from:"Liu Yuan"

[Qemu-devel] [PATCH] MAINTAINERS: update block/sheepdog maintainers

2018-10-12 Thread Liu Yuan

From: Liu Yuan 

E-mail to one of block/sheepdog maintainers Mitake Hitoshi bounces

: unknown user: "mitake.hitoshi"

and no current address is known. So just remove it.

Signed-off-by: Liu Yuan 
---
 MAINTAINERS | 1 -
 1 file changed, 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 3275cc6..6670c16 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2007,7 +2007,6 @@ F: block/rbd.c
 T: git git://github.com/codyprime/qemu-kvm-jtc.git block
 
 Sheepdog
-M: Hitoshi Mitake 
 M: Liu Yuan 
 M: Jeff Cody 
 L: qemu-bl...@nongnu.org
-- 
2.7.4

Re: [Qemu-devel] [PATCH] sheepdog: discard the payload if the header is invalid

2015-09-01 Thread Liu Yuan

On Tue, Sep 01, 2015 at 10:05:38AM +0800, Liu Yuan wrote:
> On Mon, Aug 31, 2015 at 09:51:00PM -0400, Jeff Cody wrote:
> > On Tue, Sep 01, 2015 at 09:29:31AM +0800, Liu Yuan wrote:
> > > From: Liu Yuan <liuy...@cmss.chinamobile.com>
> > > 
> > > We need to discard the payload if we get a invalid header due to whatever 
> > > reason
> > > to avoid data stream curruption.
> > 
> > If the header is invalid / corrupted, how can rsp.data_length be
> > trusted?  Out of curiosity, is this an issue you are seeing occur "in
> > the wild"?

For a second thought, we might not need this patch for the upstream because of
auto-connection feature, which close the socket to bury the whole buffer.

But old QEMU without auto-reconnection, might need this patch to drain the
buffer.

Thanks,
Yuan

> 
> This is the defensive patch. Header is invalid in the sense that only rsp.id 
> is
> invalid due to sheepdog driver bugs, for e.g., the request was misplaced after
> being sent or duplicated requests sending to sheep daemon and get the 
> duplicated
> responses for the same request.
> 
> Actually in the late 2012 we had seen this problem but we didn't find the root
> cause how this happened by looking at the code statically and the problem was
> gone silently while we restructured the code.
> 
> But yesterday some centos6 users reported to me the problem of
> 'cannot find aio_req' and hang the guest disk. That QEMU's sheepdog driver was
> rather old and would bump rsp.id mismatch problem as we did in the late 2012.
> By looking at the code again, I found this error handling problem. However,
> this patch is not aimed to solve the rsp.id mismatch problem (If it still 
> exist)
> but just a defensive one after this problem happens.
> 
> Thanks,
> Yuan

Re: [Qemu-devel] [PATCH v3] sheepdog: add reopen support

2015-08-31 Thread Liu Yuan

On Fri, Aug 28, 2015 at 10:53:58AM +0800, Liu Yuan wrote:
> From: Liu Yuan <liuy...@cmss.chinamobile.com>
> 
> With reopen supported, block-commit (and offline commit) is now supported for
> image files whose base image uses the Sheepdog protocol driver.
> 
> Cc: qemu-devel@nongnu.org
> Cc: Jeff Cody <jc...@redhat.com>
> Cc: Kevin Wolf <kw...@redhat.com>
> Cc: Stefan Hajnoczi <stefa...@redhat.com>
> Signed-off-by: Liu Yuan <liuy...@cmss.chinamobile.com>
> ---
> v3:
>  - init cache flags in sd_reopen_prepare() as in sd_open()
> v2:
>  - free AioHandler [Jeff Cody]
> 

ping...

[Qemu-devel] [PATCH] sheepdog: discard the payload if the header is invalid

2015-08-31 Thread Liu Yuan

From: Liu Yuan <liuy...@cmss.chinamobile.com>

We need to discard the payload if we get a invalid header due to whatever reason
to avoid data stream curruption. For e.g., the response consists of header plus
data payload. If we simply read the header then the data payload is left in the
socket buffer and the next time we would read the garbage data and currupt the
whole connection.

Cc: qemu-devel@nongnu.org
Cc: Jeff Cody <jc...@redhat.com>
Cc: Kevin Wolf <kw...@redhat.com>
Cc: Stefan Hajnoczi <stefa...@redhat.com>
Signed-off-by: Liu Yuan <liuy...@cmss.chinamobile.com>
---
 block/sheepdog.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/block/sheepdog.c b/block/sheepdog.c
index 9585beb..9ed3458 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -794,6 +794,14 @@ static void coroutine_fn aio_read_response(void *opaque)
 }
 }
 if (!aio_req) {
+if (rsp.data_length) {
+void *garbage = g_malloc(rsp.data_length);
+ret = qemu_co_recv(fd, garbage, rsp.data_length);
+if (ret != rsp.data_length) {
+error_report("failed to discard the data, %s", 
strerror(errno));
+}
+g_free(garbage);
+}
 error_report("cannot find aio_req %x", rsp.id);
 goto err;
 }
-- 
1.9.1

Re: [Qemu-devel] [PATCH] sheepdog: discard the payload if the header is invalid

2015-08-31 Thread Liu Yuan

On Mon, Aug 31, 2015 at 09:51:00PM -0400, Jeff Cody wrote:
> On Tue, Sep 01, 2015 at 09:29:31AM +0800, Liu Yuan wrote:
> > From: Liu Yuan <liuy...@cmss.chinamobile.com>
> > 
> > We need to discard the payload if we get a invalid header due to whatever 
> > reason
> > to avoid data stream curruption.
> 
> If the header is invalid / corrupted, how can rsp.data_length be
> trusted?  Out of curiosity, is this an issue you are seeing occur "in
> the wild"?

This is the defensive patch. Header is invalid in the sense that only rsp.id is
invalid due to sheepdog driver bugs, for e.g., the request was misplaced after
being sent or duplicated requests sending to sheep daemon and get the duplicated
responses for the same request.

Actually in the late 2012 we had seen this problem but we didn't find the root
cause how this happened by looking at the code statically and the problem was
gone silently while we restructured the code.

But yesterday some centos6 users reported to me the problem of
'cannot find aio_req' and hang the guest disk. That QEMU's sheepdog driver was
rather old and would bump rsp.id mismatch problem as we did in the late 2012.
By looking at the code again, I found this error handling problem. However,
this patch is not aimed to solve the rsp.id mismatch problem (If it still exist)
but just a defensive one after this problem happens.

Thanks,
Yuan

Re: [Qemu-devel] [PATCH v2] sheepdog: add reopen support

2015-08-27 Thread Liu Yuan

On Thu, Aug 27, 2015 at 03:00:32PM -0400, Jeff Cody wrote:
 On Thu, Aug 27, 2015 at 10:54:01AM +0800, Liu Yuan wrote:
  From: Liu Yuan liuy...@cmss.chinamobile.com
  
  With reopen supported, block-commit (and offline commit) is now supported 
  for
  image files whose base image uses the Sheepdog protocol driver.
  
  Cc: qemu-devel@nongnu.org
  Cc: Jeff Cody jc...@redhat.com
  Cc: Kevin Wolf kw...@redhat.com
  Cc: Stefan Hajnoczi stefa...@redhat.com
  Signed-off-by: Liu Yuan liuy...@cmss.chinamobile.com
  ---
  v2:
   - free AioHandler [Jeff Cody]
  
   block/sheepdog.c | 75 
  
   1 file changed, 75 insertions(+)
  
  diff --git a/block/sheepdog.c b/block/sheepdog.c
  index 9585beb..e54add4 100644
  --- a/block/sheepdog.c
  +++ b/block/sheepdog.c
  @@ -377,6 +377,11 @@ typedef struct BDRVSheepdogState {
   QLIST_HEAD(inflight_aiocb_head, SheepdogAIOCB) inflight_aiocb_head;
   } BDRVSheepdogState;
   
  +typedef struct BDRVSheepdogReopenState {
  +int fd;
  +int cache_flags;
  +} BDRVSheepdogReopenState;
  +
   static const char * sd_strerror(int err)
   {
   int i;
  @@ -1486,6 +1491,67 @@ out:
   return ret;
   }
   
  +static int sd_reopen_prepare(BDRVReopenState *state, BlockReopenQueue 
  *queue,
  + Error **errp)
  +{
  +BDRVSheepdogState *s = state-bs-opaque;
  +BDRVSheepdogReopenState *re_s;
  +int ret = 0;
  +
  +re_s = state-opaque = g_new0(BDRVSheepdogReopenState, 1);
  +
  +if (state-flags  BDRV_O_NOCACHE) {
  +re_s-cache_flags = SD_FLAG_CMD_DIRECT;
  +}
 
 In sd_open(), the s-cache_flag is set to SD_FLAG_CMD_CACHE by
 default, and then overwritten if BDRV_O_NOCACHE is set.  Do we want
 that same behavior here?
 
 (Sorry I didn't ask this the first time around)

As far as I understood, sd_open() was called before sd_reopen, so I thought
s-cache_flags = SD_FLAG_CMD_CACHE was duplicated in sd_reopen. But for a second
thought, if the reopen indicates a complete flags reparsing, I'd agree with you.

I'll prepare the third version of the patch.

Thanks,
Yuan

[Qemu-devel] [PATCH v3] sheepdog: add reopen support

2015-08-27 Thread Liu Yuan

From: Liu Yuan liuy...@cmss.chinamobile.com

With reopen supported, block-commit (and offline commit) is now supported for
image files whose base image uses the Sheepdog protocol driver.

Cc: qemu-devel@nongnu.org
Cc: Jeff Cody jc...@redhat.com
Cc: Kevin Wolf kw...@redhat.com
Cc: Stefan Hajnoczi stefa...@redhat.com
Signed-off-by: Liu Yuan liuy...@cmss.chinamobile.com
---
v3:
 - init cache flags in sd_reopen_prepare() as in sd_open()
v2:
 - free AioHandler [Jeff Cody]

 block/sheepdog.c | 76 
 1 file changed, 76 insertions(+)

diff --git a/block/sheepdog.c b/block/sheepdog.c
index 9585beb..6e8a783 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -377,6 +377,11 @@ typedef struct BDRVSheepdogState {
 QLIST_HEAD(inflight_aiocb_head, SheepdogAIOCB) inflight_aiocb_head;
 } BDRVSheepdogState;
 
+typedef struct BDRVSheepdogReopenState {
+int fd;
+int cache_flags;
+} BDRVSheepdogReopenState;
+
 static const char * sd_strerror(int err)
 {
 int i;
@@ -1486,6 +1491,68 @@ out:
 return ret;
 }
 
+static int sd_reopen_prepare(BDRVReopenState *state, BlockReopenQueue *queue,
+ Error **errp)
+{
+BDRVSheepdogState *s = state-bs-opaque;
+BDRVSheepdogReopenState *re_s;
+int ret = 0;
+
+re_s = state-opaque = g_new0(BDRVSheepdogReopenState, 1);
+
+re_s-cache_flags = SD_FLAG_CMD_CACHE;
+if (state-flags  BDRV_O_NOCACHE) {
+re_s-cache_flags = SD_FLAG_CMD_DIRECT;
+}
+
+re_s-fd = get_sheep_fd(s, errp);
+if (re_s-fd  0) {
+ret = re_s-fd;
+return ret;
+}
+
+return ret;
+}
+
+static void sd_reopen_commit(BDRVReopenState *state)
+{
+BDRVSheepdogReopenState *re_s = state-opaque;
+BDRVSheepdogState *s = state-bs-opaque;
+
+if (s-fd) {
+aio_set_fd_handler(s-aio_context, s-fd, NULL, NULL, NULL);
+closesocket(s-fd);
+}
+
+s-fd = re_s-fd;
+s-cache_flags = re_s-cache_flags;
+
+g_free(state-opaque);
+state-opaque = NULL;
+
+return;
+}
+
+static void sd_reopen_abort(BDRVReopenState *state)
+{
+BDRVSheepdogReopenState *re_s = state-opaque;
+BDRVSheepdogState *s = state-bs-opaque;
+
+if (re_s == NULL) {
+return;
+}
+
+if (re_s-fd) {
+aio_set_fd_handler(s-aio_context, re_s-fd, NULL, NULL, NULL);
+closesocket(re_s-fd);
+}
+
+g_free(state-opaque);
+state-opaque = NULL;
+
+return;
+}
+
 static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot,
 Error **errp)
 {
@@ -2703,6 +2770,9 @@ static BlockDriver bdrv_sheepdog = {
 .instance_size  = sizeof(BDRVSheepdogState),
 .bdrv_needs_filename = true,
 .bdrv_file_open = sd_open,
+.bdrv_reopen_prepare= sd_reopen_prepare,
+.bdrv_reopen_commit = sd_reopen_commit,
+.bdrv_reopen_abort  = sd_reopen_abort,
 .bdrv_close = sd_close,
 .bdrv_create= sd_create,
 .bdrv_has_zero_init = bdrv_has_zero_init_1,
@@ -2736,6 +2806,9 @@ static BlockDriver bdrv_sheepdog_tcp = {
 .instance_size  = sizeof(BDRVSheepdogState),
 .bdrv_needs_filename = true,
 .bdrv_file_open = sd_open,
+.bdrv_reopen_prepare= sd_reopen_prepare,
+.bdrv_reopen_commit = sd_reopen_commit,
+.bdrv_reopen_abort  = sd_reopen_abort,
 .bdrv_close = sd_close,
 .bdrv_create= sd_create,
 .bdrv_has_zero_init = bdrv_has_zero_init_1,
@@ -2769,6 +2842,9 @@ static BlockDriver bdrv_sheepdog_unix = {
 .instance_size  = sizeof(BDRVSheepdogState),
 .bdrv_needs_filename = true,
 .bdrv_file_open = sd_open,
+.bdrv_reopen_prepare= sd_reopen_prepare,
+.bdrv_reopen_commit = sd_reopen_commit,
+.bdrv_reopen_abort  = sd_reopen_abort,
 .bdrv_close = sd_close,
 .bdrv_create= sd_create,
 .bdrv_has_zero_init = bdrv_has_zero_init_1,
-- 
1.9.1

[Qemu-devel] [PATCH] sheepdog: add reopen support

2015-08-26 Thread Liu Yuan

From: Liu Yuan liuy...@cmss.chinamobile.com

With reopen supported, block-commit (and offline commit) is now supported for
image files whose base image uses the Sheepdog protocol driver.

Cc: qemu-devel@nongnu.org
Cc: Jeff Cody jc...@redhat.com
Cc: Kevin Wolf kw...@redhat.com
Cc: Stefan Hajnoczi stefa...@redhat.com
Signed-off-by: Liu Yuan liuy...@cmss.chinamobile.com
---
 block/sheepdog.c | 72 
 1 file changed, 72 insertions(+)

diff --git a/block/sheepdog.c b/block/sheepdog.c
index 9585beb..26d09e9 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -377,6 +377,11 @@ typedef struct BDRVSheepdogState {
 QLIST_HEAD(inflight_aiocb_head, SheepdogAIOCB) inflight_aiocb_head;
 } BDRVSheepdogState;
 
+typedef struct BDRVSheepdogReopenState {
+int fd;
+int cache_flags;
+} BDRVSheepdogReopenState;
+
 static const char * sd_strerror(int err)
 {
 int i;
@@ -1486,6 +1491,64 @@ out:
 return ret;
 }
 
+static int sd_reopen_prepare(BDRVReopenState *state, BlockReopenQueue *queue,
+ Error **errp)
+{
+BDRVSheepdogState *s = state-bs-opaque;
+BDRVSheepdogReopenState *re_s;
+int ret = 0;
+
+re_s = state-opaque = g_new0(BDRVSheepdogReopenState, 1);
+
+if (state-flags  BDRV_O_NOCACHE) {
+re_s-cache_flags = SD_FLAG_CMD_DIRECT;
+}
+
+re_s-fd = get_sheep_fd(s, errp);
+if (re_s-fd  0) {
+ret = re_s-fd;
+return ret;
+}
+
+return ret;
+}
+
+static void sd_reopen_commit(BDRVReopenState *state)
+{
+BDRVSheepdogReopenState *re_s = state-opaque;
+BDRVSheepdogState *s = state-bs-opaque;
+
+if (s-fd) {
+closesocket(s-fd);
+}
+
+s-fd = re_s-fd;
+s-cache_flags = re_s-cache_flags;
+
+g_free(state-opaque);
+state-opaque = NULL;
+
+return;
+}
+
+static void sd_reopen_abort(BDRVReopenState *state)
+{
+BDRVSheepdogReopenState *re_s = state-opaque;
+
+if (re_s == NULL) {
+return;
+}
+
+if (re_s-fd) {
+closesocket(re_s-fd);
+}
+
+g_free(state-opaque);
+state-opaque = NULL;
+
+return;
+}
+
 static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot,
 Error **errp)
 {
@@ -2703,6 +2766,9 @@ static BlockDriver bdrv_sheepdog = {
 .instance_size  = sizeof(BDRVSheepdogState),
 .bdrv_needs_filename = true,
 .bdrv_file_open = sd_open,
+.bdrv_reopen_prepare= sd_reopen_prepare,
+.bdrv_reopen_commit = sd_reopen_commit,
+.bdrv_reopen_abort  = sd_reopen_abort,
 .bdrv_close = sd_close,
 .bdrv_create= sd_create,
 .bdrv_has_zero_init = bdrv_has_zero_init_1,
@@ -2736,6 +2802,9 @@ static BlockDriver bdrv_sheepdog_tcp = {
 .instance_size  = sizeof(BDRVSheepdogState),
 .bdrv_needs_filename = true,
 .bdrv_file_open = sd_open,
+.bdrv_reopen_prepare= sd_reopen_prepare,
+.bdrv_reopen_commit = sd_reopen_commit,
+.bdrv_reopen_abort  = sd_reopen_abort,
 .bdrv_close = sd_close,
 .bdrv_create= sd_create,
 .bdrv_has_zero_init = bdrv_has_zero_init_1,
@@ -2769,6 +2838,9 @@ static BlockDriver bdrv_sheepdog_unix = {
 .instance_size  = sizeof(BDRVSheepdogState),
 .bdrv_needs_filename = true,
 .bdrv_file_open = sd_open,
+.bdrv_reopen_prepare= sd_reopen_prepare,
+.bdrv_reopen_commit = sd_reopen_commit,
+.bdrv_reopen_abort  = sd_reopen_abort,
 .bdrv_close = sd_close,
 .bdrv_create= sd_create,
 .bdrv_has_zero_init = bdrv_has_zero_init_1,
-- 
1.9.1

[Qemu-devel] [PATCH v2] sheepdog: add reopen support

2015-08-26 Thread Liu Yuan

From: Liu Yuan liuy...@cmss.chinamobile.com

With reopen supported, block-commit (and offline commit) is now supported for
image files whose base image uses the Sheepdog protocol driver.

Cc: qemu-devel@nongnu.org
Cc: Jeff Cody jc...@redhat.com
Cc: Kevin Wolf kw...@redhat.com
Cc: Stefan Hajnoczi stefa...@redhat.com
Signed-off-by: Liu Yuan liuy...@cmss.chinamobile.com
---
v2:
 - free AioHandler [Jeff Cody]

 block/sheepdog.c | 75 
 1 file changed, 75 insertions(+)

diff --git a/block/sheepdog.c b/block/sheepdog.c
index 9585beb..e54add4 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -377,6 +377,11 @@ typedef struct BDRVSheepdogState {
 QLIST_HEAD(inflight_aiocb_head, SheepdogAIOCB) inflight_aiocb_head;
 } BDRVSheepdogState;
 
+typedef struct BDRVSheepdogReopenState {
+int fd;
+int cache_flags;
+} BDRVSheepdogReopenState;
+
 static const char * sd_strerror(int err)
 {
 int i;
@@ -1486,6 +1491,67 @@ out:
 return ret;
 }
 
+static int sd_reopen_prepare(BDRVReopenState *state, BlockReopenQueue *queue,
+ Error **errp)
+{
+BDRVSheepdogState *s = state-bs-opaque;
+BDRVSheepdogReopenState *re_s;
+int ret = 0;
+
+re_s = state-opaque = g_new0(BDRVSheepdogReopenState, 1);
+
+if (state-flags  BDRV_O_NOCACHE) {
+re_s-cache_flags = SD_FLAG_CMD_DIRECT;
+}
+
+re_s-fd = get_sheep_fd(s, errp);
+if (re_s-fd  0) {
+ret = re_s-fd;
+return ret;
+}
+
+return ret;
+}
+
+static void sd_reopen_commit(BDRVReopenState *state)
+{
+BDRVSheepdogReopenState *re_s = state-opaque;
+BDRVSheepdogState *s = state-bs-opaque;
+
+if (s-fd) {
+aio_set_fd_handler(s-aio_context, s-fd, NULL, NULL, NULL);
+closesocket(s-fd);
+}
+
+s-fd = re_s-fd;
+s-cache_flags = re_s-cache_flags;
+
+g_free(state-opaque);
+state-opaque = NULL;
+
+return;
+}
+
+static void sd_reopen_abort(BDRVReopenState *state)
+{
+BDRVSheepdogReopenState *re_s = state-opaque;
+BDRVSheepdogState *s = state-bs-opaque;
+
+if (re_s == NULL) {
+return;
+}
+
+if (re_s-fd) {
+aio_set_fd_handler(s-aio_context, re_s-fd, NULL, NULL, NULL);
+closesocket(re_s-fd);
+}
+
+g_free(state-opaque);
+state-opaque = NULL;
+
+return;
+}
+
 static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot,
 Error **errp)
 {
@@ -2703,6 +2769,9 @@ static BlockDriver bdrv_sheepdog = {
 .instance_size  = sizeof(BDRVSheepdogState),
 .bdrv_needs_filename = true,
 .bdrv_file_open = sd_open,
+.bdrv_reopen_prepare= sd_reopen_prepare,
+.bdrv_reopen_commit = sd_reopen_commit,
+.bdrv_reopen_abort  = sd_reopen_abort,
 .bdrv_close = sd_close,
 .bdrv_create= sd_create,
 .bdrv_has_zero_init = bdrv_has_zero_init_1,
@@ -2736,6 +2805,9 @@ static BlockDriver bdrv_sheepdog_tcp = {
 .instance_size  = sizeof(BDRVSheepdogState),
 .bdrv_needs_filename = true,
 .bdrv_file_open = sd_open,
+.bdrv_reopen_prepare= sd_reopen_prepare,
+.bdrv_reopen_commit = sd_reopen_commit,
+.bdrv_reopen_abort  = sd_reopen_abort,
 .bdrv_close = sd_close,
 .bdrv_create= sd_create,
 .bdrv_has_zero_init = bdrv_has_zero_init_1,
@@ -2769,6 +2841,9 @@ static BlockDriver bdrv_sheepdog_unix = {
 .instance_size  = sizeof(BDRVSheepdogState),
 .bdrv_needs_filename = true,
 .bdrv_file_open = sd_open,
+.bdrv_reopen_prepare= sd_reopen_prepare,
+.bdrv_reopen_commit = sd_reopen_commit,
+.bdrv_reopen_abort  = sd_reopen_abort,
 .bdrv_close = sd_close,
 .bdrv_create= sd_create,
 .bdrv_has_zero_init = bdrv_has_zero_init_1,
-- 
1.9.1

Re: [Qemu-devel] [sheepdog] [PATCH] sheepdog: fix overlapping metadata update

2015-08-02 Thread Liu Yuan

On Sun, Aug 02, 2015 at 02:52:08PM +0300, Vasiliy Tolstov wrote:
 2015-07-31 15:08 GMT+03:00 Vasiliy Tolstov v.tols...@selfip.ru:
  Please wait to performance comparison. As i see Liu's patch may be
  more slow then Hitoshi.
 
 
 I'm switch to local cluster driver to test only local ssd and not
 network overhead.
 
 But now Liu's qemu does not able to completely install vps:
 sheep runs as:

What did you mean by 'not able to completely install'? It is a installation
problem or something else?

My QEMU or my patch? I guess you can test the same QEMU with my patch and with
Hitoshi's patch separately. You know, different QEMU base might cover the
difference.

Thanks,
Yuan

Re: [Qemu-devel] [sheepdog] [PATCH] sheepdog: fix overlapping metadata update

2015-08-02 Thread Liu Yuan

On Thu, Jul 30, 2015 at 09:27:44AM -0400, Jeff Cody wrote:
 On Thu, Jul 30, 2015 at 09:41:08AM +0300, Vasiliy Tolstov wrote:
  2015-07-29 12:31 GMT+03:00 Liu Yuan namei.u...@gmail.com:
   Technically, it won't affect the performance because index updates are 
   not range
   but concrete in terms of underlying 4M block size. Only 2 or 3 indexes in 
   a
   range will be updated and 90+% updates will be only 1. So if 2 updates 
   stride a
   large range, it will actually worse the performance of sheepdog because 
   many
   additional unref of object will be executed by sheep internally.
  
   It is not a performance problem but more the right fix. Even with your 
   patch,
   updates of inode can overlap. You just don't allow overlapped requests go 
   to
   sheepdog, which is a overkill approach. IMHO, we should only adjust to 
   avoid
   the overlapped inode updates, which can be done easily and incrementally 
   on top
   of old code, rather than take on a complete new untested overkill 
   mechanism. So
   what we get from your patch? Covering the problem and lock every requests?
  
   Your patch actually fix nothing but just cover the problem by slowing 
   down the
   request and even with your patch, the problem still exists because inode 
   updates
   can overlap. Your commit log doesn't explain what is the real problem and 
   why
   your fix works. This is not your toy project that can commit whatever you 
   want.
  
   BTW, sheepdog project was already forked, why don't you fork the block
   driver, too?
  
   What makes you think you own the block driver?
  
   We forked the sheepdog project because it is low quality of code partly 
   and mostly
   some company tries to make it a private project. It is not as open source 
   friendly
   as before and that is the main reason Kazutaka and I chose to fork the 
   sheepdog
   project. But this doesn't mean we need to fork the QEMU project, it is an
   open source project and not your home toy.
  
   Kazutaka and I are the biggest contributers of both sheepdog and QEMU 
   sheepdog
   block driver for years, so I think I am eligible to review the patch and
   responsible to suggest the right fix. If you are pissed off when someone 
   else
   have other opinions, you can just fork the code and play with it at home 
   or you
   follow the rule of open source project.
  
  
  Jeff Cody, please be the judge, patch from Hitoshi solved my problem
  that i emailed in sheepdog list (i have test environment with 8 hosts
  on each 6 SSD disks and infiniband interconnect between hosts) before
  Hitoshi patch, massive writing to sheepdog storage breaks file system
  and corrupt it.
  After the patch i don't see issues.
 
 
 I'd rather see some sort consensus amongst Liu, Hitoshi, yourself, or
 others more intimately familiar with sheepdog.
 
 Right now, we have Hitoshi's patch in the main git repo, slated for
 2.4 release (which is Monday).  It sounds, from Liu's email, as this
 may not fix the root cause.
 
 Vasiliy said he would test Liu's patch; if he can confirm this new
 patch fix, then I would be inclined to use Liu's patch, based on the
 detailed analysis of the issue in the commit message.
 

This is my performance comparison on top of latest QEMU with my latop with SSD.

sheepdog cluster run with 3 nodes with '-n' to get best volume performance.
QEMU command:
qemu-system-x86_64 -m 1024 --enable-kvm \
-drive 
file=debian_squeeze_amd64_standard.qcow2,cache=writeback,if=virtio \
-drive file=sheepdog:test,if=virtio

sheepdog:test is created as 'dog vdi create test 80G'

I test both time for mkfs and iops for fio write.

fio.conf:
[global]
ioengine=libaio
direct=1
thread=1
norandommap=1
runtime=60
size=300M
directory=/mnt

[write4k-rand]
stonewall
group_reporting
bs=4k
rw=randwrite
numjobs=8
iodepth=32

Resualt:

sheep formated with -c 2:1 (erasure coding)
   mkfs  fio
Yuan   0.069 4578
Hitosh 0.071 3722

sheep formarted with -c 2 (replication)
   mkfs  fio
Yuan   0.074 6873  
Hitosh 0.081 6174


Thanks,
Yuan

Re: [Qemu-devel] [sheepdog] [PATCH] sheepdog: fix overlapping metadata update

2015-08-01 Thread Liu Yuan

On Fri, Jul 31, 2015 at 03:08:09PM +0300, Vasiliy Tolstov wrote:
 2015-07-31 14:55 GMT+03:00 Vasiliy Tolstov v.tols...@selfip.ru:
  Liu's patch also works for me. But also like in Hitoshi patch breaks
  when using discards in qemu =(.
 
 
 Please wait to performance comparison. As i see Liu's patch may be
 more slow then Hitoshi.
 

Thanks for your time! Well, as far as I know, my patch would be slightly better
in performance wise because it preserves the parallelism of requests. Due to
scatter gather IO requests characteristics, we could assume following IO pattern
as an illustration:

req1 is split into 2 sheep reqs:
create(2), create(10)

req2 is split into 2 sheep reqs:
create(5), create(100)

So there are finally 4 sheep requests and with my patch they will be run in
parallel by sheep cluster and only 4 unref of objects will be executed
internally: update_inode(2), update_inode(10), update_inode(5), 
update_inode(100)

With Hitoshi's patch, however, req1 and req2 will be serialized and only one req
is finished then the other one will be sent to sheep and there are 9+96=105 
unref
of objects will be executed internally. There are still chances data corruption
because update_inode(2,10) and update_inode(5,100) will both update the range
[5,10], which is a potential problem if the overlapped range has different 
values
when the requests are queued with stale data.

This is really a several years bug: we should update the inode bits exactly as
we create the objects, not update the bits we don't touch at all. This bug isn't
revealed for a long time because most of the time, min == max in
create_inode(min, max) and before we introduction of generation reference 
counting
to the snapshot reference mechanism, updating inode bit with 0 won't cause a
remove request in sheepdog.

I'm also concerned with the complete new mechanism since current request 
handling
mechanism is solid as time goes by. It exists for years. The complete new stuff
might need a long time to stablize and need to fix the possible side effect we
don't know yet.

Thanks,
Yuan

Re: [Qemu-devel] [sheepdog] [PATCH] sheepdog: fix overlapping metadata update

2015-07-30 Thread Liu Yuan

On Thu, Jul 30, 2015 at 09:41:08AM +0300, Vasiliy Tolstov wrote:
 2015-07-29 12:31 GMT+03:00 Liu Yuan namei.u...@gmail.com:
  Technically, it won't affect the performance because index updates are not 
  range
  but concrete in terms of underlying 4M block size. Only 2 or 3 indexes in a
  range will be updated and 90+% updates will be only 1. So if 2 updates 
  stride a
  large range, it will actually worse the performance of sheepdog because many
  additional unref of object will be executed by sheep internally.
 
  It is not a performance problem but more the right fix. Even with your 
  patch,
  updates of inode can overlap. You just don't allow overlapped requests go to
  sheepdog, which is a overkill approach. IMHO, we should only adjust to avoid
  the overlapped inode updates, which can be done easily and incrementally on 
  top
  of old code, rather than take on a complete new untested overkill 
  mechanism. So
  what we get from your patch? Covering the problem and lock every requests?
 
  Your patch actually fix nothing but just cover the problem by slowing down 
  the
  request and even with your patch, the problem still exists because inode 
  updates
  can overlap. Your commit log doesn't explain what is the real problem and 
  why
  your fix works. This is not your toy project that can commit whatever you 
  want.
 
  BTW, sheepdog project was already forked, why don't you fork the block
  driver, too?
 
  What makes you think you own the block driver?
 
  We forked the sheepdog project because it is low quality of code partly and 
  mostly
  some company tries to make it a private project. It is not as open source 
  friendly
  as before and that is the main reason Kazutaka and I chose to fork the 
  sheepdog
  project. But this doesn't mean we need to fork the QEMU project, it is an
  open source project and not your home toy.
 
  Kazutaka and I are the biggest contributers of both sheepdog and QEMU 
  sheepdog
  block driver for years, so I think I am eligible to review the patch and
  responsible to suggest the right fix. If you are pissed off when someone 
  else
  have other opinions, you can just fork the code and play with it at home or 
  you
  follow the rule of open source project.
 
 
 Jeff Cody, please be the judge, patch from Hitoshi solved my problem
 that i emailed in sheepdog list (i have test environment with 8 hosts
 on each 6 SSD disks and infiniband interconnect between hosts) before
 Hitoshi patch, massive writing to sheepdog storage breaks file system
 and corrupt it.
 After the patch i don't see issues.

Hi Vasiliy,

Did you test my patch? I think both patch can make the problem gone. The
trick is that which way is the right fix. This is quit technical because
sometimes the problem is gone not because being fixed but is covered out.

If you have problem with applying the patch, feel free to mail me and I'll
package a patched QEMU tree for you.

Thanks,
Yuan

Re: [Qemu-devel] [sheepdog] [PATCH] sheepdog: fix overlapping metadata update

2015-07-29 Thread Liu Yuan

On Wed, Jul 29, 2015 at 02:04:55PM +0900, Hitoshi Mitake wrote:
 At Wed, 29 Jul 2015 12:02:35 +0800,
 Liu Yuan wrote:
  
  From: Liu Yuan liuy...@cmss.chinamobile.com
  
  Current sheepdog driver use a range update_inode(min_idx, max_idx) for 
  batching
  the updates. But there is subtle problem by determining min_idx and max_idx:
  
  for a single create request, min_idx == max_idx, so actually we just update 
  one
  one bit as expected.
  
  Suppose we have 2 create request, create(10) and create(20), then min == 10,
  max==20 even though we just need to update index 10 and index 20, 
  update_inode(10,20)
  will actually update range from 10 to 20. This would work if all the 
  update_inode()
  requests won't overlap. But unfortunately, this is not true for some corner 
  case.
  So the problem arise as following:
  
  req 1: update_inode(10,20)
  req 2: update_inode(15,22)
  
  req 1 and req 2 might have different value between [15,20] and cause 
  problems
  and can be illustrated as following by adding a printf in sd_write_done:
  
  @@ -1976,6 +1976,7 @@ static void coroutine_fn sd_write_done(SheepdogAIOCB 
  *acb)
  
   mn = s-min_dirty_data_idx;
   mx = s-max_dirty_data_idx;
  +printf(min %u, max %u\n, mn, mx);
   if (mn = mx) {
   /* we need to update the vdi object. */
   offset = sizeof(s-inode) - sizeof(s-inode.data_vdi_id) +
  
  ...
  min 4294967295, max 0
  min 9221, max 9222
  min 9224, max 9728
  min 9223, max 9223
  min 9729, max 9730
  min 9731, max 9732
  min 9733, max 9734
  min 9736, max 10240
  min 9735, max 10241
  ...
  
  Every line represents a update_inode(min, max) last 2 lines show 2 requests
  actually overlap while I ran mkfs.ext4 on a sheepdog volume. The overlapped
  requests might be reordered via network and cause inode[idx] back to 0 after
  being set by last request. Then a wrong remove request will be executed by 
  sheep
  internally and accidentally remove a victim object, which is reported at:
  
  https://bugs.launchpad.net/sheepdog-project/+bug/1456421
  
  The fix is simple that we just update inode one by one for aio_req. Since
  aio_req is never overlapped, we'll have inode update never overlapped.
 
 This patch increase a number of indoe update request than existing approach.
 current: (max - min + 1) * data object creation + 1 inode update
 this patch: (max - min + 1) * data object creation + (max - min + 1) * inode 
 update
 The increased number means increased number of network + disk
 I/O. Even the overwrapping requests can be processed in parallel, the
 overhead seems to be large. It has an impact especially in a case of
 disk I/O heavy workload.
 I don't have a comparison of benchmark result, but it is not obvious
 that the approach of this patch is better.

Technically, it won't affect the performance because index updates are not range
but concrete in terms of underlying 4M block size. Only 2 or 3 indexes in a
range will be updated and 90+% updates will be only 1. So if 2 updates stride a
large range, it will actually worse the performance of sheepdog because many
additional unref of object will be executed by sheep internally.

It is not a performance problem but more the right fix. Even with your patch,
updates of inode can overlap. You just don't allow overlapped requests go to
sheepdog, which is a overkill approach. IMHO, we should only adjust to avoid
the overlapped inode updates, which can be done easily and incrementally on top
of old code, rather than take on a complete new untested overkill mechanism. So
what we get from your patch? Covering the problem and lock every requests?

Your patch actually fix nothing but just cover the problem by slowing down the
request and even with your patch, the problem still exists because inode updates
can overlap. Your commit log doesn't explain what is the real problem and why
your fix works. This is not your toy project that can commit whatever you want.

 BTW, sheepdog project was already forked, why don't you fork the block
 driver, too?

What makes you think you own the block driver?

We forked the sheepdog project because it is low quality of code partly and 
mostly
some company tries to make it a private project. It is not as open source 
friendly
as before and that is the main reason Kazutaka and I chose to fork the sheepdog
project. But this doesn't mean we need to fork the QEMU project, it is an
open source project and not your home toy.

Kazutaka and I are the biggest contributers of both sheepdog and QEMU sheepdog
block driver for years, so I think I am eligible to review the patch and
responsible to suggest the right fix. If you are pissed off when someone else
have other opinions, you can just fork the code and play with it at home or you
follow the rule of open source project.

Yuan

Re: [Qemu-devel] [sheepdog] [PATCH] sheepdog: serialize requests to overwrapping area

2015-07-28 Thread Liu Yuan

On Sat, Jul 18, 2015 at 01:44:24AM +0900, Hitoshi Mitake wrote:
 Current sheepdog driver only serializes create requests in oid
 unit. This mechanism isn't enough for handling requests to
 overwrapping area spanning multiple oids, so it can result bugs like
 below:
 https://bugs.launchpad.net/sheepdog-project/+bug/1456421

I'm a bit late to review the patch since I'm not on the cc list, but I'd like to
get the idea how the mentioned bug relates to the serialization of requests?

The mentioned bug looks to me more a bug of sheepdog because the create and
write request will only unref a single oid. The bug report is unclear about
why the object idx in inode becomes zero, at least not pointing that it relates
to QEMU.

But this patch assume QEMU send the requests the wrong way and just vaguely
says it is just wrong without reason.

What is overrapping requests? As far as I understand, the request that stride
over two objects will be split into two, to make sure all the requests fit the
sheepdog object size. Allow requests run concurrently on different SD objects is
way achieving high performance. This patch mutes this feature, to me, without a
decent reason. 

Probably I miss something hidden, but I'd like someone enlighten me about it
because this patch might slow down QEMU VM over sheepdog.

Thanks,
Yuan

Re: [Qemu-devel] [sheepdog] [PATCH] sheepdog: serialize requests to overwrapping area

2015-07-28 Thread Liu Yuan

On Tue, Jul 28, 2015 at 04:50:08PM +0800, Liu Yuan wrote:
 On Sat, Jul 18, 2015 at 01:44:24AM +0900, Hitoshi Mitake wrote:
  Current sheepdog driver only serializes create requests in oid
  unit. This mechanism isn't enough for handling requests to
  overwrapping area spanning multiple oids, so it can result bugs like
  below:
  https://bugs.launchpad.net/sheepdog-project/+bug/1456421
 
 I'm a bit late to review the patch since I'm not on the cc list, but I'd like 
 to
 get the idea how the mentioned bug relates to the serialization of requests?
 
 The mentioned bug looks to me more a bug of sheepdog because the create and
 write request will only unref a single oid. The bug report is unclear about
 why the object idx in inode becomes zero, at least not pointing that it 
 relates
 to QEMU.
 
 But this patch assume QEMU send the requests the wrong way and just vaguely
 says it is just wrong without reason.
 
 What is overrapping requests? As far as I understand, the request that stride
 over two objects will be split into two, to make sure all the requests fit the
 sheepdog object size. Allow requests run concurrently on different SD objects 
 is
 way achieving high performance. This patch mutes this feature, to me, without 
 a
 decent reason. 
 
 Probably I miss something hidden, but I'd like someone enlighten me about it
 because this patch might slow down QEMU VM over sheepdog.
 
 Thanks,
 Yuan

Cc Jeff

Re: [Qemu-devel] [PATCH] sheepdog: serialize requests to overwrapping area

2015-07-28 Thread Liu Yuan

On Mon, Jul 27, 2015 at 11:23:02AM -0400, Jeff Cody wrote:
 On Sat, Jul 18, 2015 at 01:44:24AM +0900, Hitoshi Mitake wrote:
  Current sheepdog driver only serializes create requests in oid
  unit. This mechanism isn't enough for handling requests to
  overwrapping area spanning multiple oids, so it can result bugs like
  below:
  https://bugs.launchpad.net/sheepdog-project/+bug/1456421
  
  This patch adds a new serialization mechanism for the problem. The
  difference from the old one is:
  1. serialize entire aiocb if their targetting areas overwrap
  2. serialize all requests (read, write, and discard), not only creates
  
  This patch also removes the old mechanism because the new one can be
  an alternative.
  

Okay, I figured out what the problem is myself and allow me to try to make it
clear to non-sheepdog devs:

sheepdog volume is thin-provision, so for the first write, we create the object
internally, meaning that we need to handle write in two case:

1. write to non-allocated object, create it then update inode, so in this case
   two request will be generated: create(oid), update_inode(oid_to_inode_idx)
2. write the allocated object, just write(oid).

Current sheepdog driver use a range update_inode(min_idx, max_idx) for batching
the updates. But there is subtle problem by determining min_idx and max_idx:

for a single create request, min_idx == max_idx, so actually we just update one
one bit as expected.

Suppose we have 2 create request, create(10) and create(20), then min == 10,
max==20 even though we just need to update index 10 and index 20, 
update_inode(10,20)
will actually update range from 10 to 20. This would work if all the 
update_inode()
requests won't overlap. But unfortunately, this is not true for some corner 
case.
So the problem arise as following:

req 1: update_inode(10,20)
req 2: update_inode(15,22)

req 1 and req 2 might have different value between [15,20] and cause problems.

Based on above analysis, I think the real fix is to fix update_inode(), not
serialize all the requests in overkill way. The fix would be easy, considering
most update_inode() update only 1 index, we could just make update_inode a
single bit updater, not a range one, in which way we don't affect performance
as the above patch did. (I actually suspect that the above patch might not solve
the problem because update_inode() can overlap even with the patch).

If everyone agrees with my analysis, I'll post the fix.

Thanks,
Yuan

Re: [Qemu-devel] [PATCH] sheepdog: serialize requests to overwrapping area

2015-07-28 Thread Liu Yuan

On Tue, Jul 28, 2015 at 10:31:32PM +0800, Liu Yuan wrote:
 On Mon, Jul 27, 2015 at 11:23:02AM -0400, Jeff Cody wrote:
  On Sat, Jul 18, 2015 at 01:44:24AM +0900, Hitoshi Mitake wrote:
   Current sheepdog driver only serializes create requests in oid
   unit. This mechanism isn't enough for handling requests to
   overwrapping area spanning multiple oids, so it can result bugs like
   below:
   https://bugs.launchpad.net/sheepdog-project/+bug/1456421
   
   This patch adds a new serialization mechanism for the problem. The
   difference from the old one is:
   1. serialize entire aiocb if their targetting areas overwrap
   2. serialize all requests (read, write, and discard), not only creates
   
   This patch also removes the old mechanism because the new one can be
   an alternative.
   
 
 Okay, I figured out what the problem is myself and allow me to try to make it
 clear to non-sheepdog devs:
 
 sheepdog volume is thin-provision, so for the first write, we create the 
 object
 internally, meaning that we need to handle write in two case:
 
 1. write to non-allocated object, create it then update inode, so in this case
two request will be generated: create(oid), update_inode(oid_to_inode_idx)
 2. write the allocated object, just write(oid).
 
 Current sheepdog driver use a range update_inode(min_idx, max_idx) for 
 batching
 the updates. But there is subtle problem by determining min_idx and max_idx:
 
 for a single create request, min_idx == max_idx, so actually we just update 
 one
 one bit as expected.
 
 Suppose we have 2 create request, create(10) and create(20), then min == 10,
 max==20 even though we just need to update index 10 and index 20, 
 update_inode(10,20)
 will actually update range from 10 to 20. This would work if all the 
 update_inode()
 requests won't overlap. But unfortunately, this is not true for some corner 
 case.
 So the problem arise as following:
 
 req 1: update_inode(10,20)
 req 2: update_inode(15,22)

I patched the qemu to print min and max, and confirmed my analysis:
diff --git a/block/sheepdog.c b/block/sheepdog.c
index bd7cbed..f3e40e8 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -1976,6 +1976,7 @@ static void coroutine_fn sd_write_done(SheepdogAIOCB *acb)
 
 mn = s-min_dirty_data_idx;
 mx = s-max_dirty_data_idx;
+printf(min %u, max %u\n, mn, mx);
 if (mn = mx) {
 /* we need to update the vdi object. */
 offset = sizeof(s-inode) - sizeof(s-inode.data_vdi_id) +
diff --git a/dtc b/dtc
index 65cc4d2..bc895d6 16

...
min 4294967295, max 0
min 9221, max 9222
min 9224, max 9728
min 9223, max 9223
min 9729, max 9730
min 9731, max 9732
min 9733, max 9734
min 9736, max 10240 
min 9735, max 10241 
...

Every line represents a update_inode(min, max) last 2 lines show 2 requests
actually overlap while I ran mkfs.ext4 on a sheepdog volume. The overlapped
requests might be reordered via network and cause inode[idx] back to 0 after
being set by last request. Then a wrong remove request will be executed by sheep
internally and accidentally remove a victim object.

Thanks,
Yuan

 req 1 and req 2 might have different value between [15,20] and cause problems.
 
 Based on above analysis, I think the real fix is to fix update_inode(), not
 serialize all the requests in overkill way. The fix would be easy, considering
 most update_inode() update only 1 index, we could just make update_inode a
 single bit updater, not a range one, in which way we don't affect performance
 as the above patch did. (I actually suspect that the above patch might not 
 solve
 the problem because update_inode() can overlap even with the patch).
 
 If everyone agrees with my analysis, I'll post the fix.
 
 Thanks,
 Yuan

[Qemu-devel] [PATCH] sheepdog: fix overlapping metadata update

2015-07-28 Thread Liu Yuan

From: Liu Yuan liuy...@cmss.chinamobile.com

Current sheepdog driver use a range update_inode(min_idx, max_idx) for batching
the updates. But there is subtle problem by determining min_idx and max_idx:

for a single create request, min_idx == max_idx, so actually we just update one
one bit as expected.

Suppose we have 2 create request, create(10) and create(20), then min == 10,
max==20 even though we just need to update index 10 and index 20, 
update_inode(10,20)
will actually update range from 10 to 20. This would work if all the 
update_inode()
requests won't overlap. But unfortunately, this is not true for some corner 
case.
So the problem arise as following:

req 1: update_inode(10,20)
req 2: update_inode(15,22)

req 1 and req 2 might have different value between [15,20] and cause problems
and can be illustrated as following by adding a printf in sd_write_done:

@@ -1976,6 +1976,7 @@ static void coroutine_fn sd_write_done(SheepdogAIOCB *acb)

 mn = s-min_dirty_data_idx;
 mx = s-max_dirty_data_idx;
+printf(min %u, max %u\n, mn, mx);
 if (mn = mx) {
 /* we need to update the vdi object. */
 offset = sizeof(s-inode) - sizeof(s-inode.data_vdi_id) +

...
min 4294967295, max 0
min 9221, max 9222
min 9224, max 9728
min 9223, max 9223
min 9729, max 9730
min 9731, max 9732
min 9733, max 9734
min 9736, max 10240
min 9735, max 10241
...

Every line represents a update_inode(min, max) last 2 lines show 2 requests
actually overlap while I ran mkfs.ext4 on a sheepdog volume. The overlapped
requests might be reordered via network and cause inode[idx] back to 0 after
being set by last request. Then a wrong remove request will be executed by sheep
internally and accidentally remove a victim object, which is reported at:

https://bugs.launchpad.net/sheepdog-project/+bug/1456421

The fix is simple that we just update inode one by one for aio_req. Since
aio_req is never overlapped, we'll have inode update never overlapped.

Cc: Jeff Cody jc...@redhat.com
Cc: Kevin Wolf kw...@redhat.com
Cc: Stefan Hajnoczi stefa...@redhat.com
Cc: Teruaki Ishizaki ishizaki.teru...@lab.ntt.co.jp
Cc: Vasiliy Tolstov v.tols...@selfip.ru
Cc: Hitoshi Mitake mitake.hito...@lab.ntt.co.jp
Signed-off-by: Liu Yuan liuy...@cmss.chinamobile.com
---
 block/sheepdog.c | 77 
 1 file changed, 22 insertions(+), 55 deletions(-)

diff --git a/block/sheepdog.c b/block/sheepdog.c
index bd7cbed..d1eeb81 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -342,9 +342,6 @@ typedef struct BDRVSheepdogState {
 
 SheepdogInode inode;
 
-uint32_t min_dirty_data_idx;
-uint32_t max_dirty_data_idx;
-
 char name[SD_MAX_VDI_LEN];
 bool is_snapshot;
 uint32_t cache_flags;
@@ -782,6 +779,26 @@ static coroutine_fn void reconnect_to_sdog(void *opaque)
 }
 }
 
+static void  update_inode(BDRVSheepdogState *s, AIOReq *aio_req)
+{
+struct iovec iov;
+uint32_t offset, data_len;
+SheepdogAIOCB *acb = aio_req-aiocb;
+int idx = data_oid_to_idx(aio_req-oid);
+
+offset = SD_INODE_HEADER_SIZE + sizeof(uint32_t) * idx;
+data_len = sizeof(uint32_t);
+
+iov.iov_base = s-inode;
+iov.iov_len = sizeof(s-inode);
+aio_req = alloc_aio_req(s, acb, vid_to_vdi_oid(s-inode.vdi_id),
+data_len, offset, 0, false, 0, offset);
+QLIST_INSERT_HEAD(s-inflight_aio_head, aio_req, aio_siblings);
+add_aio_request(s, aio_req, iov, 1, AIOCB_WRITE_UDATA);
+
+return;
+}
+
 /*
  * Receive responses of the I/O requests.
  *
@@ -820,25 +837,15 @@ static void coroutine_fn aio_read_response(void *opaque)
 
 switch (acb-aiocb_type) {
 case AIOCB_WRITE_UDATA:
-/* this coroutine context is no longer suitable for co_recv
- * because we may send data to update vdi objects */
-s-co_recv = NULL;
 if (!is_data_obj(aio_req-oid)) {
 break;
 }
 idx = data_oid_to_idx(aio_req-oid);
 
 if (aio_req-create) {
-/*
- * If the object is newly created one, we need to update
- * the vdi object (metadata object).  min_dirty_data_idx
- * and max_dirty_data_idx are changed to include updated
- * index between them.
- */
 if (rsp.result == SD_RES_SUCCESS) {
 s-inode.data_vdi_id[idx] = s-inode.vdi_id;
-s-max_dirty_data_idx = MAX(idx, s-max_dirty_data_idx);
-s-min_dirty_data_idx = MIN(idx, s-min_dirty_data_idx);
+update_inode(s, aio_req);
 }
 /*
  * Some requests may be blocked because simultaneous
@@ -1518,8 +1525,6 @@ static int sd_open(BlockDriverState *bs, QDict *options, 
int flags,
 }
 
 memcpy(s-inode, buf, sizeof(s-inode));
-s-min_dirty_data_idx = UINT32_MAX;
-s-max_dirty_data_idx = 0;
 
 bs-total_sectors = s-inode.vdi_size / BDRV_SECTOR_SIZE;
 pstrcpy(s-name, sizeof

Re: [Qemu-devel] [PATCH v2] sheepdog: fix confused return values

2015-02-26 Thread Liu Yuan

On Wed, Feb 18, 2015 at 11:57:55AM +0800, Liu Yuan wrote:
 From: Liu Yuan liuy...@cmss.chinamobile.com
 
 These functions mix up -1 and -errno in return values and would might cause
 trouble error handling in the call chain.
 
 This patch let them return -errno and add some comments.
 
 Cc: qemu-devel@nongnu.org
 Cc: Markus Armbruster arm...@redhat.com
 Cc: Kevin Wolf kw...@redhat.com
 Cc: Stefan Hajnoczi stefa...@redhat.com
 Reported-by: Markus Armbruster arm...@redhat.com
 Signed-off-by: Liu Yuan liuy...@cmss.chinamobile.com
 ---
 v2:
  - use socket_error() instead of errno
 
  block/sheepdog.c | 11 +++
  1 file changed, 11 insertions(+)
 
 diff --git a/block/sheepdog.c b/block/sheepdog.c
 index be3176f..e4b30ba 100644
 --- a/block/sheepdog.c
 +++ b/block/sheepdog.c
 @@ -527,6 +527,7 @@ static SheepdogAIOCB *sd_aio_setup(BlockDriverState *bs, 
 QEMUIOVector *qiov,
  return acb;
  }
  
 +/* Return -EIO in case of error, file descriptor on success */
  static int connect_to_sdog(BDRVSheepdogState *s, Error **errp)
  {
  int fd;
 @@ -546,11 +547,14 @@ static int connect_to_sdog(BDRVSheepdogState *s, Error 
 **errp)
  
  if (fd = 0) {
  qemu_set_nonblock(fd);
 +} else {
 +fd = -EIO;
  }
  
  return fd;
  }
  
 +/* Return 0 on success and -errno in case of error */
  static coroutine_fn int send_co_req(int sockfd, SheepdogReq *hdr, void *data,
  unsigned int *wlen)
  {
 @@ -559,11 +563,13 @@ static coroutine_fn int send_co_req(int sockfd, 
 SheepdogReq *hdr, void *data,
  ret = qemu_co_send(sockfd, hdr, sizeof(*hdr));
  if (ret != sizeof(*hdr)) {
  error_report(failed to send a req, %s, strerror(errno));
 +ret = -socket_error();
  return ret;
  }
  
  ret = qemu_co_send(sockfd, data, *wlen);
  if (ret != *wlen) {
 +ret = -socket_error();
  error_report(failed to send a req, %s, strerror(errno));
  }
  
 @@ -638,6 +644,11 @@ out:
  srco-finished = true;
  }
  
 +/*
 + * Send the request to the sheep in a synchronous manner.
 + *
 + * Return 0 on success, -errno in case of error.
 + */
  static int do_req(int sockfd, AioContext *aio_context, SheepdogReq *hdr,
void *data, unsigned int *wlen, unsigned int *rlen)
  {
 -- 
 1.9.1
 

Ping...

Re: [Qemu-devel] [PATCH] sheepdog: fix confused return values

2015-02-25 Thread Liu Yuan

On Wed, Feb 25, 2015 at 10:35:17AM +0100, Markus Armbruster wrote:
 Liu Yuan namei.u...@gmail.com writes:
 
  On Wed, Feb 18, 2015 at 09:35:07AM +0100, Markus Armbruster wrote:
  Liu Yuan namei.u...@gmail.com writes:
  
   From: Liu Yuan liuy...@cmss.chinamobile.com
  
   These functions mix up -1 and -errno in return values and would might 
   cause
   trouble error handling in the call chain.
  
   This patch let them return -errno and add some comments.
  
   Reported-by: Markus Armbruster arm...@redhat.com
   Signed-off-by: Liu Yuan liuy...@cmss.chinamobile.com
  
  Did you review all functions returning negative errno, or just the two
  that caught my eye?
 
  Umm, mostly the two you mentioned.
 
 I encourage you to review the whole file for this error pattern!  In my
 experience, bugs occur in clusters.  When you find one, there's a high
 chance similar ones exist, and looking for them is good practice.

Hi Markus, I've checked the whole file as possible as I can. I can't find more
ret mixing bugs.

Yuan

Re: [Qemu-devel] [PATCH] sheepdog: fix confused return values

2015-02-24 Thread Liu Yuan

On Wed, Feb 18, 2015 at 09:35:07AM +0100, Markus Armbruster wrote:
 Liu Yuan namei.u...@gmail.com writes:
 
  From: Liu Yuan liuy...@cmss.chinamobile.com
 
  These functions mix up -1 and -errno in return values and would might cause
  trouble error handling in the call chain.
 
  This patch let them return -errno and add some comments.
 
  Reported-by: Markus Armbruster arm...@redhat.com
  Signed-off-by: Liu Yuan liuy...@cmss.chinamobile.com
 
 Did you review all functions returning negative errno, or just the two
 that caught my eye?

Umm, mostly the two you mentioned.

Yuan

[Qemu-devel] [PATCH v2] sheepdog: fix confused return values

2015-02-17 Thread Liu Yuan

From: Liu Yuan liuy...@cmss.chinamobile.com

These functions mix up -1 and -errno in return values and would might cause
trouble error handling in the call chain.

This patch let them return -errno and add some comments.

Cc: qemu-devel@nongnu.org
Cc: Markus Armbruster arm...@redhat.com
Cc: Kevin Wolf kw...@redhat.com
Cc: Stefan Hajnoczi stefa...@redhat.com
Reported-by: Markus Armbruster arm...@redhat.com
Signed-off-by: Liu Yuan liuy...@cmss.chinamobile.com
---
v2:
 - use socket_error() instead of errno

 block/sheepdog.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/block/sheepdog.c b/block/sheepdog.c
index be3176f..e4b30ba 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -527,6 +527,7 @@ static SheepdogAIOCB *sd_aio_setup(BlockDriverState *bs, 
QEMUIOVector *qiov,
 return acb;
 }
 
+/* Return -EIO in case of error, file descriptor on success */
 static int connect_to_sdog(BDRVSheepdogState *s, Error **errp)
 {
 int fd;
@@ -546,11 +547,14 @@ static int connect_to_sdog(BDRVSheepdogState *s, Error 
**errp)
 
 if (fd = 0) {
 qemu_set_nonblock(fd);
+} else {
+fd = -EIO;
 }
 
 return fd;
 }
 
+/* Return 0 on success and -errno in case of error */
 static coroutine_fn int send_co_req(int sockfd, SheepdogReq *hdr, void *data,
 unsigned int *wlen)
 {
@@ -559,11 +563,13 @@ static coroutine_fn int send_co_req(int sockfd, 
SheepdogReq *hdr, void *data,
 ret = qemu_co_send(sockfd, hdr, sizeof(*hdr));
 if (ret != sizeof(*hdr)) {
 error_report(failed to send a req, %s, strerror(errno));
+ret = -socket_error();
 return ret;
 }
 
 ret = qemu_co_send(sockfd, data, *wlen);
 if (ret != *wlen) {
+ret = -socket_error();
 error_report(failed to send a req, %s, strerror(errno));
 }
 
@@ -638,6 +644,11 @@ out:
 srco-finished = true;
 }
 
+/*
+ * Send the request to the sheep in a synchronous manner.
+ *
+ * Return 0 on success, -errno in case of error.
+ */
 static int do_req(int sockfd, AioContext *aio_context, SheepdogReq *hdr,
   void *data, unsigned int *wlen, unsigned int *rlen)
 {
-- 
1.9.1

Re: [Qemu-devel] [PATCH] sheepdog: fix confused return values

2015-02-17 Thread Liu Yuan

On Mon, Feb 16, 2015 at 12:56:04PM +0100, Kevin Wolf wrote:
 Am 13.02.2015 um 04:45 hat Liu Yuan geschrieben:
  From: Liu Yuan liuy...@cmss.chinamobile.com
  
  These functions mix up -1 and -errno in return values and would might cause
  trouble error handling in the call chain.
  
  This patch let them return -errno and add some comments.
  
  Reported-by: Markus Armbruster arm...@redhat.com
  Signed-off-by: Liu Yuan liuy...@cmss.chinamobile.com
  ---
   block/sheepdog.c | 11 +++
   1 file changed, 11 insertions(+)
  
  diff --git a/block/sheepdog.c b/block/sheepdog.c
  index be3176f..c28658c 100644
  --- a/block/sheepdog.c
  +++ b/block/sheepdog.c
  @@ -527,6 +527,7 @@ static SheepdogAIOCB *sd_aio_setup(BlockDriverState 
  *bs, QEMUIOVector *qiov,
   return acb;
   }
   
  +/* Return -EIO in case of error, file descriptor on success */
   static int connect_to_sdog(BDRVSheepdogState *s, Error **errp)
   {
   int fd;
  @@ -546,11 +547,14 @@ static int connect_to_sdog(BDRVSheepdogState *s, 
  Error **errp)
   
   if (fd = 0) {
   qemu_set_nonblock(fd);
  +} else {
  +fd = -EIO;
   }
   
   return fd;
   }
   
  +/* Return 0 on success and -errno in case of error */
   static coroutine_fn int send_co_req(int sockfd, SheepdogReq *hdr, void 
  *data,
   unsigned int *wlen)
   {
  @@ -559,11 +563,13 @@ static coroutine_fn int send_co_req(int sockfd, 
  SheepdogReq *hdr, void *data,
   ret = qemu_co_send(sockfd, hdr, sizeof(*hdr));
   if (ret != sizeof(*hdr)) {
   error_report(failed to send a req, %s, strerror(errno));
  +ret = -errno;
 
 qemu_co_sendv_recvv() uses socket_error() internally to access the
 return code. This is defined as errno on POSIX, but as WSAGetLastError()
 on Windows.
 
 You should probably either use socket_error() here, or change
 qemu_co_sendv_recvv() to return a negative error code instead of -1.
 
   return ret;
   }
   
   ret = qemu_co_send(sockfd, data, *wlen);
   if (ret != *wlen) {
  +ret = -errno;
   error_report(failed to send a req, %s, strerror(errno));
   }
 
 The same here.
 

Hi, Kevin, thanks for your tips. I'll post v2 against your comments.

Yuan

Re: [Qemu-devel] [sheepdog] [PATCH v4] sheepdog: selectable object size support

2015-02-12 Thread Liu Yuan

On Thu, Feb 12, 2015 at 05:13:55PM +0900, Hitoshi Mitake wrote:
 At Thu, 12 Feb 2015 15:42:17 +0800,
 Liu Yuan wrote:
  
  On Thu, Feb 12, 2015 at 04:28:01PM +0900, Hitoshi Mitake wrote:
   At Thu, 12 Feb 2015 15:00:49 +0800,
   Liu Yuan wrote:

On Thu, Feb 12, 2015 at 03:19:21PM +0900, Hitoshi Mitake wrote:
 At Tue, 10 Feb 2015 18:35:58 +0800,
 Liu Yuan wrote:
  
  On Tue, Feb 10, 2015 at 06:56:33PM +0900, Teruaki Ishizaki wrote:
   (2015/02/10 17:58), Liu Yuan wrote:
   On Tue, Feb 10, 2015 at 05:22:02PM +0900, Teruaki Ishizaki wrote:
   (2015/02/10 12:10), Liu Yuan wrote:
   On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki 
   wrote:
   Previously, qemu block driver of sheepdog used hard-coded VDI 
   object size.
   This patch enables users to handle block_size_shift value 
   for
   calculating VDI object size.
   
   When you start qemu, you don't need to specify additional 
   command option.
   
   But when you create the VDI which doesn't have default object 
   size
   with qemu-img command, you specify block_size_shift option.
   
   If you want to create a VDI of 8MB(1  23) object size,
   you need to specify following command option.
   
 # qemu-img create -o block_size_shift=23 sheepdog:test1 100M
   
   In addition, when you don't specify qemu-img command option,
   a default value of sheepdog cluster is used for creating VDI.
   
 # qemu-img create sheepdog:test2 100M
   
   Signed-off-by: Teruaki Ishizaki 
   ishizaki.teru...@lab.ntt.co.jp
   ---
   V4:
 - Limit a read/write buffer size for creating a 
preallocated VDI.
 - Replace a parse function for the block_size_shift option.
 - Fix an error message.
   
   V3:
 - Delete the needless operation of buffer.
 - Delete the needless operations of request header.
   for SD_OP_GET_CLUSTER_DEFAULT.
 - Fix coding style problems.
   
   V2:
 - Fix coding style problem (white space).
 - Add members, store_policy and block_size_shift to struct 
SheepdogVdiReq.
 - Initialize request header to use block_size_shift 
specified by user.
   ---
 block/sheepdog.c  |  138 
++---
 include/block/block_int.h |1 +
 2 files changed, 119 insertions(+), 20 deletions(-)
   
   diff --git a/block/sheepdog.c b/block/sheepdog.c
   index be3176f..a43b947 100644
   --- a/block/sheepdog.c
   +++ b/block/sheepdog.c
   @@ -37,6 +37,7 @@
 #define SD_OP_READ_VDIS  0x15
 #define SD_OP_FLUSH_VDI  0x16
 #define SD_OP_DEL_VDI0x17
   +#define SD_OP_GET_CLUSTER_DEFAULT   0x18
   
 #define SD_FLAG_CMD_WRITE0x01
 #define SD_FLAG_CMD_COW  0x02
   @@ -167,7 +168,8 @@ typedef struct SheepdogVdiReq {
 uint32_t base_vdi_id;
 uint8_t copies;
 uint8_t copy_policy;
   -uint8_t reserved[2];
   +uint8_t store_policy;
   +uint8_t block_size_shift;
 uint32_t snapid;
 uint32_t type;
 uint32_t pad[2];
   @@ -186,6 +188,21 @@ typedef struct SheepdogVdiRsp {
 uint32_t pad[5];
 } SheepdogVdiRsp;
   
   +typedef struct SheepdogClusterRsp {
   +uint8_t proto_ver;
   +uint8_t opcode;
   +uint16_t flags;
   +uint32_t epoch;
   +uint32_t id;
   +uint32_t data_length;
   +uint32_t result;
   +uint8_t nr_copies;
   +uint8_t copy_policy;
   +uint8_t block_size_shift;
   +uint8_t __pad1;
   +uint32_t __pad2[6];
   +} SheepdogClusterRsp;
   +
 typedef struct SheepdogInode {
 char name[SD_MAX_VDI_LEN];
 char tag[SD_MAX_VDI_TAG_LEN];
   @@ -1544,6 +1561,7 @@ static int 
   do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int 
   snapshot,
 hdr.vdi_size = s-inode.vdi_size;
 hdr.copy_policy = s-inode.copy_policy;
 hdr.copies = s-inode.nr_copies;
   +hdr.block_size_shift = s-inode.block_size_shift;
   
 ret = do_req(fd, s-aio_context, (SheepdogReq *)hdr, 
buf, wlen, rlen);
   
   @@ -1569,9 +1587,12 @@ static int 
   do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int 
   snapshot,
 static int sd_prealloc(const char *filename, Error **errp)
 {
 BlockDriverState *bs = NULL;
   +BDRVSheepdogState *base = NULL;
   +unsigned long buf_size;
 uint32_t idx, max_idx;
   +uint32_t object_size;
 int64_t vdi_size;
   -void *buf = g_malloc0(SD_DATA_OBJ_SIZE);
   +void *buf = NULL;
 int ret

Re: [Qemu-devel] [PATCH v4] sheepdog: selectable object size support

2015-02-12 Thread Liu Yuan

On Fri, Feb 13, 2015 at 10:33:04AM +0900, Teruaki Ishizaki wrote:
 (2015/02/12 11:55), Liu Yuan wrote:
 On Thu, Feb 12, 2015 at 11:33:16AM +0900, Teruaki Ishizaki wrote:
 (2015/02/12 11:19), Liu Yuan wrote:
 On Thu, Feb 12, 2015 at 10:51:25AM +0900, Teruaki Ishizaki wrote:
 (2015/02/10 20:12), Liu Yuan wrote:
 On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote:
 Previously, qemu block driver of sheepdog used hard-coded VDI object 
 size.
 This patch enables users to handle block_size_shift value for
 calculating VDI object size.
 
 When you start qemu, you don't need to specify additional command 
 option.
 
 But when you create the VDI which doesn't have default object size
 with qemu-img command, you specify block_size_shift option.
 
 If you want to create a VDI of 8MB(1  23) object size,
 you need to specify following command option.
 
   # qemu-img create -o block_size_shift=23 sheepdog:test1 100M
 
 In addition, when you don't specify qemu-img command option,
 a default value of sheepdog cluster is used for creating VDI.
 
   # qemu-img create sheepdog:test2 100M
 
 Signed-off-by: Teruaki Ishizaki ishizaki.teru...@lab.ntt.co.jp
 ---
 V4:
   - Limit a read/write buffer size for creating a preallocated VDI.
   - Replace a parse function for the block_size_shift option.
   - Fix an error message.
 
 V3:
   - Delete the needless operation of buffer.
   - Delete the needless operations of request header.
 for SD_OP_GET_CLUSTER_DEFAULT.
   - Fix coding style problems.
 
 V2:
   - Fix coding style problem (white space).
   - Add members, store_policy and block_size_shift to struct 
  SheepdogVdiReq.
   - Initialize request header to use block_size_shift specified by user.
 ---
   block/sheepdog.c  |  138 
  ++---
   include/block/block_int.h |1 +
   2 files changed, 119 insertions(+), 20 deletions(-)
 
 diff --git a/block/sheepdog.c b/block/sheepdog.c
 index be3176f..a43b947 100644
 --- a/block/sheepdog.c
 +++ b/block/sheepdog.c
 @@ -37,6 +37,7 @@
   #define SD_OP_READ_VDIS  0x15
   #define SD_OP_FLUSH_VDI  0x16
   #define SD_OP_DEL_VDI0x17
 +#define SD_OP_GET_CLUSTER_DEFAULT   0x18
 
 This might not be necessary. For old qemu or the qemu-img without setting
 option, the block_size_shift will be 0.
 
 If we make 0 to represent 4MB object, then we don't need to get the 
 default
 cluster object size.
 
 We migth even get rid of the idea of cluster default size. The downsize 
 is that,
 if we want to create a vdi with different size not the default 4MB,
 we have to write it every time for qemu-img or dog.
 
 If we choose to keep the idea of cluster default size, I think we'd also 
 try to
 avoid call this request from QEMU to make backward compatibility easier. 
 In this
 scenario, 0 might be used to ask new sheep to decide to use cluster 
 default size.
 
 Both old qemu and new QEMU will send 0 to sheep and both old and new 
 sheep can
 handle 0 though it has different meanings.
 
 Table for this bit as 0:
 Qe: qemu
 SD: Sheep daemon
 CDS: Cluster Default Size
 Ign: Ignored by the sheep daemon
 
 Qe/sd   newold
 new CDSIgn
 old CDSNULL
 Does Ign mean that VDI is handled as 4MB object size?
 
 Yes, old sheep can only handle 4MB object and doesn't check this field at 
 all.
 
 
 
 I think this approach is acceptable. The difference to your patch is that
 we don't send SD_OP_GET_CLUSTER_DEFAULT to sheep daemon and
 SD_OP_GET_CLUSTER_DEFAULT can be removed.
 When users create a new VDI with qemu-img, qemu's Sheepdog backend
 driver calculates max limit VDI size.
 
 But if block_size_shift option is not specified, qemu's Sheepdog backend
 driver can't calculate max limit VDI size.
 
 If block_size_shift not specified, this means
 
 1 for old sheep, use 4MB size
 2 for new sheep, use cluster wide default value.
 
 And sheep then can calculate it on its own, no?
 
 Dog command(client) calculate max size, so I think
 that qemu's Sheepdog backend driver should calculate it
 like dog command.
 
 Is that policy changeable?
 
 I checked the QEMU code and got your idea. In the past it was fixed size so 
 very
 easy to hardcode the check in the client, no communication with sheep needed.
 
 Yes, if it is reasonable, we can change it.
 
 I think we can push the size calculation logic into sheep, if not the right 
 size
 return INVALID_PARAMETER to clients. Clients just check this and report error
 back to users.
 
 There is no backward compability for this approach, since 4MB is the smallest
 size.
 
 OLD QEMU will limit the max_size as 4TB, which is no problem for new sheep.
 
 I have checked the Qemu and sheepdog code.
 When we resize VDI, sd_truncate() is called and
 resize value is handled by Qemu.
 (Sorry I haven't noticed this operation)
 
 Then, sd_truncate() writes Sheepdog inode object directly.
 So Sheepdog server can't handle maximum VDI size.
 
 As I thought, should we use SD_OP_GET_CLUSTER_DEFAULT

Re: [Qemu-devel] [PATCH] sheepdog: Fix misleading error messages in sd_snapshot_create()

2015-02-12 Thread Liu Yuan

On Thu, Feb 12, 2015 at 02:49:50PM +0100, Markus Armbruster wrote:
 If do_sd_create() fails, it first reports the error returned, then
 reports a another one with strerror(errno).  errno is meaningless at
 that point.
 
 Report just one error combining the valid information from both
 messages.
 
 Reported-by: Eric Blake ebl...@redhat.com
 Signed-off-by: Markus Armbruster arm...@redhat.com
 ---
 Applies on top of my [PATCH v2 00/10] Clean up around
 error_get_pretty(), qerror_report_err(), but rebasing to master would
 be trivial.
 
  block/sheepdog.c | 5 ++---
  1 file changed, 2 insertions(+), 3 deletions(-)
 
 diff --git a/block/sheepdog.c b/block/sheepdog.c
 index 6a4a3bd..0e8c712 100644
 --- a/block/sheepdog.c
 +++ b/block/sheepdog.c
 @@ -2225,9 +2225,8 @@ static int sd_snapshot_create(BlockDriverState *bs, 
 QEMUSnapshotInfo *sn_info)
  
  ret = do_sd_create(s, new_vid, 1, local_err);
  if (ret  0) {
 -error_report_err(local_err);
 -error_report(failed to create inode for snapshot. %s,
 - strerror(errno));
 +error_report(failed to create inode for snapshot: %s,
 + error_get_pretty(local_err));
  goto cleanup;
  }

Reviewed-by: Liu Yuan namei.u...@gmail.com

Re: [Qemu-devel] [PATCH] sheepdog: fix confused return values

2015-02-12 Thread Liu Yuan

On Fri, Feb 13, 2015 at 11:45:42AM +0800, Liu Yuan wrote:
 From: Liu Yuan liuy...@cmss.chinamobile.com
 
 These functions mix up -1 and -errno in return values and would might cause
 trouble error handling in the call chain.
 
 This patch let them return -errno and add some comments.
 
 Reported-by: Markus Armbruster arm...@redhat.com
 Signed-off-by: Liu Yuan liuy...@cmss.chinamobile.com

Cc Kevin and Stefan

[Qemu-devel] [PATCH] sheepdog: fix confused return values

2015-02-12 Thread Liu Yuan

From: Liu Yuan liuy...@cmss.chinamobile.com

These functions mix up -1 and -errno in return values and would might cause
trouble error handling in the call chain.

This patch let them return -errno and add some comments.

Reported-by: Markus Armbruster arm...@redhat.com
Signed-off-by: Liu Yuan liuy...@cmss.chinamobile.com
---
 block/sheepdog.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/block/sheepdog.c b/block/sheepdog.c
index be3176f..c28658c 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -527,6 +527,7 @@ static SheepdogAIOCB *sd_aio_setup(BlockDriverState *bs, 
QEMUIOVector *qiov,
 return acb;
 }
 
+/* Return -EIO in case of error, file descriptor on success */
 static int connect_to_sdog(BDRVSheepdogState *s, Error **errp)
 {
 int fd;
@@ -546,11 +547,14 @@ static int connect_to_sdog(BDRVSheepdogState *s, Error 
**errp)
 
 if (fd = 0) {
 qemu_set_nonblock(fd);
+} else {
+fd = -EIO;
 }
 
 return fd;
 }
 
+/* Return 0 on success and -errno in case of error */
 static coroutine_fn int send_co_req(int sockfd, SheepdogReq *hdr, void *data,
 unsigned int *wlen)
 {
@@ -559,11 +563,13 @@ static coroutine_fn int send_co_req(int sockfd, 
SheepdogReq *hdr, void *data,
 ret = qemu_co_send(sockfd, hdr, sizeof(*hdr));
 if (ret != sizeof(*hdr)) {
 error_report(failed to send a req, %s, strerror(errno));
+ret = -errno;
 return ret;
 }
 
 ret = qemu_co_send(sockfd, data, *wlen);
 if (ret != *wlen) {
+ret = -errno;
 error_report(failed to send a req, %s, strerror(errno));
 }
 
@@ -638,6 +644,11 @@ out:
 srco-finished = true;
 }
 
+/*
+ * Send the request to the sheep in a synchronous manner.
+ *
+ * Return 0 on success, -errno in case of error.
+ */
 static int do_req(int sockfd, AioContext *aio_context, SheepdogReq *hdr,
   void *data, unsigned int *wlen, unsigned int *rlen)
 {
-- 
1.9.1

Re: [Qemu-devel] [sheepdog] [PATCH v4] sheepdog: selectable object size support

2015-02-12 Thread Liu Yuan

On Thu, Feb 12, 2015 at 05:01:05PM +0900, Teruaki Ishizaki wrote:
 (2015/02/12 16:42), Liu Yuan wrote:
 On Thu, Feb 12, 2015 at 04:28:01PM +0900, Hitoshi Mitake wrote:
 At Thu, 12 Feb 2015 15:00:49 +0800,
 Liu Yuan wrote:
 
 On Thu, Feb 12, 2015 at 03:19:21PM +0900, Hitoshi Mitake wrote:
 At Tue, 10 Feb 2015 18:35:58 +0800,
 Liu Yuan wrote:
 
 On Tue, Feb 10, 2015 at 06:56:33PM +0900, Teruaki Ishizaki wrote:
 (2015/02/10 17:58), Liu Yuan wrote:
 On Tue, Feb 10, 2015 at 05:22:02PM +0900, Teruaki Ishizaki wrote:
 (2015/02/10 12:10), Liu Yuan wrote:
 On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote:
 Previously, qemu block driver of sheepdog used hard-coded VDI 
 object size.
 This patch enables users to handle block_size_shift value for
 calculating VDI object size.
 
 When you start qemu, you don't need to specify additional command 
 option.
 
 But when you create the VDI which doesn't have default object size
 with qemu-img command, you specify block_size_shift option.
 
 If you want to create a VDI of 8MB(1  23) object size,
 you need to specify following command option.
 
   # qemu-img create -o block_size_shift=23 sheepdog:test1 100M
 
 In addition, when you don't specify qemu-img command option,
 a default value of sheepdog cluster is used for creating VDI.
 
   # qemu-img create sheepdog:test2 100M
 
 Signed-off-by: Teruaki Ishizaki ishizaki.teru...@lab.ntt.co.jp
 ---
 V4:
   - Limit a read/write buffer size for creating a preallocated VDI.
   - Replace a parse function for the block_size_shift option.
   - Fix an error message.
 
 V3:
   - Delete the needless operation of buffer.
   - Delete the needless operations of request header.
 for SD_OP_GET_CLUSTER_DEFAULT.
   - Fix coding style problems.
 
 V2:
   - Fix coding style problem (white space).
   - Add members, store_policy and block_size_shift to struct 
  SheepdogVdiReq.
   - Initialize request header to use block_size_shift specified by 
  user.
 ---
   block/sheepdog.c  |  138 
  ++---
   include/block/block_int.h |1 +
   2 files changed, 119 insertions(+), 20 deletions(-)
 
 diff --git a/block/sheepdog.c b/block/sheepdog.c
 index be3176f..a43b947 100644
 --- a/block/sheepdog.c
 +++ b/block/sheepdog.c
 @@ -37,6 +37,7 @@
   #define SD_OP_READ_VDIS  0x15
   #define SD_OP_FLUSH_VDI  0x16
   #define SD_OP_DEL_VDI0x17
 +#define SD_OP_GET_CLUSTER_DEFAULT   0x18
 
   #define SD_FLAG_CMD_WRITE0x01
   #define SD_FLAG_CMD_COW  0x02
 @@ -167,7 +168,8 @@ typedef struct SheepdogVdiReq {
   uint32_t base_vdi_id;
   uint8_t copies;
   uint8_t copy_policy;
 -uint8_t reserved[2];
 +uint8_t store_policy;
 +uint8_t block_size_shift;
   uint32_t snapid;
   uint32_t type;
   uint32_t pad[2];
 @@ -186,6 +188,21 @@ typedef struct SheepdogVdiRsp {
   uint32_t pad[5];
   } SheepdogVdiRsp;
 
 +typedef struct SheepdogClusterRsp {
 +uint8_t proto_ver;
 +uint8_t opcode;
 +uint16_t flags;
 +uint32_t epoch;
 +uint32_t id;
 +uint32_t data_length;
 +uint32_t result;
 +uint8_t nr_copies;
 +uint8_t copy_policy;
 +uint8_t block_size_shift;
 +uint8_t __pad1;
 +uint32_t __pad2[6];
 +} SheepdogClusterRsp;
 +
   typedef struct SheepdogInode {
   char name[SD_MAX_VDI_LEN];
   char tag[SD_MAX_VDI_TAG_LEN];
 @@ -1544,6 +1561,7 @@ static int do_sd_create(BDRVSheepdogState *s, 
 uint32_t *vdi_id, int snapshot,
   hdr.vdi_size = s-inode.vdi_size;
   hdr.copy_policy = s-inode.copy_policy;
   hdr.copies = s-inode.nr_copies;
 +hdr.block_size_shift = s-inode.block_size_shift;
 
   ret = do_req(fd, s-aio_context, (SheepdogReq *)hdr, buf, 
  wlen, rlen);
 
 @@ -1569,9 +1587,12 @@ static int do_sd_create(BDRVSheepdogState 
 *s, uint32_t *vdi_id, int snapshot,
   static int sd_prealloc(const char *filename, Error **errp)
   {
   BlockDriverState *bs = NULL;
 +BDRVSheepdogState *base = NULL;
 +unsigned long buf_size;
   uint32_t idx, max_idx;
 +uint32_t object_size;
   int64_t vdi_size;
 -void *buf = g_malloc0(SD_DATA_OBJ_SIZE);
 +void *buf = NULL;
   int ret;
 
   ret = bdrv_open(bs, filename, NULL, NULL, BDRV_O_RDWR | 
  BDRV_O_PROTOCOL,
 @@ -1585,18 +1606,24 @@ static int sd_prealloc(const char 
 *filename, Error **errp)
   ret = vdi_size;
   goto out;
   }
 -max_idx = DIV_ROUND_UP(vdi_size, SD_DATA_OBJ_SIZE);
 +
 +base = bs-opaque;
 +object_size = (UINT32_C(1)  base-inode.block_size_shift);
 +buf_size = MIN(object_size, SD_DATA_OBJ_SIZE);
 +buf = g_malloc0(buf_size);
 +
 +max_idx = DIV_ROUND_UP(vdi_size, buf_size);
 
   for (idx = 0; idx  max_idx; idx++) {
   /*
* The created image can be a cloned image, so we need to 
  read
* a data from the source image.
*/
 -ret = bdrv_pread(bs, idx * SD_DATA_OBJ_SIZE, buf, 
 SD_DATA_OBJ_SIZE);
 +ret

Re: [Qemu-devel] [PATCH v4] sheepdog: selectable object size support

2015-02-11 Thread Liu Yuan

On Thu, Feb 12, 2015 at 10:51:25AM +0900, Teruaki Ishizaki wrote:
 (2015/02/10 20:12), Liu Yuan wrote:
 On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote:
 Previously, qemu block driver of sheepdog used hard-coded VDI object size.
 This patch enables users to handle block_size_shift value for
 calculating VDI object size.
 
 When you start qemu, you don't need to specify additional command option.
 
 But when you create the VDI which doesn't have default object size
 with qemu-img command, you specify block_size_shift option.
 
 If you want to create a VDI of 8MB(1  23) object size,
 you need to specify following command option.
 
   # qemu-img create -o block_size_shift=23 sheepdog:test1 100M
 
 In addition, when you don't specify qemu-img command option,
 a default value of sheepdog cluster is used for creating VDI.
 
   # qemu-img create sheepdog:test2 100M
 
 Signed-off-by: Teruaki Ishizaki ishizaki.teru...@lab.ntt.co.jp
 ---
 V4:
   - Limit a read/write buffer size for creating a preallocated VDI.
   - Replace a parse function for the block_size_shift option.
   - Fix an error message.
 
 V3:
   - Delete the needless operation of buffer.
   - Delete the needless operations of request header.
 for SD_OP_GET_CLUSTER_DEFAULT.
   - Fix coding style problems.
 
 V2:
   - Fix coding style problem (white space).
   - Add members, store_policy and block_size_shift to struct SheepdogVdiReq.
   - Initialize request header to use block_size_shift specified by user.
 ---
   block/sheepdog.c  |  138 
  ++---
   include/block/block_int.h |1 +
   2 files changed, 119 insertions(+), 20 deletions(-)
 
 diff --git a/block/sheepdog.c b/block/sheepdog.c
 index be3176f..a43b947 100644
 --- a/block/sheepdog.c
 +++ b/block/sheepdog.c
 @@ -37,6 +37,7 @@
   #define SD_OP_READ_VDIS  0x15
   #define SD_OP_FLUSH_VDI  0x16
   #define SD_OP_DEL_VDI0x17
 +#define SD_OP_GET_CLUSTER_DEFAULT   0x18
 
 This might not be necessary. For old qemu or the qemu-img without setting
 option, the block_size_shift will be 0.
 
 If we make 0 to represent 4MB object, then we don't need to get the default
 cluster object size.
 
 We migth even get rid of the idea of cluster default size. The downsize is 
 that,
 if we want to create a vdi with different size not the default 4MB,
 we have to write it every time for qemu-img or dog.
 
 If we choose to keep the idea of cluster default size, I think we'd also try 
 to
 avoid call this request from QEMU to make backward compatibility easier. In 
 this
 scenario, 0 might be used to ask new sheep to decide to use cluster default 
 size.
 
 Both old qemu and new QEMU will send 0 to sheep and both old and new sheep 
 can
 handle 0 though it has different meanings.
 
 Table for this bit as 0:
 Qe: qemu
 SD: Sheep daemon
 CDS: Cluster Default Size
 Ign: Ignored by the sheep daemon
 
 Qe/sd   newold
 new CDSIgn
 old CDSNULL
 Does Ign mean that VDI is handled as 4MB object size?

Yes, old sheep can only handle 4MB object and doesn't check this field at all.

 
 
 I think this approach is acceptable. The difference to your patch is that
 we don't send SD_OP_GET_CLUSTER_DEFAULT to sheep daemon and
 SD_OP_GET_CLUSTER_DEFAULT can be removed.
 When users create a new VDI with qemu-img, qemu's Sheepdog backend
 driver calculates max limit VDI size.

 But if block_size_shift option is not specified, qemu's Sheepdog backend
 driver can't calculate max limit VDI size.

If block_size_shift not specified, this means

1 for old sheep, use 4MB size
2 for new sheep, use cluster wide default value.

And sheep then can calculate it on its own, no?

Thanks
Yuan

Re: [Qemu-devel] [PATCH v4] sheepdog: selectable object size support

2015-02-11 Thread Liu Yuan

On Thu, Feb 12, 2015 at 11:33:16AM +0900, Teruaki Ishizaki wrote:
 (2015/02/12 11:19), Liu Yuan wrote:
 On Thu, Feb 12, 2015 at 10:51:25AM +0900, Teruaki Ishizaki wrote:
 (2015/02/10 20:12), Liu Yuan wrote:
 On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote:
 Previously, qemu block driver of sheepdog used hard-coded VDI object size.
 This patch enables users to handle block_size_shift value for
 calculating VDI object size.
 
 When you start qemu, you don't need to specify additional command option.
 
 But when you create the VDI which doesn't have default object size
 with qemu-img command, you specify block_size_shift option.
 
 If you want to create a VDI of 8MB(1  23) object size,
 you need to specify following command option.
 
   # qemu-img create -o block_size_shift=23 sheepdog:test1 100M
 
 In addition, when you don't specify qemu-img command option,
 a default value of sheepdog cluster is used for creating VDI.
 
   # qemu-img create sheepdog:test2 100M
 
 Signed-off-by: Teruaki Ishizaki ishizaki.teru...@lab.ntt.co.jp
 ---
 V4:
   - Limit a read/write buffer size for creating a preallocated VDI.
   - Replace a parse function for the block_size_shift option.
   - Fix an error message.
 
 V3:
   - Delete the needless operation of buffer.
   - Delete the needless operations of request header.
 for SD_OP_GET_CLUSTER_DEFAULT.
   - Fix coding style problems.
 
 V2:
   - Fix coding style problem (white space).
   - Add members, store_policy and block_size_shift to struct 
  SheepdogVdiReq.
   - Initialize request header to use block_size_shift specified by user.
 ---
   block/sheepdog.c  |  138 
  ++---
   include/block/block_int.h |1 +
   2 files changed, 119 insertions(+), 20 deletions(-)
 
 diff --git a/block/sheepdog.c b/block/sheepdog.c
 index be3176f..a43b947 100644
 --- a/block/sheepdog.c
 +++ b/block/sheepdog.c
 @@ -37,6 +37,7 @@
   #define SD_OP_READ_VDIS  0x15
   #define SD_OP_FLUSH_VDI  0x16
   #define SD_OP_DEL_VDI0x17
 +#define SD_OP_GET_CLUSTER_DEFAULT   0x18
 
 This might not be necessary. For old qemu or the qemu-img without setting
 option, the block_size_shift will be 0.
 
 If we make 0 to represent 4MB object, then we don't need to get the default
 cluster object size.
 
 We migth even get rid of the idea of cluster default size. The downsize is 
 that,
 if we want to create a vdi with different size not the default 4MB,
 we have to write it every time for qemu-img or dog.
 
 If we choose to keep the idea of cluster default size, I think we'd also 
 try to
 avoid call this request from QEMU to make backward compatibility easier. 
 In this
 scenario, 0 might be used to ask new sheep to decide to use cluster 
 default size.
 
 Both old qemu and new QEMU will send 0 to sheep and both old and new sheep 
 can
 handle 0 though it has different meanings.
 
 Table for this bit as 0:
 Qe: qemu
 SD: Sheep daemon
 CDS: Cluster Default Size
 Ign: Ignored by the sheep daemon
 
 Qe/sd   newold
 new CDSIgn
 old CDSNULL
 Does Ign mean that VDI is handled as 4MB object size?
 
 Yes, old sheep can only handle 4MB object and doesn't check this field at 
 all.
 
 
 
 I think this approach is acceptable. The difference to your patch is that
 we don't send SD_OP_GET_CLUSTER_DEFAULT to sheep daemon and
 SD_OP_GET_CLUSTER_DEFAULT can be removed.
 When users create a new VDI with qemu-img, qemu's Sheepdog backend
 driver calculates max limit VDI size.
 
 But if block_size_shift option is not specified, qemu's Sheepdog backend
 driver can't calculate max limit VDI size.
 
 If block_size_shift not specified, this means
 
 1 for old sheep, use 4MB size
 2 for new sheep, use cluster wide default value.
 
 And sheep then can calculate it on its own, no?
 
 Dog command(client) calculate max size, so I think
 that qemu's Sheepdog backend driver should calculate it
 like dog command.
 
 Is that policy changeable?

I checked the QEMU code and got your idea. In the past it was fixed size so very
easy to hardcode the check in the client, no communication with sheep needed.

Yes, if it is reasonable, we can change it.

I think we can push the size calculation logic into sheep, if not the right size
return INVALID_PARAMETER to clients. Clients just check this and report error
back to users.

There is no backward compability for this approach, since 4MB is the smallest
size.

OLD QEMU will limit the max_size as 4TB, which is no problem for new sheep.

Thanks
Yuan

Re: [Qemu-devel] [sheepdog] [PATCH v4] sheepdog: selectable object size support

2015-02-11 Thread Liu Yuan

On Thu, Feb 12, 2015 at 03:19:21PM +0900, Hitoshi Mitake wrote:
 At Tue, 10 Feb 2015 18:35:58 +0800,
 Liu Yuan wrote:
  
  On Tue, Feb 10, 2015 at 06:56:33PM +0900, Teruaki Ishizaki wrote:
   (2015/02/10 17:58), Liu Yuan wrote:
   On Tue, Feb 10, 2015 at 05:22:02PM +0900, Teruaki Ishizaki wrote:
   (2015/02/10 12:10), Liu Yuan wrote:
   On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote:
   Previously, qemu block driver of sheepdog used hard-coded VDI object 
   size.
   This patch enables users to handle block_size_shift value for
   calculating VDI object size.
   
   When you start qemu, you don't need to specify additional command 
   option.
   
   But when you create the VDI which doesn't have default object size
   with qemu-img command, you specify block_size_shift option.
   
   If you want to create a VDI of 8MB(1  23) object size,
   you need to specify following command option.
   
 # qemu-img create -o block_size_shift=23 sheepdog:test1 100M
   
   In addition, when you don't specify qemu-img command option,
   a default value of sheepdog cluster is used for creating VDI.
   
 # qemu-img create sheepdog:test2 100M
   
   Signed-off-by: Teruaki Ishizaki ishizaki.teru...@lab.ntt.co.jp
   ---
   V4:
 - Limit a read/write buffer size for creating a preallocated VDI.
 - Replace a parse function for the block_size_shift option.
 - Fix an error message.
   
   V3:
 - Delete the needless operation of buffer.
 - Delete the needless operations of request header.
   for SD_OP_GET_CLUSTER_DEFAULT.
 - Fix coding style problems.
   
   V2:
 - Fix coding style problem (white space).
 - Add members, store_policy and block_size_shift to struct 
SheepdogVdiReq.
 - Initialize request header to use block_size_shift specified by 
user.
   ---
 block/sheepdog.c  |  138 
++---
 include/block/block_int.h |1 +
 2 files changed, 119 insertions(+), 20 deletions(-)
   
   diff --git a/block/sheepdog.c b/block/sheepdog.c
   index be3176f..a43b947 100644
   --- a/block/sheepdog.c
   +++ b/block/sheepdog.c
   @@ -37,6 +37,7 @@
 #define SD_OP_READ_VDIS  0x15
 #define SD_OP_FLUSH_VDI  0x16
 #define SD_OP_DEL_VDI0x17
   +#define SD_OP_GET_CLUSTER_DEFAULT   0x18
   
 #define SD_FLAG_CMD_WRITE0x01
 #define SD_FLAG_CMD_COW  0x02
   @@ -167,7 +168,8 @@ typedef struct SheepdogVdiReq {
 uint32_t base_vdi_id;
 uint8_t copies;
 uint8_t copy_policy;
   -uint8_t reserved[2];
   +uint8_t store_policy;
   +uint8_t block_size_shift;
 uint32_t snapid;
 uint32_t type;
 uint32_t pad[2];
   @@ -186,6 +188,21 @@ typedef struct SheepdogVdiRsp {
 uint32_t pad[5];
 } SheepdogVdiRsp;
   
   +typedef struct SheepdogClusterRsp {
   +uint8_t proto_ver;
   +uint8_t opcode;
   +uint16_t flags;
   +uint32_t epoch;
   +uint32_t id;
   +uint32_t data_length;
   +uint32_t result;
   +uint8_t nr_copies;
   +uint8_t copy_policy;
   +uint8_t block_size_shift;
   +uint8_t __pad1;
   +uint32_t __pad2[6];
   +} SheepdogClusterRsp;
   +
 typedef struct SheepdogInode {
 char name[SD_MAX_VDI_LEN];
 char tag[SD_MAX_VDI_TAG_LEN];
   @@ -1544,6 +1561,7 @@ static int do_sd_create(BDRVSheepdogState *s, 
   uint32_t *vdi_id, int snapshot,
 hdr.vdi_size = s-inode.vdi_size;
 hdr.copy_policy = s-inode.copy_policy;
 hdr.copies = s-inode.nr_copies;
   +hdr.block_size_shift = s-inode.block_size_shift;
   
 ret = do_req(fd, s-aio_context, (SheepdogReq *)hdr, buf, 
wlen, rlen);
   
   @@ -1569,9 +1587,12 @@ static int do_sd_create(BDRVSheepdogState *s, 
   uint32_t *vdi_id, int snapshot,
 static int sd_prealloc(const char *filename, Error **errp)
 {
 BlockDriverState *bs = NULL;
   +BDRVSheepdogState *base = NULL;
   +unsigned long buf_size;
 uint32_t idx, max_idx;
   +uint32_t object_size;
 int64_t vdi_size;
   -void *buf = g_malloc0(SD_DATA_OBJ_SIZE);
   +void *buf = NULL;
 int ret;
   
 ret = bdrv_open(bs, filename, NULL, NULL, BDRV_O_RDWR | 
BDRV_O_PROTOCOL,
   @@ -1585,18 +1606,24 @@ static int sd_prealloc(const char *filename, 
   Error **errp)
 ret = vdi_size;
 goto out;
 }
   -max_idx = DIV_ROUND_UP(vdi_size, SD_DATA_OBJ_SIZE);
   +
   +base = bs-opaque;
   +object_size = (UINT32_C(1)  base-inode.block_size_shift);
   +buf_size = MIN(object_size, SD_DATA_OBJ_SIZE);
   +buf = g_malloc0(buf_size);
   +
   +max_idx = DIV_ROUND_UP(vdi_size, buf_size);
   
 for (idx = 0; idx  max_idx; idx++) {
 /*
  * The created image can be a cloned image, so we need to 
read
  * a data from the source image.
  */
   -ret = bdrv_pread(bs, idx

Re: [Qemu-devel] [sheepdog] [PATCH v4] sheepdog: selectable object size support

2015-02-11 Thread Liu Yuan

On Thu, Feb 12, 2015 at 04:28:01PM +0900, Hitoshi Mitake wrote:
 At Thu, 12 Feb 2015 15:00:49 +0800,
 Liu Yuan wrote:
  
  On Thu, Feb 12, 2015 at 03:19:21PM +0900, Hitoshi Mitake wrote:
   At Tue, 10 Feb 2015 18:35:58 +0800,
   Liu Yuan wrote:

On Tue, Feb 10, 2015 at 06:56:33PM +0900, Teruaki Ishizaki wrote:
 (2015/02/10 17:58), Liu Yuan wrote:
 On Tue, Feb 10, 2015 at 05:22:02PM +0900, Teruaki Ishizaki wrote:
 (2015/02/10 12:10), Liu Yuan wrote:
 On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote:
 Previously, qemu block driver of sheepdog used hard-coded VDI 
 object size.
 This patch enables users to handle block_size_shift value for
 calculating VDI object size.
 
 When you start qemu, you don't need to specify additional command 
 option.
 
 But when you create the VDI which doesn't have default object size
 with qemu-img command, you specify block_size_shift option.
 
 If you want to create a VDI of 8MB(1  23) object size,
 you need to specify following command option.
 
   # qemu-img create -o block_size_shift=23 sheepdog:test1 100M
 
 In addition, when you don't specify qemu-img command option,
 a default value of sheepdog cluster is used for creating VDI.
 
   # qemu-img create sheepdog:test2 100M
 
 Signed-off-by: Teruaki Ishizaki ishizaki.teru...@lab.ntt.co.jp
 ---
 V4:
   - Limit a read/write buffer size for creating a preallocated 
  VDI.
   - Replace a parse function for the block_size_shift option.
   - Fix an error message.
 
 V3:
   - Delete the needless operation of buffer.
   - Delete the needless operations of request header.
 for SD_OP_GET_CLUSTER_DEFAULT.
   - Fix coding style problems.
 
 V2:
   - Fix coding style problem (white space).
   - Add members, store_policy and block_size_shift to struct 
  SheepdogVdiReq.
   - Initialize request header to use block_size_shift specified 
  by user.
 ---
   block/sheepdog.c  |  138 
  ++---
   include/block/block_int.h |1 +
   2 files changed, 119 insertions(+), 20 deletions(-)
 
 diff --git a/block/sheepdog.c b/block/sheepdog.c
 index be3176f..a43b947 100644
 --- a/block/sheepdog.c
 +++ b/block/sheepdog.c
 @@ -37,6 +37,7 @@
   #define SD_OP_READ_VDIS  0x15
   #define SD_OP_FLUSH_VDI  0x16
   #define SD_OP_DEL_VDI0x17
 +#define SD_OP_GET_CLUSTER_DEFAULT   0x18
 
   #define SD_FLAG_CMD_WRITE0x01
   #define SD_FLAG_CMD_COW  0x02
 @@ -167,7 +168,8 @@ typedef struct SheepdogVdiReq {
   uint32_t base_vdi_id;
   uint8_t copies;
   uint8_t copy_policy;
 -uint8_t reserved[2];
 +uint8_t store_policy;
 +uint8_t block_size_shift;
   uint32_t snapid;
   uint32_t type;
   uint32_t pad[2];
 @@ -186,6 +188,21 @@ typedef struct SheepdogVdiRsp {
   uint32_t pad[5];
   } SheepdogVdiRsp;
 
 +typedef struct SheepdogClusterRsp {
 +uint8_t proto_ver;
 +uint8_t opcode;
 +uint16_t flags;
 +uint32_t epoch;
 +uint32_t id;
 +uint32_t data_length;
 +uint32_t result;
 +uint8_t nr_copies;
 +uint8_t copy_policy;
 +uint8_t block_size_shift;
 +uint8_t __pad1;
 +uint32_t __pad2[6];
 +} SheepdogClusterRsp;
 +
   typedef struct SheepdogInode {
   char name[SD_MAX_VDI_LEN];
   char tag[SD_MAX_VDI_TAG_LEN];
 @@ -1544,6 +1561,7 @@ static int do_sd_create(BDRVSheepdogState 
 *s, uint32_t *vdi_id, int snapshot,
   hdr.vdi_size = s-inode.vdi_size;
   hdr.copy_policy = s-inode.copy_policy;
   hdr.copies = s-inode.nr_copies;
 +hdr.block_size_shift = s-inode.block_size_shift;
 
   ret = do_req(fd, s-aio_context, (SheepdogReq *)hdr, buf, 
  wlen, rlen);
 
 @@ -1569,9 +1587,12 @@ static int do_sd_create(BDRVSheepdogState 
 *s, uint32_t *vdi_id, int snapshot,
   static int sd_prealloc(const char *filename, Error **errp)
   {
   BlockDriverState *bs = NULL;
 +BDRVSheepdogState *base = NULL;
 +unsigned long buf_size;
   uint32_t idx, max_idx;
 +uint32_t object_size;
   int64_t vdi_size;
 -void *buf = g_malloc0(SD_DATA_OBJ_SIZE);
 +void *buf = NULL;
   int ret;
 
   ret = bdrv_open(bs, filename, NULL, NULL, BDRV_O_RDWR | 
  BDRV_O_PROTOCOL,
 @@ -1585,18 +1606,24 @@ static int sd_prealloc(const char 
 *filename, Error **errp)
   ret = vdi_size;
   goto out;
   }
 -max_idx = DIV_ROUND_UP(vdi_size, SD_DATA_OBJ_SIZE);
 +
 +base = bs-opaque;
 +object_size = (UINT32_C(1)  base-inode.block_size_shift

Re: [Qemu-devel] [PATCH v4] sheepdog: selectable object size support

2015-02-10 Thread Liu Yuan

On Tue, Feb 10, 2015 at 05:22:02PM +0900, Teruaki Ishizaki wrote:
 (2015/02/10 12:10), Liu Yuan wrote:
 On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote:
 Previously, qemu block driver of sheepdog used hard-coded VDI object size.
 This patch enables users to handle block_size_shift value for
 calculating VDI object size.
 
 When you start qemu, you don't need to specify additional command option.
 
 But when you create the VDI which doesn't have default object size
 with qemu-img command, you specify block_size_shift option.
 
 If you want to create a VDI of 8MB(1  23) object size,
 you need to specify following command option.
 
   # qemu-img create -o block_size_shift=23 sheepdog:test1 100M
 
 In addition, when you don't specify qemu-img command option,
 a default value of sheepdog cluster is used for creating VDI.
 
   # qemu-img create sheepdog:test2 100M
 
 Signed-off-by: Teruaki Ishizaki ishizaki.teru...@lab.ntt.co.jp
 ---
 V4:
   - Limit a read/write buffer size for creating a preallocated VDI.
   - Replace a parse function for the block_size_shift option.
   - Fix an error message.
 
 V3:
   - Delete the needless operation of buffer.
   - Delete the needless operations of request header.
 for SD_OP_GET_CLUSTER_DEFAULT.
   - Fix coding style problems.
 
 V2:
   - Fix coding style problem (white space).
   - Add members, store_policy and block_size_shift to struct SheepdogVdiReq.
   - Initialize request header to use block_size_shift specified by user.
 ---
   block/sheepdog.c  |  138 
  ++---
   include/block/block_int.h |1 +
   2 files changed, 119 insertions(+), 20 deletions(-)
 
 diff --git a/block/sheepdog.c b/block/sheepdog.c
 index be3176f..a43b947 100644
 --- a/block/sheepdog.c
 +++ b/block/sheepdog.c
 @@ -37,6 +37,7 @@
   #define SD_OP_READ_VDIS  0x15
   #define SD_OP_FLUSH_VDI  0x16
   #define SD_OP_DEL_VDI0x17
 +#define SD_OP_GET_CLUSTER_DEFAULT   0x18
 
   #define SD_FLAG_CMD_WRITE0x01
   #define SD_FLAG_CMD_COW  0x02
 @@ -167,7 +168,8 @@ typedef struct SheepdogVdiReq {
   uint32_t base_vdi_id;
   uint8_t copies;
   uint8_t copy_policy;
 -uint8_t reserved[2];
 +uint8_t store_policy;
 +uint8_t block_size_shift;
   uint32_t snapid;
   uint32_t type;
   uint32_t pad[2];
 @@ -186,6 +188,21 @@ typedef struct SheepdogVdiRsp {
   uint32_t pad[5];
   } SheepdogVdiRsp;
 
 +typedef struct SheepdogClusterRsp {
 +uint8_t proto_ver;
 +uint8_t opcode;
 +uint16_t flags;
 +uint32_t epoch;
 +uint32_t id;
 +uint32_t data_length;
 +uint32_t result;
 +uint8_t nr_copies;
 +uint8_t copy_policy;
 +uint8_t block_size_shift;
 +uint8_t __pad1;
 +uint32_t __pad2[6];
 +} SheepdogClusterRsp;
 +
   typedef struct SheepdogInode {
   char name[SD_MAX_VDI_LEN];
   char tag[SD_MAX_VDI_TAG_LEN];
 @@ -1544,6 +1561,7 @@ static int do_sd_create(BDRVSheepdogState *s, 
 uint32_t *vdi_id, int snapshot,
   hdr.vdi_size = s-inode.vdi_size;
   hdr.copy_policy = s-inode.copy_policy;
   hdr.copies = s-inode.nr_copies;
 +hdr.block_size_shift = s-inode.block_size_shift;
 
   ret = do_req(fd, s-aio_context, (SheepdogReq *)hdr, buf, wlen, 
  rlen);
 
 @@ -1569,9 +1587,12 @@ static int do_sd_create(BDRVSheepdogState *s, 
 uint32_t *vdi_id, int snapshot,
   static int sd_prealloc(const char *filename, Error **errp)
   {
   BlockDriverState *bs = NULL;
 +BDRVSheepdogState *base = NULL;
 +unsigned long buf_size;
   uint32_t idx, max_idx;
 +uint32_t object_size;
   int64_t vdi_size;
 -void *buf = g_malloc0(SD_DATA_OBJ_SIZE);
 +void *buf = NULL;
   int ret;
 
   ret = bdrv_open(bs, filename, NULL, NULL, BDRV_O_RDWR | 
  BDRV_O_PROTOCOL,
 @@ -1585,18 +1606,24 @@ static int sd_prealloc(const char *filename, Error 
 **errp)
   ret = vdi_size;
   goto out;
   }
 -max_idx = DIV_ROUND_UP(vdi_size, SD_DATA_OBJ_SIZE);
 +
 +base = bs-opaque;
 +object_size = (UINT32_C(1)  base-inode.block_size_shift);
 +buf_size = MIN(object_size, SD_DATA_OBJ_SIZE);
 +buf = g_malloc0(buf_size);
 +
 +max_idx = DIV_ROUND_UP(vdi_size, buf_size);
 
   for (idx = 0; idx  max_idx; idx++) {
   /*
* The created image can be a cloned image, so we need to read
* a data from the source image.
*/
 -ret = bdrv_pread(bs, idx * SD_DATA_OBJ_SIZE, buf, 
 SD_DATA_OBJ_SIZE);
 +ret = bdrv_pread(bs, idx * buf_size, buf, buf_size);
   if (ret  0) {
   goto out;
   }
 -ret = bdrv_pwrite(bs, idx * SD_DATA_OBJ_SIZE, buf, 
 SD_DATA_OBJ_SIZE);
 +ret = bdrv_pwrite(bs, idx * buf_size, buf, buf_size);
   if (ret  0) {
   goto out;
   }
 @@ -1669,6 +1696,21 @@ static int parse_redundancy(BDRVSheepdogState *s, 
 const char *opt)
   return 0;
   }
 
 +static int

Re: [Qemu-devel] [PATCH v4] sheepdog: selectable object size support

2015-02-10 Thread Liu Yuan

On Tue, Feb 10, 2015 at 06:56:33PM +0900, Teruaki Ishizaki wrote:
 (2015/02/10 17:58), Liu Yuan wrote:
 On Tue, Feb 10, 2015 at 05:22:02PM +0900, Teruaki Ishizaki wrote:
 (2015/02/10 12:10), Liu Yuan wrote:
 On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote:
 Previously, qemu block driver of sheepdog used hard-coded VDI object size.
 This patch enables users to handle block_size_shift value for
 calculating VDI object size.
 
 When you start qemu, you don't need to specify additional command option.
 
 But when you create the VDI which doesn't have default object size
 with qemu-img command, you specify block_size_shift option.
 
 If you want to create a VDI of 8MB(1  23) object size,
 you need to specify following command option.
 
   # qemu-img create -o block_size_shift=23 sheepdog:test1 100M
 
 In addition, when you don't specify qemu-img command option,
 a default value of sheepdog cluster is used for creating VDI.
 
   # qemu-img create sheepdog:test2 100M
 
 Signed-off-by: Teruaki Ishizaki ishizaki.teru...@lab.ntt.co.jp
 ---
 V4:
   - Limit a read/write buffer size for creating a preallocated VDI.
   - Replace a parse function for the block_size_shift option.
   - Fix an error message.
 
 V3:
   - Delete the needless operation of buffer.
   - Delete the needless operations of request header.
 for SD_OP_GET_CLUSTER_DEFAULT.
   - Fix coding style problems.
 
 V2:
   - Fix coding style problem (white space).
   - Add members, store_policy and block_size_shift to struct 
  SheepdogVdiReq.
   - Initialize request header to use block_size_shift specified by user.
 ---
   block/sheepdog.c  |  138 
  ++---
   include/block/block_int.h |1 +
   2 files changed, 119 insertions(+), 20 deletions(-)
 
 diff --git a/block/sheepdog.c b/block/sheepdog.c
 index be3176f..a43b947 100644
 --- a/block/sheepdog.c
 +++ b/block/sheepdog.c
 @@ -37,6 +37,7 @@
   #define SD_OP_READ_VDIS  0x15
   #define SD_OP_FLUSH_VDI  0x16
   #define SD_OP_DEL_VDI0x17
 +#define SD_OP_GET_CLUSTER_DEFAULT   0x18
 
   #define SD_FLAG_CMD_WRITE0x01
   #define SD_FLAG_CMD_COW  0x02
 @@ -167,7 +168,8 @@ typedef struct SheepdogVdiReq {
   uint32_t base_vdi_id;
   uint8_t copies;
   uint8_t copy_policy;
 -uint8_t reserved[2];
 +uint8_t store_policy;
 +uint8_t block_size_shift;
   uint32_t snapid;
   uint32_t type;
   uint32_t pad[2];
 @@ -186,6 +188,21 @@ typedef struct SheepdogVdiRsp {
   uint32_t pad[5];
   } SheepdogVdiRsp;
 
 +typedef struct SheepdogClusterRsp {
 +uint8_t proto_ver;
 +uint8_t opcode;
 +uint16_t flags;
 +uint32_t epoch;
 +uint32_t id;
 +uint32_t data_length;
 +uint32_t result;
 +uint8_t nr_copies;
 +uint8_t copy_policy;
 +uint8_t block_size_shift;
 +uint8_t __pad1;
 +uint32_t __pad2[6];
 +} SheepdogClusterRsp;
 +
   typedef struct SheepdogInode {
   char name[SD_MAX_VDI_LEN];
   char tag[SD_MAX_VDI_TAG_LEN];
 @@ -1544,6 +1561,7 @@ static int do_sd_create(BDRVSheepdogState *s, 
 uint32_t *vdi_id, int snapshot,
   hdr.vdi_size = s-inode.vdi_size;
   hdr.copy_policy = s-inode.copy_policy;
   hdr.copies = s-inode.nr_copies;
 +hdr.block_size_shift = s-inode.block_size_shift;
 
   ret = do_req(fd, s-aio_context, (SheepdogReq *)hdr, buf, wlen, 
  rlen);
 
 @@ -1569,9 +1587,12 @@ static int do_sd_create(BDRVSheepdogState *s, 
 uint32_t *vdi_id, int snapshot,
   static int sd_prealloc(const char *filename, Error **errp)
   {
   BlockDriverState *bs = NULL;
 +BDRVSheepdogState *base = NULL;
 +unsigned long buf_size;
   uint32_t idx, max_idx;
 +uint32_t object_size;
   int64_t vdi_size;
 -void *buf = g_malloc0(SD_DATA_OBJ_SIZE);
 +void *buf = NULL;
   int ret;
 
   ret = bdrv_open(bs, filename, NULL, NULL, BDRV_O_RDWR | 
  BDRV_O_PROTOCOL,
 @@ -1585,18 +1606,24 @@ static int sd_prealloc(const char *filename, 
 Error **errp)
   ret = vdi_size;
   goto out;
   }
 -max_idx = DIV_ROUND_UP(vdi_size, SD_DATA_OBJ_SIZE);
 +
 +base = bs-opaque;
 +object_size = (UINT32_C(1)  base-inode.block_size_shift);
 +buf_size = MIN(object_size, SD_DATA_OBJ_SIZE);
 +buf = g_malloc0(buf_size);
 +
 +max_idx = DIV_ROUND_UP(vdi_size, buf_size);
 
   for (idx = 0; idx  max_idx; idx++) {
   /*
* The created image can be a cloned image, so we need to read
* a data from the source image.
*/
 -ret = bdrv_pread(bs, idx * SD_DATA_OBJ_SIZE, buf, 
 SD_DATA_OBJ_SIZE);
 +ret = bdrv_pread(bs, idx * buf_size, buf, buf_size);
   if (ret  0) {
   goto out;
   }
 -ret = bdrv_pwrite(bs, idx * SD_DATA_OBJ_SIZE, buf, 
 SD_DATA_OBJ_SIZE);
 +ret = bdrv_pwrite(bs, idx * buf_size, buf, buf_size);
   if (ret  0) {
   goto out;
   }
 @@ -1669,6 +1696,21 @@ static int

Re: [Qemu-devel] [PATCH v4] sheepdog: selectable object size support

2015-02-10 Thread Liu Yuan

On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote:
 Previously, qemu block driver of sheepdog used hard-coded VDI object size.
 This patch enables users to handle block_size_shift value for
 calculating VDI object size.
 
 When you start qemu, you don't need to specify additional command option.
 
 But when you create the VDI which doesn't have default object size
 with qemu-img command, you specify block_size_shift option.
 
 If you want to create a VDI of 8MB(1  23) object size,
 you need to specify following command option.
 
  # qemu-img create -o block_size_shift=23 sheepdog:test1 100M
 
 In addition, when you don't specify qemu-img command option,
 a default value of sheepdog cluster is used for creating VDI.
 
  # qemu-img create sheepdog:test2 100M
 
 Signed-off-by: Teruaki Ishizaki ishizaki.teru...@lab.ntt.co.jp
 ---
 V4:
  - Limit a read/write buffer size for creating a preallocated VDI.
  - Replace a parse function for the block_size_shift option.
  - Fix an error message.
 
 V3:
  - Delete the needless operation of buffer.
  - Delete the needless operations of request header.
for SD_OP_GET_CLUSTER_DEFAULT.
  - Fix coding style problems.
 
 V2:
  - Fix coding style problem (white space).
  - Add members, store_policy and block_size_shift to struct SheepdogVdiReq.
  - Initialize request header to use block_size_shift specified by user.
 ---
  block/sheepdog.c  |  138 
 ++---
  include/block/block_int.h |1 +
  2 files changed, 119 insertions(+), 20 deletions(-)
 
 diff --git a/block/sheepdog.c b/block/sheepdog.c
 index be3176f..a43b947 100644
 --- a/block/sheepdog.c
 +++ b/block/sheepdog.c
 @@ -37,6 +37,7 @@
  #define SD_OP_READ_VDIS  0x15
  #define SD_OP_FLUSH_VDI  0x16
  #define SD_OP_DEL_VDI0x17
 +#define SD_OP_GET_CLUSTER_DEFAULT   0x18

This might not be necessary. For old qemu or the qemu-img without setting
option, the block_size_shift will be 0.

If we make 0 to represent 4MB object, then we don't need to get the default
cluster object size.

We migth even get rid of the idea of cluster default size. The downsize is that,
if we want to create a vdi with different size not the default 4MB,
we have to write it every time for qemu-img or dog.

If we choose to keep the idea of cluster default size, I think we'd also try to
avoid call this request from QEMU to make backward compatibility easier. In this
scenario, 0 might be used to ask new sheep to decide to use cluster default 
size.

Both old qemu and new QEMU will send 0 to sheep and both old and new sheep can
handle 0 though it has different meanings. 

Table for this bit as 0:
Qe: qemu
SD: Sheep daemon
CDS: Cluster Default Size
Ign: Ignored by the sheep daemon

Qe/sd   newold
new CDSIgn
old CDSNULL

I think this approach is acceptable. The difference to your patch is that
we don't send SD_OP_GET_CLUSTER_DEFAULT to sheep daemon and
SD_OP_GET_CLUSTER_DEFAULT can be removed.

Thanks
Yuan

Re: [Qemu-devel] [PATCH v4] sheepdog: selectable object size support

2015-02-09 Thread Liu Yuan

On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote:
 Previously, qemu block driver of sheepdog used hard-coded VDI object size.
 This patch enables users to handle block_size_shift value for
 calculating VDI object size.
 
 When you start qemu, you don't need to specify additional command option.
 
 But when you create the VDI which doesn't have default object size
 with qemu-img command, you specify block_size_shift option.
 
 If you want to create a VDI of 8MB(1  23) object size,
 you need to specify following command option.
 
  # qemu-img create -o block_size_shift=23 sheepdog:test1 100M
 
 In addition, when you don't specify qemu-img command option,
 a default value of sheepdog cluster is used for creating VDI.
 
  # qemu-img create sheepdog:test2 100M
 
 Signed-off-by: Teruaki Ishizaki ishizaki.teru...@lab.ntt.co.jp
 ---
 V4:
  - Limit a read/write buffer size for creating a preallocated VDI.
  - Replace a parse function for the block_size_shift option.
  - Fix an error message.
 
 V3:
  - Delete the needless operation of buffer.
  - Delete the needless operations of request header.
for SD_OP_GET_CLUSTER_DEFAULT.
  - Fix coding style problems.
 
 V2:
  - Fix coding style problem (white space).
  - Add members, store_policy and block_size_shift to struct SheepdogVdiReq.
  - Initialize request header to use block_size_shift specified by user.
 ---
  block/sheepdog.c  |  138 
 ++---
  include/block/block_int.h |1 +
  2 files changed, 119 insertions(+), 20 deletions(-)
 
 diff --git a/block/sheepdog.c b/block/sheepdog.c
 index be3176f..a43b947 100644
 --- a/block/sheepdog.c
 +++ b/block/sheepdog.c
 @@ -37,6 +37,7 @@
  #define SD_OP_READ_VDIS  0x15
  #define SD_OP_FLUSH_VDI  0x16
  #define SD_OP_DEL_VDI0x17
 +#define SD_OP_GET_CLUSTER_DEFAULT   0x18
  
  #define SD_FLAG_CMD_WRITE0x01
  #define SD_FLAG_CMD_COW  0x02
 @@ -167,7 +168,8 @@ typedef struct SheepdogVdiReq {
  uint32_t base_vdi_id;
  uint8_t copies;
  uint8_t copy_policy;
 -uint8_t reserved[2];
 +uint8_t store_policy;
 +uint8_t block_size_shift;
  uint32_t snapid;
  uint32_t type;
  uint32_t pad[2];
 @@ -186,6 +188,21 @@ typedef struct SheepdogVdiRsp {
  uint32_t pad[5];
  } SheepdogVdiRsp;
  
 +typedef struct SheepdogClusterRsp {
 +uint8_t proto_ver;
 +uint8_t opcode;
 +uint16_t flags;
 +uint32_t epoch;
 +uint32_t id;
 +uint32_t data_length;
 +uint32_t result;
 +uint8_t nr_copies;
 +uint8_t copy_policy;
 +uint8_t block_size_shift;
 +uint8_t __pad1;
 +uint32_t __pad2[6];
 +} SheepdogClusterRsp;
 +
  typedef struct SheepdogInode {
  char name[SD_MAX_VDI_LEN];
  char tag[SD_MAX_VDI_TAG_LEN];
 @@ -1544,6 +1561,7 @@ static int do_sd_create(BDRVSheepdogState *s, uint32_t 
 *vdi_id, int snapshot,
  hdr.vdi_size = s-inode.vdi_size;
  hdr.copy_policy = s-inode.copy_policy;
  hdr.copies = s-inode.nr_copies;
 +hdr.block_size_shift = s-inode.block_size_shift;
  
  ret = do_req(fd, s-aio_context, (SheepdogReq *)hdr, buf, wlen, rlen);
  
 @@ -1569,9 +1587,12 @@ static int do_sd_create(BDRVSheepdogState *s, uint32_t 
 *vdi_id, int snapshot,
  static int sd_prealloc(const char *filename, Error **errp)
  {
  BlockDriverState *bs = NULL;
 +BDRVSheepdogState *base = NULL;
 +unsigned long buf_size;
  uint32_t idx, max_idx;
 +uint32_t object_size;
  int64_t vdi_size;
 -void *buf = g_malloc0(SD_DATA_OBJ_SIZE);
 +void *buf = NULL;
  int ret;
  
  ret = bdrv_open(bs, filename, NULL, NULL, BDRV_O_RDWR | BDRV_O_PROTOCOL,
 @@ -1585,18 +1606,24 @@ static int sd_prealloc(const char *filename, Error 
 **errp)
  ret = vdi_size;
  goto out;
  }
 -max_idx = DIV_ROUND_UP(vdi_size, SD_DATA_OBJ_SIZE);
 +
 +base = bs-opaque;
 +object_size = (UINT32_C(1)  base-inode.block_size_shift);
 +buf_size = MIN(object_size, SD_DATA_OBJ_SIZE);
 +buf = g_malloc0(buf_size);
 +
 +max_idx = DIV_ROUND_UP(vdi_size, buf_size);
  
  for (idx = 0; idx  max_idx; idx++) {
  /*
   * The created image can be a cloned image, so we need to read
   * a data from the source image.
   */
 -ret = bdrv_pread(bs, idx * SD_DATA_OBJ_SIZE, buf, SD_DATA_OBJ_SIZE);
 +ret = bdrv_pread(bs, idx * buf_size, buf, buf_size);
  if (ret  0) {
  goto out;
  }
 -ret = bdrv_pwrite(bs, idx * SD_DATA_OBJ_SIZE, buf, SD_DATA_OBJ_SIZE);
 +ret = bdrv_pwrite(bs, idx * buf_size, buf, buf_size);
  if (ret  0) {
  goto out;
  }
 @@ -1669,6 +1696,21 @@ static int parse_redundancy(BDRVSheepdogState *s, 
 const char *opt)
  return 0;
  }
  
 +static int parse_block_size_shift(BDRVSheepdogState *s, QemuOpts *opt)
 +{
 +struct SheepdogInode *inode = s-inode;
 +inode-block_size_shift =
 +

Re: [Qemu-devel] [PATCH v4] sheepdog: selectable object size support

2015-02-09 Thread Liu Yuan

On Tue, Feb 10, 2015 at 11:10:51AM +0800, Liu Yuan wrote:
 On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote:
  Previously, qemu block driver of sheepdog used hard-coded VDI object size.
  This patch enables users to handle block_size_shift value for
  calculating VDI object size.
  
  When you start qemu, you don't need to specify additional command option.
  
  But when you create the VDI which doesn't have default object size
  with qemu-img command, you specify block_size_shift option.
  
  If you want to create a VDI of 8MB(1  23) object size,
  you need to specify following command option.
  
   # qemu-img create -o block_size_shift=23 sheepdog:test1 100M
  
  In addition, when you don't specify qemu-img command option,
  a default value of sheepdog cluster is used for creating VDI.
  
   # qemu-img create sheepdog:test2 100M
  
  Signed-off-by: Teruaki Ishizaki ishizaki.teru...@lab.ntt.co.jp
  ---
  V4:
   - Limit a read/write buffer size for creating a preallocated VDI.
   - Replace a parse function for the block_size_shift option.
   - Fix an error message.
  
  V3:
   - Delete the needless operation of buffer.
   - Delete the needless operations of request header.
 for SD_OP_GET_CLUSTER_DEFAULT.
   - Fix coding style problems.
  
  V2:
   - Fix coding style problem (white space).
   - Add members, store_policy and block_size_shift to struct SheepdogVdiReq.
   - Initialize request header to use block_size_shift specified by user.
  ---
   block/sheepdog.c  |  138 
  ++---
   include/block/block_int.h |1 +
   2 files changed, 119 insertions(+), 20 deletions(-)
  
  diff --git a/block/sheepdog.c b/block/sheepdog.c
  index be3176f..a43b947 100644
  --- a/block/sheepdog.c
  +++ b/block/sheepdog.c
  @@ -37,6 +37,7 @@
   #define SD_OP_READ_VDIS  0x15
   #define SD_OP_FLUSH_VDI  0x16
   #define SD_OP_DEL_VDI0x17
  +#define SD_OP_GET_CLUSTER_DEFAULT   0x18
   
   #define SD_FLAG_CMD_WRITE0x01
   #define SD_FLAG_CMD_COW  0x02
  @@ -167,7 +168,8 @@ typedef struct SheepdogVdiReq {
   uint32_t base_vdi_id;
   uint8_t copies;
   uint8_t copy_policy;
  -uint8_t reserved[2];
  +uint8_t store_policy;
  +uint8_t block_size_shift;
   uint32_t snapid;
   uint32_t type;
   uint32_t pad[2];
  @@ -186,6 +188,21 @@ typedef struct SheepdogVdiRsp {
   uint32_t pad[5];
   } SheepdogVdiRsp;
   
  +typedef struct SheepdogClusterRsp {
  +uint8_t proto_ver;
  +uint8_t opcode;
  +uint16_t flags;
  +uint32_t epoch;
  +uint32_t id;
  +uint32_t data_length;
  +uint32_t result;
  +uint8_t nr_copies;
  +uint8_t copy_policy;
  +uint8_t block_size_shift;
  +uint8_t __pad1;
  +uint32_t __pad2[6];
  +} SheepdogClusterRsp;
  +
   typedef struct SheepdogInode {
   char name[SD_MAX_VDI_LEN];
   char tag[SD_MAX_VDI_TAG_LEN];
  @@ -1544,6 +1561,7 @@ static int do_sd_create(BDRVSheepdogState *s, 
  uint32_t *vdi_id, int snapshot,
   hdr.vdi_size = s-inode.vdi_size;
   hdr.copy_policy = s-inode.copy_policy;
   hdr.copies = s-inode.nr_copies;
  +hdr.block_size_shift = s-inode.block_size_shift;
   
   ret = do_req(fd, s-aio_context, (SheepdogReq *)hdr, buf, wlen, 
  rlen);
   
  @@ -1569,9 +1587,12 @@ static int do_sd_create(BDRVSheepdogState *s, 
  uint32_t *vdi_id, int snapshot,
   static int sd_prealloc(const char *filename, Error **errp)
   {
   BlockDriverState *bs = NULL;
  +BDRVSheepdogState *base = NULL;
  +unsigned long buf_size;
   uint32_t idx, max_idx;
  +uint32_t object_size;
   int64_t vdi_size;
  -void *buf = g_malloc0(SD_DATA_OBJ_SIZE);
  +void *buf = NULL;
   int ret;
   
   ret = bdrv_open(bs, filename, NULL, NULL, BDRV_O_RDWR | 
  BDRV_O_PROTOCOL,
  @@ -1585,18 +1606,24 @@ static int sd_prealloc(const char *filename, Error 
  **errp)
   ret = vdi_size;
   goto out;
   }
  -max_idx = DIV_ROUND_UP(vdi_size, SD_DATA_OBJ_SIZE);
  +
  +base = bs-opaque;
  +object_size = (UINT32_C(1)  base-inode.block_size_shift);
  +buf_size = MIN(object_size, SD_DATA_OBJ_SIZE);
  +buf = g_malloc0(buf_size);
  +
  +max_idx = DIV_ROUND_UP(vdi_size, buf_size);
   
   for (idx = 0; idx  max_idx; idx++) {
   /*
* The created image can be a cloned image, so we need to read
* a data from the source image.
*/
  -ret = bdrv_pread(bs, idx * SD_DATA_OBJ_SIZE, buf, 
  SD_DATA_OBJ_SIZE);
  +ret = bdrv_pread(bs, idx * buf_size, buf, buf_size);
   if (ret  0) {
   goto out;
   }
  -ret = bdrv_pwrite(bs, idx * SD_DATA_OBJ_SIZE, buf, 
  SD_DATA_OBJ_SIZE);
  +ret = bdrv_pwrite(bs, idx * buf_size, buf, buf_size);
   if (ret  0) {
   goto out;
   }
  @@ -1669,6 +1696,21 @@ static int parse_redundancy(BDRVSheepdogState *s, 
  const char *opt

Re: [Qemu-devel] [PATCH v4] sheepdog: selectable object size support

2015-02-08 Thread Liu Yuan

On Fri, Feb 06, 2015 at 04:57:55PM +0900, Teruaki Ishizaki wrote:
 (2015/02/06 11:18), Liu Yuan wrote:
 On Wed, Feb 04, 2015 at 01:54:19PM +0900, Teruaki Ishizaki wrote:
 (2015/02/02 15:52), Liu Yuan wrote:
 On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote:
 Previously, qemu block driver of sheepdog used hard-coded VDI object size.
 This patch enables users to handle block_size_shift value for
 calculating VDI object size.
 
 When you start qemu, you don't need to specify additional command option.
 
 But when you create the VDI which doesn't have default object size
 with qemu-img command, you specify block_size_shift option.
 
 If you want to create a VDI of 8MB(1  23) object size,
 you need to specify following command option.
 
   # qemu-img create -o block_size_shift=23 sheepdog:test1 100M
 
 Is it possible to make this option more user friendly? such as
 
   $ qemu-img create -o object_size=8M sheepdog:test 1G
 
 At first, I thought that the object_size was user friendly.
 But, Sheepdog has already the value of block_size_shift
 in the inode layout that means like object_size.
 
 'object_size' doesn't always fit right in 'block_size_shift'.
 On the other hands, 'block_size_shift' always fit right in
 'object_size'.
 
 I think that existing layout shouldn't be changed easily and
 it seems that it is difficult for users to specify
 the object_size value that fit right in 'block_size_shift'.
 
 I don't think we need to change the layout. block_size_shift is what QEMU 
 talks
 to sheep, and QEMU options is what users talks to QEMU. We can convert the 
 user
 friendly object size into block_size_shift internally in the driver before
 sending it tosheep daemon, no?
 
 For example, users specify 12MB for object size, block_size_shift
 doesn't fit exactly to an integer.

In this case, we should abort and print the error message and notify the users
the acceptable range like erasure coding option.

 
 I suppose that normally an administrator do format sheepdog cluster
 with specifying block_size_shift and users usually do qemu-img command
 without a block_size_shift option.

block_size_shift is too developer centered and has a direct relation to the
underlying algorithm. If in the future, e.g, we change the underlying algorithm
about how we represent block size, block_size_shift might not even exsit. So
use object_size would be more generic and won't have this kind of problem.

Secondly, it is not hard to parse the object_size into block_size_shift, so I'd
suggest we'd better try our best to make it user friendly.

Thanks
Yuan

Re: [Qemu-devel] [PATCH v4] sheepdog: selectable object size support

2015-02-05 Thread Liu Yuan

On Wed, Feb 04, 2015 at 01:54:19PM +0900, Teruaki Ishizaki wrote:
 (2015/02/02 15:52), Liu Yuan wrote:
 On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote:
 Previously, qemu block driver of sheepdog used hard-coded VDI object size.
 This patch enables users to handle block_size_shift value for
 calculating VDI object size.
 
 When you start qemu, you don't need to specify additional command option.
 
 But when you create the VDI which doesn't have default object size
 with qemu-img command, you specify block_size_shift option.
 
 If you want to create a VDI of 8MB(1  23) object size,
 you need to specify following command option.
 
   # qemu-img create -o block_size_shift=23 sheepdog:test1 100M
 
 Is it possible to make this option more user friendly? such as
 
   $ qemu-img create -o object_size=8M sheepdog:test 1G
 
 At first, I thought that the object_size was user friendly.
 But, Sheepdog has already the value of block_size_shift
 in the inode layout that means like object_size.
 
 'object_size' doesn't always fit right in 'block_size_shift'.
 On the other hands, 'block_size_shift' always fit right in
 'object_size'.
 
 I think that existing layout shouldn't be changed easily and
 it seems that it is difficult for users to specify
 the object_size value that fit right in 'block_size_shift'.
 
I don't think we need to change the layout. block_size_shift is what QEMU talks
to sheep, and QEMU options is what users talks to QEMU. We can convert the user
friendly object size into block_size_shift internally in the driver before
sending it tosheep daemon, no?

Thanks
Yuan

Re: [Qemu-devel] [PATCH v4] sheepdog: selectable object size support

2015-02-01 Thread Liu Yuan

On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote:
 Previously, qemu block driver of sheepdog used hard-coded VDI object size.
 This patch enables users to handle block_size_shift value for
 calculating VDI object size.
 
 When you start qemu, you don't need to specify additional command option.
 
 But when you create the VDI which doesn't have default object size
 with qemu-img command, you specify block_size_shift option.
 
 If you want to create a VDI of 8MB(1  23) object size,
 you need to specify following command option.
 
  # qemu-img create -o block_size_shift=23 sheepdog:test1 100M

Is it possible to make this option more user friendly? such as

 $ qemu-img create -o object_size=8M sheepdog:test 1G

Thanks
Yuan

Re: [Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver

2014-09-10 Thread Liu Yuan

On Sun, Sep 07, 2014 at 05:12:31PM +0200, Benoît Canet wrote:
 The Monday 01 Sep 2014 à 15:43:06 (+0800), Liu Yuan wrote :
  This patch set mainly add mainly two logics to implement device recover
  - notify qourum driver of the broken states from the child driver(s)
  - dirty track and sync the device after it is repaired
  
  Thus quorum allow VMs to continue while some child devices are broken and 
  when
  the child devices are repaired and return back, we sync dirty bits during
  downtime to keep data consistency.
  
  The recovery logic is based on the driver state bitmap and will sync the 
  dirty
  bits with a timeslice window in a coroutine in this prtimive implementation.
  
  Simple graph about 2 children with threshold=1 and read-pattern=fifo:
  (similary to DRBD)
  
  + denote device sync iteration
  - IO on a single device
  = IO on two devices
  
sync complete, release dirty bitmap
   ^
   |
-++==
   | |
   | v
   |   device repaired and begin to sync
   v
 device broken, create a dirty bitmap
  
This sync logic can take care of nested broken problem, that devices are
broken while in sync. We just start a sync process after the devices are
repaired again and switch the devices from broken to sound only when the 
  sync
completes.
  
  For read-pattern=quorum mode, it enjoys the recovery logic without any 
  problem.
  
  Todo:
  - use aio interface to sync data (multiple transfer in one go)
  - dynamic slice window to control sync bandwidth more smoothly
  - add auto-reconnection mechanism to other protocol (if not support yet)
  - add tests
  
  Cc: Eric Blake ebl...@redhat.com
  Cc: Benoit Canet ben...@irqsave.net
  Cc: Kevin Wolf kw...@redhat.com
  Cc: Stefan Hajnoczi stefa...@redhat.com
  
  Liu Yuan (8):
block/quorum: initialize qcrs.aiocb for read
block: add driver operation callbacks
block/sheepdog: propagate disconnect/reconnect events to upper driver
block/quorum: add quorum_aio_release() helper
quorum: fix quorum_aio_cancel()
block/quorum: add broken state to BlockDriverState
block: add two helpers
quorum: add basic device recovery logic
  
   block.c   |  17 +++
   block/quorum.c| 324 
  +-
   block/sheepdog.c  |   9 ++
   include/block/block.h |   9 ++
   include/block/block_int.h |   6 +
   trace-events  |   5 +
   6 files changed, 336 insertions(+), 34 deletions(-)
  
  -- 
  1.9.1
  
 
 Hi liu,
 
 Had you noticed that your series conflict with one of Fam's series in the 
 quorum cancel
 function fix patch ?

Not yet, thanks for reminding.

 Could you find an arrangement with Fam so the two patches don't collide 
 anymore ?
 
 Do you intend to respin your series ?

Yes, I'll rebase the v2 later before more possible reviews.

Thanks
Yuan

Re: [Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver

2014-09-10 Thread Liu Yuan

On Wed, Sep 03, 2014 at 12:19:14AM +0200, Benoît Canet wrote:
 The Monday 01 Sep 2014 à 15:43:06 (+0800), Liu Yuan wrote :
 
 Liu,
 
 Do you think this could work with qcow2 file backed by NFS servers ?

It depends on which client we use.

If we use Linux NFS client by 'posix file' protocol, I guess we need to hack
raw-posix.c and insert reconnect/disconnect callbacks.

I'm exploring the possibility of QEMU's nfs client block/nfs.c too.

Either way, this should work with all the protocols with proper hacks.

Thanks
Yuan

[Qemu-devel] [PATCH 1/8] block/quorum: initialize qcrs.aiocb for read

2014-09-01 Thread Liu Yuan

This is required by quorum_aio_cancel()

Cc: Eric Blake ebl...@redhat.com
Cc: Benoit Canet ben...@irqsave.net
Cc: Kevin Wolf kw...@redhat.com
Cc: Stefan Hajnoczi stefa...@redhat.com
Signed-off-by: Liu Yuan namei.u...@gmail.com
---
 block/quorum.c | 23 ---
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index af48e8c..5866bca 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -653,8 +653,10 @@ static BlockDriverAIOCB *read_quorum_children(QuorumAIOCB 
*acb)
 }
 
 for (i = 0; i  s-num_children; i++) {
-bdrv_aio_readv(s-bs[i], acb-sector_num, acb-qcrs[i].qiov,
-   acb-nb_sectors, quorum_aio_cb, acb-qcrs[i]);
+acb-qcrs[i].aiocb = bdrv_aio_readv(s-bs[i], acb-sector_num,
+acb-qcrs[i].qiov,
+acb-nb_sectors, quorum_aio_cb,
+acb-qcrs[i]);
 }
 
 return acb-common;
@@ -663,15 +665,14 @@ static BlockDriverAIOCB *read_quorum_children(QuorumAIOCB 
*acb)
 static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb)
 {
 BDRVQuorumState *s = acb-common.bs-opaque;
-
-acb-qcrs[acb-child_iter].buf = qemu_blockalign(s-bs[acb-child_iter],
- acb-qiov-size);
-qemu_iovec_init(acb-qcrs[acb-child_iter].qiov, acb-qiov-niov);
-qemu_iovec_clone(acb-qcrs[acb-child_iter].qiov, acb-qiov,
- acb-qcrs[acb-child_iter].buf);
-bdrv_aio_readv(s-bs[acb-child_iter], acb-sector_num,
-   acb-qcrs[acb-child_iter].qiov, acb-nb_sectors,
-   quorum_aio_cb, acb-qcrs[acb-child_iter]);
+int i = acb-child_iter;
+
+acb-qcrs[i].buf = qemu_blockalign(s-bs[i], acb-qiov-size);
+qemu_iovec_init(acb-qcrs[i].qiov, acb-qiov-niov);
+qemu_iovec_clone(acb-qcrs[i].qiov, acb-qiov, acb-qcrs[i].buf);
+acb-qcrs[i].aiocb = bdrv_aio_readv(s-bs[i], acb-sector_num,
+acb-qcrs[i].qiov, acb-nb_sectors,
+quorum_aio_cb, acb-qcrs[i]);
 
 return acb-common;
 }
-- 
1.9.1

[Qemu-devel] [PATCH 4/8] block/quorum: add quorum_aio_release() helper

2014-09-01 Thread Liu Yuan

Cc: Eric Blake ebl...@redhat.com
Cc: Benoit Canet ben...@irqsave.net
Cc: Kevin Wolf kw...@redhat.com
Cc: Stefan Hajnoczi stefa...@redhat.com
Signed-off-by: Liu Yuan namei.u...@gmail.com
---
 block/quorum.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index 5866bca..9e056d6 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -130,6 +130,12 @@ struct QuorumAIOCB {
 
 static bool quorum_vote(QuorumAIOCB *acb);
 
+static void quorum_aio_release(QuorumAIOCB *acb)
+{
+g_free(acb-qcrs);
+qemu_aio_release(acb);
+}
+
 static void quorum_aio_cancel(BlockDriverAIOCB *blockacb)
 {
 QuorumAIOCB *acb = container_of(blockacb, QuorumAIOCB, common);
@@ -141,8 +147,7 @@ static void quorum_aio_cancel(BlockDriverAIOCB *blockacb)
 bdrv_aio_cancel(acb-qcrs[i].aiocb);
 }
 
-g_free(acb-qcrs);
-qemu_aio_release(acb);
+quorum_aio_release(acb);
 }
 
 static AIOCBInfo quorum_aiocb_info = {
@@ -168,8 +173,7 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
 }
 }
 
-g_free(acb-qcrs);
-qemu_aio_release(acb);
+quorum_aio_release(acb);
 }
 
 static bool quorum_sha256_compare(QuorumVoteValue *a, QuorumVoteValue *b)
-- 
1.9.1

[Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver

2014-09-01 Thread Liu Yuan

This patch set mainly add mainly two logics to implement device recover
- notify qourum driver of the broken states from the child driver(s)
- dirty track and sync the device after it is repaired

Thus quorum allow VMs to continue while some child devices are broken and when
the child devices are repaired and return back, we sync dirty bits during
downtime to keep data consistency.

The recovery logic is based on the driver state bitmap and will sync the dirty
bits with a timeslice window in a coroutine in this prtimive implementation.

Simple graph about 2 children with threshold=1 and read-pattern=fifo:
(similary to DRBD)

+ denote device sync iteration
- IO on a single device
= IO on two devices

  sync complete, release dirty bitmap
 ^
 |
  -++==
 | |
 | v
 |   device repaired and begin to sync
 v
   device broken, create a dirty bitmap

  This sync logic can take care of nested broken problem, that devices are
  broken while in sync. We just start a sync process after the devices are
  repaired again and switch the devices from broken to sound only when the sync
  completes.

For read-pattern=quorum mode, it enjoys the recovery logic without any problem.

Todo:
- use aio interface to sync data (multiple transfer in one go)
- dynamic slice window to control sync bandwidth more smoothly
- add auto-reconnection mechanism to other protocol (if not support yet)
- add tests

Cc: Eric Blake ebl...@redhat.com
Cc: Benoit Canet ben...@irqsave.net
Cc: Kevin Wolf kw...@redhat.com
Cc: Stefan Hajnoczi stefa...@redhat.com

Liu Yuan (8):
  block/quorum: initialize qcrs.aiocb for read
  block: add driver operation callbacks
  block/sheepdog: propagate disconnect/reconnect events to upper driver
  block/quorum: add quorum_aio_release() helper
  quorum: fix quorum_aio_cancel()
  block/quorum: add broken state to BlockDriverState
  block: add two helpers
  quorum: add basic device recovery logic

 block.c   |  17 +++
 block/quorum.c| 324 +-
 block/sheepdog.c  |   9 ++
 include/block/block.h |   9 ++
 include/block/block_int.h |   6 +
 trace-events  |   5 +
 6 files changed, 336 insertions(+), 34 deletions(-)

-- 
1.9.1

[Qemu-devel] [PATCH 5/8] quorum: fix quorum_aio_cancel()

2014-09-01 Thread Liu Yuan

For a fifo read pattern, we only have one running aio (possible other cases that
has less number than num_children in the future), so we need to check if
.acb is NULL against bdrv_aio_cancel() to avoid segfault.

Cc: Eric Blake ebl...@redhat.com
Cc: Benoit Canet ben...@irqsave.net
Cc: Kevin Wolf kw...@redhat.com
Cc: Stefan Hajnoczi stefa...@redhat.com
Signed-off-by: Liu Yuan namei.u...@gmail.com
---
 block/quorum.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/quorum.c b/block/quorum.c
index 9e056d6..b9eeda3 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -144,7 +144,9 @@ static void quorum_aio_cancel(BlockDriverAIOCB *blockacb)
 
 /* cancel all callbacks */
 for (i = 0; i  s-num_children; i++) {
-bdrv_aio_cancel(acb-qcrs[i].aiocb);
+if (acb-qcrs[i].aiocb) {
+bdrv_aio_cancel(acb-qcrs[i].aiocb);
+}
 }
 
 quorum_aio_release(acb);
-- 
1.9.1

[Qemu-devel] [PATCH 2/8] block: add driver operation callbacks

2014-09-01 Thread Liu Yuan

Driver operations are defined as callbacks passed from block upper drivers to
lower drivers and are supposed to be called by lower drivers.

Requests handling(queuing, submitting, etc.) are done in protocol tier in the
block layer and connection states are also maintained down there. Driver
operations are supposed to notify the upper tier (such as quorum) of the states
changes.

For now only two operation are added:

driver_disconnect: called when connection is off
driver_reconnect: called when connection is on after disconnection

Which are used to notify upper tier of the connection state.

Cc: Eric Blake ebl...@redhat.com
Cc: Benoit Canet ben...@irqsave.net
Cc: Kevin Wolf kw...@redhat.com
Cc: Stefan Hajnoczi stefa...@redhat.com
Signed-off-by: Liu Yuan namei.u...@gmail.com
---
 block.c   | 7 +++
 include/block/block.h | 7 +++
 include/block/block_int.h | 3 +++
 3 files changed, 17 insertions(+)

diff --git a/block.c b/block.c
index c12b8de..22eb3e4 100644
--- a/block.c
+++ b/block.c
@@ -2152,6 +2152,13 @@ void bdrv_set_dev_ops(BlockDriverState *bs, const 
BlockDevOps *ops,
 bs-dev_opaque = opaque;
 }
 
+void bdrv_set_drv_ops(BlockDriverState *bs, const BlockDrvOps *ops,
+  void *opaque)
+{
+bs-drv_ops = ops;
+bs-drv_opaque = opaque;
+}
+
 static void bdrv_dev_change_media_cb(BlockDriverState *bs, bool load)
 {
 if (bs-dev_ops  bs-dev_ops-change_media_cb) {
diff --git a/include/block/block.h b/include/block/block.h
index 8f4ad16..a61eaf0 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -82,6 +82,11 @@ typedef struct BlockDevOps {
 void (*resize_cb)(void *opaque);
 } BlockDevOps;
 
+typedef struct BlockDrvOps {
+void (*driver_reconnect)(BlockDriverState *bs);
+void (*driver_disconnect)(BlockDriverState *bs);
+} BlockDrvOps;
+
 typedef enum {
 BDRV_REQ_COPY_ON_READ = 0x1,
 BDRV_REQ_ZERO_WRITE   = 0x2,
@@ -234,6 +239,8 @@ void bdrv_detach_dev(BlockDriverState *bs, void *dev);
 void *bdrv_get_attached_dev(BlockDriverState *bs);
 void bdrv_set_dev_ops(BlockDriverState *bs, const BlockDevOps *ops,
   void *opaque);
+void bdrv_set_drv_ops(BlockDriverState *bs, const BlockDrvOps *ops,
+  void *opaque);
 void bdrv_dev_eject_request(BlockDriverState *bs, bool force);
 bool bdrv_dev_has_removable_media(BlockDriverState *bs);
 bool bdrv_dev_is_tray_open(BlockDriverState *bs);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 2334895..9fdec7f 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -319,6 +319,9 @@ struct BlockDriverState {
 const BlockDevOps *dev_ops;
 void *dev_opaque;
 
+const BlockDrvOps *drv_ops;
+void *drv_opaque;
+
 AioContext *aio_context; /* event loop used for fd handlers, timers, etc */
 
 char filename[1024];
-- 
1.9.1

[Qemu-devel] [PATCH 8/8] quorum: add basic device recovery logic

2014-09-01 Thread Liu Yuan

For some configuration, quorum allow VMs to continue while some child devices
are broken and when the child devices are repaired and return back, we need to
sync dirty bits during downtime to keep data consistency.

The recovery logic is based on the driver state bitmap and will sync the dirty
bits with a timeslice window in a coroutine in this prtimive implementation.

Simple graph about 2 children with threshold=1 and read-pattern=fifo:

+ denote device sync iteration
- IO on a single device
= IO on two devices

  sync complete, release dirty bitmap
 ^
 |
  -++==
 | |
 | v
 |   device repaired and begin to sync
 v
   device broken, create a dirty bitmap

  This sync logic can take care of nested broken problem, that devices are
  broken while in sync. We just start a sync process after the devices are
  repaired again and switch the devices from broken to sound only when the sync
  completes.

For read-pattern=quorum mode, it enjoys the recovery logic without any problem.

Cc: Eric Blake ebl...@redhat.com
Cc: Benoit Canet ben...@irqsave.net
Cc: Kevin Wolf kw...@redhat.com
Cc: Stefan Hajnoczi stefa...@redhat.com
Signed-off-by: Liu Yuan namei.u...@gmail.com
---
 block/quorum.c | 189 -
 trace-events   |   5 ++
 2 files changed, 191 insertions(+), 3 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index 7b07e35..ffd7c2d 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -23,6 +23,7 @@
 #include qapi/qmp/qlist.h
 #include qapi/qmp/qstring.h
 #include qapi-event.h
+#include trace.h
 
 #define HASH_LENGTH 32
 
@@ -31,6 +32,10 @@
 #define QUORUM_OPT_REWRITErewrite-corrupted
 #define QUORUM_OPT_READ_PATTERN   read-pattern
 
+#define SLICE_TIME  1ULL /* 100 ms */
+#define CHUNK_SIZE  (1  20) /* 1M */
+#define SECTORS_PER_CHUNK   (CHUNK_SIZE  BDRV_SECTOR_BITS)
+
 /* This union holds a vote hash value */
 typedef union QuorumVoteValue {
 char h[HASH_LENGTH];   /* SHA-256 hash */
@@ -64,6 +69,7 @@ typedef struct QuorumVotes {
 
 /* the following structure holds the state of one quorum instance */
 typedef struct BDRVQuorumState {
+BlockDriverState *mybs;/* Quorum block driver base state */
 BlockDriverState **bs; /* children BlockDriverStates */
 int num_children;  /* children count */
 int threshold; /* if less than threshold children reads gave the
@@ -82,6 +88,10 @@ typedef struct BDRVQuorumState {
 */
 
 QuorumReadPattern read_pattern;
+BdrvDirtyBitmap *dirty_bitmap;
+uint8_t *sync_buf;
+HBitmapIter hbi;
+int64_t sector_num;
 } BDRVQuorumState;
 
 typedef struct QuorumAIOCB QuorumAIOCB;
@@ -290,12 +300,11 @@ static void quorum_copy_qiov(QEMUIOVector *dest, 
QEMUIOVector *source)
 }
 }
 
-static int next_fifo_child(QuorumAIOCB *acb)
+static int get_good_child(BDRVQuorumState *s, int iter)
 {
-BDRVQuorumState *s = acb-common.bs-opaque;
 int i;
 
-for (i = acb-child_iter; i  s-num_children; i++) {
+for (i = iter; i  s-num_children; i++) {
 if (!s-bs[i]-broken) {
 break;
 }
@@ -306,6 +315,13 @@ static int next_fifo_child(QuorumAIOCB *acb)
 return i;
 }
 
+static int next_fifo_child(QuorumAIOCB *acb)
+{
+BDRVQuorumState *s = acb-common.bs-opaque;
+
+return get_good_child(s, acb-child_iter);
+}
+
 static void quorum_aio_cb(void *opaque, int ret)
 {
 QuorumChildRequest *sacb = opaque;
@@ -951,6 +967,171 @@ static int parse_read_pattern(const char *opt)
 return -EINVAL;
 }
 
+static void sync_prepare(BDRVQuorumState *qs, int64_t *num)
+{
+int64_t nb, total = bdrv_nb_sectors(qs-mybs);
+
+qs-sector_num = hbitmap_iter_next(qs-hbi);
+/* Wrap around if previous bits get dirty while syncing */
+if (qs-sector_num  0) {
+bdrv_dirty_iter_init(qs-mybs, qs-dirty_bitmap, qs-hbi);
+qs-sector_num = hbitmap_iter_next(qs-hbi);
+assert(qs-sector_num = 0);
+}
+
+for (nb = 1; nb  SECTORS_PER_CHUNK  qs-sector_num + nb  total;
+ nb++) {
+if (!bdrv_get_dirty(qs-mybs, qs-dirty_bitmap, qs-sector_num + nb)) {
+break;
+}
+}
+*num = nb;
+}
+
+static void sync_finish(BDRVQuorumState *qs, int64_t num)
+{
+int64_t i;
+
+for (i = 0; i  num; i++) {
+/* We need to advance the iterator manually */
+hbitmap_iter_next(qs-hbi);
+}
+bdrv_reset_dirty(qs-mybs, qs-sector_num, num);
+}
+
+static int quorum_sync_iteration(BDRVQuorumState *qs, BlockDriverState *target)
+{
+BlockDriverState *source;
+QEMUIOVector qiov;
+int ret, good;
+int64_t nb_sectors;
+struct iovec iov;
+const char *sname, *tname = bdrv_get_filename(target);
+
+good = get_good_child

[Qemu-devel] [PATCH 3/8] block/sheepdog: propagate disconnect/reconnect events to upper driver

2014-09-01 Thread Liu Yuan

This is the reference usage how we propagate connection state to upper tier.

Cc: Eric Blake ebl...@redhat.com
Cc: Benoit Canet ben...@irqsave.net
Cc: Kevin Wolf kw...@redhat.com
Cc: Stefan Hajnoczi stefa...@redhat.com
Signed-off-by: Liu Yuan namei.u...@gmail.com
---
 block/sheepdog.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/block/sheepdog.c b/block/sheepdog.c
index 53c24d6..9c0fc49 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -714,6 +714,11 @@ static coroutine_fn void reconnect_to_sdog(void *opaque)
 {
 BDRVSheepdogState *s = opaque;
 AIOReq *aio_req, *next;
+BlockDriverState *bs = s-bs;
+
+if (bs-drv_ops  bs-drv_ops-driver_disconnect) {
+bs-drv_ops-driver_disconnect(bs);
+}
 
 aio_set_fd_handler(s-aio_context, s-fd, NULL, NULL, NULL);
 close(s-fd);
@@ -756,6 +761,10 @@ static coroutine_fn void reconnect_to_sdog(void *opaque)
 QLIST_INSERT_HEAD(s-inflight_aio_head, aio_req, aio_siblings);
 resend_aioreq(s, aio_req);
 }
+
+if (bs-drv_ops  bs-drv_ops-driver_reconnect) {
+bs-drv_ops-driver_reconnect(bs);
+}
 }
 
 /*
-- 
1.9.1

[Qemu-devel] [PATCH 6/8] block/quorum: add broken state to BlockDriverState

2014-09-01 Thread Liu Yuan

This allow VM continues to process even if some devices are broken meanwhile
with proper configuration.

We mark the device broken when the protocol tier notify back some broken
state(s) of the device, such as diconnection via driver operations. We could
also reset the device as sound when the protocol tier is repaired.

Origianlly .threshold controls how we should decide the success of read/write
and return the failure only if the success count of read/write is less than
.threshold specified by users. But it doesn't record the states of underlying
states and will impact performance a bit in some cases.

For example, we have 3 children and .threshold is set 2. If one of the devices
broken, we should still return success and continue to run VM. But for every
IO operations, we will blindly send the requests to the broken device.

To store broken state into driver state we can save requests to borken devices
and resend the requests to the repaired ones by setting broken as false.

This is especially useful for network based protocol such as sheepdog, which
has a auto-reconnection mechanism and will never report EIO if the connection
is broken but just store the requests to its local queue and wait for resending.
Without broken state, quorum request will not come back until the connection is
re-established. So we have to skip the broken deivces to allow VM to continue
running with networked backed child (glusterfs, nfs, sheepdog, etc).

With the combination of read-pattern and threshold, we can easily mimic the DRVD
behavior with following configuration:

 read-pattern=fifo,threshold=1 will two children.

Cc: Eric Blake ebl...@redhat.com
Cc: Benoit Canet ben...@irqsave.net
Cc: Kevin Wolf kw...@redhat.com
Cc: Stefan Hajnoczi stefa...@redhat.com
Signed-off-by: Liu Yuan namei.u...@gmail.com
---
 block/quorum.c| 102 ++
 include/block/block_int.h |   3 ++
 2 files changed, 87 insertions(+), 18 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index b9eeda3..7b07e35 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -120,6 +120,7 @@ struct QuorumAIOCB {
 int rewrite_count;  /* number of replica to rewrite: count down to
  * zero once writes are fired
  */
+int issued_count;   /* actual readwrite issued count */
 
 QuorumVotes votes;
 
@@ -170,8 +171,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
 if (acb-is_read) {
 /* on the quorum case acb-child_iter == s-num_children - 1 */
 for (i = 0; i = acb-child_iter; i++) {
-qemu_vfree(acb-qcrs[i].buf);
-qemu_iovec_destroy(acb-qcrs[i].qiov);
+if (acb-qcrs[i].buf) {
+qemu_vfree(acb-qcrs[i].buf);
+qemu_iovec_destroy(acb-qcrs[i].qiov);
+}
 }
 }
 
@@ -207,6 +210,7 @@ static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s,
 acb-count = 0;
 acb-success_count = 0;
 acb-rewrite_count = 0;
+acb-issued_count = 0;
 acb-votes.compare = quorum_sha256_compare;
 QLIST_INIT(acb-votes.vote_list);
 acb-is_read = false;
@@ -286,6 +290,22 @@ static void quorum_copy_qiov(QEMUIOVector *dest, 
QEMUIOVector *source)
 }
 }
 
+static int next_fifo_child(QuorumAIOCB *acb)
+{
+BDRVQuorumState *s = acb-common.bs-opaque;
+int i;
+
+for (i = acb-child_iter; i  s-num_children; i++) {
+if (!s-bs[i]-broken) {
+break;
+}
+}
+if (i == s-num_children) {
+return -1;
+}
+return i;
+}
+
 static void quorum_aio_cb(void *opaque, int ret)
 {
 QuorumChildRequest *sacb = opaque;
@@ -293,11 +313,18 @@ static void quorum_aio_cb(void *opaque, int ret)
 BDRVQuorumState *s = acb-common.bs-opaque;
 bool rewrite = false;
 
+if (ret  0) {
+s-bs[acb-child_iter]-broken = true;
+}
+
 if (acb-is_read  s-read_pattern == QUORUM_READ_PATTERN_FIFO) {
 /* We try to read next child in FIFO order if we fail to read */
-if (ret  0  ++acb-child_iter  s-num_children) {
-read_fifo_child(acb);
-return;
+if (ret  0) {
+acb-child_iter = next_fifo_child(acb);
+if (acb-child_iter  0) {
+read_fifo_child(acb);
+return;
+}
 }
 
 if (ret == 0) {
@@ -315,9 +342,9 @@ static void quorum_aio_cb(void *opaque, int ret)
 } else {
 quorum_report_bad(acb, sacb-aiocb-bs-node_name, ret);
 }
-assert(acb-count = s-num_children);
-assert(acb-success_count = s-num_children);
-if (acb-count  s-num_children) {
+assert(acb-count = acb-issued_count);
+assert(acb-success_count = acb-issued_count);
+if (acb-count  acb-issued_count) {
 return;
 }
 
@@ -647,22 +674,46 @@ free_exit:
 return rewrite;
 }
 
+static bool has_enough_children(BDRVQuorumState *s)
+{
+int good = 0, i

[Qemu-devel] [PATCH 7/8] block: add two helpers

2014-09-01 Thread Liu Yuan

These helpers are needed by later quorum sync device logic.

Cc: Eric Blake ebl...@redhat.com
Cc: Benoit Canet ben...@irqsave.net
Cc: Kevin Wolf kw...@redhat.com
Cc: Stefan Hajnoczi stefa...@redhat.com
Signed-off-by: Liu Yuan namei.u...@gmail.com
---
 block.c   | 10 ++
 include/block/block.h |  2 ++
 2 files changed, 12 insertions(+)

diff --git a/block.c b/block.c
index 22eb3e4..2e2f1d9 100644
--- a/block.c
+++ b/block.c
@@ -2145,6 +2145,16 @@ void *bdrv_get_attached_dev(BlockDriverState *bs)
 return bs-dev;
 }
 
+BlockDriverState *bdrv_get_file(BlockDriverState *bs)
+{
+return bs-file;
+}
+
+const char *bdrv_get_filename(BlockDriverState *bs)
+{
+return bs-filename;
+}
+
 void bdrv_set_dev_ops(BlockDriverState *bs, const BlockDevOps *ops,
   void *opaque)
 {
diff --git a/include/block/block.h b/include/block/block.h
index a61eaf0..1e116cc 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -237,6 +237,8 @@ int bdrv_attach_dev(BlockDriverState *bs, void *dev);
 void bdrv_attach_dev_nofail(BlockDriverState *bs, void *dev);
 void bdrv_detach_dev(BlockDriverState *bs, void *dev);
 void *bdrv_get_attached_dev(BlockDriverState *bs);
+BlockDriverState *bdrv_get_file(BlockDriverState *bs);
+const char *bdrv_get_filename(BlockDriverState *bs);
 void bdrv_set_dev_ops(BlockDriverState *bs, const BlockDevOps *ops,
   void *opaque);
 void bdrv_set_drv_ops(BlockDriverState *bs, const BlockDrvOps *ops,
-- 
1.9.1

Re: [Qemu-devel] [PATCH 2/8] block: add driver operation callbacks

2014-09-01 Thread Liu Yuan

On Mon, Sep 01, 2014 at 10:28:54AM +0200, Benoît Canet wrote:
 The Monday 01 Sep 2014 à 15:43:08 (+0800), Liu Yuan wrote :
  Driver operations are defined as callbacks passed from block upper drivers 
  to
  lower drivers and are supposed to be called by lower drivers.
  
  Requests handling(queuing, submitting, etc.) are done in protocol tier in 
  the
  block layer and connection states are also maintained down there. Driver
  operations are supposed to notify the upper tier (such as quorum) of the 
  states
  changes.
  
  For now only two operation are added:
  
  driver_disconnect: called when connection is off
  driver_reconnect: called when connection is on after disconnection
  
  Which are used to notify upper tier of the connection state.
  
  Cc: Eric Blake ebl...@redhat.com
  Cc: Benoit Canet ben...@irqsave.net
  Cc: Kevin Wolf kw...@redhat.com
  Cc: Stefan Hajnoczi stefa...@redhat.com
  Signed-off-by: Liu Yuan namei.u...@gmail.com
  ---
   block.c   | 7 +++
   include/block/block.h | 7 +++
   include/block/block_int.h | 3 +++
   3 files changed, 17 insertions(+)
  
  diff --git a/block.c b/block.c
  index c12b8de..22eb3e4 100644
  --- a/block.c
  +++ b/block.c
  @@ -2152,6 +2152,13 @@ void bdrv_set_dev_ops(BlockDriverState *bs, const 
  BlockDevOps *ops,
   bs-dev_opaque = opaque;
   }
   
  +void bdrv_set_drv_ops(BlockDriverState *bs, const BlockDrvOps *ops,
  +  void *opaque)
  +{
  +bs-drv_ops = ops;
  +bs-drv_opaque = opaque;
 
 We need to be very carefull of the mix between these fields and the infamous
 bdrv_swap function.
 
 Also I don't know if driver operations is the right name since the 
 BlockDriver structure's
 callback could also be named driver operations.
 

BlockDrvierState has a device operation for callbacks from devices. So I
choose driver operation. So any sugguestion for better name?

Thanks
Yuan

Re: [Qemu-devel] [PATCH 3/8] block/sheepdog: propagate disconnect/reconnect events to upper driver

2014-09-01 Thread Liu Yuan

On Mon, Sep 01, 2014 at 10:31:47AM +0200, Benoît Canet wrote:
 The Monday 01 Sep 2014 à 15:43:09 (+0800), Liu Yuan wrote :
  This is the reference usage how we propagate connection state to upper tier.

  Cc: Eric Blake ebl...@redhat.com
  Cc: Benoit Canet ben...@irqsave.net
  Cc: Kevin Wolf kw...@redhat.com
  Cc: Stefan Hajnoczi stefa...@redhat.com
  Signed-off-by: Liu Yuan namei.u...@gmail.com
  ---
   block/sheepdog.c | 9 +
   1 file changed, 9 insertions(+)

  diff --git a/block/sheepdog.c b/block/sheepdog.c
  index 53c24d6..9c0fc49 100644
  --- a/block/sheepdog.c
  +++ b/block/sheepdog.c
  @@ -714,6 +714,11 @@ static coroutine_fn void reconnect_to_sdog(void 
  *opaque)
   {
   BDRVSheepdogState *s = opaque;
   AIOReq *aio_req, *next;
  +BlockDriverState *bs = s-bs;
  +
  +if (bs-drv_ops  bs-drv_ops-driver_disconnect) {
  +bs-drv_ops-driver_disconnect(bs);
  +}

 Since this sequence will be strictly the same for all the implementation
 could we create a bdrv_signal_disconnect(bs); in the block layer to make this
 code generic ?

I'm not sure if other protocol driver can have the same auto-reconnection logic.
Probably for simplicity, we keep it as is in the patch. Later when we get more
flesh of implementation of other protocols, we can make a better decision.

Thanks
Yuan

Re: [Qemu-devel] [PATCH 5/8] quorum: fix quorum_aio_cancel()

2014-09-01 Thread Liu Yuan

On Mon, Sep 01, 2014 at 10:35:27AM +0200, Benoît Canet wrote:
 The Monday 01 Sep 2014 à 15:43:11 (+0800), Liu Yuan wrote :
  For a fifo read pattern, we only have one running aio
 
 (possible other cases that has less number than num_children in the future)
 I have trouble understanding this part of the commit message could you try
 to clarify it ?

Until this patch, we have two cases, read single or read all. But later patch
allow VMs to continue if some devices are down. So the discrete number 1 and N
becomes a range [1, N], that is possible running requests are from 1 to N.

Thanks
Yuan

Re: [Qemu-devel] [PATCH 6/8] block/quorum: add broken state to BlockDriverState

2014-09-01 Thread Liu Yuan

On Mon, Sep 01, 2014 at 10:57:43AM +0200, Benoît Canet wrote:
 The Monday 01 Sep 2014 à 15:43:12 (+0800), Liu Yuan wrote :
  This allow VM continues to process even if some devices are broken meanwhile
  with proper configuration.
  
  We mark the device broken when the protocol tier notify back some broken
  state(s) of the device, such as diconnection via driver operations. We could
  also reset the device as sound when the protocol tier is repaired.
  
  Origianlly .threshold controls how we should decide the success of 
  read/write
  and return the failure only if the success count of read/write is less than
  .threshold specified by users. But it doesn't record the states of 
  underlying
  states and will impact performance a bit in some cases.
  
  For example, we have 3 children and .threshold is set 2. If one of the 
  devices
  broken, we should still return success and continue to run VM. But for every
  IO operations, we will blindly send the requests to the broken device.
  
  To store broken state into driver state we can save requests to borken 
  devices
  and resend the requests to the repaired ones by setting broken as false.
  
  This is especially useful for network based protocol such as sheepdog, which
  has a auto-reconnection mechanism and will never report EIO if the 
  connection
  is broken but just store the requests to its local queue and wait for 
  resending.
  Without broken state, quorum request will not come back until the 
  connection is
  re-established. So we have to skip the broken deivces to allow VM to 
  continue
  running with networked backed child (glusterfs, nfs, sheepdog, etc).
  
  With the combination of read-pattern and threshold, we can easily mimic the 
  DRVD
  behavior with following configuration:
  
   read-pattern=fifo,threshold=1 will two children.
  
  Cc: Eric Blake ebl...@redhat.com
  Cc: Benoit Canet ben...@irqsave.net
  Cc: Kevin Wolf kw...@redhat.com
  Cc: Stefan Hajnoczi stefa...@redhat.com
  Signed-off-by: Liu Yuan namei.u...@gmail.com
  ---
   block/quorum.c| 102 
  ++
   include/block/block_int.h |   3 ++
   2 files changed, 87 insertions(+), 18 deletions(-)
  
  diff --git a/block/quorum.c b/block/quorum.c
  index b9eeda3..7b07e35 100644
  --- a/block/quorum.c
  +++ b/block/quorum.c
  @@ -120,6 +120,7 @@ struct QuorumAIOCB {
   int rewrite_count;  /* number of replica to rewrite: count 
  down to
* zero once writes are fired
*/
  +int issued_count;   /* actual readwrite issued count */
   
   QuorumVotes votes;
   
  @@ -170,8 +171,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
   if (acb-is_read) {
   /* on the quorum case acb-child_iter == s-num_children - 1 */
   for (i = 0; i = acb-child_iter; i++) {
  -qemu_vfree(acb-qcrs[i].buf);
  -qemu_iovec_destroy(acb-qcrs[i].qiov);
  +if (acb-qcrs[i].buf) {
  +qemu_vfree(acb-qcrs[i].buf);
  +qemu_iovec_destroy(acb-qcrs[i].qiov);
  +}
   }
   }
   
  @@ -207,6 +210,7 @@ static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s,
   acb-count = 0;
   acb-success_count = 0;
   acb-rewrite_count = 0;
  +acb-issued_count = 0;
   acb-votes.compare = quorum_sha256_compare;
   QLIST_INIT(acb-votes.vote_list);
   acb-is_read = false;
  @@ -286,6 +290,22 @@ static void quorum_copy_qiov(QEMUIOVector *dest, 
  QEMUIOVector *source)
   }
   }
   
  +static int next_fifo_child(QuorumAIOCB *acb)
  +{
  +BDRVQuorumState *s = acb-common.bs-opaque;
  +int i;
  +
  +for (i = acb-child_iter; i  s-num_children; i++) {
  +if (!s-bs[i]-broken) {
  +break;
  +}
  +}
  +if (i == s-num_children) {
  +return -1;
  +}
  +return i;
  +}
  +
   static void quorum_aio_cb(void *opaque, int ret)
   {
   QuorumChildRequest *sacb = opaque;
  @@ -293,11 +313,18 @@ static void quorum_aio_cb(void *opaque, int ret)
   BDRVQuorumState *s = acb-common.bs-opaque;
   bool rewrite = false;
   
  +if (ret  0) {
  +s-bs[acb-child_iter]-broken = true;
  +}
 
 child_iter is fifo mode stuff.
 Do we need to write if (s-read_pattern == QUORUM_READ_PATTERN_FIFO  ret  
 0) here ?

Probably not. child_iter denotes which bs the QuorumChildRequest belongs to.

 
  +
   if (acb-is_read  s-read_pattern == QUORUM_READ_PATTERN_FIFO) {
   /* We try to read next child in FIFO order if we fail to read */
  -if (ret  0  ++acb-child_iter  s-num_children) {
  -read_fifo_child(acb);
  -return;
  +if (ret  0) {
  +acb-child_iter = next_fifo_child(acb);
 
 You don't seem to increment child_iter anymore.

No, I use a function next_fifo_child() to get the next_fifo_child because right
now we have to skip

Re: [Qemu-devel] [PATCH 2/8] block: add driver operation callbacks

2014-09-01 Thread Liu Yuan

On Mon, Sep 01, 2014 at 11:28:22AM +0200, Benoît Canet wrote:
 The Monday 01 Sep 2014 à 17:19:19 (+0800), Liu Yuan wrote :
  On Mon, Sep 01, 2014 at 10:28:54AM +0200, Benoît Canet wrote:
   The Monday 01 Sep 2014 à 15:43:08 (+0800), Liu Yuan wrote :
Driver operations are defined as callbacks passed from block upper 
drivers to
lower drivers and are supposed to be called by lower drivers.

Requests handling(queuing, submitting, etc.) are done in protocol tier 
in the
block layer and connection states are also maintained down there. Driver
operations are supposed to notify the upper tier (such as quorum) of 
the states
changes.

For now only two operation are added:

driver_disconnect: called when connection is off
driver_reconnect: called when connection is on after disconnection

Which are used to notify upper tier of the connection state.

Cc: Eric Blake ebl...@redhat.com
Cc: Benoit Canet ben...@irqsave.net
Cc: Kevin Wolf kw...@redhat.com
Cc: Stefan Hajnoczi stefa...@redhat.com
Signed-off-by: Liu Yuan namei.u...@gmail.com
---
 block.c   | 7 +++
 include/block/block.h | 7 +++
 include/block/block_int.h | 3 +++
 3 files changed, 17 insertions(+)

diff --git a/block.c b/block.c
index c12b8de..22eb3e4 100644
--- a/block.c
+++ b/block.c
@@ -2152,6 +2152,13 @@ void bdrv_set_dev_ops(BlockDriverState *bs, 
const BlockDevOps *ops,
 bs-dev_opaque = opaque;
 }
 
+void bdrv_set_drv_ops(BlockDriverState *bs, const BlockDrvOps *ops,
+  void *opaque)
+{
+bs-drv_ops = ops;
+bs-drv_opaque = opaque;
   
   We need to be very carefull of the mix between these fields and the 
   infamous
   bdrv_swap function.
   
   Also I don't know if driver operations is the right name since the 
   BlockDriver structure's
   callback could also be named driver operations.
   
  
  BlockDrvierState has a device operation for callbacks from devices. So I
  choose driver operation. So any sugguestion for better name?
 
 From what I see in this series the job of these callbacks is to send a 
 message or a signal to
 the upper BDS.
 
 Also the name must reflect it goes from the child to the parent.
 
 child_signals ?
 child_messages ?
 

As far as I see, put child in the name will make it too quorum centric. Since it
is operation in BlockDriverState, we need to keep it as generic as we could.

These operations [here we mean disconnect() and reconnect(), but probably later
some other will add more opeartions] are passed from 'upper driver' to protocol
driver [in the code we call the protocol as 'file' driver, a narrow name too],
so I chose to name it as 'driver operation'. If we can rename 'file' as
protocol, include file, nfs, sheepdog, etc, such as

bdrv_create_file - bdrv_create_protocol
bs.file - bs.protocol

then the 'driver operation' here would sound better.

Thanks
Yuan

Re: [Qemu-devel] [PATCH 8/8] quorum: add basic device recovery logic

2014-09-01 Thread Liu Yuan

On Mon, Sep 01, 2014 at 11:37:20AM +0200, Benoît Canet wrote:
 The Monday 01 Sep 2014 à 15:43:14 (+0800), Liu Yuan wrote :
  For some configuration, quorum allow VMs to continue while some child 
  devices
  are broken and when the child devices are repaired and return back, we need 
  to
  sync dirty bits during downtime to keep data consistency.
  
  The recovery logic is based on the driver state bitmap and will sync the 
  dirty
  bits with a timeslice window in a coroutine in this prtimive implementation.
  
  Simple graph about 2 children with threshold=1 and read-pattern=fifo:
  
  + denote device sync iteration
  - IO on a single device
  = IO on two devices
  
sync complete, release dirty bitmap
   ^
   |
-++==
   | |
   | v
   |   device repaired and begin to sync
   v
 device broken, create a dirty bitmap
  
This sync logic can take care of nested broken problem, that devices are
broken while in sync. We just start a sync process after the devices are
repaired again and switch the devices from broken to sound only when the 
  sync
completes.
  
  For read-pattern=quorum mode, it enjoys the recovery logic without any 
  problem.
  
  Cc: Eric Blake ebl...@redhat.com
  Cc: Benoit Canet ben...@irqsave.net
  Cc: Kevin Wolf kw...@redhat.com
  Cc: Stefan Hajnoczi stefa...@redhat.com
  Signed-off-by: Liu Yuan namei.u...@gmail.com
  ---
   block/quorum.c | 189 
  -
   trace-events   |   5 ++
   2 files changed, 191 insertions(+), 3 deletions(-)
  
  diff --git a/block/quorum.c b/block/quorum.c
  index 7b07e35..ffd7c2d 100644
  --- a/block/quorum.c
  +++ b/block/quorum.c
  @@ -23,6 +23,7 @@
   #include qapi/qmp/qlist.h
   #include qapi/qmp/qstring.h
   #include qapi-event.h
  +#include trace.h
   
   #define HASH_LENGTH 32
   
  @@ -31,6 +32,10 @@
   #define QUORUM_OPT_REWRITErewrite-corrupted
   #define QUORUM_OPT_READ_PATTERN   read-pattern
   
  +#define SLICE_TIME  1ULL /* 100 ms */
  +#define CHUNK_SIZE  (1  20) /* 1M */
  +#define SECTORS_PER_CHUNK   (CHUNK_SIZE  BDRV_SECTOR_BITS)
  +
   /* This union holds a vote hash value */
   typedef union QuorumVoteValue {
   char h[HASH_LENGTH];   /* SHA-256 hash */
  @@ -64,6 +69,7 @@ typedef struct QuorumVotes {
   
   /* the following structure holds the state of one quorum instance */
   typedef struct BDRVQuorumState {
  +BlockDriverState *mybs;/* Quorum block driver base state */
   BlockDriverState **bs; /* children BlockDriverStates */
   int num_children;  /* children count */
   int threshold; /* if less than threshold children reads gave 
  the
  @@ -82,6 +88,10 @@ typedef struct BDRVQuorumState {
   */
   
   QuorumReadPattern read_pattern;
  +BdrvDirtyBitmap *dirty_bitmap;
  +uint8_t *sync_buf;
  +HBitmapIter hbi;
  +int64_t sector_num;
   } BDRVQuorumState;
   
   typedef struct QuorumAIOCB QuorumAIOCB;
  @@ -290,12 +300,11 @@ static void quorum_copy_qiov(QEMUIOVector *dest, 
  QEMUIOVector *source)
   }
   }
   
  -static int next_fifo_child(QuorumAIOCB *acb)
  +static int get_good_child(BDRVQuorumState *s, int iter)
   {
  -BDRVQuorumState *s = acb-common.bs-opaque;
   int i;
   
  -for (i = acb-child_iter; i  s-num_children; i++) {
  +for (i = iter; i  s-num_children; i++) {
   if (!s-bs[i]-broken) {
   break;
   }
  @@ -306,6 +315,13 @@ static int next_fifo_child(QuorumAIOCB *acb)
   return i;
   }
   
  +static int next_fifo_child(QuorumAIOCB *acb)
  +{
  +BDRVQuorumState *s = acb-common.bs-opaque;
  +
  +return get_good_child(s, acb-child_iter);
  +}
  +
   static void quorum_aio_cb(void *opaque, int ret)
   {
   QuorumChildRequest *sacb = opaque;
  @@ -951,6 +967,171 @@ static int parse_read_pattern(const char *opt)
   return -EINVAL;
   }
   
  +static void sync_prepare(BDRVQuorumState *qs, int64_t *num)
  +{
  +int64_t nb, total = bdrv_nb_sectors(qs-mybs);
  +
  +qs-sector_num = hbitmap_iter_next(qs-hbi);
  +/* Wrap around if previous bits get dirty while syncing */
  +if (qs-sector_num  0) {
  +bdrv_dirty_iter_init(qs-mybs, qs-dirty_bitmap, qs-hbi);
  +qs-sector_num = hbitmap_iter_next(qs-hbi);
  +assert(qs-sector_num = 0);
  +}
  +
  +for (nb = 1; nb  SECTORS_PER_CHUNK  qs-sector_num + nb  total;
  + nb++) {
  +if (!bdrv_get_dirty(qs-mybs, qs-dirty_bitmap, qs-sector_num + 
  nb)) {
  +break;
  +}
  +}
  +*num = nb;
  +}
  +
  +static void sync_finish(BDRVQuorumState *qs, int64_t num)
  +{
  +int64_t i;
  +
  +for (i = 0; i  num; i++) {
  +/* We

Re: [Qemu-devel] [PATCH 5/8] quorum: fix quorum_aio_cancel()

2014-09-01 Thread Liu Yuan

On Mon, Sep 01, 2014 at 11:32:04AM +0200, Benoît Canet wrote:
 The Monday 01 Sep 2014 à 17:26:09 (+0800), Liu Yuan wrote :
  On Mon, Sep 01, 2014 at 10:35:27AM +0200, Benoît Canet wrote:
   The Monday 01 Sep 2014 à 15:43:11 (+0800), Liu Yuan wrote :
For a fifo read pattern, we only have one running aio
   
   (possible other cases that has less number than num_children in the 
   future)
   I have trouble understanding this part of the commit message could you try
   to clarify it ?
  
  Until this patch, we have two cases, read single or read all. But later 
  patch
  allow VMs to continue if some devices are down. So the discrete number 1 
  and N
  becomes a range [1, N], that is possible running requests are from 1 to N.
 
 Why not 
 
 (In some other cases some children of the num_children BDS could be disabled
 reducing the number of requests needed) ?
 

Sounds better, I'll take yours, thanks!

Yuan

[Qemu-devel] [PATCH] block: kill tail whitespace in block.c

2014-08-31 Thread Liu Yuan

Cc: Kevin Wolf kw...@redhat.com
Cc: Stefan Hajnoczi stefa...@redhat.com
Signed-off-by: Liu Yuan namei.u...@gmail.com
---
 block.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block.c b/block.c
index e9380f6..c12b8de 100644
--- a/block.c
+++ b/block.c
@@ -2239,7 +2239,7 @@ int bdrv_commit(BlockDriverState *bs)
 
 if (!drv)
 return -ENOMEDIUM;
-
+
 if (!bs-backing_hd) {
 return -ENOTSUP;
 }
-- 
1.9.1

Re: [Qemu-devel] [PATCH v6 0/2] add read-pattern for block qourum

2014-08-28 Thread Liu Yuan

On Mon, Aug 25, 2014 at 10:46:21AM +0800, Liu Yuan wrote:
 On Mon, Aug 18, 2014 at 05:41:03PM +0800, Liu Yuan wrote:
  v6:
   - fix a unused warning introduced by last version
  
 
 Hi Stefan and Kevin,
 
Benoît Canet has added Reviewed-by tag to both patches, could one of you 
 pick
 this patch set?
 

ping

[Qemu-devel] [PATCH] sheepdog: fix a core dump while do auto-reconnecting

2014-08-28 Thread Liu Yuan

We should reinit Local_err as NULL inside the while loop or g_free() will report
corrupption and abort the QEMU when sheepdog driver tries reconnecting.

qemu-system-x86_64: failed to get the header, Resource temporarily unavailable
qemu-system-x86_64: Failed to connect to socket: Connection refused
qemu-system-x86_64: (null)
[xcb] Unknown sequence number while awaiting reply
[xcb] Most likely this is a multi-threaded client and XInitThreads has not been 
called
[xcb] Aborting, sorry about that.
qemu-system-x86_64: ../../src/xcb_io.c:298: poll_for_response: Assertion 
`!xcb_xlib_threads_sequence_lost' failed.
Aborted (core dumped)

Cc: qemu-devel@nongnu.org
Cc: Markus Armbruster arm...@redhat.com
Cc: Kevin Wolf kw...@redhat.com
Cc: Stefan Hajnoczi stefa...@redhat.com
Signed-off-by: Liu Yuan namei.u...@gmail.com
---
 block/sheepdog.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/sheepdog.c b/block/sheepdog.c
index 3af8743..a3a19b1 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -712,7 +712,6 @@ static void coroutine_fn send_pending_req(BDRVSheepdogState 
*s, uint64_t oid)
 
 static coroutine_fn void reconnect_to_sdog(void *opaque)
 {
-Error *local_err = NULL;
 BDRVSheepdogState *s = opaque;
 AIOReq *aio_req, *next;
 BlockDriverState *bs = s-bs;
@@ -731,6 +730,7 @@ static coroutine_fn void reconnect_to_sdog(void *opaque)
 
 /* Try to reconnect the sheepdog server every one second. */
 while (s-fd  0) {
+Error *local_err = NULL;
 s-fd = get_sheep_fd(s, local_err);
 if (s-fd  0) {
 DPRINTF(Wait for connection to be established\n);
-- 
1.9.1

[Qemu-devel] [PATCH v2] sheepdog: fix a core dump while do auto-reconnecting

2014-08-28 Thread Liu Yuan

We should reinit local_err as NULL inside the while loop or g_free() will report
corrupption and abort the QEMU when sheepdog driver tries reconnecting.

This was broken in commit 356b4ca.

qemu-system-x86_64: failed to get the header, Resource temporarily unavailable
qemu-system-x86_64: Failed to connect to socket: Connection refused
qemu-system-x86_64: (null)
[xcb] Unknown sequence number while awaiting reply
[xcb] Most likely this is a multi-threaded client and XInitThreads has not been 
called
[xcb] Aborting, sorry about that.
qemu-system-x86_64: ../../src/xcb_io.c:298: poll_for_response: Assertion 
`!xcb_xlib_threads_sequence_lost' failed.
Aborted (core dumped)

Cc: qemu-devel@nongnu.org
Cc: Markus Armbruster arm...@redhat.com
Cc: Kevin Wolf kw...@redhat.com
Cc: Stefan Hajnoczi stefa...@redhat.com
Reviewed-by: Markus Armbruster arm...@redhat.com
Signed-off-by: Liu Yuan namei.u...@gmail.com
---
 block/sheepdog.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/sheepdog.c b/block/sheepdog.c
index 12cbd9d..53c24d6 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -712,7 +712,6 @@ static void coroutine_fn send_pending_req(BDRVSheepdogState 
*s, uint64_t oid)
 
 static coroutine_fn void reconnect_to_sdog(void *opaque)
 {
-Error *local_err = NULL;
 BDRVSheepdogState *s = opaque;
 AIOReq *aio_req, *next;
 
@@ -727,6 +726,7 @@ static coroutine_fn void reconnect_to_sdog(void *opaque)
 
 /* Try to reconnect the sheepdog server every one second. */
 while (s-fd  0) {
+Error *local_err = NULL;
 s-fd = get_sheep_fd(s, local_err);
 if (s-fd  0) {
 DPRINTF(Wait for connection to be established\n);
-- 
1.9.1

Re: [Qemu-devel] [PATCH v6 0/2] add read-pattern for block qourum

2014-08-28 Thread Liu Yuan

On Thu, Aug 28, 2014 at 11:33:08AM +0100, Stefan Hajnoczi wrote:
 On Mon, Aug 18, 2014 at 05:41:03PM +0800, Liu Yuan wrote:
  v6:
   - fix a unused warning introduced by last version
  
  v5:
   - simplify a for loop in quorum_aio_finalize()
  
  v4:
   - swap the patch order
   - update comment for fifo pattern in qaip
   - use qapi enumeration in quorum driver instead of manual parsing
  
  v3:
   - separate patch into two, one for quorum and one for qapi for easier 
  review
   - add enumeration for quorum read pattern
   - remove unrelated blank line fix from this patch set
  
  v2:
   - rename single as 'fifo'
   - rename read_all_children as read_quorum_children
   - fix quorum_aio_finalize() for fifo pattern
  
  This patch adds single read pattern to quorum driver and quorum vote is 
  default
  pattern.
  
  For now we do a quorum vote on all the reads, it is designed for unreliable
  underlying storage such as non-redundant NFS to make sure data integrity at 
  the
  cost of the read performance.
  
  For some use cases as following:
  
  VM
--
||
vv
AB
  
  Both A and B has hardware raid storage to justify the data integrity on its 
  own.
  So it would help performance if we do a single read instead of on all the 
  nodes.
  Further, if we run VM on either of the storage node, we can make a local 
  read
  request for better performance.
  
  This patch generalize the above 2 nodes case in the N nodes. That is,
  
  vm - write to all the N nodes, read just one of them. If single read 
  fails, we
  try to read next node in FIFO order specified by the startup command.
  
  The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
  functionality in the single device/node failure for now. But compared with 
  DRBD
  we still have some advantages over it:
  
  - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
  storage. And if A crashes, we need to restart all the VMs on node B. But for
  practice case, we can't because B might not have enough resources to setup 
  20 VMs
  at once. So if we run our 20 VMs with quorum driver, and scatter the 
  replicated
  images over the data center, we can very likely restart 20 VMs without any
  resource problem.
  
  After all, I think we can build a more powerful replicated image 
  functionality
  on quorum and block jobs(block mirror) to meet various High Availibility 
  needs.
  
  E.g, Enable single read pattern on 2 children,

  -drive driver=quorum,children.0.file.filename=0.qcow2,\
  children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1
  
  [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device
  
  Cc: Benoit Canet ben...@irqsave.net
  Cc: Eric Blake ebl...@redhat.com
  Cc: Kevin Wolf kw...@redhat.com
  Cc: Stefan Hajnoczi stefa...@redhat.com
  Liu Yuan (2):
qapi: add read-pattern enum for quorum
block/quorum: add simple read pattern support
  
   block/quorum.c   | 177 
  +--
   qapi/block-core.json |  20 +-
   2 files changed, 148 insertions(+), 49 deletions(-)
 
 I dropped the \n from the error_setg() error message while merging.
 Please do not use \n with error_setg().

Thanks for your fix.
 
 Please extend the quorum qemu-iotests to cover the new fifo read
 pattern.  You can send the tests as a separate patch series.
 

Okay, will do later.

Yuan

Re: [Qemu-devel] [PATCH v6 0/2] add read-pattern for block qourum

2014-08-24 Thread Liu Yuan

On Mon, Aug 18, 2014 at 05:41:03PM +0800, Liu Yuan wrote:
 v6:
  - fix a unused warning introduced by last version
 

Hi Stefan and Kevin,

   Benoît Canet has added Reviewed-by tag to both patches, could one of you pick
this patch set?

Thanks
Yuan

Re: [Qemu-devel] [PATCH v5 2/2] block/quorum: add simple read pattern support

2014-08-21 Thread Liu Yuan

On Mon, Aug 18, 2014 at 01:59:28PM +0800, Liu Yuan wrote:
 On Fri, Aug 15, 2014 at 03:59:04PM +0200, Benoît Canet wrote:
  The Friday 15 Aug 2014 à 13:05:17 (+0800), Liu Yuan wrote :
   This patch adds single read pattern to quorum driver and quorum vote is 
   default
   pattern.
   
   For now we do a quorum vote on all the reads, it is designed for 
   unreliable
   underlying storage such as non-redundant NFS to make sure data integrity 
   at the
   cost of the read performance.
   
   For some use cases as following:
   
   VM
 --
 ||
 vv
 AB
   
   Both A and B has hardware raid storage to justify the data integrity on 
   its own.
   So it would help performance if we do a single read instead of on all the 
   nodes.
   Further, if we run VM on either of the storage node, we can make a local 
   read
   request for better performance.
   
   This patch generalize the above 2 nodes case in the N nodes. That is,
   
   vm - write to all the N nodes, read just one of them. If single read 
   fails, we
   try to read next node in FIFO order specified by the startup command.
   
   The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
   functionality in the single device/node failure for now. But compared 
   with DRBD
   we still have some advantages over it:
   
   - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
   storage. And if A crashes, we need to restart all the VMs on node B. But 
   for
   practice case, we can't because B might not have enough resources to 
   setup 20 VMs
   at once. So if we run our 20 VMs with quorum driver, and scatter the 
   replicated
   images over the data center, we can very likely restart 20 VMs without any
   resource problem.
   
   After all, I think we can build a more powerful replicated image 
   functionality
   on quorum and block jobs(block mirror) to meet various High Availibility 
   needs.
   
   E.g, Enable single read pattern on 2 children,
   
   -drive driver=quorum,children.0.file.filename=0.qcow2,\
   children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1
   
   [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device
   
   Cc: Benoit Canet ben...@irqsave.net
   Cc: Eric Blake ebl...@redhat.com
   Cc: Kevin Wolf kw...@redhat.com
   Cc: Stefan Hajnoczi stefa...@redhat.com
   Signed-off-by: Liu Yuan namei.u...@gmail.com
   ---
block/quorum.c | 176 
   ++---
1 file changed, 129 insertions(+), 47 deletions(-)
   
   diff --git a/block/quorum.c b/block/quorum.c
   index d5ee9c0..1235d7c 100644
   --- a/block/quorum.c
   +++ b/block/quorum.c
   @@ -24,6 +24,7 @@
#define QUORUM_OPT_VOTE_THRESHOLD vote-threshold
#define QUORUM_OPT_BLKVERIFY  blkverify
#define QUORUM_OPT_REWRITErewrite-corrupted
   +#define QUORUM_OPT_READ_PATTERN   read-pattern

/* This union holds a vote hash value */
typedef union QuorumVoteValue {
   @@ -74,6 +75,8 @@ typedef struct BDRVQuorumState {
bool rewrite_corrupted;/* true if the driver must rewrite-on-read 
   corrupted
* block if Quorum is reached.
*/
   +
   +QuorumReadPattern read_pattern;
} BDRVQuorumState;

typedef struct QuorumAIOCB QuorumAIOCB;
   @@ -117,6 +120,7 @@ struct QuorumAIOCB {

bool is_read;
int vote_ret;
   +int child_iter; /* which child to read in fifo pattern */
  
  I don't understand what fifo pattern could mean for a bunch of disk
  as they are not forming a queue.
 
 Naming isn't 100% accurate but as in Eric's comment (see below), both FIFO and
 Round-Robin can be used for two different patterns.
 
  Maybe round-robin is more suitable but your code does not implement
  round-robin since it will alway start from the first disk.
  
  Your code is scanning the disks set it's a scan pattern.
  
  That said is it a problem that the first disk will be accessed more often 
  than the other ?
 
 As my commit log documented, the purpose of the read pattern I added is to
 speed up read against quorum original read pattern. And the use case is clear
 (I hope so) and you can take DRBD as a good example for why we need it. Of
 course we are far away from DRBD, which need a recovery logic after all kinds 
 of
 failures. My patch set can be taken as a prelimitary step to implement a DRBD
 like service driver.
 
 Eric previously commented on two read patterns that might be useful:
 
 Should we offer multiple modes in addition to 'quorum'?  For example, I
 could see a difference between 'fifo' (favor read from the first quorum
 member always, unless it fails, good when the first member is local and
 other member is remote) and 'round-robin' (evenly distribute reads; each
 read goes to the next available quorum member, good when all members are
 equally distant

Re: [Qemu-devel] [PATCH v5 2/2] block/quorum: add simple read pattern support

2014-08-18 Thread Liu Yuan

On Fri, Aug 15, 2014 at 03:59:04PM +0200, Benoît Canet wrote:
 The Friday 15 Aug 2014 à 13:05:17 (+0800), Liu Yuan wrote :
  This patch adds single read pattern to quorum driver and quorum vote is 
  default
  pattern.
  
  For now we do a quorum vote on all the reads, it is designed for unreliable
  underlying storage such as non-redundant NFS to make sure data integrity at 
  the
  cost of the read performance.
  
  For some use cases as following:
  
  VM
--
||
vv
AB
  
  Both A and B has hardware raid storage to justify the data integrity on its 
  own.
  So it would help performance if we do a single read instead of on all the 
  nodes.
  Further, if we run VM on either of the storage node, we can make a local 
  read
  request for better performance.
  
  This patch generalize the above 2 nodes case in the N nodes. That is,
  
  vm - write to all the N nodes, read just one of them. If single read 
  fails, we
  try to read next node in FIFO order specified by the startup command.
  
  The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
  functionality in the single device/node failure for now. But compared with 
  DRBD
  we still have some advantages over it:
  
  - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
  storage. And if A crashes, we need to restart all the VMs on node B. But for
  practice case, we can't because B might not have enough resources to setup 
  20 VMs
  at once. So if we run our 20 VMs with quorum driver, and scatter the 
  replicated
  images over the data center, we can very likely restart 20 VMs without any
  resource problem.
  
  After all, I think we can build a more powerful replicated image 
  functionality
  on quorum and block jobs(block mirror) to meet various High Availibility 
  needs.
  
  E.g, Enable single read pattern on 2 children,
  
  -drive driver=quorum,children.0.file.filename=0.qcow2,\
  children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1
  
  [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device
  
  Cc: Benoit Canet ben...@irqsave.net
  Cc: Eric Blake ebl...@redhat.com
  Cc: Kevin Wolf kw...@redhat.com
  Cc: Stefan Hajnoczi stefa...@redhat.com
  Signed-off-by: Liu Yuan namei.u...@gmail.com
  ---
   block/quorum.c | 176 
  ++---
   1 file changed, 129 insertions(+), 47 deletions(-)
  
  diff --git a/block/quorum.c b/block/quorum.c
  index d5ee9c0..1235d7c 100644
  --- a/block/quorum.c
  +++ b/block/quorum.c
  @@ -24,6 +24,7 @@
   #define QUORUM_OPT_VOTE_THRESHOLD vote-threshold
   #define QUORUM_OPT_BLKVERIFY  blkverify
   #define QUORUM_OPT_REWRITErewrite-corrupted
  +#define QUORUM_OPT_READ_PATTERN   read-pattern
   
   /* This union holds a vote hash value */
   typedef union QuorumVoteValue {
  @@ -74,6 +75,8 @@ typedef struct BDRVQuorumState {
   bool rewrite_corrupted;/* true if the driver must rewrite-on-read 
  corrupted
   * block if Quorum is reached.
   */
  +
  +QuorumReadPattern read_pattern;
   } BDRVQuorumState;
   
   typedef struct QuorumAIOCB QuorumAIOCB;
  @@ -117,6 +120,7 @@ struct QuorumAIOCB {
   
   bool is_read;
   int vote_ret;
  +int child_iter; /* which child to read in fifo pattern */
 
 I don't understand what fifo pattern could mean for a bunch of disk
 as they are not forming a queue.

Naming isn't 100% accurate but as in Eric's comment (see below), both FIFO and
Round-Robin can be used for two different patterns.

 Maybe round-robin is more suitable but your code does not implement
 round-robin since it will alway start from the first disk.
 
 Your code is scanning the disks set it's a scan pattern.
 
 That said is it a problem that the first disk will be accessed more often 
 than the other ?

As my commit log documented, the purpose of the read pattern I added is to
speed up read against quorum original read pattern. And the use case is clear
(I hope so) and you can take DRBD as a good example for why we need it. Of
course we are far away from DRBD, which need a recovery logic after all kinds of
failures. My patch set can be taken as a prelimitary step to implement a DRBD
like service driver.

Eric previously commented on two read patterns that might be useful:

Should we offer multiple modes in addition to 'quorum'?  For example, I
could see a difference between 'fifo' (favor read from the first quorum
member always, unless it fails, good when the first member is local and
other member is remote) and 'round-robin' (evenly distribute reads; each
read goes to the next available quorum member, good when all members are
equally distant).

 You will have to care to insert disks in different order on each QEMU to 
 spread the load.

This is another use case that my patch set didn't try to solve.

 Shouldn't the code try to spread

[Qemu-devel] [PATCH v6 1/2] qapi: add read-pattern enum for quorum

2014-08-18 Thread Liu Yuan

Cc: Eric Blake ebl...@redhat.com
Reviewed-by: Eric Blake ebl...@redhat.com
Reviewed-by: Benoît Canet benoit.ca...@nodalink.com
Signed-off-by: Liu Yuan namei.u...@gmail.com
---
 qapi/block-core.json | 20 +++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index e378653..42033d9 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1384,6 +1384,19 @@
 'raw': 'BlockdevRef' } }
 
 ##
+# @QuorumReadPattern
+#
+# An enumeration of quorum read patterns.
+#
+# @quorum: read all the children and do a quorum vote on reads
+#
+# @fifo: read only from the first child that has not failed
+#
+# Since: 2.2
+##
+{ 'enum': 'QuorumReadPattern', 'data': [ 'quorum', 'fifo' ] }
+
+##
 # @BlockdevOptionsQuorum
 #
 # Driver specific block device options for Quorum
@@ -1398,12 +1411,17 @@
 # @rewrite-corrupted: #optional rewrite corrupted data when quorum is reached
 # (Since 2.1)
 #
+# @read-pattern: #optional choose read pattern and set to quorum by default
+#(Since 2.2)
+#
 # Since: 2.0
 ##
 { 'type': 'BlockdevOptionsQuorum',
   'data': { '*blkverify': 'bool',
 'children': [ 'BlockdevRef' ],
-'vote-threshold': 'int', '*rewrite-corrupted': 'bool' } }
+'vote-threshold': 'int',
+'*rewrite-corrupted': 'bool',
+'*read-pattern': 'QuorumReadPattern' } }
 
 ##
 # @BlockdevOptions
-- 
1.9.1

[Qemu-devel] [PATCH v6 0/2] add read-pattern for block qourum

2014-08-18 Thread Liu Yuan

v6:
 - fix a unused warning introduced by last version

v5:
 - simplify a for loop in quorum_aio_finalize()

v4:
 - swap the patch order
 - update comment for fifo pattern in qaip
 - use qapi enumeration in quorum driver instead of manual parsing

v3:
 - separate patch into two, one for quorum and one for qapi for easier review
 - add enumeration for quorum read pattern
 - remove unrelated blank line fix from this patch set

v2:
 - rename single as 'fifo'
 - rename read_all_children as read_quorum_children
 - fix quorum_aio_finalize() for fifo pattern

This patch adds single read pattern to quorum driver and quorum vote is default
pattern.

For now we do a quorum vote on all the reads, it is designed for unreliable
underlying storage such as non-redundant NFS to make sure data integrity at the
cost of the read performance.

For some use cases as following:

VM
  --
  ||
  vv
  AB

Both A and B has hardware raid storage to justify the data integrity on its own.
So it would help performance if we do a single read instead of on all the nodes.
Further, if we run VM on either of the storage node, we can make a local read
request for better performance.

This patch generalize the above 2 nodes case in the N nodes. That is,

vm - write to all the N nodes, read just one of them. If single read fails, we
try to read next node in FIFO order specified by the startup command.

The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
functionality in the single device/node failure for now. But compared with DRBD
we still have some advantages over it:

- Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
storage. And if A crashes, we need to restart all the VMs on node B. But for
practice case, we can't because B might not have enough resources to setup 20 
VMs
at once. So if we run our 20 VMs with quorum driver, and scatter the replicated
images over the data center, we can very likely restart 20 VMs without any
resource problem.

After all, I think we can build a more powerful replicated image functionality
on quorum and block jobs(block mirror) to meet various High Availibility needs.

E.g, Enable single read pattern on 2 children,
  
-drive driver=quorum,children.0.file.filename=0.qcow2,\
children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1

[1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device

Cc: Benoit Canet ben...@irqsave.net
Cc: Eric Blake ebl...@redhat.com
Cc: Kevin Wolf kw...@redhat.com
Cc: Stefan Hajnoczi stefa...@redhat.com
Liu Yuan (2):
  qapi: add read-pattern enum for quorum
  block/quorum: add simple read pattern support

 block/quorum.c   | 177 +--
 qapi/block-core.json |  20 +-
 2 files changed, 148 insertions(+), 49 deletions(-)

-- 
1.9.1

[Qemu-devel] [PATCH v6 2/2] block/quorum: add simple read pattern support

2014-08-18 Thread Liu Yuan

This patch adds single read pattern to quorum driver and quorum vote is default
pattern.

For now we do a quorum vote on all the reads, it is designed for unreliable
underlying storage such as non-redundant NFS to make sure data integrity at the
cost of the read performance.

For some use cases as following:

VM
  --
  ||
  vv
  AB

Both A and B has hardware raid storage to justify the data integrity on its own.
So it would help performance if we do a single read instead of on all the nodes.
Further, if we run VM on either of the storage node, we can make a local read
request for better performance.

This patch generalize the above 2 nodes case in the N nodes. That is,

vm - write to all the N nodes, read just one of them. If single read fails, we
try to read next node in FIFO order specified by the startup command.

The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
functionality in the single device/node failure for now. But compared with DRBD
we still have some advantages over it:

- Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
storage. And if A crashes, we need to restart all the VMs on node B. But for
practice case, we can't because B might not have enough resources to setup 20 
VMs
at once. So if we run our 20 VMs with quorum driver, and scatter the replicated
images over the data center, we can very likely restart 20 VMs without any
resource problem.

After all, I think we can build a more powerful replicated image functionality
on quorum and block jobs(block mirror) to meet various High Availibility needs.

E.g, Enable single read pattern on 2 children,

-drive driver=quorum,children.0.file.filename=0.qcow2,\
children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1

[1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device

Cc: Benoit Canet ben...@irqsave.net
Cc: Eric Blake ebl...@redhat.com
Cc: Kevin Wolf kw...@redhat.com
Cc: Stefan Hajnoczi stefa...@redhat.com
Signed-off-by: Liu Yuan namei.u...@gmail.com
---
 block/quorum.c | 177 +
 1 file changed, 129 insertions(+), 48 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index d5ee9c0..c5f1b35 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -24,6 +24,7 @@
 #define QUORUM_OPT_VOTE_THRESHOLD vote-threshold
 #define QUORUM_OPT_BLKVERIFY  blkverify
 #define QUORUM_OPT_REWRITErewrite-corrupted
+#define QUORUM_OPT_READ_PATTERN   read-pattern
 
 /* This union holds a vote hash value */
 typedef union QuorumVoteValue {
@@ -74,6 +75,8 @@ typedef struct BDRVQuorumState {
 bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
 * block if Quorum is reached.
 */
+
+QuorumReadPattern read_pattern;
 } BDRVQuorumState;
 
 typedef struct QuorumAIOCB QuorumAIOCB;
@@ -117,6 +120,7 @@ struct QuorumAIOCB {
 
 bool is_read;
 int vote_ret;
+int child_iter; /* which child to read in fifo pattern */
 };
 
 static bool quorum_vote(QuorumAIOCB *acb);
@@ -143,7 +147,6 @@ static AIOCBInfo quorum_aiocb_info = {
 
 static void quorum_aio_finalize(QuorumAIOCB *acb)
 {
-BDRVQuorumState *s = acb-common.bs-opaque;
 int i, ret = 0;
 
 if (acb-vote_ret) {
@@ -153,7 +156,8 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
 acb-common.cb(acb-common.opaque, ret);
 
 if (acb-is_read) {
-for (i = 0; i  s-num_children; i++) {
+/* on the quorum case acb-child_iter == s-num_children - 1 */
+for (i = 0; i = acb-child_iter; i++) {
 qemu_vfree(acb-qcrs[i].buf);
 qemu_iovec_destroy(acb-qcrs[i].qiov);
 }
@@ -256,6 +260,21 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret)
 quorum_aio_finalize(acb);
 }
 
+static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb);
+
+static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
+{
+int i;
+assert(dest-niov == source-niov);
+assert(dest-size == source-size);
+for (i = 0; i  source-niov; i++) {
+assert(dest-iov[i].iov_len == source-iov[i].iov_len);
+memcpy(dest-iov[i].iov_base,
+   source-iov[i].iov_base,
+   source-iov[i].iov_len);
+}
+}
+
 static void quorum_aio_cb(void *opaque, int ret)
 {
 QuorumChildRequest *sacb = opaque;
@@ -263,6 +282,21 @@ static void quorum_aio_cb(void *opaque, int ret)
 BDRVQuorumState *s = acb-common.bs-opaque;
 bool rewrite = false;
 
+if (acb-is_read  s-read_pattern == QUORUM_READ_PATTERN_FIFO) {
+/* We try to read next child in FIFO order if we fail to read */
+if (ret  0  ++acb-child_iter  s-num_children) {
+read_fifo_child(acb);
+return;
+}
+
+if (ret == 0) {
+quorum_copy_qiov(acb-qiov, acb-qcrs[acb-child_iter].qiov);
+}
+acb-vote_ret = ret

Re: [Qemu-devel] [PATCH v4 2/2] block/quorum: add simple read pattern support

2014-08-14 Thread Liu Yuan

On Tue, Aug 12, 2014 at 10:41:28AM +0800, Liu Yuan wrote:
 On Mon, Aug 11, 2014 at 02:31:43PM +0200, Benoît Canet wrote:
  The Thursday 17 Jul 2014 à 13:18:56 (+0800), Liu Yuan wrote :
   This patch adds single read pattern to quorum driver and quorum vote is 
   default
   pattern.
   
   For now we do a quorum vote on all the reads, it is designed for 
   unreliable
   underlying storage such as non-redundant NFS to make sure data integrity 
   at the
   cost of the read performance.
   
   For some use cases as following:
   
   VM
 --
 ||
 vv
 AB
   
   Both A and B has hardware raid storage to justify the data integrity on 
   its own.
   So it would help performance if we do a single read instead of on all the 
   nodes.
   Further, if we run VM on either of the storage node, we can make a local 
   read
   request for better performance.
   
   This patch generalize the above 2 nodes case in the N nodes. That is,
   
   vm - write to all the N nodes, read just one of them. If single read 
   fails, we
   try to read next node in FIFO order specified by the startup command.
   
   The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
   functionality in the single device/node failure for now. But compared 
   with DRBD
   we still have some advantages over it:
   
   - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
   storage. And if A crashes, we need to restart all the VMs on node B. But 
   for
   practice case, we can't because B might not have enough resources to 
   setup 20 VMs
   at once. So if we run our 20 VMs with quorum driver, and scatter the 
   replicated
   images over the data center, we can very likely restart 20 VMs without any
   resource problem.
   
   After all, I think we can build a more powerful replicated image 
   functionality
   on quorum and block jobs(block mirror) to meet various High Availibility 
   needs.
   
   E.g, Enable single read pattern on 2 children,
   
   -drive driver=quorum,children.0.file.filename=0.qcow2,\
   children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1
   
   [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device
   
   Cc: Benoit Canet ben...@irqsave.net
   Cc: Eric Blake ebl...@redhat.com
   Cc: Kevin Wolf kw...@redhat.com
   Cc: Stefan Hajnoczi stefa...@redhat.com
   Signed-off-by: Liu Yuan namei.u...@gmail.com
   ---
block/quorum.c | 179 
   +
1 file changed, 131 insertions(+), 48 deletions(-)
   
   diff --git a/block/quorum.c b/block/quorum.c
   index d5ee9c0..ebf5c71 100644
   --- a/block/quorum.c
   +++ b/block/quorum.c
   @@ -24,6 +24,7 @@
#define QUORUM_OPT_VOTE_THRESHOLD vote-threshold
#define QUORUM_OPT_BLKVERIFY  blkverify
#define QUORUM_OPT_REWRITErewrite-corrupted
   +#define QUORUM_OPT_READ_PATTERN   read-pattern

/* This union holds a vote hash value */
typedef union QuorumVoteValue {
   @@ -74,6 +75,8 @@ typedef struct BDRVQuorumState {
bool rewrite_corrupted;/* true if the driver must rewrite-on-read 
   corrupted
* block if Quorum is reached.
*/
   +
   +QuorumReadPattern read_pattern;
} BDRVQuorumState;

typedef struct QuorumAIOCB QuorumAIOCB;
   @@ -117,6 +120,7 @@ struct QuorumAIOCB {

bool is_read;
int vote_ret;
   +int child_iter; /* which child to read in fifo pattern */
};

static bool quorum_vote(QuorumAIOCB *acb);
   @@ -154,8 +158,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)

if (acb-is_read) {
for (i = 0; i  s-num_children; i++) {
   -qemu_vfree(acb-qcrs[i].buf);
   -qemu_iovec_destroy(acb-qcrs[i].qiov);
   +if (i = acb-child_iter) {
   +qemu_vfree(acb-qcrs[i].buf);
   +qemu_iovec_destroy(acb-qcrs[i].qiov);
   +}
}
}

   @@ -256,6 +262,21 @@ static void quorum_rewrite_aio_cb(void *opaque, int 
   ret)
quorum_aio_finalize(acb);
}

   +static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb);
   +
   +static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
   +{
   +int i;
   +assert(dest-niov == source-niov);
   +assert(dest-size == source-size);
   +for (i = 0; i  source-niov; i++) {
   +assert(dest-iov[i].iov_len == source-iov[i].iov_len);
   +memcpy(dest-iov[i].iov_base,
   +   source-iov[i].iov_base,
   +   source-iov[i].iov_len);
   +}
   +}
   +
static void quorum_aio_cb(void *opaque, int ret)
{
QuorumChildRequest *sacb = opaque;
   @@ -263,6 +284,21 @@ static void quorum_aio_cb(void *opaque, int ret)
BDRVQuorumState *s = acb-common.bs-opaque;
bool rewrite = false

Re: [Qemu-devel] [PATCH v3 0/2] sheepdog driver update related to VDI locking feature

2014-08-14 Thread Liu Yuan

On Mon, Aug 11, 2014 at 02:43:44PM +0900, Hitoshi Mitake wrote:
 Recently, sheepdog revived VDI locking functionality. This patchset
 updates sheepdog driver of QEMU for this feature.
 
 v3:
  - keep backward compatibility
 
 v2:
  - don't handle SD_RES_VDI_NOT_LOCKED as a special case   
 
 Hitoshi Mitake (2):
   sheepdog: adopting protocol update for VDI locking
   sheepdog: improve error handling for a case of failed lock
 
  block/sheepdog.c | 10 +-
  1 file changed, 9 insertions(+), 1 deletion(-)
 

Reviewed-by: Liu Yuan namei.u...@gmail.com

Re: [Qemu-devel] [PATCH v4 2/2] block/quorum: add simple read pattern support

2014-08-14 Thread Liu Yuan

On Thu, Aug 14, 2014 at 01:09:32PM +0200, Benoît Canet wrote:
 The Thursday 17 Jul 2014 à 13:18:56 (+0800), Liu Yuan wrote :
  This patch adds single read pattern to quorum driver and quorum vote is 
  default
  pattern.
  
  For now we do a quorum vote on all the reads, it is designed for unreliable
  underlying storage such as non-redundant NFS to make sure data integrity at 
  the
  cost of the read performance.
  
  For some use cases as following:
  
  VM
--
||
vv
AB
  
  Both A and B has hardware raid storage to justify the data integrity on its 
  own.
  So it would help performance if we do a single read instead of on all the 
  nodes.
  Further, if we run VM on either of the storage node, we can make a local 
  read
  request for better performance.
  
  This patch generalize the above 2 nodes case in the N nodes. That is,
  
  vm - write to all the N nodes, read just one of them. If single read 
  fails, we
  try to read next node in FIFO order specified by the startup command.
  
  The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
  functionality in the single device/node failure for now. But compared with 
  DRBD
  we still have some advantages over it:
  
  - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
  storage. And if A crashes, we need to restart all the VMs on node B. But for
  practice case, we can't because B might not have enough resources to setup 
  20 VMs
  at once. So if we run our 20 VMs with quorum driver, and scatter the 
  replicated
  images over the data center, we can very likely restart 20 VMs without any
  resource problem.
  
  After all, I think we can build a more powerful replicated image 
  functionality
  on quorum and block jobs(block mirror) to meet various High Availibility 
  needs.
  
  E.g, Enable single read pattern on 2 children,
  
  -drive driver=quorum,children.0.file.filename=0.qcow2,\
  children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1
  
  [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device
  
  Cc: Benoit Canet ben...@irqsave.net
  Cc: Eric Blake ebl...@redhat.com
  Cc: Kevin Wolf kw...@redhat.com
  Cc: Stefan Hajnoczi stefa...@redhat.com
  Signed-off-by: Liu Yuan namei.u...@gmail.com
  ---
   block/quorum.c | 179 
  +
   1 file changed, 131 insertions(+), 48 deletions(-)
  
  diff --git a/block/quorum.c b/block/quorum.c
  index d5ee9c0..ebf5c71 100644
  --- a/block/quorum.c
  +++ b/block/quorum.c
  @@ -24,6 +24,7 @@
   #define QUORUM_OPT_VOTE_THRESHOLD vote-threshold
   #define QUORUM_OPT_BLKVERIFY  blkverify
   #define QUORUM_OPT_REWRITErewrite-corrupted
  +#define QUORUM_OPT_READ_PATTERN   read-pattern
   
   /* This union holds a vote hash value */
   typedef union QuorumVoteValue {
  @@ -74,6 +75,8 @@ typedef struct BDRVQuorumState {
   bool rewrite_corrupted;/* true if the driver must rewrite-on-read 
  corrupted
   * block if Quorum is reached.
   */
  +
  +QuorumReadPattern read_pattern;
   } BDRVQuorumState;
   
   typedef struct QuorumAIOCB QuorumAIOCB;
  @@ -117,6 +120,7 @@ struct QuorumAIOCB {
   
   bool is_read;
   int vote_ret;
  +int child_iter; /* which child to read in fifo pattern */
   };
   
   static bool quorum_vote(QuorumAIOCB *acb);
  @@ -154,8 +158,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
   
   if (acb-is_read) {
   for (i = 0; i  s-num_children; i++) {
  -qemu_vfree(acb-qcrs[i].buf);
  -qemu_iovec_destroy(acb-qcrs[i].qiov);
  +if (i = acb-child_iter) {
  +qemu_vfree(acb-qcrs[i].buf);
  +qemu_iovec_destroy(acb-qcrs[i].qiov);
  +}
 
 This seems convoluted.
 
 What about ?
   /* on the quorum case acb-child_iter == s-num_children - 1 */
   for (i = 0; i = acb-child_iter; i++) {
   qemu_vfree(acb-qcrs[i].buf);
   qemu_iovec_destroy(acb-qcrs[i].qiov);
   }
   
   }

Sounds good. I'll update the patch v5.

Thanks
Yuan

[Qemu-devel] [PATCH v5 0/2] add read-pattern for block qourum

2014-08-14 Thread Liu Yuan

v5:
 - simplify a for loop in quorum_aio_finalize()

v4:
 - swap the patch order
 - update comment for fifo pattern in qaip
 - use qapi enumeration in quorum driver instead of manual parsing

v3:
 - separate patch into two, one for quorum and one for qapi for easier review
 - add enumeration for quorum read pattern
 - remove unrelated blank line fix from this patch set

v2:
 - rename single as 'fifo'
 - rename read_all_children as read_quorum_children
 - fix quorum_aio_finalize() for fifo pattern

This patch adds single read pattern to quorum driver and quorum vote is default
pattern.

For now we do a quorum vote on all the reads, it is designed for unreliable
underlying storage such as non-redundant NFS to make sure data integrity at the
cost of the read performance.

For some use cases as following:

VM
  --
  ||
  vv
  AB

Both A and B has hardware raid storage to justify the data integrity on its own.
So it would help performance if we do a single read instead of on all the nodes.
Further, if we run VM on either of the storage node, we can make a local read
request for better performance.

This patch generalize the above 2 nodes case in the N nodes. That is,

vm - write to all the N nodes, read just one of them. If single read fails, we
try to read next node in FIFO order specified by the startup command.

The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
functionality in the single device/node failure for now. But compared with DRBD
we still have some advantages over it:

- Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
storage. And if A crashes, we need to restart all the VMs on node B. But for
practice case, we can't because B might not have enough resources to setup 20 
VMs
at once. So if we run our 20 VMs with quorum driver, and scatter the replicated
images over the data center, we can very likely restart 20 VMs without any
resource problem.

After all, I think we can build a more powerful replicated image functionality
on quorum and block jobs(block mirror) to meet various High Availibility needs.

E.g, Enable single read pattern on 2 children,
  
-drive driver=quorum,children.0.file.filename=0.qcow2,\
children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1

[1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device

Cc: Benoit Canet ben...@irqsave.net
Cc: Eric Blake ebl...@redhat.com
Cc: Kevin Wolf kw...@redhat.com
Cc: Stefan Hajnoczi stefa...@redhat.com
Signed-off-by: Liu Yuan namei.u...@gmail.com
Liu Yuan (2):
  qapi: add read-pattern enum for quorum
  block/quorum: add simple read pattern support

 block/quorum.c   | 176 +--
 qapi/block-core.json |  20 +-
 2 files changed, 148 insertions(+), 48 deletions(-)

-- 
1.9.1

[Qemu-devel] [PATCH v5 2/2] block/quorum: add simple read pattern support

2014-08-14 Thread Liu Yuan

This patch adds single read pattern to quorum driver and quorum vote is default
pattern.

For now we do a quorum vote on all the reads, it is designed for unreliable
underlying storage such as non-redundant NFS to make sure data integrity at the
cost of the read performance.

For some use cases as following:

VM
  --
  ||
  vv
  AB

Both A and B has hardware raid storage to justify the data integrity on its own.
So it would help performance if we do a single read instead of on all the nodes.
Further, if we run VM on either of the storage node, we can make a local read
request for better performance.

This patch generalize the above 2 nodes case in the N nodes. That is,

vm - write to all the N nodes, read just one of them. If single read fails, we
try to read next node in FIFO order specified by the startup command.

The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
functionality in the single device/node failure for now. But compared with DRBD
we still have some advantages over it:

- Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
storage. And if A crashes, we need to restart all the VMs on node B. But for
practice case, we can't because B might not have enough resources to setup 20 
VMs
at once. So if we run our 20 VMs with quorum driver, and scatter the replicated
images over the data center, we can very likely restart 20 VMs without any
resource problem.

After all, I think we can build a more powerful replicated image functionality
on quorum and block jobs(block mirror) to meet various High Availibility needs.

E.g, Enable single read pattern on 2 children,

-drive driver=quorum,children.0.file.filename=0.qcow2,\
children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1

[1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device

Cc: Benoit Canet ben...@irqsave.net
Cc: Eric Blake ebl...@redhat.com
Cc: Kevin Wolf kw...@redhat.com
Cc: Stefan Hajnoczi stefa...@redhat.com
Signed-off-by: Liu Yuan namei.u...@gmail.com
---
 block/quorum.c | 176 ++---
 1 file changed, 129 insertions(+), 47 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index d5ee9c0..1235d7c 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -24,6 +24,7 @@
 #define QUORUM_OPT_VOTE_THRESHOLD vote-threshold
 #define QUORUM_OPT_BLKVERIFY  blkverify
 #define QUORUM_OPT_REWRITErewrite-corrupted
+#define QUORUM_OPT_READ_PATTERN   read-pattern
 
 /* This union holds a vote hash value */
 typedef union QuorumVoteValue {
@@ -74,6 +75,8 @@ typedef struct BDRVQuorumState {
 bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
 * block if Quorum is reached.
 */
+
+QuorumReadPattern read_pattern;
 } BDRVQuorumState;
 
 typedef struct QuorumAIOCB QuorumAIOCB;
@@ -117,6 +120,7 @@ struct QuorumAIOCB {
 
 bool is_read;
 int vote_ret;
+int child_iter; /* which child to read in fifo pattern */
 };
 
 static bool quorum_vote(QuorumAIOCB *acb);
@@ -153,7 +157,8 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
 acb-common.cb(acb-common.opaque, ret);
 
 if (acb-is_read) {
-for (i = 0; i  s-num_children; i++) {
+/* on the quorum case acb-child_iter == s-num_children - 1 */
+for (i = 0; i = acb-child_iter; i++) {
 qemu_vfree(acb-qcrs[i].buf);
 qemu_iovec_destroy(acb-qcrs[i].qiov);
 }
@@ -256,6 +261,21 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret)
 quorum_aio_finalize(acb);
 }
 
+static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb);
+
+static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
+{
+int i;
+assert(dest-niov == source-niov);
+assert(dest-size == source-size);
+for (i = 0; i  source-niov; i++) {
+assert(dest-iov[i].iov_len == source-iov[i].iov_len);
+memcpy(dest-iov[i].iov_base,
+   source-iov[i].iov_base,
+   source-iov[i].iov_len);
+}
+}
+
 static void quorum_aio_cb(void *opaque, int ret)
 {
 QuorumChildRequest *sacb = opaque;
@@ -263,6 +283,21 @@ static void quorum_aio_cb(void *opaque, int ret)
 BDRVQuorumState *s = acb-common.bs-opaque;
 bool rewrite = false;
 
+if (acb-is_read  s-read_pattern == QUORUM_READ_PATTERN_FIFO) {
+/* We try to read next child in FIFO order if we fail to read */
+if (ret  0  ++acb-child_iter  s-num_children) {
+read_fifo_child(acb);
+return;
+}
+
+if (ret == 0) {
+quorum_copy_qiov(acb-qiov, acb-qcrs[acb-child_iter].qiov);
+}
+acb-vote_ret = ret;
+quorum_aio_finalize(acb);
+return;
+}
+
 sacb-ret = ret;
 acb-count++;
 if (ret == 0) {
@@ -343,19 +378,6 @@ static bool quorum_rewrite_bad_versions(BDRVQuorumState 
*s, QuorumAIOCB *acb

[Qemu-devel] [PATCH v5 1/2] qapi: add read-pattern enum for quorum

2014-08-14 Thread Liu Yuan

Cc: Eric Blake ebl...@redhat.com
Reviewed-by: Eric Blake ebl...@redhat.com
Signed-off-by: Liu Yuan namei.u...@gmail.com
---
 qapi/block-core.json | 20 +++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index e378653..42033d9 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1384,6 +1384,19 @@
 'raw': 'BlockdevRef' } }
 
 ##
+# @QuorumReadPattern
+#
+# An enumeration of quorum read patterns.
+#
+# @quorum: read all the children and do a quorum vote on reads
+#
+# @fifo: read only from the first child that has not failed
+#
+# Since: 2.2
+##
+{ 'enum': 'QuorumReadPattern', 'data': [ 'quorum', 'fifo' ] }
+
+##
 # @BlockdevOptionsQuorum
 #
 # Driver specific block device options for Quorum
@@ -1398,12 +1411,17 @@
 # @rewrite-corrupted: #optional rewrite corrupted data when quorum is reached
 # (Since 2.1)
 #
+# @read-pattern: #optional choose read pattern and set to quorum by default
+#(Since 2.2)
+#
 # Since: 2.0
 ##
 { 'type': 'BlockdevOptionsQuorum',
   'data': { '*blkverify': 'bool',
 'children': [ 'BlockdevRef' ],
-'vote-threshold': 'int', '*rewrite-corrupted': 'bool' } }
+'vote-threshold': 'int',
+'*rewrite-corrupted': 'bool',
+'*read-pattern': 'QuorumReadPattern' } }
 
 ##
 # @BlockdevOptions
-- 
1.9.1

Re: [Qemu-devel] [PATCH v4 2/2] block/quorum: add simple read pattern support

2014-08-11 Thread Liu Yuan

On Thu, Jul 17, 2014 at 01:18:56PM +0800, Liu Yuan wrote:
 This patch adds single read pattern to quorum driver and quorum vote is 
 default
 pattern.
 
 For now we do a quorum vote on all the reads, it is designed for unreliable
 underlying storage such as non-redundant NFS to make sure data integrity at 
 the
 cost of the read performance.
 
 For some use cases as following:
 
 VM
   --
   ||
   vv
   AB
 
 Both A and B has hardware raid storage to justify the data integrity on its 
 own.
 So it would help performance if we do a single read instead of on all the 
 nodes.
 Further, if we run VM on either of the storage node, we can make a local read
 request for better performance.
 
 This patch generalize the above 2 nodes case in the N nodes. That is,
 
 vm - write to all the N nodes, read just one of them. If single read fails, 
 we
 try to read next node in FIFO order specified by the startup command.
 
 The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
 functionality in the single device/node failure for now. But compared with 
 DRBD
 we still have some advantages over it:
 
 - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
 storage. And if A crashes, we need to restart all the VMs on node B. But for
 practice case, we can't because B might not have enough resources to setup 20 
 VMs
 at once. So if we run our 20 VMs with quorum driver, and scatter the 
 replicated
 images over the data center, we can very likely restart 20 VMs without any
 resource problem.
 
 After all, I think we can build a more powerful replicated image functionality
 on quorum and block jobs(block mirror) to meet various High Availibility 
 needs.
 
 E.g, Enable single read pattern on 2 children,
 
 -drive driver=quorum,children.0.file.filename=0.qcow2,\
 children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1
 
 [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device
 
 Cc: Benoit Canet ben...@irqsave.net
 Cc: Eric Blake ebl...@redhat.com
 Cc: Kevin Wolf kw...@redhat.com
 Cc: Stefan Hajnoczi stefa...@redhat.com
 Signed-off-by: Liu Yuan namei.u...@gmail.com
 ---
  block/quorum.c | 179 
 +
  1 file changed, 131 insertions(+), 48 deletions(-)
 
 diff --git a/block/quorum.c b/block/quorum.c
 index d5ee9c0..ebf5c71 100644
 --- a/block/quorum.c
 +++ b/block/quorum.c
 @@ -24,6 +24,7 @@
  #define QUORUM_OPT_VOTE_THRESHOLD vote-threshold
  #define QUORUM_OPT_BLKVERIFY  blkverify
  #define QUORUM_OPT_REWRITErewrite-corrupted
 +#define QUORUM_OPT_READ_PATTERN   read-pattern
  
  /* This union holds a vote hash value */
  typedef union QuorumVoteValue {
 @@ -74,6 +75,8 @@ typedef struct BDRVQuorumState {
  bool rewrite_corrupted;/* true if the driver must rewrite-on-read 
 corrupted
  * block if Quorum is reached.
  */
 +
 +QuorumReadPattern read_pattern;
  } BDRVQuorumState;
  
  typedef struct QuorumAIOCB QuorumAIOCB;
 @@ -117,6 +120,7 @@ struct QuorumAIOCB {
  
  bool is_read;
  int vote_ret;
 +int child_iter; /* which child to read in fifo pattern */
  };
  
  static bool quorum_vote(QuorumAIOCB *acb);
 @@ -154,8 +158,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
  
  if (acb-is_read) {
  for (i = 0; i  s-num_children; i++) {
 -qemu_vfree(acb-qcrs[i].buf);
 -qemu_iovec_destroy(acb-qcrs[i].qiov);
 +if (i = acb-child_iter) {
 +qemu_vfree(acb-qcrs[i].buf);
 +qemu_iovec_destroy(acb-qcrs[i].qiov);
 +}
  }
  }
  
 @@ -256,6 +262,21 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret)
  quorum_aio_finalize(acb);
  }
  
 +static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb);
 +
 +static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
 +{
 +int i;
 +assert(dest-niov == source-niov);
 +assert(dest-size == source-size);
 +for (i = 0; i  source-niov; i++) {
 +assert(dest-iov[i].iov_len == source-iov[i].iov_len);
 +memcpy(dest-iov[i].iov_base,
 +   source-iov[i].iov_base,
 +   source-iov[i].iov_len);
 +}
 +}
 +
  static void quorum_aio_cb(void *opaque, int ret)
  {
  QuorumChildRequest *sacb = opaque;
 @@ -263,6 +284,21 @@ static void quorum_aio_cb(void *opaque, int ret)
  BDRVQuorumState *s = acb-common.bs-opaque;
  bool rewrite = false;
  
 +if (acb-is_read  s-read_pattern == QUORUM_READ_PATTERN_FIFO) {
 +/* We try to read next child in FIFO order if we fail to read */
 +if (ret  0  ++acb-child_iter  s-num_children) {
 +read_fifo_child(acb);
 +return;
 +}
 +
 +if (ret == 0) {
 +quorum_copy_qiov(acb-qiov, acb-qcrs[acb-child_iter].qiov);
 +}
 +acb-vote_ret = ret

Re: [Qemu-devel] [PATCH v4 2/2] block/quorum: add simple read pattern support

2014-08-11 Thread Liu Yuan

On Mon, Aug 11, 2014 at 02:31:43PM +0200, Benoît Canet wrote:
 The Thursday 17 Jul 2014 à 13:18:56 (+0800), Liu Yuan wrote :
  This patch adds single read pattern to quorum driver and quorum vote is 
  default
  pattern.
  
  For now we do a quorum vote on all the reads, it is designed for unreliable
  underlying storage such as non-redundant NFS to make sure data integrity at 
  the
  cost of the read performance.
  
  For some use cases as following:
  
  VM
--
||
vv
AB
  
  Both A and B has hardware raid storage to justify the data integrity on its 
  own.
  So it would help performance if we do a single read instead of on all the 
  nodes.
  Further, if we run VM on either of the storage node, we can make a local 
  read
  request for better performance.
  
  This patch generalize the above 2 nodes case in the N nodes. That is,
  
  vm - write to all the N nodes, read just one of them. If single read 
  fails, we
  try to read next node in FIFO order specified by the startup command.
  
  The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
  functionality in the single device/node failure for now. But compared with 
  DRBD
  we still have some advantages over it:
  
  - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
  storage. And if A crashes, we need to restart all the VMs on node B. But for
  practice case, we can't because B might not have enough resources to setup 
  20 VMs
  at once. So if we run our 20 VMs with quorum driver, and scatter the 
  replicated
  images over the data center, we can very likely restart 20 VMs without any
  resource problem.
  
  After all, I think we can build a more powerful replicated image 
  functionality
  on quorum and block jobs(block mirror) to meet various High Availibility 
  needs.
  
  E.g, Enable single read pattern on 2 children,
  
  -drive driver=quorum,children.0.file.filename=0.qcow2,\
  children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1
  
  [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device
  
  Cc: Benoit Canet ben...@irqsave.net
  Cc: Eric Blake ebl...@redhat.com
  Cc: Kevin Wolf kw...@redhat.com
  Cc: Stefan Hajnoczi stefa...@redhat.com
  Signed-off-by: Liu Yuan namei.u...@gmail.com
  ---
   block/quorum.c | 179 
  +
   1 file changed, 131 insertions(+), 48 deletions(-)
  
  diff --git a/block/quorum.c b/block/quorum.c
  index d5ee9c0..ebf5c71 100644
  --- a/block/quorum.c
  +++ b/block/quorum.c
  @@ -24,6 +24,7 @@
   #define QUORUM_OPT_VOTE_THRESHOLD vote-threshold
   #define QUORUM_OPT_BLKVERIFY  blkverify
   #define QUORUM_OPT_REWRITErewrite-corrupted
  +#define QUORUM_OPT_READ_PATTERN   read-pattern
   
   /* This union holds a vote hash value */
   typedef union QuorumVoteValue {
  @@ -74,6 +75,8 @@ typedef struct BDRVQuorumState {
   bool rewrite_corrupted;/* true if the driver must rewrite-on-read 
  corrupted
   * block if Quorum is reached.
   */
  +
  +QuorumReadPattern read_pattern;
   } BDRVQuorumState;
   
   typedef struct QuorumAIOCB QuorumAIOCB;
  @@ -117,6 +120,7 @@ struct QuorumAIOCB {
   
   bool is_read;
   int vote_ret;
  +int child_iter; /* which child to read in fifo pattern */
   };
   
   static bool quorum_vote(QuorumAIOCB *acb);
  @@ -154,8 +158,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
   
   if (acb-is_read) {
   for (i = 0; i  s-num_children; i++) {
  -qemu_vfree(acb-qcrs[i].buf);
  -qemu_iovec_destroy(acb-qcrs[i].qiov);
  +if (i = acb-child_iter) {
  +qemu_vfree(acb-qcrs[i].buf);
  +qemu_iovec_destroy(acb-qcrs[i].qiov);
  +}
   }
   }
   
  @@ -256,6 +262,21 @@ static void quorum_rewrite_aio_cb(void *opaque, int 
  ret)
   quorum_aio_finalize(acb);
   }
   
  +static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb);
  +
  +static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
  +{
  +int i;
  +assert(dest-niov == source-niov);
  +assert(dest-size == source-size);
  +for (i = 0; i  source-niov; i++) {
  +assert(dest-iov[i].iov_len == source-iov[i].iov_len);
  +memcpy(dest-iov[i].iov_base,
  +   source-iov[i].iov_base,
  +   source-iov[i].iov_len);
  +}
  +}
  +
   static void quorum_aio_cb(void *opaque, int ret)
   {
   QuorumChildRequest *sacb = opaque;
  @@ -263,6 +284,21 @@ static void quorum_aio_cb(void *opaque, int ret)
   BDRVQuorumState *s = acb-common.bs-opaque;
   bool rewrite = false;
   
  +if (acb-is_read  s-read_pattern == QUORUM_READ_PATTERN_FIFO) {
  +/* We try to read next child in FIFO order if we fail to read */
  +if (ret  0  ++acb-child_iter  s-num_children

Re: [Qemu-devel] [PATCH 1/2] sheepdog: adopting protocol update for VDI locking

2014-08-10 Thread Liu Yuan

On Mon, Aug 11, 2014 at 11:17:33AM +0900, Hitoshi Mitake wrote:
 At Fri, 8 Aug 2014 15:49:37 +0800,
 Liu Yuan wrote:
  
  On Fri, Aug 08, 2014 at 03:12:17PM +0900, Hitoshi Mitake wrote:
   At Fri, 8 Aug 2014 13:20:39 +0800,
   Liu Yuan wrote:

On Thu, Aug 07, 2014 at 04:28:39PM +0900, Hitoshi Mitake wrote:
 The update is required for supporting iSCSI multipath. It doesn't
 affect behavior of QEMU driver but adding a new field to vdi request
 struct is required.
 
 Cc: Kevin Wolf kw...@redhat.com
 Cc: Stefan Hajnoczi stefa...@redhat.com
 Cc: Liu Yuan namei.u...@gmail.com
 Cc: MORITA Kazutaka morita.kazut...@lab.ntt.co.jp
 Signed-off-by: Hitoshi Mitake mitake.hito...@lab.ntt.co.jp
 ---
  block/sheepdog.c | 8 +++-
  1 file changed, 7 insertions(+), 1 deletion(-)
 
 diff --git a/block/sheepdog.c b/block/sheepdog.c
 index 8d9350c..36f76f0 100644
 --- a/block/sheepdog.c
 +++ b/block/sheepdog.c
 @@ -103,6 +103,9 @@
  #define SD_INODE_SIZE (sizeof(SheepdogInode))
  #define CURRENT_VDI_ID 0
  
 +#define LOCK_TYPE_NORMAL 1
 +#define LOCK_TYPE_SHARED 2  /* for iSCSI multipath */

How about

#define LOCK_TYPE_NORMAL 0
#define LOCK_TYPE_SHARED 1

Then we don't need this patch. Since qemu won't make use of multipath 
for the
near future, we should avoid adding stuff related to multipath to qemu 
driver.
   
   (Cc-ing current Kazutaka-san's address)
   
   I think this isn't a good idea. Because it means that sheep has an
   assumption about padding field of the request data struct. This sort
   of workaround can cause hard to find problems in the future.
   
  
  The point is, how to keep backward compatibilty? E.g, old QEMU with present
  sheep daemon with lock features. Then QEMU will send 0 instead of 1 to the 
  sheep
  daemon and based on how you handle the invalid value, these might cause 
  problems.
  
  Suppose we have 1 old QEMU and 1 present QEMU who try to start the same 
  image A.
  Old QEMU will send invalid 0 to sheep daemon. We shouldn't deny it because 
  it
  can run with old sheep and should run on new sheep too. Then this image A 
  isn't
  locked due to invalid 0 value. Then present QEMU send correct LOCK signal 
  and
  will wrongly set up the image.
  
  Start with 0 instead of 1, in my option, is reasonable to keep backward
  compatibility.
 
 I don't think so. Because the backward compatibility of the locking
 functionality is already broken in the far past.
 
 As Meng repoted in the sheepdog mailing list, old QEMU can issue
 locking request to sheep with vid == 0:
 http://lists.wpkg.org/pipermail/sheepdog/2014-August/015438.html

I don't really understand why we can't start with 0 and can't keep backward
compatibility. By the way, I think the link has nothing to do with qemu, it is
a bug in sheep.

locking has two state, one is lock and the other unlock.

We choose 0 to mean 'lock' the vdi and 1 to 'unlock' the vdi.

So both old and new QEMU issue 0 to lock the image and 'release' request to
unlock the image. What is in the way?

 
 Even we set the default lock type as 0, the old QEMU cannot issue a
 correct locking request. 

why?

 I'll post a patch for incrementing protocol
 version number later. But before doing that, I also want to clean
 DISCARD request. Because this request cannot work with snapshot (not
 only with the new GC algorithm. The old naive algorithm cannot work
 with the DISCARD request).

If you remove discard, what if users run new qemu with old sheep, which make
use of old algorithm and people expect discard to work?

I don't think remove operation from protocol is good idea. If sheep daemon can't
support discard, you can simply ask the sheep daemon to return success without
doing anything if it doesn't support discard.

Thanks
Yuan

Re: [Qemu-devel] [PATCH 1/2] sheepdog: adopting protocol update for VDI locking

2014-08-10 Thread Liu Yuan

On Mon, Aug 11, 2014 at 11:34:56AM +0800, Liu Yuan wrote:
 On Mon, Aug 11, 2014 at 11:17:33AM +0900, Hitoshi Mitake wrote:
  At Fri, 8 Aug 2014 15:49:37 +0800,
  Liu Yuan wrote:
   
   On Fri, Aug 08, 2014 at 03:12:17PM +0900, Hitoshi Mitake wrote:
At Fri, 8 Aug 2014 13:20:39 +0800,
Liu Yuan wrote:
 
 On Thu, Aug 07, 2014 at 04:28:39PM +0900, Hitoshi Mitake wrote:
  The update is required for supporting iSCSI multipath. It doesn't
  affect behavior of QEMU driver but adding a new field to vdi request
  struct is required.
  
  Cc: Kevin Wolf kw...@redhat.com
  Cc: Stefan Hajnoczi stefa...@redhat.com
  Cc: Liu Yuan namei.u...@gmail.com
  Cc: MORITA Kazutaka morita.kazut...@lab.ntt.co.jp
  Signed-off-by: Hitoshi Mitake mitake.hito...@lab.ntt.co.jp
  ---
   block/sheepdog.c | 8 +++-
   1 file changed, 7 insertions(+), 1 deletion(-)
  
  diff --git a/block/sheepdog.c b/block/sheepdog.c
  index 8d9350c..36f76f0 100644
  --- a/block/sheepdog.c
  +++ b/block/sheepdog.c
  @@ -103,6 +103,9 @@
   #define SD_INODE_SIZE (sizeof(SheepdogInode))
   #define CURRENT_VDI_ID 0
   
  +#define LOCK_TYPE_NORMAL 1
  +#define LOCK_TYPE_SHARED 2  /* for iSCSI multipath */
 
 How about
 
 #define LOCK_TYPE_NORMAL 0
 #define LOCK_TYPE_SHARED 1
 
 Then we don't need this patch. Since qemu won't make use of multipath 
 for the
 near future, we should avoid adding stuff related to multipath to 
 qemu driver.

(Cc-ing current Kazutaka-san's address)

I think this isn't a good idea. Because it means that sheep has an
assumption about padding field of the request data struct. This sort
of workaround can cause hard to find problems in the future.

   
   The point is, how to keep backward compatibilty? E.g, old QEMU with 
   present
   sheep daemon with lock features. Then QEMU will send 0 instead of 1 to 
   the sheep
   daemon and based on how you handle the invalid value, these might cause 
   problems.
   
   Suppose we have 1 old QEMU and 1 present QEMU who try to start the same 
   image A.
   Old QEMU will send invalid 0 to sheep daemon. We shouldn't deny it 
   because it
   can run with old sheep and should run on new sheep too. Then this image A 
   isn't
   locked due to invalid 0 value. Then present QEMU send correct LOCK signal 
   and
   will wrongly set up the image.
   
   Start with 0 instead of 1, in my option, is reasonable to keep backward
   compatibility.
  
  I don't think so. Because the backward compatibility of the locking
  functionality is already broken in the far past.
  
  As Meng repoted in the sheepdog mailing list, old QEMU can issue
  locking request to sheep with vid == 0:
  http://lists.wpkg.org/pipermail/sheepdog/2014-August/015438.html
 
 I don't really understand why we can't start with 0 and can't keep backward
 compatibility. By the way, I think the link has nothing to do with qemu, it is
 a bug in sheep.
 
 locking has two state, one is lock and the other unlock.
 
 We choose 0 to mean 'lock' the vdi and 1 to 'unlock' the vdi.
 
 So both old and new QEMU issue 0 to lock the image and 'release' request to
 unlock the image. What is in the way?
 
  
  Even we set the default lock type as 0, the old QEMU cannot issue a
  correct locking request. 
 
 why?
 
  I'll post a patch for incrementing protocol
  version number later. But before doing that, I also want to clean
  DISCARD request. Because this request cannot work with snapshot (not
  only with the new GC algorithm. The old naive algorithm cannot work
  with the DISCARD request).
 
 If you remove discard, what if users run new qemu with old sheep, which make
 use of old algorithm and people expect discard to work?

Okay, you mean discard can't work with snapshots, but it should work with
non-snapshots, so for the users of non-snapshots, people can expect it to work.
As stated in my last email, you can handle this problem transparently without
modification of protocol.

QEMU -- discard[offset, lenght] -- sheep

sheep:
  if req on snapshot
 return success;
  else
 return sheep_handle_discard().

Thanks
Yuan

[Qemu-devel] [PATCH] cluster/zookeeper: add log information for zk auto-recoonect

2014-08-08 Thread Liu Yuan

Reported-by: Valerio Pachera siri...@gmail.com
Signed-off-by: Liu Yuan namei.u...@gmail.com
---
 sheep/group.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sheep/group.c b/sheep/group.c
index 06a80bd..08e3884 100644
--- a/sheep/group.c
+++ b/sheep/group.c
@@ -979,7 +979,7 @@ static int send_join_request(void)
 {
struct sd_node *n = sys-this_node;
 
-   sd_info(%s, node_to_str(n));
+   sd_info(%s going to rejoin the cluster, node_to_str(n));
return sys-cdrv-join(n, sys-cinfo, sizeof(sys-cinfo));
 }
 
-- 
1.9.1

Re: [Qemu-devel] [PATCH 2/2] sheepdog: improve error handling for a case of failed lock

2014-08-08 Thread Liu Yuan

On Fri, Aug 08, 2014 at 03:17:59PM +0900, Hitoshi Mitake wrote:
 At Fri, 8 Aug 2014 13:31:39 +0800,
 Liu Yuan wrote:
  
  On Thu, Aug 07, 2014 at 04:28:40PM +0900, Hitoshi Mitake wrote:
   Recently, sheepdog revived its VDI locking functionality. This patch
   updates sheepdog driver of QEMU for this feature:
   
   1. Improve error message when QEMU fails to acquire lock of
   VDI. Current sheepdog driver prints an error message VDI isn't
   locked when it fails to acquire lock. It is a little bit confusing
   because the mesage says VDI isn't locked but it is actually locked by
   other VM. This patch modifies this confusing message.
   
   2. Change error code for a case of failed locking. -EBUSY is a
   suitable one.
   
   Reported-by: Valerio Pachera siri...@gmail.com
   Cc: Kevin Wolf kw...@redhat.com
   Cc: Stefan Hajnoczi stefa...@redhat.com
   Cc: Liu Yuan namei.u...@gmail.com
   Cc: MORITA Kazutaka morita.kazut...@lab.ntt.co.jp
   Signed-off-by: Hitoshi Mitake mitake.hito...@lab.ntt.co.jp
   ---
block/sheepdog.c | 4 
1 file changed, 4 insertions(+)
   
   diff --git a/block/sheepdog.c b/block/sheepdog.c
   index 36f76f0..0b3f86d 100644
   --- a/block/sheepdog.c
   +++ b/block/sheepdog.c
   @@ -1112,9 +1112,13 @@ static int find_vdi_name(BDRVSheepdogState *s, 
   const char *filename,

if (rsp-result != SD_RES_SUCCESS) {
error_setg(errp, cannot get vdi info, %s, %s % PRIu32  %s,
   +   rsp-result == SD_RES_VDI_NOT_LOCKED ?
  
  I'm puzzled by this check.
  
  we use SD_RES_VDI_LOCKED to indicate vid is already locked, no?
 
 We use SD_RES_VDI_NOT_LOCKED for indicating locking by this VM fails.
 
  
   +   VDI is already locked by other VM :

But this message said it was locked by others, and we have SD_RES_VDI_LOCKED for
this case.

We need fix sheep daemon for this case to return SD_RES_VDI_LOCKED for already
locked case and NOT_LOCKED for other sheep internal errors.

   sd_strerror(rsp-result), filename, snapid, tag);
if (rsp-result == SD_RES_NO_VDI) {
ret = -ENOENT;
   +} else if (rsp-result == SD_RES_VDI_NOT_LOCKED) {
   +ret = -EBUSY;
} else {
ret = -EIO;
}
  
  It is better to use switch case to handle the result.
 
 using switch statement in this case only increases a number of lines
 of code:
 
 Current change:
 if (rsp-result == SD_RES_NO_VDI) {
 ret = -ENOENT;
 } else if (rsp-result == SD_RES_VDI_NOT_LOCKED) {
 ...
 
 Change with switch:
 switch (rsp-result) {
   case SD_RES_NO_VDI:
 ret = -ENOENT;
   break;
   case SD_RES_VDI_NOT_LOCKED:
 ...
 
 The change with switch statement requires one more line for break;. I
 think if statement is suitable for this case.

If you insist on 'if-else' over swtich case, it is fine with me. But I'd suggest
switch-case because it looks cleaner and easier to understand if we have more
than 2 branches.

Thanks
Yuan

Re: [Qemu-devel] [PATCH 1/2] sheepdog: adopting protocol update for VDI locking

2014-08-08 Thread Liu Yuan

On Fri, Aug 08, 2014 at 03:12:17PM +0900, Hitoshi Mitake wrote:
 At Fri, 8 Aug 2014 13:20:39 +0800,
 Liu Yuan wrote:
  
  On Thu, Aug 07, 2014 at 04:28:39PM +0900, Hitoshi Mitake wrote:
   The update is required for supporting iSCSI multipath. It doesn't
   affect behavior of QEMU driver but adding a new field to vdi request
   struct is required.
   
   Cc: Kevin Wolf kw...@redhat.com
   Cc: Stefan Hajnoczi stefa...@redhat.com
   Cc: Liu Yuan namei.u...@gmail.com
   Cc: MORITA Kazutaka morita.kazut...@lab.ntt.co.jp
   Signed-off-by: Hitoshi Mitake mitake.hito...@lab.ntt.co.jp
   ---
block/sheepdog.c | 8 +++-
1 file changed, 7 insertions(+), 1 deletion(-)
   
   diff --git a/block/sheepdog.c b/block/sheepdog.c
   index 8d9350c..36f76f0 100644
   --- a/block/sheepdog.c
   +++ b/block/sheepdog.c
   @@ -103,6 +103,9 @@
#define SD_INODE_SIZE (sizeof(SheepdogInode))
#define CURRENT_VDI_ID 0

   +#define LOCK_TYPE_NORMAL 1
   +#define LOCK_TYPE_SHARED 2  /* for iSCSI multipath */
  
  How about
  
  #define LOCK_TYPE_NORMAL 0
  #define LOCK_TYPE_SHARED 1
  
  Then we don't need this patch. Since qemu won't make use of multipath for 
  the
  near future, we should avoid adding stuff related to multipath to qemu 
  driver.
 
 (Cc-ing current Kazutaka-san's address)
 
 I think this isn't a good idea. Because it means that sheep has an
 assumption about padding field of the request data struct. This sort
 of workaround can cause hard to find problems in the future.
 

The point is, how to keep backward compatibilty? E.g, old QEMU with present
sheep daemon with lock features. Then QEMU will send 0 instead of 1 to the sheep
daemon and based on how you handle the invalid value, these might cause 
problems.

Suppose we have 1 old QEMU and 1 present QEMU who try to start the same image A.
Old QEMU will send invalid 0 to sheep daemon. We shouldn't deny it because it
can run with old sheep and should run on new sheep too. Then this image A isn't
locked due to invalid 0 value. Then present QEMU send correct LOCK signal and
will wrongly set up the image.

Start with 0 instead of 1, in my option, is reasonable to keep backward
compatibility.

Thanks
Yuan

Re: [Qemu-devel] [PATCH 1/2] sheepdog: adopting protocol update for VDI locking

2014-08-07 Thread Liu Yuan

On Thu, Aug 07, 2014 at 04:28:39PM +0900, Hitoshi Mitake wrote:
 The update is required for supporting iSCSI multipath. It doesn't
 affect behavior of QEMU driver but adding a new field to vdi request
 struct is required.
 
 Cc: Kevin Wolf kw...@redhat.com
 Cc: Stefan Hajnoczi stefa...@redhat.com
 Cc: Liu Yuan namei.u...@gmail.com
 Cc: MORITA Kazutaka morita.kazut...@lab.ntt.co.jp
 Signed-off-by: Hitoshi Mitake mitake.hito...@lab.ntt.co.jp
 ---
  block/sheepdog.c | 8 +++-
  1 file changed, 7 insertions(+), 1 deletion(-)
 
 diff --git a/block/sheepdog.c b/block/sheepdog.c
 index 8d9350c..36f76f0 100644
 --- a/block/sheepdog.c
 +++ b/block/sheepdog.c
 @@ -103,6 +103,9 @@
  #define SD_INODE_SIZE (sizeof(SheepdogInode))
  #define CURRENT_VDI_ID 0
  
 +#define LOCK_TYPE_NORMAL 1
 +#define LOCK_TYPE_SHARED 2  /* for iSCSI multipath */

How about

#define LOCK_TYPE_NORMAL 0
#define LOCK_TYPE_SHARED 1

Then we don't need this patch. Since qemu won't make use of multipath for the
near future, we should avoid adding stuff related to multipath to qemu driver.

Thanks
Yuan

 +
  typedef struct SheepdogReq {
  uint8_t proto_ver;
  uint8_t opcode;
 @@ -166,7 +169,8 @@ typedef struct SheepdogVdiReq {
  uint8_t copy_policy;
  uint8_t reserved[2];
  uint32_t snapid;
 -uint32_t pad[3];
 +uint32_t type;
 +uint32_t pad[2];
  } SheepdogVdiReq;
  
  typedef struct SheepdogVdiRsp {
 @@ -1090,6 +1094,7 @@ static int find_vdi_name(BDRVSheepdogState *s, const 
 char *filename,
  memset(hdr, 0, sizeof(hdr));
  if (lock) {
  hdr.opcode = SD_OP_LOCK_VDI;
 +hdr.type = LOCK_TYPE_NORMAL;
  } else {
  hdr.opcode = SD_OP_GET_VDI_INFO;
  }
 @@ -1793,6 +1798,7 @@ static void sd_close(BlockDriverState *bs)
  memset(hdr, 0, sizeof(hdr));
  
  hdr.opcode = SD_OP_RELEASE_VDI;
 +hdr.type = LOCK_TYPE_NORMAL;
  hdr.base_vdi_id = s-inode.vdi_id;
  wlen = strlen(s-name) + 1;
  hdr.data_length = wlen;
 -- 
 1.8.3.2

Re: [Qemu-devel] [PATCH 2/2] sheepdog: improve error handling for a case of failed lock

2014-08-07 Thread Liu Yuan

On Thu, Aug 07, 2014 at 04:28:40PM +0900, Hitoshi Mitake wrote:
 Recently, sheepdog revived its VDI locking functionality. This patch
 updates sheepdog driver of QEMU for this feature:
 
 1. Improve error message when QEMU fails to acquire lock of
 VDI. Current sheepdog driver prints an error message VDI isn't
 locked when it fails to acquire lock. It is a little bit confusing
 because the mesage says VDI isn't locked but it is actually locked by
 other VM. This patch modifies this confusing message.
 
 2. Change error code for a case of failed locking. -EBUSY is a
 suitable one.
 
 Reported-by: Valerio Pachera siri...@gmail.com
 Cc: Kevin Wolf kw...@redhat.com
 Cc: Stefan Hajnoczi stefa...@redhat.com
 Cc: Liu Yuan namei.u...@gmail.com
 Cc: MORITA Kazutaka morita.kazut...@lab.ntt.co.jp
 Signed-off-by: Hitoshi Mitake mitake.hito...@lab.ntt.co.jp
 ---
  block/sheepdog.c | 4 
  1 file changed, 4 insertions(+)
 
 diff --git a/block/sheepdog.c b/block/sheepdog.c
 index 36f76f0..0b3f86d 100644
 --- a/block/sheepdog.c
 +++ b/block/sheepdog.c
 @@ -1112,9 +1112,13 @@ static int find_vdi_name(BDRVSheepdogState *s, const 
 char *filename,
  
  if (rsp-result != SD_RES_SUCCESS) {
  error_setg(errp, cannot get vdi info, %s, %s % PRIu32  %s,
 +   rsp-result == SD_RES_VDI_NOT_LOCKED ?

I'm puzzled by this check.

we use SD_RES_VDI_LOCKED to indicate vid is already locked, no?

 +   VDI is already locked by other VM :
 sd_strerror(rsp-result), filename, snapid, tag);
  if (rsp-result == SD_RES_NO_VDI) {
  ret = -ENOENT;
 +} else if (rsp-result == SD_RES_VDI_NOT_LOCKED) {
 +ret = -EBUSY;
  } else {
  ret = -EIO;
  }

It is better to use switch case to handle the result.

Thanks
Yuan

Re: [Qemu-devel] [PATCH] block/quorum: implement .bdrv_co_get_block_status

2014-07-20 Thread Liu Yuan

On Fri, Jul 18, 2014 at 03:35:57PM +0200, Paolo Bonzini wrote:
 Il 17/07/2014 13:50, Liu Yuan ha scritto:
  - allow drive-mirror to create sprase mirror on images like qcow2
  - allow qemu-img map to work as expected on quorum driver
  
  Cc: Benoit Canet ben...@irqsave.net
  Cc: Kevin Wolf kw...@redhat.com
  Cc: Stefan Hajnoczi stefa...@redhat.com
  Signed-off-by: Liu Yuan namei.u...@gmail.com
  ---
   block/quorum.c | 16 
   1 file changed, 16 insertions(+)
  
  diff --git a/block/quorum.c b/block/quorum.c
  index ebf5c71..f0d0a98 100644
  --- a/block/quorum.c
  +++ b/block/quorum.c
  @@ -780,6 +780,21 @@ static coroutine_fn int 
  quorum_co_flush(BlockDriverState *bs)
   return result;
   }
   
  +static int64_t coroutine_fn quorum_co_get_block_status(BlockDriverState 
  *bs,
  +   int64_t sector_num,
  +   int nb_sectors,
  +   int *pnum)
  +{
  +BDRVQuorumState *s = bs-opaque;
  +BlockDriverState *child_bs = s-bs[0];
  +
  +if (child_bs-drv-bdrv_co_get_block_status)
  +return child_bs-drv-bdrv_co_get_block_status(child_bs, 
  sector_num,
  +   nb_sectors, pnum);
 
 Is this if necessary?

Yes, otherwise bdrv_get_block_status() will be called multiple times and the
result for qcow2 won't return desired value im my test. Or we can simply call
bdrv_get_block_status() plus some tricks?

 
  +return bdrv_get_block_status(child_bs, sector_num, nb_sectors, pnum);
 
 Also, the definition of BDRV_BLOCK_OFFSET_VALID explicitly refers to
 bs-file, so you probably have to exclude it from the result as well as
 BDRV_BLOCK_RAW.
 

Thanks for reminding.

Yuan

[Qemu-devel] [PATCH] block/quorum: implement .bdrv_co_get_block_status

2014-07-17 Thread Liu Yuan

- allow drive-mirror to create sprase mirror on images like qcow2
- allow qemu-img map to work as expected on quorum driver

Cc: Benoit Canet ben...@irqsave.net
Cc: Kevin Wolf kw...@redhat.com
Cc: Stefan Hajnoczi stefa...@redhat.com
Signed-off-by: Liu Yuan namei.u...@gmail.com
---
 block/quorum.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/block/quorum.c b/block/quorum.c
index ebf5c71..f0d0a98 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -780,6 +780,21 @@ static coroutine_fn int quorum_co_flush(BlockDriverState 
*bs)
 return result;
 }
 
+static int64_t coroutine_fn quorum_co_get_block_status(BlockDriverState *bs,
+   int64_t sector_num,
+   int nb_sectors,
+   int *pnum)
+{
+BDRVQuorumState *s = bs-opaque;
+BlockDriverState *child_bs = s-bs[0];
+
+if (child_bs-drv-bdrv_co_get_block_status)
+return child_bs-drv-bdrv_co_get_block_status(child_bs, sector_num,
+   nb_sectors, pnum);
+
+return bdrv_get_block_status(child_bs, sector_num, nb_sectors, pnum);
+}
+
 static bool quorum_recurse_is_first_non_filter(BlockDriverState *bs,
BlockDriverState *candidate)
 {
@@ -1038,6 +1053,7 @@ static BlockDriver bdrv_quorum = {
 .bdrv_close = quorum_close,
 
 .bdrv_co_flush_to_disk  = quorum_co_flush,
+.bdrv_co_get_block_status   = quorum_co_get_block_status,
 
 .bdrv_getlength = quorum_getlength,
 
-- 
1.9.1

[Qemu-devel] [PATCH v4 0/2] add read-pattern for block qourum

2014-07-16 Thread Liu Yuan

v4:
 - swap the patch order
 - update comment for fifo pattern in qaip
 - use qapi enumeration in quorum driver instead of manual parsing

v3:
 - separate patch into two, one for quorum and one for qapi for easier review
 - add enumeration for quorum read pattern
 - remove unrelated blank line fix from this patch set

v2:
 - rename single as 'fifo'
 - rename read_all_children as read_quorum_children
 - fix quorum_aio_finalize() for fifo pattern

This patch adds single read pattern to quorum driver and quorum vote is default
pattern.

For now we do a quorum vote on all the reads, it is designed for unreliable
underlying storage such as non-redundant NFS to make sure data integrity at the
cost of the read performance.

For some use cases as following:

VM
  --
  ||
  vv
  AB

Both A and B has hardware raid storage to justify the data integrity on its own.
So it would help performance if we do a single read instead of on all the nodes.
Further, if we run VM on either of the storage node, we can make a local read
request for better performance.

This patch generalize the above 2 nodes case in the N nodes. That is,

vm - write to all the N nodes, read just one of them. If single read fails, we
try to read next node in FIFO order specified by the startup command.

The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
functionality in the single device/node failure for now. But compared with DRBD
we still have some advantages over it:

- Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
storage. And if A crashes, we need to restart all the VMs on node B. But for
practice case, we can't because B might not have enough resources to setup 20 
VMs
at once. So if we run our 20 VMs with quorum driver, and scatter the replicated
images over the data center, we can very likely restart 20 VMs without any
resource problem.

After all, I think we can build a more powerful replicated image functionality
on quorum and block jobs(block mirror) to meet various High Availibility needs.

E.g, Enable single read pattern on 2 children,

-drive driver=quorum,children.0.file.filename=0.qcow2,\
children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1

[1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device

Liu Yuan (2):
  qapi: add read-pattern support for quorum
  block/quorum: add simple read pattern support

 block/quorum.c   | 181 +--
 qapi/block-core.json |  19 +-
 2 files changed, 149 insertions(+), 51 deletions(-)

-- 
1.9.1

[Qemu-devel] [PATCH v4 2/2] block/quorum: add simple read pattern support

2014-07-16 Thread Liu Yuan

This patch adds single read pattern to quorum driver and quorum vote is default
pattern.

For now we do a quorum vote on all the reads, it is designed for unreliable
underlying storage such as non-redundant NFS to make sure data integrity at the
cost of the read performance.

For some use cases as following:

VM
  --
  ||
  vv
  AB

Both A and B has hardware raid storage to justify the data integrity on its own.
So it would help performance if we do a single read instead of on all the nodes.
Further, if we run VM on either of the storage node, we can make a local read
request for better performance.

This patch generalize the above 2 nodes case in the N nodes. That is,

vm - write to all the N nodes, read just one of them. If single read fails, we
try to read next node in FIFO order specified by the startup command.

The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
functionality in the single device/node failure for now. But compared with DRBD
we still have some advantages over it:

- Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
storage. And if A crashes, we need to restart all the VMs on node B. But for
practice case, we can't because B might not have enough resources to setup 20 
VMs
at once. So if we run our 20 VMs with quorum driver, and scatter the replicated
images over the data center, we can very likely restart 20 VMs without any
resource problem.

After all, I think we can build a more powerful replicated image functionality
on quorum and block jobs(block mirror) to meet various High Availibility needs.

E.g, Enable single read pattern on 2 children,

-drive driver=quorum,children.0.file.filename=0.qcow2,\
children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1

[1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device

Cc: Benoit Canet ben...@irqsave.net
Cc: Eric Blake ebl...@redhat.com
Cc: Kevin Wolf kw...@redhat.com
Cc: Stefan Hajnoczi stefa...@redhat.com
Signed-off-by: Liu Yuan namei.u...@gmail.com
---
 block/quorum.c | 179 +
 1 file changed, 131 insertions(+), 48 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index d5ee9c0..ebf5c71 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -24,6 +24,7 @@
 #define QUORUM_OPT_VOTE_THRESHOLD vote-threshold
 #define QUORUM_OPT_BLKVERIFY  blkverify
 #define QUORUM_OPT_REWRITErewrite-corrupted
+#define QUORUM_OPT_READ_PATTERN   read-pattern
 
 /* This union holds a vote hash value */
 typedef union QuorumVoteValue {
@@ -74,6 +75,8 @@ typedef struct BDRVQuorumState {
 bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
 * block if Quorum is reached.
 */
+
+QuorumReadPattern read_pattern;
 } BDRVQuorumState;
 
 typedef struct QuorumAIOCB QuorumAIOCB;
@@ -117,6 +120,7 @@ struct QuorumAIOCB {
 
 bool is_read;
 int vote_ret;
+int child_iter; /* which child to read in fifo pattern */
 };
 
 static bool quorum_vote(QuorumAIOCB *acb);
@@ -154,8 +158,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
 
 if (acb-is_read) {
 for (i = 0; i  s-num_children; i++) {
-qemu_vfree(acb-qcrs[i].buf);
-qemu_iovec_destroy(acb-qcrs[i].qiov);
+if (i = acb-child_iter) {
+qemu_vfree(acb-qcrs[i].buf);
+qemu_iovec_destroy(acb-qcrs[i].qiov);
+}
 }
 }
 
@@ -256,6 +262,21 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret)
 quorum_aio_finalize(acb);
 }
 
+static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb);
+
+static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
+{
+int i;
+assert(dest-niov == source-niov);
+assert(dest-size == source-size);
+for (i = 0; i  source-niov; i++) {
+assert(dest-iov[i].iov_len == source-iov[i].iov_len);
+memcpy(dest-iov[i].iov_base,
+   source-iov[i].iov_base,
+   source-iov[i].iov_len);
+}
+}
+
 static void quorum_aio_cb(void *opaque, int ret)
 {
 QuorumChildRequest *sacb = opaque;
@@ -263,6 +284,21 @@ static void quorum_aio_cb(void *opaque, int ret)
 BDRVQuorumState *s = acb-common.bs-opaque;
 bool rewrite = false;
 
+if (acb-is_read  s-read_pattern == QUORUM_READ_PATTERN_FIFO) {
+/* We try to read next child in FIFO order if we fail to read */
+if (ret  0  ++acb-child_iter  s-num_children) {
+read_fifo_child(acb);
+return;
+}
+
+if (ret == 0) {
+quorum_copy_qiov(acb-qiov, acb-qcrs[acb-child_iter].qiov);
+}
+acb-vote_ret = ret;
+quorum_aio_finalize(acb);
+return;
+}
+
 sacb-ret = ret;
 acb-count++;
 if (ret == 0) {
@@ -343,19 +379,6 @@ static bool quorum_rewrite_bad_versions(BDRVQuorumState 
*s, QuorumAIOCB *acb

[Qemu-devel] [PATCH v4 1/2] qapi: add read-pattern enum for quorum

2014-07-16 Thread Liu Yuan

Cc: Eric Blake ebl...@redhat.com
Signed-off-by: Liu Yuan namei.u...@gmail.com
---
 qapi/block-core.json | 20 +++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index e378653..42033d9 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1384,6 +1384,19 @@
 'raw': 'BlockdevRef' } }
 
 ##
+# @QuorumReadPattern
+#
+# An enumeration of quorum read patterns.
+#
+# @quorum: read all the children and do a quorum vote on reads
+#
+# @fifo: read only from the first child that has not failed
+#
+# Since: 2.2
+##
+{ 'enum': 'QuorumReadPattern', 'data': [ 'quorum', 'fifo' ] }
+
+##
 # @BlockdevOptionsQuorum
 #
 # Driver specific block device options for Quorum
@@ -1398,12 +1411,17 @@
 # @rewrite-corrupted: #optional rewrite corrupted data when quorum is reached
 # (Since 2.1)
 #
+# @read-pattern: #optional choose read pattern and set to quorum by default
+#(Since 2.2)
+#
 # Since: 2.0
 ##
 { 'type': 'BlockdevOptionsQuorum',
   'data': { '*blkverify': 'bool',
 'children': [ 'BlockdevRef' ],
-'vote-threshold': 'int', '*rewrite-corrupted': 'bool' } }
+'vote-threshold': 'int',
+'*rewrite-corrupted': 'bool',
+'*read-pattern': 'QuorumReadPattern' } }
 
 ##
 # @BlockdevOptions
-- 
1.9.1

[Qemu-devel] [PATCH v3 2/2] qapi: add read-pattern support for quorum

2014-07-15 Thread Liu Yuan

Cc: Eric Blake ebl...@redhat.com
Signed-off-by: Liu Yuan namei.u...@gmail.com
---
 qapi/block-core.json | 19 ++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index e378653..22491bc 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1384,6 +1384,18 @@
 'raw': 'BlockdevRef' } }
 
 ##
+# @QuorumReadPattern
+# An enumeration of quorum read pattern.
+#
+# @quorum: read all the children and do a quorum vote on reads
+#
+# @fifo: read a single child and try next child in FIFO order if read fails.
+#
+# Since: 2.2
+##
+{ 'enum': 'QuorumReadPattern', 'data': [ 'quorum', 'fifo' ] }
+
+##
 # @BlockdevOptionsQuorum
 #
 # Driver specific block device options for Quorum
@@ -1398,12 +1410,17 @@
 # @rewrite-corrupted: #optional rewrite corrupted data when quorum is reached
 # (Since 2.1)
 #
+# @read-pattern: #optional choose read pattern and set to quorum by default
+#(Since 2.2)
+#
 # Since: 2.0
 ##
 { 'type': 'BlockdevOptionsQuorum',
   'data': { '*blkverify': 'bool',
 'children': [ 'BlockdevRef' ],
-'vote-threshold': 'int', '*rewrite-corrupted': 'bool' } }
+'vote-threshold': 'int',
+'*rewrite-corrupted': 'bool',
+'*read-pattern': 'QuorumReadPattern' } }
 
 ##
 # @BlockdevOptions
-- 
1.9.1

[Qemu-devel] [PATCH v3 1/2] block/quorum: add simple read pattern support

2014-07-15 Thread Liu Yuan

This patch adds single read pattern to quorum driver and quorum vote is default
pattern.

For now we do a quorum vote on all the reads, it is designed for unreliable
underlying storage such as non-redundant NFS to make sure data integrity at the
cost of the read performance.

For some use cases as following:

VM
  --
  ||
  vv
  AB

Both A and B has hardware raid storage to justify the data integrity on its own.
So it would help performance if we do a single read instead of on all the nodes.
Further, if we run VM on either of the storage node, we can make a local read
request for better performance.

This patch generalize the above 2 nodes case in the N nodes. That is,

vm - write to all the N nodes, read just one of them. If single read fails, we
try to read next node in FIFO order specified by the startup command.

The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
functionality in the single device/node failure for now. But compared with DRBD
we still have some advantages over it:

- Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
storage. And if A crashes, we need to restart all the VMs on node B. But for
practice case, we can't because B might not have enough resources to setup 20 
VMs
at once. So if we run our 20 VMs with quorum driver, and scatter the replicated
images over the data center, we can very likely restart 20 VMs without any
resource problem.

After all, I think we can build a more powerful replicated image functionality
on quorum and block jobs(block mirror) to meet various High Availibility needs.

E.g, Enable single read pattern on 2 children,

-drive driver=quorum,children.0.file.filename=0.qcow2,\
children.1.file.filename=1.qcow2,read-pattern=single,vote-threshold=1

[1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device

Cc: Benoit Canet ben...@irqsave.net
Cc: Kevin Wolf kw...@redhat.com
Cc: Stefan Hajnoczi stefa...@redhat.com
Signed-off-by: Liu Yuan namei.u...@gmail.com
---
 block/quorum.c | 181 +
 1 file changed, 131 insertions(+), 50 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index d5ee9c0..1443fc5 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -24,6 +24,7 @@
 #define QUORUM_OPT_VOTE_THRESHOLD vote-threshold
 #define QUORUM_OPT_BLKVERIFY  blkverify
 #define QUORUM_OPT_REWRITErewrite-corrupted
+#define QUORUM_OPT_READ_PATTERN   read-pattern
 
 /* This union holds a vote hash value */
 typedef union QuorumVoteValue {
@@ -74,6 +75,16 @@ typedef struct BDRVQuorumState {
 bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
 * block if Quorum is reached.
 */
+
+#define READ_PATTERN_QUORUM 0 /* default */
+#define READ_PATTERN_FIFO   1
+int read_pattern;  /* fifo: read a single child and try first one
+* first. If error, try next child in an
+* FIFO order specifed by command line.
+* Return error if no child read succeeds.
+* quorum: read all the children and do a quorum
+* vote on reads.
+*/
 } BDRVQuorumState;
 
 typedef struct QuorumAIOCB QuorumAIOCB;
@@ -117,6 +128,7 @@ struct QuorumAIOCB {
 
 bool is_read;
 int vote_ret;
+int child_iter; /* which child to read in fifo pattern */
 };
 
 static bool quorum_vote(QuorumAIOCB *acb);
@@ -154,8 +166,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
 
 if (acb-is_read) {
 for (i = 0; i  s-num_children; i++) {
-qemu_vfree(acb-qcrs[i].buf);
-qemu_iovec_destroy(acb-qcrs[i].qiov);
+if (i = acb-child_iter) {
+qemu_vfree(acb-qcrs[i].buf);
+qemu_iovec_destroy(acb-qcrs[i].qiov);
+}
 }
 }
 
@@ -256,6 +270,21 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret)
 quorum_aio_finalize(acb);
 }
 
+static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb);
+
+static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
+{
+int i;
+assert(dest-niov == source-niov);
+assert(dest-size == source-size);
+for (i = 0; i  source-niov; i++) {
+assert(dest-iov[i].iov_len == source-iov[i].iov_len);
+memcpy(dest-iov[i].iov_base,
+   source-iov[i].iov_base,
+   source-iov[i].iov_len);
+}
+}
+
 static void quorum_aio_cb(void *opaque, int ret)
 {
 QuorumChildRequest *sacb = opaque;
@@ -263,6 +292,21 @@ static void quorum_aio_cb(void *opaque, int ret)
 BDRVQuorumState *s = acb-common.bs-opaque;
 bool rewrite = false;
 
+if (acb-is_read  s-read_pattern == READ_PATTERN_FIFO) {
+/* We try to read next child in FIFO order if we fail to read

[Qemu-devel] [PATCH v3 0/2] add read-pattern for block qourum

2014-07-15 Thread Liu Yuan

v3:
 - separate patch into two, one for quorum and one for qapi for easier review
 - add enumeration for quorum read pattern
 - remove unrelated blank line fix from this patch set

v2:
 - rename single as 'fifo'
 - rename read_all_children as read_quorum_children
 - fix quorum_aio_finalize() for fifo pattern

This patch adds single read pattern to quorum driver and quorum vote is default
pattern.

For now we do a quorum vote on all the reads, it is designed for unreliable
underlying storage such as non-redundant NFS to make sure data integrity at the
cost of the read performance.

For some use cases as following:

VM
  --
  ||
  vv
  AB

Both A and B has hardware raid storage to justify the data integrity on its own.
So it would help performance if we do a single read instead of on all the nodes.
Further, if we run VM on either of the storage node, we can make a local read
request for better performance.

This patch generalize the above 2 nodes case in the N nodes. That is,

vm - write to all the N nodes, read just one of them. If single read fails, we
try to read next node in FIFO order specified by the startup command.

The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
functionality in the single device/node failure for now. But compared with DRBD
we still have some advantages over it:

- Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
storage. And if A crashes, we need to restart all the VMs on node B. But for
practice case, we can't because B might not have enough resources to setup 20 
VMs
at once. So if we run our 20 VMs with quorum driver, and scatter the replicated
images over the data center, we can very likely restart 20 VMs without any
resource problem.

After all, I think we can build a more powerful replicated image functionality
on quorum and block jobs(block mirror) to meet various High Availibility needs.

E.g, Enable single read pattern on 2 children,

-drive driver=quorum,children.0.file.filename=0.qcow2,\
children.1.file.filename=1.qcow2,read-pattern=single,vote-threshold=1

[1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device

Liu Yuan (2):
  block/quorum: add simple read pattern support
  qapi: add read-pattern support for quorum

 block/quorum.c   | 181 +--
 qapi/block-core.json |  19 +-
 2 files changed, 149 insertions(+), 51 deletions(-)

-- 
1.9.1

Re: [Qemu-devel] [PATCH v3 2/2] qapi: add read-pattern support for quorum

2014-07-15 Thread Liu Yuan

On Tue, Jul 15, 2014 at 06:22:14AM -0600, Eric Blake wrote:
 On 07/15/2014 12:34 AM, Liu Yuan wrote:
  Cc: Eric Blake ebl...@redhat.com
  Signed-off-by: Liu Yuan namei.u...@gmail.com
  ---
   qapi/block-core.json | 19 ++-
   1 file changed, 18 insertions(+), 1 deletion(-)
  
  diff --git a/qapi/block-core.json b/qapi/block-core.json
  index e378653..22491bc 100644
  --- a/qapi/block-core.json
  +++ b/qapi/block-core.json
  @@ -1384,6 +1384,18 @@
   'raw': 'BlockdevRef' } }
   
   ##
  +# @QuorumReadPattern
  +# An enumeration of quorum read pattern.
 
 s/pattern/patterns/
 
  +#
  +# @quorum: read all the children and do a quorum vote on reads
  +#
  +# @fifo: read a single child and try next child in FIFO order if read 
  fails.
 
 Inconsistent on whether you are ending a line with '.'.  I'm not quite
 sure if this reads well; maybe:
 
 @fifo: read only from the first child that has not failed
 
 I also wonder if 'single' might be a better name than 'fifo', especially
 if we later add a 'round-robin' for read load balancing.

Both fifo and round-robin in our context indicate a single read.

but either one looks okay to me. We need to reach agreement on the name before I
send the next version, single or fifo or any other suggested, Benoit, Kevin,
Stefan?

Thanks
Yuan

Re: [Qemu-devel] [PATCH v3 2/2] qapi: add read-pattern support for quorum

2014-07-15 Thread Liu Yuan

On Wed, Jul 16, 2014 at 10:22:20AM +0800, Liu Yuan wrote:
 On Tue, Jul 15, 2014 at 06:22:14AM -0600, Eric Blake wrote:
  On 07/15/2014 12:34 AM, Liu Yuan wrote:
   Cc: Eric Blake ebl...@redhat.com
   Signed-off-by: Liu Yuan namei.u...@gmail.com
   ---
qapi/block-core.json | 19 ++-
1 file changed, 18 insertions(+), 1 deletion(-)
   
   diff --git a/qapi/block-core.json b/qapi/block-core.json
   index e378653..22491bc 100644
   --- a/qapi/block-core.json
   +++ b/qapi/block-core.json
   @@ -1384,6 +1384,18 @@
'raw': 'BlockdevRef' } }

##
   +# @QuorumReadPattern
   +# An enumeration of quorum read pattern.
  
  s/pattern/patterns/
  
   +#
   +# @quorum: read all the children and do a quorum vote on reads
   +#
   +# @fifo: read a single child and try next child in FIFO order if read 
   fails.
  
  Inconsistent on whether you are ending a line with '.'.  I'm not quite
  sure if this reads well; maybe:
  
  @fifo: read only from the first child that has not failed
  
  I also wonder if 'single' might be a better name than 'fifo', especially
  if we later add a 'round-robin' for read load balancing.
 
 Both fifo and round-robin in our context indicate a single read.
 
 but either one looks okay to me. We need to reach agreement on the name 
 before I
 send the next version, single or fifo or any other suggested, Benoit, Kevin,
 Stefan?

Cc others mentioned.

[Qemu-devel] [PATCH v2] block/quorum: add simple read pattern support

2014-07-14 Thread Liu Yuan

This patch adds single read pattern to quorum driver and quorum vote is default
pattern.

For now we do a quorum vote on all the reads, it is designed for unreliable
underlying storage such as non-redundant NFS to make sure data integrity at the
cost of the read performance.

For some use cases as following:

VM
  --
  ||
  vv
  AB

Both A and B has hardware raid storage to justify the data integrity on its own.
So it would help performance if we do a single read instead of on all the nodes.
Further, if we run VM on either of the storage node, we can make a local read
request for better performance.

This patch generalize the above 2 nodes case in the N nodes. That is,

vm - write to all the N nodes, read just one of them. If single read fails, we
try to read next node in FIFO order specified by the startup command.

The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
functionality in the single device/node failure for now. But compared with DRBD
we still have some advantages over it:

- Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
storage. And if A crashes, we need to restart all the VMs on node B. But for
practice case, we can't because B might not have enough resources to setup 20 
VMs
at once. So if we run our 20 VMs with quorum driver, and scatter the replicated
images over the data center, we can very likely restart 20 VMs without any
resource problem.

After all, I think we can build a more powerful replicated image functionality
on quorum and block jobs(block mirror) to meet various High Availibility needs.

E.g, Enable single read pattern on 2 children,

-drive driver=quorum,children.0.file.filename=0.qcow2,\
children.1.file.filename=1.qcow2,read-pattern=single,vote-threshold=1

[1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device

Cc: Benoit Canet ben...@irqsave.net
Cc: Kevin Wolf kw...@redhat.com
Cc: Stefan Hajnoczi stefa...@redhat.com
Signed-off-by: Liu Yuan namei.u...@gmail.com
---
 v2:
 - rename single as 'fifo'
 - rename read_all_children as read_quorum_children
 - fix quorum_aio_finalize() for fifo pattern

 block.c  |   2 +-
 block/quorum.c   | 181 +--
 qapi/block-core.json |   7 +-
 3 files changed, 138 insertions(+), 52 deletions(-)

diff --git a/block.c b/block.c
index 8800a6b..3dfa27c 100644
--- a/block.c
+++ b/block.c
@@ -2212,7 +2212,7 @@ int bdrv_commit(BlockDriverState *bs)
 
 if (!drv)
 return -ENOMEDIUM;
-
+
 if (!bs-backing_hd) {
 return -ENOTSUP;
 }
diff --git a/block/quorum.c b/block/quorum.c
index d5ee9c0..1443fc5 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -24,6 +24,7 @@
 #define QUORUM_OPT_VOTE_THRESHOLD vote-threshold
 #define QUORUM_OPT_BLKVERIFY  blkverify
 #define QUORUM_OPT_REWRITErewrite-corrupted
+#define QUORUM_OPT_READ_PATTERN   read-pattern
 
 /* This union holds a vote hash value */
 typedef union QuorumVoteValue {
@@ -74,6 +75,16 @@ typedef struct BDRVQuorumState {
 bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
 * block if Quorum is reached.
 */
+
+#define READ_PATTERN_QUORUM 0 /* default */
+#define READ_PATTERN_FIFO   1
+int read_pattern;  /* fifo: read a single child and try first one
+* first. If error, try next child in an
+* FIFO order specifed by command line.
+* Return error if no child read succeeds.
+* quorum: read all the children and do a quorum
+* vote on reads.
+*/
 } BDRVQuorumState;
 
 typedef struct QuorumAIOCB QuorumAIOCB;
@@ -117,6 +128,7 @@ struct QuorumAIOCB {
 
 bool is_read;
 int vote_ret;
+int child_iter; /* which child to read in fifo pattern */
 };
 
 static bool quorum_vote(QuorumAIOCB *acb);
@@ -154,8 +166,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb)
 
 if (acb-is_read) {
 for (i = 0; i  s-num_children; i++) {
-qemu_vfree(acb-qcrs[i].buf);
-qemu_iovec_destroy(acb-qcrs[i].qiov);
+if (i = acb-child_iter) {
+qemu_vfree(acb-qcrs[i].buf);
+qemu_iovec_destroy(acb-qcrs[i].qiov);
+}
 }
 }
 
@@ -256,6 +270,21 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret)
 quorum_aio_finalize(acb);
 }
 
+static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb);
+
+static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
+{
+int i;
+assert(dest-niov == source-niov);
+assert(dest-size == source-size);
+for (i = 0; i  source-niov; i++) {
+assert(dest-iov[i].iov_len == source-iov[i].iov_len);
+memcpy(dest-iov[i].iov_base

Re: [Qemu-devel] [PATCH v2] block/quorum: add simple read pattern support

2014-07-14 Thread Liu Yuan

On Mon, Jul 14, 2014 at 09:33:59PM -0600, Eric Blake wrote:
 On 07/14/2014 09:19 PM, Liu Yuan wrote:
  This patch adds single read pattern to quorum driver and quorum vote is 
  default
  pattern.
  
 
  This patch generalize the above 2 nodes case in the N nodes. That is,
  
  vm - write to all the N nodes, read just one of them. If single read 
  fails, we
  try to read next node in FIFO order specified by the startup command.
  
  The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
  functionality in the single device/node failure for now. But compared with 
  DRBD
  we still have some advantages over it:
  
 
  +++ b/block.c
  @@ -2212,7 +2212,7 @@ int bdrv_commit(BlockDriverState *bs)
   
   if (!drv)
   return -ENOMEDIUM;
  -
  +
   if (!bs-backing_hd) {
   return -ENOTSUP;
   }
 
 While this whitespace cleanup is nice, it doesn't belong in this patch,
 when there is no other change to this unrelated file.
 
 
  +++ b/qapi/block-core.json
  @@ -1398,12 +1398,17 @@
   # @rewrite-corrupted: #optional rewrite corrupted data when quorum is 
  reached
   # (Since 2.1)
   #
  +# @read-pattern: #optional choose quorum or fifo pattern for read
  +#set to quorum by default (Since 2.2)
  +#
   # Since: 2.0
   ##
   { 'type': 'BlockdevOptionsQuorum',
 'data': { '*blkverify': 'bool',
   'children': [ 'BlockdevRef' ],
  -'vote-threshold': 'int', '*rewrite-corrupted': 'bool' } }
  +'vote-threshold': 'int',
  +'*rewrite-corrupted': 'bool',
  +'*read-pattern': 'str' } }
 
 Raw strings that encode a finite set of values are bad for type-safety.
  Please add an enum:
 
 { 'enum': 'QuorumReadPattern', 'data': [ 'quorum', 'fifo' ] }
 
 then use '*read-pattern': 'QuorumReadPattern'

For startup command line, did it imply a change too? Sorry I'm not familiar with
block-core.json.

With your suggestion, how we pass read-pattern via qmp?

 read-pattern: fifo
 or 
 read-pattern: quorum


 
 Should we offer multiple modes in addition to 'quorum'?  For example, I
 could see a difference between 'fifo' (favor read from the first quorum
 member always, unless it fails, good when the first member is local and
 other member is remote) and 'round-robin' (evenly distribute reads; each
 read goes to the next available quorum member, good when all members are
 equally distant).

I guess so. 'round-robin' in your context would benefit use case seeking for
better read load balance.

Thanks
Yuan

[Qemu-devel] [PATCH] block/quorum: make quorum_getlength error message user friendly

2014-07-11 Thread Liu Yuan

When start quorum driver with 2 different sized images, we get:

qemu-system-x86_64: -drive if=virtio,driver=quorum...: Could not refresh total \
sector count: Input/output error

EIO would confuse users. With this patch, the error message goes like

Children images are not in the same size
qemu-system-x86_64: -drive if=virtio,driver=quorum...: Could not refresh total \
sector count: Invalid argument

Cc: Benoit Canet ben...@irqsave.net
Cc: Kevin Wolf kw...@redhat.com
Cc: Stefan Hajnoczi stefa...@redhat.com
Signed-off-by: Liu Yuan namei.u...@gmail.com
---
 block/quorum.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/quorum.c b/block/quorum.c
index 2f18755..51437ad 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -736,7 +736,8 @@ static int64_t quorum_getlength(BlockDriverState *bs)
 return value;
 }
 if (value != result) {
-return -EIO;
+error_printf(Children images are not in the same size\n);
+return -EINVAL;
 }
 }
 
-- 
1.9.1

[Qemu-devel] [PATCH v2] configure: make libnfs not_found message user friendly

2014-07-11 Thread Liu Yuan

Cc: Kevin Wolf kw...@redhat.com
Signed-off-by: Liu Yuan namei.u...@gmail.com
---
 configure | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/configure b/configure
index 7dd43fd..78e7baf 100755
--- a/configure
+++ b/configure
@@ -3996,7 +3996,7 @@ if test $libnfs != no ; then
 LIBS=$LIBS $libnfs_libs
   else
 if test $libnfs = yes ; then
-  feature_not_found libnfs
+  feature_not_found libnfs Install libnfs-devel = 1.9.3
 fi
 libnfs=no
   fi
-- 
1.9.1

1 2 3 4 >

1 - 100 of 325 matches

Mail list logo