[Qemu-devel] [PATCH] MAINTAINERS: update block/sheepdog maintainers
From: Liu Yuan E-mail to one of block/sheepdog maintainers Mitake Hitoshi bounces : unknown user: "mitake.hitoshi" and no current address is known. So just remove it. Signed-off-by: Liu Yuan --- MAINTAINERS | 1 - 1 file changed, 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 3275cc6..6670c16 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -2007,7 +2007,6 @@ F: block/rbd.c T: git git://github.com/codyprime/qemu-kvm-jtc.git block Sheepdog -M: Hitoshi Mitake M: Liu Yuan M: Jeff Cody L: qemu-bl...@nongnu.org -- 2.7.4
Re: [Qemu-devel] [PATCH] sheepdog: discard the payload if the header is invalid
On Tue, Sep 01, 2015 at 10:05:38AM +0800, Liu Yuan wrote: > On Mon, Aug 31, 2015 at 09:51:00PM -0400, Jeff Cody wrote: > > On Tue, Sep 01, 2015 at 09:29:31AM +0800, Liu Yuan wrote: > > > From: Liu Yuan <liuy...@cmss.chinamobile.com> > > > > > > We need to discard the payload if we get a invalid header due to whatever > > > reason > > > to avoid data stream curruption. > > > > If the header is invalid / corrupted, how can rsp.data_length be > > trusted? Out of curiosity, is this an issue you are seeing occur "in > > the wild"? For a second thought, we might not need this patch for the upstream because of auto-connection feature, which close the socket to bury the whole buffer. But old QEMU without auto-reconnection, might need this patch to drain the buffer. Thanks, Yuan > > This is the defensive patch. Header is invalid in the sense that only rsp.id > is > invalid due to sheepdog driver bugs, for e.g., the request was misplaced after > being sent or duplicated requests sending to sheep daemon and get the > duplicated > responses for the same request. > > Actually in the late 2012 we had seen this problem but we didn't find the root > cause how this happened by looking at the code statically and the problem was > gone silently while we restructured the code. > > But yesterday some centos6 users reported to me the problem of > 'cannot find aio_req' and hang the guest disk. That QEMU's sheepdog driver was > rather old and would bump rsp.id mismatch problem as we did in the late 2012. > By looking at the code again, I found this error handling problem. However, > this patch is not aimed to solve the rsp.id mismatch problem (If it still > exist) > but just a defensive one after this problem happens. > > Thanks, > Yuan
Re: [Qemu-devel] [PATCH v3] sheepdog: add reopen support
On Fri, Aug 28, 2015 at 10:53:58AM +0800, Liu Yuan wrote: > From: Liu Yuan <liuy...@cmss.chinamobile.com> > > With reopen supported, block-commit (and offline commit) is now supported for > image files whose base image uses the Sheepdog protocol driver. > > Cc: qemu-devel@nongnu.org > Cc: Jeff Cody <jc...@redhat.com> > Cc: Kevin Wolf <kw...@redhat.com> > Cc: Stefan Hajnoczi <stefa...@redhat.com> > Signed-off-by: Liu Yuan <liuy...@cmss.chinamobile.com> > --- > v3: > - init cache flags in sd_reopen_prepare() as in sd_open() > v2: > - free AioHandler [Jeff Cody] > ping...
[Qemu-devel] [PATCH] sheepdog: discard the payload if the header is invalid
From: Liu Yuan <liuy...@cmss.chinamobile.com> We need to discard the payload if we get a invalid header due to whatever reason to avoid data stream curruption. For e.g., the response consists of header plus data payload. If we simply read the header then the data payload is left in the socket buffer and the next time we would read the garbage data and currupt the whole connection. Cc: qemu-devel@nongnu.org Cc: Jeff Cody <jc...@redhat.com> Cc: Kevin Wolf <kw...@redhat.com> Cc: Stefan Hajnoczi <stefa...@redhat.com> Signed-off-by: Liu Yuan <liuy...@cmss.chinamobile.com> --- block/sheepdog.c | 8 1 file changed, 8 insertions(+) diff --git a/block/sheepdog.c b/block/sheepdog.c index 9585beb..9ed3458 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -794,6 +794,14 @@ static void coroutine_fn aio_read_response(void *opaque) } } if (!aio_req) { +if (rsp.data_length) { +void *garbage = g_malloc(rsp.data_length); +ret = qemu_co_recv(fd, garbage, rsp.data_length); +if (ret != rsp.data_length) { +error_report("failed to discard the data, %s", strerror(errno)); +} +g_free(garbage); +} error_report("cannot find aio_req %x", rsp.id); goto err; } -- 1.9.1
Re: [Qemu-devel] [PATCH] sheepdog: discard the payload if the header is invalid
On Mon, Aug 31, 2015 at 09:51:00PM -0400, Jeff Cody wrote: > On Tue, Sep 01, 2015 at 09:29:31AM +0800, Liu Yuan wrote: > > From: Liu Yuan <liuy...@cmss.chinamobile.com> > > > > We need to discard the payload if we get a invalid header due to whatever > > reason > > to avoid data stream curruption. > > If the header is invalid / corrupted, how can rsp.data_length be > trusted? Out of curiosity, is this an issue you are seeing occur "in > the wild"? This is the defensive patch. Header is invalid in the sense that only rsp.id is invalid due to sheepdog driver bugs, for e.g., the request was misplaced after being sent or duplicated requests sending to sheep daemon and get the duplicated responses for the same request. Actually in the late 2012 we had seen this problem but we didn't find the root cause how this happened by looking at the code statically and the problem was gone silently while we restructured the code. But yesterday some centos6 users reported to me the problem of 'cannot find aio_req' and hang the guest disk. That QEMU's sheepdog driver was rather old and would bump rsp.id mismatch problem as we did in the late 2012. By looking at the code again, I found this error handling problem. However, this patch is not aimed to solve the rsp.id mismatch problem (If it still exist) but just a defensive one after this problem happens. Thanks, Yuan
Re: [Qemu-devel] [PATCH v2] sheepdog: add reopen support
On Thu, Aug 27, 2015 at 03:00:32PM -0400, Jeff Cody wrote: On Thu, Aug 27, 2015 at 10:54:01AM +0800, Liu Yuan wrote: From: Liu Yuan liuy...@cmss.chinamobile.com With reopen supported, block-commit (and offline commit) is now supported for image files whose base image uses the Sheepdog protocol driver. Cc: qemu-devel@nongnu.org Cc: Jeff Cody jc...@redhat.com Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan liuy...@cmss.chinamobile.com --- v2: - free AioHandler [Jeff Cody] block/sheepdog.c | 75 1 file changed, 75 insertions(+) diff --git a/block/sheepdog.c b/block/sheepdog.c index 9585beb..e54add4 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -377,6 +377,11 @@ typedef struct BDRVSheepdogState { QLIST_HEAD(inflight_aiocb_head, SheepdogAIOCB) inflight_aiocb_head; } BDRVSheepdogState; +typedef struct BDRVSheepdogReopenState { +int fd; +int cache_flags; +} BDRVSheepdogReopenState; + static const char * sd_strerror(int err) { int i; @@ -1486,6 +1491,67 @@ out: return ret; } +static int sd_reopen_prepare(BDRVReopenState *state, BlockReopenQueue *queue, + Error **errp) +{ +BDRVSheepdogState *s = state-bs-opaque; +BDRVSheepdogReopenState *re_s; +int ret = 0; + +re_s = state-opaque = g_new0(BDRVSheepdogReopenState, 1); + +if (state-flags BDRV_O_NOCACHE) { +re_s-cache_flags = SD_FLAG_CMD_DIRECT; +} In sd_open(), the s-cache_flag is set to SD_FLAG_CMD_CACHE by default, and then overwritten if BDRV_O_NOCACHE is set. Do we want that same behavior here? (Sorry I didn't ask this the first time around) As far as I understood, sd_open() was called before sd_reopen, so I thought s-cache_flags = SD_FLAG_CMD_CACHE was duplicated in sd_reopen. But for a second thought, if the reopen indicates a complete flags reparsing, I'd agree with you. I'll prepare the third version of the patch. Thanks, Yuan
[Qemu-devel] [PATCH v3] sheepdog: add reopen support
From: Liu Yuan liuy...@cmss.chinamobile.com With reopen supported, block-commit (and offline commit) is now supported for image files whose base image uses the Sheepdog protocol driver. Cc: qemu-devel@nongnu.org Cc: Jeff Cody jc...@redhat.com Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan liuy...@cmss.chinamobile.com --- v3: - init cache flags in sd_reopen_prepare() as in sd_open() v2: - free AioHandler [Jeff Cody] block/sheepdog.c | 76 1 file changed, 76 insertions(+) diff --git a/block/sheepdog.c b/block/sheepdog.c index 9585beb..6e8a783 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -377,6 +377,11 @@ typedef struct BDRVSheepdogState { QLIST_HEAD(inflight_aiocb_head, SheepdogAIOCB) inflight_aiocb_head; } BDRVSheepdogState; +typedef struct BDRVSheepdogReopenState { +int fd; +int cache_flags; +} BDRVSheepdogReopenState; + static const char * sd_strerror(int err) { int i; @@ -1486,6 +1491,68 @@ out: return ret; } +static int sd_reopen_prepare(BDRVReopenState *state, BlockReopenQueue *queue, + Error **errp) +{ +BDRVSheepdogState *s = state-bs-opaque; +BDRVSheepdogReopenState *re_s; +int ret = 0; + +re_s = state-opaque = g_new0(BDRVSheepdogReopenState, 1); + +re_s-cache_flags = SD_FLAG_CMD_CACHE; +if (state-flags BDRV_O_NOCACHE) { +re_s-cache_flags = SD_FLAG_CMD_DIRECT; +} + +re_s-fd = get_sheep_fd(s, errp); +if (re_s-fd 0) { +ret = re_s-fd; +return ret; +} + +return ret; +} + +static void sd_reopen_commit(BDRVReopenState *state) +{ +BDRVSheepdogReopenState *re_s = state-opaque; +BDRVSheepdogState *s = state-bs-opaque; + +if (s-fd) { +aio_set_fd_handler(s-aio_context, s-fd, NULL, NULL, NULL); +closesocket(s-fd); +} + +s-fd = re_s-fd; +s-cache_flags = re_s-cache_flags; + +g_free(state-opaque); +state-opaque = NULL; + +return; +} + +static void sd_reopen_abort(BDRVReopenState *state) +{ +BDRVSheepdogReopenState *re_s = state-opaque; +BDRVSheepdogState *s = state-bs-opaque; + +if (re_s == NULL) { +return; +} + +if (re_s-fd) { +aio_set_fd_handler(s-aio_context, re_s-fd, NULL, NULL, NULL); +closesocket(re_s-fd); +} + +g_free(state-opaque); +state-opaque = NULL; + +return; +} + static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot, Error **errp) { @@ -2703,6 +2770,9 @@ static BlockDriver bdrv_sheepdog = { .instance_size = sizeof(BDRVSheepdogState), .bdrv_needs_filename = true, .bdrv_file_open = sd_open, +.bdrv_reopen_prepare= sd_reopen_prepare, +.bdrv_reopen_commit = sd_reopen_commit, +.bdrv_reopen_abort = sd_reopen_abort, .bdrv_close = sd_close, .bdrv_create= sd_create, .bdrv_has_zero_init = bdrv_has_zero_init_1, @@ -2736,6 +2806,9 @@ static BlockDriver bdrv_sheepdog_tcp = { .instance_size = sizeof(BDRVSheepdogState), .bdrv_needs_filename = true, .bdrv_file_open = sd_open, +.bdrv_reopen_prepare= sd_reopen_prepare, +.bdrv_reopen_commit = sd_reopen_commit, +.bdrv_reopen_abort = sd_reopen_abort, .bdrv_close = sd_close, .bdrv_create= sd_create, .bdrv_has_zero_init = bdrv_has_zero_init_1, @@ -2769,6 +2842,9 @@ static BlockDriver bdrv_sheepdog_unix = { .instance_size = sizeof(BDRVSheepdogState), .bdrv_needs_filename = true, .bdrv_file_open = sd_open, +.bdrv_reopen_prepare= sd_reopen_prepare, +.bdrv_reopen_commit = sd_reopen_commit, +.bdrv_reopen_abort = sd_reopen_abort, .bdrv_close = sd_close, .bdrv_create= sd_create, .bdrv_has_zero_init = bdrv_has_zero_init_1, -- 1.9.1
[Qemu-devel] [PATCH] sheepdog: add reopen support
From: Liu Yuan liuy...@cmss.chinamobile.com With reopen supported, block-commit (and offline commit) is now supported for image files whose base image uses the Sheepdog protocol driver. Cc: qemu-devel@nongnu.org Cc: Jeff Cody jc...@redhat.com Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan liuy...@cmss.chinamobile.com --- block/sheepdog.c | 72 1 file changed, 72 insertions(+) diff --git a/block/sheepdog.c b/block/sheepdog.c index 9585beb..26d09e9 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -377,6 +377,11 @@ typedef struct BDRVSheepdogState { QLIST_HEAD(inflight_aiocb_head, SheepdogAIOCB) inflight_aiocb_head; } BDRVSheepdogState; +typedef struct BDRVSheepdogReopenState { +int fd; +int cache_flags; +} BDRVSheepdogReopenState; + static const char * sd_strerror(int err) { int i; @@ -1486,6 +1491,64 @@ out: return ret; } +static int sd_reopen_prepare(BDRVReopenState *state, BlockReopenQueue *queue, + Error **errp) +{ +BDRVSheepdogState *s = state-bs-opaque; +BDRVSheepdogReopenState *re_s; +int ret = 0; + +re_s = state-opaque = g_new0(BDRVSheepdogReopenState, 1); + +if (state-flags BDRV_O_NOCACHE) { +re_s-cache_flags = SD_FLAG_CMD_DIRECT; +} + +re_s-fd = get_sheep_fd(s, errp); +if (re_s-fd 0) { +ret = re_s-fd; +return ret; +} + +return ret; +} + +static void sd_reopen_commit(BDRVReopenState *state) +{ +BDRVSheepdogReopenState *re_s = state-opaque; +BDRVSheepdogState *s = state-bs-opaque; + +if (s-fd) { +closesocket(s-fd); +} + +s-fd = re_s-fd; +s-cache_flags = re_s-cache_flags; + +g_free(state-opaque); +state-opaque = NULL; + +return; +} + +static void sd_reopen_abort(BDRVReopenState *state) +{ +BDRVSheepdogReopenState *re_s = state-opaque; + +if (re_s == NULL) { +return; +} + +if (re_s-fd) { +closesocket(re_s-fd); +} + +g_free(state-opaque); +state-opaque = NULL; + +return; +} + static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot, Error **errp) { @@ -2703,6 +2766,9 @@ static BlockDriver bdrv_sheepdog = { .instance_size = sizeof(BDRVSheepdogState), .bdrv_needs_filename = true, .bdrv_file_open = sd_open, +.bdrv_reopen_prepare= sd_reopen_prepare, +.bdrv_reopen_commit = sd_reopen_commit, +.bdrv_reopen_abort = sd_reopen_abort, .bdrv_close = sd_close, .bdrv_create= sd_create, .bdrv_has_zero_init = bdrv_has_zero_init_1, @@ -2736,6 +2802,9 @@ static BlockDriver bdrv_sheepdog_tcp = { .instance_size = sizeof(BDRVSheepdogState), .bdrv_needs_filename = true, .bdrv_file_open = sd_open, +.bdrv_reopen_prepare= sd_reopen_prepare, +.bdrv_reopen_commit = sd_reopen_commit, +.bdrv_reopen_abort = sd_reopen_abort, .bdrv_close = sd_close, .bdrv_create= sd_create, .bdrv_has_zero_init = bdrv_has_zero_init_1, @@ -2769,6 +2838,9 @@ static BlockDriver bdrv_sheepdog_unix = { .instance_size = sizeof(BDRVSheepdogState), .bdrv_needs_filename = true, .bdrv_file_open = sd_open, +.bdrv_reopen_prepare= sd_reopen_prepare, +.bdrv_reopen_commit = sd_reopen_commit, +.bdrv_reopen_abort = sd_reopen_abort, .bdrv_close = sd_close, .bdrv_create= sd_create, .bdrv_has_zero_init = bdrv_has_zero_init_1, -- 1.9.1
[Qemu-devel] [PATCH v2] sheepdog: add reopen support
From: Liu Yuan liuy...@cmss.chinamobile.com With reopen supported, block-commit (and offline commit) is now supported for image files whose base image uses the Sheepdog protocol driver. Cc: qemu-devel@nongnu.org Cc: Jeff Cody jc...@redhat.com Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan liuy...@cmss.chinamobile.com --- v2: - free AioHandler [Jeff Cody] block/sheepdog.c | 75 1 file changed, 75 insertions(+) diff --git a/block/sheepdog.c b/block/sheepdog.c index 9585beb..e54add4 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -377,6 +377,11 @@ typedef struct BDRVSheepdogState { QLIST_HEAD(inflight_aiocb_head, SheepdogAIOCB) inflight_aiocb_head; } BDRVSheepdogState; +typedef struct BDRVSheepdogReopenState { +int fd; +int cache_flags; +} BDRVSheepdogReopenState; + static const char * sd_strerror(int err) { int i; @@ -1486,6 +1491,67 @@ out: return ret; } +static int sd_reopen_prepare(BDRVReopenState *state, BlockReopenQueue *queue, + Error **errp) +{ +BDRVSheepdogState *s = state-bs-opaque; +BDRVSheepdogReopenState *re_s; +int ret = 0; + +re_s = state-opaque = g_new0(BDRVSheepdogReopenState, 1); + +if (state-flags BDRV_O_NOCACHE) { +re_s-cache_flags = SD_FLAG_CMD_DIRECT; +} + +re_s-fd = get_sheep_fd(s, errp); +if (re_s-fd 0) { +ret = re_s-fd; +return ret; +} + +return ret; +} + +static void sd_reopen_commit(BDRVReopenState *state) +{ +BDRVSheepdogReopenState *re_s = state-opaque; +BDRVSheepdogState *s = state-bs-opaque; + +if (s-fd) { +aio_set_fd_handler(s-aio_context, s-fd, NULL, NULL, NULL); +closesocket(s-fd); +} + +s-fd = re_s-fd; +s-cache_flags = re_s-cache_flags; + +g_free(state-opaque); +state-opaque = NULL; + +return; +} + +static void sd_reopen_abort(BDRVReopenState *state) +{ +BDRVSheepdogReopenState *re_s = state-opaque; +BDRVSheepdogState *s = state-bs-opaque; + +if (re_s == NULL) { +return; +} + +if (re_s-fd) { +aio_set_fd_handler(s-aio_context, re_s-fd, NULL, NULL, NULL); +closesocket(re_s-fd); +} + +g_free(state-opaque); +state-opaque = NULL; + +return; +} + static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot, Error **errp) { @@ -2703,6 +2769,9 @@ static BlockDriver bdrv_sheepdog = { .instance_size = sizeof(BDRVSheepdogState), .bdrv_needs_filename = true, .bdrv_file_open = sd_open, +.bdrv_reopen_prepare= sd_reopen_prepare, +.bdrv_reopen_commit = sd_reopen_commit, +.bdrv_reopen_abort = sd_reopen_abort, .bdrv_close = sd_close, .bdrv_create= sd_create, .bdrv_has_zero_init = bdrv_has_zero_init_1, @@ -2736,6 +2805,9 @@ static BlockDriver bdrv_sheepdog_tcp = { .instance_size = sizeof(BDRVSheepdogState), .bdrv_needs_filename = true, .bdrv_file_open = sd_open, +.bdrv_reopen_prepare= sd_reopen_prepare, +.bdrv_reopen_commit = sd_reopen_commit, +.bdrv_reopen_abort = sd_reopen_abort, .bdrv_close = sd_close, .bdrv_create= sd_create, .bdrv_has_zero_init = bdrv_has_zero_init_1, @@ -2769,6 +2841,9 @@ static BlockDriver bdrv_sheepdog_unix = { .instance_size = sizeof(BDRVSheepdogState), .bdrv_needs_filename = true, .bdrv_file_open = sd_open, +.bdrv_reopen_prepare= sd_reopen_prepare, +.bdrv_reopen_commit = sd_reopen_commit, +.bdrv_reopen_abort = sd_reopen_abort, .bdrv_close = sd_close, .bdrv_create= sd_create, .bdrv_has_zero_init = bdrv_has_zero_init_1, -- 1.9.1
Re: [Qemu-devel] [sheepdog] [PATCH] sheepdog: fix overlapping metadata update
On Sun, Aug 02, 2015 at 02:52:08PM +0300, Vasiliy Tolstov wrote: 2015-07-31 15:08 GMT+03:00 Vasiliy Tolstov v.tols...@selfip.ru: Please wait to performance comparison. As i see Liu's patch may be more slow then Hitoshi. I'm switch to local cluster driver to test only local ssd and not network overhead. But now Liu's qemu does not able to completely install vps: sheep runs as: What did you mean by 'not able to completely install'? It is a installation problem or something else? My QEMU or my patch? I guess you can test the same QEMU with my patch and with Hitoshi's patch separately. You know, different QEMU base might cover the difference. Thanks, Yuan
Re: [Qemu-devel] [sheepdog] [PATCH] sheepdog: fix overlapping metadata update
On Thu, Jul 30, 2015 at 09:27:44AM -0400, Jeff Cody wrote: On Thu, Jul 30, 2015 at 09:41:08AM +0300, Vasiliy Tolstov wrote: 2015-07-29 12:31 GMT+03:00 Liu Yuan namei.u...@gmail.com: Technically, it won't affect the performance because index updates are not range but concrete in terms of underlying 4M block size. Only 2 or 3 indexes in a range will be updated and 90+% updates will be only 1. So if 2 updates stride a large range, it will actually worse the performance of sheepdog because many additional unref of object will be executed by sheep internally. It is not a performance problem but more the right fix. Even with your patch, updates of inode can overlap. You just don't allow overlapped requests go to sheepdog, which is a overkill approach. IMHO, we should only adjust to avoid the overlapped inode updates, which can be done easily and incrementally on top of old code, rather than take on a complete new untested overkill mechanism. So what we get from your patch? Covering the problem and lock every requests? Your patch actually fix nothing but just cover the problem by slowing down the request and even with your patch, the problem still exists because inode updates can overlap. Your commit log doesn't explain what is the real problem and why your fix works. This is not your toy project that can commit whatever you want. BTW, sheepdog project was already forked, why don't you fork the block driver, too? What makes you think you own the block driver? We forked the sheepdog project because it is low quality of code partly and mostly some company tries to make it a private project. It is not as open source friendly as before and that is the main reason Kazutaka and I chose to fork the sheepdog project. But this doesn't mean we need to fork the QEMU project, it is an open source project and not your home toy. Kazutaka and I are the biggest contributers of both sheepdog and QEMU sheepdog block driver for years, so I think I am eligible to review the patch and responsible to suggest the right fix. If you are pissed off when someone else have other opinions, you can just fork the code and play with it at home or you follow the rule of open source project. Jeff Cody, please be the judge, patch from Hitoshi solved my problem that i emailed in sheepdog list (i have test environment with 8 hosts on each 6 SSD disks and infiniband interconnect between hosts) before Hitoshi patch, massive writing to sheepdog storage breaks file system and corrupt it. After the patch i don't see issues. I'd rather see some sort consensus amongst Liu, Hitoshi, yourself, or others more intimately familiar with sheepdog. Right now, we have Hitoshi's patch in the main git repo, slated for 2.4 release (which is Monday). It sounds, from Liu's email, as this may not fix the root cause. Vasiliy said he would test Liu's patch; if he can confirm this new patch fix, then I would be inclined to use Liu's patch, based on the detailed analysis of the issue in the commit message. This is my performance comparison on top of latest QEMU with my latop with SSD. sheepdog cluster run with 3 nodes with '-n' to get best volume performance. QEMU command: qemu-system-x86_64 -m 1024 --enable-kvm \ -drive file=debian_squeeze_amd64_standard.qcow2,cache=writeback,if=virtio \ -drive file=sheepdog:test,if=virtio sheepdog:test is created as 'dog vdi create test 80G' I test both time for mkfs and iops for fio write. fio.conf: [global] ioengine=libaio direct=1 thread=1 norandommap=1 runtime=60 size=300M directory=/mnt [write4k-rand] stonewall group_reporting bs=4k rw=randwrite numjobs=8 iodepth=32 Resualt: sheep formated with -c 2:1 (erasure coding) mkfs fio Yuan 0.069 4578 Hitosh 0.071 3722 sheep formarted with -c 2 (replication) mkfs fio Yuan 0.074 6873 Hitosh 0.081 6174 Thanks, Yuan
Re: [Qemu-devel] [sheepdog] [PATCH] sheepdog: fix overlapping metadata update
On Fri, Jul 31, 2015 at 03:08:09PM +0300, Vasiliy Tolstov wrote: 2015-07-31 14:55 GMT+03:00 Vasiliy Tolstov v.tols...@selfip.ru: Liu's patch also works for me. But also like in Hitoshi patch breaks when using discards in qemu =(. Please wait to performance comparison. As i see Liu's patch may be more slow then Hitoshi. Thanks for your time! Well, as far as I know, my patch would be slightly better in performance wise because it preserves the parallelism of requests. Due to scatter gather IO requests characteristics, we could assume following IO pattern as an illustration: req1 is split into 2 sheep reqs: create(2), create(10) req2 is split into 2 sheep reqs: create(5), create(100) So there are finally 4 sheep requests and with my patch they will be run in parallel by sheep cluster and only 4 unref of objects will be executed internally: update_inode(2), update_inode(10), update_inode(5), update_inode(100) With Hitoshi's patch, however, req1 and req2 will be serialized and only one req is finished then the other one will be sent to sheep and there are 9+96=105 unref of objects will be executed internally. There are still chances data corruption because update_inode(2,10) and update_inode(5,100) will both update the range [5,10], which is a potential problem if the overlapped range has different values when the requests are queued with stale data. This is really a several years bug: we should update the inode bits exactly as we create the objects, not update the bits we don't touch at all. This bug isn't revealed for a long time because most of the time, min == max in create_inode(min, max) and before we introduction of generation reference counting to the snapshot reference mechanism, updating inode bit with 0 won't cause a remove request in sheepdog. I'm also concerned with the complete new mechanism since current request handling mechanism is solid as time goes by. It exists for years. The complete new stuff might need a long time to stablize and need to fix the possible side effect we don't know yet. Thanks, Yuan
Re: [Qemu-devel] [sheepdog] [PATCH] sheepdog: fix overlapping metadata update
On Thu, Jul 30, 2015 at 09:41:08AM +0300, Vasiliy Tolstov wrote: 2015-07-29 12:31 GMT+03:00 Liu Yuan namei.u...@gmail.com: Technically, it won't affect the performance because index updates are not range but concrete in terms of underlying 4M block size. Only 2 or 3 indexes in a range will be updated and 90+% updates will be only 1. So if 2 updates stride a large range, it will actually worse the performance of sheepdog because many additional unref of object will be executed by sheep internally. It is not a performance problem but more the right fix. Even with your patch, updates of inode can overlap. You just don't allow overlapped requests go to sheepdog, which is a overkill approach. IMHO, we should only adjust to avoid the overlapped inode updates, which can be done easily and incrementally on top of old code, rather than take on a complete new untested overkill mechanism. So what we get from your patch? Covering the problem and lock every requests? Your patch actually fix nothing but just cover the problem by slowing down the request and even with your patch, the problem still exists because inode updates can overlap. Your commit log doesn't explain what is the real problem and why your fix works. This is not your toy project that can commit whatever you want. BTW, sheepdog project was already forked, why don't you fork the block driver, too? What makes you think you own the block driver? We forked the sheepdog project because it is low quality of code partly and mostly some company tries to make it a private project. It is not as open source friendly as before and that is the main reason Kazutaka and I chose to fork the sheepdog project. But this doesn't mean we need to fork the QEMU project, it is an open source project and not your home toy. Kazutaka and I are the biggest contributers of both sheepdog and QEMU sheepdog block driver for years, so I think I am eligible to review the patch and responsible to suggest the right fix. If you are pissed off when someone else have other opinions, you can just fork the code and play with it at home or you follow the rule of open source project. Jeff Cody, please be the judge, patch from Hitoshi solved my problem that i emailed in sheepdog list (i have test environment with 8 hosts on each 6 SSD disks and infiniband interconnect between hosts) before Hitoshi patch, massive writing to sheepdog storage breaks file system and corrupt it. After the patch i don't see issues. Hi Vasiliy, Did you test my patch? I think both patch can make the problem gone. The trick is that which way is the right fix. This is quit technical because sometimes the problem is gone not because being fixed but is covered out. If you have problem with applying the patch, feel free to mail me and I'll package a patched QEMU tree for you. Thanks, Yuan
Re: [Qemu-devel] [sheepdog] [PATCH] sheepdog: fix overlapping metadata update
On Wed, Jul 29, 2015 at 02:04:55PM +0900, Hitoshi Mitake wrote: At Wed, 29 Jul 2015 12:02:35 +0800, Liu Yuan wrote: From: Liu Yuan liuy...@cmss.chinamobile.com Current sheepdog driver use a range update_inode(min_idx, max_idx) for batching the updates. But there is subtle problem by determining min_idx and max_idx: for a single create request, min_idx == max_idx, so actually we just update one one bit as expected. Suppose we have 2 create request, create(10) and create(20), then min == 10, max==20 even though we just need to update index 10 and index 20, update_inode(10,20) will actually update range from 10 to 20. This would work if all the update_inode() requests won't overlap. But unfortunately, this is not true for some corner case. So the problem arise as following: req 1: update_inode(10,20) req 2: update_inode(15,22) req 1 and req 2 might have different value between [15,20] and cause problems and can be illustrated as following by adding a printf in sd_write_done: @@ -1976,6 +1976,7 @@ static void coroutine_fn sd_write_done(SheepdogAIOCB *acb) mn = s-min_dirty_data_idx; mx = s-max_dirty_data_idx; +printf(min %u, max %u\n, mn, mx); if (mn = mx) { /* we need to update the vdi object. */ offset = sizeof(s-inode) - sizeof(s-inode.data_vdi_id) + ... min 4294967295, max 0 min 9221, max 9222 min 9224, max 9728 min 9223, max 9223 min 9729, max 9730 min 9731, max 9732 min 9733, max 9734 min 9736, max 10240 min 9735, max 10241 ... Every line represents a update_inode(min, max) last 2 lines show 2 requests actually overlap while I ran mkfs.ext4 on a sheepdog volume. The overlapped requests might be reordered via network and cause inode[idx] back to 0 after being set by last request. Then a wrong remove request will be executed by sheep internally and accidentally remove a victim object, which is reported at: https://bugs.launchpad.net/sheepdog-project/+bug/1456421 The fix is simple that we just update inode one by one for aio_req. Since aio_req is never overlapped, we'll have inode update never overlapped. This patch increase a number of indoe update request than existing approach. current: (max - min + 1) * data object creation + 1 inode update this patch: (max - min + 1) * data object creation + (max - min + 1) * inode update The increased number means increased number of network + disk I/O. Even the overwrapping requests can be processed in parallel, the overhead seems to be large. It has an impact especially in a case of disk I/O heavy workload. I don't have a comparison of benchmark result, but it is not obvious that the approach of this patch is better. Technically, it won't affect the performance because index updates are not range but concrete in terms of underlying 4M block size. Only 2 or 3 indexes in a range will be updated and 90+% updates will be only 1. So if 2 updates stride a large range, it will actually worse the performance of sheepdog because many additional unref of object will be executed by sheep internally. It is not a performance problem but more the right fix. Even with your patch, updates of inode can overlap. You just don't allow overlapped requests go to sheepdog, which is a overkill approach. IMHO, we should only adjust to avoid the overlapped inode updates, which can be done easily and incrementally on top of old code, rather than take on a complete new untested overkill mechanism. So what we get from your patch? Covering the problem and lock every requests? Your patch actually fix nothing but just cover the problem by slowing down the request and even with your patch, the problem still exists because inode updates can overlap. Your commit log doesn't explain what is the real problem and why your fix works. This is not your toy project that can commit whatever you want. BTW, sheepdog project was already forked, why don't you fork the block driver, too? What makes you think you own the block driver? We forked the sheepdog project because it is low quality of code partly and mostly some company tries to make it a private project. It is not as open source friendly as before and that is the main reason Kazutaka and I chose to fork the sheepdog project. But this doesn't mean we need to fork the QEMU project, it is an open source project and not your home toy. Kazutaka and I are the biggest contributers of both sheepdog and QEMU sheepdog block driver for years, so I think I am eligible to review the patch and responsible to suggest the right fix. If you are pissed off when someone else have other opinions, you can just fork the code and play with it at home or you follow the rule of open source project. Yuan
Re: [Qemu-devel] [sheepdog] [PATCH] sheepdog: serialize requests to overwrapping area
On Sat, Jul 18, 2015 at 01:44:24AM +0900, Hitoshi Mitake wrote: Current sheepdog driver only serializes create requests in oid unit. This mechanism isn't enough for handling requests to overwrapping area spanning multiple oids, so it can result bugs like below: https://bugs.launchpad.net/sheepdog-project/+bug/1456421 I'm a bit late to review the patch since I'm not on the cc list, but I'd like to get the idea how the mentioned bug relates to the serialization of requests? The mentioned bug looks to me more a bug of sheepdog because the create and write request will only unref a single oid. The bug report is unclear about why the object idx in inode becomes zero, at least not pointing that it relates to QEMU. But this patch assume QEMU send the requests the wrong way and just vaguely says it is just wrong without reason. What is overrapping requests? As far as I understand, the request that stride over two objects will be split into two, to make sure all the requests fit the sheepdog object size. Allow requests run concurrently on different SD objects is way achieving high performance. This patch mutes this feature, to me, without a decent reason. Probably I miss something hidden, but I'd like someone enlighten me about it because this patch might slow down QEMU VM over sheepdog. Thanks, Yuan
Re: [Qemu-devel] [sheepdog] [PATCH] sheepdog: serialize requests to overwrapping area
On Tue, Jul 28, 2015 at 04:50:08PM +0800, Liu Yuan wrote: On Sat, Jul 18, 2015 at 01:44:24AM +0900, Hitoshi Mitake wrote: Current sheepdog driver only serializes create requests in oid unit. This mechanism isn't enough for handling requests to overwrapping area spanning multiple oids, so it can result bugs like below: https://bugs.launchpad.net/sheepdog-project/+bug/1456421 I'm a bit late to review the patch since I'm not on the cc list, but I'd like to get the idea how the mentioned bug relates to the serialization of requests? The mentioned bug looks to me more a bug of sheepdog because the create and write request will only unref a single oid. The bug report is unclear about why the object idx in inode becomes zero, at least not pointing that it relates to QEMU. But this patch assume QEMU send the requests the wrong way and just vaguely says it is just wrong without reason. What is overrapping requests? As far as I understand, the request that stride over two objects will be split into two, to make sure all the requests fit the sheepdog object size. Allow requests run concurrently on different SD objects is way achieving high performance. This patch mutes this feature, to me, without a decent reason. Probably I miss something hidden, but I'd like someone enlighten me about it because this patch might slow down QEMU VM over sheepdog. Thanks, Yuan Cc Jeff
Re: [Qemu-devel] [PATCH] sheepdog: serialize requests to overwrapping area
On Mon, Jul 27, 2015 at 11:23:02AM -0400, Jeff Cody wrote: On Sat, Jul 18, 2015 at 01:44:24AM +0900, Hitoshi Mitake wrote: Current sheepdog driver only serializes create requests in oid unit. This mechanism isn't enough for handling requests to overwrapping area spanning multiple oids, so it can result bugs like below: https://bugs.launchpad.net/sheepdog-project/+bug/1456421 This patch adds a new serialization mechanism for the problem. The difference from the old one is: 1. serialize entire aiocb if their targetting areas overwrap 2. serialize all requests (read, write, and discard), not only creates This patch also removes the old mechanism because the new one can be an alternative. Okay, I figured out what the problem is myself and allow me to try to make it clear to non-sheepdog devs: sheepdog volume is thin-provision, so for the first write, we create the object internally, meaning that we need to handle write in two case: 1. write to non-allocated object, create it then update inode, so in this case two request will be generated: create(oid), update_inode(oid_to_inode_idx) 2. write the allocated object, just write(oid). Current sheepdog driver use a range update_inode(min_idx, max_idx) for batching the updates. But there is subtle problem by determining min_idx and max_idx: for a single create request, min_idx == max_idx, so actually we just update one one bit as expected. Suppose we have 2 create request, create(10) and create(20), then min == 10, max==20 even though we just need to update index 10 and index 20, update_inode(10,20) will actually update range from 10 to 20. This would work if all the update_inode() requests won't overlap. But unfortunately, this is not true for some corner case. So the problem arise as following: req 1: update_inode(10,20) req 2: update_inode(15,22) req 1 and req 2 might have different value between [15,20] and cause problems. Based on above analysis, I think the real fix is to fix update_inode(), not serialize all the requests in overkill way. The fix would be easy, considering most update_inode() update only 1 index, we could just make update_inode a single bit updater, not a range one, in which way we don't affect performance as the above patch did. (I actually suspect that the above patch might not solve the problem because update_inode() can overlap even with the patch). If everyone agrees with my analysis, I'll post the fix. Thanks, Yuan
Re: [Qemu-devel] [PATCH] sheepdog: serialize requests to overwrapping area
On Tue, Jul 28, 2015 at 10:31:32PM +0800, Liu Yuan wrote: On Mon, Jul 27, 2015 at 11:23:02AM -0400, Jeff Cody wrote: On Sat, Jul 18, 2015 at 01:44:24AM +0900, Hitoshi Mitake wrote: Current sheepdog driver only serializes create requests in oid unit. This mechanism isn't enough for handling requests to overwrapping area spanning multiple oids, so it can result bugs like below: https://bugs.launchpad.net/sheepdog-project/+bug/1456421 This patch adds a new serialization mechanism for the problem. The difference from the old one is: 1. serialize entire aiocb if their targetting areas overwrap 2. serialize all requests (read, write, and discard), not only creates This patch also removes the old mechanism because the new one can be an alternative. Okay, I figured out what the problem is myself and allow me to try to make it clear to non-sheepdog devs: sheepdog volume is thin-provision, so for the first write, we create the object internally, meaning that we need to handle write in two case: 1. write to non-allocated object, create it then update inode, so in this case two request will be generated: create(oid), update_inode(oid_to_inode_idx) 2. write the allocated object, just write(oid). Current sheepdog driver use a range update_inode(min_idx, max_idx) for batching the updates. But there is subtle problem by determining min_idx and max_idx: for a single create request, min_idx == max_idx, so actually we just update one one bit as expected. Suppose we have 2 create request, create(10) and create(20), then min == 10, max==20 even though we just need to update index 10 and index 20, update_inode(10,20) will actually update range from 10 to 20. This would work if all the update_inode() requests won't overlap. But unfortunately, this is not true for some corner case. So the problem arise as following: req 1: update_inode(10,20) req 2: update_inode(15,22) I patched the qemu to print min and max, and confirmed my analysis: diff --git a/block/sheepdog.c b/block/sheepdog.c index bd7cbed..f3e40e8 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -1976,6 +1976,7 @@ static void coroutine_fn sd_write_done(SheepdogAIOCB *acb) mn = s-min_dirty_data_idx; mx = s-max_dirty_data_idx; +printf(min %u, max %u\n, mn, mx); if (mn = mx) { /* we need to update the vdi object. */ offset = sizeof(s-inode) - sizeof(s-inode.data_vdi_id) + diff --git a/dtc b/dtc index 65cc4d2..bc895d6 16 ... min 4294967295, max 0 min 9221, max 9222 min 9224, max 9728 min 9223, max 9223 min 9729, max 9730 min 9731, max 9732 min 9733, max 9734 min 9736, max 10240 min 9735, max 10241 ... Every line represents a update_inode(min, max) last 2 lines show 2 requests actually overlap while I ran mkfs.ext4 on a sheepdog volume. The overlapped requests might be reordered via network and cause inode[idx] back to 0 after being set by last request. Then a wrong remove request will be executed by sheep internally and accidentally remove a victim object. Thanks, Yuan req 1 and req 2 might have different value between [15,20] and cause problems. Based on above analysis, I think the real fix is to fix update_inode(), not serialize all the requests in overkill way. The fix would be easy, considering most update_inode() update only 1 index, we could just make update_inode a single bit updater, not a range one, in which way we don't affect performance as the above patch did. (I actually suspect that the above patch might not solve the problem because update_inode() can overlap even with the patch). If everyone agrees with my analysis, I'll post the fix. Thanks, Yuan
[Qemu-devel] [PATCH] sheepdog: fix overlapping metadata update
From: Liu Yuan liuy...@cmss.chinamobile.com Current sheepdog driver use a range update_inode(min_idx, max_idx) for batching the updates. But there is subtle problem by determining min_idx and max_idx: for a single create request, min_idx == max_idx, so actually we just update one one bit as expected. Suppose we have 2 create request, create(10) and create(20), then min == 10, max==20 even though we just need to update index 10 and index 20, update_inode(10,20) will actually update range from 10 to 20. This would work if all the update_inode() requests won't overlap. But unfortunately, this is not true for some corner case. So the problem arise as following: req 1: update_inode(10,20) req 2: update_inode(15,22) req 1 and req 2 might have different value between [15,20] and cause problems and can be illustrated as following by adding a printf in sd_write_done: @@ -1976,6 +1976,7 @@ static void coroutine_fn sd_write_done(SheepdogAIOCB *acb) mn = s-min_dirty_data_idx; mx = s-max_dirty_data_idx; +printf(min %u, max %u\n, mn, mx); if (mn = mx) { /* we need to update the vdi object. */ offset = sizeof(s-inode) - sizeof(s-inode.data_vdi_id) + ... min 4294967295, max 0 min 9221, max 9222 min 9224, max 9728 min 9223, max 9223 min 9729, max 9730 min 9731, max 9732 min 9733, max 9734 min 9736, max 10240 min 9735, max 10241 ... Every line represents a update_inode(min, max) last 2 lines show 2 requests actually overlap while I ran mkfs.ext4 on a sheepdog volume. The overlapped requests might be reordered via network and cause inode[idx] back to 0 after being set by last request. Then a wrong remove request will be executed by sheep internally and accidentally remove a victim object, which is reported at: https://bugs.launchpad.net/sheepdog-project/+bug/1456421 The fix is simple that we just update inode one by one for aio_req. Since aio_req is never overlapped, we'll have inode update never overlapped. Cc: Jeff Cody jc...@redhat.com Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Cc: Teruaki Ishizaki ishizaki.teru...@lab.ntt.co.jp Cc: Vasiliy Tolstov v.tols...@selfip.ru Cc: Hitoshi Mitake mitake.hito...@lab.ntt.co.jp Signed-off-by: Liu Yuan liuy...@cmss.chinamobile.com --- block/sheepdog.c | 77 1 file changed, 22 insertions(+), 55 deletions(-) diff --git a/block/sheepdog.c b/block/sheepdog.c index bd7cbed..d1eeb81 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -342,9 +342,6 @@ typedef struct BDRVSheepdogState { SheepdogInode inode; -uint32_t min_dirty_data_idx; -uint32_t max_dirty_data_idx; - char name[SD_MAX_VDI_LEN]; bool is_snapshot; uint32_t cache_flags; @@ -782,6 +779,26 @@ static coroutine_fn void reconnect_to_sdog(void *opaque) } } +static void update_inode(BDRVSheepdogState *s, AIOReq *aio_req) +{ +struct iovec iov; +uint32_t offset, data_len; +SheepdogAIOCB *acb = aio_req-aiocb; +int idx = data_oid_to_idx(aio_req-oid); + +offset = SD_INODE_HEADER_SIZE + sizeof(uint32_t) * idx; +data_len = sizeof(uint32_t); + +iov.iov_base = s-inode; +iov.iov_len = sizeof(s-inode); +aio_req = alloc_aio_req(s, acb, vid_to_vdi_oid(s-inode.vdi_id), +data_len, offset, 0, false, 0, offset); +QLIST_INSERT_HEAD(s-inflight_aio_head, aio_req, aio_siblings); +add_aio_request(s, aio_req, iov, 1, AIOCB_WRITE_UDATA); + +return; +} + /* * Receive responses of the I/O requests. * @@ -820,25 +837,15 @@ static void coroutine_fn aio_read_response(void *opaque) switch (acb-aiocb_type) { case AIOCB_WRITE_UDATA: -/* this coroutine context is no longer suitable for co_recv - * because we may send data to update vdi objects */ -s-co_recv = NULL; if (!is_data_obj(aio_req-oid)) { break; } idx = data_oid_to_idx(aio_req-oid); if (aio_req-create) { -/* - * If the object is newly created one, we need to update - * the vdi object (metadata object). min_dirty_data_idx - * and max_dirty_data_idx are changed to include updated - * index between them. - */ if (rsp.result == SD_RES_SUCCESS) { s-inode.data_vdi_id[idx] = s-inode.vdi_id; -s-max_dirty_data_idx = MAX(idx, s-max_dirty_data_idx); -s-min_dirty_data_idx = MIN(idx, s-min_dirty_data_idx); +update_inode(s, aio_req); } /* * Some requests may be blocked because simultaneous @@ -1518,8 +1525,6 @@ static int sd_open(BlockDriverState *bs, QDict *options, int flags, } memcpy(s-inode, buf, sizeof(s-inode)); -s-min_dirty_data_idx = UINT32_MAX; -s-max_dirty_data_idx = 0; bs-total_sectors = s-inode.vdi_size / BDRV_SECTOR_SIZE; pstrcpy(s-name, sizeof
Re: [Qemu-devel] [PATCH v2] sheepdog: fix confused return values
On Wed, Feb 18, 2015 at 11:57:55AM +0800, Liu Yuan wrote: From: Liu Yuan liuy...@cmss.chinamobile.com These functions mix up -1 and -errno in return values and would might cause trouble error handling in the call chain. This patch let them return -errno and add some comments. Cc: qemu-devel@nongnu.org Cc: Markus Armbruster arm...@redhat.com Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Reported-by: Markus Armbruster arm...@redhat.com Signed-off-by: Liu Yuan liuy...@cmss.chinamobile.com --- v2: - use socket_error() instead of errno block/sheepdog.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/block/sheepdog.c b/block/sheepdog.c index be3176f..e4b30ba 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -527,6 +527,7 @@ static SheepdogAIOCB *sd_aio_setup(BlockDriverState *bs, QEMUIOVector *qiov, return acb; } +/* Return -EIO in case of error, file descriptor on success */ static int connect_to_sdog(BDRVSheepdogState *s, Error **errp) { int fd; @@ -546,11 +547,14 @@ static int connect_to_sdog(BDRVSheepdogState *s, Error **errp) if (fd = 0) { qemu_set_nonblock(fd); +} else { +fd = -EIO; } return fd; } +/* Return 0 on success and -errno in case of error */ static coroutine_fn int send_co_req(int sockfd, SheepdogReq *hdr, void *data, unsigned int *wlen) { @@ -559,11 +563,13 @@ static coroutine_fn int send_co_req(int sockfd, SheepdogReq *hdr, void *data, ret = qemu_co_send(sockfd, hdr, sizeof(*hdr)); if (ret != sizeof(*hdr)) { error_report(failed to send a req, %s, strerror(errno)); +ret = -socket_error(); return ret; } ret = qemu_co_send(sockfd, data, *wlen); if (ret != *wlen) { +ret = -socket_error(); error_report(failed to send a req, %s, strerror(errno)); } @@ -638,6 +644,11 @@ out: srco-finished = true; } +/* + * Send the request to the sheep in a synchronous manner. + * + * Return 0 on success, -errno in case of error. + */ static int do_req(int sockfd, AioContext *aio_context, SheepdogReq *hdr, void *data, unsigned int *wlen, unsigned int *rlen) { -- 1.9.1 Ping...
Re: [Qemu-devel] [PATCH] sheepdog: fix confused return values
On Wed, Feb 25, 2015 at 10:35:17AM +0100, Markus Armbruster wrote: Liu Yuan namei.u...@gmail.com writes: On Wed, Feb 18, 2015 at 09:35:07AM +0100, Markus Armbruster wrote: Liu Yuan namei.u...@gmail.com writes: From: Liu Yuan liuy...@cmss.chinamobile.com These functions mix up -1 and -errno in return values and would might cause trouble error handling in the call chain. This patch let them return -errno and add some comments. Reported-by: Markus Armbruster arm...@redhat.com Signed-off-by: Liu Yuan liuy...@cmss.chinamobile.com Did you review all functions returning negative errno, or just the two that caught my eye? Umm, mostly the two you mentioned. I encourage you to review the whole file for this error pattern! In my experience, bugs occur in clusters. When you find one, there's a high chance similar ones exist, and looking for them is good practice. Hi Markus, I've checked the whole file as possible as I can. I can't find more ret mixing bugs. Yuan
Re: [Qemu-devel] [PATCH] sheepdog: fix confused return values
On Wed, Feb 18, 2015 at 09:35:07AM +0100, Markus Armbruster wrote: Liu Yuan namei.u...@gmail.com writes: From: Liu Yuan liuy...@cmss.chinamobile.com These functions mix up -1 and -errno in return values and would might cause trouble error handling in the call chain. This patch let them return -errno and add some comments. Reported-by: Markus Armbruster arm...@redhat.com Signed-off-by: Liu Yuan liuy...@cmss.chinamobile.com Did you review all functions returning negative errno, or just the two that caught my eye? Umm, mostly the two you mentioned. Yuan
[Qemu-devel] [PATCH v2] sheepdog: fix confused return values
From: Liu Yuan liuy...@cmss.chinamobile.com These functions mix up -1 and -errno in return values and would might cause trouble error handling in the call chain. This patch let them return -errno and add some comments. Cc: qemu-devel@nongnu.org Cc: Markus Armbruster arm...@redhat.com Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Reported-by: Markus Armbruster arm...@redhat.com Signed-off-by: Liu Yuan liuy...@cmss.chinamobile.com --- v2: - use socket_error() instead of errno block/sheepdog.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/block/sheepdog.c b/block/sheepdog.c index be3176f..e4b30ba 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -527,6 +527,7 @@ static SheepdogAIOCB *sd_aio_setup(BlockDriverState *bs, QEMUIOVector *qiov, return acb; } +/* Return -EIO in case of error, file descriptor on success */ static int connect_to_sdog(BDRVSheepdogState *s, Error **errp) { int fd; @@ -546,11 +547,14 @@ static int connect_to_sdog(BDRVSheepdogState *s, Error **errp) if (fd = 0) { qemu_set_nonblock(fd); +} else { +fd = -EIO; } return fd; } +/* Return 0 on success and -errno in case of error */ static coroutine_fn int send_co_req(int sockfd, SheepdogReq *hdr, void *data, unsigned int *wlen) { @@ -559,11 +563,13 @@ static coroutine_fn int send_co_req(int sockfd, SheepdogReq *hdr, void *data, ret = qemu_co_send(sockfd, hdr, sizeof(*hdr)); if (ret != sizeof(*hdr)) { error_report(failed to send a req, %s, strerror(errno)); +ret = -socket_error(); return ret; } ret = qemu_co_send(sockfd, data, *wlen); if (ret != *wlen) { +ret = -socket_error(); error_report(failed to send a req, %s, strerror(errno)); } @@ -638,6 +644,11 @@ out: srco-finished = true; } +/* + * Send the request to the sheep in a synchronous manner. + * + * Return 0 on success, -errno in case of error. + */ static int do_req(int sockfd, AioContext *aio_context, SheepdogReq *hdr, void *data, unsigned int *wlen, unsigned int *rlen) { -- 1.9.1
Re: [Qemu-devel] [PATCH] sheepdog: fix confused return values
On Mon, Feb 16, 2015 at 12:56:04PM +0100, Kevin Wolf wrote: Am 13.02.2015 um 04:45 hat Liu Yuan geschrieben: From: Liu Yuan liuy...@cmss.chinamobile.com These functions mix up -1 and -errno in return values and would might cause trouble error handling in the call chain. This patch let them return -errno and add some comments. Reported-by: Markus Armbruster arm...@redhat.com Signed-off-by: Liu Yuan liuy...@cmss.chinamobile.com --- block/sheepdog.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/block/sheepdog.c b/block/sheepdog.c index be3176f..c28658c 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -527,6 +527,7 @@ static SheepdogAIOCB *sd_aio_setup(BlockDriverState *bs, QEMUIOVector *qiov, return acb; } +/* Return -EIO in case of error, file descriptor on success */ static int connect_to_sdog(BDRVSheepdogState *s, Error **errp) { int fd; @@ -546,11 +547,14 @@ static int connect_to_sdog(BDRVSheepdogState *s, Error **errp) if (fd = 0) { qemu_set_nonblock(fd); +} else { +fd = -EIO; } return fd; } +/* Return 0 on success and -errno in case of error */ static coroutine_fn int send_co_req(int sockfd, SheepdogReq *hdr, void *data, unsigned int *wlen) { @@ -559,11 +563,13 @@ static coroutine_fn int send_co_req(int sockfd, SheepdogReq *hdr, void *data, ret = qemu_co_send(sockfd, hdr, sizeof(*hdr)); if (ret != sizeof(*hdr)) { error_report(failed to send a req, %s, strerror(errno)); +ret = -errno; qemu_co_sendv_recvv() uses socket_error() internally to access the return code. This is defined as errno on POSIX, but as WSAGetLastError() on Windows. You should probably either use socket_error() here, or change qemu_co_sendv_recvv() to return a negative error code instead of -1. return ret; } ret = qemu_co_send(sockfd, data, *wlen); if (ret != *wlen) { +ret = -errno; error_report(failed to send a req, %s, strerror(errno)); } The same here. Hi, Kevin, thanks for your tips. I'll post v2 against your comments. Yuan
Re: [Qemu-devel] [sheepdog] [PATCH v4] sheepdog: selectable object size support
On Thu, Feb 12, 2015 at 05:13:55PM +0900, Hitoshi Mitake wrote: At Thu, 12 Feb 2015 15:42:17 +0800, Liu Yuan wrote: On Thu, Feb 12, 2015 at 04:28:01PM +0900, Hitoshi Mitake wrote: At Thu, 12 Feb 2015 15:00:49 +0800, Liu Yuan wrote: On Thu, Feb 12, 2015 at 03:19:21PM +0900, Hitoshi Mitake wrote: At Tue, 10 Feb 2015 18:35:58 +0800, Liu Yuan wrote: On Tue, Feb 10, 2015 at 06:56:33PM +0900, Teruaki Ishizaki wrote: (2015/02/10 17:58), Liu Yuan wrote: On Tue, Feb 10, 2015 at 05:22:02PM +0900, Teruaki Ishizaki wrote: (2015/02/10 12:10), Liu Yuan wrote: On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote: Previously, qemu block driver of sheepdog used hard-coded VDI object size. This patch enables users to handle block_size_shift value for calculating VDI object size. When you start qemu, you don't need to specify additional command option. But when you create the VDI which doesn't have default object size with qemu-img command, you specify block_size_shift option. If you want to create a VDI of 8MB(1 23) object size, you need to specify following command option. # qemu-img create -o block_size_shift=23 sheepdog:test1 100M In addition, when you don't specify qemu-img command option, a default value of sheepdog cluster is used for creating VDI. # qemu-img create sheepdog:test2 100M Signed-off-by: Teruaki Ishizaki ishizaki.teru...@lab.ntt.co.jp --- V4: - Limit a read/write buffer size for creating a preallocated VDI. - Replace a parse function for the block_size_shift option. - Fix an error message. V3: - Delete the needless operation of buffer. - Delete the needless operations of request header. for SD_OP_GET_CLUSTER_DEFAULT. - Fix coding style problems. V2: - Fix coding style problem (white space). - Add members, store_policy and block_size_shift to struct SheepdogVdiReq. - Initialize request header to use block_size_shift specified by user. --- block/sheepdog.c | 138 ++--- include/block/block_int.h |1 + 2 files changed, 119 insertions(+), 20 deletions(-) diff --git a/block/sheepdog.c b/block/sheepdog.c index be3176f..a43b947 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -37,6 +37,7 @@ #define SD_OP_READ_VDIS 0x15 #define SD_OP_FLUSH_VDI 0x16 #define SD_OP_DEL_VDI0x17 +#define SD_OP_GET_CLUSTER_DEFAULT 0x18 #define SD_FLAG_CMD_WRITE0x01 #define SD_FLAG_CMD_COW 0x02 @@ -167,7 +168,8 @@ typedef struct SheepdogVdiReq { uint32_t base_vdi_id; uint8_t copies; uint8_t copy_policy; -uint8_t reserved[2]; +uint8_t store_policy; +uint8_t block_size_shift; uint32_t snapid; uint32_t type; uint32_t pad[2]; @@ -186,6 +188,21 @@ typedef struct SheepdogVdiRsp { uint32_t pad[5]; } SheepdogVdiRsp; +typedef struct SheepdogClusterRsp { +uint8_t proto_ver; +uint8_t opcode; +uint16_t flags; +uint32_t epoch; +uint32_t id; +uint32_t data_length; +uint32_t result; +uint8_t nr_copies; +uint8_t copy_policy; +uint8_t block_size_shift; +uint8_t __pad1; +uint32_t __pad2[6]; +} SheepdogClusterRsp; + typedef struct SheepdogInode { char name[SD_MAX_VDI_LEN]; char tag[SD_MAX_VDI_TAG_LEN]; @@ -1544,6 +1561,7 @@ static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot, hdr.vdi_size = s-inode.vdi_size; hdr.copy_policy = s-inode.copy_policy; hdr.copies = s-inode.nr_copies; +hdr.block_size_shift = s-inode.block_size_shift; ret = do_req(fd, s-aio_context, (SheepdogReq *)hdr, buf, wlen, rlen); @@ -1569,9 +1587,12 @@ static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot, static int sd_prealloc(const char *filename, Error **errp) { BlockDriverState *bs = NULL; +BDRVSheepdogState *base = NULL; +unsigned long buf_size; uint32_t idx, max_idx; +uint32_t object_size; int64_t vdi_size; -void *buf = g_malloc0(SD_DATA_OBJ_SIZE); +void *buf = NULL; int ret
Re: [Qemu-devel] [PATCH v4] sheepdog: selectable object size support
On Fri, Feb 13, 2015 at 10:33:04AM +0900, Teruaki Ishizaki wrote: (2015/02/12 11:55), Liu Yuan wrote: On Thu, Feb 12, 2015 at 11:33:16AM +0900, Teruaki Ishizaki wrote: (2015/02/12 11:19), Liu Yuan wrote: On Thu, Feb 12, 2015 at 10:51:25AM +0900, Teruaki Ishizaki wrote: (2015/02/10 20:12), Liu Yuan wrote: On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote: Previously, qemu block driver of sheepdog used hard-coded VDI object size. This patch enables users to handle block_size_shift value for calculating VDI object size. When you start qemu, you don't need to specify additional command option. But when you create the VDI which doesn't have default object size with qemu-img command, you specify block_size_shift option. If you want to create a VDI of 8MB(1 23) object size, you need to specify following command option. # qemu-img create -o block_size_shift=23 sheepdog:test1 100M In addition, when you don't specify qemu-img command option, a default value of sheepdog cluster is used for creating VDI. # qemu-img create sheepdog:test2 100M Signed-off-by: Teruaki Ishizaki ishizaki.teru...@lab.ntt.co.jp --- V4: - Limit a read/write buffer size for creating a preallocated VDI. - Replace a parse function for the block_size_shift option. - Fix an error message. V3: - Delete the needless operation of buffer. - Delete the needless operations of request header. for SD_OP_GET_CLUSTER_DEFAULT. - Fix coding style problems. V2: - Fix coding style problem (white space). - Add members, store_policy and block_size_shift to struct SheepdogVdiReq. - Initialize request header to use block_size_shift specified by user. --- block/sheepdog.c | 138 ++--- include/block/block_int.h |1 + 2 files changed, 119 insertions(+), 20 deletions(-) diff --git a/block/sheepdog.c b/block/sheepdog.c index be3176f..a43b947 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -37,6 +37,7 @@ #define SD_OP_READ_VDIS 0x15 #define SD_OP_FLUSH_VDI 0x16 #define SD_OP_DEL_VDI0x17 +#define SD_OP_GET_CLUSTER_DEFAULT 0x18 This might not be necessary. For old qemu or the qemu-img without setting option, the block_size_shift will be 0. If we make 0 to represent 4MB object, then we don't need to get the default cluster object size. We migth even get rid of the idea of cluster default size. The downsize is that, if we want to create a vdi with different size not the default 4MB, we have to write it every time for qemu-img or dog. If we choose to keep the idea of cluster default size, I think we'd also try to avoid call this request from QEMU to make backward compatibility easier. In this scenario, 0 might be used to ask new sheep to decide to use cluster default size. Both old qemu and new QEMU will send 0 to sheep and both old and new sheep can handle 0 though it has different meanings. Table for this bit as 0: Qe: qemu SD: Sheep daemon CDS: Cluster Default Size Ign: Ignored by the sheep daemon Qe/sd newold new CDSIgn old CDSNULL Does Ign mean that VDI is handled as 4MB object size? Yes, old sheep can only handle 4MB object and doesn't check this field at all. I think this approach is acceptable. The difference to your patch is that we don't send SD_OP_GET_CLUSTER_DEFAULT to sheep daemon and SD_OP_GET_CLUSTER_DEFAULT can be removed. When users create a new VDI with qemu-img, qemu's Sheepdog backend driver calculates max limit VDI size. But if block_size_shift option is not specified, qemu's Sheepdog backend driver can't calculate max limit VDI size. If block_size_shift not specified, this means 1 for old sheep, use 4MB size 2 for new sheep, use cluster wide default value. And sheep then can calculate it on its own, no? Dog command(client) calculate max size, so I think that qemu's Sheepdog backend driver should calculate it like dog command. Is that policy changeable? I checked the QEMU code and got your idea. In the past it was fixed size so very easy to hardcode the check in the client, no communication with sheep needed. Yes, if it is reasonable, we can change it. I think we can push the size calculation logic into sheep, if not the right size return INVALID_PARAMETER to clients. Clients just check this and report error back to users. There is no backward compability for this approach, since 4MB is the smallest size. OLD QEMU will limit the max_size as 4TB, which is no problem for new sheep. I have checked the Qemu and sheepdog code. When we resize VDI, sd_truncate() is called and resize value is handled by Qemu. (Sorry I haven't noticed this operation) Then, sd_truncate() writes Sheepdog inode object directly. So Sheepdog server can't handle maximum VDI size. As I thought, should we use SD_OP_GET_CLUSTER_DEFAULT
Re: [Qemu-devel] [PATCH] sheepdog: Fix misleading error messages in sd_snapshot_create()
On Thu, Feb 12, 2015 at 02:49:50PM +0100, Markus Armbruster wrote: If do_sd_create() fails, it first reports the error returned, then reports a another one with strerror(errno). errno is meaningless at that point. Report just one error combining the valid information from both messages. Reported-by: Eric Blake ebl...@redhat.com Signed-off-by: Markus Armbruster arm...@redhat.com --- Applies on top of my [PATCH v2 00/10] Clean up around error_get_pretty(), qerror_report_err(), but rebasing to master would be trivial. block/sheepdog.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/block/sheepdog.c b/block/sheepdog.c index 6a4a3bd..0e8c712 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -2225,9 +2225,8 @@ static int sd_snapshot_create(BlockDriverState *bs, QEMUSnapshotInfo *sn_info) ret = do_sd_create(s, new_vid, 1, local_err); if (ret 0) { -error_report_err(local_err); -error_report(failed to create inode for snapshot. %s, - strerror(errno)); +error_report(failed to create inode for snapshot: %s, + error_get_pretty(local_err)); goto cleanup; } Reviewed-by: Liu Yuan namei.u...@gmail.com
Re: [Qemu-devel] [PATCH] sheepdog: fix confused return values
On Fri, Feb 13, 2015 at 11:45:42AM +0800, Liu Yuan wrote: From: Liu Yuan liuy...@cmss.chinamobile.com These functions mix up -1 and -errno in return values and would might cause trouble error handling in the call chain. This patch let them return -errno and add some comments. Reported-by: Markus Armbruster arm...@redhat.com Signed-off-by: Liu Yuan liuy...@cmss.chinamobile.com Cc Kevin and Stefan
[Qemu-devel] [PATCH] sheepdog: fix confused return values
From: Liu Yuan liuy...@cmss.chinamobile.com These functions mix up -1 and -errno in return values and would might cause trouble error handling in the call chain. This patch let them return -errno and add some comments. Reported-by: Markus Armbruster arm...@redhat.com Signed-off-by: Liu Yuan liuy...@cmss.chinamobile.com --- block/sheepdog.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/block/sheepdog.c b/block/sheepdog.c index be3176f..c28658c 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -527,6 +527,7 @@ static SheepdogAIOCB *sd_aio_setup(BlockDriverState *bs, QEMUIOVector *qiov, return acb; } +/* Return -EIO in case of error, file descriptor on success */ static int connect_to_sdog(BDRVSheepdogState *s, Error **errp) { int fd; @@ -546,11 +547,14 @@ static int connect_to_sdog(BDRVSheepdogState *s, Error **errp) if (fd = 0) { qemu_set_nonblock(fd); +} else { +fd = -EIO; } return fd; } +/* Return 0 on success and -errno in case of error */ static coroutine_fn int send_co_req(int sockfd, SheepdogReq *hdr, void *data, unsigned int *wlen) { @@ -559,11 +563,13 @@ static coroutine_fn int send_co_req(int sockfd, SheepdogReq *hdr, void *data, ret = qemu_co_send(sockfd, hdr, sizeof(*hdr)); if (ret != sizeof(*hdr)) { error_report(failed to send a req, %s, strerror(errno)); +ret = -errno; return ret; } ret = qemu_co_send(sockfd, data, *wlen); if (ret != *wlen) { +ret = -errno; error_report(failed to send a req, %s, strerror(errno)); } @@ -638,6 +644,11 @@ out: srco-finished = true; } +/* + * Send the request to the sheep in a synchronous manner. + * + * Return 0 on success, -errno in case of error. + */ static int do_req(int sockfd, AioContext *aio_context, SheepdogReq *hdr, void *data, unsigned int *wlen, unsigned int *rlen) { -- 1.9.1
Re: [Qemu-devel] [sheepdog] [PATCH v4] sheepdog: selectable object size support
On Thu, Feb 12, 2015 at 05:01:05PM +0900, Teruaki Ishizaki wrote: (2015/02/12 16:42), Liu Yuan wrote: On Thu, Feb 12, 2015 at 04:28:01PM +0900, Hitoshi Mitake wrote: At Thu, 12 Feb 2015 15:00:49 +0800, Liu Yuan wrote: On Thu, Feb 12, 2015 at 03:19:21PM +0900, Hitoshi Mitake wrote: At Tue, 10 Feb 2015 18:35:58 +0800, Liu Yuan wrote: On Tue, Feb 10, 2015 at 06:56:33PM +0900, Teruaki Ishizaki wrote: (2015/02/10 17:58), Liu Yuan wrote: On Tue, Feb 10, 2015 at 05:22:02PM +0900, Teruaki Ishizaki wrote: (2015/02/10 12:10), Liu Yuan wrote: On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote: Previously, qemu block driver of sheepdog used hard-coded VDI object size. This patch enables users to handle block_size_shift value for calculating VDI object size. When you start qemu, you don't need to specify additional command option. But when you create the VDI which doesn't have default object size with qemu-img command, you specify block_size_shift option. If you want to create a VDI of 8MB(1 23) object size, you need to specify following command option. # qemu-img create -o block_size_shift=23 sheepdog:test1 100M In addition, when you don't specify qemu-img command option, a default value of sheepdog cluster is used for creating VDI. # qemu-img create sheepdog:test2 100M Signed-off-by: Teruaki Ishizaki ishizaki.teru...@lab.ntt.co.jp --- V4: - Limit a read/write buffer size for creating a preallocated VDI. - Replace a parse function for the block_size_shift option. - Fix an error message. V3: - Delete the needless operation of buffer. - Delete the needless operations of request header. for SD_OP_GET_CLUSTER_DEFAULT. - Fix coding style problems. V2: - Fix coding style problem (white space). - Add members, store_policy and block_size_shift to struct SheepdogVdiReq. - Initialize request header to use block_size_shift specified by user. --- block/sheepdog.c | 138 ++--- include/block/block_int.h |1 + 2 files changed, 119 insertions(+), 20 deletions(-) diff --git a/block/sheepdog.c b/block/sheepdog.c index be3176f..a43b947 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -37,6 +37,7 @@ #define SD_OP_READ_VDIS 0x15 #define SD_OP_FLUSH_VDI 0x16 #define SD_OP_DEL_VDI0x17 +#define SD_OP_GET_CLUSTER_DEFAULT 0x18 #define SD_FLAG_CMD_WRITE0x01 #define SD_FLAG_CMD_COW 0x02 @@ -167,7 +168,8 @@ typedef struct SheepdogVdiReq { uint32_t base_vdi_id; uint8_t copies; uint8_t copy_policy; -uint8_t reserved[2]; +uint8_t store_policy; +uint8_t block_size_shift; uint32_t snapid; uint32_t type; uint32_t pad[2]; @@ -186,6 +188,21 @@ typedef struct SheepdogVdiRsp { uint32_t pad[5]; } SheepdogVdiRsp; +typedef struct SheepdogClusterRsp { +uint8_t proto_ver; +uint8_t opcode; +uint16_t flags; +uint32_t epoch; +uint32_t id; +uint32_t data_length; +uint32_t result; +uint8_t nr_copies; +uint8_t copy_policy; +uint8_t block_size_shift; +uint8_t __pad1; +uint32_t __pad2[6]; +} SheepdogClusterRsp; + typedef struct SheepdogInode { char name[SD_MAX_VDI_LEN]; char tag[SD_MAX_VDI_TAG_LEN]; @@ -1544,6 +1561,7 @@ static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot, hdr.vdi_size = s-inode.vdi_size; hdr.copy_policy = s-inode.copy_policy; hdr.copies = s-inode.nr_copies; +hdr.block_size_shift = s-inode.block_size_shift; ret = do_req(fd, s-aio_context, (SheepdogReq *)hdr, buf, wlen, rlen); @@ -1569,9 +1587,12 @@ static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot, static int sd_prealloc(const char *filename, Error **errp) { BlockDriverState *bs = NULL; +BDRVSheepdogState *base = NULL; +unsigned long buf_size; uint32_t idx, max_idx; +uint32_t object_size; int64_t vdi_size; -void *buf = g_malloc0(SD_DATA_OBJ_SIZE); +void *buf = NULL; int ret; ret = bdrv_open(bs, filename, NULL, NULL, BDRV_O_RDWR | BDRV_O_PROTOCOL, @@ -1585,18 +1606,24 @@ static int sd_prealloc(const char *filename, Error **errp) ret = vdi_size; goto out; } -max_idx = DIV_ROUND_UP(vdi_size, SD_DATA_OBJ_SIZE); + +base = bs-opaque; +object_size = (UINT32_C(1) base-inode.block_size_shift); +buf_size = MIN(object_size, SD_DATA_OBJ_SIZE); +buf = g_malloc0(buf_size); + +max_idx = DIV_ROUND_UP(vdi_size, buf_size); for (idx = 0; idx max_idx; idx++) { /* * The created image can be a cloned image, so we need to read * a data from the source image. */ -ret = bdrv_pread(bs, idx * SD_DATA_OBJ_SIZE, buf, SD_DATA_OBJ_SIZE); +ret
Re: [Qemu-devel] [PATCH v4] sheepdog: selectable object size support
On Thu, Feb 12, 2015 at 10:51:25AM +0900, Teruaki Ishizaki wrote: (2015/02/10 20:12), Liu Yuan wrote: On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote: Previously, qemu block driver of sheepdog used hard-coded VDI object size. This patch enables users to handle block_size_shift value for calculating VDI object size. When you start qemu, you don't need to specify additional command option. But when you create the VDI which doesn't have default object size with qemu-img command, you specify block_size_shift option. If you want to create a VDI of 8MB(1 23) object size, you need to specify following command option. # qemu-img create -o block_size_shift=23 sheepdog:test1 100M In addition, when you don't specify qemu-img command option, a default value of sheepdog cluster is used for creating VDI. # qemu-img create sheepdog:test2 100M Signed-off-by: Teruaki Ishizaki ishizaki.teru...@lab.ntt.co.jp --- V4: - Limit a read/write buffer size for creating a preallocated VDI. - Replace a parse function for the block_size_shift option. - Fix an error message. V3: - Delete the needless operation of buffer. - Delete the needless operations of request header. for SD_OP_GET_CLUSTER_DEFAULT. - Fix coding style problems. V2: - Fix coding style problem (white space). - Add members, store_policy and block_size_shift to struct SheepdogVdiReq. - Initialize request header to use block_size_shift specified by user. --- block/sheepdog.c | 138 ++--- include/block/block_int.h |1 + 2 files changed, 119 insertions(+), 20 deletions(-) diff --git a/block/sheepdog.c b/block/sheepdog.c index be3176f..a43b947 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -37,6 +37,7 @@ #define SD_OP_READ_VDIS 0x15 #define SD_OP_FLUSH_VDI 0x16 #define SD_OP_DEL_VDI0x17 +#define SD_OP_GET_CLUSTER_DEFAULT 0x18 This might not be necessary. For old qemu or the qemu-img without setting option, the block_size_shift will be 0. If we make 0 to represent 4MB object, then we don't need to get the default cluster object size. We migth even get rid of the idea of cluster default size. The downsize is that, if we want to create a vdi with different size not the default 4MB, we have to write it every time for qemu-img or dog. If we choose to keep the idea of cluster default size, I think we'd also try to avoid call this request from QEMU to make backward compatibility easier. In this scenario, 0 might be used to ask new sheep to decide to use cluster default size. Both old qemu and new QEMU will send 0 to sheep and both old and new sheep can handle 0 though it has different meanings. Table for this bit as 0: Qe: qemu SD: Sheep daemon CDS: Cluster Default Size Ign: Ignored by the sheep daemon Qe/sd newold new CDSIgn old CDSNULL Does Ign mean that VDI is handled as 4MB object size? Yes, old sheep can only handle 4MB object and doesn't check this field at all. I think this approach is acceptable. The difference to your patch is that we don't send SD_OP_GET_CLUSTER_DEFAULT to sheep daemon and SD_OP_GET_CLUSTER_DEFAULT can be removed. When users create a new VDI with qemu-img, qemu's Sheepdog backend driver calculates max limit VDI size. But if block_size_shift option is not specified, qemu's Sheepdog backend driver can't calculate max limit VDI size. If block_size_shift not specified, this means 1 for old sheep, use 4MB size 2 for new sheep, use cluster wide default value. And sheep then can calculate it on its own, no? Thanks Yuan
Re: [Qemu-devel] [PATCH v4] sheepdog: selectable object size support
On Thu, Feb 12, 2015 at 11:33:16AM +0900, Teruaki Ishizaki wrote: (2015/02/12 11:19), Liu Yuan wrote: On Thu, Feb 12, 2015 at 10:51:25AM +0900, Teruaki Ishizaki wrote: (2015/02/10 20:12), Liu Yuan wrote: On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote: Previously, qemu block driver of sheepdog used hard-coded VDI object size. This patch enables users to handle block_size_shift value for calculating VDI object size. When you start qemu, you don't need to specify additional command option. But when you create the VDI which doesn't have default object size with qemu-img command, you specify block_size_shift option. If you want to create a VDI of 8MB(1 23) object size, you need to specify following command option. # qemu-img create -o block_size_shift=23 sheepdog:test1 100M In addition, when you don't specify qemu-img command option, a default value of sheepdog cluster is used for creating VDI. # qemu-img create sheepdog:test2 100M Signed-off-by: Teruaki Ishizaki ishizaki.teru...@lab.ntt.co.jp --- V4: - Limit a read/write buffer size for creating a preallocated VDI. - Replace a parse function for the block_size_shift option. - Fix an error message. V3: - Delete the needless operation of buffer. - Delete the needless operations of request header. for SD_OP_GET_CLUSTER_DEFAULT. - Fix coding style problems. V2: - Fix coding style problem (white space). - Add members, store_policy and block_size_shift to struct SheepdogVdiReq. - Initialize request header to use block_size_shift specified by user. --- block/sheepdog.c | 138 ++--- include/block/block_int.h |1 + 2 files changed, 119 insertions(+), 20 deletions(-) diff --git a/block/sheepdog.c b/block/sheepdog.c index be3176f..a43b947 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -37,6 +37,7 @@ #define SD_OP_READ_VDIS 0x15 #define SD_OP_FLUSH_VDI 0x16 #define SD_OP_DEL_VDI0x17 +#define SD_OP_GET_CLUSTER_DEFAULT 0x18 This might not be necessary. For old qemu or the qemu-img without setting option, the block_size_shift will be 0. If we make 0 to represent 4MB object, then we don't need to get the default cluster object size. We migth even get rid of the idea of cluster default size. The downsize is that, if we want to create a vdi with different size not the default 4MB, we have to write it every time for qemu-img or dog. If we choose to keep the idea of cluster default size, I think we'd also try to avoid call this request from QEMU to make backward compatibility easier. In this scenario, 0 might be used to ask new sheep to decide to use cluster default size. Both old qemu and new QEMU will send 0 to sheep and both old and new sheep can handle 0 though it has different meanings. Table for this bit as 0: Qe: qemu SD: Sheep daemon CDS: Cluster Default Size Ign: Ignored by the sheep daemon Qe/sd newold new CDSIgn old CDSNULL Does Ign mean that VDI is handled as 4MB object size? Yes, old sheep can only handle 4MB object and doesn't check this field at all. I think this approach is acceptable. The difference to your patch is that we don't send SD_OP_GET_CLUSTER_DEFAULT to sheep daemon and SD_OP_GET_CLUSTER_DEFAULT can be removed. When users create a new VDI with qemu-img, qemu's Sheepdog backend driver calculates max limit VDI size. But if block_size_shift option is not specified, qemu's Sheepdog backend driver can't calculate max limit VDI size. If block_size_shift not specified, this means 1 for old sheep, use 4MB size 2 for new sheep, use cluster wide default value. And sheep then can calculate it on its own, no? Dog command(client) calculate max size, so I think that qemu's Sheepdog backend driver should calculate it like dog command. Is that policy changeable? I checked the QEMU code and got your idea. In the past it was fixed size so very easy to hardcode the check in the client, no communication with sheep needed. Yes, if it is reasonable, we can change it. I think we can push the size calculation logic into sheep, if not the right size return INVALID_PARAMETER to clients. Clients just check this and report error back to users. There is no backward compability for this approach, since 4MB is the smallest size. OLD QEMU will limit the max_size as 4TB, which is no problem for new sheep. Thanks Yuan
Re: [Qemu-devel] [sheepdog] [PATCH v4] sheepdog: selectable object size support
On Thu, Feb 12, 2015 at 03:19:21PM +0900, Hitoshi Mitake wrote: At Tue, 10 Feb 2015 18:35:58 +0800, Liu Yuan wrote: On Tue, Feb 10, 2015 at 06:56:33PM +0900, Teruaki Ishizaki wrote: (2015/02/10 17:58), Liu Yuan wrote: On Tue, Feb 10, 2015 at 05:22:02PM +0900, Teruaki Ishizaki wrote: (2015/02/10 12:10), Liu Yuan wrote: On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote: Previously, qemu block driver of sheepdog used hard-coded VDI object size. This patch enables users to handle block_size_shift value for calculating VDI object size. When you start qemu, you don't need to specify additional command option. But when you create the VDI which doesn't have default object size with qemu-img command, you specify block_size_shift option. If you want to create a VDI of 8MB(1 23) object size, you need to specify following command option. # qemu-img create -o block_size_shift=23 sheepdog:test1 100M In addition, when you don't specify qemu-img command option, a default value of sheepdog cluster is used for creating VDI. # qemu-img create sheepdog:test2 100M Signed-off-by: Teruaki Ishizaki ishizaki.teru...@lab.ntt.co.jp --- V4: - Limit a read/write buffer size for creating a preallocated VDI. - Replace a parse function for the block_size_shift option. - Fix an error message. V3: - Delete the needless operation of buffer. - Delete the needless operations of request header. for SD_OP_GET_CLUSTER_DEFAULT. - Fix coding style problems. V2: - Fix coding style problem (white space). - Add members, store_policy and block_size_shift to struct SheepdogVdiReq. - Initialize request header to use block_size_shift specified by user. --- block/sheepdog.c | 138 ++--- include/block/block_int.h |1 + 2 files changed, 119 insertions(+), 20 deletions(-) diff --git a/block/sheepdog.c b/block/sheepdog.c index be3176f..a43b947 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -37,6 +37,7 @@ #define SD_OP_READ_VDIS 0x15 #define SD_OP_FLUSH_VDI 0x16 #define SD_OP_DEL_VDI0x17 +#define SD_OP_GET_CLUSTER_DEFAULT 0x18 #define SD_FLAG_CMD_WRITE0x01 #define SD_FLAG_CMD_COW 0x02 @@ -167,7 +168,8 @@ typedef struct SheepdogVdiReq { uint32_t base_vdi_id; uint8_t copies; uint8_t copy_policy; -uint8_t reserved[2]; +uint8_t store_policy; +uint8_t block_size_shift; uint32_t snapid; uint32_t type; uint32_t pad[2]; @@ -186,6 +188,21 @@ typedef struct SheepdogVdiRsp { uint32_t pad[5]; } SheepdogVdiRsp; +typedef struct SheepdogClusterRsp { +uint8_t proto_ver; +uint8_t opcode; +uint16_t flags; +uint32_t epoch; +uint32_t id; +uint32_t data_length; +uint32_t result; +uint8_t nr_copies; +uint8_t copy_policy; +uint8_t block_size_shift; +uint8_t __pad1; +uint32_t __pad2[6]; +} SheepdogClusterRsp; + typedef struct SheepdogInode { char name[SD_MAX_VDI_LEN]; char tag[SD_MAX_VDI_TAG_LEN]; @@ -1544,6 +1561,7 @@ static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot, hdr.vdi_size = s-inode.vdi_size; hdr.copy_policy = s-inode.copy_policy; hdr.copies = s-inode.nr_copies; +hdr.block_size_shift = s-inode.block_size_shift; ret = do_req(fd, s-aio_context, (SheepdogReq *)hdr, buf, wlen, rlen); @@ -1569,9 +1587,12 @@ static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot, static int sd_prealloc(const char *filename, Error **errp) { BlockDriverState *bs = NULL; +BDRVSheepdogState *base = NULL; +unsigned long buf_size; uint32_t idx, max_idx; +uint32_t object_size; int64_t vdi_size; -void *buf = g_malloc0(SD_DATA_OBJ_SIZE); +void *buf = NULL; int ret; ret = bdrv_open(bs, filename, NULL, NULL, BDRV_O_RDWR | BDRV_O_PROTOCOL, @@ -1585,18 +1606,24 @@ static int sd_prealloc(const char *filename, Error **errp) ret = vdi_size; goto out; } -max_idx = DIV_ROUND_UP(vdi_size, SD_DATA_OBJ_SIZE); + +base = bs-opaque; +object_size = (UINT32_C(1) base-inode.block_size_shift); +buf_size = MIN(object_size, SD_DATA_OBJ_SIZE); +buf = g_malloc0(buf_size); + +max_idx = DIV_ROUND_UP(vdi_size, buf_size); for (idx = 0; idx max_idx; idx++) { /* * The created image can be a cloned image, so we need to read * a data from the source image. */ -ret = bdrv_pread(bs, idx
Re: [Qemu-devel] [sheepdog] [PATCH v4] sheepdog: selectable object size support
On Thu, Feb 12, 2015 at 04:28:01PM +0900, Hitoshi Mitake wrote: At Thu, 12 Feb 2015 15:00:49 +0800, Liu Yuan wrote: On Thu, Feb 12, 2015 at 03:19:21PM +0900, Hitoshi Mitake wrote: At Tue, 10 Feb 2015 18:35:58 +0800, Liu Yuan wrote: On Tue, Feb 10, 2015 at 06:56:33PM +0900, Teruaki Ishizaki wrote: (2015/02/10 17:58), Liu Yuan wrote: On Tue, Feb 10, 2015 at 05:22:02PM +0900, Teruaki Ishizaki wrote: (2015/02/10 12:10), Liu Yuan wrote: On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote: Previously, qemu block driver of sheepdog used hard-coded VDI object size. This patch enables users to handle block_size_shift value for calculating VDI object size. When you start qemu, you don't need to specify additional command option. But when you create the VDI which doesn't have default object size with qemu-img command, you specify block_size_shift option. If you want to create a VDI of 8MB(1 23) object size, you need to specify following command option. # qemu-img create -o block_size_shift=23 sheepdog:test1 100M In addition, when you don't specify qemu-img command option, a default value of sheepdog cluster is used for creating VDI. # qemu-img create sheepdog:test2 100M Signed-off-by: Teruaki Ishizaki ishizaki.teru...@lab.ntt.co.jp --- V4: - Limit a read/write buffer size for creating a preallocated VDI. - Replace a parse function for the block_size_shift option. - Fix an error message. V3: - Delete the needless operation of buffer. - Delete the needless operations of request header. for SD_OP_GET_CLUSTER_DEFAULT. - Fix coding style problems. V2: - Fix coding style problem (white space). - Add members, store_policy and block_size_shift to struct SheepdogVdiReq. - Initialize request header to use block_size_shift specified by user. --- block/sheepdog.c | 138 ++--- include/block/block_int.h |1 + 2 files changed, 119 insertions(+), 20 deletions(-) diff --git a/block/sheepdog.c b/block/sheepdog.c index be3176f..a43b947 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -37,6 +37,7 @@ #define SD_OP_READ_VDIS 0x15 #define SD_OP_FLUSH_VDI 0x16 #define SD_OP_DEL_VDI0x17 +#define SD_OP_GET_CLUSTER_DEFAULT 0x18 #define SD_FLAG_CMD_WRITE0x01 #define SD_FLAG_CMD_COW 0x02 @@ -167,7 +168,8 @@ typedef struct SheepdogVdiReq { uint32_t base_vdi_id; uint8_t copies; uint8_t copy_policy; -uint8_t reserved[2]; +uint8_t store_policy; +uint8_t block_size_shift; uint32_t snapid; uint32_t type; uint32_t pad[2]; @@ -186,6 +188,21 @@ typedef struct SheepdogVdiRsp { uint32_t pad[5]; } SheepdogVdiRsp; +typedef struct SheepdogClusterRsp { +uint8_t proto_ver; +uint8_t opcode; +uint16_t flags; +uint32_t epoch; +uint32_t id; +uint32_t data_length; +uint32_t result; +uint8_t nr_copies; +uint8_t copy_policy; +uint8_t block_size_shift; +uint8_t __pad1; +uint32_t __pad2[6]; +} SheepdogClusterRsp; + typedef struct SheepdogInode { char name[SD_MAX_VDI_LEN]; char tag[SD_MAX_VDI_TAG_LEN]; @@ -1544,6 +1561,7 @@ static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot, hdr.vdi_size = s-inode.vdi_size; hdr.copy_policy = s-inode.copy_policy; hdr.copies = s-inode.nr_copies; +hdr.block_size_shift = s-inode.block_size_shift; ret = do_req(fd, s-aio_context, (SheepdogReq *)hdr, buf, wlen, rlen); @@ -1569,9 +1587,12 @@ static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot, static int sd_prealloc(const char *filename, Error **errp) { BlockDriverState *bs = NULL; +BDRVSheepdogState *base = NULL; +unsigned long buf_size; uint32_t idx, max_idx; +uint32_t object_size; int64_t vdi_size; -void *buf = g_malloc0(SD_DATA_OBJ_SIZE); +void *buf = NULL; int ret; ret = bdrv_open(bs, filename, NULL, NULL, BDRV_O_RDWR | BDRV_O_PROTOCOL, @@ -1585,18 +1606,24 @@ static int sd_prealloc(const char *filename, Error **errp) ret = vdi_size; goto out; } -max_idx = DIV_ROUND_UP(vdi_size, SD_DATA_OBJ_SIZE); + +base = bs-opaque; +object_size = (UINT32_C(1) base-inode.block_size_shift
Re: [Qemu-devel] [PATCH v4] sheepdog: selectable object size support
On Tue, Feb 10, 2015 at 05:22:02PM +0900, Teruaki Ishizaki wrote: (2015/02/10 12:10), Liu Yuan wrote: On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote: Previously, qemu block driver of sheepdog used hard-coded VDI object size. This patch enables users to handle block_size_shift value for calculating VDI object size. When you start qemu, you don't need to specify additional command option. But when you create the VDI which doesn't have default object size with qemu-img command, you specify block_size_shift option. If you want to create a VDI of 8MB(1 23) object size, you need to specify following command option. # qemu-img create -o block_size_shift=23 sheepdog:test1 100M In addition, when you don't specify qemu-img command option, a default value of sheepdog cluster is used for creating VDI. # qemu-img create sheepdog:test2 100M Signed-off-by: Teruaki Ishizaki ishizaki.teru...@lab.ntt.co.jp --- V4: - Limit a read/write buffer size for creating a preallocated VDI. - Replace a parse function for the block_size_shift option. - Fix an error message. V3: - Delete the needless operation of buffer. - Delete the needless operations of request header. for SD_OP_GET_CLUSTER_DEFAULT. - Fix coding style problems. V2: - Fix coding style problem (white space). - Add members, store_policy and block_size_shift to struct SheepdogVdiReq. - Initialize request header to use block_size_shift specified by user. --- block/sheepdog.c | 138 ++--- include/block/block_int.h |1 + 2 files changed, 119 insertions(+), 20 deletions(-) diff --git a/block/sheepdog.c b/block/sheepdog.c index be3176f..a43b947 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -37,6 +37,7 @@ #define SD_OP_READ_VDIS 0x15 #define SD_OP_FLUSH_VDI 0x16 #define SD_OP_DEL_VDI0x17 +#define SD_OP_GET_CLUSTER_DEFAULT 0x18 #define SD_FLAG_CMD_WRITE0x01 #define SD_FLAG_CMD_COW 0x02 @@ -167,7 +168,8 @@ typedef struct SheepdogVdiReq { uint32_t base_vdi_id; uint8_t copies; uint8_t copy_policy; -uint8_t reserved[2]; +uint8_t store_policy; +uint8_t block_size_shift; uint32_t snapid; uint32_t type; uint32_t pad[2]; @@ -186,6 +188,21 @@ typedef struct SheepdogVdiRsp { uint32_t pad[5]; } SheepdogVdiRsp; +typedef struct SheepdogClusterRsp { +uint8_t proto_ver; +uint8_t opcode; +uint16_t flags; +uint32_t epoch; +uint32_t id; +uint32_t data_length; +uint32_t result; +uint8_t nr_copies; +uint8_t copy_policy; +uint8_t block_size_shift; +uint8_t __pad1; +uint32_t __pad2[6]; +} SheepdogClusterRsp; + typedef struct SheepdogInode { char name[SD_MAX_VDI_LEN]; char tag[SD_MAX_VDI_TAG_LEN]; @@ -1544,6 +1561,7 @@ static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot, hdr.vdi_size = s-inode.vdi_size; hdr.copy_policy = s-inode.copy_policy; hdr.copies = s-inode.nr_copies; +hdr.block_size_shift = s-inode.block_size_shift; ret = do_req(fd, s-aio_context, (SheepdogReq *)hdr, buf, wlen, rlen); @@ -1569,9 +1587,12 @@ static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot, static int sd_prealloc(const char *filename, Error **errp) { BlockDriverState *bs = NULL; +BDRVSheepdogState *base = NULL; +unsigned long buf_size; uint32_t idx, max_idx; +uint32_t object_size; int64_t vdi_size; -void *buf = g_malloc0(SD_DATA_OBJ_SIZE); +void *buf = NULL; int ret; ret = bdrv_open(bs, filename, NULL, NULL, BDRV_O_RDWR | BDRV_O_PROTOCOL, @@ -1585,18 +1606,24 @@ static int sd_prealloc(const char *filename, Error **errp) ret = vdi_size; goto out; } -max_idx = DIV_ROUND_UP(vdi_size, SD_DATA_OBJ_SIZE); + +base = bs-opaque; +object_size = (UINT32_C(1) base-inode.block_size_shift); +buf_size = MIN(object_size, SD_DATA_OBJ_SIZE); +buf = g_malloc0(buf_size); + +max_idx = DIV_ROUND_UP(vdi_size, buf_size); for (idx = 0; idx max_idx; idx++) { /* * The created image can be a cloned image, so we need to read * a data from the source image. */ -ret = bdrv_pread(bs, idx * SD_DATA_OBJ_SIZE, buf, SD_DATA_OBJ_SIZE); +ret = bdrv_pread(bs, idx * buf_size, buf, buf_size); if (ret 0) { goto out; } -ret = bdrv_pwrite(bs, idx * SD_DATA_OBJ_SIZE, buf, SD_DATA_OBJ_SIZE); +ret = bdrv_pwrite(bs, idx * buf_size, buf, buf_size); if (ret 0) { goto out; } @@ -1669,6 +1696,21 @@ static int parse_redundancy(BDRVSheepdogState *s, const char *opt) return 0; } +static int
Re: [Qemu-devel] [PATCH v4] sheepdog: selectable object size support
On Tue, Feb 10, 2015 at 06:56:33PM +0900, Teruaki Ishizaki wrote: (2015/02/10 17:58), Liu Yuan wrote: On Tue, Feb 10, 2015 at 05:22:02PM +0900, Teruaki Ishizaki wrote: (2015/02/10 12:10), Liu Yuan wrote: On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote: Previously, qemu block driver of sheepdog used hard-coded VDI object size. This patch enables users to handle block_size_shift value for calculating VDI object size. When you start qemu, you don't need to specify additional command option. But when you create the VDI which doesn't have default object size with qemu-img command, you specify block_size_shift option. If you want to create a VDI of 8MB(1 23) object size, you need to specify following command option. # qemu-img create -o block_size_shift=23 sheepdog:test1 100M In addition, when you don't specify qemu-img command option, a default value of sheepdog cluster is used for creating VDI. # qemu-img create sheepdog:test2 100M Signed-off-by: Teruaki Ishizaki ishizaki.teru...@lab.ntt.co.jp --- V4: - Limit a read/write buffer size for creating a preallocated VDI. - Replace a parse function for the block_size_shift option. - Fix an error message. V3: - Delete the needless operation of buffer. - Delete the needless operations of request header. for SD_OP_GET_CLUSTER_DEFAULT. - Fix coding style problems. V2: - Fix coding style problem (white space). - Add members, store_policy and block_size_shift to struct SheepdogVdiReq. - Initialize request header to use block_size_shift specified by user. --- block/sheepdog.c | 138 ++--- include/block/block_int.h |1 + 2 files changed, 119 insertions(+), 20 deletions(-) diff --git a/block/sheepdog.c b/block/sheepdog.c index be3176f..a43b947 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -37,6 +37,7 @@ #define SD_OP_READ_VDIS 0x15 #define SD_OP_FLUSH_VDI 0x16 #define SD_OP_DEL_VDI0x17 +#define SD_OP_GET_CLUSTER_DEFAULT 0x18 #define SD_FLAG_CMD_WRITE0x01 #define SD_FLAG_CMD_COW 0x02 @@ -167,7 +168,8 @@ typedef struct SheepdogVdiReq { uint32_t base_vdi_id; uint8_t copies; uint8_t copy_policy; -uint8_t reserved[2]; +uint8_t store_policy; +uint8_t block_size_shift; uint32_t snapid; uint32_t type; uint32_t pad[2]; @@ -186,6 +188,21 @@ typedef struct SheepdogVdiRsp { uint32_t pad[5]; } SheepdogVdiRsp; +typedef struct SheepdogClusterRsp { +uint8_t proto_ver; +uint8_t opcode; +uint16_t flags; +uint32_t epoch; +uint32_t id; +uint32_t data_length; +uint32_t result; +uint8_t nr_copies; +uint8_t copy_policy; +uint8_t block_size_shift; +uint8_t __pad1; +uint32_t __pad2[6]; +} SheepdogClusterRsp; + typedef struct SheepdogInode { char name[SD_MAX_VDI_LEN]; char tag[SD_MAX_VDI_TAG_LEN]; @@ -1544,6 +1561,7 @@ static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot, hdr.vdi_size = s-inode.vdi_size; hdr.copy_policy = s-inode.copy_policy; hdr.copies = s-inode.nr_copies; +hdr.block_size_shift = s-inode.block_size_shift; ret = do_req(fd, s-aio_context, (SheepdogReq *)hdr, buf, wlen, rlen); @@ -1569,9 +1587,12 @@ static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot, static int sd_prealloc(const char *filename, Error **errp) { BlockDriverState *bs = NULL; +BDRVSheepdogState *base = NULL; +unsigned long buf_size; uint32_t idx, max_idx; +uint32_t object_size; int64_t vdi_size; -void *buf = g_malloc0(SD_DATA_OBJ_SIZE); +void *buf = NULL; int ret; ret = bdrv_open(bs, filename, NULL, NULL, BDRV_O_RDWR | BDRV_O_PROTOCOL, @@ -1585,18 +1606,24 @@ static int sd_prealloc(const char *filename, Error **errp) ret = vdi_size; goto out; } -max_idx = DIV_ROUND_UP(vdi_size, SD_DATA_OBJ_SIZE); + +base = bs-opaque; +object_size = (UINT32_C(1) base-inode.block_size_shift); +buf_size = MIN(object_size, SD_DATA_OBJ_SIZE); +buf = g_malloc0(buf_size); + +max_idx = DIV_ROUND_UP(vdi_size, buf_size); for (idx = 0; idx max_idx; idx++) { /* * The created image can be a cloned image, so we need to read * a data from the source image. */ -ret = bdrv_pread(bs, idx * SD_DATA_OBJ_SIZE, buf, SD_DATA_OBJ_SIZE); +ret = bdrv_pread(bs, idx * buf_size, buf, buf_size); if (ret 0) { goto out; } -ret = bdrv_pwrite(bs, idx * SD_DATA_OBJ_SIZE, buf, SD_DATA_OBJ_SIZE); +ret = bdrv_pwrite(bs, idx * buf_size, buf, buf_size); if (ret 0) { goto out; } @@ -1669,6 +1696,21 @@ static int
Re: [Qemu-devel] [PATCH v4] sheepdog: selectable object size support
On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote: Previously, qemu block driver of sheepdog used hard-coded VDI object size. This patch enables users to handle block_size_shift value for calculating VDI object size. When you start qemu, you don't need to specify additional command option. But when you create the VDI which doesn't have default object size with qemu-img command, you specify block_size_shift option. If you want to create a VDI of 8MB(1 23) object size, you need to specify following command option. # qemu-img create -o block_size_shift=23 sheepdog:test1 100M In addition, when you don't specify qemu-img command option, a default value of sheepdog cluster is used for creating VDI. # qemu-img create sheepdog:test2 100M Signed-off-by: Teruaki Ishizaki ishizaki.teru...@lab.ntt.co.jp --- V4: - Limit a read/write buffer size for creating a preallocated VDI. - Replace a parse function for the block_size_shift option. - Fix an error message. V3: - Delete the needless operation of buffer. - Delete the needless operations of request header. for SD_OP_GET_CLUSTER_DEFAULT. - Fix coding style problems. V2: - Fix coding style problem (white space). - Add members, store_policy and block_size_shift to struct SheepdogVdiReq. - Initialize request header to use block_size_shift specified by user. --- block/sheepdog.c | 138 ++--- include/block/block_int.h |1 + 2 files changed, 119 insertions(+), 20 deletions(-) diff --git a/block/sheepdog.c b/block/sheepdog.c index be3176f..a43b947 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -37,6 +37,7 @@ #define SD_OP_READ_VDIS 0x15 #define SD_OP_FLUSH_VDI 0x16 #define SD_OP_DEL_VDI0x17 +#define SD_OP_GET_CLUSTER_DEFAULT 0x18 This might not be necessary. For old qemu or the qemu-img without setting option, the block_size_shift will be 0. If we make 0 to represent 4MB object, then we don't need to get the default cluster object size. We migth even get rid of the idea of cluster default size. The downsize is that, if we want to create a vdi with different size not the default 4MB, we have to write it every time for qemu-img or dog. If we choose to keep the idea of cluster default size, I think we'd also try to avoid call this request from QEMU to make backward compatibility easier. In this scenario, 0 might be used to ask new sheep to decide to use cluster default size. Both old qemu and new QEMU will send 0 to sheep and both old and new sheep can handle 0 though it has different meanings. Table for this bit as 0: Qe: qemu SD: Sheep daemon CDS: Cluster Default Size Ign: Ignored by the sheep daemon Qe/sd newold new CDSIgn old CDSNULL I think this approach is acceptable. The difference to your patch is that we don't send SD_OP_GET_CLUSTER_DEFAULT to sheep daemon and SD_OP_GET_CLUSTER_DEFAULT can be removed. Thanks Yuan
Re: [Qemu-devel] [PATCH v4] sheepdog: selectable object size support
On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote: Previously, qemu block driver of sheepdog used hard-coded VDI object size. This patch enables users to handle block_size_shift value for calculating VDI object size. When you start qemu, you don't need to specify additional command option. But when you create the VDI which doesn't have default object size with qemu-img command, you specify block_size_shift option. If you want to create a VDI of 8MB(1 23) object size, you need to specify following command option. # qemu-img create -o block_size_shift=23 sheepdog:test1 100M In addition, when you don't specify qemu-img command option, a default value of sheepdog cluster is used for creating VDI. # qemu-img create sheepdog:test2 100M Signed-off-by: Teruaki Ishizaki ishizaki.teru...@lab.ntt.co.jp --- V4: - Limit a read/write buffer size for creating a preallocated VDI. - Replace a parse function for the block_size_shift option. - Fix an error message. V3: - Delete the needless operation of buffer. - Delete the needless operations of request header. for SD_OP_GET_CLUSTER_DEFAULT. - Fix coding style problems. V2: - Fix coding style problem (white space). - Add members, store_policy and block_size_shift to struct SheepdogVdiReq. - Initialize request header to use block_size_shift specified by user. --- block/sheepdog.c | 138 ++--- include/block/block_int.h |1 + 2 files changed, 119 insertions(+), 20 deletions(-) diff --git a/block/sheepdog.c b/block/sheepdog.c index be3176f..a43b947 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -37,6 +37,7 @@ #define SD_OP_READ_VDIS 0x15 #define SD_OP_FLUSH_VDI 0x16 #define SD_OP_DEL_VDI0x17 +#define SD_OP_GET_CLUSTER_DEFAULT 0x18 #define SD_FLAG_CMD_WRITE0x01 #define SD_FLAG_CMD_COW 0x02 @@ -167,7 +168,8 @@ typedef struct SheepdogVdiReq { uint32_t base_vdi_id; uint8_t copies; uint8_t copy_policy; -uint8_t reserved[2]; +uint8_t store_policy; +uint8_t block_size_shift; uint32_t snapid; uint32_t type; uint32_t pad[2]; @@ -186,6 +188,21 @@ typedef struct SheepdogVdiRsp { uint32_t pad[5]; } SheepdogVdiRsp; +typedef struct SheepdogClusterRsp { +uint8_t proto_ver; +uint8_t opcode; +uint16_t flags; +uint32_t epoch; +uint32_t id; +uint32_t data_length; +uint32_t result; +uint8_t nr_copies; +uint8_t copy_policy; +uint8_t block_size_shift; +uint8_t __pad1; +uint32_t __pad2[6]; +} SheepdogClusterRsp; + typedef struct SheepdogInode { char name[SD_MAX_VDI_LEN]; char tag[SD_MAX_VDI_TAG_LEN]; @@ -1544,6 +1561,7 @@ static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot, hdr.vdi_size = s-inode.vdi_size; hdr.copy_policy = s-inode.copy_policy; hdr.copies = s-inode.nr_copies; +hdr.block_size_shift = s-inode.block_size_shift; ret = do_req(fd, s-aio_context, (SheepdogReq *)hdr, buf, wlen, rlen); @@ -1569,9 +1587,12 @@ static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot, static int sd_prealloc(const char *filename, Error **errp) { BlockDriverState *bs = NULL; +BDRVSheepdogState *base = NULL; +unsigned long buf_size; uint32_t idx, max_idx; +uint32_t object_size; int64_t vdi_size; -void *buf = g_malloc0(SD_DATA_OBJ_SIZE); +void *buf = NULL; int ret; ret = bdrv_open(bs, filename, NULL, NULL, BDRV_O_RDWR | BDRV_O_PROTOCOL, @@ -1585,18 +1606,24 @@ static int sd_prealloc(const char *filename, Error **errp) ret = vdi_size; goto out; } -max_idx = DIV_ROUND_UP(vdi_size, SD_DATA_OBJ_SIZE); + +base = bs-opaque; +object_size = (UINT32_C(1) base-inode.block_size_shift); +buf_size = MIN(object_size, SD_DATA_OBJ_SIZE); +buf = g_malloc0(buf_size); + +max_idx = DIV_ROUND_UP(vdi_size, buf_size); for (idx = 0; idx max_idx; idx++) { /* * The created image can be a cloned image, so we need to read * a data from the source image. */ -ret = bdrv_pread(bs, idx * SD_DATA_OBJ_SIZE, buf, SD_DATA_OBJ_SIZE); +ret = bdrv_pread(bs, idx * buf_size, buf, buf_size); if (ret 0) { goto out; } -ret = bdrv_pwrite(bs, idx * SD_DATA_OBJ_SIZE, buf, SD_DATA_OBJ_SIZE); +ret = bdrv_pwrite(bs, idx * buf_size, buf, buf_size); if (ret 0) { goto out; } @@ -1669,6 +1696,21 @@ static int parse_redundancy(BDRVSheepdogState *s, const char *opt) return 0; } +static int parse_block_size_shift(BDRVSheepdogState *s, QemuOpts *opt) +{ +struct SheepdogInode *inode = s-inode; +inode-block_size_shift = +
Re: [Qemu-devel] [PATCH v4] sheepdog: selectable object size support
On Tue, Feb 10, 2015 at 11:10:51AM +0800, Liu Yuan wrote: On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote: Previously, qemu block driver of sheepdog used hard-coded VDI object size. This patch enables users to handle block_size_shift value for calculating VDI object size. When you start qemu, you don't need to specify additional command option. But when you create the VDI which doesn't have default object size with qemu-img command, you specify block_size_shift option. If you want to create a VDI of 8MB(1 23) object size, you need to specify following command option. # qemu-img create -o block_size_shift=23 sheepdog:test1 100M In addition, when you don't specify qemu-img command option, a default value of sheepdog cluster is used for creating VDI. # qemu-img create sheepdog:test2 100M Signed-off-by: Teruaki Ishizaki ishizaki.teru...@lab.ntt.co.jp --- V4: - Limit a read/write buffer size for creating a preallocated VDI. - Replace a parse function for the block_size_shift option. - Fix an error message. V3: - Delete the needless operation of buffer. - Delete the needless operations of request header. for SD_OP_GET_CLUSTER_DEFAULT. - Fix coding style problems. V2: - Fix coding style problem (white space). - Add members, store_policy and block_size_shift to struct SheepdogVdiReq. - Initialize request header to use block_size_shift specified by user. --- block/sheepdog.c | 138 ++--- include/block/block_int.h |1 + 2 files changed, 119 insertions(+), 20 deletions(-) diff --git a/block/sheepdog.c b/block/sheepdog.c index be3176f..a43b947 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -37,6 +37,7 @@ #define SD_OP_READ_VDIS 0x15 #define SD_OP_FLUSH_VDI 0x16 #define SD_OP_DEL_VDI0x17 +#define SD_OP_GET_CLUSTER_DEFAULT 0x18 #define SD_FLAG_CMD_WRITE0x01 #define SD_FLAG_CMD_COW 0x02 @@ -167,7 +168,8 @@ typedef struct SheepdogVdiReq { uint32_t base_vdi_id; uint8_t copies; uint8_t copy_policy; -uint8_t reserved[2]; +uint8_t store_policy; +uint8_t block_size_shift; uint32_t snapid; uint32_t type; uint32_t pad[2]; @@ -186,6 +188,21 @@ typedef struct SheepdogVdiRsp { uint32_t pad[5]; } SheepdogVdiRsp; +typedef struct SheepdogClusterRsp { +uint8_t proto_ver; +uint8_t opcode; +uint16_t flags; +uint32_t epoch; +uint32_t id; +uint32_t data_length; +uint32_t result; +uint8_t nr_copies; +uint8_t copy_policy; +uint8_t block_size_shift; +uint8_t __pad1; +uint32_t __pad2[6]; +} SheepdogClusterRsp; + typedef struct SheepdogInode { char name[SD_MAX_VDI_LEN]; char tag[SD_MAX_VDI_TAG_LEN]; @@ -1544,6 +1561,7 @@ static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot, hdr.vdi_size = s-inode.vdi_size; hdr.copy_policy = s-inode.copy_policy; hdr.copies = s-inode.nr_copies; +hdr.block_size_shift = s-inode.block_size_shift; ret = do_req(fd, s-aio_context, (SheepdogReq *)hdr, buf, wlen, rlen); @@ -1569,9 +1587,12 @@ static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot, static int sd_prealloc(const char *filename, Error **errp) { BlockDriverState *bs = NULL; +BDRVSheepdogState *base = NULL; +unsigned long buf_size; uint32_t idx, max_idx; +uint32_t object_size; int64_t vdi_size; -void *buf = g_malloc0(SD_DATA_OBJ_SIZE); +void *buf = NULL; int ret; ret = bdrv_open(bs, filename, NULL, NULL, BDRV_O_RDWR | BDRV_O_PROTOCOL, @@ -1585,18 +1606,24 @@ static int sd_prealloc(const char *filename, Error **errp) ret = vdi_size; goto out; } -max_idx = DIV_ROUND_UP(vdi_size, SD_DATA_OBJ_SIZE); + +base = bs-opaque; +object_size = (UINT32_C(1) base-inode.block_size_shift); +buf_size = MIN(object_size, SD_DATA_OBJ_SIZE); +buf = g_malloc0(buf_size); + +max_idx = DIV_ROUND_UP(vdi_size, buf_size); for (idx = 0; idx max_idx; idx++) { /* * The created image can be a cloned image, so we need to read * a data from the source image. */ -ret = bdrv_pread(bs, idx * SD_DATA_OBJ_SIZE, buf, SD_DATA_OBJ_SIZE); +ret = bdrv_pread(bs, idx * buf_size, buf, buf_size); if (ret 0) { goto out; } -ret = bdrv_pwrite(bs, idx * SD_DATA_OBJ_SIZE, buf, SD_DATA_OBJ_SIZE); +ret = bdrv_pwrite(bs, idx * buf_size, buf, buf_size); if (ret 0) { goto out; } @@ -1669,6 +1696,21 @@ static int parse_redundancy(BDRVSheepdogState *s, const char *opt
Re: [Qemu-devel] [PATCH v4] sheepdog: selectable object size support
On Fri, Feb 06, 2015 at 04:57:55PM +0900, Teruaki Ishizaki wrote: (2015/02/06 11:18), Liu Yuan wrote: On Wed, Feb 04, 2015 at 01:54:19PM +0900, Teruaki Ishizaki wrote: (2015/02/02 15:52), Liu Yuan wrote: On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote: Previously, qemu block driver of sheepdog used hard-coded VDI object size. This patch enables users to handle block_size_shift value for calculating VDI object size. When you start qemu, you don't need to specify additional command option. But when you create the VDI which doesn't have default object size with qemu-img command, you specify block_size_shift option. If you want to create a VDI of 8MB(1 23) object size, you need to specify following command option. # qemu-img create -o block_size_shift=23 sheepdog:test1 100M Is it possible to make this option more user friendly? such as $ qemu-img create -o object_size=8M sheepdog:test 1G At first, I thought that the object_size was user friendly. But, Sheepdog has already the value of block_size_shift in the inode layout that means like object_size. 'object_size' doesn't always fit right in 'block_size_shift'. On the other hands, 'block_size_shift' always fit right in 'object_size'. I think that existing layout shouldn't be changed easily and it seems that it is difficult for users to specify the object_size value that fit right in 'block_size_shift'. I don't think we need to change the layout. block_size_shift is what QEMU talks to sheep, and QEMU options is what users talks to QEMU. We can convert the user friendly object size into block_size_shift internally in the driver before sending it tosheep daemon, no? For example, users specify 12MB for object size, block_size_shift doesn't fit exactly to an integer. In this case, we should abort and print the error message and notify the users the acceptable range like erasure coding option. I suppose that normally an administrator do format sheepdog cluster with specifying block_size_shift and users usually do qemu-img command without a block_size_shift option. block_size_shift is too developer centered and has a direct relation to the underlying algorithm. If in the future, e.g, we change the underlying algorithm about how we represent block size, block_size_shift might not even exsit. So use object_size would be more generic and won't have this kind of problem. Secondly, it is not hard to parse the object_size into block_size_shift, so I'd suggest we'd better try our best to make it user friendly. Thanks Yuan
Re: [Qemu-devel] [PATCH v4] sheepdog: selectable object size support
On Wed, Feb 04, 2015 at 01:54:19PM +0900, Teruaki Ishizaki wrote: (2015/02/02 15:52), Liu Yuan wrote: On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote: Previously, qemu block driver of sheepdog used hard-coded VDI object size. This patch enables users to handle block_size_shift value for calculating VDI object size. When you start qemu, you don't need to specify additional command option. But when you create the VDI which doesn't have default object size with qemu-img command, you specify block_size_shift option. If you want to create a VDI of 8MB(1 23) object size, you need to specify following command option. # qemu-img create -o block_size_shift=23 sheepdog:test1 100M Is it possible to make this option more user friendly? such as $ qemu-img create -o object_size=8M sheepdog:test 1G At first, I thought that the object_size was user friendly. But, Sheepdog has already the value of block_size_shift in the inode layout that means like object_size. 'object_size' doesn't always fit right in 'block_size_shift'. On the other hands, 'block_size_shift' always fit right in 'object_size'. I think that existing layout shouldn't be changed easily and it seems that it is difficult for users to specify the object_size value that fit right in 'block_size_shift'. I don't think we need to change the layout. block_size_shift is what QEMU talks to sheep, and QEMU options is what users talks to QEMU. We can convert the user friendly object size into block_size_shift internally in the driver before sending it tosheep daemon, no? Thanks Yuan
Re: [Qemu-devel] [PATCH v4] sheepdog: selectable object size support
On Tue, Jan 27, 2015 at 05:35:27PM +0900, Teruaki Ishizaki wrote: Previously, qemu block driver of sheepdog used hard-coded VDI object size. This patch enables users to handle block_size_shift value for calculating VDI object size. When you start qemu, you don't need to specify additional command option. But when you create the VDI which doesn't have default object size with qemu-img command, you specify block_size_shift option. If you want to create a VDI of 8MB(1 23) object size, you need to specify following command option. # qemu-img create -o block_size_shift=23 sheepdog:test1 100M Is it possible to make this option more user friendly? such as $ qemu-img create -o object_size=8M sheepdog:test 1G Thanks Yuan
Re: [Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver
On Sun, Sep 07, 2014 at 05:12:31PM +0200, Benoît Canet wrote: The Monday 01 Sep 2014 à 15:43:06 (+0800), Liu Yuan wrote : This patch set mainly add mainly two logics to implement device recover - notify qourum driver of the broken states from the child driver(s) - dirty track and sync the device after it is repaired Thus quorum allow VMs to continue while some child devices are broken and when the child devices are repaired and return back, we sync dirty bits during downtime to keep data consistency. The recovery logic is based on the driver state bitmap and will sync the dirty bits with a timeslice window in a coroutine in this prtimive implementation. Simple graph about 2 children with threshold=1 and read-pattern=fifo: (similary to DRBD) + denote device sync iteration - IO on a single device = IO on two devices sync complete, release dirty bitmap ^ | -++== | | | v | device repaired and begin to sync v device broken, create a dirty bitmap This sync logic can take care of nested broken problem, that devices are broken while in sync. We just start a sync process after the devices are repaired again and switch the devices from broken to sound only when the sync completes. For read-pattern=quorum mode, it enjoys the recovery logic without any problem. Todo: - use aio interface to sync data (multiple transfer in one go) - dynamic slice window to control sync bandwidth more smoothly - add auto-reconnection mechanism to other protocol (if not support yet) - add tests Cc: Eric Blake ebl...@redhat.com Cc: Benoit Canet ben...@irqsave.net Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Liu Yuan (8): block/quorum: initialize qcrs.aiocb for read block: add driver operation callbacks block/sheepdog: propagate disconnect/reconnect events to upper driver block/quorum: add quorum_aio_release() helper quorum: fix quorum_aio_cancel() block/quorum: add broken state to BlockDriverState block: add two helpers quorum: add basic device recovery logic block.c | 17 +++ block/quorum.c| 324 +- block/sheepdog.c | 9 ++ include/block/block.h | 9 ++ include/block/block_int.h | 6 + trace-events | 5 + 6 files changed, 336 insertions(+), 34 deletions(-) -- 1.9.1 Hi liu, Had you noticed that your series conflict with one of Fam's series in the quorum cancel function fix patch ? Not yet, thanks for reminding. Could you find an arrangement with Fam so the two patches don't collide anymore ? Do you intend to respin your series ? Yes, I'll rebase the v2 later before more possible reviews. Thanks Yuan
Re: [Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver
On Wed, Sep 03, 2014 at 12:19:14AM +0200, Benoît Canet wrote: The Monday 01 Sep 2014 à 15:43:06 (+0800), Liu Yuan wrote : Liu, Do you think this could work with qcow2 file backed by NFS servers ? It depends on which client we use. If we use Linux NFS client by 'posix file' protocol, I guess we need to hack raw-posix.c and insert reconnect/disconnect callbacks. I'm exploring the possibility of QEMU's nfs client block/nfs.c too. Either way, this should work with all the protocols with proper hacks. Thanks Yuan
[Qemu-devel] [PATCH 1/8] block/quorum: initialize qcrs.aiocb for read
This is required by quorum_aio_cancel() Cc: Eric Blake ebl...@redhat.com Cc: Benoit Canet ben...@irqsave.net Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- block/quorum.c | 23 --- 1 file changed, 12 insertions(+), 11 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index af48e8c..5866bca 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -653,8 +653,10 @@ static BlockDriverAIOCB *read_quorum_children(QuorumAIOCB *acb) } for (i = 0; i s-num_children; i++) { -bdrv_aio_readv(s-bs[i], acb-sector_num, acb-qcrs[i].qiov, - acb-nb_sectors, quorum_aio_cb, acb-qcrs[i]); +acb-qcrs[i].aiocb = bdrv_aio_readv(s-bs[i], acb-sector_num, +acb-qcrs[i].qiov, +acb-nb_sectors, quorum_aio_cb, +acb-qcrs[i]); } return acb-common; @@ -663,15 +665,14 @@ static BlockDriverAIOCB *read_quorum_children(QuorumAIOCB *acb) static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb) { BDRVQuorumState *s = acb-common.bs-opaque; - -acb-qcrs[acb-child_iter].buf = qemu_blockalign(s-bs[acb-child_iter], - acb-qiov-size); -qemu_iovec_init(acb-qcrs[acb-child_iter].qiov, acb-qiov-niov); -qemu_iovec_clone(acb-qcrs[acb-child_iter].qiov, acb-qiov, - acb-qcrs[acb-child_iter].buf); -bdrv_aio_readv(s-bs[acb-child_iter], acb-sector_num, - acb-qcrs[acb-child_iter].qiov, acb-nb_sectors, - quorum_aio_cb, acb-qcrs[acb-child_iter]); +int i = acb-child_iter; + +acb-qcrs[i].buf = qemu_blockalign(s-bs[i], acb-qiov-size); +qemu_iovec_init(acb-qcrs[i].qiov, acb-qiov-niov); +qemu_iovec_clone(acb-qcrs[i].qiov, acb-qiov, acb-qcrs[i].buf); +acb-qcrs[i].aiocb = bdrv_aio_readv(s-bs[i], acb-sector_num, +acb-qcrs[i].qiov, acb-nb_sectors, +quorum_aio_cb, acb-qcrs[i]); return acb-common; } -- 1.9.1
[Qemu-devel] [PATCH 4/8] block/quorum: add quorum_aio_release() helper
Cc: Eric Blake ebl...@redhat.com Cc: Benoit Canet ben...@irqsave.net Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- block/quorum.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index 5866bca..9e056d6 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -130,6 +130,12 @@ struct QuorumAIOCB { static bool quorum_vote(QuorumAIOCB *acb); +static void quorum_aio_release(QuorumAIOCB *acb) +{ +g_free(acb-qcrs); +qemu_aio_release(acb); +} + static void quorum_aio_cancel(BlockDriverAIOCB *blockacb) { QuorumAIOCB *acb = container_of(blockacb, QuorumAIOCB, common); @@ -141,8 +147,7 @@ static void quorum_aio_cancel(BlockDriverAIOCB *blockacb) bdrv_aio_cancel(acb-qcrs[i].aiocb); } -g_free(acb-qcrs); -qemu_aio_release(acb); +quorum_aio_release(acb); } static AIOCBInfo quorum_aiocb_info = { @@ -168,8 +173,7 @@ static void quorum_aio_finalize(QuorumAIOCB *acb) } } -g_free(acb-qcrs); -qemu_aio_release(acb); +quorum_aio_release(acb); } static bool quorum_sha256_compare(QuorumVoteValue *a, QuorumVoteValue *b) -- 1.9.1
[Qemu-devel] [PATCH 0/8] add basic recovery logic to quorum driver
This patch set mainly add mainly two logics to implement device recover - notify qourum driver of the broken states from the child driver(s) - dirty track and sync the device after it is repaired Thus quorum allow VMs to continue while some child devices are broken and when the child devices are repaired and return back, we sync dirty bits during downtime to keep data consistency. The recovery logic is based on the driver state bitmap and will sync the dirty bits with a timeslice window in a coroutine in this prtimive implementation. Simple graph about 2 children with threshold=1 and read-pattern=fifo: (similary to DRBD) + denote device sync iteration - IO on a single device = IO on two devices sync complete, release dirty bitmap ^ | -++== | | | v | device repaired and begin to sync v device broken, create a dirty bitmap This sync logic can take care of nested broken problem, that devices are broken while in sync. We just start a sync process after the devices are repaired again and switch the devices from broken to sound only when the sync completes. For read-pattern=quorum mode, it enjoys the recovery logic without any problem. Todo: - use aio interface to sync data (multiple transfer in one go) - dynamic slice window to control sync bandwidth more smoothly - add auto-reconnection mechanism to other protocol (if not support yet) - add tests Cc: Eric Blake ebl...@redhat.com Cc: Benoit Canet ben...@irqsave.net Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Liu Yuan (8): block/quorum: initialize qcrs.aiocb for read block: add driver operation callbacks block/sheepdog: propagate disconnect/reconnect events to upper driver block/quorum: add quorum_aio_release() helper quorum: fix quorum_aio_cancel() block/quorum: add broken state to BlockDriverState block: add two helpers quorum: add basic device recovery logic block.c | 17 +++ block/quorum.c| 324 +- block/sheepdog.c | 9 ++ include/block/block.h | 9 ++ include/block/block_int.h | 6 + trace-events | 5 + 6 files changed, 336 insertions(+), 34 deletions(-) -- 1.9.1
[Qemu-devel] [PATCH 5/8] quorum: fix quorum_aio_cancel()
For a fifo read pattern, we only have one running aio (possible other cases that has less number than num_children in the future), so we need to check if .acb is NULL against bdrv_aio_cancel() to avoid segfault. Cc: Eric Blake ebl...@redhat.com Cc: Benoit Canet ben...@irqsave.net Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- block/quorum.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/block/quorum.c b/block/quorum.c index 9e056d6..b9eeda3 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -144,7 +144,9 @@ static void quorum_aio_cancel(BlockDriverAIOCB *blockacb) /* cancel all callbacks */ for (i = 0; i s-num_children; i++) { -bdrv_aio_cancel(acb-qcrs[i].aiocb); +if (acb-qcrs[i].aiocb) { +bdrv_aio_cancel(acb-qcrs[i].aiocb); +} } quorum_aio_release(acb); -- 1.9.1
[Qemu-devel] [PATCH 2/8] block: add driver operation callbacks
Driver operations are defined as callbacks passed from block upper drivers to lower drivers and are supposed to be called by lower drivers. Requests handling(queuing, submitting, etc.) are done in protocol tier in the block layer and connection states are also maintained down there. Driver operations are supposed to notify the upper tier (such as quorum) of the states changes. For now only two operation are added: driver_disconnect: called when connection is off driver_reconnect: called when connection is on after disconnection Which are used to notify upper tier of the connection state. Cc: Eric Blake ebl...@redhat.com Cc: Benoit Canet ben...@irqsave.net Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- block.c | 7 +++ include/block/block.h | 7 +++ include/block/block_int.h | 3 +++ 3 files changed, 17 insertions(+) diff --git a/block.c b/block.c index c12b8de..22eb3e4 100644 --- a/block.c +++ b/block.c @@ -2152,6 +2152,13 @@ void bdrv_set_dev_ops(BlockDriverState *bs, const BlockDevOps *ops, bs-dev_opaque = opaque; } +void bdrv_set_drv_ops(BlockDriverState *bs, const BlockDrvOps *ops, + void *opaque) +{ +bs-drv_ops = ops; +bs-drv_opaque = opaque; +} + static void bdrv_dev_change_media_cb(BlockDriverState *bs, bool load) { if (bs-dev_ops bs-dev_ops-change_media_cb) { diff --git a/include/block/block.h b/include/block/block.h index 8f4ad16..a61eaf0 100644 --- a/include/block/block.h +++ b/include/block/block.h @@ -82,6 +82,11 @@ typedef struct BlockDevOps { void (*resize_cb)(void *opaque); } BlockDevOps; +typedef struct BlockDrvOps { +void (*driver_reconnect)(BlockDriverState *bs); +void (*driver_disconnect)(BlockDriverState *bs); +} BlockDrvOps; + typedef enum { BDRV_REQ_COPY_ON_READ = 0x1, BDRV_REQ_ZERO_WRITE = 0x2, @@ -234,6 +239,8 @@ void bdrv_detach_dev(BlockDriverState *bs, void *dev); void *bdrv_get_attached_dev(BlockDriverState *bs); void bdrv_set_dev_ops(BlockDriverState *bs, const BlockDevOps *ops, void *opaque); +void bdrv_set_drv_ops(BlockDriverState *bs, const BlockDrvOps *ops, + void *opaque); void bdrv_dev_eject_request(BlockDriverState *bs, bool force); bool bdrv_dev_has_removable_media(BlockDriverState *bs); bool bdrv_dev_is_tray_open(BlockDriverState *bs); diff --git a/include/block/block_int.h b/include/block/block_int.h index 2334895..9fdec7f 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -319,6 +319,9 @@ struct BlockDriverState { const BlockDevOps *dev_ops; void *dev_opaque; +const BlockDrvOps *drv_ops; +void *drv_opaque; + AioContext *aio_context; /* event loop used for fd handlers, timers, etc */ char filename[1024]; -- 1.9.1
[Qemu-devel] [PATCH 8/8] quorum: add basic device recovery logic
For some configuration, quorum allow VMs to continue while some child devices are broken and when the child devices are repaired and return back, we need to sync dirty bits during downtime to keep data consistency. The recovery logic is based on the driver state bitmap and will sync the dirty bits with a timeslice window in a coroutine in this prtimive implementation. Simple graph about 2 children with threshold=1 and read-pattern=fifo: + denote device sync iteration - IO on a single device = IO on two devices sync complete, release dirty bitmap ^ | -++== | | | v | device repaired and begin to sync v device broken, create a dirty bitmap This sync logic can take care of nested broken problem, that devices are broken while in sync. We just start a sync process after the devices are repaired again and switch the devices from broken to sound only when the sync completes. For read-pattern=quorum mode, it enjoys the recovery logic without any problem. Cc: Eric Blake ebl...@redhat.com Cc: Benoit Canet ben...@irqsave.net Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- block/quorum.c | 189 - trace-events | 5 ++ 2 files changed, 191 insertions(+), 3 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index 7b07e35..ffd7c2d 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -23,6 +23,7 @@ #include qapi/qmp/qlist.h #include qapi/qmp/qstring.h #include qapi-event.h +#include trace.h #define HASH_LENGTH 32 @@ -31,6 +32,10 @@ #define QUORUM_OPT_REWRITErewrite-corrupted #define QUORUM_OPT_READ_PATTERN read-pattern +#define SLICE_TIME 1ULL /* 100 ms */ +#define CHUNK_SIZE (1 20) /* 1M */ +#define SECTORS_PER_CHUNK (CHUNK_SIZE BDRV_SECTOR_BITS) + /* This union holds a vote hash value */ typedef union QuorumVoteValue { char h[HASH_LENGTH]; /* SHA-256 hash */ @@ -64,6 +69,7 @@ typedef struct QuorumVotes { /* the following structure holds the state of one quorum instance */ typedef struct BDRVQuorumState { +BlockDriverState *mybs;/* Quorum block driver base state */ BlockDriverState **bs; /* children BlockDriverStates */ int num_children; /* children count */ int threshold; /* if less than threshold children reads gave the @@ -82,6 +88,10 @@ typedef struct BDRVQuorumState { */ QuorumReadPattern read_pattern; +BdrvDirtyBitmap *dirty_bitmap; +uint8_t *sync_buf; +HBitmapIter hbi; +int64_t sector_num; } BDRVQuorumState; typedef struct QuorumAIOCB QuorumAIOCB; @@ -290,12 +300,11 @@ static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source) } } -static int next_fifo_child(QuorumAIOCB *acb) +static int get_good_child(BDRVQuorumState *s, int iter) { -BDRVQuorumState *s = acb-common.bs-opaque; int i; -for (i = acb-child_iter; i s-num_children; i++) { +for (i = iter; i s-num_children; i++) { if (!s-bs[i]-broken) { break; } @@ -306,6 +315,13 @@ static int next_fifo_child(QuorumAIOCB *acb) return i; } +static int next_fifo_child(QuorumAIOCB *acb) +{ +BDRVQuorumState *s = acb-common.bs-opaque; + +return get_good_child(s, acb-child_iter); +} + static void quorum_aio_cb(void *opaque, int ret) { QuorumChildRequest *sacb = opaque; @@ -951,6 +967,171 @@ static int parse_read_pattern(const char *opt) return -EINVAL; } +static void sync_prepare(BDRVQuorumState *qs, int64_t *num) +{ +int64_t nb, total = bdrv_nb_sectors(qs-mybs); + +qs-sector_num = hbitmap_iter_next(qs-hbi); +/* Wrap around if previous bits get dirty while syncing */ +if (qs-sector_num 0) { +bdrv_dirty_iter_init(qs-mybs, qs-dirty_bitmap, qs-hbi); +qs-sector_num = hbitmap_iter_next(qs-hbi); +assert(qs-sector_num = 0); +} + +for (nb = 1; nb SECTORS_PER_CHUNK qs-sector_num + nb total; + nb++) { +if (!bdrv_get_dirty(qs-mybs, qs-dirty_bitmap, qs-sector_num + nb)) { +break; +} +} +*num = nb; +} + +static void sync_finish(BDRVQuorumState *qs, int64_t num) +{ +int64_t i; + +for (i = 0; i num; i++) { +/* We need to advance the iterator manually */ +hbitmap_iter_next(qs-hbi); +} +bdrv_reset_dirty(qs-mybs, qs-sector_num, num); +} + +static int quorum_sync_iteration(BDRVQuorumState *qs, BlockDriverState *target) +{ +BlockDriverState *source; +QEMUIOVector qiov; +int ret, good; +int64_t nb_sectors; +struct iovec iov; +const char *sname, *tname = bdrv_get_filename(target); + +good = get_good_child
[Qemu-devel] [PATCH 3/8] block/sheepdog: propagate disconnect/reconnect events to upper driver
This is the reference usage how we propagate connection state to upper tier. Cc: Eric Blake ebl...@redhat.com Cc: Benoit Canet ben...@irqsave.net Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- block/sheepdog.c | 9 + 1 file changed, 9 insertions(+) diff --git a/block/sheepdog.c b/block/sheepdog.c index 53c24d6..9c0fc49 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -714,6 +714,11 @@ static coroutine_fn void reconnect_to_sdog(void *opaque) { BDRVSheepdogState *s = opaque; AIOReq *aio_req, *next; +BlockDriverState *bs = s-bs; + +if (bs-drv_ops bs-drv_ops-driver_disconnect) { +bs-drv_ops-driver_disconnect(bs); +} aio_set_fd_handler(s-aio_context, s-fd, NULL, NULL, NULL); close(s-fd); @@ -756,6 +761,10 @@ static coroutine_fn void reconnect_to_sdog(void *opaque) QLIST_INSERT_HEAD(s-inflight_aio_head, aio_req, aio_siblings); resend_aioreq(s, aio_req); } + +if (bs-drv_ops bs-drv_ops-driver_reconnect) { +bs-drv_ops-driver_reconnect(bs); +} } /* -- 1.9.1
[Qemu-devel] [PATCH 6/8] block/quorum: add broken state to BlockDriverState
This allow VM continues to process even if some devices are broken meanwhile with proper configuration. We mark the device broken when the protocol tier notify back some broken state(s) of the device, such as diconnection via driver operations. We could also reset the device as sound when the protocol tier is repaired. Origianlly .threshold controls how we should decide the success of read/write and return the failure only if the success count of read/write is less than .threshold specified by users. But it doesn't record the states of underlying states and will impact performance a bit in some cases. For example, we have 3 children and .threshold is set 2. If one of the devices broken, we should still return success and continue to run VM. But for every IO operations, we will blindly send the requests to the broken device. To store broken state into driver state we can save requests to borken devices and resend the requests to the repaired ones by setting broken as false. This is especially useful for network based protocol such as sheepdog, which has a auto-reconnection mechanism and will never report EIO if the connection is broken but just store the requests to its local queue and wait for resending. Without broken state, quorum request will not come back until the connection is re-established. So we have to skip the broken deivces to allow VM to continue running with networked backed child (glusterfs, nfs, sheepdog, etc). With the combination of read-pattern and threshold, we can easily mimic the DRVD behavior with following configuration: read-pattern=fifo,threshold=1 will two children. Cc: Eric Blake ebl...@redhat.com Cc: Benoit Canet ben...@irqsave.net Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- block/quorum.c| 102 ++ include/block/block_int.h | 3 ++ 2 files changed, 87 insertions(+), 18 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index b9eeda3..7b07e35 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -120,6 +120,7 @@ struct QuorumAIOCB { int rewrite_count; /* number of replica to rewrite: count down to * zero once writes are fired */ +int issued_count; /* actual readwrite issued count */ QuorumVotes votes; @@ -170,8 +171,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb) if (acb-is_read) { /* on the quorum case acb-child_iter == s-num_children - 1 */ for (i = 0; i = acb-child_iter; i++) { -qemu_vfree(acb-qcrs[i].buf); -qemu_iovec_destroy(acb-qcrs[i].qiov); +if (acb-qcrs[i].buf) { +qemu_vfree(acb-qcrs[i].buf); +qemu_iovec_destroy(acb-qcrs[i].qiov); +} } } @@ -207,6 +210,7 @@ static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s, acb-count = 0; acb-success_count = 0; acb-rewrite_count = 0; +acb-issued_count = 0; acb-votes.compare = quorum_sha256_compare; QLIST_INIT(acb-votes.vote_list); acb-is_read = false; @@ -286,6 +290,22 @@ static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source) } } +static int next_fifo_child(QuorumAIOCB *acb) +{ +BDRVQuorumState *s = acb-common.bs-opaque; +int i; + +for (i = acb-child_iter; i s-num_children; i++) { +if (!s-bs[i]-broken) { +break; +} +} +if (i == s-num_children) { +return -1; +} +return i; +} + static void quorum_aio_cb(void *opaque, int ret) { QuorumChildRequest *sacb = opaque; @@ -293,11 +313,18 @@ static void quorum_aio_cb(void *opaque, int ret) BDRVQuorumState *s = acb-common.bs-opaque; bool rewrite = false; +if (ret 0) { +s-bs[acb-child_iter]-broken = true; +} + if (acb-is_read s-read_pattern == QUORUM_READ_PATTERN_FIFO) { /* We try to read next child in FIFO order if we fail to read */ -if (ret 0 ++acb-child_iter s-num_children) { -read_fifo_child(acb); -return; +if (ret 0) { +acb-child_iter = next_fifo_child(acb); +if (acb-child_iter 0) { +read_fifo_child(acb); +return; +} } if (ret == 0) { @@ -315,9 +342,9 @@ static void quorum_aio_cb(void *opaque, int ret) } else { quorum_report_bad(acb, sacb-aiocb-bs-node_name, ret); } -assert(acb-count = s-num_children); -assert(acb-success_count = s-num_children); -if (acb-count s-num_children) { +assert(acb-count = acb-issued_count); +assert(acb-success_count = acb-issued_count); +if (acb-count acb-issued_count) { return; } @@ -647,22 +674,46 @@ free_exit: return rewrite; } +static bool has_enough_children(BDRVQuorumState *s) +{ +int good = 0, i
[Qemu-devel] [PATCH 7/8] block: add two helpers
These helpers are needed by later quorum sync device logic. Cc: Eric Blake ebl...@redhat.com Cc: Benoit Canet ben...@irqsave.net Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- block.c | 10 ++ include/block/block.h | 2 ++ 2 files changed, 12 insertions(+) diff --git a/block.c b/block.c index 22eb3e4..2e2f1d9 100644 --- a/block.c +++ b/block.c @@ -2145,6 +2145,16 @@ void *bdrv_get_attached_dev(BlockDriverState *bs) return bs-dev; } +BlockDriverState *bdrv_get_file(BlockDriverState *bs) +{ +return bs-file; +} + +const char *bdrv_get_filename(BlockDriverState *bs) +{ +return bs-filename; +} + void bdrv_set_dev_ops(BlockDriverState *bs, const BlockDevOps *ops, void *opaque) { diff --git a/include/block/block.h b/include/block/block.h index a61eaf0..1e116cc 100644 --- a/include/block/block.h +++ b/include/block/block.h @@ -237,6 +237,8 @@ int bdrv_attach_dev(BlockDriverState *bs, void *dev); void bdrv_attach_dev_nofail(BlockDriverState *bs, void *dev); void bdrv_detach_dev(BlockDriverState *bs, void *dev); void *bdrv_get_attached_dev(BlockDriverState *bs); +BlockDriverState *bdrv_get_file(BlockDriverState *bs); +const char *bdrv_get_filename(BlockDriverState *bs); void bdrv_set_dev_ops(BlockDriverState *bs, const BlockDevOps *ops, void *opaque); void bdrv_set_drv_ops(BlockDriverState *bs, const BlockDrvOps *ops, -- 1.9.1
Re: [Qemu-devel] [PATCH 2/8] block: add driver operation callbacks
On Mon, Sep 01, 2014 at 10:28:54AM +0200, Benoît Canet wrote: The Monday 01 Sep 2014 à 15:43:08 (+0800), Liu Yuan wrote : Driver operations are defined as callbacks passed from block upper drivers to lower drivers and are supposed to be called by lower drivers. Requests handling(queuing, submitting, etc.) are done in protocol tier in the block layer and connection states are also maintained down there. Driver operations are supposed to notify the upper tier (such as quorum) of the states changes. For now only two operation are added: driver_disconnect: called when connection is off driver_reconnect: called when connection is on after disconnection Which are used to notify upper tier of the connection state. Cc: Eric Blake ebl...@redhat.com Cc: Benoit Canet ben...@irqsave.net Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- block.c | 7 +++ include/block/block.h | 7 +++ include/block/block_int.h | 3 +++ 3 files changed, 17 insertions(+) diff --git a/block.c b/block.c index c12b8de..22eb3e4 100644 --- a/block.c +++ b/block.c @@ -2152,6 +2152,13 @@ void bdrv_set_dev_ops(BlockDriverState *bs, const BlockDevOps *ops, bs-dev_opaque = opaque; } +void bdrv_set_drv_ops(BlockDriverState *bs, const BlockDrvOps *ops, + void *opaque) +{ +bs-drv_ops = ops; +bs-drv_opaque = opaque; We need to be very carefull of the mix between these fields and the infamous bdrv_swap function. Also I don't know if driver operations is the right name since the BlockDriver structure's callback could also be named driver operations. BlockDrvierState has a device operation for callbacks from devices. So I choose driver operation. So any sugguestion for better name? Thanks Yuan
Re: [Qemu-devel] [PATCH 3/8] block/sheepdog: propagate disconnect/reconnect events to upper driver
On Mon, Sep 01, 2014 at 10:31:47AM +0200, Benoît Canet wrote: The Monday 01 Sep 2014 à 15:43:09 (+0800), Liu Yuan wrote : This is the reference usage how we propagate connection state to upper tier. Cc: Eric Blake ebl...@redhat.com Cc: Benoit Canet ben...@irqsave.net Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- block/sheepdog.c | 9 + 1 file changed, 9 insertions(+) diff --git a/block/sheepdog.c b/block/sheepdog.c index 53c24d6..9c0fc49 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -714,6 +714,11 @@ static coroutine_fn void reconnect_to_sdog(void *opaque) { BDRVSheepdogState *s = opaque; AIOReq *aio_req, *next; +BlockDriverState *bs = s-bs; + +if (bs-drv_ops bs-drv_ops-driver_disconnect) { +bs-drv_ops-driver_disconnect(bs); +} Since this sequence will be strictly the same for all the implementation could we create a bdrv_signal_disconnect(bs); in the block layer to make this code generic ? I'm not sure if other protocol driver can have the same auto-reconnection logic. Probably for simplicity, we keep it as is in the patch. Later when we get more flesh of implementation of other protocols, we can make a better decision. Thanks Yuan
Re: [Qemu-devel] [PATCH 5/8] quorum: fix quorum_aio_cancel()
On Mon, Sep 01, 2014 at 10:35:27AM +0200, Benoît Canet wrote: The Monday 01 Sep 2014 à 15:43:11 (+0800), Liu Yuan wrote : For a fifo read pattern, we only have one running aio (possible other cases that has less number than num_children in the future) I have trouble understanding this part of the commit message could you try to clarify it ? Until this patch, we have two cases, read single or read all. But later patch allow VMs to continue if some devices are down. So the discrete number 1 and N becomes a range [1, N], that is possible running requests are from 1 to N. Thanks Yuan
Re: [Qemu-devel] [PATCH 6/8] block/quorum: add broken state to BlockDriverState
On Mon, Sep 01, 2014 at 10:57:43AM +0200, Benoît Canet wrote: The Monday 01 Sep 2014 à 15:43:12 (+0800), Liu Yuan wrote : This allow VM continues to process even if some devices are broken meanwhile with proper configuration. We mark the device broken when the protocol tier notify back some broken state(s) of the device, such as diconnection via driver operations. We could also reset the device as sound when the protocol tier is repaired. Origianlly .threshold controls how we should decide the success of read/write and return the failure only if the success count of read/write is less than .threshold specified by users. But it doesn't record the states of underlying states and will impact performance a bit in some cases. For example, we have 3 children and .threshold is set 2. If one of the devices broken, we should still return success and continue to run VM. But for every IO operations, we will blindly send the requests to the broken device. To store broken state into driver state we can save requests to borken devices and resend the requests to the repaired ones by setting broken as false. This is especially useful for network based protocol such as sheepdog, which has a auto-reconnection mechanism and will never report EIO if the connection is broken but just store the requests to its local queue and wait for resending. Without broken state, quorum request will not come back until the connection is re-established. So we have to skip the broken deivces to allow VM to continue running with networked backed child (glusterfs, nfs, sheepdog, etc). With the combination of read-pattern and threshold, we can easily mimic the DRVD behavior with following configuration: read-pattern=fifo,threshold=1 will two children. Cc: Eric Blake ebl...@redhat.com Cc: Benoit Canet ben...@irqsave.net Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- block/quorum.c| 102 ++ include/block/block_int.h | 3 ++ 2 files changed, 87 insertions(+), 18 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index b9eeda3..7b07e35 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -120,6 +120,7 @@ struct QuorumAIOCB { int rewrite_count; /* number of replica to rewrite: count down to * zero once writes are fired */ +int issued_count; /* actual readwrite issued count */ QuorumVotes votes; @@ -170,8 +171,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb) if (acb-is_read) { /* on the quorum case acb-child_iter == s-num_children - 1 */ for (i = 0; i = acb-child_iter; i++) { -qemu_vfree(acb-qcrs[i].buf); -qemu_iovec_destroy(acb-qcrs[i].qiov); +if (acb-qcrs[i].buf) { +qemu_vfree(acb-qcrs[i].buf); +qemu_iovec_destroy(acb-qcrs[i].qiov); +} } } @@ -207,6 +210,7 @@ static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s, acb-count = 0; acb-success_count = 0; acb-rewrite_count = 0; +acb-issued_count = 0; acb-votes.compare = quorum_sha256_compare; QLIST_INIT(acb-votes.vote_list); acb-is_read = false; @@ -286,6 +290,22 @@ static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source) } } +static int next_fifo_child(QuorumAIOCB *acb) +{ +BDRVQuorumState *s = acb-common.bs-opaque; +int i; + +for (i = acb-child_iter; i s-num_children; i++) { +if (!s-bs[i]-broken) { +break; +} +} +if (i == s-num_children) { +return -1; +} +return i; +} + static void quorum_aio_cb(void *opaque, int ret) { QuorumChildRequest *sacb = opaque; @@ -293,11 +313,18 @@ static void quorum_aio_cb(void *opaque, int ret) BDRVQuorumState *s = acb-common.bs-opaque; bool rewrite = false; +if (ret 0) { +s-bs[acb-child_iter]-broken = true; +} child_iter is fifo mode stuff. Do we need to write if (s-read_pattern == QUORUM_READ_PATTERN_FIFO ret 0) here ? Probably not. child_iter denotes which bs the QuorumChildRequest belongs to. + if (acb-is_read s-read_pattern == QUORUM_READ_PATTERN_FIFO) { /* We try to read next child in FIFO order if we fail to read */ -if (ret 0 ++acb-child_iter s-num_children) { -read_fifo_child(acb); -return; +if (ret 0) { +acb-child_iter = next_fifo_child(acb); You don't seem to increment child_iter anymore. No, I use a function next_fifo_child() to get the next_fifo_child because right now we have to skip
Re: [Qemu-devel] [PATCH 2/8] block: add driver operation callbacks
On Mon, Sep 01, 2014 at 11:28:22AM +0200, Benoît Canet wrote: The Monday 01 Sep 2014 à 17:19:19 (+0800), Liu Yuan wrote : On Mon, Sep 01, 2014 at 10:28:54AM +0200, Benoît Canet wrote: The Monday 01 Sep 2014 à 15:43:08 (+0800), Liu Yuan wrote : Driver operations are defined as callbacks passed from block upper drivers to lower drivers and are supposed to be called by lower drivers. Requests handling(queuing, submitting, etc.) are done in protocol tier in the block layer and connection states are also maintained down there. Driver operations are supposed to notify the upper tier (such as quorum) of the states changes. For now only two operation are added: driver_disconnect: called when connection is off driver_reconnect: called when connection is on after disconnection Which are used to notify upper tier of the connection state. Cc: Eric Blake ebl...@redhat.com Cc: Benoit Canet ben...@irqsave.net Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- block.c | 7 +++ include/block/block.h | 7 +++ include/block/block_int.h | 3 +++ 3 files changed, 17 insertions(+) diff --git a/block.c b/block.c index c12b8de..22eb3e4 100644 --- a/block.c +++ b/block.c @@ -2152,6 +2152,13 @@ void bdrv_set_dev_ops(BlockDriverState *bs, const BlockDevOps *ops, bs-dev_opaque = opaque; } +void bdrv_set_drv_ops(BlockDriverState *bs, const BlockDrvOps *ops, + void *opaque) +{ +bs-drv_ops = ops; +bs-drv_opaque = opaque; We need to be very carefull of the mix between these fields and the infamous bdrv_swap function. Also I don't know if driver operations is the right name since the BlockDriver structure's callback could also be named driver operations. BlockDrvierState has a device operation for callbacks from devices. So I choose driver operation. So any sugguestion for better name? From what I see in this series the job of these callbacks is to send a message or a signal to the upper BDS. Also the name must reflect it goes from the child to the parent. child_signals ? child_messages ? As far as I see, put child in the name will make it too quorum centric. Since it is operation in BlockDriverState, we need to keep it as generic as we could. These operations [here we mean disconnect() and reconnect(), but probably later some other will add more opeartions] are passed from 'upper driver' to protocol driver [in the code we call the protocol as 'file' driver, a narrow name too], so I chose to name it as 'driver operation'. If we can rename 'file' as protocol, include file, nfs, sheepdog, etc, such as bdrv_create_file - bdrv_create_protocol bs.file - bs.protocol then the 'driver operation' here would sound better. Thanks Yuan
Re: [Qemu-devel] [PATCH 8/8] quorum: add basic device recovery logic
On Mon, Sep 01, 2014 at 11:37:20AM +0200, Benoît Canet wrote: The Monday 01 Sep 2014 à 15:43:14 (+0800), Liu Yuan wrote : For some configuration, quorum allow VMs to continue while some child devices are broken and when the child devices are repaired and return back, we need to sync dirty bits during downtime to keep data consistency. The recovery logic is based on the driver state bitmap and will sync the dirty bits with a timeslice window in a coroutine in this prtimive implementation. Simple graph about 2 children with threshold=1 and read-pattern=fifo: + denote device sync iteration - IO on a single device = IO on two devices sync complete, release dirty bitmap ^ | -++== | | | v | device repaired and begin to sync v device broken, create a dirty bitmap This sync logic can take care of nested broken problem, that devices are broken while in sync. We just start a sync process after the devices are repaired again and switch the devices from broken to sound only when the sync completes. For read-pattern=quorum mode, it enjoys the recovery logic without any problem. Cc: Eric Blake ebl...@redhat.com Cc: Benoit Canet ben...@irqsave.net Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- block/quorum.c | 189 - trace-events | 5 ++ 2 files changed, 191 insertions(+), 3 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index 7b07e35..ffd7c2d 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -23,6 +23,7 @@ #include qapi/qmp/qlist.h #include qapi/qmp/qstring.h #include qapi-event.h +#include trace.h #define HASH_LENGTH 32 @@ -31,6 +32,10 @@ #define QUORUM_OPT_REWRITErewrite-corrupted #define QUORUM_OPT_READ_PATTERN read-pattern +#define SLICE_TIME 1ULL /* 100 ms */ +#define CHUNK_SIZE (1 20) /* 1M */ +#define SECTORS_PER_CHUNK (CHUNK_SIZE BDRV_SECTOR_BITS) + /* This union holds a vote hash value */ typedef union QuorumVoteValue { char h[HASH_LENGTH]; /* SHA-256 hash */ @@ -64,6 +69,7 @@ typedef struct QuorumVotes { /* the following structure holds the state of one quorum instance */ typedef struct BDRVQuorumState { +BlockDriverState *mybs;/* Quorum block driver base state */ BlockDriverState **bs; /* children BlockDriverStates */ int num_children; /* children count */ int threshold; /* if less than threshold children reads gave the @@ -82,6 +88,10 @@ typedef struct BDRVQuorumState { */ QuorumReadPattern read_pattern; +BdrvDirtyBitmap *dirty_bitmap; +uint8_t *sync_buf; +HBitmapIter hbi; +int64_t sector_num; } BDRVQuorumState; typedef struct QuorumAIOCB QuorumAIOCB; @@ -290,12 +300,11 @@ static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source) } } -static int next_fifo_child(QuorumAIOCB *acb) +static int get_good_child(BDRVQuorumState *s, int iter) { -BDRVQuorumState *s = acb-common.bs-opaque; int i; -for (i = acb-child_iter; i s-num_children; i++) { +for (i = iter; i s-num_children; i++) { if (!s-bs[i]-broken) { break; } @@ -306,6 +315,13 @@ static int next_fifo_child(QuorumAIOCB *acb) return i; } +static int next_fifo_child(QuorumAIOCB *acb) +{ +BDRVQuorumState *s = acb-common.bs-opaque; + +return get_good_child(s, acb-child_iter); +} + static void quorum_aio_cb(void *opaque, int ret) { QuorumChildRequest *sacb = opaque; @@ -951,6 +967,171 @@ static int parse_read_pattern(const char *opt) return -EINVAL; } +static void sync_prepare(BDRVQuorumState *qs, int64_t *num) +{ +int64_t nb, total = bdrv_nb_sectors(qs-mybs); + +qs-sector_num = hbitmap_iter_next(qs-hbi); +/* Wrap around if previous bits get dirty while syncing */ +if (qs-sector_num 0) { +bdrv_dirty_iter_init(qs-mybs, qs-dirty_bitmap, qs-hbi); +qs-sector_num = hbitmap_iter_next(qs-hbi); +assert(qs-sector_num = 0); +} + +for (nb = 1; nb SECTORS_PER_CHUNK qs-sector_num + nb total; + nb++) { +if (!bdrv_get_dirty(qs-mybs, qs-dirty_bitmap, qs-sector_num + nb)) { +break; +} +} +*num = nb; +} + +static void sync_finish(BDRVQuorumState *qs, int64_t num) +{ +int64_t i; + +for (i = 0; i num; i++) { +/* We
Re: [Qemu-devel] [PATCH 5/8] quorum: fix quorum_aio_cancel()
On Mon, Sep 01, 2014 at 11:32:04AM +0200, Benoît Canet wrote: The Monday 01 Sep 2014 à 17:26:09 (+0800), Liu Yuan wrote : On Mon, Sep 01, 2014 at 10:35:27AM +0200, Benoît Canet wrote: The Monday 01 Sep 2014 à 15:43:11 (+0800), Liu Yuan wrote : For a fifo read pattern, we only have one running aio (possible other cases that has less number than num_children in the future) I have trouble understanding this part of the commit message could you try to clarify it ? Until this patch, we have two cases, read single or read all. But later patch allow VMs to continue if some devices are down. So the discrete number 1 and N becomes a range [1, N], that is possible running requests are from 1 to N. Why not (In some other cases some children of the num_children BDS could be disabled reducing the number of requests needed) ? Sounds better, I'll take yours, thanks! Yuan
[Qemu-devel] [PATCH] block: kill tail whitespace in block.c
Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- block.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block.c b/block.c index e9380f6..c12b8de 100644 --- a/block.c +++ b/block.c @@ -2239,7 +2239,7 @@ int bdrv_commit(BlockDriverState *bs) if (!drv) return -ENOMEDIUM; - + if (!bs-backing_hd) { return -ENOTSUP; } -- 1.9.1
Re: [Qemu-devel] [PATCH v6 0/2] add read-pattern for block qourum
On Mon, Aug 25, 2014 at 10:46:21AM +0800, Liu Yuan wrote: On Mon, Aug 18, 2014 at 05:41:03PM +0800, Liu Yuan wrote: v6: - fix a unused warning introduced by last version Hi Stefan and Kevin, Benoît Canet has added Reviewed-by tag to both patches, could one of you pick this patch set? ping
[Qemu-devel] [PATCH] sheepdog: fix a core dump while do auto-reconnecting
We should reinit Local_err as NULL inside the while loop or g_free() will report corrupption and abort the QEMU when sheepdog driver tries reconnecting. qemu-system-x86_64: failed to get the header, Resource temporarily unavailable qemu-system-x86_64: Failed to connect to socket: Connection refused qemu-system-x86_64: (null) [xcb] Unknown sequence number while awaiting reply [xcb] Most likely this is a multi-threaded client and XInitThreads has not been called [xcb] Aborting, sorry about that. qemu-system-x86_64: ../../src/xcb_io.c:298: poll_for_response: Assertion `!xcb_xlib_threads_sequence_lost' failed. Aborted (core dumped) Cc: qemu-devel@nongnu.org Cc: Markus Armbruster arm...@redhat.com Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- block/sheepdog.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/sheepdog.c b/block/sheepdog.c index 3af8743..a3a19b1 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -712,7 +712,6 @@ static void coroutine_fn send_pending_req(BDRVSheepdogState *s, uint64_t oid) static coroutine_fn void reconnect_to_sdog(void *opaque) { -Error *local_err = NULL; BDRVSheepdogState *s = opaque; AIOReq *aio_req, *next; BlockDriverState *bs = s-bs; @@ -731,6 +730,7 @@ static coroutine_fn void reconnect_to_sdog(void *opaque) /* Try to reconnect the sheepdog server every one second. */ while (s-fd 0) { +Error *local_err = NULL; s-fd = get_sheep_fd(s, local_err); if (s-fd 0) { DPRINTF(Wait for connection to be established\n); -- 1.9.1
[Qemu-devel] [PATCH v2] sheepdog: fix a core dump while do auto-reconnecting
We should reinit local_err as NULL inside the while loop or g_free() will report corrupption and abort the QEMU when sheepdog driver tries reconnecting. This was broken in commit 356b4ca. qemu-system-x86_64: failed to get the header, Resource temporarily unavailable qemu-system-x86_64: Failed to connect to socket: Connection refused qemu-system-x86_64: (null) [xcb] Unknown sequence number while awaiting reply [xcb] Most likely this is a multi-threaded client and XInitThreads has not been called [xcb] Aborting, sorry about that. qemu-system-x86_64: ../../src/xcb_io.c:298: poll_for_response: Assertion `!xcb_xlib_threads_sequence_lost' failed. Aborted (core dumped) Cc: qemu-devel@nongnu.org Cc: Markus Armbruster arm...@redhat.com Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Reviewed-by: Markus Armbruster arm...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- block/sheepdog.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/sheepdog.c b/block/sheepdog.c index 12cbd9d..53c24d6 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -712,7 +712,6 @@ static void coroutine_fn send_pending_req(BDRVSheepdogState *s, uint64_t oid) static coroutine_fn void reconnect_to_sdog(void *opaque) { -Error *local_err = NULL; BDRVSheepdogState *s = opaque; AIOReq *aio_req, *next; @@ -727,6 +726,7 @@ static coroutine_fn void reconnect_to_sdog(void *opaque) /* Try to reconnect the sheepdog server every one second. */ while (s-fd 0) { +Error *local_err = NULL; s-fd = get_sheep_fd(s, local_err); if (s-fd 0) { DPRINTF(Wait for connection to be established\n); -- 1.9.1
Re: [Qemu-devel] [PATCH v6 0/2] add read-pattern for block qourum
On Thu, Aug 28, 2014 at 11:33:08AM +0100, Stefan Hajnoczi wrote: On Mon, Aug 18, 2014 at 05:41:03PM +0800, Liu Yuan wrote: v6: - fix a unused warning introduced by last version v5: - simplify a for loop in quorum_aio_finalize() v4: - swap the patch order - update comment for fifo pattern in qaip - use qapi enumeration in quorum driver instead of manual parsing v3: - separate patch into two, one for quorum and one for qapi for easier review - add enumeration for quorum read pattern - remove unrelated blank line fix from this patch set v2: - rename single as 'fifo' - rename read_all_children as read_quorum_children - fix quorum_aio_finalize() for fifo pattern This patch adds single read pattern to quorum driver and quorum vote is default pattern. For now we do a quorum vote on all the reads, it is designed for unreliable underlying storage such as non-redundant NFS to make sure data integrity at the cost of the read performance. For some use cases as following: VM -- || vv AB Both A and B has hardware raid storage to justify the data integrity on its own. So it would help performance if we do a single read instead of on all the nodes. Further, if we run VM on either of the storage node, we can make a local read request for better performance. This patch generalize the above 2 nodes case in the N nodes. That is, vm - write to all the N nodes, read just one of them. If single read fails, we try to read next node in FIFO order specified by the startup command. The 2 nodes case is very similar to DRBD[1] though lack of auto-sync functionality in the single device/node failure for now. But compared with DRBD we still have some advantages over it: - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed storage. And if A crashes, we need to restart all the VMs on node B. But for practice case, we can't because B might not have enough resources to setup 20 VMs at once. So if we run our 20 VMs with quorum driver, and scatter the replicated images over the data center, we can very likely restart 20 VMs without any resource problem. After all, I think we can build a more powerful replicated image functionality on quorum and block jobs(block mirror) to meet various High Availibility needs. E.g, Enable single read pattern on 2 children, -drive driver=quorum,children.0.file.filename=0.qcow2,\ children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1 [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device Cc: Benoit Canet ben...@irqsave.net Cc: Eric Blake ebl...@redhat.com Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Liu Yuan (2): qapi: add read-pattern enum for quorum block/quorum: add simple read pattern support block/quorum.c | 177 +-- qapi/block-core.json | 20 +- 2 files changed, 148 insertions(+), 49 deletions(-) I dropped the \n from the error_setg() error message while merging. Please do not use \n with error_setg(). Thanks for your fix. Please extend the quorum qemu-iotests to cover the new fifo read pattern. You can send the tests as a separate patch series. Okay, will do later. Yuan
Re: [Qemu-devel] [PATCH v6 0/2] add read-pattern for block qourum
On Mon, Aug 18, 2014 at 05:41:03PM +0800, Liu Yuan wrote: v6: - fix a unused warning introduced by last version Hi Stefan and Kevin, Benoît Canet has added Reviewed-by tag to both patches, could one of you pick this patch set? Thanks Yuan
Re: [Qemu-devel] [PATCH v5 2/2] block/quorum: add simple read pattern support
On Mon, Aug 18, 2014 at 01:59:28PM +0800, Liu Yuan wrote: On Fri, Aug 15, 2014 at 03:59:04PM +0200, Benoît Canet wrote: The Friday 15 Aug 2014 à 13:05:17 (+0800), Liu Yuan wrote : This patch adds single read pattern to quorum driver and quorum vote is default pattern. For now we do a quorum vote on all the reads, it is designed for unreliable underlying storage such as non-redundant NFS to make sure data integrity at the cost of the read performance. For some use cases as following: VM -- || vv AB Both A and B has hardware raid storage to justify the data integrity on its own. So it would help performance if we do a single read instead of on all the nodes. Further, if we run VM on either of the storage node, we can make a local read request for better performance. This patch generalize the above 2 nodes case in the N nodes. That is, vm - write to all the N nodes, read just one of them. If single read fails, we try to read next node in FIFO order specified by the startup command. The 2 nodes case is very similar to DRBD[1] though lack of auto-sync functionality in the single device/node failure for now. But compared with DRBD we still have some advantages over it: - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed storage. And if A crashes, we need to restart all the VMs on node B. But for practice case, we can't because B might not have enough resources to setup 20 VMs at once. So if we run our 20 VMs with quorum driver, and scatter the replicated images over the data center, we can very likely restart 20 VMs without any resource problem. After all, I think we can build a more powerful replicated image functionality on quorum and block jobs(block mirror) to meet various High Availibility needs. E.g, Enable single read pattern on 2 children, -drive driver=quorum,children.0.file.filename=0.qcow2,\ children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1 [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device Cc: Benoit Canet ben...@irqsave.net Cc: Eric Blake ebl...@redhat.com Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- block/quorum.c | 176 ++--- 1 file changed, 129 insertions(+), 47 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index d5ee9c0..1235d7c 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -24,6 +24,7 @@ #define QUORUM_OPT_VOTE_THRESHOLD vote-threshold #define QUORUM_OPT_BLKVERIFY blkverify #define QUORUM_OPT_REWRITErewrite-corrupted +#define QUORUM_OPT_READ_PATTERN read-pattern /* This union holds a vote hash value */ typedef union QuorumVoteValue { @@ -74,6 +75,8 @@ typedef struct BDRVQuorumState { bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted * block if Quorum is reached. */ + +QuorumReadPattern read_pattern; } BDRVQuorumState; typedef struct QuorumAIOCB QuorumAIOCB; @@ -117,6 +120,7 @@ struct QuorumAIOCB { bool is_read; int vote_ret; +int child_iter; /* which child to read in fifo pattern */ I don't understand what fifo pattern could mean for a bunch of disk as they are not forming a queue. Naming isn't 100% accurate but as in Eric's comment (see below), both FIFO and Round-Robin can be used for two different patterns. Maybe round-robin is more suitable but your code does not implement round-robin since it will alway start from the first disk. Your code is scanning the disks set it's a scan pattern. That said is it a problem that the first disk will be accessed more often than the other ? As my commit log documented, the purpose of the read pattern I added is to speed up read against quorum original read pattern. And the use case is clear (I hope so) and you can take DRBD as a good example for why we need it. Of course we are far away from DRBD, which need a recovery logic after all kinds of failures. My patch set can be taken as a prelimitary step to implement a DRBD like service driver. Eric previously commented on two read patterns that might be useful: Should we offer multiple modes in addition to 'quorum'? For example, I could see a difference between 'fifo' (favor read from the first quorum member always, unless it fails, good when the first member is local and other member is remote) and 'round-robin' (evenly distribute reads; each read goes to the next available quorum member, good when all members are equally distant
Re: [Qemu-devel] [PATCH v5 2/2] block/quorum: add simple read pattern support
On Fri, Aug 15, 2014 at 03:59:04PM +0200, Benoît Canet wrote: The Friday 15 Aug 2014 à 13:05:17 (+0800), Liu Yuan wrote : This patch adds single read pattern to quorum driver and quorum vote is default pattern. For now we do a quorum vote on all the reads, it is designed for unreliable underlying storage such as non-redundant NFS to make sure data integrity at the cost of the read performance. For some use cases as following: VM -- || vv AB Both A and B has hardware raid storage to justify the data integrity on its own. So it would help performance if we do a single read instead of on all the nodes. Further, if we run VM on either of the storage node, we can make a local read request for better performance. This patch generalize the above 2 nodes case in the N nodes. That is, vm - write to all the N nodes, read just one of them. If single read fails, we try to read next node in FIFO order specified by the startup command. The 2 nodes case is very similar to DRBD[1] though lack of auto-sync functionality in the single device/node failure for now. But compared with DRBD we still have some advantages over it: - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed storage. And if A crashes, we need to restart all the VMs on node B. But for practice case, we can't because B might not have enough resources to setup 20 VMs at once. So if we run our 20 VMs with quorum driver, and scatter the replicated images over the data center, we can very likely restart 20 VMs without any resource problem. After all, I think we can build a more powerful replicated image functionality on quorum and block jobs(block mirror) to meet various High Availibility needs. E.g, Enable single read pattern on 2 children, -drive driver=quorum,children.0.file.filename=0.qcow2,\ children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1 [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device Cc: Benoit Canet ben...@irqsave.net Cc: Eric Blake ebl...@redhat.com Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- block/quorum.c | 176 ++--- 1 file changed, 129 insertions(+), 47 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index d5ee9c0..1235d7c 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -24,6 +24,7 @@ #define QUORUM_OPT_VOTE_THRESHOLD vote-threshold #define QUORUM_OPT_BLKVERIFY blkverify #define QUORUM_OPT_REWRITErewrite-corrupted +#define QUORUM_OPT_READ_PATTERN read-pattern /* This union holds a vote hash value */ typedef union QuorumVoteValue { @@ -74,6 +75,8 @@ typedef struct BDRVQuorumState { bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted * block if Quorum is reached. */ + +QuorumReadPattern read_pattern; } BDRVQuorumState; typedef struct QuorumAIOCB QuorumAIOCB; @@ -117,6 +120,7 @@ struct QuorumAIOCB { bool is_read; int vote_ret; +int child_iter; /* which child to read in fifo pattern */ I don't understand what fifo pattern could mean for a bunch of disk as they are not forming a queue. Naming isn't 100% accurate but as in Eric's comment (see below), both FIFO and Round-Robin can be used for two different patterns. Maybe round-robin is more suitable but your code does not implement round-robin since it will alway start from the first disk. Your code is scanning the disks set it's a scan pattern. That said is it a problem that the first disk will be accessed more often than the other ? As my commit log documented, the purpose of the read pattern I added is to speed up read against quorum original read pattern. And the use case is clear (I hope so) and you can take DRBD as a good example for why we need it. Of course we are far away from DRBD, which need a recovery logic after all kinds of failures. My patch set can be taken as a prelimitary step to implement a DRBD like service driver. Eric previously commented on two read patterns that might be useful: Should we offer multiple modes in addition to 'quorum'? For example, I could see a difference between 'fifo' (favor read from the first quorum member always, unless it fails, good when the first member is local and other member is remote) and 'round-robin' (evenly distribute reads; each read goes to the next available quorum member, good when all members are equally distant). You will have to care to insert disks in different order on each QEMU to spread the load. This is another use case that my patch set didn't try to solve. Shouldn't the code try to spread
[Qemu-devel] [PATCH v6 1/2] qapi: add read-pattern enum for quorum
Cc: Eric Blake ebl...@redhat.com Reviewed-by: Eric Blake ebl...@redhat.com Reviewed-by: Benoît Canet benoit.ca...@nodalink.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- qapi/block-core.json | 20 +++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/qapi/block-core.json b/qapi/block-core.json index e378653..42033d9 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -1384,6 +1384,19 @@ 'raw': 'BlockdevRef' } } ## +# @QuorumReadPattern +# +# An enumeration of quorum read patterns. +# +# @quorum: read all the children and do a quorum vote on reads +# +# @fifo: read only from the first child that has not failed +# +# Since: 2.2 +## +{ 'enum': 'QuorumReadPattern', 'data': [ 'quorum', 'fifo' ] } + +## # @BlockdevOptionsQuorum # # Driver specific block device options for Quorum @@ -1398,12 +1411,17 @@ # @rewrite-corrupted: #optional rewrite corrupted data when quorum is reached # (Since 2.1) # +# @read-pattern: #optional choose read pattern and set to quorum by default +#(Since 2.2) +# # Since: 2.0 ## { 'type': 'BlockdevOptionsQuorum', 'data': { '*blkverify': 'bool', 'children': [ 'BlockdevRef' ], -'vote-threshold': 'int', '*rewrite-corrupted': 'bool' } } +'vote-threshold': 'int', +'*rewrite-corrupted': 'bool', +'*read-pattern': 'QuorumReadPattern' } } ## # @BlockdevOptions -- 1.9.1
[Qemu-devel] [PATCH v6 0/2] add read-pattern for block qourum
v6: - fix a unused warning introduced by last version v5: - simplify a for loop in quorum_aio_finalize() v4: - swap the patch order - update comment for fifo pattern in qaip - use qapi enumeration in quorum driver instead of manual parsing v3: - separate patch into two, one for quorum and one for qapi for easier review - add enumeration for quorum read pattern - remove unrelated blank line fix from this patch set v2: - rename single as 'fifo' - rename read_all_children as read_quorum_children - fix quorum_aio_finalize() for fifo pattern This patch adds single read pattern to quorum driver and quorum vote is default pattern. For now we do a quorum vote on all the reads, it is designed for unreliable underlying storage such as non-redundant NFS to make sure data integrity at the cost of the read performance. For some use cases as following: VM -- || vv AB Both A and B has hardware raid storage to justify the data integrity on its own. So it would help performance if we do a single read instead of on all the nodes. Further, if we run VM on either of the storage node, we can make a local read request for better performance. This patch generalize the above 2 nodes case in the N nodes. That is, vm - write to all the N nodes, read just one of them. If single read fails, we try to read next node in FIFO order specified by the startup command. The 2 nodes case is very similar to DRBD[1] though lack of auto-sync functionality in the single device/node failure for now. But compared with DRBD we still have some advantages over it: - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed storage. And if A crashes, we need to restart all the VMs on node B. But for practice case, we can't because B might not have enough resources to setup 20 VMs at once. So if we run our 20 VMs with quorum driver, and scatter the replicated images over the data center, we can very likely restart 20 VMs without any resource problem. After all, I think we can build a more powerful replicated image functionality on quorum and block jobs(block mirror) to meet various High Availibility needs. E.g, Enable single read pattern on 2 children, -drive driver=quorum,children.0.file.filename=0.qcow2,\ children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1 [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device Cc: Benoit Canet ben...@irqsave.net Cc: Eric Blake ebl...@redhat.com Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Liu Yuan (2): qapi: add read-pattern enum for quorum block/quorum: add simple read pattern support block/quorum.c | 177 +-- qapi/block-core.json | 20 +- 2 files changed, 148 insertions(+), 49 deletions(-) -- 1.9.1
[Qemu-devel] [PATCH v6 2/2] block/quorum: add simple read pattern support
This patch adds single read pattern to quorum driver and quorum vote is default pattern. For now we do a quorum vote on all the reads, it is designed for unreliable underlying storage such as non-redundant NFS to make sure data integrity at the cost of the read performance. For some use cases as following: VM -- || vv AB Both A and B has hardware raid storage to justify the data integrity on its own. So it would help performance if we do a single read instead of on all the nodes. Further, if we run VM on either of the storage node, we can make a local read request for better performance. This patch generalize the above 2 nodes case in the N nodes. That is, vm - write to all the N nodes, read just one of them. If single read fails, we try to read next node in FIFO order specified by the startup command. The 2 nodes case is very similar to DRBD[1] though lack of auto-sync functionality in the single device/node failure for now. But compared with DRBD we still have some advantages over it: - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed storage. And if A crashes, we need to restart all the VMs on node B. But for practice case, we can't because B might not have enough resources to setup 20 VMs at once. So if we run our 20 VMs with quorum driver, and scatter the replicated images over the data center, we can very likely restart 20 VMs without any resource problem. After all, I think we can build a more powerful replicated image functionality on quorum and block jobs(block mirror) to meet various High Availibility needs. E.g, Enable single read pattern on 2 children, -drive driver=quorum,children.0.file.filename=0.qcow2,\ children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1 [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device Cc: Benoit Canet ben...@irqsave.net Cc: Eric Blake ebl...@redhat.com Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- block/quorum.c | 177 + 1 file changed, 129 insertions(+), 48 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index d5ee9c0..c5f1b35 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -24,6 +24,7 @@ #define QUORUM_OPT_VOTE_THRESHOLD vote-threshold #define QUORUM_OPT_BLKVERIFY blkverify #define QUORUM_OPT_REWRITErewrite-corrupted +#define QUORUM_OPT_READ_PATTERN read-pattern /* This union holds a vote hash value */ typedef union QuorumVoteValue { @@ -74,6 +75,8 @@ typedef struct BDRVQuorumState { bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted * block if Quorum is reached. */ + +QuorumReadPattern read_pattern; } BDRVQuorumState; typedef struct QuorumAIOCB QuorumAIOCB; @@ -117,6 +120,7 @@ struct QuorumAIOCB { bool is_read; int vote_ret; +int child_iter; /* which child to read in fifo pattern */ }; static bool quorum_vote(QuorumAIOCB *acb); @@ -143,7 +147,6 @@ static AIOCBInfo quorum_aiocb_info = { static void quorum_aio_finalize(QuorumAIOCB *acb) { -BDRVQuorumState *s = acb-common.bs-opaque; int i, ret = 0; if (acb-vote_ret) { @@ -153,7 +156,8 @@ static void quorum_aio_finalize(QuorumAIOCB *acb) acb-common.cb(acb-common.opaque, ret); if (acb-is_read) { -for (i = 0; i s-num_children; i++) { +/* on the quorum case acb-child_iter == s-num_children - 1 */ +for (i = 0; i = acb-child_iter; i++) { qemu_vfree(acb-qcrs[i].buf); qemu_iovec_destroy(acb-qcrs[i].qiov); } @@ -256,6 +260,21 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret) quorum_aio_finalize(acb); } +static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb); + +static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source) +{ +int i; +assert(dest-niov == source-niov); +assert(dest-size == source-size); +for (i = 0; i source-niov; i++) { +assert(dest-iov[i].iov_len == source-iov[i].iov_len); +memcpy(dest-iov[i].iov_base, + source-iov[i].iov_base, + source-iov[i].iov_len); +} +} + static void quorum_aio_cb(void *opaque, int ret) { QuorumChildRequest *sacb = opaque; @@ -263,6 +282,21 @@ static void quorum_aio_cb(void *opaque, int ret) BDRVQuorumState *s = acb-common.bs-opaque; bool rewrite = false; +if (acb-is_read s-read_pattern == QUORUM_READ_PATTERN_FIFO) { +/* We try to read next child in FIFO order if we fail to read */ +if (ret 0 ++acb-child_iter s-num_children) { +read_fifo_child(acb); +return; +} + +if (ret == 0) { +quorum_copy_qiov(acb-qiov, acb-qcrs[acb-child_iter].qiov); +} +acb-vote_ret = ret
Re: [Qemu-devel] [PATCH v4 2/2] block/quorum: add simple read pattern support
On Tue, Aug 12, 2014 at 10:41:28AM +0800, Liu Yuan wrote: On Mon, Aug 11, 2014 at 02:31:43PM +0200, Benoît Canet wrote: The Thursday 17 Jul 2014 à 13:18:56 (+0800), Liu Yuan wrote : This patch adds single read pattern to quorum driver and quorum vote is default pattern. For now we do a quorum vote on all the reads, it is designed for unreliable underlying storage such as non-redundant NFS to make sure data integrity at the cost of the read performance. For some use cases as following: VM -- || vv AB Both A and B has hardware raid storage to justify the data integrity on its own. So it would help performance if we do a single read instead of on all the nodes. Further, if we run VM on either of the storage node, we can make a local read request for better performance. This patch generalize the above 2 nodes case in the N nodes. That is, vm - write to all the N nodes, read just one of them. If single read fails, we try to read next node in FIFO order specified by the startup command. The 2 nodes case is very similar to DRBD[1] though lack of auto-sync functionality in the single device/node failure for now. But compared with DRBD we still have some advantages over it: - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed storage. And if A crashes, we need to restart all the VMs on node B. But for practice case, we can't because B might not have enough resources to setup 20 VMs at once. So if we run our 20 VMs with quorum driver, and scatter the replicated images over the data center, we can very likely restart 20 VMs without any resource problem. After all, I think we can build a more powerful replicated image functionality on quorum and block jobs(block mirror) to meet various High Availibility needs. E.g, Enable single read pattern on 2 children, -drive driver=quorum,children.0.file.filename=0.qcow2,\ children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1 [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device Cc: Benoit Canet ben...@irqsave.net Cc: Eric Blake ebl...@redhat.com Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- block/quorum.c | 179 + 1 file changed, 131 insertions(+), 48 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index d5ee9c0..ebf5c71 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -24,6 +24,7 @@ #define QUORUM_OPT_VOTE_THRESHOLD vote-threshold #define QUORUM_OPT_BLKVERIFY blkverify #define QUORUM_OPT_REWRITErewrite-corrupted +#define QUORUM_OPT_READ_PATTERN read-pattern /* This union holds a vote hash value */ typedef union QuorumVoteValue { @@ -74,6 +75,8 @@ typedef struct BDRVQuorumState { bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted * block if Quorum is reached. */ + +QuorumReadPattern read_pattern; } BDRVQuorumState; typedef struct QuorumAIOCB QuorumAIOCB; @@ -117,6 +120,7 @@ struct QuorumAIOCB { bool is_read; int vote_ret; +int child_iter; /* which child to read in fifo pattern */ }; static bool quorum_vote(QuorumAIOCB *acb); @@ -154,8 +158,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb) if (acb-is_read) { for (i = 0; i s-num_children; i++) { -qemu_vfree(acb-qcrs[i].buf); -qemu_iovec_destroy(acb-qcrs[i].qiov); +if (i = acb-child_iter) { +qemu_vfree(acb-qcrs[i].buf); +qemu_iovec_destroy(acb-qcrs[i].qiov); +} } } @@ -256,6 +262,21 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret) quorum_aio_finalize(acb); } +static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb); + +static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source) +{ +int i; +assert(dest-niov == source-niov); +assert(dest-size == source-size); +for (i = 0; i source-niov; i++) { +assert(dest-iov[i].iov_len == source-iov[i].iov_len); +memcpy(dest-iov[i].iov_base, + source-iov[i].iov_base, + source-iov[i].iov_len); +} +} + static void quorum_aio_cb(void *opaque, int ret) { QuorumChildRequest *sacb = opaque; @@ -263,6 +284,21 @@ static void quorum_aio_cb(void *opaque, int ret) BDRVQuorumState *s = acb-common.bs-opaque; bool rewrite = false
Re: [Qemu-devel] [PATCH v3 0/2] sheepdog driver update related to VDI locking feature
On Mon, Aug 11, 2014 at 02:43:44PM +0900, Hitoshi Mitake wrote: Recently, sheepdog revived VDI locking functionality. This patchset updates sheepdog driver of QEMU for this feature. v3: - keep backward compatibility v2: - don't handle SD_RES_VDI_NOT_LOCKED as a special case Hitoshi Mitake (2): sheepdog: adopting protocol update for VDI locking sheepdog: improve error handling for a case of failed lock block/sheepdog.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) Reviewed-by: Liu Yuan namei.u...@gmail.com
Re: [Qemu-devel] [PATCH v4 2/2] block/quorum: add simple read pattern support
On Thu, Aug 14, 2014 at 01:09:32PM +0200, Benoît Canet wrote: The Thursday 17 Jul 2014 à 13:18:56 (+0800), Liu Yuan wrote : This patch adds single read pattern to quorum driver and quorum vote is default pattern. For now we do a quorum vote on all the reads, it is designed for unreliable underlying storage such as non-redundant NFS to make sure data integrity at the cost of the read performance. For some use cases as following: VM -- || vv AB Both A and B has hardware raid storage to justify the data integrity on its own. So it would help performance if we do a single read instead of on all the nodes. Further, if we run VM on either of the storage node, we can make a local read request for better performance. This patch generalize the above 2 nodes case in the N nodes. That is, vm - write to all the N nodes, read just one of them. If single read fails, we try to read next node in FIFO order specified by the startup command. The 2 nodes case is very similar to DRBD[1] though lack of auto-sync functionality in the single device/node failure for now. But compared with DRBD we still have some advantages over it: - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed storage. And if A crashes, we need to restart all the VMs on node B. But for practice case, we can't because B might not have enough resources to setup 20 VMs at once. So if we run our 20 VMs with quorum driver, and scatter the replicated images over the data center, we can very likely restart 20 VMs without any resource problem. After all, I think we can build a more powerful replicated image functionality on quorum and block jobs(block mirror) to meet various High Availibility needs. E.g, Enable single read pattern on 2 children, -drive driver=quorum,children.0.file.filename=0.qcow2,\ children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1 [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device Cc: Benoit Canet ben...@irqsave.net Cc: Eric Blake ebl...@redhat.com Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- block/quorum.c | 179 + 1 file changed, 131 insertions(+), 48 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index d5ee9c0..ebf5c71 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -24,6 +24,7 @@ #define QUORUM_OPT_VOTE_THRESHOLD vote-threshold #define QUORUM_OPT_BLKVERIFY blkverify #define QUORUM_OPT_REWRITErewrite-corrupted +#define QUORUM_OPT_READ_PATTERN read-pattern /* This union holds a vote hash value */ typedef union QuorumVoteValue { @@ -74,6 +75,8 @@ typedef struct BDRVQuorumState { bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted * block if Quorum is reached. */ + +QuorumReadPattern read_pattern; } BDRVQuorumState; typedef struct QuorumAIOCB QuorumAIOCB; @@ -117,6 +120,7 @@ struct QuorumAIOCB { bool is_read; int vote_ret; +int child_iter; /* which child to read in fifo pattern */ }; static bool quorum_vote(QuorumAIOCB *acb); @@ -154,8 +158,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb) if (acb-is_read) { for (i = 0; i s-num_children; i++) { -qemu_vfree(acb-qcrs[i].buf); -qemu_iovec_destroy(acb-qcrs[i].qiov); +if (i = acb-child_iter) { +qemu_vfree(acb-qcrs[i].buf); +qemu_iovec_destroy(acb-qcrs[i].qiov); +} This seems convoluted. What about ? /* on the quorum case acb-child_iter == s-num_children - 1 */ for (i = 0; i = acb-child_iter; i++) { qemu_vfree(acb-qcrs[i].buf); qemu_iovec_destroy(acb-qcrs[i].qiov); } } Sounds good. I'll update the patch v5. Thanks Yuan
[Qemu-devel] [PATCH v5 0/2] add read-pattern for block qourum
v5: - simplify a for loop in quorum_aio_finalize() v4: - swap the patch order - update comment for fifo pattern in qaip - use qapi enumeration in quorum driver instead of manual parsing v3: - separate patch into two, one for quorum and one for qapi for easier review - add enumeration for quorum read pattern - remove unrelated blank line fix from this patch set v2: - rename single as 'fifo' - rename read_all_children as read_quorum_children - fix quorum_aio_finalize() for fifo pattern This patch adds single read pattern to quorum driver and quorum vote is default pattern. For now we do a quorum vote on all the reads, it is designed for unreliable underlying storage such as non-redundant NFS to make sure data integrity at the cost of the read performance. For some use cases as following: VM -- || vv AB Both A and B has hardware raid storage to justify the data integrity on its own. So it would help performance if we do a single read instead of on all the nodes. Further, if we run VM on either of the storage node, we can make a local read request for better performance. This patch generalize the above 2 nodes case in the N nodes. That is, vm - write to all the N nodes, read just one of them. If single read fails, we try to read next node in FIFO order specified by the startup command. The 2 nodes case is very similar to DRBD[1] though lack of auto-sync functionality in the single device/node failure for now. But compared with DRBD we still have some advantages over it: - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed storage. And if A crashes, we need to restart all the VMs on node B. But for practice case, we can't because B might not have enough resources to setup 20 VMs at once. So if we run our 20 VMs with quorum driver, and scatter the replicated images over the data center, we can very likely restart 20 VMs without any resource problem. After all, I think we can build a more powerful replicated image functionality on quorum and block jobs(block mirror) to meet various High Availibility needs. E.g, Enable single read pattern on 2 children, -drive driver=quorum,children.0.file.filename=0.qcow2,\ children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1 [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device Cc: Benoit Canet ben...@irqsave.net Cc: Eric Blake ebl...@redhat.com Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com Liu Yuan (2): qapi: add read-pattern enum for quorum block/quorum: add simple read pattern support block/quorum.c | 176 +-- qapi/block-core.json | 20 +- 2 files changed, 148 insertions(+), 48 deletions(-) -- 1.9.1
[Qemu-devel] [PATCH v5 2/2] block/quorum: add simple read pattern support
This patch adds single read pattern to quorum driver and quorum vote is default pattern. For now we do a quorum vote on all the reads, it is designed for unreliable underlying storage such as non-redundant NFS to make sure data integrity at the cost of the read performance. For some use cases as following: VM -- || vv AB Both A and B has hardware raid storage to justify the data integrity on its own. So it would help performance if we do a single read instead of on all the nodes. Further, if we run VM on either of the storage node, we can make a local read request for better performance. This patch generalize the above 2 nodes case in the N nodes. That is, vm - write to all the N nodes, read just one of them. If single read fails, we try to read next node in FIFO order specified by the startup command. The 2 nodes case is very similar to DRBD[1] though lack of auto-sync functionality in the single device/node failure for now. But compared with DRBD we still have some advantages over it: - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed storage. And if A crashes, we need to restart all the VMs on node B. But for practice case, we can't because B might not have enough resources to setup 20 VMs at once. So if we run our 20 VMs with quorum driver, and scatter the replicated images over the data center, we can very likely restart 20 VMs without any resource problem. After all, I think we can build a more powerful replicated image functionality on quorum and block jobs(block mirror) to meet various High Availibility needs. E.g, Enable single read pattern on 2 children, -drive driver=quorum,children.0.file.filename=0.qcow2,\ children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1 [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device Cc: Benoit Canet ben...@irqsave.net Cc: Eric Blake ebl...@redhat.com Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- block/quorum.c | 176 ++--- 1 file changed, 129 insertions(+), 47 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index d5ee9c0..1235d7c 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -24,6 +24,7 @@ #define QUORUM_OPT_VOTE_THRESHOLD vote-threshold #define QUORUM_OPT_BLKVERIFY blkverify #define QUORUM_OPT_REWRITErewrite-corrupted +#define QUORUM_OPT_READ_PATTERN read-pattern /* This union holds a vote hash value */ typedef union QuorumVoteValue { @@ -74,6 +75,8 @@ typedef struct BDRVQuorumState { bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted * block if Quorum is reached. */ + +QuorumReadPattern read_pattern; } BDRVQuorumState; typedef struct QuorumAIOCB QuorumAIOCB; @@ -117,6 +120,7 @@ struct QuorumAIOCB { bool is_read; int vote_ret; +int child_iter; /* which child to read in fifo pattern */ }; static bool quorum_vote(QuorumAIOCB *acb); @@ -153,7 +157,8 @@ static void quorum_aio_finalize(QuorumAIOCB *acb) acb-common.cb(acb-common.opaque, ret); if (acb-is_read) { -for (i = 0; i s-num_children; i++) { +/* on the quorum case acb-child_iter == s-num_children - 1 */ +for (i = 0; i = acb-child_iter; i++) { qemu_vfree(acb-qcrs[i].buf); qemu_iovec_destroy(acb-qcrs[i].qiov); } @@ -256,6 +261,21 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret) quorum_aio_finalize(acb); } +static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb); + +static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source) +{ +int i; +assert(dest-niov == source-niov); +assert(dest-size == source-size); +for (i = 0; i source-niov; i++) { +assert(dest-iov[i].iov_len == source-iov[i].iov_len); +memcpy(dest-iov[i].iov_base, + source-iov[i].iov_base, + source-iov[i].iov_len); +} +} + static void quorum_aio_cb(void *opaque, int ret) { QuorumChildRequest *sacb = opaque; @@ -263,6 +283,21 @@ static void quorum_aio_cb(void *opaque, int ret) BDRVQuorumState *s = acb-common.bs-opaque; bool rewrite = false; +if (acb-is_read s-read_pattern == QUORUM_READ_PATTERN_FIFO) { +/* We try to read next child in FIFO order if we fail to read */ +if (ret 0 ++acb-child_iter s-num_children) { +read_fifo_child(acb); +return; +} + +if (ret == 0) { +quorum_copy_qiov(acb-qiov, acb-qcrs[acb-child_iter].qiov); +} +acb-vote_ret = ret; +quorum_aio_finalize(acb); +return; +} + sacb-ret = ret; acb-count++; if (ret == 0) { @@ -343,19 +378,6 @@ static bool quorum_rewrite_bad_versions(BDRVQuorumState *s, QuorumAIOCB *acb
[Qemu-devel] [PATCH v5 1/2] qapi: add read-pattern enum for quorum
Cc: Eric Blake ebl...@redhat.com Reviewed-by: Eric Blake ebl...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- qapi/block-core.json | 20 +++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/qapi/block-core.json b/qapi/block-core.json index e378653..42033d9 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -1384,6 +1384,19 @@ 'raw': 'BlockdevRef' } } ## +# @QuorumReadPattern +# +# An enumeration of quorum read patterns. +# +# @quorum: read all the children and do a quorum vote on reads +# +# @fifo: read only from the first child that has not failed +# +# Since: 2.2 +## +{ 'enum': 'QuorumReadPattern', 'data': [ 'quorum', 'fifo' ] } + +## # @BlockdevOptionsQuorum # # Driver specific block device options for Quorum @@ -1398,12 +1411,17 @@ # @rewrite-corrupted: #optional rewrite corrupted data when quorum is reached # (Since 2.1) # +# @read-pattern: #optional choose read pattern and set to quorum by default +#(Since 2.2) +# # Since: 2.0 ## { 'type': 'BlockdevOptionsQuorum', 'data': { '*blkverify': 'bool', 'children': [ 'BlockdevRef' ], -'vote-threshold': 'int', '*rewrite-corrupted': 'bool' } } +'vote-threshold': 'int', +'*rewrite-corrupted': 'bool', +'*read-pattern': 'QuorumReadPattern' } } ## # @BlockdevOptions -- 1.9.1
Re: [Qemu-devel] [PATCH v4 2/2] block/quorum: add simple read pattern support
On Thu, Jul 17, 2014 at 01:18:56PM +0800, Liu Yuan wrote: This patch adds single read pattern to quorum driver and quorum vote is default pattern. For now we do a quorum vote on all the reads, it is designed for unreliable underlying storage such as non-redundant NFS to make sure data integrity at the cost of the read performance. For some use cases as following: VM -- || vv AB Both A and B has hardware raid storage to justify the data integrity on its own. So it would help performance if we do a single read instead of on all the nodes. Further, if we run VM on either of the storage node, we can make a local read request for better performance. This patch generalize the above 2 nodes case in the N nodes. That is, vm - write to all the N nodes, read just one of them. If single read fails, we try to read next node in FIFO order specified by the startup command. The 2 nodes case is very similar to DRBD[1] though lack of auto-sync functionality in the single device/node failure for now. But compared with DRBD we still have some advantages over it: - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed storage. And if A crashes, we need to restart all the VMs on node B. But for practice case, we can't because B might not have enough resources to setup 20 VMs at once. So if we run our 20 VMs with quorum driver, and scatter the replicated images over the data center, we can very likely restart 20 VMs without any resource problem. After all, I think we can build a more powerful replicated image functionality on quorum and block jobs(block mirror) to meet various High Availibility needs. E.g, Enable single read pattern on 2 children, -drive driver=quorum,children.0.file.filename=0.qcow2,\ children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1 [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device Cc: Benoit Canet ben...@irqsave.net Cc: Eric Blake ebl...@redhat.com Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- block/quorum.c | 179 + 1 file changed, 131 insertions(+), 48 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index d5ee9c0..ebf5c71 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -24,6 +24,7 @@ #define QUORUM_OPT_VOTE_THRESHOLD vote-threshold #define QUORUM_OPT_BLKVERIFY blkverify #define QUORUM_OPT_REWRITErewrite-corrupted +#define QUORUM_OPT_READ_PATTERN read-pattern /* This union holds a vote hash value */ typedef union QuorumVoteValue { @@ -74,6 +75,8 @@ typedef struct BDRVQuorumState { bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted * block if Quorum is reached. */ + +QuorumReadPattern read_pattern; } BDRVQuorumState; typedef struct QuorumAIOCB QuorumAIOCB; @@ -117,6 +120,7 @@ struct QuorumAIOCB { bool is_read; int vote_ret; +int child_iter; /* which child to read in fifo pattern */ }; static bool quorum_vote(QuorumAIOCB *acb); @@ -154,8 +158,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb) if (acb-is_read) { for (i = 0; i s-num_children; i++) { -qemu_vfree(acb-qcrs[i].buf); -qemu_iovec_destroy(acb-qcrs[i].qiov); +if (i = acb-child_iter) { +qemu_vfree(acb-qcrs[i].buf); +qemu_iovec_destroy(acb-qcrs[i].qiov); +} } } @@ -256,6 +262,21 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret) quorum_aio_finalize(acb); } +static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb); + +static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source) +{ +int i; +assert(dest-niov == source-niov); +assert(dest-size == source-size); +for (i = 0; i source-niov; i++) { +assert(dest-iov[i].iov_len == source-iov[i].iov_len); +memcpy(dest-iov[i].iov_base, + source-iov[i].iov_base, + source-iov[i].iov_len); +} +} + static void quorum_aio_cb(void *opaque, int ret) { QuorumChildRequest *sacb = opaque; @@ -263,6 +284,21 @@ static void quorum_aio_cb(void *opaque, int ret) BDRVQuorumState *s = acb-common.bs-opaque; bool rewrite = false; +if (acb-is_read s-read_pattern == QUORUM_READ_PATTERN_FIFO) { +/* We try to read next child in FIFO order if we fail to read */ +if (ret 0 ++acb-child_iter s-num_children) { +read_fifo_child(acb); +return; +} + +if (ret == 0) { +quorum_copy_qiov(acb-qiov, acb-qcrs[acb-child_iter].qiov); +} +acb-vote_ret = ret
Re: [Qemu-devel] [PATCH v4 2/2] block/quorum: add simple read pattern support
On Mon, Aug 11, 2014 at 02:31:43PM +0200, Benoît Canet wrote: The Thursday 17 Jul 2014 à 13:18:56 (+0800), Liu Yuan wrote : This patch adds single read pattern to quorum driver and quorum vote is default pattern. For now we do a quorum vote on all the reads, it is designed for unreliable underlying storage such as non-redundant NFS to make sure data integrity at the cost of the read performance. For some use cases as following: VM -- || vv AB Both A and B has hardware raid storage to justify the data integrity on its own. So it would help performance if we do a single read instead of on all the nodes. Further, if we run VM on either of the storage node, we can make a local read request for better performance. This patch generalize the above 2 nodes case in the N nodes. That is, vm - write to all the N nodes, read just one of them. If single read fails, we try to read next node in FIFO order specified by the startup command. The 2 nodes case is very similar to DRBD[1] though lack of auto-sync functionality in the single device/node failure for now. But compared with DRBD we still have some advantages over it: - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed storage. And if A crashes, we need to restart all the VMs on node B. But for practice case, we can't because B might not have enough resources to setup 20 VMs at once. So if we run our 20 VMs with quorum driver, and scatter the replicated images over the data center, we can very likely restart 20 VMs without any resource problem. After all, I think we can build a more powerful replicated image functionality on quorum and block jobs(block mirror) to meet various High Availibility needs. E.g, Enable single read pattern on 2 children, -drive driver=quorum,children.0.file.filename=0.qcow2,\ children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1 [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device Cc: Benoit Canet ben...@irqsave.net Cc: Eric Blake ebl...@redhat.com Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- block/quorum.c | 179 + 1 file changed, 131 insertions(+), 48 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index d5ee9c0..ebf5c71 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -24,6 +24,7 @@ #define QUORUM_OPT_VOTE_THRESHOLD vote-threshold #define QUORUM_OPT_BLKVERIFY blkverify #define QUORUM_OPT_REWRITErewrite-corrupted +#define QUORUM_OPT_READ_PATTERN read-pattern /* This union holds a vote hash value */ typedef union QuorumVoteValue { @@ -74,6 +75,8 @@ typedef struct BDRVQuorumState { bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted * block if Quorum is reached. */ + +QuorumReadPattern read_pattern; } BDRVQuorumState; typedef struct QuorumAIOCB QuorumAIOCB; @@ -117,6 +120,7 @@ struct QuorumAIOCB { bool is_read; int vote_ret; +int child_iter; /* which child to read in fifo pattern */ }; static bool quorum_vote(QuorumAIOCB *acb); @@ -154,8 +158,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb) if (acb-is_read) { for (i = 0; i s-num_children; i++) { -qemu_vfree(acb-qcrs[i].buf); -qemu_iovec_destroy(acb-qcrs[i].qiov); +if (i = acb-child_iter) { +qemu_vfree(acb-qcrs[i].buf); +qemu_iovec_destroy(acb-qcrs[i].qiov); +} } } @@ -256,6 +262,21 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret) quorum_aio_finalize(acb); } +static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb); + +static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source) +{ +int i; +assert(dest-niov == source-niov); +assert(dest-size == source-size); +for (i = 0; i source-niov; i++) { +assert(dest-iov[i].iov_len == source-iov[i].iov_len); +memcpy(dest-iov[i].iov_base, + source-iov[i].iov_base, + source-iov[i].iov_len); +} +} + static void quorum_aio_cb(void *opaque, int ret) { QuorumChildRequest *sacb = opaque; @@ -263,6 +284,21 @@ static void quorum_aio_cb(void *opaque, int ret) BDRVQuorumState *s = acb-common.bs-opaque; bool rewrite = false; +if (acb-is_read s-read_pattern == QUORUM_READ_PATTERN_FIFO) { +/* We try to read next child in FIFO order if we fail to read */ +if (ret 0 ++acb-child_iter s-num_children
Re: [Qemu-devel] [PATCH 1/2] sheepdog: adopting protocol update for VDI locking
On Mon, Aug 11, 2014 at 11:17:33AM +0900, Hitoshi Mitake wrote: At Fri, 8 Aug 2014 15:49:37 +0800, Liu Yuan wrote: On Fri, Aug 08, 2014 at 03:12:17PM +0900, Hitoshi Mitake wrote: At Fri, 8 Aug 2014 13:20:39 +0800, Liu Yuan wrote: On Thu, Aug 07, 2014 at 04:28:39PM +0900, Hitoshi Mitake wrote: The update is required for supporting iSCSI multipath. It doesn't affect behavior of QEMU driver but adding a new field to vdi request struct is required. Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Cc: Liu Yuan namei.u...@gmail.com Cc: MORITA Kazutaka morita.kazut...@lab.ntt.co.jp Signed-off-by: Hitoshi Mitake mitake.hito...@lab.ntt.co.jp --- block/sheepdog.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/block/sheepdog.c b/block/sheepdog.c index 8d9350c..36f76f0 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -103,6 +103,9 @@ #define SD_INODE_SIZE (sizeof(SheepdogInode)) #define CURRENT_VDI_ID 0 +#define LOCK_TYPE_NORMAL 1 +#define LOCK_TYPE_SHARED 2 /* for iSCSI multipath */ How about #define LOCK_TYPE_NORMAL 0 #define LOCK_TYPE_SHARED 1 Then we don't need this patch. Since qemu won't make use of multipath for the near future, we should avoid adding stuff related to multipath to qemu driver. (Cc-ing current Kazutaka-san's address) I think this isn't a good idea. Because it means that sheep has an assumption about padding field of the request data struct. This sort of workaround can cause hard to find problems in the future. The point is, how to keep backward compatibilty? E.g, old QEMU with present sheep daemon with lock features. Then QEMU will send 0 instead of 1 to the sheep daemon and based on how you handle the invalid value, these might cause problems. Suppose we have 1 old QEMU and 1 present QEMU who try to start the same image A. Old QEMU will send invalid 0 to sheep daemon. We shouldn't deny it because it can run with old sheep and should run on new sheep too. Then this image A isn't locked due to invalid 0 value. Then present QEMU send correct LOCK signal and will wrongly set up the image. Start with 0 instead of 1, in my option, is reasonable to keep backward compatibility. I don't think so. Because the backward compatibility of the locking functionality is already broken in the far past. As Meng repoted in the sheepdog mailing list, old QEMU can issue locking request to sheep with vid == 0: http://lists.wpkg.org/pipermail/sheepdog/2014-August/015438.html I don't really understand why we can't start with 0 and can't keep backward compatibility. By the way, I think the link has nothing to do with qemu, it is a bug in sheep. locking has two state, one is lock and the other unlock. We choose 0 to mean 'lock' the vdi and 1 to 'unlock' the vdi. So both old and new QEMU issue 0 to lock the image and 'release' request to unlock the image. What is in the way? Even we set the default lock type as 0, the old QEMU cannot issue a correct locking request. why? I'll post a patch for incrementing protocol version number later. But before doing that, I also want to clean DISCARD request. Because this request cannot work with snapshot (not only with the new GC algorithm. The old naive algorithm cannot work with the DISCARD request). If you remove discard, what if users run new qemu with old sheep, which make use of old algorithm and people expect discard to work? I don't think remove operation from protocol is good idea. If sheep daemon can't support discard, you can simply ask the sheep daemon to return success without doing anything if it doesn't support discard. Thanks Yuan
Re: [Qemu-devel] [PATCH 1/2] sheepdog: adopting protocol update for VDI locking
On Mon, Aug 11, 2014 at 11:34:56AM +0800, Liu Yuan wrote: On Mon, Aug 11, 2014 at 11:17:33AM +0900, Hitoshi Mitake wrote: At Fri, 8 Aug 2014 15:49:37 +0800, Liu Yuan wrote: On Fri, Aug 08, 2014 at 03:12:17PM +0900, Hitoshi Mitake wrote: At Fri, 8 Aug 2014 13:20:39 +0800, Liu Yuan wrote: On Thu, Aug 07, 2014 at 04:28:39PM +0900, Hitoshi Mitake wrote: The update is required for supporting iSCSI multipath. It doesn't affect behavior of QEMU driver but adding a new field to vdi request struct is required. Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Cc: Liu Yuan namei.u...@gmail.com Cc: MORITA Kazutaka morita.kazut...@lab.ntt.co.jp Signed-off-by: Hitoshi Mitake mitake.hito...@lab.ntt.co.jp --- block/sheepdog.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/block/sheepdog.c b/block/sheepdog.c index 8d9350c..36f76f0 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -103,6 +103,9 @@ #define SD_INODE_SIZE (sizeof(SheepdogInode)) #define CURRENT_VDI_ID 0 +#define LOCK_TYPE_NORMAL 1 +#define LOCK_TYPE_SHARED 2 /* for iSCSI multipath */ How about #define LOCK_TYPE_NORMAL 0 #define LOCK_TYPE_SHARED 1 Then we don't need this patch. Since qemu won't make use of multipath for the near future, we should avoid adding stuff related to multipath to qemu driver. (Cc-ing current Kazutaka-san's address) I think this isn't a good idea. Because it means that sheep has an assumption about padding field of the request data struct. This sort of workaround can cause hard to find problems in the future. The point is, how to keep backward compatibilty? E.g, old QEMU with present sheep daemon with lock features. Then QEMU will send 0 instead of 1 to the sheep daemon and based on how you handle the invalid value, these might cause problems. Suppose we have 1 old QEMU and 1 present QEMU who try to start the same image A. Old QEMU will send invalid 0 to sheep daemon. We shouldn't deny it because it can run with old sheep and should run on new sheep too. Then this image A isn't locked due to invalid 0 value. Then present QEMU send correct LOCK signal and will wrongly set up the image. Start with 0 instead of 1, in my option, is reasonable to keep backward compatibility. I don't think so. Because the backward compatibility of the locking functionality is already broken in the far past. As Meng repoted in the sheepdog mailing list, old QEMU can issue locking request to sheep with vid == 0: http://lists.wpkg.org/pipermail/sheepdog/2014-August/015438.html I don't really understand why we can't start with 0 and can't keep backward compatibility. By the way, I think the link has nothing to do with qemu, it is a bug in sheep. locking has two state, one is lock and the other unlock. We choose 0 to mean 'lock' the vdi and 1 to 'unlock' the vdi. So both old and new QEMU issue 0 to lock the image and 'release' request to unlock the image. What is in the way? Even we set the default lock type as 0, the old QEMU cannot issue a correct locking request. why? I'll post a patch for incrementing protocol version number later. But before doing that, I also want to clean DISCARD request. Because this request cannot work with snapshot (not only with the new GC algorithm. The old naive algorithm cannot work with the DISCARD request). If you remove discard, what if users run new qemu with old sheep, which make use of old algorithm and people expect discard to work? Okay, you mean discard can't work with snapshots, but it should work with non-snapshots, so for the users of non-snapshots, people can expect it to work. As stated in my last email, you can handle this problem transparently without modification of protocol. QEMU -- discard[offset, lenght] -- sheep sheep: if req on snapshot return success; else return sheep_handle_discard(). Thanks Yuan
[Qemu-devel] [PATCH] cluster/zookeeper: add log information for zk auto-recoonect
Reported-by: Valerio Pachera siri...@gmail.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- sheep/group.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sheep/group.c b/sheep/group.c index 06a80bd..08e3884 100644 --- a/sheep/group.c +++ b/sheep/group.c @@ -979,7 +979,7 @@ static int send_join_request(void) { struct sd_node *n = sys-this_node; - sd_info(%s, node_to_str(n)); + sd_info(%s going to rejoin the cluster, node_to_str(n)); return sys-cdrv-join(n, sys-cinfo, sizeof(sys-cinfo)); } -- 1.9.1
Re: [Qemu-devel] [PATCH 2/2] sheepdog: improve error handling for a case of failed lock
On Fri, Aug 08, 2014 at 03:17:59PM +0900, Hitoshi Mitake wrote: At Fri, 8 Aug 2014 13:31:39 +0800, Liu Yuan wrote: On Thu, Aug 07, 2014 at 04:28:40PM +0900, Hitoshi Mitake wrote: Recently, sheepdog revived its VDI locking functionality. This patch updates sheepdog driver of QEMU for this feature: 1. Improve error message when QEMU fails to acquire lock of VDI. Current sheepdog driver prints an error message VDI isn't locked when it fails to acquire lock. It is a little bit confusing because the mesage says VDI isn't locked but it is actually locked by other VM. This patch modifies this confusing message. 2. Change error code for a case of failed locking. -EBUSY is a suitable one. Reported-by: Valerio Pachera siri...@gmail.com Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Cc: Liu Yuan namei.u...@gmail.com Cc: MORITA Kazutaka morita.kazut...@lab.ntt.co.jp Signed-off-by: Hitoshi Mitake mitake.hito...@lab.ntt.co.jp --- block/sheepdog.c | 4 1 file changed, 4 insertions(+) diff --git a/block/sheepdog.c b/block/sheepdog.c index 36f76f0..0b3f86d 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -1112,9 +1112,13 @@ static int find_vdi_name(BDRVSheepdogState *s, const char *filename, if (rsp-result != SD_RES_SUCCESS) { error_setg(errp, cannot get vdi info, %s, %s % PRIu32 %s, + rsp-result == SD_RES_VDI_NOT_LOCKED ? I'm puzzled by this check. we use SD_RES_VDI_LOCKED to indicate vid is already locked, no? We use SD_RES_VDI_NOT_LOCKED for indicating locking by this VM fails. + VDI is already locked by other VM : But this message said it was locked by others, and we have SD_RES_VDI_LOCKED for this case. We need fix sheep daemon for this case to return SD_RES_VDI_LOCKED for already locked case and NOT_LOCKED for other sheep internal errors. sd_strerror(rsp-result), filename, snapid, tag); if (rsp-result == SD_RES_NO_VDI) { ret = -ENOENT; +} else if (rsp-result == SD_RES_VDI_NOT_LOCKED) { +ret = -EBUSY; } else { ret = -EIO; } It is better to use switch case to handle the result. using switch statement in this case only increases a number of lines of code: Current change: if (rsp-result == SD_RES_NO_VDI) { ret = -ENOENT; } else if (rsp-result == SD_RES_VDI_NOT_LOCKED) { ... Change with switch: switch (rsp-result) { case SD_RES_NO_VDI: ret = -ENOENT; break; case SD_RES_VDI_NOT_LOCKED: ... The change with switch statement requires one more line for break;. I think if statement is suitable for this case. If you insist on 'if-else' over swtich case, it is fine with me. But I'd suggest switch-case because it looks cleaner and easier to understand if we have more than 2 branches. Thanks Yuan
Re: [Qemu-devel] [PATCH 1/2] sheepdog: adopting protocol update for VDI locking
On Fri, Aug 08, 2014 at 03:12:17PM +0900, Hitoshi Mitake wrote: At Fri, 8 Aug 2014 13:20:39 +0800, Liu Yuan wrote: On Thu, Aug 07, 2014 at 04:28:39PM +0900, Hitoshi Mitake wrote: The update is required for supporting iSCSI multipath. It doesn't affect behavior of QEMU driver but adding a new field to vdi request struct is required. Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Cc: Liu Yuan namei.u...@gmail.com Cc: MORITA Kazutaka morita.kazut...@lab.ntt.co.jp Signed-off-by: Hitoshi Mitake mitake.hito...@lab.ntt.co.jp --- block/sheepdog.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/block/sheepdog.c b/block/sheepdog.c index 8d9350c..36f76f0 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -103,6 +103,9 @@ #define SD_INODE_SIZE (sizeof(SheepdogInode)) #define CURRENT_VDI_ID 0 +#define LOCK_TYPE_NORMAL 1 +#define LOCK_TYPE_SHARED 2 /* for iSCSI multipath */ How about #define LOCK_TYPE_NORMAL 0 #define LOCK_TYPE_SHARED 1 Then we don't need this patch. Since qemu won't make use of multipath for the near future, we should avoid adding stuff related to multipath to qemu driver. (Cc-ing current Kazutaka-san's address) I think this isn't a good idea. Because it means that sheep has an assumption about padding field of the request data struct. This sort of workaround can cause hard to find problems in the future. The point is, how to keep backward compatibilty? E.g, old QEMU with present sheep daemon with lock features. Then QEMU will send 0 instead of 1 to the sheep daemon and based on how you handle the invalid value, these might cause problems. Suppose we have 1 old QEMU and 1 present QEMU who try to start the same image A. Old QEMU will send invalid 0 to sheep daemon. We shouldn't deny it because it can run with old sheep and should run on new sheep too. Then this image A isn't locked due to invalid 0 value. Then present QEMU send correct LOCK signal and will wrongly set up the image. Start with 0 instead of 1, in my option, is reasonable to keep backward compatibility. Thanks Yuan
Re: [Qemu-devel] [PATCH 1/2] sheepdog: adopting protocol update for VDI locking
On Thu, Aug 07, 2014 at 04:28:39PM +0900, Hitoshi Mitake wrote: The update is required for supporting iSCSI multipath. It doesn't affect behavior of QEMU driver but adding a new field to vdi request struct is required. Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Cc: Liu Yuan namei.u...@gmail.com Cc: MORITA Kazutaka morita.kazut...@lab.ntt.co.jp Signed-off-by: Hitoshi Mitake mitake.hito...@lab.ntt.co.jp --- block/sheepdog.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/block/sheepdog.c b/block/sheepdog.c index 8d9350c..36f76f0 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -103,6 +103,9 @@ #define SD_INODE_SIZE (sizeof(SheepdogInode)) #define CURRENT_VDI_ID 0 +#define LOCK_TYPE_NORMAL 1 +#define LOCK_TYPE_SHARED 2 /* for iSCSI multipath */ How about #define LOCK_TYPE_NORMAL 0 #define LOCK_TYPE_SHARED 1 Then we don't need this patch. Since qemu won't make use of multipath for the near future, we should avoid adding stuff related to multipath to qemu driver. Thanks Yuan + typedef struct SheepdogReq { uint8_t proto_ver; uint8_t opcode; @@ -166,7 +169,8 @@ typedef struct SheepdogVdiReq { uint8_t copy_policy; uint8_t reserved[2]; uint32_t snapid; -uint32_t pad[3]; +uint32_t type; +uint32_t pad[2]; } SheepdogVdiReq; typedef struct SheepdogVdiRsp { @@ -1090,6 +1094,7 @@ static int find_vdi_name(BDRVSheepdogState *s, const char *filename, memset(hdr, 0, sizeof(hdr)); if (lock) { hdr.opcode = SD_OP_LOCK_VDI; +hdr.type = LOCK_TYPE_NORMAL; } else { hdr.opcode = SD_OP_GET_VDI_INFO; } @@ -1793,6 +1798,7 @@ static void sd_close(BlockDriverState *bs) memset(hdr, 0, sizeof(hdr)); hdr.opcode = SD_OP_RELEASE_VDI; +hdr.type = LOCK_TYPE_NORMAL; hdr.base_vdi_id = s-inode.vdi_id; wlen = strlen(s-name) + 1; hdr.data_length = wlen; -- 1.8.3.2
Re: [Qemu-devel] [PATCH 2/2] sheepdog: improve error handling for a case of failed lock
On Thu, Aug 07, 2014 at 04:28:40PM +0900, Hitoshi Mitake wrote: Recently, sheepdog revived its VDI locking functionality. This patch updates sheepdog driver of QEMU for this feature: 1. Improve error message when QEMU fails to acquire lock of VDI. Current sheepdog driver prints an error message VDI isn't locked when it fails to acquire lock. It is a little bit confusing because the mesage says VDI isn't locked but it is actually locked by other VM. This patch modifies this confusing message. 2. Change error code for a case of failed locking. -EBUSY is a suitable one. Reported-by: Valerio Pachera siri...@gmail.com Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Cc: Liu Yuan namei.u...@gmail.com Cc: MORITA Kazutaka morita.kazut...@lab.ntt.co.jp Signed-off-by: Hitoshi Mitake mitake.hito...@lab.ntt.co.jp --- block/sheepdog.c | 4 1 file changed, 4 insertions(+) diff --git a/block/sheepdog.c b/block/sheepdog.c index 36f76f0..0b3f86d 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -1112,9 +1112,13 @@ static int find_vdi_name(BDRVSheepdogState *s, const char *filename, if (rsp-result != SD_RES_SUCCESS) { error_setg(errp, cannot get vdi info, %s, %s % PRIu32 %s, + rsp-result == SD_RES_VDI_NOT_LOCKED ? I'm puzzled by this check. we use SD_RES_VDI_LOCKED to indicate vid is already locked, no? + VDI is already locked by other VM : sd_strerror(rsp-result), filename, snapid, tag); if (rsp-result == SD_RES_NO_VDI) { ret = -ENOENT; +} else if (rsp-result == SD_RES_VDI_NOT_LOCKED) { +ret = -EBUSY; } else { ret = -EIO; } It is better to use switch case to handle the result. Thanks Yuan
Re: [Qemu-devel] [PATCH] block/quorum: implement .bdrv_co_get_block_status
On Fri, Jul 18, 2014 at 03:35:57PM +0200, Paolo Bonzini wrote: Il 17/07/2014 13:50, Liu Yuan ha scritto: - allow drive-mirror to create sprase mirror on images like qcow2 - allow qemu-img map to work as expected on quorum driver Cc: Benoit Canet ben...@irqsave.net Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- block/quorum.c | 16 1 file changed, 16 insertions(+) diff --git a/block/quorum.c b/block/quorum.c index ebf5c71..f0d0a98 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -780,6 +780,21 @@ static coroutine_fn int quorum_co_flush(BlockDriverState *bs) return result; } +static int64_t coroutine_fn quorum_co_get_block_status(BlockDriverState *bs, + int64_t sector_num, + int nb_sectors, + int *pnum) +{ +BDRVQuorumState *s = bs-opaque; +BlockDriverState *child_bs = s-bs[0]; + +if (child_bs-drv-bdrv_co_get_block_status) +return child_bs-drv-bdrv_co_get_block_status(child_bs, sector_num, + nb_sectors, pnum); Is this if necessary? Yes, otherwise bdrv_get_block_status() will be called multiple times and the result for qcow2 won't return desired value im my test. Or we can simply call bdrv_get_block_status() plus some tricks? +return bdrv_get_block_status(child_bs, sector_num, nb_sectors, pnum); Also, the definition of BDRV_BLOCK_OFFSET_VALID explicitly refers to bs-file, so you probably have to exclude it from the result as well as BDRV_BLOCK_RAW. Thanks for reminding. Yuan
[Qemu-devel] [PATCH] block/quorum: implement .bdrv_co_get_block_status
- allow drive-mirror to create sprase mirror on images like qcow2 - allow qemu-img map to work as expected on quorum driver Cc: Benoit Canet ben...@irqsave.net Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- block/quorum.c | 16 1 file changed, 16 insertions(+) diff --git a/block/quorum.c b/block/quorum.c index ebf5c71..f0d0a98 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -780,6 +780,21 @@ static coroutine_fn int quorum_co_flush(BlockDriverState *bs) return result; } +static int64_t coroutine_fn quorum_co_get_block_status(BlockDriverState *bs, + int64_t sector_num, + int nb_sectors, + int *pnum) +{ +BDRVQuorumState *s = bs-opaque; +BlockDriverState *child_bs = s-bs[0]; + +if (child_bs-drv-bdrv_co_get_block_status) +return child_bs-drv-bdrv_co_get_block_status(child_bs, sector_num, + nb_sectors, pnum); + +return bdrv_get_block_status(child_bs, sector_num, nb_sectors, pnum); +} + static bool quorum_recurse_is_first_non_filter(BlockDriverState *bs, BlockDriverState *candidate) { @@ -1038,6 +1053,7 @@ static BlockDriver bdrv_quorum = { .bdrv_close = quorum_close, .bdrv_co_flush_to_disk = quorum_co_flush, +.bdrv_co_get_block_status = quorum_co_get_block_status, .bdrv_getlength = quorum_getlength, -- 1.9.1
[Qemu-devel] [PATCH v4 0/2] add read-pattern for block qourum
v4: - swap the patch order - update comment for fifo pattern in qaip - use qapi enumeration in quorum driver instead of manual parsing v3: - separate patch into two, one for quorum and one for qapi for easier review - add enumeration for quorum read pattern - remove unrelated blank line fix from this patch set v2: - rename single as 'fifo' - rename read_all_children as read_quorum_children - fix quorum_aio_finalize() for fifo pattern This patch adds single read pattern to quorum driver and quorum vote is default pattern. For now we do a quorum vote on all the reads, it is designed for unreliable underlying storage such as non-redundant NFS to make sure data integrity at the cost of the read performance. For some use cases as following: VM -- || vv AB Both A and B has hardware raid storage to justify the data integrity on its own. So it would help performance if we do a single read instead of on all the nodes. Further, if we run VM on either of the storage node, we can make a local read request for better performance. This patch generalize the above 2 nodes case in the N nodes. That is, vm - write to all the N nodes, read just one of them. If single read fails, we try to read next node in FIFO order specified by the startup command. The 2 nodes case is very similar to DRBD[1] though lack of auto-sync functionality in the single device/node failure for now. But compared with DRBD we still have some advantages over it: - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed storage. And if A crashes, we need to restart all the VMs on node B. But for practice case, we can't because B might not have enough resources to setup 20 VMs at once. So if we run our 20 VMs with quorum driver, and scatter the replicated images over the data center, we can very likely restart 20 VMs without any resource problem. After all, I think we can build a more powerful replicated image functionality on quorum and block jobs(block mirror) to meet various High Availibility needs. E.g, Enable single read pattern on 2 children, -drive driver=quorum,children.0.file.filename=0.qcow2,\ children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1 [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device Liu Yuan (2): qapi: add read-pattern support for quorum block/quorum: add simple read pattern support block/quorum.c | 181 +-- qapi/block-core.json | 19 +- 2 files changed, 149 insertions(+), 51 deletions(-) -- 1.9.1
[Qemu-devel] [PATCH v4 2/2] block/quorum: add simple read pattern support
This patch adds single read pattern to quorum driver and quorum vote is default pattern. For now we do a quorum vote on all the reads, it is designed for unreliable underlying storage such as non-redundant NFS to make sure data integrity at the cost of the read performance. For some use cases as following: VM -- || vv AB Both A and B has hardware raid storage to justify the data integrity on its own. So it would help performance if we do a single read instead of on all the nodes. Further, if we run VM on either of the storage node, we can make a local read request for better performance. This patch generalize the above 2 nodes case in the N nodes. That is, vm - write to all the N nodes, read just one of them. If single read fails, we try to read next node in FIFO order specified by the startup command. The 2 nodes case is very similar to DRBD[1] though lack of auto-sync functionality in the single device/node failure for now. But compared with DRBD we still have some advantages over it: - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed storage. And if A crashes, we need to restart all the VMs on node B. But for practice case, we can't because B might not have enough resources to setup 20 VMs at once. So if we run our 20 VMs with quorum driver, and scatter the replicated images over the data center, we can very likely restart 20 VMs without any resource problem. After all, I think we can build a more powerful replicated image functionality on quorum and block jobs(block mirror) to meet various High Availibility needs. E.g, Enable single read pattern on 2 children, -drive driver=quorum,children.0.file.filename=0.qcow2,\ children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1 [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device Cc: Benoit Canet ben...@irqsave.net Cc: Eric Blake ebl...@redhat.com Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- block/quorum.c | 179 + 1 file changed, 131 insertions(+), 48 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index d5ee9c0..ebf5c71 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -24,6 +24,7 @@ #define QUORUM_OPT_VOTE_THRESHOLD vote-threshold #define QUORUM_OPT_BLKVERIFY blkverify #define QUORUM_OPT_REWRITErewrite-corrupted +#define QUORUM_OPT_READ_PATTERN read-pattern /* This union holds a vote hash value */ typedef union QuorumVoteValue { @@ -74,6 +75,8 @@ typedef struct BDRVQuorumState { bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted * block if Quorum is reached. */ + +QuorumReadPattern read_pattern; } BDRVQuorumState; typedef struct QuorumAIOCB QuorumAIOCB; @@ -117,6 +120,7 @@ struct QuorumAIOCB { bool is_read; int vote_ret; +int child_iter; /* which child to read in fifo pattern */ }; static bool quorum_vote(QuorumAIOCB *acb); @@ -154,8 +158,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb) if (acb-is_read) { for (i = 0; i s-num_children; i++) { -qemu_vfree(acb-qcrs[i].buf); -qemu_iovec_destroy(acb-qcrs[i].qiov); +if (i = acb-child_iter) { +qemu_vfree(acb-qcrs[i].buf); +qemu_iovec_destroy(acb-qcrs[i].qiov); +} } } @@ -256,6 +262,21 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret) quorum_aio_finalize(acb); } +static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb); + +static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source) +{ +int i; +assert(dest-niov == source-niov); +assert(dest-size == source-size); +for (i = 0; i source-niov; i++) { +assert(dest-iov[i].iov_len == source-iov[i].iov_len); +memcpy(dest-iov[i].iov_base, + source-iov[i].iov_base, + source-iov[i].iov_len); +} +} + static void quorum_aio_cb(void *opaque, int ret) { QuorumChildRequest *sacb = opaque; @@ -263,6 +284,21 @@ static void quorum_aio_cb(void *opaque, int ret) BDRVQuorumState *s = acb-common.bs-opaque; bool rewrite = false; +if (acb-is_read s-read_pattern == QUORUM_READ_PATTERN_FIFO) { +/* We try to read next child in FIFO order if we fail to read */ +if (ret 0 ++acb-child_iter s-num_children) { +read_fifo_child(acb); +return; +} + +if (ret == 0) { +quorum_copy_qiov(acb-qiov, acb-qcrs[acb-child_iter].qiov); +} +acb-vote_ret = ret; +quorum_aio_finalize(acb); +return; +} + sacb-ret = ret; acb-count++; if (ret == 0) { @@ -343,19 +379,6 @@ static bool quorum_rewrite_bad_versions(BDRVQuorumState *s, QuorumAIOCB *acb
[Qemu-devel] [PATCH v4 1/2] qapi: add read-pattern enum for quorum
Cc: Eric Blake ebl...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- qapi/block-core.json | 20 +++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/qapi/block-core.json b/qapi/block-core.json index e378653..42033d9 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -1384,6 +1384,19 @@ 'raw': 'BlockdevRef' } } ## +# @QuorumReadPattern +# +# An enumeration of quorum read patterns. +# +# @quorum: read all the children and do a quorum vote on reads +# +# @fifo: read only from the first child that has not failed +# +# Since: 2.2 +## +{ 'enum': 'QuorumReadPattern', 'data': [ 'quorum', 'fifo' ] } + +## # @BlockdevOptionsQuorum # # Driver specific block device options for Quorum @@ -1398,12 +1411,17 @@ # @rewrite-corrupted: #optional rewrite corrupted data when quorum is reached # (Since 2.1) # +# @read-pattern: #optional choose read pattern and set to quorum by default +#(Since 2.2) +# # Since: 2.0 ## { 'type': 'BlockdevOptionsQuorum', 'data': { '*blkverify': 'bool', 'children': [ 'BlockdevRef' ], -'vote-threshold': 'int', '*rewrite-corrupted': 'bool' } } +'vote-threshold': 'int', +'*rewrite-corrupted': 'bool', +'*read-pattern': 'QuorumReadPattern' } } ## # @BlockdevOptions -- 1.9.1
[Qemu-devel] [PATCH v3 2/2] qapi: add read-pattern support for quorum
Cc: Eric Blake ebl...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- qapi/block-core.json | 19 ++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/qapi/block-core.json b/qapi/block-core.json index e378653..22491bc 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -1384,6 +1384,18 @@ 'raw': 'BlockdevRef' } } ## +# @QuorumReadPattern +# An enumeration of quorum read pattern. +# +# @quorum: read all the children and do a quorum vote on reads +# +# @fifo: read a single child and try next child in FIFO order if read fails. +# +# Since: 2.2 +## +{ 'enum': 'QuorumReadPattern', 'data': [ 'quorum', 'fifo' ] } + +## # @BlockdevOptionsQuorum # # Driver specific block device options for Quorum @@ -1398,12 +1410,17 @@ # @rewrite-corrupted: #optional rewrite corrupted data when quorum is reached # (Since 2.1) # +# @read-pattern: #optional choose read pattern and set to quorum by default +#(Since 2.2) +# # Since: 2.0 ## { 'type': 'BlockdevOptionsQuorum', 'data': { '*blkverify': 'bool', 'children': [ 'BlockdevRef' ], -'vote-threshold': 'int', '*rewrite-corrupted': 'bool' } } +'vote-threshold': 'int', +'*rewrite-corrupted': 'bool', +'*read-pattern': 'QuorumReadPattern' } } ## # @BlockdevOptions -- 1.9.1
[Qemu-devel] [PATCH v3 1/2] block/quorum: add simple read pattern support
This patch adds single read pattern to quorum driver and quorum vote is default pattern. For now we do a quorum vote on all the reads, it is designed for unreliable underlying storage such as non-redundant NFS to make sure data integrity at the cost of the read performance. For some use cases as following: VM -- || vv AB Both A and B has hardware raid storage to justify the data integrity on its own. So it would help performance if we do a single read instead of on all the nodes. Further, if we run VM on either of the storage node, we can make a local read request for better performance. This patch generalize the above 2 nodes case in the N nodes. That is, vm - write to all the N nodes, read just one of them. If single read fails, we try to read next node in FIFO order specified by the startup command. The 2 nodes case is very similar to DRBD[1] though lack of auto-sync functionality in the single device/node failure for now. But compared with DRBD we still have some advantages over it: - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed storage. And if A crashes, we need to restart all the VMs on node B. But for practice case, we can't because B might not have enough resources to setup 20 VMs at once. So if we run our 20 VMs with quorum driver, and scatter the replicated images over the data center, we can very likely restart 20 VMs without any resource problem. After all, I think we can build a more powerful replicated image functionality on quorum and block jobs(block mirror) to meet various High Availibility needs. E.g, Enable single read pattern on 2 children, -drive driver=quorum,children.0.file.filename=0.qcow2,\ children.1.file.filename=1.qcow2,read-pattern=single,vote-threshold=1 [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device Cc: Benoit Canet ben...@irqsave.net Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- block/quorum.c | 181 + 1 file changed, 131 insertions(+), 50 deletions(-) diff --git a/block/quorum.c b/block/quorum.c index d5ee9c0..1443fc5 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -24,6 +24,7 @@ #define QUORUM_OPT_VOTE_THRESHOLD vote-threshold #define QUORUM_OPT_BLKVERIFY blkverify #define QUORUM_OPT_REWRITErewrite-corrupted +#define QUORUM_OPT_READ_PATTERN read-pattern /* This union holds a vote hash value */ typedef union QuorumVoteValue { @@ -74,6 +75,16 @@ typedef struct BDRVQuorumState { bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted * block if Quorum is reached. */ + +#define READ_PATTERN_QUORUM 0 /* default */ +#define READ_PATTERN_FIFO 1 +int read_pattern; /* fifo: read a single child and try first one +* first. If error, try next child in an +* FIFO order specifed by command line. +* Return error if no child read succeeds. +* quorum: read all the children and do a quorum +* vote on reads. +*/ } BDRVQuorumState; typedef struct QuorumAIOCB QuorumAIOCB; @@ -117,6 +128,7 @@ struct QuorumAIOCB { bool is_read; int vote_ret; +int child_iter; /* which child to read in fifo pattern */ }; static bool quorum_vote(QuorumAIOCB *acb); @@ -154,8 +166,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb) if (acb-is_read) { for (i = 0; i s-num_children; i++) { -qemu_vfree(acb-qcrs[i].buf); -qemu_iovec_destroy(acb-qcrs[i].qiov); +if (i = acb-child_iter) { +qemu_vfree(acb-qcrs[i].buf); +qemu_iovec_destroy(acb-qcrs[i].qiov); +} } } @@ -256,6 +270,21 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret) quorum_aio_finalize(acb); } +static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb); + +static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source) +{ +int i; +assert(dest-niov == source-niov); +assert(dest-size == source-size); +for (i = 0; i source-niov; i++) { +assert(dest-iov[i].iov_len == source-iov[i].iov_len); +memcpy(dest-iov[i].iov_base, + source-iov[i].iov_base, + source-iov[i].iov_len); +} +} + static void quorum_aio_cb(void *opaque, int ret) { QuorumChildRequest *sacb = opaque; @@ -263,6 +292,21 @@ static void quorum_aio_cb(void *opaque, int ret) BDRVQuorumState *s = acb-common.bs-opaque; bool rewrite = false; +if (acb-is_read s-read_pattern == READ_PATTERN_FIFO) { +/* We try to read next child in FIFO order if we fail to read
[Qemu-devel] [PATCH v3 0/2] add read-pattern for block qourum
v3: - separate patch into two, one for quorum and one for qapi for easier review - add enumeration for quorum read pattern - remove unrelated blank line fix from this patch set v2: - rename single as 'fifo' - rename read_all_children as read_quorum_children - fix quorum_aio_finalize() for fifo pattern This patch adds single read pattern to quorum driver and quorum vote is default pattern. For now we do a quorum vote on all the reads, it is designed for unreliable underlying storage such as non-redundant NFS to make sure data integrity at the cost of the read performance. For some use cases as following: VM -- || vv AB Both A and B has hardware raid storage to justify the data integrity on its own. So it would help performance if we do a single read instead of on all the nodes. Further, if we run VM on either of the storage node, we can make a local read request for better performance. This patch generalize the above 2 nodes case in the N nodes. That is, vm - write to all the N nodes, read just one of them. If single read fails, we try to read next node in FIFO order specified by the startup command. The 2 nodes case is very similar to DRBD[1] though lack of auto-sync functionality in the single device/node failure for now. But compared with DRBD we still have some advantages over it: - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed storage. And if A crashes, we need to restart all the VMs on node B. But for practice case, we can't because B might not have enough resources to setup 20 VMs at once. So if we run our 20 VMs with quorum driver, and scatter the replicated images over the data center, we can very likely restart 20 VMs without any resource problem. After all, I think we can build a more powerful replicated image functionality on quorum and block jobs(block mirror) to meet various High Availibility needs. E.g, Enable single read pattern on 2 children, -drive driver=quorum,children.0.file.filename=0.qcow2,\ children.1.file.filename=1.qcow2,read-pattern=single,vote-threshold=1 [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device Liu Yuan (2): block/quorum: add simple read pattern support qapi: add read-pattern support for quorum block/quorum.c | 181 +-- qapi/block-core.json | 19 +- 2 files changed, 149 insertions(+), 51 deletions(-) -- 1.9.1
Re: [Qemu-devel] [PATCH v3 2/2] qapi: add read-pattern support for quorum
On Tue, Jul 15, 2014 at 06:22:14AM -0600, Eric Blake wrote: On 07/15/2014 12:34 AM, Liu Yuan wrote: Cc: Eric Blake ebl...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- qapi/block-core.json | 19 ++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/qapi/block-core.json b/qapi/block-core.json index e378653..22491bc 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -1384,6 +1384,18 @@ 'raw': 'BlockdevRef' } } ## +# @QuorumReadPattern +# An enumeration of quorum read pattern. s/pattern/patterns/ +# +# @quorum: read all the children and do a quorum vote on reads +# +# @fifo: read a single child and try next child in FIFO order if read fails. Inconsistent on whether you are ending a line with '.'. I'm not quite sure if this reads well; maybe: @fifo: read only from the first child that has not failed I also wonder if 'single' might be a better name than 'fifo', especially if we later add a 'round-robin' for read load balancing. Both fifo and round-robin in our context indicate a single read. but either one looks okay to me. We need to reach agreement on the name before I send the next version, single or fifo or any other suggested, Benoit, Kevin, Stefan? Thanks Yuan
Re: [Qemu-devel] [PATCH v3 2/2] qapi: add read-pattern support for quorum
On Wed, Jul 16, 2014 at 10:22:20AM +0800, Liu Yuan wrote: On Tue, Jul 15, 2014 at 06:22:14AM -0600, Eric Blake wrote: On 07/15/2014 12:34 AM, Liu Yuan wrote: Cc: Eric Blake ebl...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- qapi/block-core.json | 19 ++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/qapi/block-core.json b/qapi/block-core.json index e378653..22491bc 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -1384,6 +1384,18 @@ 'raw': 'BlockdevRef' } } ## +# @QuorumReadPattern +# An enumeration of quorum read pattern. s/pattern/patterns/ +# +# @quorum: read all the children and do a quorum vote on reads +# +# @fifo: read a single child and try next child in FIFO order if read fails. Inconsistent on whether you are ending a line with '.'. I'm not quite sure if this reads well; maybe: @fifo: read only from the first child that has not failed I also wonder if 'single' might be a better name than 'fifo', especially if we later add a 'round-robin' for read load balancing. Both fifo and round-robin in our context indicate a single read. but either one looks okay to me. We need to reach agreement on the name before I send the next version, single or fifo or any other suggested, Benoit, Kevin, Stefan? Cc others mentioned.
[Qemu-devel] [PATCH v2] block/quorum: add simple read pattern support
This patch adds single read pattern to quorum driver and quorum vote is default pattern. For now we do a quorum vote on all the reads, it is designed for unreliable underlying storage such as non-redundant NFS to make sure data integrity at the cost of the read performance. For some use cases as following: VM -- || vv AB Both A and B has hardware raid storage to justify the data integrity on its own. So it would help performance if we do a single read instead of on all the nodes. Further, if we run VM on either of the storage node, we can make a local read request for better performance. This patch generalize the above 2 nodes case in the N nodes. That is, vm - write to all the N nodes, read just one of them. If single read fails, we try to read next node in FIFO order specified by the startup command. The 2 nodes case is very similar to DRBD[1] though lack of auto-sync functionality in the single device/node failure for now. But compared with DRBD we still have some advantages over it: - Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed storage. And if A crashes, we need to restart all the VMs on node B. But for practice case, we can't because B might not have enough resources to setup 20 VMs at once. So if we run our 20 VMs with quorum driver, and scatter the replicated images over the data center, we can very likely restart 20 VMs without any resource problem. After all, I think we can build a more powerful replicated image functionality on quorum and block jobs(block mirror) to meet various High Availibility needs. E.g, Enable single read pattern on 2 children, -drive driver=quorum,children.0.file.filename=0.qcow2,\ children.1.file.filename=1.qcow2,read-pattern=single,vote-threshold=1 [1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device Cc: Benoit Canet ben...@irqsave.net Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- v2: - rename single as 'fifo' - rename read_all_children as read_quorum_children - fix quorum_aio_finalize() for fifo pattern block.c | 2 +- block/quorum.c | 181 +-- qapi/block-core.json | 7 +- 3 files changed, 138 insertions(+), 52 deletions(-) diff --git a/block.c b/block.c index 8800a6b..3dfa27c 100644 --- a/block.c +++ b/block.c @@ -2212,7 +2212,7 @@ int bdrv_commit(BlockDriverState *bs) if (!drv) return -ENOMEDIUM; - + if (!bs-backing_hd) { return -ENOTSUP; } diff --git a/block/quorum.c b/block/quorum.c index d5ee9c0..1443fc5 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -24,6 +24,7 @@ #define QUORUM_OPT_VOTE_THRESHOLD vote-threshold #define QUORUM_OPT_BLKVERIFY blkverify #define QUORUM_OPT_REWRITErewrite-corrupted +#define QUORUM_OPT_READ_PATTERN read-pattern /* This union holds a vote hash value */ typedef union QuorumVoteValue { @@ -74,6 +75,16 @@ typedef struct BDRVQuorumState { bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted * block if Quorum is reached. */ + +#define READ_PATTERN_QUORUM 0 /* default */ +#define READ_PATTERN_FIFO 1 +int read_pattern; /* fifo: read a single child and try first one +* first. If error, try next child in an +* FIFO order specifed by command line. +* Return error if no child read succeeds. +* quorum: read all the children and do a quorum +* vote on reads. +*/ } BDRVQuorumState; typedef struct QuorumAIOCB QuorumAIOCB; @@ -117,6 +128,7 @@ struct QuorumAIOCB { bool is_read; int vote_ret; +int child_iter; /* which child to read in fifo pattern */ }; static bool quorum_vote(QuorumAIOCB *acb); @@ -154,8 +166,10 @@ static void quorum_aio_finalize(QuorumAIOCB *acb) if (acb-is_read) { for (i = 0; i s-num_children; i++) { -qemu_vfree(acb-qcrs[i].buf); -qemu_iovec_destroy(acb-qcrs[i].qiov); +if (i = acb-child_iter) { +qemu_vfree(acb-qcrs[i].buf); +qemu_iovec_destroy(acb-qcrs[i].qiov); +} } } @@ -256,6 +270,21 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret) quorum_aio_finalize(acb); } +static BlockDriverAIOCB *read_fifo_child(QuorumAIOCB *acb); + +static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source) +{ +int i; +assert(dest-niov == source-niov); +assert(dest-size == source-size); +for (i = 0; i source-niov; i++) { +assert(dest-iov[i].iov_len == source-iov[i].iov_len); +memcpy(dest-iov[i].iov_base
Re: [Qemu-devel] [PATCH v2] block/quorum: add simple read pattern support
On Mon, Jul 14, 2014 at 09:33:59PM -0600, Eric Blake wrote: On 07/14/2014 09:19 PM, Liu Yuan wrote: This patch adds single read pattern to quorum driver and quorum vote is default pattern. This patch generalize the above 2 nodes case in the N nodes. That is, vm - write to all the N nodes, read just one of them. If single read fails, we try to read next node in FIFO order specified by the startup command. The 2 nodes case is very similar to DRBD[1] though lack of auto-sync functionality in the single device/node failure for now. But compared with DRBD we still have some advantages over it: +++ b/block.c @@ -2212,7 +2212,7 @@ int bdrv_commit(BlockDriverState *bs) if (!drv) return -ENOMEDIUM; - + if (!bs-backing_hd) { return -ENOTSUP; } While this whitespace cleanup is nice, it doesn't belong in this patch, when there is no other change to this unrelated file. +++ b/qapi/block-core.json @@ -1398,12 +1398,17 @@ # @rewrite-corrupted: #optional rewrite corrupted data when quorum is reached # (Since 2.1) # +# @read-pattern: #optional choose quorum or fifo pattern for read +#set to quorum by default (Since 2.2) +# # Since: 2.0 ## { 'type': 'BlockdevOptionsQuorum', 'data': { '*blkverify': 'bool', 'children': [ 'BlockdevRef' ], -'vote-threshold': 'int', '*rewrite-corrupted': 'bool' } } +'vote-threshold': 'int', +'*rewrite-corrupted': 'bool', +'*read-pattern': 'str' } } Raw strings that encode a finite set of values are bad for type-safety. Please add an enum: { 'enum': 'QuorumReadPattern', 'data': [ 'quorum', 'fifo' ] } then use '*read-pattern': 'QuorumReadPattern' For startup command line, did it imply a change too? Sorry I'm not familiar with block-core.json. With your suggestion, how we pass read-pattern via qmp? read-pattern: fifo or read-pattern: quorum Should we offer multiple modes in addition to 'quorum'? For example, I could see a difference between 'fifo' (favor read from the first quorum member always, unless it fails, good when the first member is local and other member is remote) and 'round-robin' (evenly distribute reads; each read goes to the next available quorum member, good when all members are equally distant). I guess so. 'round-robin' in your context would benefit use case seeking for better read load balance. Thanks Yuan
[Qemu-devel] [PATCH] block/quorum: make quorum_getlength error message user friendly
When start quorum driver with 2 different sized images, we get: qemu-system-x86_64: -drive if=virtio,driver=quorum...: Could not refresh total \ sector count: Input/output error EIO would confuse users. With this patch, the error message goes like Children images are not in the same size qemu-system-x86_64: -drive if=virtio,driver=quorum...: Could not refresh total \ sector count: Invalid argument Cc: Benoit Canet ben...@irqsave.net Cc: Kevin Wolf kw...@redhat.com Cc: Stefan Hajnoczi stefa...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- block/quorum.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/block/quorum.c b/block/quorum.c index 2f18755..51437ad 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -736,7 +736,8 @@ static int64_t quorum_getlength(BlockDriverState *bs) return value; } if (value != result) { -return -EIO; +error_printf(Children images are not in the same size\n); +return -EINVAL; } } -- 1.9.1
[Qemu-devel] [PATCH v2] configure: make libnfs not_found message user friendly
Cc: Kevin Wolf kw...@redhat.com Signed-off-by: Liu Yuan namei.u...@gmail.com --- configure | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/configure b/configure index 7dd43fd..78e7baf 100755 --- a/configure +++ b/configure @@ -3996,7 +3996,7 @@ if test $libnfs != no ; then LIBS=$LIBS $libnfs_libs else if test $libnfs = yes ; then - feature_not_found libnfs + feature_not_found libnfs Install libnfs-devel = 1.9.3 fi libnfs=no fi -- 1.9.1