Re: [RFC PATCH v4 3/3] block: add sheepdog driver for distributed storage support
At Fri, 04 Jun 2010 13:04:00 +0200, Kevin Wolf wrote: Am 03.06.2010 18:23, schrieb MORITA Kazutaka: +static void sd_aio_cancel(BlockDriverAIOCB *blockacb) +{ + SheepdogAIOCB *acb = (SheepdogAIOCB *)blockacb; + + acb-canceled = 1; +} Does this provide the right semantics? You haven't really cancelled the request, but you pretend to. So you actually complete the request in the background and then throw the return code away. I seem to remember that posix-aio-compat.c waits at this point for completion of the requests, calls the callbacks and only afterwards returns from aio_cancel when no more requests are in flight. Or if you can really cancel requests, it would be the best option, of course. Sheepdog cannot cancel the requests which are already sent to the servers. So, as you say, we pretend to cancel the requests without waiting for completion of them. However, are there any situation where pretending to cancel causes problems in practice? I'm not sure how often it would happen in practice, but if the guest OS thinks the old value is on disk when in fact the new one is, this could lead to corruption. I think if it can happen, even without evidence that it actually does, it's already relevant enough. I agree. To wait for completion of the requests here, we may need to create another thread for processing I/O like posix-aio-compat.c. I don't think you need a thread to get the same behaviour, you just need to call the fd handlers like in the main loop. It would probably be the first driver doing this, though, and it's not an often used code path, so it might be a bad idea. Maybe it's reasonable to just complete the request with -EIO? This way the guest couldn't make any assumption about the data written. On the other hand, it could be unhappy about failed requests, but that's probably better than corruption. Completing with -EIO looks good to me. Thanks for the advice. I'll send an updated patch tomorrow. Regards, Kazutaka -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v4 3/3] block: add sheepdog driver for distributed storage support
Am 03.06.2010 18:23, schrieb MORITA Kazutaka: +static void sd_aio_cancel(BlockDriverAIOCB *blockacb) +{ + SheepdogAIOCB *acb = (SheepdogAIOCB *)blockacb; + + acb-canceled = 1; +} Does this provide the right semantics? You haven't really cancelled the request, but you pretend to. So you actually complete the request in the background and then throw the return code away. I seem to remember that posix-aio-compat.c waits at this point for completion of the requests, calls the callbacks and only afterwards returns from aio_cancel when no more requests are in flight. Or if you can really cancel requests, it would be the best option, of course. Sheepdog cannot cancel the requests which are already sent to the servers. So, as you say, we pretend to cancel the requests without waiting for completion of them. However, are there any situation where pretending to cancel causes problems in practice? I'm not sure how often it would happen in practice, but if the guest OS thinks the old value is on disk when in fact the new one is, this could lead to corruption. I think if it can happen, even without evidence that it actually does, it's already relevant enough. To wait for completion of the requests here, we may need to create another thread for processing I/O like posix-aio-compat.c. I don't think you need a thread to get the same behaviour, you just need to call the fd handlers like in the main loop. It would probably be the first driver doing this, though, and it's not an often used code path, so it might be a bad idea. Maybe it's reasonable to just complete the request with -EIO? This way the guest couldn't make any assumption about the data written. On the other hand, it could be unhappy about failed requests, but that's probably better than corruption. Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v4 3/3] block: add sheepdog driver for distributed storage support
At Wed, 02 Jun 2010 15:55:42 +0200, Kevin Wolf wrote: Am 28.05.2010 04:44, schrieb MORITA Kazutaka: Sheepdog is a distributed storage system for QEMU. It provides highly available block level storage volumes to VMs like Amazon EBS. This patch adds a qemu block driver for Sheepdog. Sheepdog features are: - No node in the cluster is special (no metadata node, no control node, etc) - Linear scalability in performance and capacity - No single point of failure - Autonomous management (zero configuration) - Useful volume management support such as snapshot and cloning - Thin provisioning - Autonomous load balancing The more details are available at the project site: http://www.osrg.net/sheepdog/ Signed-off-by: MORITA Kazutaka morita.kazut...@lab.ntt.co.jp --- Makefile.objs|2 +- block/sheepdog.c | 1835 ++ 2 files changed, 1836 insertions(+), 1 deletions(-) create mode 100644 block/sheepdog.c One general thing: The code uses some mix of spaces and tabs for indentation, with the greatest part using tabs. According to CODING_STYLE it should consistently use four spaces instead. OK. I'll fix the indentation according to CODYING_STYLE. + +typedef struct SheepdogInode { + char name[SD_MAX_VDI_LEN]; + uint64_t ctime; + uint64_t snap_ctime; + uint64_t vm_clock_nsec; + uint64_t vdi_size; + uint64_t vm_state_size; + uint16_t copy_policy; + uint8_t nr_copies; + uint8_t block_size_shift; + uint32_t snap_id; + uint32_t vdi_id; + uint32_t parent_vdi_id; + uint32_t child_vdi_id[MAX_CHILDREN]; + uint32_t data_vdi_id[MAX_DATA_OBJS]; Wow, this is a huge array. :-) So Sheepdog has a fixed limit of 16 TB, right? MAX_DATA_OBJS is (1 20), and the size of a object is 4 MB. So the limit of the Sheepdog image size is 4 TB. These values are hard-coded, and I guess they should be configurable. +} SheepdogInode; + + +static void sd_aio_cancel(BlockDriverAIOCB *blockacb) +{ + SheepdogAIOCB *acb = (SheepdogAIOCB *)blockacb; + + acb-canceled = 1; +} Does this provide the right semantics? You haven't really cancelled the request, but you pretend to. So you actually complete the request in the background and then throw the return code away. I seem to remember that posix-aio-compat.c waits at this point for completion of the requests, calls the callbacks and only afterwards returns from aio_cancel when no more requests are in flight. Or if you can really cancel requests, it would be the best option, of course. Sheepdog cannot cancel the requests which are already sent to the servers. So, as you say, we pretend to cancel the requests without waiting for completion of them. However, are there any situation where pretending to cancel causes problems in practice? To wait for completion of the requests here, we may need to create another thread for processing I/O like posix-aio-compat.c. + +static int do_send_recv(int sockfd, struct iovec *iov, int len, int offset, + int write) I've spent at least 15 minutes figuring out what this function does. I think I've got it now more or less, but I've come to the conclusion that this code needs more comments. I'd suggest to add a header comment to all non-trivial functions and maybe somewhere on the top a general description of how things work. As far as I understood now, there are basically two parts of request handling: 1. The request is sent to the server. Its AIOCB is saved in a list in the BDRVSheepdogState. It doesn't pass a callback or anything for the completion. 2. aio_read_response is registered as a fd handler to the sheepdog connection. When the server responds, it searches the right AIOCB in the list and the second part of request handling starts. do_send_recv is the function that is used to do all communication with the server. The iov stuff looks like it's only used for some data, but seems this is not true - it's also used for the metadata of the protocol. Did I understand it right so far? Yes, exactly. I'll add comments to make codes more readable. +{ + struct msghdr msg; + int ret, diff; + + memset(msg, 0, sizeof(msg)); + msg.msg_iov = iov; + msg.msg_iovlen = 1; + + len += offset; + + while (iov-iov_len len) { + len -= iov-iov_len; + + iov++; + msg.msg_iovlen++; + } You're counting the number of elements in the iov here. qemu_iovec would already have these (and also len), wouldn't it make sense to use it as the abstraction? Though I'm not sure where these iovecs come from, so the answer might be no. We uses struct msghdr for sendmsg/recvmsg here, so using iovec directly looks simpler. + + diff = iov-iov_len - len; + iov-iov_len -= diff; + + while (msg.msg_iov-iov_len =
Re: [RFC PATCH v4 3/3] block: add sheepdog driver for distributed storage support
Am 28.05.2010 04:44, schrieb MORITA Kazutaka: Sheepdog is a distributed storage system for QEMU. It provides highly available block level storage volumes to VMs like Amazon EBS. This patch adds a qemu block driver for Sheepdog. Sheepdog features are: - No node in the cluster is special (no metadata node, no control node, etc) - Linear scalability in performance and capacity - No single point of failure - Autonomous management (zero configuration) - Useful volume management support such as snapshot and cloning - Thin provisioning - Autonomous load balancing The more details are available at the project site: http://www.osrg.net/sheepdog/ Signed-off-by: MORITA Kazutaka morita.kazut...@lab.ntt.co.jp --- Makefile.objs|2 +- block/sheepdog.c | 1835 ++ 2 files changed, 1836 insertions(+), 1 deletions(-) create mode 100644 block/sheepdog.c One general thing: The code uses some mix of spaces and tabs for indentation, with the greatest part using tabs. According to CODING_STYLE it should consistently use four spaces instead. diff --git a/Makefile.objs b/Makefile.objs index 1a942e5..527a754 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -14,7 +14,7 @@ block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vvfat.o block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o -block-nested-y += parallels.o nbd.o blkdebug.o +block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o block-nested-$(CONFIG_WIN32) += raw-win32.o block-nested-$(CONFIG_POSIX) += raw-posix.o block-nested-$(CONFIG_CURL) += curl.o diff --git a/block/sheepdog.c b/block/sheepdog.c new file mode 100644 index 000..68545e8 --- /dev/null +++ b/block/sheepdog.c @@ -0,0 +1,1835 @@ +/* + * Copyright (C) 2009-2010 Nippon Telegraph and Telephone Corporation. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License version + * 2 as published by the Free Software Foundation. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see http://www.gnu.org/licenses/. + */ +#include netdb.h +#include netinet/tcp.h + +#include qemu-common.h +#include qemu-error.h +#include block_int.h + +#define SD_PROTO_VER 0x01 + +#define SD_DEFAULT_ADDR localhost:7000 + +#define SD_OP_CREATE_AND_WRITE_OBJ 0x01 +#define SD_OP_READ_OBJ 0x02 +#define SD_OP_WRITE_OBJ 0x03 + +#define SD_OP_NEW_VDI0x11 +#define SD_OP_LOCK_VDI 0x12 +#define SD_OP_RELEASE_VDI0x13 +#define SD_OP_GET_VDI_INFO 0x14 +#define SD_OP_READ_VDIS 0x15 + +#define SD_FLAG_CMD_WRITE0x01 +#define SD_FLAG_CMD_COW 0x02 + +#define SD_RES_SUCCESS 0x00 /* Success */ +#define SD_RES_UNKNOWN 0x01 /* Unknown error */ +#define SD_RES_NO_OBJ0x02 /* No object found */ +#define SD_RES_EIO 0x03 /* I/O error */ +#define SD_RES_VDI_EXIST 0x04 /* Vdi exists already */ +#define SD_RES_INVALID_PARMS 0x05 /* Invalid parameters */ +#define SD_RES_SYSTEM_ERROR 0x06 /* System error */ +#define SD_RES_VDI_LOCKED0x07 /* Vdi is locked */ +#define SD_RES_NO_VDI0x08 /* No vdi found */ +#define SD_RES_NO_BASE_VDI 0x09 /* No base vdi found */ +#define SD_RES_VDI_READ 0x0A /* Cannot read requested vdi */ +#define SD_RES_VDI_WRITE 0x0B /* Cannot write requested vdi */ +#define SD_RES_BASE_VDI_READ 0x0C /* Cannot read base vdi */ +#define SD_RES_BASE_VDI_WRITE 0x0D /* Cannot write base vdi */ +#define SD_RES_NO_TAG0x0E /* Requested tag is not found */ +#define SD_RES_STARTUP 0x0F /* Sheepdog is on starting up */ +#define SD_RES_VDI_NOT_LOCKED 0x10 /* Vdi is not locked */ +#define SD_RES_SHUTDOWN 0x11 /* Sheepdog is shutting down */ +#define SD_RES_NO_MEM0x12 /* Cannot allocate memory */ +#define SD_RES_FULL_VDI 0x13 /* we already have the maximum vdis */ +#define SD_RES_VER_MISMATCH 0x14 /* Protocol version mismatch */ +#define SD_RES_NO_SPACE 0x15 /* Server has no room for new objects */ +#define SD_RES_WAIT_FOR_FORMAT 0x16 /* Sheepdog is waiting for a format operation */ +#define SD_RES_WAIT_FOR_JOIN0x17 /* Sheepdog is waiting for other nodes joining */ +#define SD_RES_JOIN_FAILED 0x18 /* Target node had failed to join sheepdog */ + +/* + * Object ID rules + * + * 0 - 19 (20 bits): data object space + * 20 - 31 (12 bits): reserved data object space + * 32 - 55 (24 bits): vdi object space + * 56 - 59 ( 4 bits): reserved vdi object space + * 60 - 63 ( 4 bits): object type indentifier space + */ + +#define VDI_SPACE_SHIFT 32 +#define VDI_BIT (UINT64_C(1) 63) +#define VMSTATE_BIT (UINT64_C(1) 62) +#define MAX_DATA_OBJS (1ULL 20) +#define MAX_CHILDREN 1024 +#define