[Qemu-devel] Re: [RFC PATCH v4 3/3] block: add sheepdog driver for distributed storage support

2010-06-06 Thread MORITA Kazutaka
At Fri, 04 Jun 2010 13:04:00 +0200,
Kevin Wolf wrote:
 
 Am 03.06.2010 18:23, schrieb MORITA Kazutaka:
  +static void sd_aio_cancel(BlockDriverAIOCB *blockacb)
  +{
  + SheepdogAIOCB *acb = (SheepdogAIOCB *)blockacb;
  +
  + acb-canceled = 1;
  +}
 
  Does this provide the right semantics? You haven't really cancelled the
  request, but you pretend to. So you actually complete the request in the
  background and then throw the return code away.
 
  I seem to remember that posix-aio-compat.c waits at this point for
  completion of the requests, calls the callbacks and only afterwards
  returns from aio_cancel when no more requests are in flight.
 
  Or if you can really cancel requests, it would be the best option, of
  course.
 
  
  Sheepdog cannot cancel the requests which are already sent to the
  servers.  So, as you say, we pretend to cancel the requests without
  waiting for completion of them.  However, are there any situation
  where pretending to cancel causes problems in practice?
 
 I'm not sure how often it would happen in practice, but if the guest OS
 thinks the old value is on disk when in fact the new one is, this could
 lead to corruption. I think if it can happen, even without evidence that
 it actually does, it's already relevant enough.
 

I agree.

  To wait for completion of the requests here, we may need to create
  another thread for processing I/O like posix-aio-compat.c.
 
 I don't think you need a thread to get the same behaviour, you just need
 to call the fd handlers like in the main loop. It would probably be the
 first driver doing this, though, and it's not an often used code path,
 so it might be a bad idea.
 
 Maybe it's reasonable to just complete the request with -EIO? This way
 the guest couldn't make any assumption about the data written. On the
 other hand, it could be unhappy about failed requests, but that's
 probably better than corruption.
 

Completing with -EIO looks good to me.  Thanks for the advice.
I'll send an updated patch tomorrow.

Regards,

Kazutaka



[Qemu-devel] Re: [RFC PATCH v4 3/3] block: add sheepdog driver for distributed storage support

2010-06-04 Thread Kevin Wolf
Am 03.06.2010 18:23, schrieb MORITA Kazutaka:
 +static void sd_aio_cancel(BlockDriverAIOCB *blockacb)
 +{
 +   SheepdogAIOCB *acb = (SheepdogAIOCB *)blockacb;
 +
 +   acb-canceled = 1;
 +}

 Does this provide the right semantics? You haven't really cancelled the
 request, but you pretend to. So you actually complete the request in the
 background and then throw the return code away.

 I seem to remember that posix-aio-compat.c waits at this point for
 completion of the requests, calls the callbacks and only afterwards
 returns from aio_cancel when no more requests are in flight.

 Or if you can really cancel requests, it would be the best option, of
 course.

 
 Sheepdog cannot cancel the requests which are already sent to the
 servers.  So, as you say, we pretend to cancel the requests without
 waiting for completion of them.  However, are there any situation
 where pretending to cancel causes problems in practice?

I'm not sure how often it would happen in practice, but if the guest OS
thinks the old value is on disk when in fact the new one is, this could
lead to corruption. I think if it can happen, even without evidence that
it actually does, it's already relevant enough.

 To wait for completion of the requests here, we may need to create
 another thread for processing I/O like posix-aio-compat.c.

I don't think you need a thread to get the same behaviour, you just need
to call the fd handlers like in the main loop. It would probably be the
first driver doing this, though, and it's not an often used code path,
so it might be a bad idea.

Maybe it's reasonable to just complete the request with -EIO? This way
the guest couldn't make any assumption about the data written. On the
other hand, it could be unhappy about failed requests, but that's
probably better than corruption.

Kevin



[Qemu-devel] Re: [RFC PATCH v4 3/3] block: add sheepdog driver for distributed storage support

2010-06-03 Thread MORITA Kazutaka
At Tue, 01 Jun 2010 09:58:04 -0500,
Thanks for your comments!

Chris Krumme wrote:
 
 On 05/27/2010 09:44 PM, MORITA Kazutaka wrote:
  Sheepdog is a distributed storage system for QEMU. It provides highly

  +
  +static int connect_to_sdog(const char *addr)
  +{
  +   char buf[64];
  +   char hbuf[NI_MAXHOST], sbuf[NI_MAXSERV];
  +   char name[256], *p;
  +   int fd, ret;
  +   struct addrinfo hints, *res, *res0;
  +   int port = 0;
  +
  +   if (!addr) {
  +   addr = SD_DEFAULT_ADDR;
  +   }
  +
  +   strcpy(name, addr);
 
 
 Can strlen(addr) be  sizeof(name)?
 

Yes, we should check the length of addr. This would causes overflows.

  +
  +   p = name;
  +   while (*p) {
  +   if (*p == ':') {
  +   *p++ = '\0';
 
 
 May also need to check for p  name + sizeof(name).
 

p should be NULL-terminated, so the check is not required, I think.

  +   break;
  +   } else {
  +   p++;
  +   }
  +   }
  +
  +   if (*p == '\0') {
  +   error_report(cannot find a port number, %s\n, name);
  +   return -1;
  +   }
  +   port = strtol(p, NULL, 10);
 
 
 Are negative numbers valid here?
 

No. It is better to use strtoul.


  +
  +static int parse_vdiname(BDRVSheepdogState *s, const char *filename,
  +char *vdi, int vdi_len, uint32_t *snapid)
  +{
  +   char *p, *q;
  +   int nr_sep;
  +
  +   p = q = strdup(filename);
  +
  +   if (!p) {
 
 
 I think Qemu has a version of strdup that will not return NULL.
 

Yes. We can use qemu_strdup here.


  +
  +/* TODO: error cleanups */
  +static int sd_open(BlockDriverState *bs, const char *filename, int flags)
  +{
  +   int ret, fd;
  +   uint32_t vid = 0;
  +   BDRVSheepdogState *s = bs-opaque;
  +   char vdi[256];
  +   uint32_t snapid;
  +   int for_snapshot = 0;
  +   char *buf;
  +
  +   strstart(filename, sheepdog:, (const char **)filename);
  +
  +   buf = qemu_malloc(SD_INODE_SIZE);
  +
  +   memset(vdi, 0, sizeof(vdi));
  +   if (parse_vdiname(s, filename, vdi, sizeof(vdi),snapid)  0) {
  +   goto out;
  +   }
  +   s-fd = get_sheep_fd(s);
  +   if (s-fd  0) {
 
 
 buf is not freed, goto out maybe.
 

Yes, we should goto out here.


  +
  +static int do_sd_create(const char *addr, char *filename, char *tag,
  +   int64_t total_sectors, uint32_t base_vid,
  +   uint32_t *vdi_id, int snapshot)
  +{
  +   SheepdogVdiReq hdr;
  +   SheepdogVdiRsp *rsp = (SheepdogVdiRsp *)hdr;
  +   int fd, ret;
  +   unsigned int wlen, rlen = 0;
  +   char buf[SD_MAX_VDI_LEN];
  +
  +   fd = connect_to_sdog(addr);
  +   if (fd  0) {
  +   return -1;
  +   }
  +
  +   strncpy(buf, filename, SD_MAX_VDI_LEN);
  +
  +   memset(hdr, 0, sizeof(hdr));
  +   hdr.opcode = SD_OP_NEW_VDI;
  +   hdr.base_vdi_id = base_vid;
  +
  +   wlen = SD_MAX_VDI_LEN;
  +
  +   hdr.flags = SD_FLAG_CMD_WRITE;
  +   hdr.snapid = snapshot;
  +
  +   hdr.data_length = wlen;
  +   hdr.vdi_size = total_sectors * 512;
 
 
 There is another patch on the list changing 512 to a define for sector size.
 

OK. We'll define SECTOR_SIZE.


  +
  +   ret = do_req(fd, (SheepdogReq *)hdr, buf,wlen,rlen);
  +
  +   close(fd);
  +
  +   if (ret) {
  +   return -1;
  +   }
  +
  +   if (rsp-result != SD_RES_SUCCESS) {
  +   error_report(%s, %s\n, sd_strerror(rsp-result), filename);
  +   return -1;
  +   }
  +
  +   if (vdi_id) {
  +   *vdi_id = rsp-vdi_id;
  +   }
  +
  +   return 0;
  +}
  +
  +static int sd_create(const char *filename, QEMUOptionParameter *options)
  +{
  +   int ret;
  +   uint32_t vid = 0;
  +   int64_t total_sectors = 0;
  +   char *backing_file = NULL;
  +
  +   strstart(filename, sheepdog:, (const char **)filename);
  +
  +   while (options  options-name) {
  +   if (!strcmp(options-name, BLOCK_OPT_SIZE)) {
  +   total_sectors = options-value.n / 512;
 
 Use define.
  +   } else if (!strcmp(options-name, BLOCK_OPT_BACKING_FILE)) {
  +   backing_file = options-value.s;
  +   }
  +   options++;
  +   }
  +
  +   if (backing_file) {
  +   BlockDriverState bs;
  +   char vdi[SD_MAX_VDI_LEN];
  +   uint32_t snapid;
  +
  +   strstart(backing_file, sheepdog:, (const char 
  **)backing_file);
  +   memset(bs, 0, sizeof(bs));
  +
  +   bs.opaque = qemu_malloc(sizeof(BDRVSheepdogState));
 
 
 bs seems to have a short life span, is opaque getting freed?
 

No, we should free it.


  +
  +static int sd_snapshot_create(BlockDriverState *bs, QEMUSnapshotInfo 
  *sn_info)
  +{
  +   BDRVSheepdogState *s = bs-opaque;
  +   int ret, fd;
  +   uint32_t new_vid;
  +   SheepdogInode *inode;
  +   unsigned int datalen;
  +   uint64_t offset;
  +
  +   dprintf(sn_info: name %s id_str %s s: name %s vm_state_size %d 
  +   is_current %d\n, sn_info-name, sn_info-id_str,
  +   

[Qemu-devel] Re: [RFC PATCH v4 3/3] block: add sheepdog driver for distributed storage support

2010-06-03 Thread MORITA Kazutaka
At Wed, 02 Jun 2010 15:55:42 +0200,
Kevin Wolf wrote:
 
 Am 28.05.2010 04:44, schrieb MORITA Kazutaka:
  Sheepdog is a distributed storage system for QEMU. It provides highly
  available block level storage volumes to VMs like Amazon EBS.  This
  patch adds a qemu block driver for Sheepdog.
  
  Sheepdog features are:
  - No node in the cluster is special (no metadata node, no control
node, etc)
  - Linear scalability in performance and capacity
  - No single point of failure
  - Autonomous management (zero configuration)
  - Useful volume management support such as snapshot and cloning
  - Thin provisioning
  - Autonomous load balancing
  
  The more details are available at the project site:
  http://www.osrg.net/sheepdog/
  
  Signed-off-by: MORITA Kazutaka morita.kazut...@lab.ntt.co.jp
  ---
   Makefile.objs|2 +-
   block/sheepdog.c | 1835 
  ++
   2 files changed, 1836 insertions(+), 1 deletions(-)
   create mode 100644 block/sheepdog.c
 
 One general thing: The code uses some mix of spaces and tabs for
 indentation, with the greatest part using tabs. According to
 CODING_STYLE it should consistently use four spaces instead.
 

OK.  I'll fix the indentation according to CODYING_STYLE.


  +
  +typedef struct SheepdogInode {
  +   char name[SD_MAX_VDI_LEN];
  +   uint64_t ctime;
  +   uint64_t snap_ctime;
  +   uint64_t vm_clock_nsec;
  +   uint64_t vdi_size;
  +   uint64_t vm_state_size;
  +   uint16_t copy_policy;
  +   uint8_t  nr_copies;
  +   uint8_t  block_size_shift;
  +   uint32_t snap_id;
  +   uint32_t vdi_id;
  +   uint32_t parent_vdi_id;
  +   uint32_t child_vdi_id[MAX_CHILDREN];
  +   uint32_t data_vdi_id[MAX_DATA_OBJS];
 
 Wow, this is a huge array. :-)
 
 So Sheepdog has a fixed limit of 16 TB, right?
 

MAX_DATA_OBJS is (1  20), and the size of a object is 4 MB.  So the
limit of the Sheepdog image size is 4 TB.

These values are hard-coded, and I guess they should be configurable.


 
  +} SheepdogInode;
  +

  +
  +static void sd_aio_cancel(BlockDriverAIOCB *blockacb)
  +{
  +   SheepdogAIOCB *acb = (SheepdogAIOCB *)blockacb;
  +
  +   acb-canceled = 1;
  +}
 
 Does this provide the right semantics? You haven't really cancelled the
 request, but you pretend to. So you actually complete the request in the
 background and then throw the return code away.
 
 I seem to remember that posix-aio-compat.c waits at this point for
 completion of the requests, calls the callbacks and only afterwards
 returns from aio_cancel when no more requests are in flight.
 
 Or if you can really cancel requests, it would be the best option, of
 course.
 

Sheepdog cannot cancel the requests which are already sent to the
servers.  So, as you say, we pretend to cancel the requests without
waiting for completion of them.  However, are there any situation
where pretending to cancel causes problems in practice?

To wait for completion of the requests here, we may need to create
another thread for processing I/O like posix-aio-compat.c.


  +
  +static int do_send_recv(int sockfd, struct iovec *iov, int len, int offset,
  +   int write)
 
 I've spent at least 15 minutes figuring out what this function does. I
 think I've got it now more or less, but I've come to the conclusion that
 this code needs more comments.
 
 I'd suggest to add a header comment to all non-trivial functions and
 maybe somewhere on the top a general description of how things work.
 
 As far as I understood now, there are basically two parts of request
 handling:
 
 1. The request is sent to the server. Its AIOCB is saved in a list in
 the BDRVSheepdogState. It doesn't pass a callback or anything for the
 completion.
 
 2. aio_read_response is registered as a fd handler to the sheepdog
 connection. When the server responds, it searches the right AIOCB in the
 list and the second part of request handling starts.
 
 do_send_recv is the function that is used to do all communication with
 the server. The iov stuff looks like it's only used for some data, but
 seems this is not true - it's also used for the metadata of the protocol.
 
 Did I understand it right so far?
 

Yes, exactly.  I'll add comments to make codes more readable.


  +{
  +   struct msghdr msg;
  +   int ret, diff;
  +
  +   memset(msg, 0, sizeof(msg));
  +   msg.msg_iov = iov;
  +   msg.msg_iovlen = 1;
  +
  +   len += offset;
  +
  +   while (iov-iov_len  len) {
  +   len -= iov-iov_len;
  +
  +   iov++;
  +   msg.msg_iovlen++;
  +   }
 
 You're counting the number of elements in the iov here. qemu_iovec would
 already have these (and also len), wouldn't it make sense to use it as
 the abstraction? Though I'm not sure where these iovecs come from, so
 the answer might be no.
 

We uses struct msghdr for sendmsg/recvmsg here, so using iovec
directly looks simpler.


  +
  +   diff = iov-iov_len - len;
  +   iov-iov_len -= diff;
  +
  +   while (msg.msg_iov-iov_len =