Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-28 Thread Wenchao Xia

于 2012-11-27 18:37, Dietmar Maurer 写道:

  Just want to confirm something to understand it better:
you are backing up the block image not including VM memory state right?  I
am considering a way to do live Savevm including memory and device state,
so wonder if you already had a solution for it.


Yes, I have already code for that.




  Does those code for VM memory and device state lively save/restore
included in this patch serials? I quickly reviewed the patches but
did not found a hook to save VM memory state? Hope you can enlight
me your way, my thoughts is do live migration into qcow2 file,
but your code seems not touched qcow2 images.

--
Best Regards

Wenchao Xia




Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-28 Thread Dietmar Maurer
Does those code for VM memory and device state lively save/restore
 included in this patch serials? I quickly reviewed the patches but did not
 found a hook to save VM memory state? Hope you can enlight me your way,
 my thoughts is do live migration into qcow2 file, but your code seems not
 touched qcow2 images.

I will post that code in a few days.


Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-27 Thread Kevin Wolf
Am 27.11.2012 08:15, schrieb Dietmar Maurer:
 The only solution I came up with is to add before/after hooks in the
 block job.  I agree with the criticism, but I think it's general
 enough and at the same time easy enough to implement.

 IMHO, the current implementation is quite simple and easy to maintain.

 No, if (bs-backup_info) simply doesn't belong in bdrv_co_writev.

 I do not really understand that argument, because the current
 COPY_ON_READ implementation also works that way:

 if (bs-copy_on_read) {
 flags |= BDRV_REQ_COPY_ON_READ;
 }
 if (flags  BDRV_REQ_COPY_ON_READ) {
 bs-copy_on_read_in_flight++;
 }

 if (bs-copy_on_read_in_flight) {
 wait_for_overlapping_requests(bs, sector_num, nb_sectors);
 }

 tracked_request_begin(req, bs, sector_num, nb_sectors, false);

 if (flags  BDRV_REQ_COPY_ON_READ) { ...

 Or do you also want to move that to block job hooks?
 
 Just tried to move that code, but copy on read feature is unrelated to block 
 jobs,
 i.e. one can open a bdrv with BDRV_O_COPY_ON_READ, and that does not create
 a job.
 
 I already suggested to add those hooks to BDS instead - don't you think that 
 would work?

To which BDS? If it is the BDS that is being backed up, the problem is
that you could only have one implementation per BDS, i.e. you couldn't
use backup and copy on read or I/O throttling or whatever at the same time.

Kevin



Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-27 Thread Wenchao Xia
于 2012-11-21 17:01, Dietmar Maurer 写道:
 This series provides a way to efficiently backup VMs.
 
 * Backup to a single archive file
 * Backup contain all data to restore VM (full backup)
 * Do not depend on storage type or image format
 * Avoid use of temporary storage
 * store sparse images efficiently
 
 The file docs/backup-rfc.txt contains more details.
 
 Signed-off-by: Dietmar Maurer diet...@proxmox.com
 ---
   docs/backup-rfc.txt |  119 
 +++
   1 files changed, 119 insertions(+), 0 deletions(-)
   create mode 100644 docs/backup-rfc.txt
 
 diff --git a/docs/backup-rfc.txt b/docs/backup-rfc.txt
 new file mode 100644
 index 000..5b4b3df
 --- /dev/null
 +++ b/docs/backup-rfc.txt
 @@ -0,0 +1,119 @@
 +RFC: Efficient VM backup for qemu
 +
 +=Requirements=
 +
 +* Backup to a single archive file
 +* Backup needs to contain all data to restore VM (full backup)
 +* Do not depend on storage type or image format
 +* Avoid use of temporary storage
 +* store sparse images efficiently
 +
 +=Introduction=
 +
 +Most VM backup solutions use some kind of snapshot to get a consistent
 +VM view at a specific point in time. For example, we previously used
 +LVM to create a snapshot of all used VM images, which are then copied
 +into a tar file.
 +
 +That basically means that any data written during backup involve
 +considerable overhead. For LVM we get the following steps:
 +
 +1.) read original data (VM write)
 +2.) write original data into snapshot (VM write)
 +3.) write new data (VM write)
 +4.) read data from snapshot (backup)
 +5.) write data from snapshot into tar file (backup)
 +
 +Another approach to backup VM images is to create a new qcow2 image
 +which use the old image as base. During backup, writes are redirected
 +to the new image, so the old image represents a 'snapshot'. After
 +backup, data need to be copied back from new image into the old
 +one (commit). So a simple write during backup triggers the following
 +steps:
 +
 +1.) write new data to new image (VM write)
 +2.) read data from old image (backup)
 +3.) write data from old image into tar file (backup)
 +
 +4.) read data from new image (commit)
 +5.) write data to old image (commit)
 +
 +This is in fact the same overhead as before. Other tools like qemu
 +livebackup produces similar overhead (2 reads, 3 writes).
 +
 +Some storage types/formats supports internal snapshots using some kind
 +of reference counting (rados, sheepdog, dm-thin, qcow2). It would be possible
 +to use that for backups, but for now we want to be storage-independent.
 +
 +Note: It turned out that taking a qcow2 snapshot can take a very long
 +time on larger files.
 +
 +=Make it more efficient=
 +
 +The be more efficient, we simply need to avoid unnecessary steps. The
 +following steps are always required:
 +
 +1.) read old data before it gets overwritten
 +2.) write that data into the backup archive
 +3.) write new data (VM write)
 +
 +As you can see, this involves only one read, an two writes.
 +
 +To make that work, our backup archive need to be able to store image
 +data 'out of order'. It is important to notice that this will not work
 +with traditional archive formats like tar.
 +
 +During backup we simply intercept writes, then read existing data and
 +store that directly into the archive. After that we can continue the
 +write.
 +
 +==Advantages==
 +
 +* very good performance (1 read, 2 writes)
 +* works on any storage type and image format.
 +* avoid usage of temporary storage
 +* we can define a new and simple archive format, which is able to
 +  store sparse files efficiently.
 +
 +Note: Storing sparse files is a mess with existing archive
 +formats. For example, tar requires information about holes at the
 +beginning of the archive.
 +
 +==Disadvantages==
 +
 +* we need to define a new archive format
 +
 +Note: Most existing archive formats are optimized to store small files
 +including file attributes. We simply do not need that for VM archives.
 +
 +* archive contains data 'out of order'
 +
 +If you want to access image data in sequential order, you need to
 +re-order archive data. It would be possible to to that on the fly,
 +using temporary files.
 +
 +Fortunately, a normal restore/extract works perfectly with 'out of
 +order' data, because the target files are seekable.
 +
 +* slow backup storage can slow down VM during backup
 +
 +It is important to note that we only do sequential writes to the
 +backup storage. Furthermore one can compress the backup stream. IMHO,
 +it is better to slow down the VM a bit. All other solutions creates
 +large amounts of temporary data during backup.
 +
 +=Archive format requirements=
 +
 +The basic requirement for such new format is that we can store image
 +date 'out of order'. It is also very likely that we have less than 256
 +drives/images per VM, and we want to be able to store VM configuration
 +files.
 +
 +We have defined a very simply format with those properties, see:

Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-27 Thread Dietmar Maurer
  The only solution I came up with is to add before/after hooks in the
  block job.  I agree with the criticism, but I think it's general
  enough and at the same time easy enough to implement.

Ok, here is another try - do you think that is better?

Note: This code is not tested - I just want to ask if I am moving into the 
right direction.

I tried to move BackupInfo into the block job.

From 86c3ca6467e8d220369f5e7b235cdc0e88a79ff9 Mon Sep 17 00:00:00 2001
From: Dietmar Maurer diet...@proxmox.com
Date: Tue, 27 Nov 2012 11:05:23 +0100
Subject: [PATCH] add basic backup support to block driver

Function bdrv_backup_init() creates a block job to backup a block device.

We call brdv_co_backup_cow() for each write during backup. That function
reads the original data and pass it to backup_dump_cb().

The tracked_request infrastructure is used to serialize access.

Currently backup cluster size is hardcoded to 65536 bytes.

Signed-off-by: Dietmar Maurer diet...@proxmox.com
---
 Makefile.objs |1 +
 backup.c  |  302 +
 backup.h  |   30 ++
 block.c   |   71 -
 block.h   |2 +
 blockjob.h|   10 ++
 6 files changed, 410 insertions(+), 6 deletions(-)
 create mode 100644 backup.c
 create mode 100644 backup.h

diff --git a/Makefile.objs b/Makefile.objs
index 3c7abca..cb46be5 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -48,6 +48,7 @@ coroutine-obj-$(CONFIG_WIN32) += coroutine-win32.o
 block-obj-y = iov.o cache-utils.o qemu-option.o module.o async.o
 block-obj-y += nbd.o block.o blockjob.o aes.o qemu-config.o
 block-obj-y += thread-pool.o qemu-progress.o qemu-sockets.o uri.o notify.o
+block-obj-y += backup.o
 block-obj-y += $(coroutine-obj-y) $(qobject-obj-y) $(version-obj-y)
 block-obj-$(CONFIG_POSIX) += event_notifier-posix.o aio-posix.o
 block-obj-$(CONFIG_WIN32) += event_notifier-win32.o aio-win32.o
diff --git a/backup.c b/backup.c
new file mode 100644
index 000..e86b76e
--- /dev/null
+++ b/backup.c
@@ -0,0 +1,302 @@
+/*
+ * QEMU backup
+ *
+ * Copyright (C) 2012 Proxmox Server Solutions
+ *
+ * Authors:
+ *  Dietmar Maurer (diet...@proxmox.com)
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include stdio.h
+#include errno.h
+#include unistd.h
+
+#include block.h
+#include block_int.h
+#include blockjob.h
+#include backup.h
+
+#define DEBUG_BACKUP 0
+
+#define DPRINTF(fmt, ...) \
+do { if (DEBUG_BACKUP) { printf(backup:  fmt, ## __VA_ARGS__); } } \
+while (0)
+
+
+#define BITS_PER_LONG  (sizeof(unsigned long) * 8)
+
+typedef struct BackupBlockJob {
+BlockJob common;
+unsigned long *bitmap;
+int bitmap_size;
+BackupDumpFunc *backup_dump_cb;
+BlockDriverCompletionFunc *backup_complete_cb;
+void *opaque;
+} BackupBlockJob;
+
+static int backup_get_bitmap(BlockDriverState *bs, int64_t cluster_num)
+{
+assert(bs);
+BackupBlockJob *job = (BackupBlockJob *)bs-job;
+assert(job);
+assert(job-bitmap);
+
+unsigned long val, idx, bit;
+
+idx = cluster_num / BITS_PER_LONG;
+
+assert(job-bitmap_size  idx);
+
+bit = cluster_num % BITS_PER_LONG;
+val = job-bitmap[idx];
+
+return !!(val  (1UL  bit));
+}
+
+static void backup_set_bitmap(BlockDriverState *bs, int64_t cluster_num,
+  int dirty)
+{
+assert(bs);
+BackupBlockJob *job = (BackupBlockJob *)bs-job;
+assert(job);
+assert(job-bitmap);
+
+unsigned long val, idx, bit;
+
+idx = cluster_num / BITS_PER_LONG;
+
+assert(job-bitmap_size  idx);
+
+bit = cluster_num % BITS_PER_LONG;
+val = job-bitmap[idx];
+if (dirty) {
+if (!(val  (1UL  bit))) {
+val |= 1UL  bit;
+}
+} else {
+if (val  (1UL  bit)) {
+val = ~(1UL  bit);
+}
+}
+job-bitmap[idx] = val;
+}
+
+static int backup_in_progress_count;
+
+static int coroutine_fn bdrv_co_backup_cow(BlockDriverState *bs,
+   int64_t sector_num, int nb_sectors)
+{
+assert(bs);
+BackupBlockJob *job = (BackupBlockJob *)bs-job;
+assert(job);
+
+BlockDriver *drv = bs-drv;
+struct iovec iov;
+QEMUIOVector bounce_qiov;
+void *bounce_buffer = NULL;
+int ret = 0;
+
+backup_in_progress_count++;
+
+int64_t start, end;
+
+start = sector_num / BACKUP_BLOCKS_PER_CLUSTER;
+end = (sector_num + nb_sectors + BACKUP_BLOCKS_PER_CLUSTER - 1) /
+BACKUP_BLOCKS_PER_CLUSTER;
+
+DPRINTF(brdv_co_backup_cow enter %s C%zd %zd %d\n,
+bdrv_get_device_name(bs), start, sector_num, nb_sectors);
+
+for (; start  end; start++) {
+if (backup_get_bitmap(bs, start)) {
+DPRINTF(brdv_co_backup_cow skip C%zd\n, start);
+continue; /* already copied */
+}
+
+/* immediately set bitmap (avoid coroutine race) */
+

Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-27 Thread Dietmar Maurer
  Just want to confirm something to understand it better:
 you are backing up the block image not including VM memory state right?  I
 am considering a way to do live Savevm including memory and device state,
 so wonder if you already had a solution for it.

Yes, I have already code for that. 




Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-26 Thread Paolo Bonzini
Il 26/11/2012 06:51, Dietmar Maurer ha scritto:
 Which raises the question of how to distinguish whether it's a new request to
 bs that must go through the filters or whether it actually comes from the 
 last
 filter in the chain. As you can see, we don't have a well thought out plan 
 yet,
 just rough ideas (otherwise it would probably be implemented already).
 
 The question is if I should modify my backup patch (regarding block filters)?

The only solution I came up with is to add before/after hooks in the
block job.  I agree with the criticism, but I think it's general enough
and at the same time easy enough to implement.

 IMHO, the current implementation is quite simple and easy to maintain.

No, if (bs-backup_info) simply doesn't belong in bdrv_co_writev.

Paolo



Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-26 Thread Dietmar Maurer
 The only solution I came up with is to add before/after hooks in the block
 job.  I agree with the criticism, but I think it's general enough and at the 
 same
 time easy enough to implement.
 
  IMHO, the current implementation is quite simple and easy to maintain.
 
 No, if (bs-backup_info) simply doesn't belong in bdrv_co_writev.

I do not really understand that argument, because the current COPY_ON_READ
implementation also works that way:

if (bs-copy_on_read) {
flags |= BDRV_REQ_COPY_ON_READ;
}
if (flags  BDRV_REQ_COPY_ON_READ) {
bs-copy_on_read_in_flight++;
}

if (bs-copy_on_read_in_flight) {
wait_for_overlapping_requests(bs, sector_num, nb_sectors);
}

tracked_request_begin(req, bs, sector_num, nb_sectors, false);

if (flags  BDRV_REQ_COPY_ON_READ) {
...

Or do you also want to move that to block job hooks?




Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-26 Thread Dietmar Maurer
  The only solution I came up with is to add before/after hooks in the
  block job.  I agree with the criticism, but I think it's general
  enough and at the same time easy enough to implement.
 
   IMHO, the current implementation is quite simple and easy to maintain.
 
  No, if (bs-backup_info) simply doesn't belong in bdrv_co_writev.
 
 I do not really understand that argument, because the current
 COPY_ON_READ implementation also works that way:
 
 if (bs-copy_on_read) {
 flags |= BDRV_REQ_COPY_ON_READ;
 }
 if (flags  BDRV_REQ_COPY_ON_READ) {
 bs-copy_on_read_in_flight++;
 }
 
 if (bs-copy_on_read_in_flight) {
 wait_for_overlapping_requests(bs, sector_num, nb_sectors);
 }
 
 tracked_request_begin(req, bs, sector_num, nb_sectors, false);
 
 if (flags  BDRV_REQ_COPY_ON_READ) { ...
 
 Or do you also want to move that to block job hooks?

Just tried to move that code, but copy on read feature is unrelated to block 
jobs,
i.e. one can open a bdrv with BDRV_O_COPY_ON_READ, and that does not create
a job.

I already suggested to add those hooks to BDS instead - don't you think that 
would work?




Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-25 Thread Dietmar Maurer
 Which raises the question of how to distinguish whether it's a new request to
 bs that must go through the filters or whether it actually comes from the last
 filter in the chain. As you can see, we don't have a well thought out plan 
 yet,
 just rough ideas (otherwise it would probably be implemented already).

The question is if I should modify my backup patch (regarding block filters)?

IMHO, the current implementation is quite simple and easy to maintain. We can 
easily
convert it if someone comes up with a full featured 'block filter' solution.







Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-23 Thread Dietmar Maurer
  Yup, it's already not too bad. I haven't looked into it in much
  detail, but I'd like to reduce it even a bit more. In particular, the
  backup_info field in the BlockDriverState feels wrong to me. In the
  long term the generic block layer shouldn't know at all what a backup
  is, and baking it into BDS couples it very tightly.
 
 My plan was to have something like bs-job-job_type-{before,after}_write.
 
int coroutine_fn (*before_write)(BlockDriverState *bs,
 int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
 void **cookie);
int coroutine_fn (*after_write)(BlockDriverState *bs,
 int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
 void *cookie);

I don't think that job is the right place. Instead I would put a list 
of filters into BDS:

typedef struct BlockFilter {
void *opaque;
int cluster_size;
int coroutine_fn (before_read)(BlockDriverState *bs,
int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
BdrvRequestFlags flags, void **cookie);
int coroutine_fn (after_read)(BlockDriverState *bs,
int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
BdrvRequestFlags flags, void *cookie);
int coroutine_fn (*before_write)(BlockDriverState *bs,
int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
void **cookie);
int coroutine_fn (*after_write)(BlockDriverState *bs,
int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
void *cookie);
} BlockFilter;

struct BlockDriverState {
   ...
QLIST_HEAD(, BlockFilters) filters;
};

Would that work for you?




Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-23 Thread Dietmar Maurer
  My plan was to have something like bs-job-job_type-
 {before,after}_write.
 
 int coroutine_fn (*before_write)(BlockDriverState *bs,
  int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
  void **cookie);
 int coroutine_fn (*after_write)(BlockDriverState *bs,
  int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
  void *cookie);
 
 I don't think that job is the right place. Instead I would put a list of 
 filters into
 BDS:

Well, I can also add it to job_type. Just tell me what you prefer, and I will 
write the patch.




Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-23 Thread Dietmar Maurer
   My plan was to have something like bs-job-job_type-
  {before,after}_write.
  
  int coroutine_fn (*before_write)(BlockDriverState *bs,
   int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
   void **cookie);
  int coroutine_fn (*after_write)(BlockDriverState *bs,
   int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
   void *cookie);
 
  I don't think that job is the right place. Instead I would put a list
  of filters into
  BDS:
 
 Well, I can also add it to job_type. Just tell me what you prefer, and I will
 write the patch.

BTW, will such filters work with the new virtio-blk-data-plane?




Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-23 Thread Kevin Wolf
Am 23.11.2012 08:38, schrieb Dietmar Maurer:
 In short, the idea is that you can stick filters on top of a 
 BlockDriverState, so
 that any read/writes (and possibly more requests, if necessary) are routed
 through the filter before they are passed to the block driver of this BDS.
 Filters would be implemented as BlockDrivers, i.e. you could implement
 .bdrv_co_write() in a filter to intercept all writes to an image.
 
 I am quite unsure if that make things easier.

At least it would make for a much cleaner design compared to putting
code for every feature you can think of into bdrv_co_do_readv/writev().

Kevin



Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-23 Thread Paolo Bonzini
Il 23/11/2012 10:05, Dietmar Maurer ha scritto:
 My plan was to have something like bs-job-job_type-
 {before,after}_write.

int coroutine_fn (*before_write)(BlockDriverState *bs,
 int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
 void **cookie);
int coroutine_fn (*after_write)(BlockDriverState *bs,
 int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
 void *cookie);

 I don't think that job is the right place. Instead I would put a list
 of filters into
 BDS:

 Well, I can also add it to job_type. Just tell me what you prefer, and I will
 write the patch.
 
 BTW, will such filters work with the new virtio-blk-data-plane?

No, virtio-blk-data-plane is a hack and will be slowly rewritten to
support all fancy features.

Paolo




Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-23 Thread Dietmar Maurer
  BTW, will such filters work with the new virtio-blk-data-plane?
 
 No, virtio-blk-data-plane is a hack and will be slowly rewritten to support 
 all
 fancy features.

Ah, good to know ;-) thanks.




Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-23 Thread Paolo Bonzini
Il 23/11/2012 08:42, Dietmar Maurer ha scritto:
  My plan was to have something like bs-job-job_type-{before,after}_write.
  
 int coroutine_fn (*before_write)(BlockDriverState *bs,
  int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
  void **cookie);
 int coroutine_fn (*after_write)(BlockDriverState *bs,
  int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
  void *cookie);
  
  
  The before_write could optionally return a cookie that is passed back to
  the after_write callback.
 I don't really understand why a filter is related to the job? This is 
 sometimes useful,
 but not a generic filter infrastructure (maybe someone want to use filters 
 without a job).

See the part you snipped:

Actually this was plan B, as a poor-man implementation of the filter
infrastructure.  Plan A was that the block filters would materialize
suddenly in someone's git tree.

Paolo




Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-23 Thread Dietmar Maurer
  Filters would be implemented as BlockDrivers, i.e. you could
  implement
  .bdrv_co_write() in a filter to intercept all writes to an image.
 
  I am quite unsure if that make things easier.
 
 At least it would make for a much cleaner design compared to putting code
 for every feature you can think of into bdrv_co_do_readv/writev().

So if you want to add a filter, you simply modify bs-drv to point to the 
filter?




Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-23 Thread Dietmar Maurer
 Actually this was plan B, as a poor-man implementation of the filter
 infrastructure.  Plan A was that the block filters would materialize suddenly 
 in
 someone's git tree.

OK, so let us summarize the options:

a.) wait untit it materialize suddenly in someone's git tree.
b.) add BlockFilter inside BDS
c.) add filter callbacks to block bojs (job_type)
d.) use BlockDriver as filter
e.) use the current BackupInfo unless filters materialize suddenly in someone's 
git tree.

more ideas?




Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-23 Thread Dietmar Maurer
   Filters would be implemented as BlockDrivers, i.e. you could
   implement
   .bdrv_co_write() in a filter to intercept all writes to an image.
  
   I am quite unsure if that make things easier.
 
  At least it would make for a much cleaner design compared to putting
  code for every feature you can think of into bdrv_co_do_readv/writev().
 
 So if you want to add a filter, you simply modify bs-drv to point to the 
 filter?

Seems the BlockDriver struct does not contain any 'state' (I guess that is by 
design),
so where do you store filter related dynamic data?




Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-23 Thread Kevin Wolf
Am 23.11.2012 10:05, schrieb Dietmar Maurer:
 My plan was to have something like bs-job-job_type-
 {before,after}_write.

int coroutine_fn (*before_write)(BlockDriverState *bs,
 int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
 void **cookie);
int coroutine_fn (*after_write)(BlockDriverState *bs,
 int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
 void *cookie);

 I don't think that job is the right place. Instead I would put a list
 of filters into
 BDS:

 Well, I can also add it to job_type. Just tell me what you prefer, and I will
 write the patch.

A block filter shouldn't be tied to a job, I think. We have things like
blkdebug that are really filters and aren't coupled with a job, and on
the other hand we want to generalise block jobs into just jobs, so
adding block specific things to job_type would be a step in the wrong
direction.

I also think that before_write/after_write isn't a convenient interface,
it brings back much of the callback-based AIO cruft and passing void*
isn't nice anyway. It's much nice to have a single .bdrv_co_write
callback that somewhere in the middle calls the layer below with a
simple function call.

Also read/write aren't enough, for a full filter interface you
potentially also need flush, discard and probably most other operations.

This is why I suggested using a regular BlockDriver struct for filters,
it already has all functions that are needed.

 BTW, will such filters work with the new virtio-blk-data-plane?

Not initially, but I think as soon as data plane gets support for image
formats, filters would work as well.

Kevin



Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-23 Thread Kevin Wolf
Am 23.11.2012 10:31, schrieb Dietmar Maurer:
 Filters would be implemented as BlockDrivers, i.e. you could
 implement
 .bdrv_co_write() in a filter to intercept all writes to an image.

 I am quite unsure if that make things easier.

 At least it would make for a much cleaner design compared to putting
 code for every feature you can think of into bdrv_co_do_readv/writev().

 So if you want to add a filter, you simply modify bs-drv to point to the 
 filter?
 
 Seems the BlockDriver struct does not contain any 'state' (I guess that is by 
 design),
 so where do you store filter related dynamic data?

You wouldn't change bs-drv of the block device, you still need that one
after having processed the data in the filter.

Instead, you'd have some BlockDriverState *first_filter in bs to which
requests are forwarded. first_filter-file would point to either the
next filter or if there are no more filters to the real BlockDriverState.

Which raises the question of how to distinguish whether it's a new
request to bs that must go through the filters or whether it actually
comes from the last filter in the chain. As you can see, we don't have a
well thought out plan yet, just rough ideas (otherwise it would probably
be implemented already).

Kevin



Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-23 Thread Markus Armbruster
Kevin Wolf kw...@redhat.com writes:

 Am 23.11.2012 10:05, schrieb Dietmar Maurer:
 My plan was to have something like bs-job-job_type-
 {before,after}_write.

int coroutine_fn (*before_write)(BlockDriverState *bs,
 int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
 void **cookie);
int coroutine_fn (*after_write)(BlockDriverState *bs,
 int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
 void *cookie);

 I don't think that job is the right place. Instead I would put a list
 of filters into
 BDS:

 Well, I can also add it to job_type. Just tell me what you prefer, and I 
 will
 write the patch.

 A block filter shouldn't be tied to a job, I think. We have things like
 blkdebug that are really filters and aren't coupled with a job, and on
 the other hand we want to generalise block jobs into just jobs, so
 adding block specific things to job_type would be a step in the wrong
 direction.

 I also think that before_write/after_write isn't a convenient interface,
 it brings back much of the callback-based AIO cruft and passing void*
 isn't nice anyway. It's much nice to have a single .bdrv_co_write
 callback that somewhere in the middle calls the layer below with a
 simple function call.

 Also read/write aren't enough, for a full filter interface you
 potentially also need flush, discard and probably most other operations.

 This is why I suggested using a regular BlockDriver struct for filters,
 it already has all functions that are needed.

Let me elaborate a bit.

A block backend is a tree of block driver instances (BlockDriverState).
Common examples:

raw qcow2   qcow2
 |  /   \   /   \
file file   raw  file  qcow2
 | /   \
file file  raw
|
   file

A less common example:

raw
 |
  blkdebug
 |
file

Here, blkdebug acts as a filter, i.e. a block driver that can be put
between two adjacent tree nodes.  It injects errors by selectively
failing some bdrv_aio_readv() and bdrv_aio_writev() operations.

Actually, raw could also be viewed as a degenerate filter that does
nothing[*], but such a filter isn't particularly useful.

Except perhaps to serve as base for real filters, that do stuff.  To do
stuff in your filter, you'd replace raw's operations with your own.

Hmm, blkdebug implements much fewer operations than raw.  Makes me
wonder whether it works only in special places in the tree now.

[...]


[*] Except occasionally inject bugs when somebody adds new BlockDriver
operations without updating raw to forward them.



Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-22 Thread Stefan Hajnoczi
On Wed, Nov 21, 2012 at 10:01:00AM +0100, Dietmar Maurer wrote:
 +==Disadvantages==
 +
 +* we need to define a new archive format
 +
 +Note: Most existing archive formats are optimized to store small files
 +including file attributes. We simply do not need that for VM archives.

Did you look at the VMDK Stream-Optimized Compressed subformat?

http://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf?src=vmdk

It is a stream of compressed grains (data).  They are out-of-order and
each grain comes with the virtual disk lba where the data should be
visible to the guest.

The stream also contains grain tables and grain directories.  This
metadata makes random read access to the file possible once you have
downloaded the entire file (i.e. it is seekable).  Although tools can
choose to consume the stream in sequential order too and ignore the
metadata.

In other words, the format is an out-of-order stream of data chunks plus
random access lookup tables at the end.

QEMU's block/vmdk.c already has some support for this format although I
don't think we generate out-of-order yet.

The benefit of reusing this code is that existing tools can consume
these files.

Stefan



Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-22 Thread Dietmar Maurer
 Did you look at the VMDK Stream-Optimized Compressed subformat?
 
 http://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf?
 src=vmdk
 
 It is a stream of compressed grains (data).  They are out-of-order and each
 grain comes with the virtual disk lba where the data should be visible to the
 guest.
 

What kind of license is applied to that specification?




Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-22 Thread Dietmar Maurer
 Did you look at the VMDK Stream-Optimized Compressed subformat?
 
 http://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf?
 src=vmdk

Max file size 2TB?




Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-22 Thread Dietmar Maurer
 It is a stream of compressed grains (data).  They are out-of-order and each
 grain comes with the virtual disk lba where the data should be visible to the
 guest.
 
 The stream also contains grain tables and grain directories.  This
 metadata makes random read access to the file possible once you have
 downloaded the entire file (i.e. it is seekable).  Although tools can choose 
 to
 consume the stream in sequential order too and ignore the metadata.
 
 In other words, the format is an out-of-order stream of data chunks plus
 random access lookup tables at the end.
 
 QEMU's block/vmdk.c already has some support for this format although I
 don't think we generate out-of-order yet.
 
 The benefit of reusing this code is that existing tools can consume these 
 files.

Compression format is hardcoded to RFC 1951 (defalte). I think this is a major 
disadvantage, 
because it is really slow (compared to lzop).




Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-22 Thread Dietmar Maurer
 Did you look at the VMDK Stream-Optimized Compressed subformat?
 
 http://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf?
 src=vmdk

And is that covered by any patents?




Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-22 Thread Stefan Hajnoczi
On Thu, Nov 22, 2012 at 11:26:21AM +, Dietmar Maurer wrote:
  Did you look at the VMDK Stream-Optimized Compressed subformat?
  
  http://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf?
  src=vmdk
  
  It is a stream of compressed grains (data).  They are out-of-order and 
  each
  grain comes with the virtual disk lba where the data should be visible to 
  the
  guest.
  
 
 What kind of license is applied to that specification?

The document I linked came straight from Google Search and you don't
need to agree to anything to view it.  The document doesn't seem to
impose restrictions.  QEMU has supported the VMDK format and so have
other open source tools for a number of years.

For anything more specific you could search VMware's website and/or
check with a lawyer.

Stefan



Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-22 Thread Dietmar Maurer
 -Original Message-
 From: Stefan Hajnoczi [mailto:stefa...@gmail.com]
 Sent: Donnerstag, 22. November 2012 13:45
 To: Dietmar Maurer
 Cc: qemu-devel@nongnu.org; kw...@redhat.com
 Subject: Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu
 (v1)
 
 On Thu, Nov 22, 2012 at 11:26:21AM +, Dietmar Maurer wrote:
   Did you look at the VMDK Stream-Optimized Compressed subformat?
  
  
 http://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf?
   src=vmdk
  
   It is a stream of compressed grains (data).  They are out-of-order
   and each grain comes with the virtual disk lba where the data should
   be visible to the guest.
  
 
  What kind of license is applied to that specification?
 
 The document I linked came straight from Google Search and you don't need
 to agree to anything to view it.  The document doesn't seem to impose
 restrictions.  QEMU has supported the VMDK format and so have other open
 source tools for a number of years.
 
 For anything more specific you could search VMware's website and/or check
 with a lawyer.

The documents says: VMware products are covered by one or more patents listed 
at http://www.vmware.com/go/patents

I simply do not have the time to check all those things, which make that format 
unusable for me.

Anyways, thanks for the link.




Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-22 Thread Stefan Hajnoczi
On Thu, Nov 22, 2012 at 1:55 PM, Dietmar Maurer diet...@proxmox.com wrote:
 -Original Message-
 From: Stefan Hajnoczi [mailto:stefa...@gmail.com]
 Sent: Donnerstag, 22. November 2012 13:45
 To: Dietmar Maurer
 Cc: qemu-devel@nongnu.org; kw...@redhat.com
 Subject: Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu
 (v1)

 On Thu, Nov 22, 2012 at 11:26:21AM +, Dietmar Maurer wrote:
   Did you look at the VMDK Stream-Optimized Compressed subformat?
  
  
 http://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf?
   src=vmdk
  
   It is a stream of compressed grains (data).  They are out-of-order
   and each grain comes with the virtual disk lba where the data should
   be visible to the guest.
  
 
  What kind of license is applied to that specification?

 The document I linked came straight from Google Search and you don't need
 to agree to anything to view it.  The document doesn't seem to impose
 restrictions.  QEMU has supported the VMDK format and so have other open
 source tools for a number of years.

 For anything more specific you could search VMware's website and/or check
 with a lawyer.

 The documents says: VMware products are covered by one or more patents listed 
 at http://www.vmware.com/go/patents

 I simply do not have the time to check all those things, which make that 
 format unusable for me.

In think proxmox ships the QEMU vmdk functionality today?  In that
case you should check this :).

Stefan



Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-22 Thread Stefan Hajnoczi
On Thu, Nov 22, 2012 at 12:40 PM, Dietmar Maurer diet...@proxmox.com wrote:
 Did you look at the VMDK Stream-Optimized Compressed subformat?

 http://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf?
 src=vmdk

 Max file size 2TB?

That is for a single .vmdk file.  But vmdks can also be split across
multiple files (extents) so you can get more than 2TB.

QEMU has some support for reading vmdks with extents, but I think we
never create files like this today.

Stefan



Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-22 Thread Stefan Hajnoczi
On Thu, Nov 22, 2012 at 1:00 PM, Dietmar Maurer diet...@proxmox.com wrote:
 It is a stream of compressed grains (data).  They are out-of-order and each
 grain comes with the virtual disk lba where the data should be visible to the
 guest.

 The stream also contains grain tables and grain directories.  This
 metadata makes random read access to the file possible once you have
 downloaded the entire file (i.e. it is seekable).  Although tools can choose 
 to
 consume the stream in sequential order too and ignore the metadata.

 In other words, the format is an out-of-order stream of data chunks plus
 random access lookup tables at the end.

 QEMU's block/vmdk.c already has some support for this format although I
 don't think we generate out-of-order yet.

 The benefit of reusing this code is that existing tools can consume these 
 files.

 Compression format is hardcoded to RFC 1951 (defalte). I think this is a 
 major disadvantage,
 because it is really slow (compared to lzop).

It's a naughty thing to do but we could simply pick a new constant and
support LZO as an incompatible option.  The file is then no longer
compatible with existing vmdk tools but at least we then have a choice
of using compatible deflate or the LZO extension.

VMDK already has 99% of what you need and we already have a bunch of
code to handle this format.  This seems like a good opportunity to
flesh out VMDK support and avoid reinventing the wheel.

Stefan



Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-22 Thread Dietmar Maurer
 It's a naughty thing to do but we could simply pick a new constant and
 support LZO as an incompatible option.  The file is then no longer compatible
 with existing vmdk tools but at least we then have a choice of using
 compatible deflate or the LZO extension.

To be 100% incompatible to existing tools? That would remove any advantage.
 
 VMDK already has 99% of what you need and we already have a bunch of
 code to handle this format.  This seems like a good opportunity to flesh out
 VMDK support and avoid reinventing the wheel.

Using 'probably' patented software is a bad idea - I will not go that way.




Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-22 Thread Dietmar Maurer
  The documents says: VMware products are covered by one or more
 patents
  listed at http://www.vmware.com/go/patents
 
  I simply do not have the time to check all those things, which make that
 format unusable for me.
 
 In think proxmox ships the QEMU vmdk functionality today?  In that case you
 should check this :).

Well, and also remove it from the qemu repository? Such things are not 
compatible with GPL?




Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-22 Thread Stefan Hajnoczi
On Thu, Nov 22, 2012 at 4:56 PM, Dietmar Maurer diet...@proxmox.com wrote:
 It's a naughty thing to do but we could simply pick a new constant and
 support LZO as an incompatible option.  The file is then no longer compatible
 with existing vmdk tools but at least we then have a choice of using
 compatible deflate or the LZO extension.

 To be 100% incompatible to existing tools? That would remove any advantage.

No, it should be an option.  Users who care about compatibility can use deflate.

Stefan



Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-22 Thread Stefan Hajnoczi
On Thu, Nov 22, 2012 at 4:58 PM, Dietmar Maurer diet...@proxmox.com wrote:
  The documents says: VMware products are covered by one or more
 patents
  listed at http://www.vmware.com/go/patents
 
  I simply do not have the time to check all those things, which make that
 format unusable for me.

 In think proxmox ships the QEMU vmdk functionality today?  In that case you
 should check this :).

 Well, and also remove it from the qemu repository? Such things are not 
 compatible with GPL?

If you are really concerned about this then submit a patch to add
./configure --disable-vmdk, ship QEMU without VMDK, and drop it from
your documentation/wiki.

If you're not really concerned, then let's accept that VMDK support is okay.

Stefan



Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-22 Thread Stefan Hajnoczi
On Thu, Nov 22, 2012 at 12:12 PM, Stefan Hajnoczi stefa...@gmail.com wrote:
 On Wed, Nov 21, 2012 at 10:01:00AM +0100, Dietmar Maurer wrote:
 +==Disadvantages==
 +
 +* we need to define a new archive format
 +
 +Note: Most existing archive formats are optimized to store small files
 +including file attributes. We simply do not need that for VM archives.

 Did you look at the VMDK Stream-Optimized Compressed subformat?

We've gone down several sub-threads discussing whether VMDK is
suitable.  I want to summarize why this is a good approach:

The VMDK format already allows for out-of-order data and is supported
by existing tools - this is very important for backups where people
are (rightfully) paranoid about putting their backups in an obscure
format.  They want to be able to access their data years later,
whether your tool is still around or not.

QEMU's implementation has partial support for Stream-Optimized
Compressed images.  If you complete the code for this subformat, not
only does this benefit the VM Backup feature, but it also makes
qemu-img convert more powerful for everyone.  I hope we can kill two
birds with one stone here.

Stefan



Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-22 Thread Dietmar Maurer


 -Original Message-
 From: Stefan Hajnoczi [mailto:stefa...@gmail.com]
 Sent: Donnerstag, 22. November 2012 18:02
 To: Dietmar Maurer
 Cc: kw...@redhat.com; qemu-devel@nongnu.org
 Subject: Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu
 (v1)
 
 On Thu, Nov 22, 2012 at 4:58 PM, Dietmar Maurer diet...@proxmox.com
 wrote:
   The documents says: VMware products are covered by one or more
  patents
   listed at http://www.vmware.com/go/patents
  
   I simply do not have the time to check all those things, which make
   that
  format unusable for me.
 
  In think proxmox ships the QEMU vmdk functionality today?  In that
  case you should check this :).
 
  Well, and also remove it from the qemu repository? Such things are not
 compatible with GPL?
 
 If you are really concerned about this then submit a patch to add ./configure
 --disable-vmdk, ship QEMU without VMDK, and drop it from your
 documentation/wiki.
 
 If you're not really concerned, then let's accept that VMDK support is okay.

I simply don't want to waste time one something with unclear License issues.
So I will not work on such format.

Don't get me wrong - that is just my personal opinion.






Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-22 Thread Dietmar Maurer
  Did you look at the VMDK Stream-Optimized Compressed subformat?
 
 We've gone down several sub-threads discussing whether VMDK is suitable.
 I want to summarize why this is a good approach:
 
 The VMDK format already allows for out-of-order data and is supported by
 existing tools - this is very important for backups where people are
 (rightfully) paranoid about putting their backups in an obscure format.  They
 want to be able to access their data years later, whether your tool is still
 around or not.

The VMDK format has strong disadvantages:

- unclear License (the spec links to patents)
- they use a very slow compression algorithm (deflate), which makes it unusable 
for backup 







Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-22 Thread Dietmar Maurer
 The VMDK format has strong disadvantages:
 
 - unclear License (the spec links to patents)
 - they use a very slow compression algorithm (deflate), which makes it
 unusable for backup

Seems they do not support multiple configuration files. You can only
a single text block, and that needs to contain vmware specific info.
So where do I add my qemu related config?






Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-22 Thread Dietmar Maurer
 QEMU's implementation has partial support for Stream-Optimized
 Compressed images.  If you complete the code for this subformat, not only
 does this benefit the VM Backup feature, but it also makes qemu-img convert
 more powerful for everyone.  I hope we can kill two birds with one stone

The doc contain the following link:

http://www.vmware.com/download/patents.html

I simply have no idea how to check all those patents. How can someone tell
that they do not cover things in the specs? I am really curios?







Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-22 Thread Dietmar Maurer
 The VMDK format already allows for out-of-order data and is supported by
 existing tools - this is very important for backups where people are
 (rightfully) paranoid about putting their backups in an obscure format.  They
 want to be able to access their data years later, whether your tool is still
 around or not.

Anything we will add to the qemu source fulfills those properties. Or do you
really think qemu will disappear soon?

Besides, the VMA format is much simpler than the vmdk format. Thus I consider
it safer (and not 'obscure').






Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-22 Thread Stefan Hajnoczi
On Thu, Nov 22, 2012 at 7:05 PM, Dietmar Maurer diet...@proxmox.com wrote:
 QEMU's implementation has partial support for Stream-Optimized
 Compressed images.  If you complete the code for this subformat, not only
 does this benefit the VM Backup feature, but it also makes qemu-img convert
 more powerful for everyone.  I hope we can kill two birds with one stone

 The doc contain the following link:

 http://www.vmware.com/download/patents.html

 I simply have no idea how to check all those patents. How can someone tell
 that they do not cover things in the specs? I am really curios?

If you want to investigate it then you would look at each one or use
search engines to make it easier (skip all the non-disk image related
patents).

But keep in mind that any other company out there could have a patent
on out-of-order data in an image file or other aspects of what you're
proposing.  Reinventing the wheel may not stop you from infringing on
their patents so the fact that VMware may or may not have patents
doesn't change things if you're really trying to find possible issues.

This is why SQLite has a policy of only using algorithms that are
older than 17 years, see comment by drh:
http://www.sqlite.org/cvstrac/wiki?p=BlueSky

Stefan



Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-22 Thread Stefan Hajnoczi
On Thu, Nov 22, 2012 at 6:50 PM, Dietmar Maurer diet...@proxmox.com wrote:
 The VMDK format has strong disadvantages:

 - unclear License (the spec links to patents)
 - they use a very slow compression algorithm (deflate), which makes it
 unusable for backup

 Seems they do not support multiple configuration files. You can only
 a single text block, and that needs to contain vmware specific info.
 So where do I add my qemu related config?

This is true.  QEMU uses a VMDK as a single disk image.  To handle
multiple disks there would need to be multiple images plus a vmstate
or config file.

Stefan



Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-22 Thread Stefan Hajnoczi
On Thu, Nov 22, 2012 at 6:46 PM, Dietmar Maurer diet...@proxmox.com wrote:
  Did you look at the VMDK Stream-Optimized Compressed subformat?

 We've gone down several sub-threads discussing whether VMDK is suitable.
 I want to summarize why this is a good approach:

 The VMDK format already allows for out-of-order data and is supported by
 existing tools - this is very important for backups where people are
 (rightfully) paranoid about putting their backups in an obscure format.  They
 want to be able to access their data years later, whether your tool is still
 around or not.

 The VMDK format has strong disadvantages:

 - unclear License (the spec links to patents)

I've already pointed out that you're taking an inconsistent position
on this point.  It's FUD.

 - they use a very slow compression algorithm (deflate), which makes it 
 unusable for backup

I've already pointed out that we can optionally support other algorithms.

Stefan



Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-22 Thread Stefan Hajnoczi
On Fri, Nov 23, 2012 at 6:23 AM, Stefan Hajnoczi stefa...@gmail.com wrote:
 On Thu, Nov 22, 2012 at 6:46 PM, Dietmar Maurer diet...@proxmox.com wrote:
  Did you look at the VMDK Stream-Optimized Compressed subformat?

 We've gone down several sub-threads discussing whether VMDK is suitable.
 I want to summarize why this is a good approach:

 The VMDK format already allows for out-of-order data and is supported by
 existing tools - this is very important for backups where people are
 (rightfully) paranoid about putting their backups in an obscure format.  
 They
 want to be able to access their data years later, whether your tool is still
 around or not.

 The VMDK format has strong disadvantages:

 - unclear License (the spec links to patents)

 I've already pointed out that you're taking an inconsistent position
 on this point.  It's FUD.

 - they use a very slow compression algorithm (deflate), which makes it 
 unusable for backup

 I've already pointed out that we can optionally support other algorithms.

To make progress here I'll review the RFC patches.  VMDK or not isn't
the main thing, a backup feature like this looks interesting.

Stefan



Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-22 Thread Dietmar Maurer
 But keep in mind that any other company out there could have a patent on
 out-of-order data in an image file or other aspects of what you're proposing.

Sorry, but the vmware docs explicitly include a pointer to those patents. So 
this
is something completely different to me.






Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-22 Thread Dietmar Maurer
  The VMDK format has strong disadvantages:
 
  - unclear License (the spec links to patents)
 
 I've already pointed out that you're taking an inconsistent position on this
 point.  It's FUD.
 
  - they use a very slow compression algorithm (deflate), which makes it
  unusable for backup
 
 I've already pointed out that we can optionally support other algorithms.

Well, I guess we both pointed out our opinions. 

I will try to implement some kind of plugging architecture for backup formats.
That way we can implement/support more than one format.





Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-22 Thread Dietmar Maurer
 To make progress here I'll review the RFC patches.  VMDK or not isn't the
 main thing, a backup feature like this looks interesting.

Yes, a 'review' would be great - thanks.

- Dietmar





Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-22 Thread Dietmar Maurer
 In short, the idea is that you can stick filters on top of a 
 BlockDriverState, so
 that any read/writes (and possibly more requests, if necessary) are routed
 through the filter before they are passed to the block driver of this BDS.
 Filters would be implemented as BlockDrivers, i.e. you could implement
 .bdrv_co_write() in a filter to intercept all writes to an image.

I am quite unsure if that make things easier.





Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-22 Thread Dietmar Maurer
  Yup, it's already not too bad. I haven't looked into it in much
  detail, but I'd like to reduce it even a bit more. In particular, the
  backup_info field in the BlockDriverState feels wrong to me. In the
  long term the generic block layer shouldn't know at all what a backup
  is, and baking it into BDS couples it very tightly.
 
 My plan was to have something like bs-job-job_type-{before,after}_write.
 
int coroutine_fn (*before_write)(BlockDriverState *bs,
 int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
 void **cookie);
int coroutine_fn (*after_write)(BlockDriverState *bs,
 int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
 void *cookie);
 
 
 The before_write could optionally return a cookie that is passed back to
 the after_write callback.

I don't really understand why a filter is related to the job? This is sometimes 
useful,
but not a generic filter infrastructure (maybe someone want to use filters 
without a job).





Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-21 Thread Kevin Wolf
Am 21.11.2012 10:01, schrieb Dietmar Maurer:
 +Some storage types/formats supports internal snapshots using some kind
 +of reference counting (rados, sheepdog, dm-thin, qcow2). It would be possible
 +to use that for backups, but for now we want to be storage-independent.
 +
 +Note: It turned out that taking a qcow2 snapshot can take a very long
 +time on larger files.

Hm, really? What are larger files? It has always been relatively quick
when I tested it, though internal snapshots are not my focus, so that
need not mean much.

If this is really an important use case for someone, I think qcow2
internal snapshots still have some potential for relatively easy
performance optimisations.

But that just as an aside...

 +
 +=Make it more efficient=
 +
 +The be more efficient, we simply need to avoid unnecessary steps. The
 +following steps are always required:
 +
 +1.) read old data before it gets overwritten
 +2.) write that data into the backup archive
 +3.) write new data (VM write)
 +
 +As you can see, this involves only one read, an two writes.

Looks like a nice approach to backup indeed.

The question is how to fit this into the big picture of qemu's live
block operations. Much of it looks like an active mirror (which is still
to be implemented), with the difference that it doesn't write the new,
but the old data, and that it keeps a bitmap of clusters that should not
be mirrored.

I'm not sure if this means that code should be shared between these two
or if the differences are too big. However, both of them have things in
common regarding the design. For example, both have a background part
(copying the existing data) and an active part (mirroring/backing up
data on writes). Block jobs are the right tool for the background part.

The active part is a bit more tricky. You're putting some code into
block.c to achieve it, which is kind of ugly. We have been talking about
block filters previously that would provide a generic infrastructure,
and at least in the mid term the additions to block.c must disappear.
(Same for block.h and block_int.h - keep things as separated from the
core as possible) Maybe we should introduce this infrastructure now.

Another interesting point is how (or whether) to link block jobs with
block filters. I think when the job is started, the filter should be
inserted automatically, and when you cancel it, it should be stopped.
When you pause the job... no idea. :-)

 +
 +To make that work, our backup archive need to be able to store image
 +data 'out of order'. It is important to notice that this will not work
 +with traditional archive formats like tar.

 +* works on any storage type and image format.
 +* we can define a new and simple archive format, which is able to
 +  store sparse files efficiently.

 +
 +Note: Storing sparse files is a mess with existing archive
 +formats. For example, tar requires information about holes at the
 +beginning of the archive.

 +* we need to define a new archive format
 +
 +Note: Most existing archive formats are optimized to store small files
 +including file attributes. We simply do not need that for VM archives.
 +
 +* archive contains data 'out of order'
 +
 +If you want to access image data in sequential order, you need to
 +re-order archive data. It would be possible to to that on the fly,
 +using temporary files.
 +
 +Fortunately, a normal restore/extract works perfectly with 'out of
 +order' data, because the target files are seekable.

 +=Archive format requirements=
 +
 +The basic requirement for such new format is that we can store image
 +date 'out of order'. It is also very likely that we have less than 256
 +drives/images per VM, and we want to be able to store VM configuration
 +files.
 +
 +We have defined a very simply format with those properties, see:
 +
 +docs/specs/vma_spec.txt
 +
 +Please let us know if you know an existing format which provides the
 +same functionality.

Essentially, what you need is an image format. You want to be
independent from the source image formats, but you're okay with using a
specific format for the backup (or you wouldn't have defined a new
format for it).

The one special thing that you need is storing multiple images in one
file. There's something like this already in qemu: qcow2 with its
internal snapshots is basically a flat file system.

Not saying that this is necessarily the best option, but I think reusing
existing formats and implementation is always a good thing, so it's an
idea to consider.

Kevin



Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-21 Thread Dietmar Maurer
  +Note: It turned out that taking a qcow2 snapshot can take a very long
  +time on larger files.
 
 Hm, really? What are larger files? It has always been relatively quick when 
 I
 tested it, though internal snapshots are not my focus, so that need not mean
 much.

300GB or larger
 
 If this is really an important use case for someone, I think qcow2 internal
 snapshots still have some potential for relatively easy performance
 optimisations.

I guess the problem is the small cluster size, so the reference table gets 
quite large
(for example fvd uses 2GB to minimize table size).
 
 But that just as an aside...
 
  +
  +=Make it more efficient=
  +
  +The be more efficient, we simply need to avoid unnecessary steps. The
  +following steps are always required:
  +
  +1.) read old data before it gets overwritten
  +2.) write that data into the backup archive
  +3.) write new data (VM write)
  +
  +As you can see, this involves only one read, an two writes.
 
 Looks like a nice approach to backup indeed.
 
 The question is how to fit this into the big picture of qemu's live block
 operations. Much of it looks like an active mirror (which is still to be
 implemented), with the difference that it doesn't write the new, but the old
 data, and that it keeps a bitmap of clusters that should not be mirrored.
 
 I'm not sure if this means that code should be shared between these two or
 if the differences are too big. However, both of them have things in common
 regarding the design. For example, both have a background part (copying the
 existing data) and an active part (mirroring/backing up data on writes). Block
 jobs are the right tool for the background part.

I already use block jobs. Or do you want to share more?
 
 The active part is a bit more tricky. You're putting some code into block.c to
 achieve it, which is kind of ugly. 

yes. but I tried to keep that small ;-)

We have been talking about block filters
 previously that would provide a generic infrastructure, and at least in the 
 mid
 term the additions to block.c must disappear.
 (Same for block.h and block_int.h - keep things as separated from the core as
 possible) Maybe we should introduce this infrastructure now.

I have no idea what you talk about? Can you point me to the relevant discussion?
 
 Another interesting point is how (or whether) to link block jobs with block
 filters. I think when the job is started, the filter should be inserted
 automatically, and when you cancel it, it should be stopped.
 When you pause the job... no idea. :-)
 
  +
  +To make that work, our backup archive need to be able to store image
  +data 'out of order'. It is important to notice that this will not
  +work with traditional archive formats like tar.
 
  +* works on any storage type and image format.
  +* we can define a new and simple archive format, which is able to
  +  store sparse files efficiently.
 
  +
  +Note: Storing sparse files is a mess with existing archive formats.
  +For example, tar requires information about holes at the beginning of
  +the archive.
 
  +* we need to define a new archive format
  +
  +Note: Most existing archive formats are optimized to store small
  +files including file attributes. We simply do not need that for VM 
  archives.
  +
  +* archive contains data 'out of order'
  +
  +If you want to access image data in sequential order, you need to
  +re-order archive data. It would be possible to to that on the fly,
  +using temporary files.
  +
  +Fortunately, a normal restore/extract works perfectly with 'out of
  +order' data, because the target files are seekable.
 
  +=Archive format requirements=
  +
  +The basic requirement for such new format is that we can store image
  +date 'out of order'. It is also very likely that we have less than
  +256 drives/images per VM, and we want to be able to store VM
  +configuration files.
  +
  +We have defined a very simply format with those properties, see:
  +
  +docs/specs/vma_spec.txt
  +
  +Please let us know if you know an existing format which provides the
  +same functionality.
 
 Essentially, what you need is an image format. You want to be independent
 from the source image formats, but you're okay with using a specific format
 for the backup (or you wouldn't have defined a new format for it).
 
 The one special thing that you need is storing multiple images in one file.
 There's something like this already in qemu: qcow2 with its internal
 snapshots is basically a flat file system.
 
 Not saying that this is necessarily the best option, but I think reusing 
 existing
 formats and implementation is always a good thing, so it's an idea to
 consider.

AFAIK qcow2 file cannot store data out of order. In general, an backup fd is 
not seekable, 
and we only want to do sequential writes. Image format always requires seekable 
fds?

Anyways, a qcow2 file is really complex beast - I am quite unsure if I would 
use 
that for backup if it is possible. 

That would require any 

Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-21 Thread Dietmar Maurer
 Not saying that this is necessarily the best option, but I think reusing 
 existing
 formats and implementation is always a good thing, so it's an idea to
 consider.

Yes, I would really like to reuse something. Our current backup software uses 
'tar' files,
but that is really inefficient. We also analyzed all other available
archive formats, but none of them is capable to store sparse files efficiently. 

And storing data out of order is beyond the scope of existing format.





Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-21 Thread Kevin Wolf
Am 21.11.2012 12:10, schrieb Dietmar Maurer:
 +Note: It turned out that taking a qcow2 snapshot can take a very long
 +time on larger files.

 Hm, really? What are larger files? It has always been relatively quick 
 when I
 tested it, though internal snapshots are not my focus, so that need not mean
 much.
 
 300GB or larger
  
 If this is really an important use case for someone, I think qcow2 internal
 snapshots still have some potential for relatively easy performance
 optimisations.
 
 I guess the problem is the small cluster size, so the reference table gets 
 quite large
 (for example fvd uses 2GB to minimize table size).

qemu-img check gives an idea of what it costs to read in the whole
metadata of an image. Updating some of it should mean not more than a
factor of two. I'm seeing much bigger differences, so I suspect there's
something wrong.

Somebody should probably try tracing where the performance is lost.

 But that just as an aside...

 +
 +=Make it more efficient=
 +
 +The be more efficient, we simply need to avoid unnecessary steps. The
 +following steps are always required:
 +
 +1.) read old data before it gets overwritten
 +2.) write that data into the backup archive
 +3.) write new data (VM write)
 +
 +As you can see, this involves only one read, an two writes.

 Looks like a nice approach to backup indeed.

 The question is how to fit this into the big picture of qemu's live block
 operations. Much of it looks like an active mirror (which is still to be
 implemented), with the difference that it doesn't write the new, but the old
 data, and that it keeps a bitmap of clusters that should not be mirrored.

 I'm not sure if this means that code should be shared between these two or
 if the differences are too big. However, both of them have things in common
 regarding the design. For example, both have a background part (copying the
 existing data) and an active part (mirroring/backing up data on writes). 
 Block
 jobs are the right tool for the background part.
 
 I already use block jobs. Or do you want to share more?

I was thinking about sharing code between a future active mirror and the
backup job. Which may or may not make sense. I'm mostly hoping for input
from Paolo here.

 The active part is a bit more tricky. You're putting some code into block.c 
 to
 achieve it, which is kind of ugly. 
 
 yes. but I tried to keep that small ;-)

Yup, it's already not too bad. I haven't looked into it in much detail,
but I'd like to reduce it even a bit more. In particular, the
backup_info field in the BlockDriverState feels wrong to me. In the long
term the generic block layer shouldn't know at all what a backup is, and
baking it into BDS couples it very tightly.

 We have been talking about block filters
 previously that would provide a generic infrastructure, and at least in the 
 mid
 term the additions to block.c must disappear.
 (Same for block.h and block_int.h - keep things as separated from the core as
 possible) Maybe we should introduce this infrastructure now.
 
 I have no idea what you talk about? Can you point me to the relevant 
 discussion?

Not sure if a single discussion explains it, and I can't even find one
at the moment.

In short, the idea is that you can stick filters on top of a
BlockDriverState, so that any read/writes (and possibly more requests,
if necessary) are routed through the filter before they are passed to
the block driver of this BDS. Filters would be implemented as
BlockDrivers, i.e. you could implement .bdrv_co_write() in a filter to
intercept all writes to an image.

 Another interesting point is how (or whether) to link block jobs with block
 filters. I think when the job is started, the filter should be inserted
 automatically, and when you cancel it, it should be stopped.
 When you pause the job... no idea. :-)

 Essentially, what you need is an image format. You want to be independent
 from the source image formats, but you're okay with using a specific format
 for the backup (or you wouldn't have defined a new format for it).

 The one special thing that you need is storing multiple images in one file.
 There's something like this already in qemu: qcow2 with its internal
 snapshots is basically a flat file system.

 Not saying that this is necessarily the best option, but I think reusing 
 existing
 formats and implementation is always a good thing, so it's an idea to
 consider.
 
 AFAIK qcow2 file cannot store data out of order. In general, an backup fd is 
 not seekable, 
 and we only want to do sequential writes. Image format always requires 
 seekable fds?

Ah, this is what you mean by out of order. Just out of curiosity, what
are these non-seekable backup fds usually?

In principle even for this qcow2 could be used as an image format,
however the existing implementation wouldn't be of much use for you, so
it loses quite a bit of its attractiveness.

 Anyways, a qcow2 file is really complex beast - I am quite unsure if I would 
 use 
 that for 

Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-21 Thread Paolo Bonzini
Il 21/11/2012 13:37, Kevin Wolf ha scritto:
  The active part is a bit more tricky. You're putting some code into 
  block.c to
  achieve it, which is kind of ugly. 
  
  yes. but I tried to keep that small ;-)
 Yup, it's already not too bad. I haven't looked into it in much detail,
 but I'd like to reduce it even a bit more. In particular, the
 backup_info field in the BlockDriverState feels wrong to me. In the long
 term the generic block layer shouldn't know at all what a backup is, and
 baking it into BDS couples it very tightly.

My plan was to have something like bs-job-job_type-{before,after}_write.

   int coroutine_fn (*before_write)(BlockDriverState *bs,
int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
void **cookie);
   int coroutine_fn (*after_write)(BlockDriverState *bs,
int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
void *cookie);


The before_write could optionally return a cookie that is passed back
to the after_write callback.

Actually this was plan B, as a poor-man implementation of the filter
infrastructure.  Plan A was that the block filters would materialize
suddenly in someone's git tree.

Anyway, it should be very easy to convert Dietmar's code to something
like that, and the active mirror could use it as well.

  AFAIK qcow2 file cannot store data out of order. In general, an backup fd 
  is not seekable, 
  and we only want to do sequential writes. Image format always requires 
  seekable fds?
 Ah, this is what you mean by out of order. Just out of curiosity, what
 are these non-seekable backup fds usually?

Perhaps I've been reading the SCSI standards too much lately, but tapes
come to mind. :)

Paolo



Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-21 Thread Dietmar Maurer
  AFAIK qcow2 file cannot store data out of order. In general, an backup
  fd is not seekable, and we only want to do sequential writes. Image format
 always requires seekable fds?
 
 Ah, this is what you mean by out of order. Just out of curiosity, what are
 these non-seekable backup fds usually?

/dev/nst0 ;-)

But there are better examples. Usually you want to use some kind of
compression, and you do that with existing tools:

# backup to stdout|gzip|...

A common usage scenario is to pipe a backup into a restore (copy)

# backup to stdout|ssh to remote host -c 'restore from stdin'

It is also a performance question. Seeks are terrible slow.

 In principle even for this qcow2 could be used as an image format, however
 the existing implementation wouldn't be of much use for you, so it loses
 quite a bit of its attractiveness.
 
  Anyways, a qcow2 file is really complex beast - I am quite unsure if I
  would use that for backup if it is possible.
 
  That would require any external tool to include =5 LOC
 
  The vma reader code is about 700 LOC (quite easy).
 
 So what? qemu-img is already there.

Anyways, you already pointed out that the existing implementation does not work.

But I already expected such discussion. So maybe it is better we simply pipe 
all data to an external binary?
We just need to define a minimal protocol. 

In future we can produce different archivers as independent/external binaries?








Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-21 Thread Kevin Wolf
Am 21.11.2012 14:25, schrieb Dietmar Maurer:
 AFAIK qcow2 file cannot store data out of order. In general, an backup
 fd is not seekable, and we only want to do sequential writes. Image format
 always requires seekable fds?

 Ah, this is what you mean by out of order. Just out of curiosity, what are
 these non-seekable backup fds usually?
 
 /dev/nst0 ;-)

Sure. :-)

 But there are better examples. Usually you want to use some kind of
 compression, and you do that with existing tools:
 
 # backup to stdout|gzip|...

When you use an image/archive format anyway, you could use a compression
mechanism that it already supports.

 A common usage scenario is to pipe a backup into a restore (copy)
 
 # backup to stdout|ssh to remote host -c 'restore from stdin'

This is a good one. I believe our usual solution would have been to
backup to a NBD server on the remote host instead.

In general I can see that being able to pipe it to other programs could
be nice. I'm not sure if it's an absolute requirement. Would your tools
for taking the backup employ any specific use of pipes?

 It is also a performance question. Seeks are terrible slow.

You wouldn't do it a lot. Only for metadata, and you would only write
out the metadata once the in-memory cache is full.

 In principle even for this qcow2 could be used as an image format, however
 the existing implementation wouldn't be of much use for you, so it loses
 quite a bit of its attractiveness.

 Anyways, a qcow2 file is really complex beast - I am quite unsure if I
 would use that for backup if it is possible.

 That would require any external tool to include =5 LOC

 The vma reader code is about 700 LOC (quite easy).

 So what? qemu-img is already there.
 
 Anyways, you already pointed out that the existing implementation does not 
 work.

I'm still trying to figure out the real requirements to think some more
about it. :-)

 But I already expected such discussion. So maybe it is better we simply pipe 
 all data to an external binary?
 We just need to define a minimal protocol. 
 
 In future we can produce different archivers as independent/external binaries?

You shouldn't look at discussions as a bad thing. We're not trying to
block your changes, but to understand and possibly improve them.

Yes, discussions mean that it takes a bit longer to get things merged,
but they also mean that usually something better is merged in the end
that actually fits well in qemu's design, is maintainable, generic and
so on. Evading the discussions by keeping code externally wouldn't
improve things.

Which doesn't mean that external archivers are completely out of the
question, but I would only consider them if there's a good technical
reason to do so.

So if eventually we come to the conclusion that vma (or for that matter,
anything else in your patches) is the right solution, let's take it. But
first please give us the chance to understand the reasons of why you did
things the way you did them, and to discuss the pros and cons of
alternative solutions.

Kevin



Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)

2012-11-21 Thread Dietmar Maurer
  Ah, this is what you mean by out of order. Just out of curiosity,
  what are these non-seekable backup fds usually?
 
  /dev/nst0 ;-)
 
 Sure. :-)
 
  But there are better examples. Usually you want to use some kind of
  compression, and you do that with existing tools:
 
  # backup to stdout|gzip|...
 
 When you use an image/archive format anyway, you could use a
 compression mechanism that it already supports.

Many archive formats does not support compressions internally (tar, cpio, ..).
I also avoided to include that in the 'vma' format. So you can use any
external tool. 

Some user wants to compress, other wants bzip2, or gzip -1, xz, pgzip, ...
Or maybe pipe into some kind of encryption tool ...

  A common usage scenario is to pipe a backup into a restore (copy)
 
  # backup to stdout|ssh to remote host -c 'restore from stdin'
 
 This is a good one. I believe our usual solution would have been to backup to
 a NBD server on the remote host instead.
 
 In general I can see that being able to pipe it to other programs could be 
 nice.
 I'm not sure if it's an absolute requirement. Would your tools for taking the
 backup employ any specific use of pipes?

Yes, we currently have that functionality, and I do not want to remove features.

  It is also a performance question. Seeks are terrible slow.
 
 You wouldn't do it a lot. Only for metadata, and you would only write out the
 metadata once the in-memory cache is full.

IMHO it is still much better to write sequentially, because that has 'zero' 
overhead.

Besides, writing data sequentially is so much easier (on the implementation 
side)

The current VMA code also use checksums and special 'uuid' markers, which
makes it possible to find and recover damaged archives. I guess such things
are quite impossible with qcow2, or very hard to do?

  In principle even for this qcow2 could be used as an image format,
  however the existing implementation wouldn't be of much use for you,
  so it loses quite a bit of its attractiveness.
 
  Anyways, a qcow2 file is really complex beast - I am quite unsure if
  I would use that for backup if it is possible.
 
  That would require any external tool to include =5 LOC
 
  The vma reader code is about 700 LOC (quite easy).
 
  So what? qemu-img is already there.
 
  Anyways, you already pointed out that the existing implementation does
 not work.
 
 I'm still trying to figure out the real requirements to think some more about
 it. :-)

Any existing archive format I know works on pipes (without seeks). 
Well, that does not really mean anything.

  But I already expected such discussion. So maybe it is better we simply pipe
 all data to an external binary?
  We just need to define a minimal protocol.
 
  In future we can produce different archivers as independent/external
 binaries?
 
 You shouldn't look at discussions as a bad thing. We're not trying to block
 your changes, but to understand and possibly improve them.

I do not consider your comments as 'bad thing' - above idea was a real 
suggestion ;-)

I already have plans to use a Content Addressable Storage (instead of 'vma'), so
such plugin architecture makes it easier to play around with different formats.
 
 Yes, discussions mean that it takes a bit longer to get things merged, but 
 they
 also mean that usually something better is merged in the end that actually
 fits well in qemu's design, is maintainable, generic and so on. Evading the
 discussions by keeping code externally wouldn't improve things.

sure.
 
 Which doesn't mean that external archivers are completely out of the
 question, but I would only consider them if there's a good technical reason to
 do so.

As noted above, I can see rooms for different format. 

1.) 'vma' is my proof of concept, easy to implement and use.
2.) CAS - very useful to sync backup data across datacenters (this
gives us deduplication and kind of 'incremental backups')
3.) support existing archive format like 'tar' (this is possible if we
use temporary files to store out-of-order data)
4.) backup to some kind of external server
5.) plugins for existing backup tools (bacula, ...)?

 So if eventually we come to the conclusion that vma (or for that matter,
 anything else in your patches) is the right solution, let's take it. But first
 please give us the chance to understand the reasons of why you did things
 the way you did them, and to discuss the pros and cons of alternative
 solutions.

Sure. I was not aware that I wrote something negative in the previous reply - 
sorry for that.