Re: [Qemu-block] [RFC PATCH COLO v2 01/13] docs: block replication's description

2015-04-02 Thread Wen Congyang
On 03/26/2015 02:31 PM, Fam Zheng wrote:
 On Wed, 03/25 17:36, Wen Congyang wrote:
 Signed-off-by: Wen Congyang we...@cn.fujitsu.com
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 Signed-off-by: Yang Hongyang yan...@cn.fujitsu.com
 Signed-off-by: zhanghailiang zhang.zhanghaili...@huawei.com
 Signed-off-by: Gonglei arei.gong...@huawei.com
 ---
  docs/block-replication.txt | 147 
 +
  1 file changed, 147 insertions(+)
  create mode 100644 docs/block-replication.txt

 diff --git a/docs/block-replication.txt b/docs/block-replication.txt
 new file mode 100644
 index 000..874ed8e
 --- /dev/null
 +++ b/docs/block-replication.txt
 @@ -0,0 +1,147 @@
 +Block replication
 +
 +Copyright Fujitsu, Corp. 2015
 +Copyright (c) 2015 Intel Corporation
 +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO.,LTD.
 +
 +This work is licensed under the terms of the GNU GPL, version 2 or later.
 +See the COPYING file in the top-level directory.
 +
 +The block replication is used for continuous checkpoints. It is designed
 +for COLO that Secondary VM is running. It can also be applied for FT/HA
 +scene that Secondary VM is not running.
 +
 +This document gives an overview of block replication's design.
 +
 +== Background ==
 +High availability solutions such as micro checkpoint and COLO will do
 +consecutive checkpoint. The VM state of Primary VM and Secondary VM is
 +identical right after a VM checkpoint, but becomes different as the VM
 +executes till the next checkpoint. To support disk contents checkpoint,
 +the modified disk contents in the Secondary VM must be buffered, and are
 +only dropped at next checkpoint time. To reduce the network transportation
 +effort at the time of checkpoint, the disk modification operations of
 +Primary disk are asynchronously forwarded to the Secondary node.
 +
 +== Workflow ==
 +The following is the image of block replication workflow:
 +
 ++--+++
 +|Primary Write Requests||Secondary Write Requests|
 ++--+++
 +  |   |
 +  |  (4)
 +  |   V
 +  |  /-\
 +  |  Copy and Forward| |
 +  |-(1)--+   | Disk Buffer |
 +  |  |   | |
 +  | (3)  \-/
 +  | speculative  ^
 +  |write through(2)
 +  |  |   |
 +  V  V   |
 +   +--+   ++
 +   | Primary Disk |   | Secondary Disk |
 +   +--+   ++
 +
 +1) Primary write requests will be copied and forwarded to Secondary
 +   QEMU.
 +2) Before Primary write requests are written to Secondary disk, the
 +   original sector content will be read from Secondary disk and
 +   buffered in the Disk buffer, but it will not overwrite the existing
 +   sector content in the Disk buffer.
 
 Could you elaborate a bit about the existing sector content here? IIUC, it
 could be from either Secondary Write Requests or previous COW of Primary
 Write Requests. Is that right?
 
 +3) Primary write requests will be written to Secondary disk.
 +4) Secondary write requests will be buffered in the Disk buffer and it
 +   will overwrite the existing sector content in the buffer.
 +
 +== Architecture ==
 +We are going to implement COLO block replication from many basic
 +blocks that are already in QEMU.
 +
 + virtio-blk   ||
 + ^||.--
 + |||| Secondary
 +1 Quorum  ||'--
 + /  \ ||
 +/\||
 +   Primary  2 NBD  ---  2 NBD
 + disk   client|| server 
 virtio-blk
 +  ||^   
  ^
 +. |||   
  |
 +Primary | ||  Secondary disk - hidden-disk 4 
 - active-disk 3
 +' |||  backing^   
 backing
 +  ||| |
 +  ||| |
 +  ||'-'
 + 

Re: [Qemu-block] [Qemu-devel] [PATCH v4 08/20] block: Add bitmap disabled status

2015-04-02 Thread Stefan Hajnoczi
On Fri, Mar 20, 2015 at 03:16:51PM -0400, John Snow wrote:
 Add a status indicating the enabled/disabled state of the bitmap.
 A bitmap is by default enabled, but you can lock the bitmap into
 a read-only state by setting disabled = true.
 
 A previous version of this patch added a QMP interface for changing
 the state of the bitmap, but it has since been removed for now until
 a use case emerges where this state must be revealed to the user.
 
 The disabled state WILL be used internally for bitmap migration and
 bitmap persistence.
 
 Signed-off-by: Fam Zheng f...@redhat.com
 Signed-off-by: John Snow js...@redhat.com
 Reviewed-by: Max Reitz mre...@redhat.com
 ---
  block.c   | 25 +
  include/block/block.h |  3 +++
  2 files changed, 28 insertions(+)

Reviewed-by: Stefan Hajnoczi stefa...@redhat.com


pgpAIIGuPYk4O.pgp
Description: PGP signature


Re: [Qemu-block] [PATCH v4 10/20] qmp: Add support of dirty-bitmap sync mode for drive-backup

2015-04-02 Thread Stefan Hajnoczi
On Fri, Mar 20, 2015 at 03:16:53PM -0400, John Snow wrote:
 +} else if (job-sync_mode == MIRROR_SYNC_MODE_DIRTY_BITMAP) {
 +/* Dirty Bitmap sync has a slightly different iteration method */
 +HBitmapIter hbi;
 +int64_t sector;
 +int64_t cluster;
 +int64_t last_cluster = -1;
 +bool polyrhythmic;
 +
 +bdrv_dirty_iter_init(bs, job-sync_bitmap, hbi);
 +/* Does the granularity happen to match our backup cluster size? */
 +polyrhythmic = (bdrv_dirty_bitmap_granularity(job-sync_bitmap) !=
 +BACKUP_CLUSTER_SIZE);
 +
 +/* Find the next dirty /sector/ and copy that /cluster/ */
 +while ((sector = hbitmap_iter_next(hbi)) != -1) {
 +cluster = sector / BACKUP_SECTORS_PER_CLUSTER;
 +
 +/* Fake progress updates for any clusters we skipped,
 + * excluding this current cluster. */
 +if (cluster != last_cluster + 1) {
 +job-common.offset += ((cluster - last_cluster - 1) *
 +   BACKUP_CLUSTER_SIZE);
 +}
 +
 +if (yield_and_check(job)) {
 +goto leave;
 +}
 +
 +do {
 +ret = backup_do_cow(bs, cluster * BACKUP_SECTORS_PER_CLUSTER,
 +BACKUP_SECTORS_PER_CLUSTER, 
 error_is_read);
 +if ((ret  0) 
 +backup_error_action(job, error_is_read, -ret) ==
 +BLOCK_ERROR_ACTION_REPORT) {
 +goto leave;
 +}
 +} while (ret  0);
 +
 +/* Advance (or rewind) our iterator if we need to. */
 +if (polyrhythmic) {
 +bdrv_set_dirty_iter(hbi,
 +(cluster + 1) * 
 BACKUP_SECTORS_PER_CLUSTER);
 +}
 +
 +last_cluster = cluster;
 +}

What happens when the dirty bitmap granularity is larger than
BACKUP_SECTORS_PER_CLUSTER?

|-bitmap granularity-|
|---backup cluster---|
  ~~~
Will these sectors ever be copied?

I think this case causes an infinite loop:

  cluster = hbitmap_iter_next(hbi) / BACKUP_SECTORS_PER_CLUSTER

The iterator is reset:

  bdrv_set_dirty_iter(hbi, (cluster + 1) * BACKUP_SECTORS_PER_CLUSTER);

So we get the same cluster ever time and never advance?


pgpDXGmrHmKck.pgp
Description: PGP signature


Re: [Qemu-block] block-commit dropping privs

2015-04-02 Thread Eric Blake
On 04/02/2015 06:04 AM, Michael Tokarev wrote:
 02.04.2015 14:24, Kevin Wolf wrote:
 []
 But overall, I think qemu-system should not modify backing
 file name in this case.

 So you would leave the backing file with the data that you just
 committed down one level in your backing file chain? Wouldn't that
 defeat the whole purpose of committing?

No, because there are multiple reasons for committing.  And we recently
updated qemu-img to use QMP for multiple styles of commit:

1: merge data from the active layer to the immediate backing, but keep
the active layer intact (by freeing its sectors). Useful for saying I
want to reset my point-in-time backup to now, but continue tracking the
delta since the point-in-time.
 starting with: A - B - C, with B created at point1, C created at
point2, and right now being point3. B contains point1..point2, C
contains point2..point3+
 qemu-img commit C
 ending with: A - B - C, B contains point1..point3, C contains
point3+; all three files still valid

2: merge data from the active layer to the immediate backing, and
discard the active layer (no need to waste time freeing its sectors).
Useful for saying I no longer need my point-in-time backup, revert to
the smaller chain
 starting with: A - B - C (as before)
 qemu-img commit -d C
 ending with: A - B, B contains point1..point3+; C invalid the moment
something is written to B

3: merge data from the active layer to a specified backing (typically
more than one file) to really clean up a chain. Useful for saying I
want to keep current state but discard all intermediate points in time.
 When crossing multiple backing files, you WILL cause corruption to the
intermediate files.  Therefore, this mode implies -d, and there is no
point wasting time in clearing out sectors in the discarded part of the
chain.
 starting with: A - B - C (as before)
 qemu-img commit -b A C
 ending with: A; A contains point3+, B is immediately invalid, C is
invalid the moment something else is written to A

4: offline-only: note that 'qemu-img commit' can ONLY commit the changes
of the filename argument, while QMP can commit any arbitrary point in a
chain (note that there is a difference between active commit and
intermediate commit). But qemu-img can be used to do intermediate
commit, by virtue of the fact that since the image is not in active use,
you can specify any intermediate file in the chain as the point to
commit, coupled with rewriting the backing file of the rest of the chain
to match. Requires two qemu-img commands, so might be worth adding sugar
to do it in one command the way QMP can do, if desired.
 starting with: A - B - C (as before)
 qemu-img commit -d B
 qemu-img rebase -u A C
 ending with: A - C; A contains point2, C contains point2+

[Proof that committing to more than the immediate backing file was in
the thread on converting qemu-img to use QMP; although I'm somewhat
repeating that argument down below]

 
 Um.  I don't think we understood each other.
 
 I experimented with the non-live HMP commit command.  This
 one effectively empties everything in the overlay file,
 propagating it to the backing file, but keeps the (now
 empty) overlay.  So from the stacking perspective nothing
 has changed.  Yet, together with with propagation, it also
 modifies the overlay file headers and writes a new name
 of the backing file -- the one it currently uses, which,
 in my case, is virtual /dev/fdset/foo.  It should keep
 the original file name in there, such as win.raw, unless
 explicitly asked to write a different name there.

Bug #1 - HMP doesn't use QMP commit - and therefore it can only commit
the active layer. I argued above that qemu-img has the same limitation,
except that qemu-img can be used anywhere in the chain by using multiple
qemu-img commands

Bug #2 - HMP changes the backing file when it shouldn't (active commit
should never alter backing information if the active file doesn't change)

 
 If the stack chain were to be modified by commit command,
 yes, the new name should be recorded ofcourse, such as
 after rebase.  But since stack chain is not modified,
 filename should not be modified either.
 
 When performing commit, does qemu mark the areas in the
 overlay file as free after writing contents to the backing
 file, or will these areas be written again by a subsequent
 commit?  Somehow it smells like each next commit writes
 more and more data and completes in more and more time.

 With qcow2 and qcow, the committed data is discarded with HMP 'commit'.
 Other image formats keep the copy.
 
 Hm.  It is discarded, but the file isn't shrinked.  With non-live
 commit I don't see a reason why it can't be shrinked too?

Oh, you pointed out another difference.  I was going to say that QMP
active commit followed by aborting the transaction keeps the active
layer unchanged, but QMP does not yet expose a way to empty the active
layer.  That is, starting with these file contents (where - implies
deferring to the backing file):
B 
C 

Re: [Qemu-block] [PATCH v4 11/20] qmp: add block-dirty-bitmap-clear

2015-04-02 Thread Stefan Hajnoczi
On Fri, Mar 20, 2015 at 03:16:54PM -0400, John Snow wrote:
 Add bdrv_clear_dirty_bitmap and a matching QMP command,
 qmp_block_dirty_bitmap_clear that enables a user to reset
 the bitmap attached to a drive.
 
 This allows us to reset a bitmap in the event of a full
 drive backup.
 
 Signed-off-by: John Snow js...@redhat.com
 Reviewed-by: Max Reitz mre...@redhat.com
 ---
  block.c   |  8 
  blockdev.c| 34 ++
  include/block/block.h |  1 +
  qapi/block-core.json  | 14 ++
  qmp-commands.hx   | 29 +
  5 files changed, 86 insertions(+)

Reviewed-by: Stefan Hajnoczi stefa...@redhat.com


pgprmx0YS6J8j.pgp
Description: PGP signature


Re: [Qemu-block] [PATCH v4 14/20] block: Ensure consistent bitmap function prototypes

2015-04-02 Thread Stefan Hajnoczi
On Fri, Mar 20, 2015 at 03:16:57PM -0400, John Snow wrote:
 We often don't need the BlockDriverState for functions
 that operate on bitmaps. Remove it.
 
 Signed-off-by: John Snow js...@redhat.com
 Reviewed-by: Max Reitz mre...@redhat.com
 ---
  block.c   | 13 ++---
  block/backup.c|  2 +-
  block/mirror.c| 26 ++
  blockdev.c|  2 +-
  include/block/block.h | 11 +--
  migration/block.c |  7 +++
  6 files changed, 26 insertions(+), 35 deletions(-)

Reviewed-by: Stefan Hajnoczi stefa...@redhat.com


pgpoF_BItE7WE.pgp
Description: PGP signature


Re: [Qemu-block] [Qemu-devel] [PATCH v4 12/20] qmp: Add dirty bitmap status field in query-block

2015-04-02 Thread Stefan Hajnoczi
On Fri, Mar 20, 2015 at 03:16:55PM -0400, John Snow wrote:
 Add the frozen status booleans, to inform clients
 when a bitmap is occupied doing a task.
 
 Signed-off-by: Fam Zheng f...@redhat.com
 Signed-off-by: John Snow js...@redhat.com
 Reviewed-by: Max Reitz mre...@redhat.com
 ---
  block.c  | 1 +
  qapi/block-core.json | 5 -
  2 files changed, 5 insertions(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi stefa...@redhat.com


pgpFvjlQX7Pb9.pgp
Description: PGP signature


Re: [Qemu-block] [PATCH v4 07/20] hbitmap: add hbitmap_merge

2015-04-02 Thread Stefan Hajnoczi
On Fri, Mar 20, 2015 at 03:16:50PM -0400, John Snow wrote:
 We add a bitmap merge operation to assist in error cases
 where we wish to combine two bitmaps together.
 
 This is algorithmically O(bits) provided HBITMAP_LEVELS remains
 constant. For a full bitmap on a 64bit machine:
 sum(bits/64^k, k, 0, HBITMAP_LEVELS) ~= 1.01587 * bits
 
 We may be able to improve running speed for particularly sparse
 bitmaps by using iterators, but the running time for dense maps
 will be worse.
 
 We present the simpler solution first, and we can refine it later
 if needed.
 
 Signed-off-by: John Snow js...@redhat.com
 Reviewed-by: Max Reitz mre...@redhat.com
 ---
  include/qemu/hbitmap.h | 11 +++
  util/hbitmap.c | 29 +
  2 files changed, 40 insertions(+)

Reviewed-by: Stefan Hajnoczi stefa...@redhat.com


pgpdVOFdn2fEC.pgp
Description: PGP signature


Re: [Qemu-block] [PATCH v4 13/20] block: add BdrvDirtyBitmap documentation

2015-04-02 Thread Stefan Hajnoczi
On Fri, Mar 20, 2015 at 03:16:56PM -0400, John Snow wrote:
 Signed-off-by: John Snow js...@redhat.com
 Reviewed-by: Max Reitz mre...@redhat.com
 ---
  block.c | 10 +-
  1 file changed, 5 insertions(+), 5 deletions(-)

Reviewed-by: Stefan Hajnoczi stefa...@redhat.com


pgpSgc0g1v0CF.pgp
Description: PGP signature


Re: [Qemu-block] block-commit dropping privs

2015-04-02 Thread Kevin Wolf
Am 02.04.2015 um 12:58 hat Michael Tokarev geschrieben:
 01.04.2015 15:34, Kevin Wolf wrote:
 []
  Overriding the backing file should work like this:
  
  -drive file=...,backing.file.filename=/dev/fdset/2
 
 Oh-ok, this works.  Sort of.
 
 Because after performing commit (is there a difference between
 commit hmp command and block-commit qmp command?  I used the
 former so far),

Yes, two completely different implementations. HMP 'commit' is a
non-live version that effectively pauses the VM while it runs.

Apparently the live version (which is QMP 'block-commit') isn't
available in HMP. Jeff, do you remember why?

qemu-img commit reuses the QMP code today.

 the new backing file name is recorded to the
 qcow2 file header, and now, qemu-img can't operate on this
 file anymore, telling me it can't access backing file due
 to bad file descriptor.

Right. That's why block-commit has an optional 'backing-file' argument
that contains the new backing file string to be written to the file.

 So in order to finally commit the file I used qemu-system
 again, running it like that specifying backing file, but
 with -S option, and issued `commit' command again.
 
 I noticed that I can't specify backing file for
 qemu-img commit.
 
 But overall, I think qemu-system should not modify backing
 file name in this case.

So you would leave the backing file with the data that you just
committed down one level in your backing file chain? Wouldn't that
defeat the whole purpose of committing?

 When performing commit, does qemu mark the areas in the
 overlay file as free after writing contents to the backing
 file, or will these areas be written again by a subsequent
 commit?  Somehow it smells like each next commit writes
 more and more data and completes in more and more time.

With qcow2 and qcow, the committed data is discarded with HMP 'commit'.
Other image formats keep the copy.

Live commit always keeps the data in the parent image, but it drops that
image from the backing file chain, so effectively it is dropped as well
(but you need to remove the file manually to reclaim the space).

Jeff, this sounds like we need to make some changes to unify this. That
might mean introducing an option that decides whether an image should be
dropped from the chain or emptied. Once live commit can provide the same
as HMP commit, we should change HMP commit to reuse the live commit
code.

Kevin



Re: [Qemu-block] [Qemu-devel] block-commit dropping privs

2015-04-02 Thread Jeff Cody
On Thu, Apr 02, 2015 at 01:24:02PM +0200, Kevin Wolf wrote:
 Am 02.04.2015 um 12:58 hat Michael Tokarev geschrieben:
  01.04.2015 15:34, Kevin Wolf wrote:
  []
   Overriding the backing file should work like this:
   
   -drive file=...,backing.file.filename=/dev/fdset/2
  
  Oh-ok, this works.  Sort of.
  
  Because after performing commit (is there a difference between
  commit hmp command and block-commit qmp command?  I used the
  former so far),
 
 Yes, two completely different implementations. HMP 'commit' is a
 non-live version that effectively pauses the VM while it runs.
 
 Apparently the live version (which is QMP 'block-commit') isn't
 available in HMP. Jeff, do you remember why?
 
 qemu-img commit reuses the QMP code today.


IIRC, it was because the live commit was, by definition, live.  And
the original HMP commit operated on the active layer (and the active
layer only), and the live commit originally did not.  So the HMP
commit provided functionality that was not in the QMP block-commit.

  the new backing file name is recorded to the
  qcow2 file header, and now, qemu-img can't operate on this
  file anymore, telling me it can't access backing file due
  to bad file descriptor.
 
 Right. That's why block-commit has an optional 'backing-file' argument
 that contains the new backing file string to be written to the file.
 
  So in order to finally commit the file I used qemu-system
  again, running it like that specifying backing file, but
  with -S option, and issued `commit' command again.
  
  I noticed that I can't specify backing file for
  qemu-img commit.
  
  But overall, I think qemu-system should not modify backing
  file name in this case.
 
 So you would leave the backing file with the data that you just
 committed down one level in your backing file chain? Wouldn't that
 defeat the whole purpose of committing?
 
  When performing commit, does qemu mark the areas in the
  overlay file as free after writing contents to the backing
  file, or will these areas be written again by a subsequent
  commit?  Somehow it smells like each next commit writes
  more and more data and completes in more and more time.
 
 With qcow2 and qcow, the committed data is discarded with HMP 'commit'.
 Other image formats keep the copy.
 
 Live commit always keeps the data in the parent image, but it drops that
 image from the backing file chain, so effectively it is dropped as well
 (but you need to remove the file manually to reclaim the space).
 
 Jeff, this sounds like we need to make some changes to unify this. That
 might mean introducing an option that decides whether an image should be
 dropped from the chain or emptied.

Yes, I think this would be a good idea.  I can think of scenarios
where we would want to keep the original overlay around (i.e., we
still want a snapshot overlay, but just want to consolidate data).
But I can't think of any reason we would want to keep a stale
populated overlay file around.

 Once live commit can provide the same
 as HMP commit, we should change HMP commit to reuse the live commit
 code.
 

Hmm. My concern here is there may be times we want the behavior that
HMP commit provides - faster offline active layer commit, rather than
a mirror-like operation.  I guess if we do want that behavior,
however, we could always introduce it as an option to block-commit.
So yes, let's provide an option to remove or empty the committed
overlays (applies to all overlays, if there are multiple overlays
being committed in the chain), and then have HMP commit reuse the live
commit code.




Re: [Qemu-block] [RFC PATCH COLO v2 01/13] docs: block replication's description

2015-04-02 Thread Fam Zheng
On Fri, 04/03 10:35, Wen Congyang wrote:
 On 03/26/2015 02:31 PM, Fam Zheng wrote:
  On Wed, 03/25 17:36, Wen Congyang wrote:
  Signed-off-by: Wen Congyang we...@cn.fujitsu.com
  Signed-off-by: Paolo Bonzini pbonz...@redhat.com
  Signed-off-by: Yang Hongyang yan...@cn.fujitsu.com
  Signed-off-by: zhanghailiang zhang.zhanghaili...@huawei.com
  Signed-off-by: Gonglei arei.gong...@huawei.com
  ---
   docs/block-replication.txt | 147 
  +
   1 file changed, 147 insertions(+)
   create mode 100644 docs/block-replication.txt
 
  diff --git a/docs/block-replication.txt b/docs/block-replication.txt
  new file mode 100644
  index 000..874ed8e
  --- /dev/null
  +++ b/docs/block-replication.txt
  @@ -0,0 +1,147 @@
  +Block replication
  +
  +Copyright Fujitsu, Corp. 2015
  +Copyright (c) 2015 Intel Corporation
  +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO.,LTD.
  +
  +This work is licensed under the terms of the GNU GPL, version 2 or later.
  +See the COPYING file in the top-level directory.
  +
  +The block replication is used for continuous checkpoints. It is designed
  +for COLO that Secondary VM is running. It can also be applied for FT/HA
  +scene that Secondary VM is not running.
  +
  +This document gives an overview of block replication's design.
  +
  +== Background ==
  +High availability solutions such as micro checkpoint and COLO will do
  +consecutive checkpoint. The VM state of Primary VM and Secondary VM is
  +identical right after a VM checkpoint, but becomes different as the VM
  +executes till the next checkpoint. To support disk contents checkpoint,
  +the modified disk contents in the Secondary VM must be buffered, and are
  +only dropped at next checkpoint time. To reduce the network transportation
  +effort at the time of checkpoint, the disk modification operations of
  +Primary disk are asynchronously forwarded to the Secondary node.
  +
  +== Workflow ==
  +The following is the image of block replication workflow:
  +
  ++--+++
  +|Primary Write Requests||Secondary Write Requests|
  ++--+++
  +  |   |
  +  |  (4)
  +  |   V
  +  |  /-\
  +  |  Copy and Forward| |
  +  |-(1)--+   | Disk Buffer |
  +  |  |   | |
  +  | (3)  \-/
  +  | speculative  ^
  +  |write through(2)
  +  |  |   |
  +  V  V   |
  +   +--+   ++
  +   | Primary Disk |   | Secondary Disk |
  +   +--+   ++
  +
  +1) Primary write requests will be copied and forwarded to Secondary
  +   QEMU.
  +2) Before Primary write requests are written to Secondary disk, the
  +   original sector content will be read from Secondary disk and
  +   buffered in the Disk buffer, but it will not overwrite the existing
  +   sector content in the Disk buffer.
  
  Could you elaborate a bit about the existing sector content here? IIUC, it
  could be from either Secondary Write Requests or previous COW of Primary
  Write Requests. Is that right?
  
  +3) Primary write requests will be written to Secondary disk.
  +4) Secondary write requests will be buffered in the Disk buffer and it
  +   will overwrite the existing sector content in the buffer.
  +
  +== Architecture ==
  +We are going to implement COLO block replication from many basic
  +blocks that are already in QEMU.
  +
  + virtio-blk   ||
  + ^||.--
  + |||| Secondary
  +1 Quorum  ||'--
  + /  \ ||
  +/\||
  +   Primary  2 NBD  ---  2 NBD
  + disk   client|| server   
virtio-blk
  +  ||^ 
 ^
  +. ||| 
 |
  +Primary | ||  Secondary disk - hidden-disk 4 
  - active-disk 3
  +' |||  backing^   
  backing
  +  |||  

Re: [Qemu-block] [PATCH v4 15/20] block: Resize bitmaps on bdrv_truncate

2015-04-02 Thread John Snow



On 04/02/2015 09:37 AM, Stefan Hajnoczi wrote:

On Fri, Mar 20, 2015 at 03:16:58PM -0400, John Snow wrote:

+void hbitmap_truncate(HBitmap *hb, uint64_t size)
+{
+bool shrink;
+unsigned i;
+uint64_t num_elements = size;
+uint64_t old;
+
+/* Size comes in as logical elements, adjust for granularity. */
+size = (size + (1ULL  hb-granularity) - 1)  hb-granularity;
+assert(size = ((uint64_t)1  HBITMAP_LOG_MAX_SIZE));
+shrink = size  hb-size;
+
+/* bit sizes are identical; nothing to do. */
+if (size == hb-size) {
+return;
+}
+
+/* If we're losing bits, let's clear those bits before we invalidate all of
+ * our invariants. This helps keep the bitcount consistent, and will 
prevent
+ * us from carrying around garbage bits beyond the end of the map.
+ *
+ * Because clearing bits past the end of map might reset bits we care about
+ * within the array, record the current value of the last bit we're 
keeping.
+ */
+if (shrink) {
+bool set = hbitmap_get(hb, num_elements - 1);
+uint64_t fix_count = (hb-size  hb-granularity) - num_elements;
+
+assert(fix_count);
+hbitmap_reset(hb, num_elements, fix_count);
+if (set) {
+hbitmap_set(hb, num_elements - 1, 1);
+}


Why is it necessary to set the last bit (if it was set)?  The comment
isn't clear to me.



Sure. The granularity of the bitmap provides us with virtual bit groups. 
for a granularity of say g=2, we have 2^2 virtual bits per every real bit:


101 in memory is treated, virtually, as   .

The get/set calls operate on virtual bits, not concrete ones, so if we 
were to reset virtual bits 2-11:

11|11  

We'd set the real bits to '000', because we clear or set the entire 
virtual group.


This is probably not what we really want, so as a shortcut I just read 
and then re-set the final bit.


It is programmatically avoidable (Are we truncating into a granularity 
group?) but in the case that we are, I'd need to read/reset the bit 
anyway, so it seemed fine to just unconditionally apply the fix.




[Qemu-block] [RFC] Intermediate block mirroring

2015-04-02 Thread Alberto Garcia
Hi,

I'm interested in adding the possibility to mirror an intermediate
node in a disk image chain, but I would like to have some feedback
before sending any patches.

The goal would be to convert this:

   [A] - [B] - [C] - [D]

into this:

   [A] - [B] - [X] - [D]

where [D] is the active image and [X] would be a copy of [C]. The
latter would be unlinked from the chain.

A use case would be to move disk images across different storage
backends.

My idea is to extend the drive-mirror command. Similar to what we
discussed in the case of the intermediate block streaming, I can reuse
the 'device' parameter to refer to a node name. So the API doesn't
need any changes other than the extended semantics for this parameter.

One difference with the current functionality is that once the block
job is completed, the node above the mirrored one would have to change
its backing image to point to the new one. One solution is to iterate
over all devices (bdrv_next()) and check which ones are connected
directly or indirectly to the mirrored node (bdrv_find_overlay()).

drive-mirror has three different sync modes: top, full and none. This
would be the chain from the example using each one of these modes:

  top:

 [A] - [B] - [X] - [D]

  full:

 [X] - [D]

  none:

 [A] - [B] - [C] - [X] - [D]

My understanding is that in the 'sync=full' case, [A] and [B] would
also need to be blocked during the operation since they are going to
disappear from the chain.

I have some code and in principle everything seems to be working fine,
but I'd like to test it a bit more.

What's anyway your opinion about this proposal?

Thanks,

Berto



Re: [Qemu-block] [PATCH v4 17/20] iotests: add invalid input incremental backup tests

2015-04-02 Thread Stefan Hajnoczi
On Fri, Mar 20, 2015 at 03:17:00PM -0400, John Snow wrote:
 Signed-off-by: John Snow js...@redhat.com
 ---
  tests/qemu-iotests/124 | 104 
 +
  tests/qemu-iotests/124.out |   5 +++
  tests/qemu-iotests/group   |   1 +
  3 files changed, 110 insertions(+)
  create mode 100644 tests/qemu-iotests/124
  create mode 100644 tests/qemu-iotests/124.out

Reviewed-by: Stefan Hajnoczi stefa...@redhat.com


pgpPNDeHzWIuc.pgp
Description: PGP signature


Re: [Qemu-block] [PATCH v4 16/20] hbitmap: truncate tests

2015-04-02 Thread Stefan Hajnoczi
On Fri, Mar 20, 2015 at 03:16:59PM -0400, John Snow wrote:
 The general approach is to set bits close to the boundaries of
 where we are truncating and ensure that everything appears to
 have gone OK.
 
 We test growing and shrinking by different amounts:
 - Less than the granularity
 - Less than the granularity, but across a boundary
 - Less than sizeof(unsigned long)
 - Less than sizeof(unsigned long), but across a ulong boundary
 - More than sizeof(unsigned long)
 
 Signed-off-by: John Snow js...@redhat.com
 Reviewed-by: Max Reitz mre...@redhat.com
 ---
  tests/test-hbitmap.c | 255 
 +++
  1 file changed, 255 insertions(+)

Reviewed-by: Stefan Hajnoczi stefa...@redhat.com


pgpFUM67DfSiv.pgp
Description: PGP signature


Re: [Qemu-block] [Qemu-devel] [PATCH v4 19/20] iotests: add simple incremental backup case

2015-04-02 Thread Stefan Hajnoczi
On Fri, Mar 20, 2015 at 03:17:02PM -0400, John Snow wrote:
 Signed-off-by: John Snow js...@redhat.com
 ---
  tests/qemu-iotests/124 | 153 
 +
  tests/qemu-iotests/124.out |   4 +-
  2 files changed, 155 insertions(+), 2 deletions(-)
 
 diff --git a/tests/qemu-iotests/124 b/tests/qemu-iotests/124
 index 85675ec..ce2cda7 100644
 --- a/tests/qemu-iotests/124
 +++ b/tests/qemu-iotests/124
 @@ -28,6 +28,42 @@ def io_write_patterns(img, patterns):
  for pattern in patterns:
  iotests.qemu_io('-c', 'write -P%s %s %s' % pattern, img)
  
 +class Bitmap:
 +def __init__(self, name, drive):
 +self.name = name
 +self.drive = drive
 +self.pattern = os.path.join(iotests.test_dir.replace('%', '%%'),
 +'%s.%s.backup.%%i.img' % (drive['id'],
 +  name))

drive['id'] and name aren't escaped.

It is simpler and safer to format the string from scratch in
new_target() and drop the Bitmap.pattern field:

  def new_target(self, num=None):
  ...
  filename = '%s.%s.backup.%i.img' % (self.drive['id'],
  self.name,
  num)
  target = os.path.join(iotests.test_dir, filename)

 +self.num = 0
 +self.backups = list()
 +
 +def base_target(self):
 +return self.drive['backup']
 +
 +def new_target(self, num=None):
 +if num is None:
 +num = self.num
 +self.num = num + 1
 +target = self.pattern % num
 +self.backups.append(target)
 +return target
 +
 +def last_target(self):
 +if self.backups:
 +return self.backups[-1]
 +return self.base_target()
 +
 +def del_target(self):
 +os.remove(self.backups.pop())
 +self.num -= 1
 +
 +def __del__(self):
 +for backup in self.backups:
 +try:
 +os.remove(backup)
 +except OSError:
 +pass

From the language reference:

  It is not guaranteed that __del__() methods are called for objects
  that still exist when the interpreter exits.

https://docs.python.org/2.7/reference/datamodel.html#object.__del__

Relying on reference counts is risky.

The TestCase.tearDown() method is the place to clean up:
https://docs.python.org/2.7/library/unittest.html#unittest.TestCase.tearDown

It could iterate over self.bitmaps[] and call a cleanup() function.

  
  class TestIncrementalBackup(iotests.QMPTestCase):
  def setUp(self):
 @@ -73,6 +109,123 @@ class TestIncrementalBackup(iotests.QMPTestCase):
  iotests.qemu_img('create', '-f', fmt, img, size)
  self.files.append(img)
  
 +
 +def create_full_backup(self, drive=None):
 +if drive is None:
 +drive = self.drives[-1]
 +
 +res = self.vm.qmp('drive-backup', device=drive['id'],
 +  sync='full', format=drive['fmt'],
 +  target=drive['backup'])
 +self.assert_qmp(res, 'return', {})
 +self.wait_until_completed(drive['id'])
 +self.check_full_backup(drive)
 +self.files.append(drive['backup'])
 +return drive['backup']
 +
 +
 +def check_full_backup(self, drive=None):
 +if drive is None:
 +drive = self.drives[-1]
 +self.assertTrue(iotests.compare_images(drive['file'], 
 drive['backup']))

I think QEMU still has at least drive['file'] open?  It's not safe to
access the file from another program while it is open.

The simplest solution is to terminate the VM before calling
iotests.compare_images().


pgpztWrwTPlky.pgp
Description: PGP signature


Re: [Qemu-block] [PATCH v4 20/20] iotests: add incremental backup failure recovery test

2015-04-02 Thread Stefan Hajnoczi
On Fri, Mar 20, 2015 at 03:17:03PM -0400, John Snow wrote:
 Test the failure case for incremental backups.
 
 Signed-off-by: John Snow js...@redhat.com
 ---
  blockdev.c |  1 -
  tests/qemu-iotests/124 | 55 
 ++
  tests/qemu-iotests/124.out |  4 ++--
  3 files changed, 57 insertions(+), 3 deletions(-)

Reviewed-by: Stefan Hajnoczi stefa...@redhat.com


pgpf0n_345gpJ.pgp
Description: PGP signature


Re: [Qemu-block] [PATCH v4 15/20] block: Resize bitmaps on bdrv_truncate

2015-04-02 Thread Stefan Hajnoczi
On Fri, Mar 20, 2015 at 03:16:58PM -0400, John Snow wrote:
 +void hbitmap_truncate(HBitmap *hb, uint64_t size)
 +{
 +bool shrink;
 +unsigned i;
 +uint64_t num_elements = size;
 +uint64_t old;
 +
 +/* Size comes in as logical elements, adjust for granularity. */
 +size = (size + (1ULL  hb-granularity) - 1)  hb-granularity;
 +assert(size = ((uint64_t)1  HBITMAP_LOG_MAX_SIZE));
 +shrink = size  hb-size;
 +
 +/* bit sizes are identical; nothing to do. */
 +if (size == hb-size) {
 +return;
 +}
 +
 +/* If we're losing bits, let's clear those bits before we invalidate all 
 of
 + * our invariants. This helps keep the bitcount consistent, and will 
 prevent
 + * us from carrying around garbage bits beyond the end of the map.
 + *
 + * Because clearing bits past the end of map might reset bits we care 
 about
 + * within the array, record the current value of the last bit we're 
 keeping.
 + */
 +if (shrink) {
 +bool set = hbitmap_get(hb, num_elements - 1);
 +uint64_t fix_count = (hb-size  hb-granularity) - num_elements;
 +
 +assert(fix_count);
 +hbitmap_reset(hb, num_elements, fix_count);
 +if (set) {
 +hbitmap_set(hb, num_elements - 1, 1);
 +}

Why is it necessary to set the last bit (if it was set)?  The comment
isn't clear to me.


pgpHrb2qK9blZ.pgp
Description: PGP signature


Re: [Qemu-block] [PATCH v4 18/20] iotests: add QMP event waiting queue

2015-04-02 Thread Stefan Hajnoczi
On Fri, Mar 20, 2015 at 03:17:01PM -0400, John Snow wrote:
 +# Test if 'match' is a recursive subset of 'event'
 +def event_match(event, match = None):

Not worth respinning but PEP8 says there should be no spaces around the
'=' for keyword arguments:
https://www.python.org/dev/peps/pep-0008/#whitespace-in-expressions-and-statements

 +def event_wait(self, name='BLOCK_JOB_COMPLETED', maxtries=3, match=None):
 +# Search cached events
 +for event in self._events:
 +if (event['event'] == name) and event_match(event, match):
 +self._events.remove(event)
 +return event
 +
 +# Poll for new events
 +for _ in range(maxtries):
 +event = self._qmp.pull_event(wait=True)
 +if (event['event'] == name) and event_match(event, match):
 +return event
 +self._events.append(event)
 +
 +return None

I'm not sure if maxtries is useful.  Why is a particular number of
skipped events useful to the caller and how do they pick the magic
number?

If you added the argument because this is a blocking operation then we
should probably use timeouts instead.  Timeouts will bail out even if
QEMU is unresponsive.


pgp_sNeWrvfp1.pgp
Description: PGP signature


Re: [Qemu-block] [PATCH v4 00/20] block: transactionless incremental backup series

2015-04-02 Thread Stefan Hajnoczi
On Fri, Mar 20, 2015 at 03:16:43PM -0400, John Snow wrote:
 I've run out of cheeky jokes for my cover letters.
 
 This patchset enables the in-memory part of the incremental backup
 feature, without transactional support.
 
 Support for transactions was separated into a separate series which
 is also now available on-list. Getting this portion of the series
 committed will help stabilize work on bitmap persistence and bitmap
 migration.
 
 Thanks to Fam Zheng for the original versions of this patchset,
 And thanks to Max for reviewing 2,395 versions of it since.

I'm happy with the design.  I posted a few questions about individual
patches.


pgpWqoBhyCWEm.pgp
Description: PGP signature


Re: [Qemu-block] [PATCH v4 18/20] iotests: add QMP event waiting queue

2015-04-02 Thread John Snow



On 04/02/2015 09:57 AM, Stefan Hajnoczi wrote:

On Fri, Mar 20, 2015 at 03:17:01PM -0400, John Snow wrote:

+# Test if 'match' is a recursive subset of 'event'
+def event_match(event, match = None):


Not worth respinning but PEP8 says there should be no spaces around the
'=' for keyword arguments:
https://www.python.org/dev/peps/pep-0008/#whitespace-in-expressions-and-statements


+def event_wait(self, name='BLOCK_JOB_COMPLETED', maxtries=3, match=None):
+# Search cached events
+for event in self._events:
+if (event['event'] == name) and event_match(event, match):
+self._events.remove(event)
+return event
+
+# Poll for new events
+for _ in range(maxtries):
+event = self._qmp.pull_event(wait=True)
+if (event['event'] == name) and event_match(event, match):
+return event
+self._events.append(event)
+
+return None


I'm not sure if maxtries is useful.  Why is a particular number of
skipped events useful to the caller and how do they pick the magic
number?

If you added the argument because this is a blocking operation then we
should probably use timeouts instead.  Timeouts will bail out even if
QEMU is unresponsive.



Yeah, this was just a poor man's timeout.

I'll make it nicer.



Re: [Qemu-block] [Qemu-devel] [RFC] Intermediate block mirroring

2015-04-02 Thread Eric Blake
On 04/02/2015 07:28 AM, Alberto Garcia wrote:
 Hi,
 
 I'm interested in adding the possibility to mirror an intermediate
 node in a disk image chain, but I would like to have some feedback
 before sending any patches.
 
 The goal would be to convert this:
 
[A] - [B] - [C] - [D]
 
 into this:
 
[A] - [B] - [X] - [D]
 
 where [D] is the active image and [X] would be a copy of [C]. The
 latter would be unlinked from the chain.

Seems useful, if for no other reason than to be another tool in the
arsenal of low-level manipulations that can be strung together for cool
high-level operations.

 
 A use case would be to move disk images across different storage
 backends.
 
 My idea is to extend the drive-mirror command. Similar to what we
 discussed in the case of the intermediate block streaming, I can reuse
 the 'device' parameter to refer to a node name. So the API doesn't
 need any changes other than the extended semantics for this parameter.
 
 One difference with the current functionality is that once the block
 job is completed, the node above the mirrored one would have to change
 its backing image to point to the new one. One solution is to iterate
 over all devices (bdrv_next()) and check which ones are connected
 directly or indirectly to the mirrored node (bdrv_find_overlay()).
 
 drive-mirror has three different sync modes: top, full and none. This
 would be the chain from the example using each one of these modes:
 
   top:
 
  [A] - [B] - [X] - [D]

That is, X becomes the mirror of C, and then a later command lets us
rebase D onto X (since we know the guest-visible contents accessible
from X and C are identical).

 
   full:
 
  [X] - [D]

That is, X becomes the mirror of the full chain A through C, and then a
later command lets us rebase D onto X (since we know the guest-visible
contents accessible from X and C are identical).

 
   none:
 
  [A] - [B] - [C] - [X] - [D]

That is, X becomes a new file that tracks changes made since a point in
time which are also going into C; and if we desire we can issue a later
command to rebase D onto X (since we know the guest-visible contents
accessible from X and C are identical at that time), and even later
start cleaning up C (we could use dirty bitmaps to see what got moved
into X to clean those sectors out of C and reduce its size)

 
 My understanding is that in the 'sync=full' case, [A] and [B] would
 also need to be blocked during the operation since they are going to
 disappear from the chain.
 
 I have some code and in principle everything seems to be working fine,
 but I'd like to test it a bit more.
 
 What's anyway your opinion about this proposal?

Certainly seems like something worth having.  The devil may be in the
details, but we can get there when you post proposed patches.

 
 Thanks,
 
 Berto
 
 
 

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH v4 10/20] qmp: Add support of dirty-bitmap sync mode for drive-backup

2015-04-02 Thread John Snow



On 04/02/2015 08:44 AM, Stefan Hajnoczi wrote:

On Fri, Mar 20, 2015 at 03:16:53PM -0400, John Snow wrote:

+} else if (job-sync_mode == MIRROR_SYNC_MODE_DIRTY_BITMAP) {
+/* Dirty Bitmap sync has a slightly different iteration method */
+HBitmapIter hbi;
+int64_t sector;
+int64_t cluster;
+int64_t last_cluster = -1;
+bool polyrhythmic;
+
+bdrv_dirty_iter_init(bs, job-sync_bitmap, hbi);
+/* Does the granularity happen to match our backup cluster size? */
+polyrhythmic = (bdrv_dirty_bitmap_granularity(job-sync_bitmap) !=
+BACKUP_CLUSTER_SIZE);
+
+/* Find the next dirty /sector/ and copy that /cluster/ */
+while ((sector = hbitmap_iter_next(hbi)) != -1) {
+cluster = sector / BACKUP_SECTORS_PER_CLUSTER;
+
+/* Fake progress updates for any clusters we skipped,
+ * excluding this current cluster. */
+if (cluster != last_cluster + 1) {
+job-common.offset += ((cluster - last_cluster - 1) *
+   BACKUP_CLUSTER_SIZE);
+}
+
+if (yield_and_check(job)) {
+goto leave;
+}
+
+do {
+ret = backup_do_cow(bs, cluster * BACKUP_SECTORS_PER_CLUSTER,
+BACKUP_SECTORS_PER_CLUSTER, 
error_is_read);
+if ((ret  0) 
+backup_error_action(job, error_is_read, -ret) ==
+BLOCK_ERROR_ACTION_REPORT) {
+goto leave;
+}
+} while (ret  0);
+
+/* Advance (or rewind) our iterator if we need to. */
+if (polyrhythmic) {
+bdrv_set_dirty_iter(hbi,
+(cluster + 1) * 
BACKUP_SECTORS_PER_CLUSTER);
+}
+
+last_cluster = cluster;
+}


What happens when the dirty bitmap granularity is larger than
BACKUP_SECTORS_PER_CLUSTER?

|-bitmap granularity-|
|---backup cluster---|
   ~~~
Will these sectors ever be copied?

I think this case causes an infinite loop:

   cluster = hbitmap_iter_next(hbi) / BACKUP_SECTORS_PER_CLUSTER

The iterator is reset:

   bdrv_set_dirty_iter(hbi, (cluster + 1) * BACKUP_SECTORS_PER_CLUSTER);

So we get the same cluster ever time and never advance?



That is indeed a bug. Tracking to the middle of a granularity group will 
return the index of the group you're in the middle of, not the next group.


Thanks for catching this.