Re: [Qemu-devel] [RFC] block-trace Low Level Command Supporting Disk Introspection

2013-05-23 Thread Stefan Hajnoczi
On Wed, May 22, 2013 at 11:46:15PM +0200, Paolo Bonzini wrote:
 Il 22/05/2013 22:47, Richard W.M. Jones ha scritto:
   
   I meant if there was interest in reading from a disk that isn't fully
   synchronized
   (yet) to the original disk (it might have old blocks).  Or would you only
   want to
   connect once a (complete) snapshot is available (synchronized completely 
   to
   some point-in.
  IIUC a disk which wasn't fully synchronized wouldn't necessarily be
  interpretable by libguestfs, so I guess we would need the complete
  snapshot.
 
 In the case of point-in-time backups (Stefan's block-backup) the plan is
 to have the snapshot complete from the beginning.

The way it will work is that the drive-backup target is a qcow2 image
with the guest's disk as its backing file.  When the guest writes to the
disk, drive-backup copies the original data to the qcow2 image.

The qcow2 image is exported over NBD so a client can connect to access
the read-only point-in-time snapshot.  It is not necessary to populate
the qcow2 file since it uses the guest disk as its backing file - all
reads to unpopulated clusters go to the backing file.

Stefan



Re: [Qemu-devel] [RFC] block-trace Low Level Command Supporting Disk Introspection

2013-05-22 Thread Wolfgang Richter
On Wed, May 15, 2013 at 7:54 AM, Paolo Bonzini pbonz...@redhat.com wrote:

  But does this really cover all use cases a real synchronous active
  mirror would provide? I understood that Wolf wants to get every single
  guest request exposed e.g. on an NBD connection.

 He can use throttling to limit the guest's I/O speed to the size of the
 asynchronous mirror's buffer.


Throttling is fine for me, and actually what I do today (this is the
highest source of
overhead for a system that wants to see everything), just with the tracing
framework.

-- 
Wolf


Re: [Qemu-devel] [RFC] block-trace Low Level Command Supporting Disk Introspection

2013-05-22 Thread Wolfgang Richter
On Thu, May 16, 2013 at 9:44 AM, Richard W.M. Jones rjo...@redhat.comwrote:

 Ideally I'd like to issue some QMP commands which would set up the
 point-in-time snapshot, and then connect to this snapshot over (eg)
 NBD, then when I'm done, send some more QMP commands to tear down the

snapshot.


This is actually interesting.  Does the QEMU nbd server support multiple
readers?

Essentially, if you're RWMJ (not me), and you're keeping a full mirror,
it's clear that
the mirror write stream goes to an nbd server, but is it possible to attach
a reader
to that same nbd server and read things back (read-only)?  I know it's
possible to name
the volumes you attach to, so I think conceptually with the nbd protocol
this should work.

I think this document would be better with one or more examples
 showing how this would be used.


I think the thread now has me looking at making the mirror command 'active'
:-)
rather than have a new QMP command.

-- 
Wolf


Re: [Qemu-devel] [RFC] block-trace Low Level Command Supporting Disk Introspection

2013-05-22 Thread Paolo Bonzini
Il 22/05/2013 17:51, Wolfgang Richter ha scritto:
 On Thu, May 16, 2013 at 9:44 AM, Richard W.M. Jones rjo...@redhat.com
 mailto:rjo...@redhat.com wrote:
 
 Ideally I'd like to issue some QMP commands which would set up the
 point-in-time snapshot, and then connect to this snapshot over (eg)
 NBD, then when I'm done, send some more QMP commands to tear down the 
 
 snapshot.
 
 
 This is actually interesting.  Does the QEMU nbd server support multiple
 readers?

Yes.

 Essentially, if you're RWMJ (not me), and you're keeping a full
 mirror, it's clear that the mirror write stream goes to an nbd server,
 but is it possible to attach a reader to that same nbd server and read
 things back (read-only)?

Yes, it can be done with both qemu-nbd and the QEMU nbd server commands.

Paolo




Re: [Qemu-devel] [RFC] block-trace Low Level Command Supporting Disk Introspection

2013-05-22 Thread Wolfgang Richter
On Wed, May 22, 2013 at 12:11 PM, Paolo Bonzini pbonz...@redhat.com wrote:

  Essentially, if you're RWMJ (not me), and you're keeping a full
  mirror, it's clear that the mirror write stream goes to an nbd server,
  but is it possible to attach a reader to that same nbd server and read
  things back (read-only)?

 Yes, it can be done with both qemu-nbd and the QEMU nbd server commands.


Then this means, if there was an active mirror (or snapshot being created),
it would
be easy to attach an nbd client as a reader to it even as it is being
synchronized
(perhaps dangerous?).

-- 
Wolf


Re: [Qemu-devel] [RFC] block-trace Low Level Command Supporting Disk Introspection

2013-05-22 Thread Richard W.M. Jones
On Wed, May 22, 2013 at 11:51:16AM -0400, Wolfgang Richter wrote:
 This is actually interesting.  Does the QEMU nbd server support multiple
 readers?

Yes.  qemu-nbd has a -e/--shared=N option which appears to do
exactly what it says in the man page.

$ guestfish -N fs exit
$ ls -lh test1.img 
-rw-rw-r--. 1 rjones rjones 100M May 22 17:37 test1.img
$ qemu-nbd -e 3 -r -t test1.img

From another shell:

$ guestfish --format=raw -a nbd://localhost

Welcome to guestfish, the guest filesystem shell for
editing virtual machine filesystems and disk images.

Type: 'help' for help on commands
  'man' to read the manual
  'quit' to quit the shell

fs run
fs list-filesystems 
/dev/sda1: ext2

Run up to two extra guestfish instances, with the same result.  The
fourth guestfish instance hangs at the 'run' command until one of the
first three is told to exit.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top



Re: [Qemu-devel] [RFC] block-trace Low Level Command Supporting Disk Introspection

2013-05-22 Thread Wolfgang Richter
On Wed, May 22, 2013 at 12:42 PM, Richard W.M. Jones rjo...@redhat.comwrote:

 Run up to two extra guestfish instances, with the same result.  The
 fourth guestfish instance hangs at the 'run' command until one of the
 first three is told to exit.


And your interested on being notified when a snapshot is safe to read
from?
Or is it valuable to try reading immediately?

-- 
Wolf


Re: [Qemu-devel] [RFC] block-trace Low Level Command Supporting Disk Introspection

2013-05-22 Thread Richard W.M. Jones
On Wed, May 22, 2013 at 02:32:37PM -0400, Wolfgang Richter wrote:
 On Wed, May 22, 2013 at 12:42 PM, Richard W.M. Jones rjo...@redhat.comwrote:
 
  Run up to two extra guestfish instances, with the same result.  The
  fourth guestfish instance hangs at the 'run' command until one of the
  first three is told to exit.
 
 
 And your interested on being notified when a snapshot is safe to read
 from?
 Or is it valuable to try reading immediately?

I'm not sure I understand the question.

I assumed (maybe wrongly) that if we had an NBD address (ie. Unix
socket or IP:port) then we'd just connect to that and go.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW



Re: [Qemu-devel] [RFC] block-trace Low Level Command Supporting Disk Introspection

2013-05-22 Thread Wolfgang Richter
On Wed, May 22, 2013 at 3:26 PM, Richard W.M. Jones rjo...@redhat.comwrote:

 On Wed, May 22, 2013 at 02:32:37PM -0400, Wolfgang Richter wrote:
  On Wed, May 22, 2013 at 12:42 PM, Richard W.M. Jones rjo...@redhat.com
 wrote:
 
   Run up to two extra guestfish instances, with the same result.  The
   fourth guestfish instance hangs at the 'run' command until one of the
   first three is told to exit.
 
 
  And your interested on being notified when a snapshot is safe to read
  from?
  Or is it valuable to try reading immediately?

 I'm not sure I understand the question.

 I assumed (maybe wrongly) that if we had an NBD address (ie. Unix
 socket or IP:port) then we'd just connect to that and go.


I meant if there was interest in reading from a disk that isn't fully
synchronized
(yet) to the original disk (it might have old blocks).  Or would you only
want to
connect once a (complete) snapshot is available (synchronized completely to
some point-in.

-- 
Wolf


Re: [Qemu-devel] [RFC] block-trace Low Level Command Supporting Disk Introspection

2013-05-22 Thread Richard W.M. Jones
On Wed, May 22, 2013 at 03:38:33PM -0400, Wolfgang Richter wrote:
 On Wed, May 22, 2013 at 3:26 PM, Richard W.M. Jones rjo...@redhat.comwrote:
 
  On Wed, May 22, 2013 at 02:32:37PM -0400, Wolfgang Richter wrote:
   On Wed, May 22, 2013 at 12:42 PM, Richard W.M. Jones rjo...@redhat.com
  wrote:
  
Run up to two extra guestfish instances, with the same result.  The
fourth guestfish instance hangs at the 'run' command until one of the
first three is told to exit.
  
  
   And your interested on being notified when a snapshot is safe to read
   from?
   Or is it valuable to try reading immediately?
 
  I'm not sure I understand the question.
 
  I assumed (maybe wrongly) that if we had an NBD address (ie. Unix
  socket or IP:port) then we'd just connect to that and go.
 
 
 I meant if there was interest in reading from a disk that isn't fully
 synchronized
 (yet) to the original disk (it might have old blocks).  Or would you only
 want to
 connect once a (complete) snapshot is available (synchronized completely to
 some point-in.

IIUC a disk which wasn't fully synchronized wouldn't necessarily be
interpretable by libguestfs, so I guess we would need the complete
snapshot.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org



Re: [Qemu-devel] [RFC] block-trace Low Level Command Supporting Disk Introspection

2013-05-22 Thread Paolo Bonzini
Il 22/05/2013 22:47, Richard W.M. Jones ha scritto:
  
  I meant if there was interest in reading from a disk that isn't fully
  synchronized
  (yet) to the original disk (it might have old blocks).  Or would you only
  want to
  connect once a (complete) snapshot is available (synchronized completely to
  some point-in.
 IIUC a disk which wasn't fully synchronized wouldn't necessarily be
 interpretable by libguestfs, so I guess we would need the complete
 snapshot.

In the case of point-in-time backups (Stefan's block-backup) the plan is
to have the snapshot complete from the beginning.

Paolo



Re: [Qemu-devel] [RFC] block-trace Low Level Command Supporting Disk Introspection

2013-05-16 Thread Richard W.M. Jones
[...]

From my point of view, what I'm missing here is how would I use it.

Ideally I'd like to issue some QMP commands which would set up the
point-in-time snapshot, and then connect to this snapshot over (eg)
NBD, then when I'm done, send some more QMP commands to tear down the
snapshot.

I think this document would be better with one or more examples
showing how this would be used.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming blog: http://rwmj.wordpress.com
Fedora now supports 80 OCaml packages (the OPEN alternative to F#)



Re: [Qemu-devel] [RFC] block-trace Low Level Command Supporting Disk Introspection

2013-05-15 Thread Kevin Wolf
Am 14.05.2013 um 18:45 hat Paolo Bonzini geschrieben:
 Il 14/05/2013 17:48, Wolfgang Richter ha scritto:
  On Tue, May 14, 2013 at 6:04 AM, Paolo Bonzini pbonz...@redhat.com
  mailto:pbonz...@redhat.com wrote:
  
  Il 14/05/2013 10:50, Kevin Wolf ha scritto:
   Or, to translate it into our existing terminology, drive-mirror
   implements a passive mirror, you're proposing an active one (which we
   do want to have).
  
   With an active mirror, we'll want to have another choice: The
  mirror can
   be synchronous (guest writes only complete after the mirrored
  write has
   completed) or asynchronous (completion is based only on the original
   image). It should be easy enough to support both once an active mirror
   exists.
  
  Right, I'm waiting for Stefan's block-backup to give me the right
  hooks for the active mirror.
  
  The bulk phase will always be passive, but an active-asynchronous mirror
  has some interesting properties and it makes sense to implement it.
  
  
  Do you mean you'd model the 'active' mode after 'block-backup,' or actually
  call functions provided by 'block-backup'?
 
 No, I'll just reuse the same hooks within block/mirror.c (almost... it
 looks like I need after_write too, not just before_write :( that's a
 pity).

Makes me wonder if using a real BlockDriver for the filter from the
beginning wouldn't be better than accumulating more and more hooks and
having to find ways to pass data from 'before' to 'after' hooks...

 Basically:
 
 1) before the write, if there is space in the job's buffers, allocate a
 MirrorOp and a data buffer for the write.  Also record whether the block
 was dirty before;
 
 2) after the write, do nothing if there was no room to allocate the data
 buffer.  Else clear the block from the dirty bitmap.  If the block was
 dirty, read the whole cluster from the source as in passive mirroring.
 If it wasn't, copy the data from guest memory to the preallocated buffer
 and write it to the destination;

Does the if there was no room part mean that the mirror is active only
sometimes?

And why even bother with a dirty bitmap for an active mirror? The
background job that sequentially processes the whole image only needs a
counter, no bitmap.

At which point it looks like implementing it separate from mirror.c
could make more sense.

Kevin



Re: [Qemu-devel] [RFC] block-trace Low Level Command Supporting Disk Introspection

2013-05-15 Thread Paolo Bonzini
Il 15/05/2013 09:59, Kevin Wolf ha scritto:
 Do you mean you'd model the 'active' mode after 'block-backup,' or actually
 call functions provided by 'block-backup'?

 No, I'll just reuse the same hooks within block/mirror.c (almost... it
 looks like I need after_write too, not just before_write :( that's a
 pity).
 
 Makes me wonder if using a real BlockDriver for the filter from the
 beginning wouldn't be better than accumulating more and more hooks and
 having to find ways to pass data from 'before' to 'after' hooks...

We don't need a way to pass data from before to after hooks, a simple
scan of a linked list will do.

 Basically:

 1) before the write, if there is space in the job's buffers, allocate a
 MirrorOp and a data buffer for the write.  Also record whether the block
 was dirty before;

 2) after the write, do nothing if there was no room to allocate the data
 buffer.  Else clear the block from the dirty bitmap.  If the block was
 dirty, read the whole cluster from the source as in passive mirroring.
 If it wasn't, copy the data from guest memory to the preallocated buffer
 and write it to the destination;
 
 Does the if there was no room part mean that the mirror is active only
 sometimes?

Yes, otherwise the guest can allocate arbitrary amounts of memory in the
host just by starting a few very large I/O operations.

 And why even bother with a dirty bitmap for an active mirror? The
 background job that sequentially processes the whole image only needs a
 counter, no bitmap.

That's not enough for the case when the host crashes and you have to
restart the mirroring or complete it offline.

Paolo

 At which point it looks like implementing it separate from mirror.c
 could make more sense.
 
 Kevin
 




Re: [Qemu-devel] [RFC] block-trace Low Level Command Supporting Disk Introspection

2013-05-15 Thread Paolo Bonzini

  We don't need a way to pass data from before to after hooks, a simple
  scan of a linked list will do.
 
 So in this case the linked list is the way.

Point taken. :)

   Does the if there was no room part mean that the mirror is active only
   sometimes?
  
  Yes, otherwise the guest can allocate arbitrary amounts of memory in the
  host just by starting a few very large I/O operations.
 
 I think I would rather throttle I/O in this case, i.e. requests wait
 until they can get the space. At least for a synchronous mirror we
 have to do something like this.

Yes, but this is still asynchronous.  The active part is just an optimization
to avoid write amplification (where small random writes require I/O of an entire
block as big as the bitmap granularity).

   And why even bother with a dirty bitmap for an active mirror? The
   background job that sequentially processes the whole image only needs a
   counter, no bitmap.
  
  That's not enough for the case when the host crashes and you have to
  restart the mirroring or complete it offline.
 
 You're thinking of a persistent bitmap here? Makes sense then, I didn't
 think about that.

Yes.

Paolo



Re: [Qemu-devel] [RFC] block-trace Low Level Command Supporting Disk Introspection

2013-05-15 Thread Paolo Bonzini
Il 15/05/2013 11:46, Kevin Wolf ha scritto:
 Am 15.05.2013 um 11:16 hat Paolo Bonzini geschrieben:
 Does the if there was no room part mean that the mirror is active only
 sometimes?

 Yes, otherwise the guest can allocate arbitrary amounts of memory in the
 host just by starting a few very large I/O operations.
 
 On second thought, can't you do zero copy anyway for full cluster
 writes? This means that at most two clusters per request must be
 allocated, no matter how large it is, and you can probably reuse the
 same one-cluster buffer for both.

Only for synchronous mirror.  For an asynchronous mirror, there's no
guarantee that the mirror finishes writing before the source.  When that
fails, the guest can touch the memory and the mirror diverges from the
source.

 I think I would rather throttle I/O in this case, i.e. requests wait
 until they can get the space. At least for a synchronous mirror we
 have to do something like this.

 Yes, but this is still asynchronous.  The active part is just an optimization
 to avoid write amplification (where small random writes require I/O of an 
 entire
 block as big as the bitmap granularity).
 
 Yes, that sounds like a good use case.
 
 But does this really cover all use cases a real synchronous active
 mirror would provide? I understood that Wolf wants to get every single
 guest request exposed e.g. on an NBD connection.

He can use throttling to limit the guest's I/O speed to the size of the
asynchronous mirror's buffer.

Paolo



Re: [Qemu-devel] [RFC] block-trace Low Level Command Supporting Disk Introspection

2013-05-14 Thread Stefan Hajnoczi
On Mon, May 13, 2013 at 05:21:54PM -0400, Wolfgang Richter wrote:
 I'm working on a new patch series which will add a new QMP command,
 block-trace, which turns on tracing of writes for a specified block device
 and
 sends the stream unmodified to another block device.  The 'trace' is meant
 to
 be precise meaning that writes are not lost, which differentiates this
 command
 from others.  It can be turned on and off depending on when it is needed.
 
 
 
 How is this different from block-backup or drive-mirror?
 
 
 block-backup is designed to create point-in-time snapshots and not clone the
 entire write stream of a VM to a particular device.  It implements
 copy-on-write to create a snapshot.  Thus whenever a write occurs,
 block-backup
 is designed to send the original data and not the contents of the new write.
 
 drive-mirror is designed to mirror a disk to another location.  It operates
 by
 periodically scanning a dirty bitmap and cloning blocks when dirtied.  This
 is
 efficient as it allows for batching of writes, but it does not maintain the
 order in which guest writes occurred and it can miss intermediate writes
 when
 they go to the same location on disk.
 
 
 
 How can block-trace be used?
 
 
 (1) Disk introspection - systems which analyze the writes going to a disk
 for
 introspection require a perfect clone of the write stream to an original
 disk
 to stay in-sync with updates to guest file systems.
 
 (2) Replicated block device - two block devices could be maintained as exact
 copies of each other up to a point in the disk write stream that has
 successfully been written to the destination block device.

CCed Benoit Canet, who implemented the quorum block driver to mirror I/O
to multiple images and verify data integrity.

QEMU is accumulating many different approaches to snapshots and
mirroring.  They all have their pros and cons so it's not possible to
support only one approach for all use cases.

The suggested approach is writing a BlockDriver which mirrors I/O to two
BlockDriverStates.  There has been discussion around breaking
BlockDriver into smaller interfaces, including a BlockFilter for
intercepting I/O, but this has not been implemented.  blkverify is an
example of a BlockDriver that manages two child BlockDriverStates and
may be a good starting point.



Re: [Qemu-devel] [RFC] block-trace Low Level Command Supporting Disk Introspection

2013-05-14 Thread Kevin Wolf
Am 13.05.2013 um 23:21 hat Wolfgang Richter geschrieben:
 I'm working on a new patch series which will add a new QMP command,
 block-trace, which turns on tracing of writes for a specified block device and
 sends the stream unmodified to another block device.  The 'trace' is meant to
 be precise meaning that writes are not lost, which differentiates this command
 from others.  It can be turned on and off depending on when it is needed.
 
 
 
 How is this different from block-backup or drive-mirror?
 
 
 block-backup is designed to create point-in-time snapshots and not clone the
 entire write stream of a VM to a particular device.  It implements
 copy-on-write to create a snapshot.  Thus whenever a write occurs, 
 block-backup
 is designed to send the original data and not the contents of the new write.
 
 drive-mirror is designed to mirror a disk to another location.  It operates by
 periodically scanning a dirty bitmap and cloning blocks when dirtied.  This is
 efficient as it allows for batching of writes, but it does not maintain the
 order in which guest writes occurred and it can miss intermediate writes when
 they go to the same location on disk.

Or, to translate it into our existing terminology, drive-mirror
implements a passive mirror, you're proposing an active one (which we
do want to have).

With an active mirror, we'll want to have another choice: The mirror can
be synchronous (guest writes only complete after the mirrored write has
completed) or asynchronous (completion is based only on the original
image). It should be easy enough to support both once an active mirror
exists.

 How can block-trace be used?
 
 
 (1) Disk introspection - systems which analyze the writes going to a disk for
 introspection require a perfect clone of the write stream to an original disk
 to stay in-sync with updates to guest file systems.
 
 (2) Replicated block device - two block devices could be maintained as exact
 copies of each other up to a point in the disk write stream that has
 successfully been written to the destination block device.

You're leaving out the most interesting section: How should block-trace
be implemented?

The first question is what the API should look like, on the QMP level. I
think originally the idea was to use drive-mirror for all kinds of
mirrors, but maybe it makes more sense indeed to keep the active mirror
separate. I don't particularly like the name block-trace for a separate
command, but let's save the bikeshedding for later.

The other question is how to implement it internally. I don't think
adding specific code for each new block job into bdrv_co_do_writev() is
acceptable. We really need a generic way to intercept I/O operations.
The keyword from earlier discussions is block filters. Essentially the
idea is that the block job temporarily adds a BlockDriverState on top of
the format driver and becomes able to implement all callbacks it likes
to intercept. The bad news is that the infrastructure isn't there yet
to actually make this happen in a sane way.

Kevin



Re: [Qemu-devel] [RFC] block-trace Low Level Command Supporting Disk Introspection

2013-05-14 Thread Paolo Bonzini
Il 14/05/2013 10:50, Kevin Wolf ha scritto:
 Or, to translate it into our existing terminology, drive-mirror
 implements a passive mirror, you're proposing an active one (which we
 do want to have).
 
 With an active mirror, we'll want to have another choice: The mirror can
 be synchronous (guest writes only complete after the mirrored write has
 completed) or asynchronous (completion is based only on the original
 image). It should be easy enough to support both once an active mirror
 exists.

Right, I'm waiting for Stefan's block-backup to give me the right
hooks for the active mirror.

The bulk phase will always be passive, but an active-asynchronous mirror
has some interesting properties and it makes sense to implement it.

Paolo



Re: [Qemu-devel] [RFC] block-trace Low Level Command Supporting Disk Introspection

2013-05-14 Thread Wolfgang Richter
On Tue, May 14, 2013 at 4:40 AM, Stefan Hajnoczi stefa...@redhat.comwrote:

 QEMU is accumulating many different approaches to snapshots and
 mirroring.  They all have their pros and cons so it's not possible to
 support only one approach for all use cases.

 The suggested approach is writing a BlockDriver which mirrors I/O to two
 BlockDriverStates.  There has been discussion around breaking
 BlockDriver into smaller interfaces, including a BlockFilter for
 intercepting I/O, but this has not been implemented.  blkverify is an
 example of a BlockDriver that manages two child BlockDriverStates and
 may be a good starting point.


BlockFilter sounds interesting.  The main reason I proposed 'block-trace'
is because that is almost identical to what I currently have implemented
with the tracing framework---I just didn't have a nice QMP command.

-- 
Wolf


Re: [Qemu-devel] [RFC] block-trace Low Level Command Supporting Disk Introspection

2013-05-14 Thread Wolfgang Richter
On Tue, May 14, 2013 at 4:50 AM, Kevin Wolf kw...@redhat.com wrote:

 Or, to translate it into our existing terminology, drive-mirror
 implements a passive mirror, you're proposing an active one (which we
 do want to have).

 With an active mirror, we'll want to have another choice: The mirror can
 be synchronous (guest writes only complete after the mirrored write has
 completed) or asynchronous (completion is based only on the original
 image). It should be easy enough to support both once an active mirror
 exists.


Yes! Active mirroring is precisely what is needed to implement block-level
introspection.


 You're leaving out the most interesting section: How should block-trace
 be implemented?


Noted, although maybe folding it into 'drive-mirror' as an 'active' option
might be best, now that Paolo has spoken up.


 The other question is how to implement it internally. I don't think
 adding specific code for each new block job into bdrv_co_do_writev() is
 acceptable. We really need a generic way to intercept I/O operations.
 The keyword from earlier discussions is block filters. Essentially the
 idea is that the block job temporarily adds a BlockDriverState on top of
 the format driver and becomes able to implement all callbacks it likes
 to intercept. The bad news is that the infrastructure isn't there yet
 to actually make this happen in a sane way.


Yeah, I'd also really love block filters and probably would have
originally used them instead of the tracing subsystem originally if they
existed.  It would make implementing all kinds of 'block-level' features
much, much easier.

-- 
Wolf


Re: [Qemu-devel] [RFC] block-trace Low Level Command Supporting Disk Introspection

2013-05-14 Thread Wolfgang Richter
On Tue, May 14, 2013 at 6:04 AM, Paolo Bonzini pbonz...@redhat.com wrote:

 Il 14/05/2013 10:50, Kevin Wolf ha scritto:
  Or, to translate it into our existing terminology, drive-mirror
  implements a passive mirror, you're proposing an active one (which we
  do want to have).
 
  With an active mirror, we'll want to have another choice: The mirror can
  be synchronous (guest writes only complete after the mirrored write has
  completed) or asynchronous (completion is based only on the original
  image). It should be easy enough to support both once an active mirror
  exists.

 Right, I'm waiting for Stefan's block-backup to give me the right
 hooks for the active mirror.

 The bulk phase will always be passive, but an active-asynchronous mirror
 has some interesting properties and it makes sense to implement it.


Do you mean you'd model the 'active' mode after 'block-backup,' or actually
call functions provided by 'block-backup'?  If I knew more about what you
had in mind, I wouldn't mind trying to add this 'active' mode to
'drive-mirror'
and test it with my use case.  I want to avoid duplicate work, so if you
want to implement it yourself I can defer this.

-- 
Wolf


Re: [Qemu-devel] [RFC] block-trace Low Level Command Supporting Disk Introspection

2013-05-14 Thread Paolo Bonzini
Il 14/05/2013 17:48, Wolfgang Richter ha scritto:
 On Tue, May 14, 2013 at 6:04 AM, Paolo Bonzini pbonz...@redhat.com
 mailto:pbonz...@redhat.com wrote:
 
 Il 14/05/2013 10:50, Kevin Wolf ha scritto:
  Or, to translate it into our existing terminology, drive-mirror
  implements a passive mirror, you're proposing an active one (which we
  do want to have).
 
  With an active mirror, we'll want to have another choice: The
 mirror can
  be synchronous (guest writes only complete after the mirrored
 write has
  completed) or asynchronous (completion is based only on the original
  image). It should be easy enough to support both once an active mirror
  exists.
 
 Right, I'm waiting for Stefan's block-backup to give me the right
 hooks for the active mirror.
 
 The bulk phase will always be passive, but an active-asynchronous mirror
 has some interesting properties and it makes sense to implement it.
 
 
 Do you mean you'd model the 'active' mode after 'block-backup,' or actually
 call functions provided by 'block-backup'?

No, I'll just reuse the same hooks within block/mirror.c (almost... it
looks like I need after_write too, not just before_write :( that's a
pity).  Basically:

1) before the write, if there is space in the job's buffers, allocate a
MirrorOp and a data buffer for the write.  Also record whether the block
was dirty before;

2) after the write, do nothing if there was no room to allocate the data
buffer.  Else clear the block from the dirty bitmap.  If the block was
dirty, read the whole cluster from the source as in passive mirroring.
If it wasn't, copy the data from guest memory to the preallocated buffer
and write it to the destination;

 If I knew more about what you
 had in mind, I wouldn't mind trying to add this 'active' mode to
 'drive-mirror'
 and test it with my use case.  I want to avoid duplicate work, so if you
 want to implement it yourself I can defer this.

Also the other way round.  If you want to give it a shot based on the
above spec just tell me.

It should require no changes to block.c except for adding after_write.

Paolo




Re: [Qemu-devel] [RFC] block-trace Low Level Command Supporting Disk Introspection

2013-05-14 Thread Wolfgang Richter
On Tue, May 14, 2013 at 12:45 PM, Paolo Bonzini pbonz...@redhat.com wrote:

 No, I'll just reuse the same hooks within block/mirror.c (almost... it
 looks like I need after_write too, not just before_write :( that's a
 pity).  Basically:

 1) before the write, if there is space in the job's buffers, allocate a
 MirrorOp and a data buffer for the write.  Also record whether the block
 was dirty before;

 2) after the write, do nothing if there was no room to allocate the data
 buffer.  Else clear the block from the dirty bitmap.  If the block was
 dirty, read the whole cluster from the source as in passive mirroring.
 If it wasn't, copy the data from guest memory to the preallocated buffer
 and write it to the destination;

  If I knew more about what you
  had in mind, I wouldn't mind trying to add this 'active' mode to
  'drive-mirror'
  and test it with my use case.  I want to avoid duplicate work, so if you
  want to implement it yourself I can defer this.

 Also the other way round.  If you want to give it a shot based on the
 above spec just tell me.


Talked with my group here as well.  I think I'd like to give it a shot
based on the
above spec rather than refactor my code into a new command.  This way it
will
hopefully reduce duplicated efforts, and provide extra testing for the
active
mirroring code.

I'll take a pass through the mirror code to make sure I understand it
better than
I currently do.

Would you like to coordinate off-list until we have a patch?

-- 
Wolf