Re: [Qemu-devel] [PATCH 0/2] block: Write out internal caches even with cache=unsafe

2011-10-24 Thread Kevin Wolf
Am 21.10.2011 20:44, schrieb Paolo Bonzini:
 On 10/21/2011 07:08 PM, Kevin Wolf wrote:
 Avi complained that not even writing out qcow2's cache on bdrv_flush() made
 cache=unsafe too unsafe to be useful. He's got a point.
 
 Why? cache=unsafe is explicitly allowing to s/data/manure/ on crash.

It's surely expected on a host crash, but is it for a qemu crash?
cache=unsafe was introduced to avoid fsync() costs, which it still does
after this patch.

 If you do this for raw-posix, you need to do it for all protocols.

rbd could use it, too, right. Any other protocol I missed?

Kevin



Re: [Qemu-devel] [PATCH 0/2] block: Write out internal caches even with cache=unsafe

2011-10-24 Thread Paolo Bonzini

On 10/24/2011 09:37 AM, Kevin Wolf wrote:

Why? cache=unsafe is explicitly allowing to s/data/manure/ on
crash.


It's surely expected on a host crash, but is it for a qemu crash?
cache=unsafe was introduced to avoid fsync() costs, which it still
does after this patch.


I think it's not about why is it there, but rather about what is it 
useful for.  My interpretation of it is I do not need the image 
anymore unless the command exits cleanly: VM installations, qemu-img 
conversions, BDRV_O_SNAPSHOT (doesn't do it yet, but it could).  Even 
SIGINT and SIGTERM would be excluded from this definition, but they cost 
nothing so it's nice to include them.



If you do this for raw-posix, you need to do it for all protocols.


rbd could use it, too, right. Any other protocol I missed?


NBD could, but it doesn't support flush yet.

In general, even if it were useful to implement this, I'm not sure this 
is the best way to implement it.  For example you could simply clear 
BDRV_O_NO_FLUSH in qcow2_open.


Paolo



Re: [Qemu-devel] [PATCH 0/2] block: Write out internal caches even with cache=unsafe

2011-10-24 Thread Kevin Wolf
Am 23.10.2011 16:33, schrieb Paolo Bonzini:
 On 10/22/2011 05:07 PM, Alexander Graf wrote:

 On 21.10.2011, at 11:44, Paolo Bonzini wrote:

 On 10/21/2011 07:08 PM, Kevin Wolf wrote:
 Avi complained that not even writing out qcow2's cache on
 bdrv_flush() made cache=unsafe too unsafe to be useful. He's got
 a point.

 Why? cache=unsafe is explicitly allowing to s/data/manure/ on
 crash.

 Exactly, but not on kill. By not flushing internal caches you're
 almost guaranteed to get an inconsistent qcow2 image.
 
 This should be covered already by termsig_handler.  bdrv_close_all 
 closes all block devices, and qcow2_close does flush the caches.
 
 SIGKILL doesn't give any guarantee of course but it does not in general, 
 even without cache=unsafe; you might get a SIGKILL a moment before a 
 bdrv_flush even without cache=unsafe, and get unclean qcow2 metadata.

Unclean yes, in the sense that you may get cluster leaks. If getting
SIGKILL a moment before the flush led to real corruption however,
cache=none would be broken as well.

 By not flushing internal caches you're almost guaranteed to get an
 inconsistent qcow2 image.
 
 Of course the inconsistencies with cache=unsafe will be massive if you 
 don't have a clean exit, but that's expected.  If in some cases you want 
 a clean exit, but right now you don't, the place to fix those cases 
 doesn't seem to be the block layer, but the main loop.

I don't think there's much the main loop can do against SIGKILL,
segfaults or abort().

 Also,
 
 1) why should cache=unsafe differentiate an OS that sends a flush from 
 one that doesn't (e.g. MS-DOS), from the point of view of image metadata?
 
 2) why should the guest actually send a flush if cache=unsafe?  Currently
 
  if (flags  BDRV_O_CACHE_WB)
  bs-enable_write_cache = 1;
 
 covers cache=unsafe.  However, in the end write cache enable means do I 
 need to flush data, and the answer is no when cache=unsafe, because 
 the flushes are useless and guests are free to reorder requests.

 shot-in-the-darkPerhaps what you want is to make qcow2 caches 
 writethrough in cache=unsafe mode, so that at least a try is made to 
 write the metadata/shot-in-the-dark (even though the underlying raw 
 protocol won't flush it)?  I'm not sure that is particularly useful, but 
 maybe it can help me understanding the benefit of this change.

Yes, this is the intention. It's about flushing metadata, not guest
data. The semantics that I think cache=unsafe should have is that after
a bdrv_flush() we have flushed all caches in qemu (so that the image
survives a qemu crash), but we don't care about flushing the host page
cache.

Kevin



Re: [Qemu-devel] [PATCH 0/2] block: Write out internal caches even with cache=unsafe

2011-10-24 Thread Kevin Wolf
Am 24.10.2011 09:53, schrieb Paolo Bonzini:
 On 10/24/2011 09:37 AM, Kevin Wolf wrote:
 Why? cache=unsafe is explicitly allowing to s/data/manure/ on
 crash.

 It's surely expected on a host crash, but is it for a qemu crash?
 cache=unsafe was introduced to avoid fsync() costs, which it still
 does after this patch.
 
 I think it's not about why is it there, but rather about what is it 
 useful for.  My interpretation of it is I do not need the image 
 anymore unless the command exits cleanly: VM installations, qemu-img 
 conversions, BDRV_O_SNAPSHOT (doesn't do it yet, but it could).  Even 
 SIGINT and SIGTERM would be excluded from this definition, but they cost 
 nothing so it's nice to include them.

I think another common interpretation is: I don't run this VM in
production but for development. I want the VM to go faster and I can
recreate the image in the unlikely event that power fails during my
work. But it certainly would be nasty.

 If you do this for raw-posix, you need to do it for all protocols.

 rbd could use it, too, right. Any other protocol I missed?
 
 NBD could, but it doesn't support flush yet.
 
 In general, even if it were useful to implement this, I'm not sure this 
 is the best way to implement it.  For example you could simply clear 
 BDRV_O_NO_FLUSH in qcow2_open.

That could work, too. On the other hand I don't like block drivers to
modify their flags. For example, would query-block still provide the
correct cache mode then?

But I think that starting to make exceptions for single block drivers
isn't a good idea anyway. If we want bdrv_flush() to write out all
metadata internal to qemu, I think the approach with checking the flag
in drivers calling things like fsync() is better. The common thing is to
do the flush.

Kevin



Re: [Qemu-devel] [PATCH 0/2] block: Write out internal caches even with cache=unsafe

2011-10-24 Thread Paolo Bonzini

On 10/24/2011 10:17 AM, Kevin Wolf wrote:

  I think it's not about why is it there, but rather about what is it
  useful for.  My interpretation of it is I do not need the image
  anymore unless the command exits cleanly: VM installations, qemu-img
  conversions, BDRV_O_SNAPSHOT (doesn't do it yet, but it could).  Even
  SIGINT and SIGTERM would be excluded from this definition, but they cost
  nothing so it's nice to include them.

I think another common interpretation is: I don't run this VM in
production but for development. I want the VM to go faster and I can
recreate the image in the unlikely event that power fails during my
work. But it certainly would be nasty.


Fair enough.


But I think that starting to make exceptions for single block drivers
isn't a good idea anyway. If we want bdrv_flush() to write out all
metadata internal to qemu, I think the approach with checking the flag
in drivers calling things like fsync() is better. The common thing is to
do the flush.


I don't know... checking BDRV_O_NO_FLUSH in the drivers rather than in 
the generic code sounds like a layering violation.  Perhaps what you're 
after is a separation of bdrv_co_flush from bdrv_{,co_,aio_}fsync?  Then 
BDRV_O_NO_FLUSH (better renamed to BDRV_O_NO_FSYNC...) would only 
inhibit the latter.


Paolo



Re: [Qemu-devel] [PATCH 0/2] block: Write out internal caches even with cache=unsafe

2011-10-24 Thread Kevin Wolf
Am 24.10.2011 10:47, schrieb Paolo Bonzini:
 On 10/24/2011 10:17 AM, Kevin Wolf wrote:
  I think it's not about why is it there, but rather about what is it
  useful for.  My interpretation of it is I do not need the image
  anymore unless the command exits cleanly: VM installations, qemu-img
  conversions, BDRV_O_SNAPSHOT (doesn't do it yet, but it could).  Even
  SIGINT and SIGTERM would be excluded from this definition, but they cost
  nothing so it's nice to include them.

 I think another common interpretation is: I don't run this VM in
 production but for development. I want the VM to go faster and I can
 recreate the image in the unlikely event that power fails during my
 work. But it certainly would be nasty.
 
 Fair enough.
 
 But I think that starting to make exceptions for single block drivers
 isn't a good idea anyway. If we want bdrv_flush() to write out all
 metadata internal to qemu, I think the approach with checking the flag
 in drivers calling things like fsync() is better. The common thing is to
 do the flush.
 
 I don't know... checking BDRV_O_NO_FLUSH in the drivers rather than in 
 the generic code sounds like a layering violation.  Perhaps what you're 
 after is a separation of bdrv_co_flush from bdrv_{,co_,aio_}fsync?  Then 
 BDRV_O_NO_FLUSH (better renamed to BDRV_O_NO_FSYNC...) would only 
 inhibit the latter.

Why? All other cache related BDRV_O_* flags are interpreted by the block
drivers, so why should BDRV_O_NO_FLUSH be special?

Kevin



Re: [Qemu-devel] [PATCH 0/2] block: Write out internal caches even with cache=unsafe

2011-10-24 Thread Peter Maydell
On 24 October 2011 09:17, Kevin Wolf kw...@redhat.com wrote:
 Am 24.10.2011 09:53, schrieb Paolo Bonzini:
 I think it's not about why is it there, but rather about what is it
 useful for.  My interpretation of it is I do not need the image
 anymore unless the command exits cleanly: VM installations, qemu-img
 conversions, BDRV_O_SNAPSHOT (doesn't do it yet, but it could).  Even
 SIGINT and SIGTERM would be excluded from this definition, but they cost
 nothing so it's nice to include them.

 I think another common interpretation is: I don't run this VM in
 production but for development. I want the VM to go faster and I can
 recreate the image in the unlikely event that power fails during my
 work. But it certainly would be nasty.

So at the moment the documentation (qemu-doc.html) says:
Writeback caching will report data writes as completed as soon as the
data is present in the host page cache. This is safe as long as you trust
your host. If your host crashes or loses power, then the guest may
experience data corruption.

and also:
In case you don't care about data integrity over host failures, use
cache=unsafe. This option tells qemu that it never needs to write any
data to the disk but can instead keeps things in cache. If anything
goes wrong, like your host losing power, the disk storage getting
disconnected accidently, etc. you're image will most probably be
rendered unusable.

which to me reads in both cases as will not corrupt data if the
guest crashes but may do so if the host crashes and leaves me
with no idea what the difference is. It might be nice if we could
clarify that a bit...

-- PMM



Re: [Qemu-devel] [PATCH 0/2] block: Write out internal caches even with cache=unsafe

2011-10-24 Thread Paolo Bonzini

On 10/24/2011 10:54 AM, Kevin Wolf wrote:

  I don't know... checking BDRV_O_NO_FLUSH in the drivers rather than in
  the generic code sounds like a layering violation.  Perhaps what you're
  after is a separation of bdrv_co_flush from bdrv_{,co_,aio_}fsync?  Then
  BDRV_O_NO_FLUSH (better renamed to BDRV_O_NO_FSYNC...) would only
  inhibit the latter.

Why? All other cache related BDRV_O_* flags are interpreted by the block
drivers, so why should BDRV_O_NO_FLUSH be special?


You're changing the API and asking for possibly non-trivial changes in 
all protocol drivers, in order to accomodate semantics that all format 
drivers potentially could desire.  So I wonder if the problem is simply 
that the current API is not expressive enough.


Paolo



Re: [Qemu-devel] [PATCH 0/2] block: Write out internal caches even with cache=unsafe

2011-10-24 Thread Kevin Wolf
Am 24.10.2011 11:26, schrieb Paolo Bonzini:
 On 10/24/2011 10:54 AM, Kevin Wolf wrote:
  I don't know... checking BDRV_O_NO_FLUSH in the drivers rather than in
  the generic code sounds like a layering violation.  Perhaps what you're
  after is a separation of bdrv_co_flush from bdrv_{,co_,aio_}fsync?  Then
  BDRV_O_NO_FLUSH (better renamed to BDRV_O_NO_FSYNC...) would only
  inhibit the latter.

 Why? All other cache related BDRV_O_* flags are interpreted by the block
 drivers, so why should BDRV_O_NO_FLUSH be special?
 
 You're changing the API and asking for possibly non-trivial changes in 
 all protocol drivers, in order to accomodate semantics that all format 
 drivers potentially could desire.  So I wonder if the problem is simply 
 that the current API is not expressive enough.

Can you think of any cases where a caller would want to invoke
bdrv_flush, but not bdrv_fsync? (The other way round it makes even less
sense)

Kevin



Re: [Qemu-devel] [PATCH 0/2] block: Write out internal caches even with cache=unsafe

2011-10-24 Thread Paolo Bonzini

On 10/24/2011 11:36 AM, Kevin Wolf wrote:

  You're changing the API and asking for possibly non-trivial changes in
  all protocol drivers, in order to accomodate semantics that all format
  drivers potentially could desire.  So I wonder if the problem is simply
  that the current API is not expressive enough.

Can you think of any cases where a caller would want to invoke
bdrv_flush, but not bdrv_fsync? (The other way round it makes even less
sense)


I'm talking about the internal driver API only.  The external API is 
fine as is.


Paolo



Re: [Qemu-devel] [PATCH 0/2] block: Write out internal caches even with cache=unsafe

2011-10-24 Thread Kevin Wolf
Am 24.10.2011 11:40, schrieb Paolo Bonzini:
 On 10/24/2011 11:36 AM, Kevin Wolf wrote:
  You're changing the API and asking for possibly non-trivial changes in
  all protocol drivers, in order to accomodate semantics that all format
  drivers potentially could desire.  So I wonder if the problem is simply
  that the current API is not expressive enough.
 Can you think of any cases where a caller would want to invoke
 bdrv_flush, but not bdrv_fsync? (The other way round it makes even less
 sense)
 
 I'm talking about the internal driver API only.  The external API is 
 fine as is.

Ok, so external callers don't force us to do it.

Yes, we could split bdrv_flush internally into two functions for flush
one level to the OS and flush all the way down to the disk, but I'm
not sure if this really buys us anything or just adds complexity.

Kevin



Re: [Qemu-devel] [PATCH 0/2] block: Write out internal caches even with cache=unsafe

2011-10-24 Thread Paolo Bonzini
On 10/24/2011 11:53 AM, Kevin Wolf wrote:
   
   I'm talking about the internal driver API only.  The external API is
   fine as is.
 Ok, so external callers don't force us to do it.
 
 Yes, we could split bdrv_flush internally into two functions for flush
 one level to the OS and flush all the way down to the disk, but I'm
 not sure if this really buys us anything or just adds complexity.

I would say that:

1) the safe NBD and iSCSI drivers are not in-tree, but you would have to
convert those too.  iSCSI uses aio, so it would be non-trivial.

2) long-term you wanted to convert raw-posix to coroutines, but in the
meanwhile your patch introduces yet another partial transition;

3) your two patches are more complex compared to something like this:

diff --git a/block.c b/block.c
index 11c7f91..1e06a8a 100644
--- a/block.c
+++ b/block.c
@@ -2908,9 +2908,19 @@ static void coroutine_fn bdrv_flush_co_entry(void 
*opaque)
 
 int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
 {
-if (bs-open_flags  BDRV_O_NO_FLUSH) {
+if (!bs-drv) {
 return 0;
-} else if (!bs-drv) {
+}
+
+/* First send everything to the OS.  */
+if (bs-drv-bdrv_co_flush_metadata) {
+int ret = bs-drv-bdrv_co_flush_metadata(bs);
+if (ret  0) {
+return ret;
+}
+}
+
+/* Now flush the host cache to disk if we need to.  */
+if (bs-open_flags  BDRV_O_NO_FLUSH) {
 return 0;
 } else if (bs-drv-bdrv_co_flush) {
 return bs-drv-bdrv_co_flush(bs);
diff --git a/block/qcow2.c b/block/qcow2.c
index a181932..cd13452 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1105,7 +1105,7 @@ fail:
 return ret;
 }
 
-static int qcow2_co_flush(BlockDriverState *bs)
+static int qcow2_co_flush_metadata(BlockDriverState *bs)
 {
 BDRVQcowState *s = bs-opaque;
 int ret;
@@ -1121,7 +1121,11 @@ static int qcow2_co_flush(BlockDriverState *bs)
 return ret;
 }
 qemu_co_mutex_unlock(s-lock);
+return 0;
+}
 
+static int qcow2_co_flush(BlockDriverState *bs)
+{
 return bdrv_co_flush(bs-file);
 }
 
@@ -1244,6 +1248,7 @@ static BlockDriver bdrv_qcow2 = {
 .bdrv_co_readv  = qcow2_co_readv,
 .bdrv_co_writev = qcow2_co_writev,
 .bdrv_co_flush  = qcow2_co_flush,
+.bdrv_co_flush_metadata = qcow2_co_flush_metadata,
 
 .bdrv_co_discard= qcow2_co_discard,
 .bdrv_truncate  = qcow2_truncate,
diff --git a/block_int.h b/block_int.h
index dac00f5..ab4ceea 100644
--- a/block_int.h
+++ b/block_int.h
@@ -84,6 +84,7 @@ struct BlockDriver {
 int coroutine_fn (*bdrv_co_writev)(BlockDriverState *bs,
 int64_t sector_num, int nb_sectors, QEMUIOVector *qiov);
 int coroutine_fn (*bdrv_co_flush)(BlockDriverState *bs);
+int coroutine_fn (*bdrv_co_flush_metadata)(BlockDriverState *bs);
 int coroutine_fn (*bdrv_co_discard)(BlockDriverState *bs,
 int64_t sector_num, int nb_sectors);
 
Paolo



Re: [Qemu-devel] [PATCH 0/2] block: Write out internal caches even with cache=unsafe

2011-10-23 Thread Paolo Bonzini

On 10/22/2011 05:07 PM, Alexander Graf wrote:


On 21.10.2011, at 11:44, Paolo Bonzini wrote:


On 10/21/2011 07:08 PM, Kevin Wolf wrote:

Avi complained that not even writing out qcow2's cache on
bdrv_flush() made cache=unsafe too unsafe to be useful. He's got
a point.


Why? cache=unsafe is explicitly allowing to s/data/manure/ on
crash.


Exactly, but not on kill. By not flushing internal caches you're
almost guaranteed to get an inconsistent qcow2 image.


This should be covered already by termsig_handler.  bdrv_close_all 
closes all block devices, and qcow2_close does flush the caches.


SIGKILL doesn't give any guarantee of course but it does not in general, 
even without cache=unsafe; you might get a SIGKILL a moment before a 
bdrv_flush even without cache=unsafe, and get unclean qcow2 metadata.



By not flushing internal caches you're almost guaranteed to get an
inconsistent qcow2 image.


Of course the inconsistencies with cache=unsafe will be massive if you 
don't have a clean exit, but that's expected.  If in some cases you want 
a clean exit, but right now you don't, the place to fix those cases 
doesn't seem to be the block layer, but the main loop.


Also,

1) why should cache=unsafe differentiate an OS that sends a flush from 
one that doesn't (e.g. MS-DOS), from the point of view of image metadata?


2) why should the guest actually send a flush if cache=unsafe?  Currently

if (flags  BDRV_O_CACHE_WB)
bs-enable_write_cache = 1;

covers cache=unsafe.  However, in the end write cache enable means do I 
need to flush data, and the answer is no when cache=unsafe, because 
the flushes are useless and guests are free to reorder requests.


shot-in-the-darkPerhaps what you want is to make qcow2 caches 
writethrough in cache=unsafe mode, so that at least a try is made to 
write the metadata/shot-in-the-dark (even though the underlying raw 
protocol won't flush it)?  I'm not sure that is particularly useful, but 
maybe it can help me understanding the benefit of this change.


Paolo



Re: [Qemu-devel] [PATCH 0/2] block: Write out internal caches even with cache=unsafe

2011-10-22 Thread Alexander Graf

On 21.10.2011, at 11:44, Paolo Bonzini wrote:

 On 10/21/2011 07:08 PM, Kevin Wolf wrote:
 Avi complained that not even writing out qcow2's cache on bdrv_flush() made
 cache=unsafe too unsafe to be useful. He's got a point.
 
 Why? cache=unsafe is explicitly allowing to s/data/manure/ on crash.

Exactly, but not on kill. By not flushing internal caches you're almost 
guaranteed to get an inconsistent qcow2 image.

 If you do this for raw-posix, you need to do it for all protocols.

Only for the ones that actually do flushes to the block layer. The idea is that 
while QEMU dies the OS can still finish writing unflushed blocks.


Alex




Re: [Qemu-devel] [PATCH 0/2] block: Write out internal caches even with cache=unsafe

2011-10-21 Thread Paolo Bonzini

On 10/21/2011 07:08 PM, Kevin Wolf wrote:

Avi complained that not even writing out qcow2's cache on bdrv_flush() made
cache=unsafe too unsafe to be useful. He's got a point.


Why? cache=unsafe is explicitly allowing to s/data/manure/ on crash.

If you do this for raw-posix, you need to do it for all protocols.


Kevin Wolf (2):
   raw-posix: Convert to bdrv_co_flush
   block: Handle cache=unsafe only in raw-posix/win32


Paolo