Re: [Qemu-devel] [PATCH 0/2] block: Write out internal caches even with cache=unsafe
Am 21.10.2011 20:44, schrieb Paolo Bonzini: On 10/21/2011 07:08 PM, Kevin Wolf wrote: Avi complained that not even writing out qcow2's cache on bdrv_flush() made cache=unsafe too unsafe to be useful. He's got a point. Why? cache=unsafe is explicitly allowing to s/data/manure/ on crash. It's surely expected on a host crash, but is it for a qemu crash? cache=unsafe was introduced to avoid fsync() costs, which it still does after this patch. If you do this for raw-posix, you need to do it for all protocols. rbd could use it, too, right. Any other protocol I missed? Kevin
Re: [Qemu-devel] [PATCH 0/2] block: Write out internal caches even with cache=unsafe
On 10/24/2011 09:37 AM, Kevin Wolf wrote: Why? cache=unsafe is explicitly allowing to s/data/manure/ on crash. It's surely expected on a host crash, but is it for a qemu crash? cache=unsafe was introduced to avoid fsync() costs, which it still does after this patch. I think it's not about why is it there, but rather about what is it useful for. My interpretation of it is I do not need the image anymore unless the command exits cleanly: VM installations, qemu-img conversions, BDRV_O_SNAPSHOT (doesn't do it yet, but it could). Even SIGINT and SIGTERM would be excluded from this definition, but they cost nothing so it's nice to include them. If you do this for raw-posix, you need to do it for all protocols. rbd could use it, too, right. Any other protocol I missed? NBD could, but it doesn't support flush yet. In general, even if it were useful to implement this, I'm not sure this is the best way to implement it. For example you could simply clear BDRV_O_NO_FLUSH in qcow2_open. Paolo
Re: [Qemu-devel] [PATCH 0/2] block: Write out internal caches even with cache=unsafe
Am 23.10.2011 16:33, schrieb Paolo Bonzini: On 10/22/2011 05:07 PM, Alexander Graf wrote: On 21.10.2011, at 11:44, Paolo Bonzini wrote: On 10/21/2011 07:08 PM, Kevin Wolf wrote: Avi complained that not even writing out qcow2's cache on bdrv_flush() made cache=unsafe too unsafe to be useful. He's got a point. Why? cache=unsafe is explicitly allowing to s/data/manure/ on crash. Exactly, but not on kill. By not flushing internal caches you're almost guaranteed to get an inconsistent qcow2 image. This should be covered already by termsig_handler. bdrv_close_all closes all block devices, and qcow2_close does flush the caches. SIGKILL doesn't give any guarantee of course but it does not in general, even without cache=unsafe; you might get a SIGKILL a moment before a bdrv_flush even without cache=unsafe, and get unclean qcow2 metadata. Unclean yes, in the sense that you may get cluster leaks. If getting SIGKILL a moment before the flush led to real corruption however, cache=none would be broken as well. By not flushing internal caches you're almost guaranteed to get an inconsistent qcow2 image. Of course the inconsistencies with cache=unsafe will be massive if you don't have a clean exit, but that's expected. If in some cases you want a clean exit, but right now you don't, the place to fix those cases doesn't seem to be the block layer, but the main loop. I don't think there's much the main loop can do against SIGKILL, segfaults or abort(). Also, 1) why should cache=unsafe differentiate an OS that sends a flush from one that doesn't (e.g. MS-DOS), from the point of view of image metadata? 2) why should the guest actually send a flush if cache=unsafe? Currently if (flags BDRV_O_CACHE_WB) bs-enable_write_cache = 1; covers cache=unsafe. However, in the end write cache enable means do I need to flush data, and the answer is no when cache=unsafe, because the flushes are useless and guests are free to reorder requests. shot-in-the-darkPerhaps what you want is to make qcow2 caches writethrough in cache=unsafe mode, so that at least a try is made to write the metadata/shot-in-the-dark (even though the underlying raw protocol won't flush it)? I'm not sure that is particularly useful, but maybe it can help me understanding the benefit of this change. Yes, this is the intention. It's about flushing metadata, not guest data. The semantics that I think cache=unsafe should have is that after a bdrv_flush() we have flushed all caches in qemu (so that the image survives a qemu crash), but we don't care about flushing the host page cache. Kevin
Re: [Qemu-devel] [PATCH 0/2] block: Write out internal caches even with cache=unsafe
Am 24.10.2011 09:53, schrieb Paolo Bonzini: On 10/24/2011 09:37 AM, Kevin Wolf wrote: Why? cache=unsafe is explicitly allowing to s/data/manure/ on crash. It's surely expected on a host crash, but is it for a qemu crash? cache=unsafe was introduced to avoid fsync() costs, which it still does after this patch. I think it's not about why is it there, but rather about what is it useful for. My interpretation of it is I do not need the image anymore unless the command exits cleanly: VM installations, qemu-img conversions, BDRV_O_SNAPSHOT (doesn't do it yet, but it could). Even SIGINT and SIGTERM would be excluded from this definition, but they cost nothing so it's nice to include them. I think another common interpretation is: I don't run this VM in production but for development. I want the VM to go faster and I can recreate the image in the unlikely event that power fails during my work. But it certainly would be nasty. If you do this for raw-posix, you need to do it for all protocols. rbd could use it, too, right. Any other protocol I missed? NBD could, but it doesn't support flush yet. In general, even if it were useful to implement this, I'm not sure this is the best way to implement it. For example you could simply clear BDRV_O_NO_FLUSH in qcow2_open. That could work, too. On the other hand I don't like block drivers to modify their flags. For example, would query-block still provide the correct cache mode then? But I think that starting to make exceptions for single block drivers isn't a good idea anyway. If we want bdrv_flush() to write out all metadata internal to qemu, I think the approach with checking the flag in drivers calling things like fsync() is better. The common thing is to do the flush. Kevin
Re: [Qemu-devel] [PATCH 0/2] block: Write out internal caches even with cache=unsafe
On 10/24/2011 10:17 AM, Kevin Wolf wrote: I think it's not about why is it there, but rather about what is it useful for. My interpretation of it is I do not need the image anymore unless the command exits cleanly: VM installations, qemu-img conversions, BDRV_O_SNAPSHOT (doesn't do it yet, but it could). Even SIGINT and SIGTERM would be excluded from this definition, but they cost nothing so it's nice to include them. I think another common interpretation is: I don't run this VM in production but for development. I want the VM to go faster and I can recreate the image in the unlikely event that power fails during my work. But it certainly would be nasty. Fair enough. But I think that starting to make exceptions for single block drivers isn't a good idea anyway. If we want bdrv_flush() to write out all metadata internal to qemu, I think the approach with checking the flag in drivers calling things like fsync() is better. The common thing is to do the flush. I don't know... checking BDRV_O_NO_FLUSH in the drivers rather than in the generic code sounds like a layering violation. Perhaps what you're after is a separation of bdrv_co_flush from bdrv_{,co_,aio_}fsync? Then BDRV_O_NO_FLUSH (better renamed to BDRV_O_NO_FSYNC...) would only inhibit the latter. Paolo
Re: [Qemu-devel] [PATCH 0/2] block: Write out internal caches even with cache=unsafe
Am 24.10.2011 10:47, schrieb Paolo Bonzini: On 10/24/2011 10:17 AM, Kevin Wolf wrote: I think it's not about why is it there, but rather about what is it useful for. My interpretation of it is I do not need the image anymore unless the command exits cleanly: VM installations, qemu-img conversions, BDRV_O_SNAPSHOT (doesn't do it yet, but it could). Even SIGINT and SIGTERM would be excluded from this definition, but they cost nothing so it's nice to include them. I think another common interpretation is: I don't run this VM in production but for development. I want the VM to go faster and I can recreate the image in the unlikely event that power fails during my work. But it certainly would be nasty. Fair enough. But I think that starting to make exceptions for single block drivers isn't a good idea anyway. If we want bdrv_flush() to write out all metadata internal to qemu, I think the approach with checking the flag in drivers calling things like fsync() is better. The common thing is to do the flush. I don't know... checking BDRV_O_NO_FLUSH in the drivers rather than in the generic code sounds like a layering violation. Perhaps what you're after is a separation of bdrv_co_flush from bdrv_{,co_,aio_}fsync? Then BDRV_O_NO_FLUSH (better renamed to BDRV_O_NO_FSYNC...) would only inhibit the latter. Why? All other cache related BDRV_O_* flags are interpreted by the block drivers, so why should BDRV_O_NO_FLUSH be special? Kevin
Re: [Qemu-devel] [PATCH 0/2] block: Write out internal caches even with cache=unsafe
On 24 October 2011 09:17, Kevin Wolf kw...@redhat.com wrote: Am 24.10.2011 09:53, schrieb Paolo Bonzini: I think it's not about why is it there, but rather about what is it useful for. My interpretation of it is I do not need the image anymore unless the command exits cleanly: VM installations, qemu-img conversions, BDRV_O_SNAPSHOT (doesn't do it yet, but it could). Even SIGINT and SIGTERM would be excluded from this definition, but they cost nothing so it's nice to include them. I think another common interpretation is: I don't run this VM in production but for development. I want the VM to go faster and I can recreate the image in the unlikely event that power fails during my work. But it certainly would be nasty. So at the moment the documentation (qemu-doc.html) says: Writeback caching will report data writes as completed as soon as the data is present in the host page cache. This is safe as long as you trust your host. If your host crashes or loses power, then the guest may experience data corruption. and also: In case you don't care about data integrity over host failures, use cache=unsafe. This option tells qemu that it never needs to write any data to the disk but can instead keeps things in cache. If anything goes wrong, like your host losing power, the disk storage getting disconnected accidently, etc. you're image will most probably be rendered unusable. which to me reads in both cases as will not corrupt data if the guest crashes but may do so if the host crashes and leaves me with no idea what the difference is. It might be nice if we could clarify that a bit... -- PMM
Re: [Qemu-devel] [PATCH 0/2] block: Write out internal caches even with cache=unsafe
On 10/24/2011 10:54 AM, Kevin Wolf wrote: I don't know... checking BDRV_O_NO_FLUSH in the drivers rather than in the generic code sounds like a layering violation. Perhaps what you're after is a separation of bdrv_co_flush from bdrv_{,co_,aio_}fsync? Then BDRV_O_NO_FLUSH (better renamed to BDRV_O_NO_FSYNC...) would only inhibit the latter. Why? All other cache related BDRV_O_* flags are interpreted by the block drivers, so why should BDRV_O_NO_FLUSH be special? You're changing the API and asking for possibly non-trivial changes in all protocol drivers, in order to accomodate semantics that all format drivers potentially could desire. So I wonder if the problem is simply that the current API is not expressive enough. Paolo
Re: [Qemu-devel] [PATCH 0/2] block: Write out internal caches even with cache=unsafe
Am 24.10.2011 11:26, schrieb Paolo Bonzini: On 10/24/2011 10:54 AM, Kevin Wolf wrote: I don't know... checking BDRV_O_NO_FLUSH in the drivers rather than in the generic code sounds like a layering violation. Perhaps what you're after is a separation of bdrv_co_flush from bdrv_{,co_,aio_}fsync? Then BDRV_O_NO_FLUSH (better renamed to BDRV_O_NO_FSYNC...) would only inhibit the latter. Why? All other cache related BDRV_O_* flags are interpreted by the block drivers, so why should BDRV_O_NO_FLUSH be special? You're changing the API and asking for possibly non-trivial changes in all protocol drivers, in order to accomodate semantics that all format drivers potentially could desire. So I wonder if the problem is simply that the current API is not expressive enough. Can you think of any cases where a caller would want to invoke bdrv_flush, but not bdrv_fsync? (The other way round it makes even less sense) Kevin
Re: [Qemu-devel] [PATCH 0/2] block: Write out internal caches even with cache=unsafe
On 10/24/2011 11:36 AM, Kevin Wolf wrote: You're changing the API and asking for possibly non-trivial changes in all protocol drivers, in order to accomodate semantics that all format drivers potentially could desire. So I wonder if the problem is simply that the current API is not expressive enough. Can you think of any cases where a caller would want to invoke bdrv_flush, but not bdrv_fsync? (The other way round it makes even less sense) I'm talking about the internal driver API only. The external API is fine as is. Paolo
Re: [Qemu-devel] [PATCH 0/2] block: Write out internal caches even with cache=unsafe
Am 24.10.2011 11:40, schrieb Paolo Bonzini: On 10/24/2011 11:36 AM, Kevin Wolf wrote: You're changing the API and asking for possibly non-trivial changes in all protocol drivers, in order to accomodate semantics that all format drivers potentially could desire. So I wonder if the problem is simply that the current API is not expressive enough. Can you think of any cases where a caller would want to invoke bdrv_flush, but not bdrv_fsync? (The other way round it makes even less sense) I'm talking about the internal driver API only. The external API is fine as is. Ok, so external callers don't force us to do it. Yes, we could split bdrv_flush internally into two functions for flush one level to the OS and flush all the way down to the disk, but I'm not sure if this really buys us anything or just adds complexity. Kevin
Re: [Qemu-devel] [PATCH 0/2] block: Write out internal caches even with cache=unsafe
On 10/24/2011 11:53 AM, Kevin Wolf wrote: I'm talking about the internal driver API only. The external API is fine as is. Ok, so external callers don't force us to do it. Yes, we could split bdrv_flush internally into two functions for flush one level to the OS and flush all the way down to the disk, but I'm not sure if this really buys us anything or just adds complexity. I would say that: 1) the safe NBD and iSCSI drivers are not in-tree, but you would have to convert those too. iSCSI uses aio, so it would be non-trivial. 2) long-term you wanted to convert raw-posix to coroutines, but in the meanwhile your patch introduces yet another partial transition; 3) your two patches are more complex compared to something like this: diff --git a/block.c b/block.c index 11c7f91..1e06a8a 100644 --- a/block.c +++ b/block.c @@ -2908,9 +2908,19 @@ static void coroutine_fn bdrv_flush_co_entry(void *opaque) int coroutine_fn bdrv_co_flush(BlockDriverState *bs) { -if (bs-open_flags BDRV_O_NO_FLUSH) { +if (!bs-drv) { return 0; -} else if (!bs-drv) { +} + +/* First send everything to the OS. */ +if (bs-drv-bdrv_co_flush_metadata) { +int ret = bs-drv-bdrv_co_flush_metadata(bs); +if (ret 0) { +return ret; +} +} + +/* Now flush the host cache to disk if we need to. */ +if (bs-open_flags BDRV_O_NO_FLUSH) { return 0; } else if (bs-drv-bdrv_co_flush) { return bs-drv-bdrv_co_flush(bs); diff --git a/block/qcow2.c b/block/qcow2.c index a181932..cd13452 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -1105,7 +1105,7 @@ fail: return ret; } -static int qcow2_co_flush(BlockDriverState *bs) +static int qcow2_co_flush_metadata(BlockDriverState *bs) { BDRVQcowState *s = bs-opaque; int ret; @@ -1121,7 +1121,11 @@ static int qcow2_co_flush(BlockDriverState *bs) return ret; } qemu_co_mutex_unlock(s-lock); +return 0; +} +static int qcow2_co_flush(BlockDriverState *bs) +{ return bdrv_co_flush(bs-file); } @@ -1244,6 +1248,7 @@ static BlockDriver bdrv_qcow2 = { .bdrv_co_readv = qcow2_co_readv, .bdrv_co_writev = qcow2_co_writev, .bdrv_co_flush = qcow2_co_flush, +.bdrv_co_flush_metadata = qcow2_co_flush_metadata, .bdrv_co_discard= qcow2_co_discard, .bdrv_truncate = qcow2_truncate, diff --git a/block_int.h b/block_int.h index dac00f5..ab4ceea 100644 --- a/block_int.h +++ b/block_int.h @@ -84,6 +84,7 @@ struct BlockDriver { int coroutine_fn (*bdrv_co_writev)(BlockDriverState *bs, int64_t sector_num, int nb_sectors, QEMUIOVector *qiov); int coroutine_fn (*bdrv_co_flush)(BlockDriverState *bs); +int coroutine_fn (*bdrv_co_flush_metadata)(BlockDriverState *bs); int coroutine_fn (*bdrv_co_discard)(BlockDriverState *bs, int64_t sector_num, int nb_sectors); Paolo
Re: [Qemu-devel] [PATCH 0/2] block: Write out internal caches even with cache=unsafe
On 10/22/2011 05:07 PM, Alexander Graf wrote: On 21.10.2011, at 11:44, Paolo Bonzini wrote: On 10/21/2011 07:08 PM, Kevin Wolf wrote: Avi complained that not even writing out qcow2's cache on bdrv_flush() made cache=unsafe too unsafe to be useful. He's got a point. Why? cache=unsafe is explicitly allowing to s/data/manure/ on crash. Exactly, but not on kill. By not flushing internal caches you're almost guaranteed to get an inconsistent qcow2 image. This should be covered already by termsig_handler. bdrv_close_all closes all block devices, and qcow2_close does flush the caches. SIGKILL doesn't give any guarantee of course but it does not in general, even without cache=unsafe; you might get a SIGKILL a moment before a bdrv_flush even without cache=unsafe, and get unclean qcow2 metadata. By not flushing internal caches you're almost guaranteed to get an inconsistent qcow2 image. Of course the inconsistencies with cache=unsafe will be massive if you don't have a clean exit, but that's expected. If in some cases you want a clean exit, but right now you don't, the place to fix those cases doesn't seem to be the block layer, but the main loop. Also, 1) why should cache=unsafe differentiate an OS that sends a flush from one that doesn't (e.g. MS-DOS), from the point of view of image metadata? 2) why should the guest actually send a flush if cache=unsafe? Currently if (flags BDRV_O_CACHE_WB) bs-enable_write_cache = 1; covers cache=unsafe. However, in the end write cache enable means do I need to flush data, and the answer is no when cache=unsafe, because the flushes are useless and guests are free to reorder requests. shot-in-the-darkPerhaps what you want is to make qcow2 caches writethrough in cache=unsafe mode, so that at least a try is made to write the metadata/shot-in-the-dark (even though the underlying raw protocol won't flush it)? I'm not sure that is particularly useful, but maybe it can help me understanding the benefit of this change. Paolo
Re: [Qemu-devel] [PATCH 0/2] block: Write out internal caches even with cache=unsafe
On 21.10.2011, at 11:44, Paolo Bonzini wrote: On 10/21/2011 07:08 PM, Kevin Wolf wrote: Avi complained that not even writing out qcow2's cache on bdrv_flush() made cache=unsafe too unsafe to be useful. He's got a point. Why? cache=unsafe is explicitly allowing to s/data/manure/ on crash. Exactly, but not on kill. By not flushing internal caches you're almost guaranteed to get an inconsistent qcow2 image. If you do this for raw-posix, you need to do it for all protocols. Only for the ones that actually do flushes to the block layer. The idea is that while QEMU dies the OS can still finish writing unflushed blocks. Alex
Re: [Qemu-devel] [PATCH 0/2] block: Write out internal caches even with cache=unsafe
On 10/21/2011 07:08 PM, Kevin Wolf wrote: Avi complained that not even writing out qcow2's cache on bdrv_flush() made cache=unsafe too unsafe to be useful. He's got a point. Why? cache=unsafe is explicitly allowing to s/data/manure/ on crash. If you do this for raw-posix, you need to do it for all protocols. Kevin Wolf (2): raw-posix: Convert to bdrv_co_flush block: Handle cache=unsafe only in raw-posix/win32 Paolo