Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
On 01/11/2013 03:52 PM, MORITA Kazutaka wrote: At Thu, 10 Jan 2013 13:38:16 +0800, Liu Yuan wrote: On 01/09/2013 11:10 PM, Paolo Bonzini wrote: Il 09/01/2013 14:04, Liu Yuan ha scritto: 2 The upper layer software which relies on the 'cache=xxx' to choose cache mode will fail its assumption against new QEMU. Which assumptions do you mean? As far as I can say the behaviour hasn't changed, except possibly for the performance. When users set 'cache=writethrough' to export only a writethrough cache to Guest, but with new QEMU, it will actually get a writeback cache as default. They get a writeback cache implementation-wise, but they get a writethrough cache safety-wise. How the cache is implemented doesn't matter, as long as it looks like a writethrough cache. In fact, consider a local disk that doesn't support FUA. In old QEMU, images used to be opened with O_DSYNC and that splits each write into WRITE+FLUSH, just like new QEMU. All that changes is _where_ the flushes are created. Old QEMU changes it in the kernel, new QEMU changes it in userspace. We don't need to communicate to the guest. I think 'cache=xxx' means what kind of cache the users *expect* to export to Guest OS. So if cache=writethrough set, Guest OS couldn't turn it to writeback cache magically. This is like I bought a disk with 'writethrough' cache built-in, I didn't expect that it turned to be a disk with writeback cache under the hood which could possible lose data when power outage happened. It's not by magic. It's by explicitly requesting the disk to do this. Perhaps it's a bug that the cache mode is not reset when the machine is reset. I haven't checked that, but it would be a valid complaint. Ah I didn't get the current implementation right. I tried the 3.7 kernel and it works as expected (cache=writethrough result in a 'writethrough' cache in the guest). It looks fine to me to emulate writethrough as writeback + flush, since the profermance drop isn't big, though sheepdog itself support true writethrough cache (no flush). Can we drop the SD_FLAG_CMD_CACHE flag from sheepdog write requests when bdrv_enable_write_cache() is false? Then the requests behave like FUA writes and we can safely omit succeeding flush requests. Let's first implement a emulated writethroughc cache and then look for a methond if we can play tricks to implement a true writethrough cache. This would bring complexity such as before we drop SD_FLAG_CMD_CACHE, we need to flush beforehand. And more, bdrv_enable_write_cache() is always true for now, I guess this need generic change block layer to get the writethrough/writeback hints. Thanks, Yuan
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
bdrv_enable_write_cache() is always true for now, I guess this need generic change block layer to get the writethrough/writeback hints. That's not correct. Try hdparm with an IDE disk, or sysfs with a SCSI disk (recent kernels will also support sysfs with an IDE or virtio-blk disk) and you'll see that it changes. Paolo
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
On 01/11/2013 05:00 PM, Paolo Bonzini wrote: That's not correct. Try hdparm with an IDE disk, or sysfs with a SCSI disk (recent kernels will also support sysfs with an IDE or virtio-blk disk) and you'll see that it changes. Okay, at least at startup, even with cache=writehtough set, I saw bdrv_enable_write_cache() returns true from sd_open(). So you mean this will be reset later (after sd_open())? Thanks, Yuan
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
Am 11.01.2013 08:52, schrieb MORITA Kazutaka: At Thu, 10 Jan 2013 13:38:16 +0800, Liu Yuan wrote: On 01/09/2013 11:10 PM, Paolo Bonzini wrote: Il 09/01/2013 14:04, Liu Yuan ha scritto: 2 The upper layer software which relies on the 'cache=xxx' to choose cache mode will fail its assumption against new QEMU. Which assumptions do you mean? As far as I can say the behaviour hasn't changed, except possibly for the performance. When users set 'cache=writethrough' to export only a writethrough cache to Guest, but with new QEMU, it will actually get a writeback cache as default. They get a writeback cache implementation-wise, but they get a writethrough cache safety-wise. How the cache is implemented doesn't matter, as long as it looks like a writethrough cache. In fact, consider a local disk that doesn't support FUA. In old QEMU, images used to be opened with O_DSYNC and that splits each write into WRITE+FLUSH, just like new QEMU. All that changes is _where_ the flushes are created. Old QEMU changes it in the kernel, new QEMU changes it in userspace. We don't need to communicate to the guest. I think 'cache=xxx' means what kind of cache the users *expect* to export to Guest OS. So if cache=writethrough set, Guest OS couldn't turn it to writeback cache magically. This is like I bought a disk with 'writethrough' cache built-in, I didn't expect that it turned to be a disk with writeback cache under the hood which could possible lose data when power outage happened. It's not by magic. It's by explicitly requesting the disk to do this. Perhaps it's a bug that the cache mode is not reset when the machine is reset. I haven't checked that, but it would be a valid complaint. Ah I didn't get the current implementation right. I tried the 3.7 kernel and it works as expected (cache=writethrough result in a 'writethrough' cache in the guest). It looks fine to me to emulate writethrough as writeback + flush, since the profermance drop isn't big, though sheepdog itself support true writethrough cache (no flush). Can we drop the SD_FLAG_CMD_CACHE flag from sheepdog write requests when bdrv_enable_write_cache() is false? Then the requests behave like FUA writes and we can safely omit succeeding flush requests. First we would need to make sure that on a writeback - writethrough switch a flush happens. But once this is implemented, I think your suggestion would work well. Kevin
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
- Messaggio originale - Da: Liu Yuan namei.u...@gmail.com A: Paolo Bonzini pbonz...@redhat.com Cc: Kevin Wolf kw...@redhat.com, Stefan Hajnoczi stefa...@gmail.com, qemu-devel@nongnu.org, MORITA Kazutaka morita.kazut...@lab.ntt.co.jp Inviato: Venerdì, 11 gennaio 2013 10:04:23 Oggetto: Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics On 01/11/2013 05:00 PM, Paolo Bonzini wrote: That's not correct. Try hdparm with an IDE disk, or sysfs with a SCSI disk (recent kernels will also support sysfs with an IDE or virtio-blk disk) and you'll see that it changes. Okay, at least at startup, even with cache=writehtough set, I saw bdrv_enable_write_cache() returns true from sd_open(). So you mean this will be reset later (after sd_open())? It is always true for the protocol. It is not always true for the format. This is correct. qcow2 can do some writes in writeback mode, but they will always be followed by a flush before the write is reported to the guest. It is a bit faster, and does not lose any correctness. Paolo
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
On 01/11/2013 05:34 PM, Paolo Bonzini wrote: It is always true for the protocol. It is not always true for the format. So you mean bdrv_enable_write_cache() always return true for Sheepdog at any time? Thanks, Yuan
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
On 01/11/2013 05:34 PM, Paolo Bonzini wrote: It is always true for the protocol. It is not always true for the format. So you mean bdrv_enable_write_cache() always return true for Sheepdog at any time? Yes, at least now. You could in principle have a format that calls bdrv_set_enable_write_cache(bs-file, ...), but nobody does it yet. Paolo
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
Kevin Wolf wrote: Am 08.01.2013 11:39, schrieb Liu Yuan: On 01/08/2013 06:00 PM, Kevin Wolf wrote: Am 08.01.2013 10:45, schrieb Liu Yuan: On 01/08/2013 05:40 PM, Stefan Hajnoczi wrote: Otherwise use sheepdog writeback and let QEMU block.c decide when to flush. Never use sheepdog writethrough because it's redundant here. I don't get it. What do you mean by 'redundant'? If we use virtio sheepdog block driver, how can we specify writethrough mode for Sheepdog cache? Here 'writethrough' means use a pure read cache, which doesn't need flush at all. A writethrough cache is equivalent to a write-back cache where each write is followed by a flush. qemu makes sure to send these flushes, so there is no need use Sheepdog's writethrough mode. Implement writethrough as writeback + flush will cause considerable overhead for network block device like Sheepdog: a single write request will be executed as two requests: write + flush Yeah, maybe we should have some kind of a FUA flag with write requests instead of sending a separate flush. Note that write+FUA has different semantics than write+flush, at least with regular disks. write+FUA commits just what was written, while write+flush commits everything that was written before. -- Jamie
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
Am 10.01.2013 16:12, schrieb Jamie Lokier: Kevin Wolf wrote: Am 08.01.2013 11:39, schrieb Liu Yuan: On 01/08/2013 06:00 PM, Kevin Wolf wrote: Am 08.01.2013 10:45, schrieb Liu Yuan: On 01/08/2013 05:40 PM, Stefan Hajnoczi wrote: Otherwise use sheepdog writeback and let QEMU block.c decide when to flush. Never use sheepdog writethrough because it's redundant here. I don't get it. What do you mean by 'redundant'? If we use virtio sheepdog block driver, how can we specify writethrough mode for Sheepdog cache? Here 'writethrough' means use a pure read cache, which doesn't need flush at all. A writethrough cache is equivalent to a write-back cache where each write is followed by a flush. qemu makes sure to send these flushes, so there is no need use Sheepdog's writethrough mode. Implement writethrough as writeback + flush will cause considerable overhead for network block device like Sheepdog: a single write request will be executed as two requests: write + flush Yeah, maybe we should have some kind of a FUA flag with write requests instead of sending a separate flush. Note that write+FUA has different semantics than write+flush, at least with regular disks. write+FUA commits just what was written, while write+flush commits everything that was written before. True. However, when you use it for implementing a writethrough mode, i.e. every single write has the FUA flag set, then it ought to be the same. One thing to take care of might be doing one explicit flush when switching from writeback to writethrough. Kevin
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
Paolo Bonzini wrote: Il 09/01/2013 14:04, Liu Yuan ha scritto: 2 The upper layer software which relies on the 'cache=xxx' to choose cache mode will fail its assumption against new QEMU. Which assumptions do you mean? As far as I can say the behaviour hasn't changed, except possibly for the performance. When users set 'cache=writethrough' to export only a writethrough cache to Guest, but with new QEMU, it will actually get a writeback cache as default. They get a writeback cache implementation-wise, but they get a writethrough cache safety-wise. How the cache is implemented doesn't matter, as long as it looks like a writethrough cache. In fact, consider a local disk that doesn't support FUA. In old QEMU, images used to be opened with O_DSYNC and that splits each write into WRITE+FLUSH, just like new QEMU. All that changes is _where_ the flushes are created. Old QEMU changes it in the kernel, new QEMU changes it in userspace. We don't need to communicate to the guest. I think 'cache=xxx' means what kind of cache the users *expect* to export to Guest OS. So if cache=writethrough set, Guest OS couldn't turn it to writeback cache magically. This is like I bought a disk with 'writethrough' cache built-in, I didn't expect that it turned to be a disk with writeback cache under the hood which could possible lose data when power outage happened. It's not by magic. It's by explicitly requesting the disk to do this. Perhaps it's a bug that the cache mode is not reset when the machine is reset. I haven't checked that, but it would be a valid complaint. The question is, is cache=writeback/cache=writethrough an initial setting of guest-visible WCE that the guest is allowed to change, or is cache=writeththrough a way of saying don't have a write cache (which may or may not be reflected in the guest-visible disk id). I couldn't tell from QEMU documentation which is intended. It would be a bit silly if it means different things for different backend storage. I have seen (obscure) guest code which toggled WCE to simulate FUA, and there is plenty of advice out there saying to set WCE=0 for certain kinds of databases because of its presumed crash safety. Even very ancient guests on Linux and Windows can change WCE=0 with IDE and SCSI. So from a guest point of view, I think guest setting WCE=0 should mean exactly the same as FUA every write, or flush after every write, until guest setting WCE=1. -- Jamie
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
Il 10/01/2013 16:25, Jamie Lokier ha scritto: Perhaps it's a bug that the cache mode is not reset when the machine is reset. I haven't checked that, but it would be a valid complaint. The question is, is cache=writeback/cache=writethrough an initial setting of guest-visible WCE that the guest is allowed to change, or is cache=writeththrough a way of saying don't have a write cache (which may or may not be reflected in the guest-visible disk id). It used to be the latter (with reflection in the disk data), but now it is the former. I couldn't tell from QEMU documentation which is intended. It would be a bit silly if it means different things for different backend storage. It means the same thing for IDE, SCSI and virtio-blk. Other backends, such as SD, do not even have flush, and are really slow with cache=writethrough because they write one sector at a time. For this reason they cannot really be used in a safe manner. I have seen (obscure) guest code which toggled WCE to simulate FUA, That's quite useless, since WCE=1-WCE=0 is documented to cause a flush (and it does). Might as well send a real flush. and there is plenty of advice out there saying to set WCE=0 for certain kinds of databases because of its presumed crash safety. Even very ancient guests on Linux and Windows can change WCE=0 with IDE and SCSI. This is supported in QEMU. So from a guest point of view, I think guest setting WCE=0 should mean exactly the same as FUA every write, or flush after every write, until guest setting WCE=1. Yes, that's indeed how it is implemented. Paolo
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
Paolo Bonzini wrote: Il 10/01/2013 16:25, Jamie Lokier ha scritto: Perhaps it's a bug that the cache mode is not reset when the machine is reset. I haven't checked that, but it would be a valid complaint. The question is, is cache=writeback/cache=writethrough an initial setting of guest-visible WCE that the guest is allowed to change, or is cache=writeththrough a way of saying don't have a write cache (which may or may not be reflected in the guest-visible disk id). It used to be the latter (with reflection in the disk data), but now it is the former. Interesting. It could be worth a note in the manual. I couldn't tell from QEMU documentation which is intended. It would be a bit silly if it means different things for different backend storage. It means the same thing for IDE, SCSI and virtio-blk. Other backends, such as SD, do not even have flush, and are really slow with cache=writethrough because they write one sector at a time. For this reason they cannot really be used in a safe manner. I have seen (obscure) guest code which toggled WCE to simulate FUA, That's quite useless, since WCE=1-WCE=0 is documented to cause a flush (and it does). Might as well send a real flush. It was because the ATA spec seemed to permit the combination of WCE with no flush command supported. So WCE=1-WCE=0 was used to flush, and kept at WCE=0 for the subsequent logging write-FUA(s), until a non-FUA write was wanted. -- Jamie
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
At Thu, 10 Jan 2013 13:38:16 +0800, Liu Yuan wrote: On 01/09/2013 11:10 PM, Paolo Bonzini wrote: Il 09/01/2013 14:04, Liu Yuan ha scritto: 2 The upper layer software which relies on the 'cache=xxx' to choose cache mode will fail its assumption against new QEMU. Which assumptions do you mean? As far as I can say the behaviour hasn't changed, except possibly for the performance. When users set 'cache=writethrough' to export only a writethrough cache to Guest, but with new QEMU, it will actually get a writeback cache as default. They get a writeback cache implementation-wise, but they get a writethrough cache safety-wise. How the cache is implemented doesn't matter, as long as it looks like a writethrough cache. In fact, consider a local disk that doesn't support FUA. In old QEMU, images used to be opened with O_DSYNC and that splits each write into WRITE+FLUSH, just like new QEMU. All that changes is _where_ the flushes are created. Old QEMU changes it in the kernel, new QEMU changes it in userspace. We don't need to communicate to the guest. I think 'cache=xxx' means what kind of cache the users *expect* to export to Guest OS. So if cache=writethrough set, Guest OS couldn't turn it to writeback cache magically. This is like I bought a disk with 'writethrough' cache built-in, I didn't expect that it turned to be a disk with writeback cache under the hood which could possible lose data when power outage happened. It's not by magic. It's by explicitly requesting the disk to do this. Perhaps it's a bug that the cache mode is not reset when the machine is reset. I haven't checked that, but it would be a valid complaint. Ah I didn't get the current implementation right. I tried the 3.7 kernel and it works as expected (cache=writethrough result in a 'writethrough' cache in the guest). It looks fine to me to emulate writethrough as writeback + flush, since the profermance drop isn't big, though sheepdog itself support true writethrough cache (no flush). Can we drop the SD_FLAG_CMD_CACHE flag from sheepdog write requests when bdrv_enable_write_cache() is false? Then the requests behave like FUA writes and we can safely omit succeeding flush requests. Thanks, Kazutaka
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
Il 08/01/2013 13:12, Kevin Wolf ha scritto: Okay, I see the code indeed support WCE but it requires Guest kernel to support it. For the kernel doesn't support it, there is no way to disable write cache, no? With Linux guests, it's possible for SCSI. I'm not sure about virtio-blk, but you might be right that you can't manually change it there. New enough kernels support it. Also, if you reboot from a system that supports writeback to one that doesn't (including old-enough Linux), the new QEMU will be able to use writethrough while running on the older system. Paolo
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
Il 08/01/2013 14:18, Liu Yuan ha scritto: Maybe not for a second thought. See following combination: cache flagsWCE toggled and resulting behavior writethrough writethrough writeback writetrhough (writeback + flush as expected) cache flags means specify 'cache=xxx' at startup and WCE toggled on the fly in the guest (supose guest kernel support WCE control) So the result is *not* broken. If we set cache=writethrough for sheepdog, then WCE won't take any effect because 'flush' request will be ignored by Sheepdog driver. And with cache=writeback, WCE does disable the writecache and actually turns it to a writethrough cache by sending flush req every time for write. To conclude, let Sheepdog interpret cache flags won't cause trouble even with current Guest WCE feature, the different is that if we set cache=writethrough, guest can't change it via WCE toggling. Is this behavior acceptable? But why is it useful to force-disable writeback caching? Do you have any performance numbers? Paolo
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
On 01/09/2013 06:25 PM, Paolo Bonzini wrote: But why is it useful to force-disable writeback caching? Do you have any performance numbers? This is not related to performance. What I want is to allow users to choose Sheepdog cache mode (writethrough/writeback/directio) from cache=xxx. Obviously writeback mode show much better write performance over writethrough cache in Sheepdog. NOTE: writethrough cache in Sheepdog is a readonly cache which is always consistent with cluster data thus need no flush at all. This is to boost read performance for read intensive Guest over Sheepdog iamges. Thanks, Yuan
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
Il 09/01/2013 11:36, Liu Yuan ha scritto: But why is it useful to force-disable writeback caching? Do you have any performance numbers? This is not related to performance. What I want is to allow users to choose Sheepdog cache mode (writethrough/writeback/directio) from cache=xxx. Obviously writeback mode show much better write performance over writethrough cache in Sheepdog. NOTE: writethrough cache in Sheepdog is a readonly cache which is always consistent with cluster data thus need no flush at all. This is to boost read performance for read intensive Guest over Sheepdog iamges. Ok, so the questions are (compared to older QEMU's writethrough mode): 1) how slower is QEMU's emulated-writethrough mode for writes, due to the extra requests? 2) how slower is QEMU's writeback mode for reads, due to the different structure of the cache? Paolo
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
On 01/09/2013 06:40 PM, Paolo Bonzini wrote: 1) how slower is QEMU's emulated-writethrough mode for writes, due to the extra requests? I'll collect some numbers on it. 2) how slower is QEMU's writeback mode for reads, due to the different structure of the cache? Sorry, I don't get your question. I didn't say QEMU's writeback mode is slow for reads, did I? Thanks, Yuan
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
On 01/09/2013 06:46 PM, Liu Yuan wrote: I'll collect some numbers on it. Well, when I use hdparm -W 0 /dev/vdx to choose the writethrough cache in the Guest (virtio disk), it says: test@vm1:~$ sudo hdparm -W 0 /dev/vdb /dev/vdb: setting drive write-caching to 0 (off) HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device HDIO_DRIVE_CMD(flushcache) failed: Inappropriate ioctl for device HDIO_DRIVE_CMD(setcache) failed: Inappropriate ioctl for device HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device HDIO_DRIVE_CMD(flushcache) failed: Inappropriate ioctl for device HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device Do I need extra flags to enable WCE? Both Guest and Host kernel is 3.7 and I start QEMU by following command: $ qemu-system-x86_64 --enable-kvm -drive file=~/images/test1,if=virtio,cache=writeback -smp 2 -cpu host -m 1024 -drive file=sheepdog:test,if=virtio,cache=writeback Thanks, Yuan
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
Am 09.01.2013 11:36, schrieb Liu Yuan: On 01/09/2013 06:25 PM, Paolo Bonzini wrote: But why is it useful to force-disable writeback caching? Do you have any performance numbers? This is not related to performance. What I want is to allow users to choose Sheepdog cache mode (writethrough/writeback/directio) from cache=xxx. The question was which effects these modes have, and especially what the user-visible differences between writeback+flush and writethrough cache are. Obviously writeback mode show much better write performance over writethrough cache in Sheepdog. NOTE: writethrough cache in Sheepdog is a readonly cache which is always consistent with cluster data thus need no flush at all. This is to boost read performance for read intensive Guest over Sheepdog iamges. This is the definition of a writethrough cache. Kevin
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
Il 09/01/2013 11:58, Liu Yuan ha scritto: On 01/09/2013 06:46 PM, Liu Yuan wrote: I'll collect some numbers on it. Well, when I use hdparm -W 0 /dev/vdx to choose the writethrough cache in the Guest (virtio disk), it says: test@vm1:~$ sudo hdparm -W 0 /dev/vdb /dev/vdb: setting drive write-caching to 0 (off) HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device HDIO_DRIVE_CMD(flushcache) failed: Inappropriate ioctl for device HDIO_DRIVE_CMD(setcache) failed: Inappropriate ioctl for device HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device HDIO_DRIVE_CMD(flushcache) failed: Inappropriate ioctl for device HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device Do I need extra flags to enable WCE? hdparm is specific for IDE devices. sysfs is supported by SCSI (including IDE with libata, and virtio-scsi) and virtio-blk. You can use this for SCSI: echo 'write back' /sys/block/sda/device/scsi_disk/*/cache_type and a similar command for virtio-blk. Paolo Both Guest and Host kernel is 3.7 and I start QEMU by following command: $ qemu-system-x86_64 --enable-kvm -drive file=~/images/test1,if=virtio,cache=writeback -smp 2 -cpu host -m 1024 -drive file=sheepdog:test,if=virtio,cache=writeback Thanks, Yuan
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
On 01/09/2013 06:46 PM, Liu Yuan wrote: 1) how slower is QEMU's emulated-writethrough mode for writes, due to the extra requests? I'll collect some numbers on it. Okay I got some nubmers. I run three sheep daemon on the same host to emulate a 3 nodes cluster, Sheepdog image has 3 copies and I put Sheepdog client cache and Sheepdog backend storage on a tmpfs. Guest and Host are all Linux 3.7. I start QEMU by following command: $ qemu-system-x86_64 --enable-kvm -drive file=~/images/test1,if=virtio,cache=writeback -smp 2 -cpu host -m 1024 -drive file=sheepdog:test,if=virtio,cache=writeback I run 5 times 'dd if=/dev/urandom of=/dev/vdb bs=1M count=100 oflag=direct' and get the average nubmer: emulated (write + flush) old impl (single write) 13.3 M/s 13.7 M/s boost percentage: (13.7 - 13.3)/13.3 = 3%. If boost number is not big, but if we run QEMU and Sheep daemon on the separate boxes, we'll expect of bigger boost because the overhead of extra 'flush' request is increased. Besides performance, I think backward compatibility is more important: 1 if we run a old kernel host (quite possible for a long running server) which doesn't support WCE, then we will never have a chance to choose writethrough cache for guest OS against new QEMU (most users tend to update user space tools to exclude bugs) 2 The upper layer software which relies on the 'cache=xxx' to choose cache mode will fail its assumption against new QEMU. and my proposal (adding another field to BlockDriverState to allow self-interpretation cache flag) can work well with current block layer: Sheepdog driver behavior: cache flagsWCE toggled and resulting behavior writethrough writethrough writeback writetrhough (writeback + flush) We can see that the writeback behavior for Guest-WCE toggling is the same as expected. The difference is that if we set cache=writethrough, guest can't change it via WCE toggling. Thanks, Yuan
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
On 01/09/2013 08:07 PM, Liu Yuan wrote: $ qemu-system-x86_64 --enable-kvm -drive file=~/images/test1,if=virtio,cache=writeback -smp 2 -cpu host -m 1024 -drive file=sheepdog:test,if=virtio,cache=writeback To emulate old impl, I modified the sheepdog block driver as following: diff --git a/block/sheepdog.c b/block/sheepdog.c index e821746..321a426 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -1118,6 +1118,7 @@ static int sd_open(BlockDriverState *bs, const char *filename, int flags) goto out; } +if (0) { s-cache_enabled = true; s-flush_fd = connect_to_sdog(s-addr, s-port); if (s-flush_fd 0) { @@ -1125,6 +1126,7 @@ static int sd_open(BlockDriverState *bs, const char *filename, int flags) ret = s-flush_fd; goto out; } +} if (snapid || tag[0] != '\0') { dprintf(% PRIx32 snapshot inode was open.\n, vid);
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
On 01/09/2013 08:07 PM, Liu Yuan wrote: and my proposal (adding another field to BlockDriverState to allow self-interpretation cache flag) can work well with current block layer: Sheepdog driver behavior: cache flagsWCE toggled and resulting behavior writethrough writethrough writeback writetrhough (writeback + flush) We can see that the writeback behavior for Guest-WCE toggling is the same as expected. The difference is that if we set cache=writethrough, guest can't change it via WCE toggling. I am wondering if above scheme can apply to the block layer, then 'cache=xxx' can act as a stronger control over WCE toggling. Then we don't need extra field. Thanks, Yuan
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
Am 09.01.2013 13:07, schrieb Liu Yuan: Besides performance, I think backward compatibility is more important: 1 if we run a old kernel host (quite possible for a long running server) which doesn't support WCE, then we will never have a chance to choose writethrough cache for guest OS against new QEMU (most users tend to update user space tools to exclude bugs) How does the host kernel even play a role in this context? 2 The upper layer software which relies on the 'cache=xxx' to choose cache mode will fail its assumption against new QEMU. Which assumptions do you mean? As far as I can say the behaviour hasn't changed, except possibly for the performance. We can see that the writeback behavior for Guest-WCE toggling is the same as expected. The difference is that if we set cache=writethrough, guest can't change it via WCE toggling. Then you need to make sure to communicate this to the guest. I'm not convinced that adding extra logic to each device for this is a good idea. Kevin
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
On 01/09/2013 08:42 PM, Kevin Wolf wrote: Am 09.01.2013 13:07, schrieb Liu Yuan: Besides performance, I think backward compatibility is more important: 1 if we run a old kernel host (quite possible for a long running server) which doesn't support WCE, then we will never have a chance to choose writethrough cache for guest OS against new QEMU (most users tend to update user space tools to exclude bugs) How does the host kernel even play a role in this context? Oops, my mind was broken. Okay, guest OS which doesn't support WCE can't change cache type as we discussed. 2 The upper layer software which relies on the 'cache=xxx' to choose cache mode will fail its assumption against new QEMU. Which assumptions do you mean? As far as I can say the behaviour hasn't changed, except possibly for the performance. When users set 'cache=writethrough' to export only a writethrough cache to Guest, but with new QEMU, it will actually get a writeback cache as default. We can see that the writeback behavior for Guest-WCE toggling is the same as expected. The difference is that if we set cache=writethrough, guest can't change it via WCE toggling. Then you need to make sure to communicate this to the guest. I'm not convinced that adding extra logic to each device for this is a good idea. We don't need to communicate to the guest. I think 'cache=xxx' means what kind of cache the users *expect* to export to Guest OS. So if cache=writethrough set, Guest OS couldn't turn it to writeback cache magically. This is like I bought a disk with 'writethrough' cache built-in, I didn't expect that it turned to be a disk with writeback cache under the hood which could possible lose data when power outage happened. So Guest-WCE only works when users provide us a writeback capable disk. Thanks, Yuan
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
Il 09/01/2013 14:04, Liu Yuan ha scritto: 2 The upper layer software which relies on the 'cache=xxx' to choose cache mode will fail its assumption against new QEMU. Which assumptions do you mean? As far as I can say the behaviour hasn't changed, except possibly for the performance. When users set 'cache=writethrough' to export only a writethrough cache to Guest, but with new QEMU, it will actually get a writeback cache as default. They get a writeback cache implementation-wise, but they get a writethrough cache safety-wise. How the cache is implemented doesn't matter, as long as it looks like a writethrough cache. In fact, consider a local disk that doesn't support FUA. In old QEMU, images used to be opened with O_DSYNC and that splits each write into WRITE+FLUSH, just like new QEMU. All that changes is _where_ the flushes are created. Old QEMU changes it in the kernel, new QEMU changes it in userspace. We don't need to communicate to the guest. I think 'cache=xxx' means what kind of cache the users *expect* to export to Guest OS. So if cache=writethrough set, Guest OS couldn't turn it to writeback cache magically. This is like I bought a disk with 'writethrough' cache built-in, I didn't expect that it turned to be a disk with writeback cache under the hood which could possible lose data when power outage happened. It's not by magic. It's by explicitly requesting the disk to do this. Perhaps it's a bug that the cache mode is not reset when the machine is reset. I haven't checked that, but it would be a valid complaint. Paolo So Guest-WCE only works when users provide us a writeback capable disk.
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
On 01/09/2013 11:10 PM, Paolo Bonzini wrote: Il 09/01/2013 14:04, Liu Yuan ha scritto: 2 The upper layer software which relies on the 'cache=xxx' to choose cache mode will fail its assumption against new QEMU. Which assumptions do you mean? As far as I can say the behaviour hasn't changed, except possibly for the performance. When users set 'cache=writethrough' to export only a writethrough cache to Guest, but with new QEMU, it will actually get a writeback cache as default. They get a writeback cache implementation-wise, but they get a writethrough cache safety-wise. How the cache is implemented doesn't matter, as long as it looks like a writethrough cache. In fact, consider a local disk that doesn't support FUA. In old QEMU, images used to be opened with O_DSYNC and that splits each write into WRITE+FLUSH, just like new QEMU. All that changes is _where_ the flushes are created. Old QEMU changes it in the kernel, new QEMU changes it in userspace. We don't need to communicate to the guest. I think 'cache=xxx' means what kind of cache the users *expect* to export to Guest OS. So if cache=writethrough set, Guest OS couldn't turn it to writeback cache magically. This is like I bought a disk with 'writethrough' cache built-in, I didn't expect that it turned to be a disk with writeback cache under the hood which could possible lose data when power outage happened. It's not by magic. It's by explicitly requesting the disk to do this. Perhaps it's a bug that the cache mode is not reset when the machine is reset. I haven't checked that, but it would be a valid complaint. Ah I didn't get the current implementation right. I tried the 3.7 kernel and it works as expected (cache=writethrough result in a 'writethrough' cache in the guest). It looks fine to me to emulate writethrough as writeback + flush, since the profermance drop isn't big, though sheepdog itself support true writethrough cache (no flush). I am going to send v2 of directio patch for sheepdog driver. Thanks, Yuan
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
On Tue, Jan 08, 2013 at 01:42:35PM +0800, Liu Yuan wrote: On 01/07/2013 09:23 PM, Kevin Wolf wrote: No, and in theory they shouldn't have to care. Why do you want to handle writethrough semantics in the block driver rather than letting qemu care for sending the right flushes? Because Sheepdog itself provide the client side cache implementation and we need means to get the cache hint from users when starting up the VM. This cache support: writethrough: provide a read cache for this VM, which is always consistent with cluster data writeback: provide a writeback cache for this VM, which only flush the dirty bits when VM send 'flush' request. directio: disable cache completely for this VM. Old QEMU (before v1.2) allows block drivers to get the cache flags, so Sheepdog can interpret these flags on its own to choose the right cache mode for each VM. doesn't QEMU need keep backward compatibility? This sounds fine - it fits together with QEMU like this: BDRV_NO_CACHE means sheepdog directio. We're disabling the client-side sheepdog cache. Otherwise use sheepdog writeback and let QEMU block.c decide when to flush. Never use sheepdog writethrough because it's redundant here. Stefan
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
On 01/08/2013 05:40 PM, Stefan Hajnoczi wrote: Otherwise use sheepdog writeback and let QEMU block.c decide when to flush. Never use sheepdog writethrough because it's redundant here. I don't get it. What do you mean by 'redundant'? If we use virtio sheepdog block driver, how can we specify writethrough mode for Sheepdog cache? Here 'writethrough' means use a pure read cache, which doesn't need flush at all. Thanks, Yuan
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
Am 08.01.2013 10:45, schrieb Liu Yuan: On 01/08/2013 05:40 PM, Stefan Hajnoczi wrote: Otherwise use sheepdog writeback and let QEMU block.c decide when to flush. Never use sheepdog writethrough because it's redundant here. I don't get it. What do you mean by 'redundant'? If we use virtio sheepdog block driver, how can we specify writethrough mode for Sheepdog cache? Here 'writethrough' means use a pure read cache, which doesn't need flush at all. A writethrough cache is equivalent to a write-back cache where each write is followed by a flush. qemu makes sure to send these flushes, so there is no need use Sheepdog's writethrough mode. Kevin
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
On 01/08/2013 06:00 PM, Kevin Wolf wrote: Am 08.01.2013 10:45, schrieb Liu Yuan: On 01/08/2013 05:40 PM, Stefan Hajnoczi wrote: Otherwise use sheepdog writeback and let QEMU block.c decide when to flush. Never use sheepdog writethrough because it's redundant here. I don't get it. What do you mean by 'redundant'? If we use virtio sheepdog block driver, how can we specify writethrough mode for Sheepdog cache? Here 'writethrough' means use a pure read cache, which doesn't need flush at all. A writethrough cache is equivalent to a write-back cache where each write is followed by a flush. qemu makes sure to send these flushes, so there is no need use Sheepdog's writethrough mode. Implement writethrough as writeback + flush will cause considerable overhead for network block device like Sheepdog: a single write request will be executed as two requests: write + flush. This also explains why I saw a regression about write performance: Old QEMU can issue multiple write requests in one go, but now the requests are sent one by one (even with cache=writeback set), which makes Sheepdog write performance drop a lot. Is it possible to issue multiple requests in one go as old QEMU does? It seems it is hard to restore into old semantics of cache flags due to new design of QEMU block layer. So will you accept that adding a 'flags' into BlockDriverState which carry the 'cache flags' from user to keep backward compatibility? Thanks, Yuan
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
Am 08.01.2013 11:39, schrieb Liu Yuan: On 01/08/2013 06:00 PM, Kevin Wolf wrote: Am 08.01.2013 10:45, schrieb Liu Yuan: On 01/08/2013 05:40 PM, Stefan Hajnoczi wrote: Otherwise use sheepdog writeback and let QEMU block.c decide when to flush. Never use sheepdog writethrough because it's redundant here. I don't get it. What do you mean by 'redundant'? If we use virtio sheepdog block driver, how can we specify writethrough mode for Sheepdog cache? Here 'writethrough' means use a pure read cache, which doesn't need flush at all. A writethrough cache is equivalent to a write-back cache where each write is followed by a flush. qemu makes sure to send these flushes, so there is no need use Sheepdog's writethrough mode. Implement writethrough as writeback + flush will cause considerable overhead for network block device like Sheepdog: a single write request will be executed as two requests: write + flush Yeah, maybe we should have some kind of a FUA flag with write requests instead of sending a separate flush. This also explains why I saw a regression about write performance: Old QEMU can issue multiple write requests in one go, but now the requests are sent one by one (even with cache=writeback set), which makes Sheepdog write performance drop a lot. Is it possible to issue multiple requests in one go as old QEMU does? Huh? We didn't change anything to that respect, or at least not that I'm aware of. qemu always only had single-request bdrv_co_writev, so if anything that batching must have happened inside Sheepdog code? Do you know what makes it not batch requests any more? It seems it is hard to restore into old semantics of cache flags due to new design of QEMU block layer. So will you accept that adding a 'flags' into BlockDriverState which carry the 'cache flags' from user to keep backward compatibility? No, going back to the old behaviour would break guest-toggled WCE. Kevin
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
On 01/08/2013 06:51 PM, Kevin Wolf wrote: Am 08.01.2013 11:39, schrieb Liu Yuan: On 01/08/2013 06:00 PM, Kevin Wolf wrote: Am 08.01.2013 10:45, schrieb Liu Yuan: On 01/08/2013 05:40 PM, Stefan Hajnoczi wrote: Otherwise use sheepdog writeback and let QEMU block.c decide when to flush. Never use sheepdog writethrough because it's redundant here. I don't get it. What do you mean by 'redundant'? If we use virtio sheepdog block driver, how can we specify writethrough mode for Sheepdog cache? Here 'writethrough' means use a pure read cache, which doesn't need flush at all. A writethrough cache is equivalent to a write-back cache where each write is followed by a flush. qemu makes sure to send these flushes, so there is no need use Sheepdog's writethrough mode. Implement writethrough as writeback + flush will cause considerable overhead for network block device like Sheepdog: a single write request will be executed as two requests: write + flush Yeah, maybe we should have some kind of a FUA flag with write requests instead of sending a separate flush. This also explains why I saw a regression about write performance: Old QEMU can issue multiple write requests in one go, but now the requests are sent one by one (even with cache=writeback set), which makes Sheepdog write performance drop a lot. Is it possible to issue multiple requests in one go as old QEMU does? Huh? We didn't change anything to that respect, or at least not that I'm aware of. qemu always only had single-request bdrv_co_writev, so if anything that batching must have happened inside Sheepdog code? Do you know what makes it not batch requests any more? QEMU v1.1.x works well with batched write requests. Sheepdog block driver doesn't do batching trick as far as I know, just send request as it is feed. There isn't noticeable changes between v1.1.x and current master regard to Sheepdog.c. To detail the different behavior, from Sheepdog daemon which receives the requests from QEMU: old: can receive multiple many requests at the virtually the same time and handle them concurrently now: only receive one request, handle it, reply and get another. So I think the problem is, current QEMU will wait for write response before sending another request. It seems it is hard to restore into old semantics of cache flags due to new design of QEMU block layer. So will you accept that adding a 'flags' into BlockDriverState which carry the 'cache flags' from user to keep backward compatibility? No, going back to the old behaviour would break guest-toggled WCE. Guest-toggled WCE only works with IDE and seems that virtio-blk doesn't support it, no? And I think there are huge virtio-blk users. I didn't meant to break WCE. What I meant is to allow backward compatibility. For e.g, Sheepdog driver can make use of this dedicated cache flags to implement its own cache control and doesn't affect other drivers at all. Thanks, Yuan
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
Am 08.01.2013 12:08, schrieb Liu Yuan: On 01/08/2013 06:51 PM, Kevin Wolf wrote: Am 08.01.2013 11:39, schrieb Liu Yuan: This also explains why I saw a regression about write performance: Old QEMU can issue multiple write requests in one go, but now the requests are sent one by one (even with cache=writeback set), which makes Sheepdog write performance drop a lot. Is it possible to issue multiple requests in one go as old QEMU does? Huh? We didn't change anything to that respect, or at least not that I'm aware of. qemu always only had single-request bdrv_co_writev, so if anything that batching must have happened inside Sheepdog code? Do you know what makes it not batch requests any more? QEMU v1.1.x works well with batched write requests. Sheepdog block driver doesn't do batching trick as far as I know, just send request as it is feed. There isn't noticeable changes between v1.1.x and current master regard to Sheepdog.c. To detail the different behavior, from Sheepdog daemon which receives the requests from QEMU: old: can receive multiple many requests at the virtually the same time and handle them concurrently now: only receive one request, handle it, reply and get another. So I think the problem is, current QEMU will wait for write response before sending another request. I can't see a reason why it would do that. Can you bisect this? It seems it is hard to restore into old semantics of cache flags due to new design of QEMU block layer. So will you accept that adding a 'flags' into BlockDriverState which carry the 'cache flags' from user to keep backward compatibility? No, going back to the old behaviour would break guest-toggled WCE. Guest-toggled WCE only works with IDE and seems that virtio-blk doesn't support it, no? And I think there are huge virtio-blk users. It works with virtio-blk and SCSI as well. I didn't meant to break WCE. What I meant is to allow backward compatibility. For e.g, Sheepdog driver can make use of this dedicated cache flags to implement its own cache control and doesn't affect other drivers at all. How would you do it? With a WCE that changes during runtime the idea of a flag that is passed to bdrv_open() and stays valid as long as the BlockDriverState exists doesn't match reality any more. Kevin
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
On 01/08/2013 07:19 PM, Kevin Wolf wrote: I can't see a reason why it would do that. Can you bisect this? Sure, bisect it is on my schedule, but I can't promise a deadline. It seems it is hard to restore into old semantics of cache flags due to new design of QEMU block layer. So will you accept that adding a 'flags' into BlockDriverState which carry the 'cache flags' from user to keep backward compatibility? No, going back to the old behaviour would break guest-toggled WCE. Guest-toggled WCE only works with IDE and seems that virtio-blk doesn't support it, no? And I think there are huge virtio-blk users. It works with virtio-blk and SCSI as well. Okay, I see the code indeed support WCE but it requires Guest kernel to support it. For the kernel doesn't support it, there is no way to disable write cache, no? I didn't meant to break WCE. What I meant is to allow backward compatibility. For e.g, Sheepdog driver can make use of this dedicated cache flags to implement its own cache control and doesn't affect other drivers at all. How would you do it? With a WCE that changes during runtime the idea of a flag that is passed to bdrv_open() and stays valid as long as the BlockDriverState exists doesn't match reality any more. I am not sure if I get it right. What I meant is allow Sheepdog to control cache on the 'cache flags' at startup and ignore WCE on the run time. So you mean, if I choose witeback cache at startup, and then Guest disable it via WCE, then block layer will never send flush request down to Sheepdog driver? Thanks, Yuan
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
Am 08.01.2013 12:35, schrieb Liu Yuan: On 01/08/2013 07:19 PM, Kevin Wolf wrote: I can't see a reason why it would do that. Can you bisect this? Sure, bisect it is on my schedule, but I can't promise a deadline. Ok, thanks. It would be good if it was before the hard freeze for 1.4. It seems it is hard to restore into old semantics of cache flags due to new design of QEMU block layer. So will you accept that adding a 'flags' into BlockDriverState which carry the 'cache flags' from user to keep backward compatibility? No, going back to the old behaviour would break guest-toggled WCE. Guest-toggled WCE only works with IDE and seems that virtio-blk doesn't support it, no? And I think there are huge virtio-blk users. It works with virtio-blk and SCSI as well. Okay, I see the code indeed support WCE but it requires Guest kernel to support it. For the kernel doesn't support it, there is no way to disable write cache, no? With Linux guests, it's possible for SCSI. I'm not sure about virtio-blk, but you might be right that you can't manually change it there. I didn't meant to break WCE. What I meant is to allow backward compatibility. For e.g, Sheepdog driver can make use of this dedicated cache flags to implement its own cache control and doesn't affect other drivers at all. How would you do it? With a WCE that changes during runtime the idea of a flag that is passed to bdrv_open() and stays valid as long as the BlockDriverState exists doesn't match reality any more. I am not sure if I get it right. What I meant is allow Sheepdog to control cache on the 'cache flags' at startup and ignore WCE on the run time. If you start with cache=writeback and then the guests switches WCE off and you ignore that, then you're causing data corruption in the case of a crash. This is not an option. So you mean, if I choose witeback cache at startup, and then Guest disable it via WCE, then block layer will never send flush request down to Sheepdog driver? Yes, this is the problematic case. Currently the qemu block layer makes sure to send flushes, but if you disable that logic for Sheepdog, you would get broken behaviour in this case. Kevin
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
On 01/08/2013 08:12 PM, Kevin Wolf wrote: Ok, thanks. It would be good if it was before the hard freeze for 1.4. Oops, sorry. It is a false alarm. Last time I was running latest QEMU on tmpfs which service the request too fast. It seems it is hard to restore into old semantics of cache flags due to new design of QEMU block layer. So will you accept that adding a 'flags' into BlockDriverState which carry the 'cache flags' from user to keep backward compatibility? No, going back to the old behaviour would break guest-toggled WCE. Guest-toggled WCE only works with IDE and seems that virtio-blk doesn't support it, no? And I think there are huge virtio-blk users. It works with virtio-blk and SCSI as well. Okay, I see the code indeed support WCE but it requires Guest kernel to support it. For the kernel doesn't support it, there is no way to disable write cache, no? With Linux guests, it's possible for SCSI. I'm not sure about virtio-blk, but you might be right that you can't manually change it there. I didn't meant to break WCE. What I meant is to allow backward compatibility. For e.g, Sheepdog driver can make use of this dedicated cache flags to implement its own cache control and doesn't affect other drivers at all. How would you do it? With a WCE that changes during runtime the idea of a flag that is passed to bdrv_open() and stays valid as long as the BlockDriverState exists doesn't match reality any more. I am not sure if I get it right. What I meant is allow Sheepdog to control cache on the 'cache flags' at startup and ignore WCE on the run time. If you start with cache=writeback and then the guests switches WCE off and you ignore that, then you're causing data corruption in the case of a crash. This is not an option. So you mean, if I choose witeback cache at startup, and then Guest disable it via WCE, then block layer will never send flush request down to Sheepdog driver? Yes, this is the problematic case. Currently the qemu block layer makes sure to send flushes, but if you disable that logic for Sheepdog, you would get broken behaviour in this case. Maybe not for a second thought. See following combination: cache flagsWCE toggled and resulting behavior writethrough writethrough writeback writetrhough (writeback + flush as expected) cache flags means specify 'cache=xxx' at startup and WCE toggled on the fly in the guest (supose guest kernel support WCE control) So the result is *not* broken. If we set cache=writethrough for sheepdog, then WCE won't take any effect because 'flush' request will be ignored by Sheepdog driver. And with cache=writeback, WCE does disable the writecache and actually turns it to a writethrough cache by sending flush req every time for write. To conclude, let Sheepdog interpret cache flags won't cause trouble even with current Guest WCE feature, the different is that if we set cache=writethrough, guest can't change it via WCE toggling. Is this behavior acceptable? Thanks, Yuan
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
On 01/08/2013 09:18 PM, Liu Yuan wrote: So the result is *not* broken. If we set cache=writethrough for sheepdog, then WCE won't take any effect because 'flush' request will be ignored by Sheepdog driver. The merit of this 'writethrough' is that, write request will never be interpreted as two requests: write + flush in Sheepdog driver. Thanks, Yuan
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
On Sat, Jan 05, 2013 at 03:56:11PM +0800, Liu Yuan wrote: On 01/05/2013 01:29 PM, Liu Yuan wrote: On 01/05/2013 12:40 PM, Liu Yuan wrote: On 01/05/2013 12:38 AM, Stefan Hajnoczi wrote: Hi Yuan, BDRV_O_NOCACHE means bypass host page cache (O_DIRECT). BDRV_O_CACHE_WB specifies the cache semantics that the guest sees - that means whether the disk cache is writethrough or writeback. In other words, BDRV_O_NOCACHE is a host performance tweak while BDRV_O_CACHE_WB changes the cache safety of the BlockDriverState. A protocol driver like sheepdog doesn't need to look at BDRV_O_CACHE_WB because it is implemented in block.c (see bdrv_co_do_writev() where QEMU will flush when after each write when !bs-enable_write_cache). Hi Stefan, Thanks for your explanation. But after more investigation, I find myself more confused: flags passed from block layer {writeback, writethrough} 0x2042 {directsync, off, none}0x2062 {unsafe} 0x2242 So underlying driver like Sheepdog can't depend on 'flags' passed from .bdrv_file_open() to choose the right semantics (This was possible for old QEMU IIRC). If we can't rely on the 'flags' to get the cache indications of users, would you point me how to implement tristate cache control for network block driver like Sheepdog? For e.g, I want to implement following semantics: cache=writeback|none|off # enable writeback semantics for write cache=writethrough # enable writethrough semantics for write cache=directsync # disable cache completely Thanks, Yuan I tried the old QEMU and v1.1.0 and v1.1.2, they still worked as I expected. So I guess generic block layer got changed a bit and the 'flags' meaning turned different than old code, which did indeed allow block drivers to interpret the 'flags' passed from bdrv_file_open(). With the current upstream code, it seems that BDRV_O_CACHE_WB is always enabled and makes 'flags' completely unusable for block drivers to get the indications of user by specifying 'cache=' field. So is there other means to allow block drivers to rely on, in order to interpret the 'cache semantics'? I found the commit:e1e9b0ac 'always open drivers in writeback mode'. This is really undesired for network block drivers such as Sheepdog, which implement its own cache mechanism that support writeback/writethrough/directio behavior and then want to interpret these flags on its own. Is there any means for block drivers to get the semantics of 'cache=xxx' from users? Hi Yuan, Please explain what cache semantics the sheepdog server supports so I can understand better what you are trying to achieve. Stefan
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
Am 05.01.2013 08:56, schrieb Liu Yuan: On 01/05/2013 01:29 PM, Liu Yuan wrote: On 01/05/2013 12:40 PM, Liu Yuan wrote: On 01/05/2013 12:38 AM, Stefan Hajnoczi wrote: Hi Yuan, BDRV_O_NOCACHE means bypass host page cache (O_DIRECT). BDRV_O_CACHE_WB specifies the cache semantics that the guest sees - that means whether the disk cache is writethrough or writeback. In other words, BDRV_O_NOCACHE is a host performance tweak while BDRV_O_CACHE_WB changes the cache safety of the BlockDriverState. A protocol driver like sheepdog doesn't need to look at BDRV_O_CACHE_WB because it is implemented in block.c (see bdrv_co_do_writev() where QEMU will flush when after each write when !bs-enable_write_cache). Hi Stefan, Thanks for your explanation. But after more investigation, I find myself more confused: flags passed from block layer {writeback, writethrough} 0x2042 {directsync, off, none}0x2062 {unsafe} 0x2242 So underlying driver like Sheepdog can't depend on 'flags' passed from .bdrv_file_open() to choose the right semantics (This was possible for old QEMU IIRC). If we can't rely on the 'flags' to get the cache indications of users, would you point me how to implement tristate cache control for network block driver like Sheepdog? For e.g, I want to implement following semantics: cache=writeback|none|off # enable writeback semantics for write cache=writethrough # enable writethrough semantics for write cache=directsync # disable cache completely Thanks, Yuan I tried the old QEMU and v1.1.0 and v1.1.2, they still worked as I expected. So I guess generic block layer got changed a bit and the 'flags' meaning turned different than old code, which did indeed allow block drivers to interpret the 'flags' passed from bdrv_file_open(). With the current upstream code, it seems that BDRV_O_CACHE_WB is always enabled and makes 'flags' completely unusable for block drivers to get the indications of user by specifying 'cache=' field. So is there other means to allow block drivers to rely on, in order to interpret the 'cache semantics'? I found the commit:e1e9b0ac 'always open drivers in writeback mode'. This is really undesired for network block drivers such as Sheepdog, which implement its own cache mechanism that support writeback/writethrough/directio behavior and then want to interpret these flags on its own. Is there any means for block drivers to get the semantics of 'cache=xxx' from users? No, and in theory they shouldn't have to care. Why do you want to handle writethrough semantics in the block driver rather than letting qemu care for sending the right flushes? Kevin
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
On 01/07/2013 08:31 PM, Stefan Hajnoczi wrote: On Sat, Jan 05, 2013 at 03:56:11PM +0800, Liu Yuan wrote: On 01/05/2013 01:29 PM, Liu Yuan wrote: On 01/05/2013 12:40 PM, Liu Yuan wrote: On 01/05/2013 12:38 AM, Stefan Hajnoczi wrote: Hi Yuan, BDRV_O_NOCACHE means bypass host page cache (O_DIRECT). BDRV_O_CACHE_WB specifies the cache semantics that the guest sees - that means whether the disk cache is writethrough or writeback. In other words, BDRV_O_NOCACHE is a host performance tweak while BDRV_O_CACHE_WB changes the cache safety of the BlockDriverState. A protocol driver like sheepdog doesn't need to look at BDRV_O_CACHE_WB because it is implemented in block.c (see bdrv_co_do_writev() where QEMU will flush when after each write when !bs-enable_write_cache). Hi Stefan, Thanks for your explanation. But after more investigation, I find myself more confused: flags passed from block layer {writeback, writethrough} 0x2042 {directsync, off, none}0x2062 {unsafe} 0x2242 So underlying driver like Sheepdog can't depend on 'flags' passed from .bdrv_file_open() to choose the right semantics (This was possible for old QEMU IIRC). If we can't rely on the 'flags' to get the cache indications of users, would you point me how to implement tristate cache control for network block driver like Sheepdog? For e.g, I want to implement following semantics: cache=writeback|none|off # enable writeback semantics for write cache=writethrough # enable writethrough semantics for write cache=directsync # disable cache completely Thanks, Yuan I tried the old QEMU and v1.1.0 and v1.1.2, they still worked as I expected. So I guess generic block layer got changed a bit and the 'flags' meaning turned different than old code, which did indeed allow block drivers to interpret the 'flags' passed from bdrv_file_open(). With the current upstream code, it seems that BDRV_O_CACHE_WB is always enabled and makes 'flags' completely unusable for block drivers to get the indications of user by specifying 'cache=' field. So is there other means to allow block drivers to rely on, in order to interpret the 'cache semantics'? I found the commit:e1e9b0ac 'always open drivers in writeback mode'. This is really undesired for network block drivers such as Sheepdog, which implement its own cache mechanism that support writeback/writethrough/directio behavior and then want to interpret these flags on its own. Is there any means for block drivers to get the semantics of 'cache=xxx' from users? Hi Yuan, Please explain what cache semantics the sheepdog server supports so I can understand better what you are trying to achieve. Hi Stefan, Sheepdog support writethrough/writeback/directio(bypass) the cache, and I want to implement following semantics: cache=writeback|none|off # enable writeback semantics for write cache=writethrough # enable writethrough semantics for write cache=directsync # disable cache completely Thanks, Yuan
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
On 01/07/2013 09:23 PM, Kevin Wolf wrote: No, and in theory they shouldn't have to care. Why do you want to handle writethrough semantics in the block driver rather than letting qemu care for sending the right flushes? Because Sheepdog itself provide the client side cache implementation and we need means to get the cache hint from users when starting up the VM. This cache support: writethrough: provide a read cache for this VM, which is always consistent with cluster data writeback: provide a writeback cache for this VM, which only flush the dirty bits when VM send 'flush' request. directio: disable cache completely for this VM. Old QEMU (before v1.2) allows block drivers to get the cache flags, so Sheepdog can interpret these flags on its own to choose the right cache mode for each VM. doesn't QEMU need keep backward compatibility? Thanks, Yuan
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
On Thu, Jan 03, 2013 at 09:43:53PM +0800, Liu Yuan wrote: On 12/25/2012 04:45 PM, Liu Yuan wrote: Well, I found setting cache=directsync will contain 'BDRV_O_CACHE_WB'. Is this a bug for current master? If no, my current scheme will be the only way to bypass cache of sheepdog. Ping. Can anyone confirm it is a bug that 'cache=directsync' will pass BDRV_O_CACHE_WB' in the flags? Hi Yuan, BDRV_O_NOCACHE means bypass host page cache (O_DIRECT). BDRV_O_CACHE_WB specifies the cache semantics that the guest sees - that means whether the disk cache is writethrough or writeback. In other words, BDRV_O_NOCACHE is a host performance tweak while BDRV_O_CACHE_WB changes the cache safety of the BlockDriverState. A protocol driver like sheepdog doesn't need to look at BDRV_O_CACHE_WB because it is implemented in block.c (see bdrv_co_do_writev() where QEMU will flush when after each write when !bs-enable_write_cache). Stefan
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
On 01/05/2013 12:38 AM, Stefan Hajnoczi wrote: Hi Yuan, BDRV_O_NOCACHE means bypass host page cache (O_DIRECT). BDRV_O_CACHE_WB specifies the cache semantics that the guest sees - that means whether the disk cache is writethrough or writeback. In other words, BDRV_O_NOCACHE is a host performance tweak while BDRV_O_CACHE_WB changes the cache safety of the BlockDriverState. A protocol driver like sheepdog doesn't need to look at BDRV_O_CACHE_WB because it is implemented in block.c (see bdrv_co_do_writev() where QEMU will flush when after each write when !bs-enable_write_cache). Hi Stefan, Thanks for your explanation. But after more investigation, I find myself more confused: flags passed from block layer {writeback, writethrough} 0x2042 {directsync, off, none}0x2062 {unsafe} 0x2242 So underlying driver like Sheepdog can't depend on 'flags' passed from .bdrv_file_open() to choose the right semantics (This was possible for old QEMU IIRC). If we can't rely on the 'flags' to get the cache indications of users, would you point me how to implement tristate cache control for network block driver like Sheepdog? For e.g, I want to implement following semantics: cache=writeback|none|off # enable writeback semantics for write cache=writethrough # enable writethrough semantics for write cache=directsync # disable cache completely Thanks, Yuan
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
On 01/05/2013 12:40 PM, Liu Yuan wrote: On 01/05/2013 12:38 AM, Stefan Hajnoczi wrote: Hi Yuan, BDRV_O_NOCACHE means bypass host page cache (O_DIRECT). BDRV_O_CACHE_WB specifies the cache semantics that the guest sees - that means whether the disk cache is writethrough or writeback. In other words, BDRV_O_NOCACHE is a host performance tweak while BDRV_O_CACHE_WB changes the cache safety of the BlockDriverState. A protocol driver like sheepdog doesn't need to look at BDRV_O_CACHE_WB because it is implemented in block.c (see bdrv_co_do_writev() where QEMU will flush when after each write when !bs-enable_write_cache). Hi Stefan, Thanks for your explanation. But after more investigation, I find myself more confused: flags passed from block layer {writeback, writethrough} 0x2042 {directsync, off, none}0x2062 {unsafe} 0x2242 So underlying driver like Sheepdog can't depend on 'flags' passed from .bdrv_file_open() to choose the right semantics (This was possible for old QEMU IIRC). If we can't rely on the 'flags' to get the cache indications of users, would you point me how to implement tristate cache control for network block driver like Sheepdog? For e.g, I want to implement following semantics: cache=writeback|none|off # enable writeback semantics for write cache=writethrough # enable writethrough semantics for write cache=directsync # disable cache completely Thanks, Yuan I tried the old QEMU and v1.1.0 and v1.1.2, they still worked as I expected. So I guess generic block layer got changed a bit and the 'flags' meaning turned different than old code, which did indeed allow block drivers to interpret the 'flags' passed from bdrv_file_open(). With the current upstream code, it seems that BDRV_O_CACHE_WB is always enabled and makes 'flags' completely unusable for block drivers to get the indications of user by specifying 'cache=' field. So is there other means to allow block drivers to rely on, in order to interpret the 'cache semantics'? Thanks, Yuan
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
On 01/05/2013 01:29 PM, Liu Yuan wrote: On 01/05/2013 12:40 PM, Liu Yuan wrote: On 01/05/2013 12:38 AM, Stefan Hajnoczi wrote: Hi Yuan, BDRV_O_NOCACHE means bypass host page cache (O_DIRECT). BDRV_O_CACHE_WB specifies the cache semantics that the guest sees - that means whether the disk cache is writethrough or writeback. In other words, BDRV_O_NOCACHE is a host performance tweak while BDRV_O_CACHE_WB changes the cache safety of the BlockDriverState. A protocol driver like sheepdog doesn't need to look at BDRV_O_CACHE_WB because it is implemented in block.c (see bdrv_co_do_writev() where QEMU will flush when after each write when !bs-enable_write_cache). Hi Stefan, Thanks for your explanation. But after more investigation, I find myself more confused: flags passed from block layer {writeback, writethrough} 0x2042 {directsync, off, none}0x2062 {unsafe} 0x2242 So underlying driver like Sheepdog can't depend on 'flags' passed from .bdrv_file_open() to choose the right semantics (This was possible for old QEMU IIRC). If we can't rely on the 'flags' to get the cache indications of users, would you point me how to implement tristate cache control for network block driver like Sheepdog? For e.g, I want to implement following semantics: cache=writeback|none|off # enable writeback semantics for write cache=writethrough # enable writethrough semantics for write cache=directsync # disable cache completely Thanks, Yuan I tried the old QEMU and v1.1.0 and v1.1.2, they still worked as I expected. So I guess generic block layer got changed a bit and the 'flags' meaning turned different than old code, which did indeed allow block drivers to interpret the 'flags' passed from bdrv_file_open(). With the current upstream code, it seems that BDRV_O_CACHE_WB is always enabled and makes 'flags' completely unusable for block drivers to get the indications of user by specifying 'cache=' field. So is there other means to allow block drivers to rely on, in order to interpret the 'cache semantics'? I found the commit:e1e9b0ac 'always open drivers in writeback mode'. This is really undesired for network block drivers such as Sheepdog, which implement its own cache mechanism that support writeback/writethrough/directio behavior and then want to interpret these flags on its own. Is there any means for block drivers to get the semantics of 'cache=xxx' from users? Thanks, Yuan
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
On 12/25/2012 04:45 PM, Liu Yuan wrote: Well, I found setting cache=directsync will contain 'BDRV_O_CACHE_WB'. Is this a bug for current master? If no, my current scheme will be the only way to bypass cache of sheepdog. Ping. Can anyone confirm it is a bug that 'cache=directsync' will pass BDRV_O_CACHE_WB' in the flags? Thanks, Yuan
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
On 12/25/2012 03:47 PM, MORITA Kazutaka wrote: I wonder if we should disable cache when cache=none. Many management frontend uses cache=none by default but, I think, users still expect that data is cached (e.g. by disk write cache when a raw format is used). cache=none only means that the host page cache is not used for VM disk IO. In that sense, I am fine to adopt option to this semantics. @@ -1118,12 +1118,19 @@ static int sd_open(BlockDriverState *bs, const char *filename, int flags) goto out; } -s-cache_enabled = true; -s-flush_fd = connect_to_sdog(s-addr, s-port); -if (s-flush_fd 0) { -error_report(failed to connect); -ret = s-flush_fd; -goto out; +if (flags BDRV_O_NOCACHE) { +s-cache_flags = SD_FLAG_CMD_DIRECT; +} else if (flags BDRV_O_CACHE_WB) { 'else' should be removed, and Seems should not. We need this 'else if' to allow writethrough flag. +s-cache_flags = SD_FLAG_CMD_CACHE; +} + +if (s-cache_flags != SD_FLAG_CMD_DIRECT) { should be 's-cache_flags == SD_FLAG_CMD_CACHE'? Do we need to send a flush request when cache=writethourgh? Looks good to me. I'll change it at V2. Thanks, Yuan
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
On 12/25/2012 03:47 PM, MORITA Kazutaka wrote: I wonder if we should disable cache when cache=none. Many management frontend uses cache=none by default but, I think, users still expect that data is cached (e.g. by disk write cache when a raw format is used). cache=none only means that the host page cache is not used for VM disk IO. In that sense, Well, I found setting cache=directsync will contain 'BDRV_O_CACHE_WB'. Is this a bug for current master? If no, my current scheme will be the only way to bypass cache of sheepdog. Thanks, Yuan
Re: [Qemu-devel] [PATCH] sheepdog: implement direct write semantics
At Thu, 20 Dec 2012 02:29:31 +0800, Liu Yuan wrote: From: Liu Yuan tailai...@taobao.com Sheepdog supports both writeback/writethrough write but has not yet supported DIRECTIO semantics which bypass the cache completely even if Sheepdog daemon is set up with cache enabled. Suppose cache is enabled on Sheepdog daemon size, the new cache control is cache=writeback # enable the writeback semantics for write cache=writethrough # enable the writethrough semantics for write cache='directsync | none | off' # disable cache competely I wonder if we should disable cache when cache=none. Many management frontend uses cache=none by default but, I think, users still expect that data is cached (e.g. by disk write cache when a raw format is used). cache=none only means that the host page cache is not used for VM disk IO. In that sense, @@ -1118,12 +1118,19 @@ static int sd_open(BlockDriverState *bs, const char *filename, int flags) goto out; } -s-cache_enabled = true; -s-flush_fd = connect_to_sdog(s-addr, s-port); -if (s-flush_fd 0) { -error_report(failed to connect); -ret = s-flush_fd; -goto out; +if (flags BDRV_O_NOCACHE) { +s-cache_flags = SD_FLAG_CMD_DIRECT; +} else if (flags BDRV_O_CACHE_WB) { 'else' should be removed, and +s-cache_flags = SD_FLAG_CMD_CACHE; +} + +if (s-cache_flags != SD_FLAG_CMD_DIRECT) { should be 's-cache_flags == SD_FLAG_CMD_CACHE'? Do we need to send a flush request when cache=writethourgh? Thanks, Kazutaka