Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
Rusty Russell wrote: On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote: Jens Axboe wrote: On Tue, May 04 2010, Rusty Russell wrote: ISTR someone mentioning a desire for such an API years ago, so CC'ing the usual I/O suspects... It would be nice to have a more fuller API for this, but the reality is that only the flush approach is really workable. Even just strict ordering of requests could only be supported on SCSI, and even there the kernel still lacks proper guarantees on error handling to prevent reordering there. There's a few I/O scheduling differences that might be useful: 1. The I/O scheduler could freely move WRITEs before a FLUSH but not before a BARRIER. That might be useful for time-critical WRITEs, and those issued by high I/O priority. This is only because noone actually wants flushes or barriers, though I/O people seem to only offer that. We really want these writes must occur before this write. That offers maximum choice to the I/O subsystem and potentially to smart (virtual?) disks. We do want flushes for the D in ACID - such things as after receiving a mail, or blog update into a database file (could be TDB), and confirming that to the sender, to have high confidence that the update won't disappear on system crash or power failure. Less obviously, it's also needed for the C in ACID when more than one file is involved. C is about differently updated things staying consistent with each other. For example, imagine you have a TDB file mapping Samba usernames to passwords, and another mapping Samba usernames to local usernames. (I don't know if you do this; it's just an illustration). To rename a Samba user involves updating both. Let's ignore transient transactional issues :-) and just think about what happens with per-file barriers and no sync, when a crash happens long after the updates, and before the system has written out all data and issued low level cache flushes. After restarting, due to lack of sync, the Samba username could be present in one file and not the other. 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is only for data belonging to a particular file (e.g. fdatasync with no file size change, even on btrfs if O_DIRECT was used for the writes being committed). That would entail tagging FLUSHes and WRITEs with a fs-specific identifier (such as inode number), opaque to the scheduler which only checks equality. This is closer. In userspace I'd be happy with a all prior writes to this struct file before all future writes. Even if the original guarantees were stronger (ie. inode basis). We currently implement transactions using 4 fsync /msync pairs. write_recovery_data(fd); fsync(fd); msync(mmap); write_recovery_header(fd); fsync(fd); msync(mmap); overwrite_with_new_data(fd); fsync(fd); msync(mmap); remove_recovery_header(fd); fsync(fd); msync(mmap); Yet we really only need ordering, not guarantees about it actually hitting disk before returning. In other words, FLUSH can be more relaxed than BARRIER inside the kernel. It's ironic that we think of fsync as stronger than fbarrier outside the kernel :-) It's an implementation detail; barrier has less flexibility because it has less information about what is required. I'm saying I want to give you as much information as I can, even if you don't use it yet. I agree, and I've started a few threads about it over the last couple of years. An fsync_range() system call would be very easy to use and (most importantly) easy to understand. With optional flags to weaken it (into fdatasync, barrier without sync, sync without barrier, one-sided barrier, no lowlevel cache-flush, don't rush, etc.), it would be very versatile, and still easy to understand. With an AIO version, and another flag meaning don't rush, just return when satisfied, and I suspect it would be useful for the most demanding I/O apps. -- Jamie ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
On Wed, 5 May 2010 14:28:41 +0930 Rusty Russell ru...@rustcorp.com.au wrote: On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote: Jens Axboe wrote: On Tue, May 04 2010, Rusty Russell wrote: ISTR someone mentioning a desire for such an API years ago, so CC'ing the usual I/O suspects... It would be nice to have a more fuller API for this, but the reality is that only the flush approach is really workable. Even just strict ordering of requests could only be supported on SCSI, and even there the kernel still lacks proper guarantees on error handling to prevent reordering there. There's a few I/O scheduling differences that might be useful: 1. The I/O scheduler could freely move WRITEs before a FLUSH but not before a BARRIER. That might be useful for time-critical WRITEs, and those issued by high I/O priority. This is only because noone actually wants flushes or barriers, though I/O people seem to only offer that. We really want these writes must occur before this write. That offers maximum choice to the I/O subsystem and potentially to smart (virtual?) disks. 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is only for data belonging to a particular file (e.g. fdatasync with no file size change, even on btrfs if O_DIRECT was used for the writes being committed). That would entail tagging FLUSHes and WRITEs with a fs-specific identifier (such as inode number), opaque to the scheduler which only checks equality. This is closer. In userspace I'd be happy with a all prior writes to this struct file before all future writes. Even if the original guarantees were stronger (ie. inode basis). We currently implement transactions using 4 fsync /msync pairs. write_recovery_data(fd); fsync(fd); msync(mmap); write_recovery_header(fd); fsync(fd); msync(mmap); overwrite_with_new_data(fd); fsync(fd); msync(mmap); remove_recovery_header(fd); fsync(fd); msync(mmap); Seems over-zealous. If the recovery_header held a strong checksum of the recovery_data you would not need the first fsync, and as long as you have two places to write recovery data, you don't need the 3rd and 4th syncs. Just: write_internally_checksummed_recovery_data_and_header_to_unused_log_space() fsync / msync overwrite_with_new_data() To recovery you choose the most recent log_space and replay the content. That may be a redundant operation, but that is no loss. Also cannot see the point of msync if you have already performed an fsync, and if there is a point, I would expect you to call msync before fsync... Maybe there is some subtlety there that I am not aware of. Yet we really only need ordering, not guarantees about it actually hitting disk before returning. In other words, FLUSH can be more relaxed than BARRIER inside the kernel. It's ironic that we think of fsync as stronger than fbarrier outside the kernel :-) It's an implementation detail; barrier has less flexibility because it has less information about what is required. I'm saying I want to give you as much information as I can, even if you don't use it yet. Only we know that approach doesn't work. People will learn that they don't need to give the extra information to still achieve the same result - just like they did with ext3 and fsync. Then when we improve the implementation to only provide the guarantees that you asked for, people will complain that they are getting empty files that they didn't expect. The abstraction I would like to see is a simple 'barrier' that contains no data and has a filesystem-wide effect. If a filesystem wanted a 'full' barrier such as the current BIO_RW_BARRER, it would send an empty barrier, then the data, then another empty barrier. (However I suspect most filesystems don't really need barriers on both sides.) A low level driver might merge these together if the underlying hardware supported that combined operation (which I believe some do). I think this merging would be less complex that the current need to split a BIO_RW_BARRIER in to the three separate operations when only a flush is possible (I know it would make md code a lot nicer :-). I would probably expose this to user-space as extra flags to sync_file_range: SYNC_FILE_RANGE_BARRIER_BEFORE SYNC_FILE_RANGE_BARRIER_AFTER This would make it clear that a barrier does *not* imply a sync, it only applies to data for which a sync has already been requested. So data that has already been 'synced' is stored strictly before data which has not yet been submitted with write() (or by changing a mmapped area). The barrier would still be filesystem wide in that if you SYNC_FILE_WRITE_WRITE one file, then SYNC_FILE_RANGE_BARRIER_BEFORE another file on the same filesystem, the pages scheduled in the first file would be affect by the barrier request on the second file. Implementing
Re: [PATCH] virtio-spec: document block CMD and FLUSH
On 05/04/2010 07:38 AM, Rusty Russell wrote: On Fri, 19 Feb 2010 08:52:20 am Michael S. Tsirkin wrote: I took a stub at documenting CMD and FLUSH request types in virtio block. Christoph, could you look over this please? I note that the interface seems full of warts to me, this might be a first step to cleaning them. ISTR Christoph had withdrawn some patches in this area, and was waiting for him to resubmit? I've given up on figuring out the block device. What seem to me to be sane semantics along the lines of memory barriers are foreign to disk people: they want (and depend on) flushing everywhere. For example, tdb transactions do not require a flush, they only require what I would call a barrier: that prior data be written out before any future data. Surely that would be more efficient in general than a flush! In fact, TDB wants only writes to *that file* (and metadata) written out first; it has no ordering issues with other I/O on the same device. I think that's SCSI ordered tags. A generic I/O interface would allow you to specify this request depends on these outstanding requests and leave it at that. It might have some sync flush command for dumb applications and OSes. The userspace API might be not be as precise and only allow such a barrier against all prior writes on this fd. Depends on all previous requests, and will commit before all following requests. ie a full barrier. ISTR someone mentioning a desire for such an API years ago, so CC'ing the usual I/O suspects... I'd love to see TCQ exposed to user space. -- error compiling committee.c: too many arguments to function ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio-spec: document block CMD and FLUSH
On Tue, May 04, 2010 at 02:08:24PM +0930, Rusty Russell wrote: On Fri, 19 Feb 2010 08:52:20 am Michael S. Tsirkin wrote: I took a stub at documenting CMD and FLUSH request types in virtio block. Christoph, could you look over this please? I note that the interface seems full of warts to me, this might be a first step to cleaning them. ISTR Christoph had withdrawn some patches in this area, and was waiting for him to resubmit? Any patches I've withdrawn in this area are withdrawn for good. But what I really need to do is to review Michaels spec updates, sorry. UI'll get back to it today. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio-spec: document block CMD and FLUSH
On Wed, 5 May 2010 04:24:59 am Christoph Hellwig wrote: On Fri, Feb 19, 2010 at 12:22:20AM +0200, Michael S. Tsirkin wrote: I took a stub at documenting CMD and FLUSH request types in virtio block. Christoph, could you look over this please? I note that the interface seems full of warts to me, this might be a first step to cleaning them. The whole virtio-blk interface is full of warts. It has been extended rather ad-hoc, so that is rather expected. One issue I struggled with especially is how type field mixes bits and non-bit values. I ended up simply defining all legal values, so that we have CMD = 2, CMD_OUT = 3 and so on. It's basically a complete mess without much logic behind it. +\change_unchanged +the high bit +\change_inserted 0 1266497301 + (VIRTIO_BLK_T_BARRIER) +\change_unchanged + indicates that this request acts as a barrier and that all preceeding requests + must be complete before this one, and all following requests must not be + started until this is complete. + +\change_inserted 0 1266504385 + Note that a barrier does not flush caches in the underlying backend device + in host, and thus does not serve as data consistency guarantee. + Driver must use FLUSH request to flush the host cache. +\change_unchanged I'm not sure it's even worth documenting it. I can't see any way to actually implement safe behaviour with the VIRTIO_BLK_T_BARRIER-style barriers. Btw, did I mention that .lyx is a a really horrible format to review diffs for? Plain latex would be a lot better.. Yeah, or export as text post that diff for content review. I do like the patches to the lyx source though (I check all versions into revision control, before and after merging changes, which makes it easy to produce annotated versions). Cheers, Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Re: [PATCH] virtio-spec: document block CMD and FLUSH
On Wed, 5 May 2010 05:47:05 am Jamie Lokier wrote: Jens Axboe wrote: On Tue, May 04 2010, Rusty Russell wrote: ISTR someone mentioning a desire for such an API years ago, so CC'ing the usual I/O suspects... It would be nice to have a more fuller API for this, but the reality is that only the flush approach is really workable. Even just strict ordering of requests could only be supported on SCSI, and even there the kernel still lacks proper guarantees on error handling to prevent reordering there. There's a few I/O scheduling differences that might be useful: 1. The I/O scheduler could freely move WRITEs before a FLUSH but not before a BARRIER. That might be useful for time-critical WRITEs, and those issued by high I/O priority. This is only because noone actually wants flushes or barriers, though I/O people seem to only offer that. We really want these writes must occur before this write. That offers maximum choice to the I/O subsystem and potentially to smart (virtual?) disks. 2. The I/O scheduler could move WRITEs after a FLUSH if the FLUSH is only for data belonging to a particular file (e.g. fdatasync with no file size change, even on btrfs if O_DIRECT was used for the writes being committed). That would entail tagging FLUSHes and WRITEs with a fs-specific identifier (such as inode number), opaque to the scheduler which only checks equality. This is closer. In userspace I'd be happy with a all prior writes to this struct file before all future writes. Even if the original guarantees were stronger (ie. inode basis). We currently implement transactions using 4 fsync /msync pairs. write_recovery_data(fd); fsync(fd); msync(mmap); write_recovery_header(fd); fsync(fd); msync(mmap); overwrite_with_new_data(fd); fsync(fd); msync(mmap); remove_recovery_header(fd); fsync(fd); msync(mmap); Yet we really only need ordering, not guarantees about it actually hitting disk before returning. In other words, FLUSH can be more relaxed than BARRIER inside the kernel. It's ironic that we think of fsync as stronger than fbarrier outside the kernel :-) It's an implementation detail; barrier has less flexibility because it has less information about what is required. I'm saying I want to give you as much information as I can, even if you don't use it yet. Thanks, Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio-spec: document block CMD and FLUSH
On Fri, 19 Feb 2010 08:52:20 am Michael S. Tsirkin wrote: I took a stub at documenting CMD and FLUSH request types in virtio block. Christoph, could you look over this please? I note that the interface seems full of warts to me, this might be a first step to cleaning them. ISTR Christoph had withdrawn some patches in this area, and was waiting for him to resubmit? I've given up on figuring out the block device. What seem to me to be sane semantics along the lines of memory barriers are foreign to disk people: they want (and depend on) flushing everywhere. For example, tdb transactions do not require a flush, they only require what I would call a barrier: that prior data be written out before any future data. Surely that would be more efficient in general than a flush! In fact, TDB wants only writes to *that file* (and metadata) written out first; it has no ordering issues with other I/O on the same device. A generic I/O interface would allow you to specify this request depends on these outstanding requests and leave it at that. It might have some sync flush command for dumb applications and OSes. The userspace API might be not be as precise and only allow such a barrier against all prior writes on this fd. ISTR someone mentioning a desire for such an API years ago, so CC'ing the usual I/O suspects... Cheers, Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio-spec: document block CMD and FLUSH
On Fri, Feb 19, 2010 at 12:22:20AM +0200, Michael S. Tsirkin wrote: I took a stub at documenting CMD and FLUSH request types in virtio block. Any comments? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization