Re: Proposal for proper durable fsync() and fdatasync()
Andrew Morton wrote: On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier [EMAIL PROTECTED] wrote: (It would be nicer if sync_file_range() took a vector of ranges for better elevator scheduling, but let's ignore that :-) Two passes: Pass 1: shove each of the segments into the queue with SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE Pass 2: wait for them all to complete and return accumulated result with SYNC_FILE_RANGE_WAIT_AFTER Thanks. Seems ok, though being able to cork the I/O until the last one would be a bonus (like TCP_MORE... SYNC_FILE_RANGE_MORE?) I'm imagining I'd omit the SYNC_FILE_RANGE_WAIT_BEFORE. Is there a reason why you have it there? The man page isn't very enlightening. -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal for proper durable fsync() and fdatasync()
On Tuesday 26 February 2008 18:59, Jamie Lokier wrote: Andrew Morton wrote: On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier [EMAIL PROTECTED] wrote: (It would be nicer if sync_file_range() took a vector of ranges for better elevator scheduling, but let's ignore that :-) Two passes: Pass 1: shove each of the segments into the queue with SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE Pass 2: wait for them all to complete and return accumulated result with SYNC_FILE_RANGE_WAIT_AFTER Thanks. Seems ok, though being able to cork the I/O until the last one would be a bonus (like TCP_MORE... SYNC_FILE_RANGE_MORE?) I'm imagining I'd omit the SYNC_FILE_RANGE_WAIT_BEFORE. Is there a reason why you have it there? The man page isn't very enlightening. Yeah, sync_file_range has slightly unusual semantics and introduce the new concept, writeout, to userspace (does writeout include in drive cache? the kernel doesn't think so, but the only way to make sync_file_range safe is if you do consider it writeout). If it makes it any easier to understand, we can add in SYNC_FILE_ASYNC, SYNC_FILE_SYNC parts that just deal with safe/unsafe and sync/async semantics that is part of the normal POSIX api. Anyway, the idea of making fsync/fdatasync etc. safe by default is a good idea IMO, and is a bad bug that we don't do that :( - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal for proper durable fsync() and fdatasync()
Jeff Garzik wrote: [snip huge long proposal] Rather than invent new APIs, we should fix the existing ones to _really_ flush data to physical media. Btw, one reason for the length is the current block request API isn't sufficient even to make fsync() durable with _no_ new APIs. It offers ordering barriers only, which aren't enough. I tried to explain, discuss some changes and then suggest optimisations. -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
I need to respond to this in pieces... first the bit that is bugging me: * two new page flags I need to keep track of two bits of per-cached-page information: (1) This page is known by the cache, and that the cache must be informed if the page is going to go away. I still do not understand the life cycle of this bit. What does the cache do when it learns the page has gone away? How is it informed? Who owns the page cache in which such a page lives, the nfs client? Filesystem that hosts the page? A third page cache owned by the cache itself? (See my basic confusion about how many page cache levels you have, below.) Suppose one were to take a mundane approach to the persistent cache problem instead of layering filesystems. What you would do then is change NFS's -write_page and variants to fiddle the persistent cache as well as the network, instead of just the network as now. This fiddling could even consist of -write calls to another filesystem, though working directly with the bio interface would yield the fastest, and therefore to my mind, best result. In any case, you find out how to write the page to backing store by asking the filesystem, which in the naive approach would be nfs augmented with caching library calls. The filesystem keeps its own metadata around to know how to map the page to disk. So again naively, this metadata could tell the nfs client that the page is not mapped to disk at all. So I do not see what your per-page bit is for, obviously because I do not fully understand your caching scheme. Which I could eventually find out by reading all the patches but asking you is so much more fun :-) By the way, how many levels of page caching for the same data are there, is it: 1) nfs client 2) cache layer's own page cache 3) filesystem hosting the cache or just: 1) nfs client page cache 2) filesystem hosting the cache I think it is the second, but that is already double caching, which has got to hurt. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 22/28] mm: add support for non block device backed swap files
Starting review in the middle, because this is the part I'm most familiar with. New addres_space_operations methods are added: int swapfile(struct address_space *, int); Separate -swapon() and -swapoff() methods would be so much cleaner IMO. Also is there a reason why 'struct file *' cannot be supplied to these functions? [snip] +int swap_set_page_dirty(struct page *page) +{ + struct swap_info_struct *sis = page_swap_info(page); + + if (sis-flags SWP_FILE) { + const struct address_space_operations *a_ops = + sis-swap_file-f_mapping-a_ops; + int (*spd)(struct page *) = a_ops-set_page_dirty; +#ifdef CONFIG_BLOCK + if (!spd) + spd = __set_page_dirty_buffers; +#endif This ifdef is not really needed. Just require -set_page_dirty() be filled in by filesystems which want swapfiles (and others too, in the longer term, the fallback is just historical crud). Here's an incremental patch addressing these issues and beautifying the new code. Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] Index: linux/mm/page_io.c === --- linux.orig/mm/page_io.c 2008-02-26 11:15:58.0 +0100 +++ linux/mm/page_io.c 2008-02-26 13:40:55.0 +0100 @@ -106,8 +106,10 @@ int swap_writepage(struct page *page, st } if (sis-flags SWP_FILE) { - ret = sis-swap_file-f_mapping- - a_ops-swap_out(sis-swap_file, page, wbc); + struct file *swap_file = sis-swap_file; + struct address_space *mapping = swap_file-f_mapping; + + ret = mapping-a_ops-swap_out(swap_file, page, wbc); if (!ret) count_vm_event(PSWPOUT); return ret; @@ -136,12 +138,13 @@ void swap_sync_page(struct page *page) struct swap_info_struct *sis = page_swap_info(page); if (sis-flags SWP_FILE) { - const struct address_space_operations *a_ops = - sis-swap_file-f_mapping-a_ops; - if (a_ops-sync_page) - a_ops-sync_page(page); - } else + struct address_space *mapping = sis-swap_file-f_mapping; + + if (mapping-a_ops-sync_page) + mapping-a_ops-sync_page(page); + } else { block_sync_page(page); + } } int swap_set_page_dirty(struct page *page) @@ -149,17 +152,12 @@ int swap_set_page_dirty(struct page *pag struct swap_info_struct *sis = page_swap_info(page); if (sis-flags SWP_FILE) { - const struct address_space_operations *a_ops = - sis-swap_file-f_mapping-a_ops; - int (*spd)(struct page *) = a_ops-set_page_dirty; -#ifdef CONFIG_BLOCK - if (!spd) - spd = __set_page_dirty_buffers; -#endif - return (*spd)(page); - } + struct address_space *mapping = sis-swap_file-f_mapping; - return __set_page_dirty_nobuffers(page); + return mapping-a_ops-set_page_dirty(page); + } else { + return __set_page_dirty_nobuffers(page); + } } int swap_readpage(struct file *file, struct page *page) @@ -172,8 +170,10 @@ int swap_readpage(struct file *file, str BUG_ON(PageUptodate(page)); if (sis-flags SWP_FILE) { - ret = sis-swap_file-f_mapping- - a_ops-swap_in(sis-swap_file, page); + struct file *swap_file = sis-swap_file; + struct address_space *mapping = swap_file-f_mapping; + + ret = mapping-a_ops-swap_in(swap_file, page); if (!ret) count_vm_event(PSWPIN); return ret; Index: linux/include/linux/fs.h === --- linux.orig/include/linux/fs.h 2008-02-26 11:15:58.0 +0100 +++ linux/include/linux/fs.h2008-02-26 13:29:40.0 +0100 @@ -485,7 +485,8 @@ struct address_space_operations { /* * swapfile support */ - int (*swapfile)(struct address_space *, int); + int (*swapon)(struct file *file); + int (*swapoff)(struct file *file); int (*swap_out)(struct file *file, struct page *page, struct writeback_control *wbc); int (*swap_in)(struct file *file, struct page *page); Index: linux/mm/swapfile.c === --- linux.orig/mm/swapfile.c2008-02-26 12:43:57.0 +0100 +++ linux/mm/swapfile.c 2008-02-26 13:34:57.0 +0100 @@ -1014,9 +1014,11 @@ static void destroy_swap_extents(struct } if (sis-flags SWP_FILE) { + struct file *swap_file = sis-swap_file; + struct address_space *mapping =
Re: [PATCH 22/28] mm: add support for non block device backed swap files
On Tue, 2008-02-26 at 13:45 +0100, Miklos Szeredi wrote: Starting review in the middle, because this is the part I'm most familiar with. New addres_space_operations methods are added: int swapfile(struct address_space *, int); Separate -swapon() and -swapoff() methods would be so much cleaner IMO. I'm ok with that, but its a_ops bloat, do we care about that? I guess since it has limited instances - typically one per filesystem - there is no issue here. Also is there a reason why 'struct file *' cannot be supplied to these functions? No real reason here. I guess its cleaner indeed. Thanks. +int swap_set_page_dirty(struct page *page) +{ + struct swap_info_struct *sis = page_swap_info(page); + + if (sis-flags SWP_FILE) { + const struct address_space_operations *a_ops = + sis-swap_file-f_mapping-a_ops; + int (*spd)(struct page *) = a_ops-set_page_dirty; +#ifdef CONFIG_BLOCK + if (!spd) + spd = __set_page_dirty_buffers; +#endif This ifdef is not really needed. Just require -set_page_dirty() be filled in by filesystems which want swapfiles (and others too, in the longer term, the fallback is just historical crud). Agreed. This is a good motivation to clean up that stuff. Here's an incremental patch addressing these issues and beautifying the new code. Thanks, I'll fold it into the patch and update the documentation. I'll put your creds in akpm style. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal for proper durable fsync() and fdatasync()
On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote: Yeah, sync_file_range has slightly unusual semantics and introduce the new concept, writeout, to userspace (does writeout include in drive cache? the kernel doesn't think so, but the only way to make sync_file_range safe is if you do consider it writeout). If sync_file_range isn't safe, it should get replaced by a noop implementation. There really is no point in promising a little safety. One interesting aspect of this comes with COW filesystems like btrfs or logfs. Writing out data pages is not sufficient, because those will get lost unless their referencing metadata is written as well. So either we have to call fsync for those filesystems or add another callback and let filesystems override the default implementation. Jörn -- There is no worse hell than that provided by the regrets for wasted opportunities. -- Andre-Louis Moreau in Scarabouche - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
Daniel Phillips [EMAIL PROTECTED] wrote: I need to respond to this in pieces... first the bit that is bugging me: * two new page flags I need to keep track of two bits of per-cached-page information: (1) This page is known by the cache, and that the cache must be informed if the page is going to go away. I still do not understand the life cycle of this bit. What does the cache do when it learns the page has gone away? That's up to the cache. CacheFS, for example, unpins some resources when all the pages managed by a pointer block are taken away from it. The cache may also reserve a block on disk to back this page, and that reservation may then be discarded by the netfs uncaching the page. The cache may also speculatively take copies of the page if the machine is idle. Documentation/filesystems/caching/netfs-api.txt describes the caching API as a process, including the presentation of netfs pages to the cache and their uncaching. How is it informed? [Documentation/filesystems/caching/netfs-api.txt] == PAGE UNCACHING == To uncache a page, this function should be called: void fscache_uncache_page(struct fscache_cookie *cookie, struct page *page); This function permits the cache to release any in-memory representation it might be holding for this netfs page. This function must be called once for each page on which the read or write page functions above have been called to make sure the cache's in-memory tracking information gets torn down. Note that pages can't be explicitly deleted from the data file. The whole data file must be retired (see the relinquish cookie function below). Furthermore, note that this does not cancel the asynchronous read or write operation started by the read/alloc and write functions. [/] Who owns the page cache in which such a page lives, the nfs client? Filesystem that hosts the page? A third page cache owned by the cache itself? (See my basic confusion about how many page cache levels you have, below.) [Documentation/filesystems/caching/fscache.txt] (7) Data I/O is done direct to and from the netfs's pages. The netfs indicates that page A is at index B of the data-file represented by cookie C, and that it should be read or written. The cache backend may or may not start I/O on that page, but if it does, a netfs callback will be invoked to indicate completion. The I/O may be either synchronous or asynchronous. [/] I should perhaps make the documentation more explicit: the pages passed to the routines defined in include/linux/fscache.h are netfs pages, normally belonging the pagecache of the appropriate netfs inode. This is, however, mentioned in the function banner comments in fscache.h. Suppose one were to take a mundane approach to the persistent cache problem instead of layering filesystems. What you would do then is change NFS's -write_page and variants to fiddle the persistent cache It is a requirement laid down by the Linux NFS fs maintainers that the writes to the cache be asynchronous, even if the writes to NFS aren't. Note further that NFS's write_page() != writing to the cache. Writing to the cache is typically done by NFS's readpages(). Besides, at the moment, caching is suppressed for any NFS file opened for writing due to coherency issues. This is something to be revisited later. as well as the network, instead of just the network as now. Not as now. See above. This fiddling could even consist of -write calls to another filesystem, though working directly with the bio interface would yield the fastest, and therefore to my mind, best result. You can't necessarily access the BIO interface, and even if you can, the cache is still a filesystem. Essentially, what cachefiles does is to do what you say: to perform -write calls on another filesystem. FS-Cache also protects the netfs against (a) there being no cache, (b) the cache suffering a fatal I/O error and (c) the cache being removed; and protects the cache against (d) the netfs uncaching pages that the cache is using and (e) conflicting operations from the netfs, some of which may be queued for asynchronous processing. FS-Cache also groups asynchronous netfs store requests together, which hopefully, one day, I'll be able to pass on to the backing fs. In any case, you find out how to write the page to backing store by asking the filesystem, which in the naive approach would be nfs augmented with caching library calls. NFS and AFS and CIFS and ISOFS, but yes, that's what fscache is, if you like, a caching library. The filesystem keeps its own metadata around to know how to map the page to disk. So again naively, this metadata could tell the nfs client that the page is not mapped to disk at all. The netfs should _not_ know about the metadata of a backing fs. Firstly, there are many different potential backing filesystems, and secondly if
Re: Proposal for proper durable fsync() and fdatasync()
Jörn Engel wrote: On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote: Yeah, sync_file_range has slightly unusual semantics and introduce the new concept, writeout, to userspace (does writeout include in drive cache? the kernel doesn't think so, but the only way to make sync_file_range safe is if you do consider it writeout). If sync_file_range isn't safe, it should get replaced by a noop implementation. There really is no point in promising a little safety. One interesting aspect of this comes with COW filesystems like btrfs or logfs. Writing out data pages is not sufficient, because those will get lost unless their referencing metadata is written as well. So either we have to call fsync for those filesystems or add another callback and let filesystems override the default implementation. fdatasync() is required to write data pages _and_ the necessary metadata to reference those changed pages (btrfs tree etc.), but not non-data metadata. It's the filesystem's responsibility to interpret that correctly. In-place writes don't need anything else. Phase-tree style writes do. Some kinds of logged writes don't. I'm under the impression that sync_file_range() is a sort of restricted-range asynchronous fdatasync(). By limiting the range of file date which must be written out, it becomes more refined for database and filesystem-in-a-file type applications. Just as fsync() is more refined than sync() - it's useful to sync less - same goes for syncing just part of a file. It's still the filesystem's responsibility to sync data access metadata appropriately. It can sync more if it wants, but not less. That's what I understand by sync_file_range(fd, start,length, SYNC_FILE_RANGE_WRITE_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WRITE_AFTER); Largely because the manual says to use that combination of flags for an equivalent to fdatasync(). The concept of write-out is not defined in the manual. I'm assuming it to mean this, as a reasonable guess: SYNC_FILE_RANGE_WRITE scans all pages in the range, looking for dirty pages which aren't already queued for write-out. It marks those with a write-out flag, and starts write I/Os at some unspecified time in the near future; it can be assumed writes for all the pages will complete eventually if there's no errors. When I/O completes on a page, it cleans the page and also clears the write-out flag. SYNC_FILE_RANGE_WAIT_AFTER waits until all pages in the range don't have the write-out flag set. SYNC_FILE_RANGE_WAIT_BEFORE does the same wait, but before marking pages for write-out. I don't actually see the point in this. Isn't a preceding call with SYNC_FILE_RANGE_WAIT_AFTER equivalent, making BEFORE a redundant flag? The manual says it is something to do with data-integrity, but it's not clear to me what that means. All this implies that write-out flag is a concept userspace can rely on. That's not so peculiar: WRITE seems to be equivalent to AIO-style fdatasync() on a limited range of offsets, and WAIT_AFTER seems to be equivalent to waiting for any previously issued such ops to complete. Any data access metadata updates that btrfs must make for fdatasync(), it must also make for sync_file_range(), for the limited range of offsets. -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal for proper durable fsync() and fdatasync()
Jeff Garzik wrote: Jamie Lokier wrote: By durable, I mean that fsync() should actually commit writes to physical stable storage, Yes, it should. I was surprised that fsync() doesn't do this already. There was a lot of effort put into block I/O write barriers during 2.5, so that journalling filesystems can force correct write ordering, using disk flush cache commands. After all that effort, I was very surprised to notice that Linux 2.6.x doesn't use that capability to ensure fsync() flushes the disk cache onto stable storage. It's surprising you are surprised, given that this [lame] fsync behavior has remaining consistently lame throughout Linux's history. Maybe I am confused, but isn't this is what fsync() does today whenever barriers are enabled (the fsync() invalidates the drive's write cache). ric - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal for proper durable fsync() and fdatasync()
Jörn Engel wrote: On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote: Yeah, sync_file_range has slightly unusual semantics and introduce the new concept, writeout, to userspace (does writeout include in drive cache? the kernel doesn't think so, but the only way to make sync_file_range safe is if you do consider it writeout). If sync_file_range isn't safe, it should get replaced by a noop implementation. There really is no point in promising a little safety. Sometimes there is a point in a little safety. There's a spectrum of durability (meaning how safely stored the data is). In the cases we're imagining, it's application - main memory cache - disk cache - disk surface. There are others. _None_ of those provide perfect safety for your data. They are a spectrum, and how far along you want data to be committed before you say fine, the data is safe enough for me depends on your application. For example, there are users who like to turn _off_ fdatasync() with their SQL database of choice. They prefer speed over safety, and they don't mind losing an hour's data and doing regular backups (we assume ;-) Some blogs fall into this category; who cares if a rare crash costs you a comment or two and a restore from backup; it's acceptable for the speed. There's users who would really like fdatasync() to commit data to the drive platters, so after their database says done, they are very confident that a power failure won't cause committed data to be lost. Accepting credit cards is more at this end. So should be anyone using a virtual machine of any kind without a journalling fs in the guest! And there's users who like it where it is right now: a compromise, where a system crash won't lose committed data; but a power failure might. (I'm making assumptions about drive behaviour on reset here.) My problem with fdatasync() at the moment is, I can't choose what I want from it, and there's no mechanism to give me the safest option. Most annoyingly, in-kernel filesystems _do_ have a mechanism; it just isn't exported to userspace. (A quick aside: fdatasync() et al. are actually used for two _different_ things. 1: A program says I've written it, it can say so with confidence, e.g. announcing email receipt. 2: It's used for write ordering with write-ahead logging: write, fdatasync, write. When you tease at the details, efficient implementations of them are different... Think SCSI tagged commands versus cache flushes.) One interesting aspect of this comes with COW filesystems like btrfs or logfs. Writing out data pages is not sufficient, because those will get lost unless their referencing metadata is written as well. So either we have to call fsync for those filesystems or add another callback and let filesystems override the default implementation. Doesn't the -fsync callback get called in the sys_fdatasync() case, with appropriate arguments? With barriers/flushes it certainly makes those a bit more complicated. You have to flush not just the disks with data pages, but the _other_ disks in a software RAID with data pointer metadata pages, but ideally not all of them (think database journal commit). That can be implemented with per-buffer pending-barrier/flush flags (like I described for pages in the first mail), which are equally useful when a database-like application uses a block device. -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal for proper durable fsync() and fdatasync()
Ric Wheeler wrote: I was surprised that fsync() doesn't do this already. There was a lot of effort put into block I/O write barriers during 2.5, so that journalling filesystems can force correct write ordering, using disk flush cache commands. After all that effort, I was very surprised to notice that Linux 2.6.x doesn't use that capability to ensure fsync() flushes the disk cache onto stable storage. It's surprising you are surprised, given that this [lame] fsync behavior has remaining consistently lame throughout Linux's history. Maybe I am confused, but isn't this is what fsync() does today whenever barriers are enabled (the fsync() invalidates the drive's write cache). No, fsync() doesn't always flush the drive's write cache. It often does, any I think many people are under the impression it always does, but it doesn't. Try this code on ext3: fd = open (test_file, O_RDWR | O_CREAT | O_TRUNC, 0666); while (1) { char byte; usleep (10); pwrite (fd, byte, 1, 0); fsync (fd); } It will do just over 10 write ops per second on an idle system (13 on mine), and 1 flush op per second. That's because ext3 fsync() only does a journal commit when the inode has changed. The inode mtime is changed by write only with 1 second granularity. Without a journal commit, there's no barrier, which translates to not flushing disk write cache. If you add fchmod (fd, 0644); fchmod (fd, 0664); between the write and fsync, you'll see at least 20 write ops and 20 flush ops per second, and you'll here the disk seeking more. That's because the fchmod dirties the inode, so fsync() writes the inode with a journal commit. It turns out even _that_ is not sufficient according to the kernel internals. A journal commit uses an ordered request, which isn't the same as a flush potentially, it just happens to use flush in this instance. I'm not sure if ordered requests are actually implemented by any drivers at the moment. If not now, they will be one day. We could change ext3 fsync() to always do a journal commit, and depend on the non-existence of block drivers which do ordered (not flush) barrier requests. But there's lots of things wrong with that. Not least, it sucks performance for database-like applications and virtual machines, a lot due to unnecessary seeks. That way lies wrongness. Rightness is to make fdatasync() work well, with a genuine flush (or equivalent (see FUA), only when required, and not a mere ordered barrier), no inode write, and to make sync_file_range()[*] offer the fancier applications finer controls which reflect what they actually need. [*] - or whatever. -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal for proper durable fsync() and fdatasync()
On Tue, 26 Feb 2008 15:07:45 + Jamie Lokier [EMAIL PROTECTED] wrote: SYNC_FILE_RANGE_WRITE scans all pages in the range, looking for dirty pages which aren't already queued for write-out. It marks those with a write-out flag, and starts write I/Os at some unspecified time in the near future; it can be assumed writes for all the pages will complete eventually if there's no errors. When I/O completes on a page, it cleans the page and also clears the write-out flag. SYNC_FILE_RANGE_WAIT_AFTER waits until all pages in the range don't have the write-out flag set. SYNC_FILE_RANGE_WAIT_BEFORE does the same wait, but before marking pages for write-out. I don't actually see the point in this. Isn't a preceding call with SYNC_FILE_RANGE_WAIT_AFTER equivalent, making BEFORE a redundant flag? Consider the case of pages which are dirty but are already under writeout. ie: someone redirtied the page after someone started writing the page out. For these pages the kernel needs to a) wait for the current writeout to complete b) start new writeout c) wait for that writeout to complete. those are the three stages of sync_file_range(). They are independently selectable and various combinations provide various results. The reason for providing b) only (SYNC_FILE_RANGE_WRITE) is so that userspace can get as much data into the queue as possible, to permit the kernel to optimise IO scheduling better. If you perform a) and b) together (SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE) then you are guaranteed that all data which was dirty when sync_file_range() executed will be sent into the queue, but you won't get as much data into the queue if the kernel encounters dirty, under-writeout pages. This is especially hurtful if you're trying to feed a lot of little segments into the queue. In that case perhaps userspace should do an asynchrnous pass (SYNC_FILE_RANGE_WRITE) to stuff as much data as poss into the queue, then a SYNC_FILE_RANGE_WAIT_AFTER pass then a SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER pass to clean up any stragglers. WHich mode is best very much depends on the application's file dirtying patterns. One would have to experiment with it, and tuning of sync_file_range() usage would occur alongside tuning of the application's write() design. It's an interesting problem, with potentially high payback. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal for proper durable fsync() and fdatasync()
Nick Piggin wrote: Anyway, the idea of making fsync/fdatasync etc. safe by default is a good idea IMO, and is a bad bug that we don't do that :( Agreed... it's also disappointing that [unless I'm mistaken] you have to hack each filesystem to support barriers. It seems far easier to make sync_blkdev() Do The Right Thing, and magically make all filesystems data-safe. Jeff - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] ext3 freeze feature ver 0.2
Takashi Sato wrote: o Elevate XFS ioctl numbers (XFS_IOC_FREEZE and XFS_IOC_THAW) to the VFS As Andreas Dilger and Christoph Hellwig advised me, I have elevated them to include/linux/fs.h as below. #define FIFREEZE_IOWR('X', 119, int) #define FITHAW _IOWR('X', 120, int) The ioctl numbers used by XFS applications don't need to be changed. But my following ioctl for the freeze needs the parameter as the timeout period. So if XFS applications don't want the timeout feature as the current implementation, the parameter needs to be changed 1 (level?) into 0. So, existing xfs applications calling the xfs ioctl now will behave differently, right? We can only keep the same ioctl number if the calling semantics are the same. Keeping the same number but changing the semantics is harmful, IMHO -Eric - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal for proper durable fsync() and fdatasync()
Jeff Garzik wrote: Nick Piggin wrote: Anyway, the idea of making fsync/fdatasync etc. safe by default is a good idea IMO, and is a bad bug that we don't do that :( Agreed... it's also disappointing that [unless I'm mistaken] you have to hack each filesystem to support barriers. It seems far easier to make sync_blkdev() Do The Right Thing, and magically make all filesystems data-safe. Well, you need ordered metadata writes, barriers _and_ flushes with some filesystems. Merely writing all the data pages than issuing a drive cache flush won't Do The Right Thing with those filesystems - someone already mentioned Btrfs, where it won't. But I agree that your suggestion would make a superb default, for filesystems which don't provide their own function. It's not optimal even then. Devices: On a software RAID, you ideally don't want to issue flushes to all drives if your database did a 1 block commit entry. (But they probably use O_DIRECT anyway, changing the rules again). But all that can be optimised in generic VFS code eventually. It doesn't need filesystem assistance in most cases. Apps: don't always want a full flush; sometimes a barrier would do. -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal for proper durable fsync() and fdatasync()
On Tue, 26 February 2008 15:28:10 +, Jamie Lokier wrote: One interesting aspect of this comes with COW filesystems like btrfs or logfs. Writing out data pages is not sufficient, because those will get lost unless their referencing metadata is written as well. So either we have to call fsync for those filesystems or add another callback and let filesystems override the default implementation. Doesn't the -fsync callback get called in the sys_fdatasync() case, with appropriate arguments? My paragraph above was aimed at the sync_file_range() case. fsync and fdatasync do the right thing within the limitations you brought up in this thread. sync_file_range() without further changes will only write data pages, not the metadata required to actually access those data pages. This works just fine for non-COW filesystems, which covers all currently merged ones. With COW filesystems it is currently impossible to do sync_file_range() properly. The problem is orthogonal to your's, I just brought it up since you were already mentioning sync_file_range(). Jörn -- Joern's library part 10: http://blogs.msdn.com/David_Gristwood/archive/2004/06/24/164849.aspx - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] ext3 freeze feature ver 0.2
On Feb 26, 2008 08:39 -0800, Eric Sandeen wrote: Takashi Sato wrote: o Elevate XFS ioctl numbers (XFS_IOC_FREEZE and XFS_IOC_THAW) to the VFS As Andreas Dilger and Christoph Hellwig advised me, I have elevated them to include/linux/fs.h as below. #define FIFREEZE_IOWR('X', 119, int) #define FITHAW _IOWR('X', 120, int) The ioctl numbers used by XFS applications don't need to be changed. But my following ioctl for the freeze needs the parameter as the timeout period. So if XFS applications don't want the timeout feature as the current implementation, the parameter needs to be changed 1 (level?) into 0. So, existing xfs applications calling the xfs ioctl now will behave differently, right? We can only keep the same ioctl number if the calling semantics are the same. Keeping the same number but changing the semantics is harmful, IMHO Do we know what this parameter was supposed to mean? We could special case 1 if needed to keep compatibility (documenting this clearly), either making it == 0, or some very long timeout (1h or whatever). A relatively minor wart I think. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal for proper durable fsync() and fdatasync()
Jörn Engel wrote: On Tue, 26 February 2008 15:28:10 +, Jamie Lokier wrote: One interesting aspect of this comes with COW filesystems like btrfs or logfs. Writing out data pages is not sufficient, because those will get lost unless their referencing metadata is written as well. So either we have to call fsync for those filesystems or add another callback and let filesystems override the default implementation. Doesn't the -fsync callback get called in the sys_fdatasync() case, with appropriate arguments? My paragraph above was aimed at the sync_file_range() case. fsync and fdatasync do the right thing within the limitations you brought up in this thread. sync_file_range() without further changes will only write data pages, not the metadata required to actually access those data pages. This works just fine for non-COW filesystems, which covers all currently merged ones. With COW filesystems it is currently impossible to do sync_file_range() properly. The problem is orthogonal to your's, I just brought it up since you were already mentioning sync_file_range(). You're right. Though, doesn't normal page writeback enqueue the COW metadata changes? If not, how do they get written in a timely fashion? -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal for proper durable fsync() and fdatasync()
On Tue, 26 February 2008 17:29:13 +, Jamie Lokier wrote: You're right. Though, doesn't normal page writeback enqueue the COW metadata changes? If not, how do they get written in a timely fashion? It does. But this is not sufficient to guarantee that the pages in question have been safely committed to the device by the time sync_file_range() has returned. Jörn -- Joern's library part 5: http://www.faqs.org/faqs/compression-faq/part2/section-9.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal for proper durable fsync() and fdatasync()
Jamie Lokier wrote: Jeff Garzik wrote: Nick Piggin wrote: Anyway, the idea of making fsync/fdatasync etc. safe by default is a good idea IMO, and is a bad bug that we don't do that :( Agreed... it's also disappointing that [unless I'm mistaken] you have to hack each filesystem to support barriers. It seems far easier to make sync_blkdev() Do The Right Thing, and magically make all filesystems data-safe. Well, you need ordered metadata writes, barriers _and_ flushes with some filesystems. Merely writing all the data pages than issuing a drive cache flush won't Do The Right Thing with those filesystems - someone already mentioned Btrfs, where it won't. Oh certainly. That's why we have a VFS :) fsync for NFS will look quite different, too. But I agree that your suggestion would make a superb default, for filesystems which don't provide their own function. Yep. That would immediately cover a bunch of filesystems. It's not optimal even then. Devices: On a software RAID, you ideally don't want to issue flushes to all drives if your database did a 1 block commit entry. (But they probably use O_DIRECT anyway, changing the rules again). But all that can be optimised in generic VFS code eventually. It doesn't need filesystem assistance in most cases. My own idea is that we create a FLUSH command for blkdev request queues, to exist alongside READ, WRITE, and the current barrier implementation. Then FLUSH could be passed down through MD or DM. Jeff - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
On Tuesday 26 February 2008 06:33, David Howells wrote: Suppose one were to take a mundane approach to the persistent cache problem instead of layering filesystems. What you would do then is change NFS's -write_page and variants to fiddle the persistent cache It is a requirement laid down by the Linux NFS fs maintainers that the writes to the cache be asynchronous, even if the writes to NFS aren't. As it happens, I will be hanging out for the next few days with said NFS maintainers, it would help to be as informed as possible about your patch set. Note further that NFS's write_page() != writing to the cache. Writing to the cache is typically done by NFS's readpages(). Yes, of course. But also by -write_page no? Which I could eventually find out by reading all the patches but asking you is so much more fun :-) And a waste of my time. I've provided documentation in the main FS-Cache patch, both as text files and in comments in header files that answer your questions. Please read them first. 37 Patches, none of which has Documentation in the subject line, and you did not provide a diffstat in patch 0 for the patch set as a whole. If I had known it was there of course I would have read it. It is great to see this level of documentation. But I do not think it is fair to blame your (one) reader for missing it. See the smiley above? The _real_ reason I am asking you is that I do not think anybody understands your patch set, in spite of your considerable efforts to address that. Discussion in public, right or wrong, is the only way to fix that. It is counterproductive to drive readers away from the discussion for fear that they may miss some point obvious to the original author, or perhaps already discussed earlier on lkml, and get flamed for it. Obviously, the patch set is not going to be perfect when it goes in and it would be a silly abuse of the open source process to require that, but the parts where it touches the rest of the system have to be really well understood, and it is clear from the level of participation in the thread that they are not. One bit that already came out of this, which you have alluded to several times yourself but somehow seem to keep glossing over, is that you need a -direct_bio file operations method. So does loopback mount. It might be worth putting some effort into seeing how -direct_IO can be refactored to make that happen. You can get it in separately on the basis of helping loopback, and it will make your patches nicer. Daniel - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html