Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Andrew Morton wrote:
 On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier [EMAIL PROTECTED] wrote:
 
  (It would be nicer if sync_file_range()
  took a vector of ranges for better elevator scheduling, but let's
  ignore that :-)
 
 Two passes:
 
 Pass 1: shove each of the segments into the queue with
 SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE
 
 Pass 2: wait for them all to complete and return accumulated result
 with SYNC_FILE_RANGE_WAIT_AFTER

Thanks.

Seems ok, though being able to cork the I/O until the last one would
be a bonus (like TCP_MORE...  SYNC_FILE_RANGE_MORE?)

I'm imagining I'd omit the SYNC_FILE_RANGE_WAIT_BEFORE.  Is there a
reason why you have it there?  The man page isn't very enlightening.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Nick Piggin
On Tuesday 26 February 2008 18:59, Jamie Lokier wrote:
 Andrew Morton wrote:
  On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier [EMAIL PROTECTED] 
wrote:
   (It would be nicer if sync_file_range()
   took a vector of ranges for better elevator scheduling, but let's
   ignore that :-)
 
  Two passes:
 
  Pass 1: shove each of the segments into the queue with
  SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE
 
  Pass 2: wait for them all to complete and return accumulated result
  with SYNC_FILE_RANGE_WAIT_AFTER

 Thanks.

 Seems ok, though being able to cork the I/O until the last one would
 be a bonus (like TCP_MORE...  SYNC_FILE_RANGE_MORE?)

 I'm imagining I'd omit the SYNC_FILE_RANGE_WAIT_BEFORE.  Is there a
 reason why you have it there?  The man page isn't very enlightening.


Yeah, sync_file_range has slightly unusual semantics and introduce
the new concept, writeout, to userspace (does writeout include
in drive cache? the kernel doesn't think so, but the only way to
make sync_file_range safe is if you do consider it writeout).

If it makes it any easier to understand, we can add in
SYNC_FILE_ASYNC, SYNC_FILE_SYNC parts that just deal with
safe/unsafe and sync/async semantics that is part of the normal
POSIX api.

Anyway, the idea of making fsync/fdatasync etc. safe by default is
a good idea IMO, and is a bad bug that we don't do that :(

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Jeff Garzik wrote:
 [snip huge long proposal]
 
 Rather than invent new APIs, we should fix the existing ones to _really_ 
 flush data to physical media.

Btw, one reason for the length is the current block request API isn't
sufficient even to make fsync() durable with _no_ new APIs.

It offers ordering barriers only, which aren't enough.  I tried to
explain, discuss some changes and then suggest optimisations.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/37] Permit filesystem local caching

2008-02-26 Thread Daniel Phillips
I need to respond to this in pieces... first the bit that is bugging
me:

* two new page flags
 
 I need to keep track of two bits of per-cached-page information:
 
  (1) This page is known by the cache, and that the cache must be informed if
  the page is going to go away.

I still do not understand the life cycle of this bit.  What does the
cache do when it learns the page has gone away?  How is it informed?
Who owns the page cache in which such a page lives, the nfs client?
Filesystem that hosts the page?  A third page cache owned by the
cache itself?  (See my basic confusion about how many page cache
levels you have, below.)

Suppose one were to take a mundane approach to the persistent cache
problem instead of layering filesystems.  What you would do then is
change NFS's -write_page and variants to fiddle the persistent
cache as well as the network, instead of just the network as now.
This fiddling could even consist of -write calls to another
filesystem, though working directly with the bio interface would
yield the fastest, and therefore to my mind, best result.

In any case, you find out how to write the page to backing store by
asking the filesystem, which in the naive approach would be nfs
augmented with caching library calls.  The filesystem keeps its own
metadata around to know how to map the page to disk.  So again
naively, this metadata could tell the nfs client that the page is
not mapped to disk at all.  So I do not see what your per-page bit
is for, obviously because I do not fully understand your caching
scheme.  Which I could eventually find out by reading all the
patches but asking you is so much more fun :-)

By the way, how many levels of page caching for the same data are
there, is it:

  1) nfs client
  2) cache layer's own page cache
  3) filesystem hosting the cache

or just:

  1) nfs client page cache
  2) filesystem hosting the cache

I think it is the second, but that is already double caching, which
has got to hurt.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 22/28] mm: add support for non block device backed swap files

2008-02-26 Thread Miklos Szeredi
Starting review in the middle, because this is the part I'm most
familiar with.

 New addres_space_operations methods are added:
   int swapfile(struct address_space *, int);

Separate -swapon() and -swapoff() methods would be so much cleaner IMO.

Also is there a reason why 'struct file *' cannot be supplied to these
functions?

[snip]

 +int swap_set_page_dirty(struct page *page)
 +{
 + struct swap_info_struct *sis = page_swap_info(page);
 +
 + if (sis-flags  SWP_FILE) {
 + const struct address_space_operations *a_ops =
 + sis-swap_file-f_mapping-a_ops;
 + int (*spd)(struct page *) = a_ops-set_page_dirty;
 +#ifdef CONFIG_BLOCK
 + if (!spd)
 + spd = __set_page_dirty_buffers;
 +#endif

This ifdef is not really needed.  Just require -set_page_dirty() be
filled in by filesystems which want swapfiles (and others too, in the
longer term, the fallback is just historical crud).

Here's an incremental patch addressing these issues and beautifying
the new code.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]

Index: linux/mm/page_io.c
===
--- linux.orig/mm/page_io.c 2008-02-26 11:15:58.0 +0100
+++ linux/mm/page_io.c  2008-02-26 13:40:55.0 +0100
@@ -106,8 +106,10 @@ int swap_writepage(struct page *page, st
}
 
if (sis-flags  SWP_FILE) {
-   ret = sis-swap_file-f_mapping-
-   a_ops-swap_out(sis-swap_file, page, wbc);
+   struct file *swap_file = sis-swap_file;
+   struct address_space *mapping = swap_file-f_mapping;
+
+   ret = mapping-a_ops-swap_out(swap_file, page, wbc);
if (!ret)
count_vm_event(PSWPOUT);
return ret;
@@ -136,12 +138,13 @@ void swap_sync_page(struct page *page)
struct swap_info_struct *sis = page_swap_info(page);
 
if (sis-flags  SWP_FILE) {
-   const struct address_space_operations *a_ops =
-   sis-swap_file-f_mapping-a_ops;
-   if (a_ops-sync_page)
-   a_ops-sync_page(page);
-   } else
+   struct address_space *mapping = sis-swap_file-f_mapping;
+
+   if (mapping-a_ops-sync_page)
+   mapping-a_ops-sync_page(page);
+   } else {
block_sync_page(page);
+   }
 }
 
 int swap_set_page_dirty(struct page *page)
@@ -149,17 +152,12 @@ int swap_set_page_dirty(struct page *pag
struct swap_info_struct *sis = page_swap_info(page);
 
if (sis-flags  SWP_FILE) {
-   const struct address_space_operations *a_ops =
-   sis-swap_file-f_mapping-a_ops;
-   int (*spd)(struct page *) = a_ops-set_page_dirty;
-#ifdef CONFIG_BLOCK
-   if (!spd)
-   spd = __set_page_dirty_buffers;
-#endif
-   return (*spd)(page);
-   }
+   struct address_space *mapping = sis-swap_file-f_mapping;
 
-   return __set_page_dirty_nobuffers(page);
+   return mapping-a_ops-set_page_dirty(page);
+   } else {
+   return __set_page_dirty_nobuffers(page);
+   }
 }
 
 int swap_readpage(struct file *file, struct page *page)
@@ -172,8 +170,10 @@ int swap_readpage(struct file *file, str
BUG_ON(PageUptodate(page));
 
if (sis-flags  SWP_FILE) {
-   ret = sis-swap_file-f_mapping-
-   a_ops-swap_in(sis-swap_file, page);
+   struct file *swap_file = sis-swap_file;
+   struct address_space *mapping = swap_file-f_mapping;
+
+   ret = mapping-a_ops-swap_in(swap_file, page);
if (!ret)
count_vm_event(PSWPIN);
return ret;
Index: linux/include/linux/fs.h
===
--- linux.orig/include/linux/fs.h   2008-02-26 11:15:58.0 +0100
+++ linux/include/linux/fs.h2008-02-26 13:29:40.0 +0100
@@ -485,7 +485,8 @@ struct address_space_operations {
/*
 * swapfile support
 */
-   int (*swapfile)(struct address_space *, int);
+   int (*swapon)(struct file *file);
+   int (*swapoff)(struct file *file);
int (*swap_out)(struct file *file, struct page *page,
struct writeback_control *wbc);
int (*swap_in)(struct file *file, struct page *page);
Index: linux/mm/swapfile.c
===
--- linux.orig/mm/swapfile.c2008-02-26 12:43:57.0 +0100
+++ linux/mm/swapfile.c 2008-02-26 13:34:57.0 +0100
@@ -1014,9 +1014,11 @@ static void destroy_swap_extents(struct 
}
 
if (sis-flags  SWP_FILE) {
+   struct file *swap_file = sis-swap_file;
+   struct address_space *mapping = 

Re: [PATCH 22/28] mm: add support for non block device backed swap files

2008-02-26 Thread Peter Zijlstra

On Tue, 2008-02-26 at 13:45 +0100, Miklos Szeredi wrote:
 Starting review in the middle, because this is the part I'm most
 familiar with.
 
  New addres_space_operations methods are added:
int swapfile(struct address_space *, int);
 
 Separate -swapon() and -swapoff() methods would be so much cleaner IMO.

I'm ok with that, but its a_ops bloat, do we care about that? I guess
since it has limited instances - typically one per filesystem - there is
no issue here.

 Also is there a reason why 'struct file *' cannot be supplied to these
 functions?

No real reason here. I guess its cleaner indeed. Thanks.

  +int swap_set_page_dirty(struct page *page)
  +{
  +   struct swap_info_struct *sis = page_swap_info(page);
  +
  +   if (sis-flags  SWP_FILE) {
  +   const struct address_space_operations *a_ops =
  +   sis-swap_file-f_mapping-a_ops;
  +   int (*spd)(struct page *) = a_ops-set_page_dirty;
  +#ifdef CONFIG_BLOCK
  +   if (!spd)
  +   spd = __set_page_dirty_buffers;
  +#endif
 
 This ifdef is not really needed.  Just require -set_page_dirty() be
 filled in by filesystems which want swapfiles (and others too, in the
 longer term, the fallback is just historical crud).

Agreed. This is a good motivation to clean up that stuff.

 Here's an incremental patch addressing these issues and beautifying
 the new code.

Thanks, I'll fold it into the patch and update the documentation. I'll
put your creds in akpm style.


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jörn Engel
On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote:
 
 Yeah, sync_file_range has slightly unusual semantics and introduce
 the new concept, writeout, to userspace (does writeout include
 in drive cache? the kernel doesn't think so, but the only way to
 make sync_file_range safe is if you do consider it writeout).

If sync_file_range isn't safe, it should get replaced by a noop
implementation.  There really is no point in promising a little
safety.

One interesting aspect of this comes with COW filesystems like btrfs or
logfs.  Writing out data pages is not sufficient, because those will get
lost unless their referencing metadata is written as well.  So either we
have to call fsync for those filesystems or add another callback and let
filesystems override the default implementation.

Jörn

-- 
There is no worse hell than that provided by the regrets
for wasted opportunities.
-- Andre-Louis Moreau in Scarabouche
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/37] Permit filesystem local caching

2008-02-26 Thread David Howells
Daniel Phillips [EMAIL PROTECTED] wrote:

 I need to respond to this in pieces... first the bit that is bugging
 me:
 
 * two new page flags
  
  I need to keep track of two bits of per-cached-page information:
  
   (1) This page is known by the cache, and that the cache must be informed if
   the page is going to go away.
 
 I still do not understand the life cycle of this bit.  What does the
 cache do when it learns the page has gone away?

That's up to the cache.  CacheFS, for example, unpins some resources when all
the pages managed by a pointer block are taken away from it.  The cache may
also reserve a block on disk to back this page, and that reservation may then
be discarded by the netfs uncaching the page.

The cache may also speculatively take copies of the page if the machine is
idle.

Documentation/filesystems/caching/netfs-api.txt describes the caching API as a
process, including the presentation of netfs pages to the cache and their
uncaching.

 How is it informed?

[Documentation/filesystems/caching/netfs-api.txt]
==
PAGE UNCACHING
==

To uncache a page, this function should be called:

void fscache_uncache_page(struct fscache_cookie *cookie,
  struct page *page);

This function permits the cache to release any in-memory representation it
might be holding for this netfs page.  This function must be called once for
each page on which the read or write page functions above have been called to
make sure the cache's in-memory tracking information gets torn down.

Note that pages can't be explicitly deleted from the data file.  The whole
data file must be retired (see the relinquish cookie function below).

Furthermore, note that this does not cancel the asynchronous read or write
operation started by the read/alloc and write functions.
[/]

 Who owns the page cache in which such a page lives, the nfs client?
 Filesystem that hosts the page?  A third page cache owned by the
 cache itself?  (See my basic confusion about how many page cache
 levels you have, below.)

[Documentation/filesystems/caching/fscache.txt]
 (7) Data I/O is done direct to and from the netfs's pages.  The netfs
 indicates that page A is at index B of the data-file represented by cookie
 C, and that it should be read or written.  The cache backend may or may
 not start I/O on that page, but if it does, a netfs callback will be
 invoked to indicate completion.  The I/O may be either synchronous or
 asynchronous.
[/]

I should perhaps make the documentation more explicit: the pages passed to the
routines defined in include/linux/fscache.h are netfs pages, normally belonging
the pagecache of the appropriate netfs inode.  This is, however, mentioned in
the function banner comments in fscache.h.

 Suppose one were to take a mundane approach to the persistent cache
 problem instead of layering filesystems.  What you would do then is
 change NFS's -write_page and variants to fiddle the persistent
 cache

It is a requirement laid down by the Linux NFS fs maintainers that the writes
to the cache be asynchronous, even if the writes to NFS aren't.

Note further that NFS's write_page() != writing to the cache.  Writing to the
cache is typically done by NFS's readpages().

Besides, at the moment, caching is suppressed for any NFS file opened for
writing due to coherency issues.  This is something to be revisited later.

 as well as the network, instead of just the network as now.

Not as now.  See above.

 This fiddling could even consist of -write calls to another
 filesystem, though working directly with the bio interface would
 yield the fastest, and therefore to my mind, best result.

You can't necessarily access the BIO interface, and even if you can, the cache
is still a filesystem.

Essentially, what cachefiles does is to do what you say: to perform -write
calls on another filesystem.

FS-Cache also protects the netfs against (a) there being no cache, (b) the
cache suffering a fatal I/O error and (c) the cache being removed; and protects
the cache against (d) the netfs uncaching pages that the cache is using and (e)
conflicting operations from the netfs, some of which may be queued for
asynchronous processing.

FS-Cache also groups asynchronous netfs store requests together, which
hopefully, one day, I'll be able to pass on to the backing fs.

 In any case, you find out how to write the page to backing store by
 asking the filesystem, which in the naive approach would be nfs
 augmented with caching library calls.

NFS and AFS and CIFS and ISOFS, but yes, that's what fscache is, if you like, a
caching library.

 The filesystem keeps its own metadata around to know how to map the page to
 disk.  So again naively, this metadata could tell the nfs client that the
 page is not mapped to disk at all.

The netfs should _not_ know about the metadata of a backing fs.  Firstly, there
are many different potential backing filesystems, and secondly if 

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Jörn Engel wrote:
 On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote:
  
  Yeah, sync_file_range has slightly unusual semantics and introduce
  the new concept, writeout, to userspace (does writeout include
  in drive cache? the kernel doesn't think so, but the only way to
  make sync_file_range safe is if you do consider it writeout).
 
 If sync_file_range isn't safe, it should get replaced by a noop
 implementation.  There really is no point in promising a little
 safety.
 
 One interesting aspect of this comes with COW filesystems like btrfs or
 logfs.  Writing out data pages is not sufficient, because those will get
 lost unless their referencing metadata is written as well.  So either we
 have to call fsync for those filesystems or add another callback and let
 filesystems override the default implementation.

fdatasync() is required to write data pages _and_ the necessary
metadata to reference those changed pages (btrfs tree etc.), but not
non-data metadata.

It's the filesystem's responsibility to interpret that correctly.
In-place writes don't need anything else.  Phase-tree style writes do.
Some kinds of logged writes don't.

I'm under the impression that sync_file_range() is a sort of
restricted-range asynchronous fdatasync().

By limiting the range of file date which must be written out, it
becomes more refined for database and filesystem-in-a-file type
applications.  Just as fsync() is more refined than sync() - it's
useful to sync less - same goes for syncing just part of a file.

It's still the filesystem's responsibility to sync data access
metadata appropriately.  It can sync more if it wants, but not less.

That's what I understand by
   sync_file_range(fd, start,length, SYNC_FILE_RANGE_WRITE_BEFORE
   | SYNC_FILE_RANGE_WRITE
   | SYNC_FILE_RANGE_WRITE_AFTER);
Largely because the manual says to use that combination of flags for
an equivalent to fdatasync().

The concept of write-out is not defined in the manual.  I'm assuming
it to mean this, as a reasonable guess:

SYNC_FILE_RANGE_WRITE scans all pages in the range, looking for dirty
pages which aren't already queued for write-out.  It marks those with
a write-out flag, and starts write I/Os at some unspecified time in
the near future; it can be assumed writes for all the pages will
complete eventually if there's no errors.  When I/O completes on a
page, it cleans the page and also clears the write-out flag.

SYNC_FILE_RANGE_WAIT_AFTER waits until all pages in the range don't
have the write-out flag set.

SYNC_FILE_RANGE_WAIT_BEFORE does the same wait, but before marking
pages for write-out.  I don't actually see the point in this.  Isn't a
preceding call with SYNC_FILE_RANGE_WAIT_AFTER equivalent, making
BEFORE a redundant flag?

The manual says it is something to do with data-integrity, but it's
not clear to me what that means.

All this implies that write-out flag is a concept userspace can rely
on.  That's not so peculiar: WRITE seems to be equivalent to AIO-style
fdatasync() on a limited range of offsets, and WAIT_AFTER seems to be
equivalent to waiting for any previously issued such ops to complete.

Any data access metadata updates that btrfs must make for fdatasync(),
it must also make for sync_file_range(), for the limited range of
offsets.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Ric Wheeler

Jeff Garzik wrote:

Jamie Lokier wrote:

By durable, I mean that fsync() should actually commit writes to
physical stable storage,


Yes, it should.



I was surprised that fsync() doesn't do this already.  There was a lot
of effort put into block I/O write barriers during 2.5, so that
journalling filesystems can force correct write ordering, using disk
flush cache commands.

After all that effort, I was very surprised to notice that Linux 2.6.x
doesn't use that capability to ensure fsync() flushes the disk cache
onto stable storage.


It's surprising you are surprised, given that this [lame] fsync behavior 
has remaining consistently lame throughout Linux's history.


Maybe I am confused, but isn't this is what fsync() does today whenever 
barriers are enabled (the fsync() invalidates the drive's write cache).


ric
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Jörn Engel wrote:
 On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote:
  Yeah, sync_file_range has slightly unusual semantics and introduce
  the new concept, writeout, to userspace (does writeout include
  in drive cache? the kernel doesn't think so, but the only way to
  make sync_file_range safe is if you do consider it writeout).
 
 If sync_file_range isn't safe, it should get replaced by a noop
 implementation.  There really is no point in promising a little
 safety.

Sometimes there is a point in a little safety.

There's a spectrum of durability (meaning how safely stored the data
is).  In the cases we're imagining, it's application - main memory
cache - disk cache - disk surface.  There are others.

_None_ of those provide perfect safety for your data.  They are a
spectrum, and how far along you want data to be committed before you
say fine, the data is safe enough for me depends on your application.

For example, there are users who like to turn _off_ fdatasync() with
their SQL database of choice.  They prefer speed over safety, and they
don't mind losing an hour's data and doing regular backups (we assume
;-) Some blogs fall into this category; who cares if a rare crash
costs you a comment or two and a restore from backup; it's acceptable
for the speed.

There's users who would really like fdatasync() to commit data to the
drive platters, so after their database says done, they are very
confident that a power failure won't cause committed data to be lost.
Accepting credit cards is more at this end.  So should be anyone using
a virtual machine of any kind without a journalling fs in the guest!

And there's users who like it where it is right now: a compromise,
where a system crash won't lose committed data; but a power failure
might.  (I'm making assumptions about drive behaviour on reset here.)

My problem with fdatasync() at the moment is, I can't choose what I
want from it, and there's no mechanism to give me the safest option.
Most annoyingly, in-kernel filesystems _do_ have a mechanism; it just
isn't exported to userspace.

(A quick aside: fdatasync() et al. are actually used for two
_different_ things.  1: A program says I've written it, it can say
so with confidence, e.g. announcing email receipt.  2: It's used for
write ordering with write-ahead logging: write, fdatasync, write.
When you tease at the details, efficient implementations of them are
different...  Think SCSI tagged commands versus cache flushes.)

 One interesting aspect of this comes with COW filesystems like btrfs or
 logfs.  Writing out data pages is not sufficient, because those will get
 lost unless their referencing metadata is written as well.  So either we
 have to call fsync for those filesystems or add another callback and let
 filesystems override the default implementation.

Doesn't the -fsync callback get called in the sys_fdatasync() case,
with appropriate arguments?

With barriers/flushes it certainly makes those a bit more complicated.
You have to flush not just the disks with data pages, but the _other_
disks in a software RAID with data pointer metadata pages, but ideally
not all of them (think database journal commit).

That can be implemented with per-buffer pending-barrier/flush flags
(like I described for pages in the first mail), which are equally
useful when a database-like application uses a block device.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Ric Wheeler wrote:
 I was surprised that fsync() doesn't do this already.  There was a lot
 of effort put into block I/O write barriers during 2.5, so that
 journalling filesystems can force correct write ordering, using disk
 flush cache commands.
 
 After all that effort, I was very surprised to notice that Linux 2.6.x
 doesn't use that capability to ensure fsync() flushes the disk cache
 onto stable storage.
 
 It's surprising you are surprised, given that this [lame] fsync behavior 
 has remaining consistently lame throughout Linux's history.
 
 Maybe I am confused, but isn't this is what fsync() does today whenever 
 barriers are enabled (the fsync() invalidates the drive's write cache).

No, fsync() doesn't always flush the drive's write cache.  It often
does, any I think many people are under the impression it always does,
but it doesn't.

Try this code on ext3:

fd = open (test_file, O_RDWR | O_CREAT | O_TRUNC, 0666);
while (1) {
char byte;
usleep (10);
pwrite (fd, byte, 1, 0);
fsync (fd);
}

It will do just over 10 write ops per second on an idle system (13 on
mine), and 1 flush op per second.

That's because ext3 fsync() only does a journal commit when the inode
has changed.  The inode mtime is changed by write only with 1 second
granularity.  Without a journal commit, there's no barrier, which
translates to not flushing disk write cache.

If you add fchmod (fd, 0644); fchmod (fd, 0664); between the write
and fsync, you'll see at least 20 write ops and 20 flush ops per
second, and you'll here the disk seeking more.  That's because the
fchmod dirties the inode, so fsync() writes the inode with a journal
commit.

It turns out even _that_ is not sufficient according to the kernel
internals.  A journal commit uses an ordered request, which isn't the
same as a flush potentially, it just happens to use flush in this
instance.  I'm not sure if ordered requests are actually implemented
by any drivers at the moment.  If not now, they will be one day.

We could change ext3 fsync() to always do a journal commit, and depend
on the non-existence of block drivers which do ordered (not flush)
barrier requests.  But there's lots of things wrong with that.  Not
least, it sucks performance for database-like applications and virtual
machines, a lot due to unnecessary seeks.  That way lies wrongness.

Rightness is to make fdatasync() work well, with a genuine flush (or
equivalent (see FUA), only when required, and not a mere ordered
barrier), no inode write, and to make sync_file_range()[*] offer the
fancier applications finer controls which reflect what they actually
need.

[*] - or whatever.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Andrew Morton
On Tue, 26 Feb 2008 15:07:45 + Jamie Lokier [EMAIL PROTECTED] wrote:

 SYNC_FILE_RANGE_WRITE scans all pages in the range, looking for dirty
 pages which aren't already queued for write-out.  It marks those with
 a write-out flag, and starts write I/Os at some unspecified time in
 the near future; it can be assumed writes for all the pages will
 complete eventually if there's no errors.  When I/O completes on a
 page, it cleans the page and also clears the write-out flag.
 
 SYNC_FILE_RANGE_WAIT_AFTER waits until all pages in the range don't
 have the write-out flag set.
 
 SYNC_FILE_RANGE_WAIT_BEFORE does the same wait, but before marking
 pages for write-out.  I don't actually see the point in this.  Isn't a
 preceding call with SYNC_FILE_RANGE_WAIT_AFTER equivalent, making
 BEFORE a redundant flag?

Consider the case of pages which are dirty but are already under writeout. 
ie: someone redirtied the page after someone started writing the page out. 
For these pages the kernel needs to

a) wait for the current writeout to complete

b) start new writeout

c) wait for that writeout to complete.

those are the three stages of sync_file_range().  They are independently
selectable and various combinations provide various results.

The reason for providing b) only (SYNC_FILE_RANGE_WRITE) is so that
userspace can get as much data into the queue as possible, to permit the
kernel to optimise IO scheduling better.

If you perform a) and b) together
(SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE) then you are guaranteed
that all data which was dirty when sync_file_range() executed will be sent
into the queue, but you won't get as much data into the queue if the kernel
encounters dirty, under-writeout pages.  This is especially hurtful if
you're trying to feed a lot of little segments into the queue.  In that
case perhaps userspace should do an asynchrnous pass
(SYNC_FILE_RANGE_WRITE) to stuff as much data as poss into the queue, then
a SYNC_FILE_RANGE_WAIT_AFTER pass then a
SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER
pass to clean up any stragglers.  WHich mode is best very much depends on
the application's file dirtying patterns.  One would have to experiment
with it, and tuning of sync_file_range() usage would occur alongside tuning
of the application's write() design.

It's an interesting problem, with potentially high payback.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jeff Garzik

Nick Piggin wrote:

Anyway, the idea of making fsync/fdatasync etc. safe by default is
a good idea IMO, and is a bad bug that we don't do that :(


Agreed...  it's also disappointing that [unless I'm mistaken] you have 
to hack each filesystem to support barriers.


It seems far easier to make sync_blkdev() Do The Right Thing, and 
magically make all filesystems data-safe.


Jeff


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] ext3 freeze feature ver 0.2

2008-02-26 Thread Eric Sandeen
Takashi Sato wrote:

 o Elevate XFS ioctl numbers (XFS_IOC_FREEZE and XFS_IOC_THAW) to the VFS
   As Andreas Dilger and Christoph Hellwig advised me, I have elevated
   them to include/linux/fs.h as below.
 #define FIFREEZE_IOWR('X', 119, int)
    #define FITHAW  _IOWR('X', 120, int)
   The ioctl numbers used by XFS applications don't need to be changed.
   But my following ioctl for the freeze needs the parameter
   as the timeout period.  So if XFS applications don't want the timeout
   feature as the current implementation, the parameter needs to be
   changed 1 (level?) into 0.

So, existing xfs applications calling the xfs ioctl now will behave
differently, right?  We can only keep the same ioctl number if the
calling semantics are the same.  Keeping the same number but changing
the semantics is harmful, IMHO

-Eric
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Jeff Garzik wrote:
 Nick Piggin wrote:
 Anyway, the idea of making fsync/fdatasync etc. safe by default is
 a good idea IMO, and is a bad bug that we don't do that :(
 
 Agreed...  it's also disappointing that [unless I'm mistaken] you have 
 to hack each filesystem to support barriers.
 
 It seems far easier to make sync_blkdev() Do The Right Thing, and 
 magically make all filesystems data-safe.

Well, you need ordered metadata writes, barriers _and_ flushes with
some filesystems.

Merely writing all the data pages than issuing a drive cache flush
won't Do The Right Thing with those filesystems - someone already
mentioned Btrfs, where it won't.

But I agree that your suggestion would make a superb default, for
filesystems which don't provide their own function.

It's not optimal even then.

  Devices: On a software RAID, you ideally don't want to issue flushes
  to all drives if your database did a 1 block commit entry.  (But they
  probably use O_DIRECT anyway, changing the rules again).  But all that
  can be optimised in generic VFS code eventually.  It doesn't need
  filesystem assistance in most cases.

  Apps: don't always want a full flush; sometimes a barrier would do.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jörn Engel
On Tue, 26 February 2008 15:28:10 +, Jamie Lokier wrote:
 
  One interesting aspect of this comes with COW filesystems like btrfs or
  logfs.  Writing out data pages is not sufficient, because those will get
  lost unless their referencing metadata is written as well.  So either we
  have to call fsync for those filesystems or add another callback and let
  filesystems override the default implementation.
 
 Doesn't the -fsync callback get called in the sys_fdatasync() case,
 with appropriate arguments?

My paragraph above was aimed at the sync_file_range() case.  fsync and
fdatasync do the right thing within the limitations you brought up in
this thread.  sync_file_range() without further changes will only write
data pages, not the metadata required to actually access those data
pages.  This works just fine for non-COW filesystems, which covers all
currently merged ones.

With COW filesystems it is currently impossible to do sync_file_range()
properly.  The problem is orthogonal to your's, I just brought it up
since you were already mentioning sync_file_range().


Jörn

-- 
Joern's library part 10:
http://blogs.msdn.com/David_Gristwood/archive/2004/06/24/164849.aspx
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] ext3 freeze feature ver 0.2

2008-02-26 Thread Andreas Dilger
On Feb 26, 2008  08:39 -0800, Eric Sandeen wrote:
 Takashi Sato wrote:
 
  o Elevate XFS ioctl numbers (XFS_IOC_FREEZE and XFS_IOC_THAW) to the VFS
As Andreas Dilger and Christoph Hellwig advised me, I have elevated
them to include/linux/fs.h as below.
  #define FIFREEZE_IOWR('X', 119, int)
     #define FITHAW  _IOWR('X', 120, int)
The ioctl numbers used by XFS applications don't need to be changed.
But my following ioctl for the freeze needs the parameter
as the timeout period.  So if XFS applications don't want the timeout
feature as the current implementation, the parameter needs to be
changed 1 (level?) into 0.
 
 So, existing xfs applications calling the xfs ioctl now will behave
 differently, right?  We can only keep the same ioctl number if the
 calling semantics are the same.  Keeping the same number but changing
 the semantics is harmful, IMHO

Do we know what this parameter was supposed to mean?

We could special case 1 if needed to keep compatibility (documenting
this clearly), either making it == 0, or some very long timeout (1h
or whatever).  A relatively minor wart I think.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Jörn Engel wrote:
 On Tue, 26 February 2008 15:28:10 +, Jamie Lokier wrote:
  
   One interesting aspect of this comes with COW filesystems like btrfs or
   logfs.  Writing out data pages is not sufficient, because those will get
   lost unless their referencing metadata is written as well.  So either we
   have to call fsync for those filesystems or add another callback and let
   filesystems override the default implementation.
  
  Doesn't the -fsync callback get called in the sys_fdatasync() case,
  with appropriate arguments?
 
 My paragraph above was aimed at the sync_file_range() case.  fsync and
 fdatasync do the right thing within the limitations you brought up in
 this thread.  sync_file_range() without further changes will only write
 data pages, not the metadata required to actually access those data
 pages.  This works just fine for non-COW filesystems, which covers all
 currently merged ones.
 
 With COW filesystems it is currently impossible to do sync_file_range()
 properly.  The problem is orthogonal to your's, I just brought it up
 since you were already mentioning sync_file_range().

You're right.  Though, doesn't normal page writeback enqueue the COW
metadata changes?  If not, how do they get written in a timely
fashion?

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jörn Engel
On Tue, 26 February 2008 17:29:13 +, Jamie Lokier wrote:
 
 You're right.  Though, doesn't normal page writeback enqueue the COW
 metadata changes?  If not, how do they get written in a timely
 fashion?

It does.  But this is not sufficient to guarantee that the pages in
question have been safely committed to the device by the time
sync_file_range() has returned.

Jörn

-- 
Joern's library part 5:
http://www.faqs.org/faqs/compression-faq/part2/section-9.html
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jeff Garzik

Jamie Lokier wrote:

Jeff Garzik wrote:

Nick Piggin wrote:

Anyway, the idea of making fsync/fdatasync etc. safe by default is
a good idea IMO, and is a bad bug that we don't do that :(
Agreed...  it's also disappointing that [unless I'm mistaken] you have 
to hack each filesystem to support barriers.


It seems far easier to make sync_blkdev() Do The Right Thing, and 
magically make all filesystems data-safe.


Well, you need ordered metadata writes, barriers _and_ flushes with
some filesystems.

Merely writing all the data pages than issuing a drive cache flush
won't Do The Right Thing with those filesystems - someone already
mentioned Btrfs, where it won't.


Oh certainly.  That's why we have a VFS :)  fsync for NFS will look 
quite different, too.




But I agree that your suggestion would make a superb default, for
filesystems which don't provide their own function.


Yep.  That would immediately cover a bunch of filesystems.



It's not optimal even then.

  Devices: On a software RAID, you ideally don't want to issue flushes
  to all drives if your database did a 1 block commit entry.  (But they
  probably use O_DIRECT anyway, changing the rules again).  But all that
  can be optimised in generic VFS code eventually.  It doesn't need
  filesystem assistance in most cases.


My own idea is that we create a FLUSH command for blkdev request queues, 
to exist alongside READ, WRITE, and the current barrier implementation. 
 Then FLUSH could be passed down through MD or DM.


Jeff


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/37] Permit filesystem local caching

2008-02-26 Thread Daniel Phillips
On Tuesday 26 February 2008 06:33, David Howells wrote:
  Suppose one were to take a mundane approach to the persistent cache
  problem instead of layering filesystems.  What you would do then is
  change NFS's -write_page and variants to fiddle the persistent
  cache
 
 It is a requirement laid down by the Linux NFS fs maintainers that the writes
 to the cache be asynchronous, even if the writes to NFS aren't.

As it happens, I will be hanging out for the next few days with said
NFS maintainers, it would help to be as informed as possible about
your patch set.

 Note further that NFS's write_page() != writing to the cache.  Writing to the
 cache is typically done by NFS's readpages().

Yes, of course.  But also by -write_page no?

  Which I could eventually find out by reading all the patches but asking you
  is so much more fun :-)
 
 And a waste of my time.  I've provided documentation in the main FS-Cache
 patch, both as text files and in comments in header files that answer your
 questions.  Please read them first.

37 Patches, none of which has Documentation in the subject line, and
you did not provide a diffstat in patch 0 for the patch set as a whole.
If I had known it was there of course I would have read it.  It is great
to see this level of documentation.  But I do not think it is fair to
blame your (one) reader for missing it.

See the smiley above?  The _real_ reason I am asking you is that I do
not think anybody understands your patch set, in spite of your
considerable efforts to address that.  Discussion in public, right or
wrong, is the only way to fix that.  It is counterproductive to drive
readers away from the discussion for fear that they may miss some point
obvious to the original author, or perhaps already discussed earlier on
lkml, and get flamed for it.

Obviously, the patch set is not going to be perfect when it goes in and
it would be a silly abuse of the open source process to require that,
but the parts where it touches the rest of the system have to be really
well understood, and it is clear from the level of participation in the
thread that they are not.

One bit that already came out of this, which you have alluded to
several times yourself but somehow seem to keep glossing over, is that
you need a -direct_bio file operations method.  So does loopback mount.
It might be worth putting some effort into seeing how -direct_IO can
be refactored to make that happen.  You can get it in separately on the
basis of helping loopback, and it will make your patches nicer.

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html