Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-20 Thread Joerg Schilling
Peter Schuller peter.schul...@infidyne.com wrote:

  fsync() is, indeed, expensive.  Lots of calls to fsync() that are not
  necessary for correct application operation EXCEPT as a workaround for
  lame filesystem re-ordering are a sure way to kill performance.

 IMO the fundamental problem is that the only way to achieve a write
 barrier is fsync() (disregarding direct I/O etc). Again I would just
 like an fbarrier() as I've mentioned on the list previously. It seems
 to me that if this were just adopted by some operating systems and
 applications could start using it, things would just sort itself out
 when file systems/block devices layers start actually implementing the
 optimization possible (instead of the native fbarrier() - fsync()).

In addition, POSIX does not mention that close() needs to sync the file to 
disk. If an application like star likes to verify whether files could be 
written to disk in order to create a correct exit code, the only way is to call
fsync() before close().

With UFS, this creates an performance impact of aprox. 10%, with ZFS this
was more than 10% the last time I checked.

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-19 Thread Bob Friesenhahn

On Thu, 19 Mar 2009, Miles Nordin wrote:


And the guarantees ARE minimal---just:

http://www.google.com/search?q=POSIX+%22crash+consistency%22

and you'll find even people against T'so's who want to change ext4
still agree POSIX is on T'so's side.


Clearly I am guilty of inflated expectations.  Regardless, POSIX 
specifications define a minimum set of expectations and there is 
nothing to prevent vendors from offering more, or for enhanced 
specifications (e.g. Open Group) from raising the bar.


Now that I am more aware of the situation, I can see that users of my 
software are likely to lose files if the system were to crash.  There 
is a fsync-safe mode for my software which should avoid this but 
application performance would suffer quite a lot if it was used on a 
large scale.


If ZFS does try to order its disk updates in cronological order 
without prioritizing metadata updates over data, then the risk is 
minimized.


While a number of esteemed Sun kernel engineers have expressed their 
views here, we have yet to hear an opinion/statement from a Sun ZFS 
development engineer.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-19 Thread Miles Nordin
 bf == Bob Friesenhahn bfrie...@simple.dallas.tx.us writes:

bf If ZFS does try to order its disk updates in cronological
bf order without prioritizing metadata updates over data, then
bf the risk is minimized.

AIUI it doesn't exactly order them, just puts them into 5-second
chunks.  so it rolls the on-disk representation forward in lurching
steps every 5 seconds, and the boundaries between each step are exact
representations of how the filesystem once looked to the userland.

I do not udnerstand yet if fsync() will lurch forward the _entire_
filesystem, or just the inode being fsync()d.  Unless i'm mistaken
``the property,'' as I described it, can only be achieved by lurching
forward the entire filesystem whenever you fsync anything because
otherwise you will recover to an overall state through which you never
passed before the crash (with the fsync'd file being a little newer),
but it might be faster/better to violate the property and only sync
what was asked.

If it's the entire filesystem, then it might improve performance to
separate unrelated heavy writers into different filesystems---for
example it would be better to put a tape-emulation backup directory
and a mail queue directory into separate filesystems even fi they have
to go in the same pool.  If it breaks the property and only sync's the
inode asked, then two directories on one filesystem vs two filesystems
should not change performacne of that scenario which is an advantage.


pgpUAlxtcFBkB.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-19 Thread Peter Schuller
 fsync() is, indeed, expensive.  Lots of calls to fsync() that are not
 necessary for correct application operation EXCEPT as a workaround for
 lame filesystem re-ordering are a sure way to kill performance.

IMO the fundamental problem is that the only way to achieve a write
barrier is fsync() (disregarding direct I/O etc). Again I would just
like an fbarrier() as I've mentioned on the list previously. It seems
to me that if this were just adopted by some operating systems and
applications could start using it, things would just sort itself out
when file systems/block devices layers start actually implementing the
optimization possible (instead of the native fbarrier() - fsync()).

As was noted previously in the previous thread on this topic, ZFS
effectively has an implicit fbarrier() in between each write. Imagine
now if all the applications out there were automatically massively
faster on ZFS... but this won't happen until operating systems start
exposing the necessary interface.

What does one need to do to get something happening here? Other than
whine on mailing lists...

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller peter.schul...@infidyne.com'
Key retrieval: Send an E-Mail to getpgp...@scode.org
E-Mail: peter.schul...@infidyne.com Web: http://www.scode.org



pgpZpZZfcWwR3.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-19 Thread Peter Schuller
Uh, I should probably clarify some things (I was too quick to hit
send):

 IMO the fundamental problem is that the only way to achieve a write
 barrier is fsync() (disregarding direct I/O etc). Again I would just
 like an fbarrier() as I've mentioned on the list previously. It seems

Of course if fbarrier() is analogous to fsync() this does not actually
address the particular problem which is the main topic of this thread,
since there the fbarrier() would presumably apply only to I/O within
that file.

This particular case would only be helped if the fbarrier() were
global, or at least extending further than the particular file.

Fundamentally, I think a userful observation is that the only time you
ever care about persistence is when you make a contract with an
external party outside of your blackbox of I/O. Typical examples are
database commits and mail server queues. Anything within the blackbox
is only concerned with consistency.

In this particular case, the fsync()/fbarrier() operateon the black
box of the file, with the directory being an external party. The
rename() operation on the directory entry constitutes an operation
which depends on the state of the individual file blackbox, thus
constituting an external dependency and thus requireing persistence.

The question is whether it is necessarily a good idea to make the
blackbox be the entire file system. If it is, a lot of things would be
much much easier. On the other hand, it also makes optimization more
difficult in many cases. For example the latency of persisting 8kb of
data could be very very significant if there is large amounts of bulk
I/O happening in the same file system. So I definitely see the
motivation behind having persistence guarantees be non-global.

Perhaps it boils down to the files+directory model not necessarily
being the best one in all cases. Perhaps one would like to define
subtrees which have global fsync()/fbarrier() type semantics within
each respective subtree.

On the other hand, that sounds a lot like a ZFS file system, other
than the fact that ZFS file system creation is not something which is
exposed to the application programmer.

How about having file-system global barrier/persistence semantics, but
having a well-defined API for creating child file systems rooted at
any point in a hierarchy? It would allow global semantics and what
that entails, while allowing that bulk I/O happening in your 1 TB
PostgreSQL database to be segregated, in terms of performance impact,
from your kde settings file system.

 What does one need to do to get something happening here? Other than
 whine on mailing lists...

And that came off much more rude than intended. Clearly it's not an
implementation effort issues ince the naive fbarrioer() is basically
calling fsync(). However I get the feeling there is little motivation
in the operating system community for addressing these concerns, for
whatever reason (IIRC it was only recently that some write
barrier/write caching issues started being seriously discussed in the
Linux kernel community for example).

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller peter.schul...@infidyne.com'
Key retrieval: Send an E-Mail to getpgp...@scode.org
E-Mail: peter.schul...@infidyne.com Web: http://www.scode.org



pgpMWdfQtjIuW.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread Casper . Dik

Recently there's been discussion [1] in the Linux community about how
filesystems should deal with rename(2), particularly in the case of a crash.
ext4 was found to truncate files after a crash, that had been written with
open(foo.tmp), write(), close() and then rename(foo.tmp, foo). This is
 because ext4 uses delayed allocation and may not write the contents to disk
immediately, but commits metadata changes quite frequently. So when
rename(foo.tmp,foo) is committed to disk, it has a length of zero which
is later updated when the data is written to disk. This means after a crash,
foo is zero-length, and both the new and the old data has been lost, which
is undesirable. This doesn't happen when using ext3's default settings
because ext3 writes data to disk before metadata (which has performance
problems, see Firefox 3 and fsync[2])

Believing that, somehow, metadata is more important than other data
should have been put to rest with UFS.  Yes, it's easier to fsck the
filesystem when the metadata is correct and that gets you a valid 
filesystem but that doesn't mean that you get a filesystem with valid contents.

Ted T'so's (the main author of ext3 and ext4) response is that applications
which perform open(),write(),close(),rename() in the expectation that they
will either get the old data or the new data, but not no data at all, are
broken, and instead should call open(),write(),fsync(),close(),rename().
Most other people are arguing that POSIX says rename(2) is atomic, and while
POSIX doesn't specify crash recovery, returning no data at all after a crash
is clearly wrong, and excessive use of fsync is overkill and
counter-productive (Ted later proposes a yes-I-really-mean-it flag for
fsync). I've omitted a lot of detail, but I think this is the core of the
argument.


As long as POSIX believes that systems don't crash, then clearly there is
nothing in the standard which would help the argument on either side.

It is a quality of implementation property.  Apparently, T'so feels
that reordering filesystem operations is fine.


Now the question I have, is how does ZFS deal with
open(),write(),close(),rename() in the case of a crash? Will it always
return the new data or the old data, or will it sometimes return no data? Is
 returning no data defensible, either under POSIX or common sense? Comments
about other filesystems, eg UFS are also welcome. As a counter-point, XFS
(written by SGI) is notorious for data-loss after a crash, but its authors
defend the behaviour as POSIX-compliant.

I didn't know about XFS behaviour on crash.  I don't know exactly how ZFS 
commits transaction groups; the ZFS authors can tell and I hope they chime 
in.

The only time POSIX is in question is when the fileserver crashes and 
whether or not the NFS server keeps its promises.  Some typical Linux 
configuration would break some of those promises.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread Joerg Schilling
James Andrewartha jam...@daa.com.au wrote:

 Recently there's been discussion [1] in the Linux community about how
 filesystems should deal with rename(2), particularly in the case of a crash.
 ext4 was found to truncate files after a crash, that had been written with
 open(foo.tmp), write(), close() and then rename(foo.tmp, foo). This is
  because ext4 uses delayed allocation and may not write the contents to disk
 immediately, but commits metadata changes quite frequently. So when
 rename(foo.tmp,foo) is committed to disk, it has a length of zero which
 is later updated when the data is written to disk. This means after a crash,
 foo is zero-length, and both the new and the old data has been lost, which
 is undesirable. This doesn't happen when using ext3's default settings
 because ext3 writes data to disk before metadata (which has performance
 problems, see Firefox 3 and fsync[2])

 Ted T'so's (the main author of ext3 and ext4) response is that applications
 which perform open(),write(),close(),rename() in the expectation that they
 will either get the old data or the new data, but not no data at all, are
 broken, and instead should call open(),write(),fsync(),close(),rename().
 Most other people are arguing that POSIX says rename(2) is atomic, and while
 POSIX doesn't specify crash recovery, returning no data at all after a crash
 is clearly wrong, and excessive use of fsync is overkill and
 counter-productive (Ted later proposes a yes-I-really-mean-it flag for
 fsync). I've omitted a lot of detail, but I think this is the core of the
 argument.

The problem in this case is not whether rename() is atomic but whether the
file that replaces the old file in an atomic rename() operation is in a 
stable state on the disk before calling rename().

The calling sequence of the failing code was:

f = open(new, O_WRONLY|O_CREATE|O_TRUNC, 0666);
write(f, dat, size);
close(f);
rename(new, old);

The only granted way to have the file new in a stable state on the disk
is to call:

f = open(new, O_WRONLY|O_CREATE|O_TRUNC, 0666);
write(f, dat, size);
fsync(f);
close(f);

Do not forget to check error codes.

If the application would call:

f = open(new, O_WRONLY|O_CREATE|O_TRUNC, 0666);
if (write(f, dat, size) != size)
fail();
if (fsync(f)  0)
fail()
if (close(f)  0)
fail()
if (rename(new, old)  0)
fail();

and if after a crash there is neither the old file nor the
new file on the disk in a consistent state, then you may blame the
file system.


Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread Moore, Joe
Joerg Schilling wrote:
 James Andrewartha jam...@daa.com.au wrote:
  Recently there's been discussion [1] in the Linux community about how 
  filesystems should deal with rename(2), particularly in the case of a crash.
  ext4 was found to truncate files after a crash, that had been written with
  open(foo.tmp), write(), close() and then rename(foo.tmp, foo). This is
   because ext4 uses delayed allocation and may not write the contents to disk
  immediately, but commits metadata changes quite frequently. So when
  rename(foo.tmp,foo) is committed to disk, it has a length of zero which
  is later updated when the data is written to disk. This means after a crash,
  foo is zero-length, and both the new and the old data has been lost, which
  is undesirable. This doesn't happen when using ext3's default settings
  because ext3 writes data to disk before metadata (which has performance
  problems, see Firefox 3 and fsync[2])
 
  Ted T'so's (the main author of ext3 and ext4) response is that applications
  which perform open(),write(),close(),rename() in the expectation that they
  will either get the old data or the new data, but not no data at all, are
  broken, and instead should call open(),write(),fsync(),close(),rename().

 The only granted way to have the file new in a stable state on the
 disk
 is to call:
 
 f = open(new, O_WRONLY|O_CREATE|O_TRUNC, 0666);
 write(f, dat, size);
 fsync(f);
 close(f);

AFAIUI, the ZFS transaction group maintains write ordering, at least as far as 
write()s to the file would be in the ZIL ahead of the rename() metadata updates.

So I think the atomicity is maintained without requiring the application to 
call fsync() before closing the file.  If the TXG is applied and the rename() 
is included, then the file writes have been too, so foo would have the new 
contents.  If the TXG containing the rename() isn't complete and on the ZIL 
device at crash time, foo would have the old contents.

Posix doesn't require the OS to sync() the file contents on close for local 
files like it does for NFS access?  How odd.

--Joe

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread Casper . Dik


AFAIUI, the ZFS transaction group maintains write ordering, at least as far as 
write()s to the fil
e would be in the ZIL ahead of the rename() metadata updates.

So I think the atomicity is maintained without requiring the application to 
call fsync() before cl
osing the file.  If the TXG is applied and the rename() is included, then the 
file writes have been
 too, so foo would have the new contents.  If the TXG containing the rename() 
isn't complete and on
 the ZIL device at crash time, foo would have the old contents.

Posix doesn't require the OS to sync() the file contents on close for local 
files like it does for
 NFS access?  How odd.

perhaps sync() but not fsync().

But I'm not sure that that is the case.  UFS does that, it schedules 
writing the modified content when the file is closed but onlyon the last 
close.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread Bob Friesenhahn

On Wed, 18 Mar 2009, Joerg Schilling wrote:


The problem in this case is not whether rename() is atomic but whether the
file that replaces the old file in an atomic rename() operation is in a
stable state on the disk before calling rename().


This topic is quite disturbing to me ...


The calling sequence of the failing code was:

f = open(new, O_WRONLY|O_CREATE|O_TRUNC, 0666);
write(f, dat, size);
close(f);
rename(new, old);

The only granted way to have the file new in a stable state on the disk
is to call:

f = open(new, O_WRONLY|O_CREATE|O_TRUNC, 0666);
write(f, dat, size);
fsync(f);
close(f);


But the problem is not that the file new is in an unstable state. 
The problem is that it seems that some filesystems are not preserving 
the ordering of requests.  Failing to preserve the ordering of 
requests is fraught with peril.


POSIX does not care about disks or filesystems.  The only correct 
behavior is for operations to be applied in the order that they are 
requested of the operating system.  This is a core function of any 
operating system.  It is therefore ok for some (or all) of the data 
which was written to new to be lost, or for the rename operation to 
be lost, but it is not ok for the rename to end up with a corrupted 
file with the new name.


In summary, I don't agree with you that the misbehavior is correct, 
but I do agree that copious expensive fsync()s should be assured to 
work around the problem.


As it happens, current versions of my own application should be safe 
from this Linux filesystem bug, but older versions are not.   There is 
even a way to request fsync() on every file close, but that could be 
quite expensive so it is not the default.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread David Dyer-Bennet

On Wed, March 18, 2009 05:08, Joerg Schilling wrote:

 The problem in this case is not whether rename() is atomic but whether the
 file that replaces the old file in an atomic rename() operation is in a
 stable state on the disk before calling rename().

Good, I was hoping somebody saw it that way.

People tend to assume that a successful close() guarantees the data
written to that file is on disk, and I don't believe that is actually
promised by POSIX (though I'm by no means a POSIX rules lawyer) or most
other modern systems.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread Richard Elling

Bob Friesenhahn wrote:
As it happens, current versions of my own application should be safe 
from this Linux filesystem bug, but older versions are not. There is 
even a way to request fsync() on every file close, but that could be 
quite expensive so it is not the default. 


Pragmatically, it is much easier to change the file system once, than
to test or change the zillions of applications that might be broken.
-- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread Nicolas Williams
On Wed, Mar 18, 2009 at 11:15:48AM -0400, Moore, Joe wrote:
 Posix doesn't require the OS to sync() the file contents on close for
 local files like it does for NFS access?  How odd.

Why should it?  If POSIX is agnostic as to system crashes / power
failures, then why should it say anything about when data should hit the
disk in the absence of explicit sync()/fsync() calls?

NFS is a different beast though.  Client cache coherency and other
issues come up.  So to maintain POSIX semantics a number of NFS
operations must be synchronous and close() on the client requires
flushing dirty buffers to the server.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread Nicolas Williams
On Wed, Mar 18, 2009 at 11:43:09AM -0500, Bob Friesenhahn wrote:
 In summary, I don't agree with you that the misbehavior is correct, 
 but I do agree that copious expensive fsync()s should be assured to 
 work around the problem.

fsync() is, indeed, expensive.  Lots of calls to fsync() that are not
necessary for correct application operation EXCEPT as a workaround for
lame filesystem re-ordering are a sure way to kill performance.

I'd rather the filesystems were fixed than end up with sync;sync;sync;
type folklore.  Or just don't use lame filesystems.

 As it happens, current versions of my own application should be safe 
 from this Linux filesystem bug, but older versions are not.   There is 
 even a way to request fsync() on every file close, but that could be 
 quite expensive so it is not the default.

So now you pepper your apps with an option to fsync() on close()?  Ouch.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread Bob Friesenhahn

On Wed, 18 Mar 2009, Richard Elling wrote:


Bob Friesenhahn wrote:
As it happens, current versions of my own application should be safe from 
this Linux filesystem bug, but older versions are not. There is even a way 
to request fsync() on every file close, but that could be quite expensive 
so it is not the default. 


Pragmatically, it is much easier to change the file system once, than
to test or change the zillions of applications that might be broken.


Yes, and particularly because fsync() can be very expensive.  At one 
time fsync() was the same as sync() for ZFS.  Presumably it is 
improved by now.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread Miles Nordin
 ja == James Andrewartha jam...@daa.com.au writes:

ja other people are arguing that POSIX says rename(2) is atomic,

Their statement is true but it's NOT an argument against T'so who is
100% right: the applications using that calling sequence for crash
consistency are not portable under POSIX.

atomic has nothing to do with crash consistency.  

It's about the view of the filesystem by other processes on the same
system, ex., the security vulnerabilities one can have with setuid
binaries that work in /tmp if said binaries don't take advantage of
certain guarantees of atomicity to avoid race conditions.  Obviously
/tmp has zero to do with what the filesystem looks like after a crash:
it always looks _empty_.

For ext4 the argument is settled, fix the app.  But a more productive
way to approach the problem would be to look at tradeoffs between
performance and crash consistency.  Maybe we need fbarrier() (which
could return faster---it sounds like on ZFS it could be a noop)
instead of fsync(), or maybe something more, something genuinely
post-Unix like limited filesystem-transactions that can open, commit,
rollback.  It's hard for a generation that grew up under POSIX to
think outside it.

A hypothetical new API ought to help balance performance/consistency
for networked filesystems, too, like NFS or Lustre/OCFS/...  For
example, networked filesystems often promise close-to-open
consistency, and the promise doesn't necessarily have to do with
crashing.  It means,

  client A client B
   write
   close
   sendmsg     poll
open
read(will see all A's writes)


  client A client B
   write
   wait a while
   sendmsg  ---poll
read(all bets are off)

This could stand obvious improvements in two ways.  First, if I'm
trying to send data to B using the filesystem 

 (monkey chorus: don't do that!  it won't work!  you
  have to send data between nodes with
  libgnetdatasender and it's associated avahi-using
  setuid-nobody daemon!  just check it out of svn.  no
  it doesn't support IPv6 but the NEXT VERSION, what,
  1000 nodes? well then you definitely don't want
  to---

  DOWN, monkeychorus!  If I feel like writing in
  Python or Javurscript or even PHP, let me.  If I
  feel like sending data through a filesystem, find a
  way to let me!  why the hell not do it?  I said
  post-POSIX.)

send USING THE FILESYSTEM, then maybe I don't want to close the file
all the time because that's slow or just annoying.  Is there some
dance I can do using locks on B or A to say, ``I need B to see the
data, but I do not necessarily need, nor want to wait, for it to be
committed to disk---I just want it consistent on all clients''?  like,
suppose I keep the file open on A and B at the same time over NFS.
Will taking a write lock on A and a read lock on B actually flush the
client's cache and get the information moved from A to B faster?

Second, we've discussed before NFSv3 write-write-write-commit batching
doesn't work across close/open, so people need slogs to make their
servers fast for the task of writing thousands of tiny files while for
mounting VM disk images over NFS the slog might not be so badly
needed.  Even with the slog, the tiny-files scenario would be slowed
down by network roundtrips.  If we had a transaction API, we could
open a transaction, write 1000 files, then close it.  On a high-rtt
network this could be many orders of magnitude faster than what we
have now.  but it's hard to imagine a transactional API that doesn't
break the good things about POSIX-style like ``relatively simple'',
``apparently-stateless NFS client-server sessions'', ``advisory
locking only'', ...


pgpYqREyYRLrY.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread Miles Nordin
 c == Miles Nordin car...@ivy.net writes:

 c fbarrier()

on second thought that couldn't help this problem.  The goal is to
associate writing to the directory (rename) with writing to the file
referenced by that inode/handle (write/fsync/``fbarrier''), and in
POSIX these two things are pretty distant and unrelated to each other.
The posix way to associate these two things is to wait for fsync() to
return before asking for the rename.  The waiting is expressive---it's
an extremely simple, easy-to-understand API for associating one thing
with another.  I thought maybe this was so simple there was only one
thing not two, so the wait coudl be skipped, but I am wrong.

It is too bad because as others have said it means these fsync()'s
will have to go in to make the app correct/portable with the API we
have to work under, even though ZFS has certain convenient quirks and
probably doesn't need them.

IMHO the best reaction to the KDE hysteria would be to make sure
SQLite and BerkeleyDB are fast as possible and effortlessly correct on
ZFS, and anything that's slow because of too much synchronous writing
to tiny files should use a library instead.  This is not currently the
case because for high performance one has to manually match DB and ZFS
record sizes which isn't practical for these tiny throwaway databases
that must share a filesystem with nonDB stuff, and there might be room
for improvement in terms of online defragmentation too.


pgpIEWQ58qaLi.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread Casper . Dik

On Wed, Mar 18, 2009 at 11:43:09AM -0500, Bob Friesenhahn wrote:
 In summary, I don't agree with you that the misbehavior is correct, 
 but I do agree that copious expensive fsync()s should be assured to 
 work around the problem.

fsync() is, indeed, expensive.  Lots of calls to fsync() that are not
necessary for correct application operation EXCEPT as a workaround for
lame filesystem re-ordering are a sure way to kill performance.

I'd rather the filesystems were fixed than end up with sync;sync;sync;
type folklore.  Or just don't use lame filesystems.

 As it happens, current versions of my own application should be safe 
 from this Linux filesystem bug, but older versions are not.   There is 
 even a way to request fsync() on every file close, but that could be 
 quite expensive so it is not the default.

So now you pepper your apps with an option to fsync() on close()?  Ouch.


fsync() was always a wart.

Many of the Unx filesystem writes didn't that is was a problem, but it
still is.  This is now part of the folklore: you must fsync.

But why do filesystem writers insist that the filesystem can reorder
all operations?  And why do they believe that meta data is more
important?

Clearly, that is false: how else can you rename files which the system 
hasn't written already?

I noticed that our old ufs code issued two synchronous writes when
creating a file.  Unfortunately, it should have used three even when we 
don't care what's in the file.

Casper



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread David Dyer-Bennet

On Wed, March 18, 2009 11:43, Bob Friesenhahn wrote:
 On Wed, 18 Mar 2009, Joerg Schilling wrote:

 The problem in this case is not whether rename() is atomic but whether
 the
 file that replaces the old file in an atomic rename() operation is in a
 stable state on the disk before calling rename().

 This topic is quite disturbing to me ...

 The calling sequence of the failing code was:

 f = open(new, O_WRONLY|O_CREATE|O_TRUNC, 0666);
 write(f, dat, size);
 close(f);
 rename(new, old);

 The only granted way to have the file new in a stable state on the
 disk
 is to call:

 f = open(new, O_WRONLY|O_CREATE|O_TRUNC, 0666);
 write(f, dat, size);
 fsync(f);
 close(f);

 But the problem is not that the file new is in an unstable state.
 The problem is that it seems that some filesystems are not preserving
 the ordering of requests.  Failing to preserve the ordering of
 requests is fraught with peril.

Only in very limited cases.  For example, writing the blocks of a file can
occur in any order, so long as no block is written twice and so long as no
reads are performed.  It simply doesn't matter what order that goes to
disk in.  As soon as somebody reads one of the blocks written, then some
of the ordering becomes important.

You're trying, I think, to argue from first principles; may I suggest that
a lot is known about filesystem (and database) semantics, and that we will
get further if we work within what's already known about that, rather than
trying to reinvent the wheel from scratch?


 POSIX does not care about disks or filesystems.  The only correct
 behavior is for operations to be applied in the order that they are
 requested of the operating system.  This is a core function of any
 operating system.

Is this what it actually says in the POSIX documents?  Or in any other
filesystem formal definition?

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread David Dyer-Bennet

On Wed, March 18, 2009 11:59, Richard Elling wrote:
 Bob Friesenhahn wrote:
 As it happens, current versions of my own application should be safe
 from this Linux filesystem bug, but older versions are not. There is
 even a way to request fsync() on every file close, but that could be
 quite expensive so it is not the default.

 Pragmatically, it is much easier to change the file system once, than
 to test or change the zillions of applications that might be broken.

On the other hand, by doing so we've set limits on the behavior of all
future applications.

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread David Magda

On Mar 18, 2009, at 12:43, Bob Friesenhahn wrote:

POSIX does not care about disks or filesystems.  The only  
correct behavior is for operations to be applied in the order that  
they are requested of the operating system.  This is a core function  
of any operating system.  It is therefore ok for some (or all) of  
the data which was written to new to be lost, or for the rename  
operation to be lost, but it is not ok for the rename to end up with  
a corrupted file with the new name.


Out of curiousity, is this what POSIX actually specifies? If that is  
the case, wouldn't that mean that the behaviour of ext3/4 is  
incorrect? (Assuming that it does re-order operations.)


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread James Litchfield

POSIX has a  Synchronized I/O Data (and File) Integrity Completion
definition (line 115434 of the Issue 7 (POSIX.1-2008) specification). 
What it
says is that writes for a byte range in a file must complete before any 
pending

reads for that byte range are satisfied.

It does not say that if you have 3 pending writes and pending reads for 
a byte range,
that the writes  must complete in  the order issued - simply that they 
must all complete
before any reads complete. See lines 71371-71376 in the write() 
discussion. The
specification explicitly avoids discussing the behavior of concurrent 
writes to a file from
multiple processes. and suggests that applications doing this should 
use some form

of concurrency control.

It is true that because of these semantics, many file system 
implementations will use
locks to ensure that no reads can occur in the entire file while writes 
are happening
which has the side  effect of ensuring the writes are executed in the 
order they are issued.
This is an implementation detail that can be  complicated by async IO as 
well. The only
guarantee  POSIX offers is that all  pending  writes to the relevant 
byte range in the file
will be completed before a read to that byte range is  allowed. An 
in-progress read is
expected to block any writes to the relevant byte range file the  read 
completes.


The specification also does not say the bits for a file must end up on 
the disk without
an intervening  fsync() operation unless you've explicitly asked for 
data synchronization
(O_SYNC,  O_DSYNC) when you opened  the file. The fsync() discussion  
(line 31956)
says that the bits must undergo a physical write of data from the 
buffer cache that should
be completed  when the fsync() call returns. If there are errors, the 
return from the fsync()
call should express the fact that one or more errors occurred. The only 
guarantee that the
physical write happens is if the system supports the  
_POSIX_SYNCHRONIZED_IO option. If
not, the comment is to read the system's  conformance documentation (if 
any) to see what
actually does happen. In the case that _POSIX_SYNCHRONIZED_IO is not 
supported,

it's perfectly allowable for fsync()  to be a no-op.

Jim Litchfield
---
David Magda wrote:

On Mar 18, 2009, at 12:43, Bob Friesenhahn wrote:

POSIX does not care about disks or filesystems.  The only correct 
behavior is for operations to be applied in the order that they are 
requested of the operating system.  This is a core function of any 
operating system.  It is therefore ok for some (or all) of the data 
which was written to new to be lost, or for the rename operation to 
be lost, but it is not ok for the rename to end up with a corrupted 
file with the new name.


Out of curiousity, is this what POSIX actually specifies? If that is 
the case, wouldn't that mean that the behaviour of ext3/4 is 
incorrect? (Assuming that it does re-order operations.)


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-18 Thread Miles Nordin
 dm == David Magda dma...@ee.ryerson.ca writes:

dm is this what POSIX actually specifies?

i doubt it.  If it did, it would basically mandate a log-structured /
COW filesystem, which, although not a _bad_ idea, is way too far from
a settled debate to be enshrining in a mandatory ``standard'' (ex.,
the database fragmentation problems with LFS, WaFL, ZFS.  And the
large number of important deployed non-COW filesystems on POSIX
systems).  

There's no other so-far-demonstrated way than log-structured/COW to
achieve this property which some people think they're entitled to take
for granted: ``after a reboot, the system must appear as though it did
not reorder any writes.  The filesystem must recover to some exact
state that it passed through in the minutes leading up to the crash,
some state as observed from the POSIX userland (above all write
caches).''

It's a nice property.  Nine years ago when i was trying to get Linux
users to try NetBSD, I flogged this as a great virtue of LFS.  And if
I were designing a non-POSIX operating system to replace Unix, I'd
probably promise developers this property.  But achieving it is just
too constraining to belong in POSIX.

If you can find some application that can safely disable some safety
feature when it knows it's running on ZFS that it needs to keep on
other filesystems and thus perform absurdly faster on ZFS with no
risk, then you can demonstrate the worth of promising this property.
The fsync() that i'm sure KDE will add into all their broken apps is
such an example, but I doubt it will be ``absurdly faster'' enough to
get ZFS any attention.  Maybe something to do with virtual disk
backing stores for VM's?

But I don't think pushing exaggerated expectations as ``obvious'' in
front of people who don't know the nasty details yet, nor overstating
POSIX's minimal crash requirements, is going to work.  There are just
too many smart people ready to defend the non-log-stuctured
write-in-place filesystems.  And I believe it *is* possible to write a
correct database or MTA, even with the level of guarantee those
systems provide (provide in practice, not provide as specified by
POSIX).

And the guarantees ARE minimal---just:

 http://www.google.com/search?q=POSIX+%22crash+consistency%22

and you'll find even people against T'so's who want to change ext4
still agree POSIX is on T'so's side.

My own opinion is that the apps are unportable and need to be fixed,
and that what te side against T'so wants changed is so poorly stated
it's no more than ad-hoc ``make the apps not broken, because otherwise
anything which does the exact same thing as the broken app we just
found will also be broken!!!'' it's not a clearly articulatable
guarantee like that AIUI provided by transaction groups.

But linux app developers never seem to give much of a flying shit
whether their apps work on notLinux, which is why they think it's
``practical'' to change ext4 rather than the nonconformant app, so
dragging out the POSIX horse for flogging in support of ``change
ext4'' looks highly hypocritical, while flogging the same horse to
support ``ZFS is the only POSIXly correct filesystem on the planet''
is flatly incorrect but at least not hypocritical. :)


pgp6S7cxYCPf9.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] rename(2), atomicity, crashes and fsync()

2009-03-17 Thread James Andrewartha
Hi all,

Recently there's been discussion [1] in the Linux community about how
filesystems should deal with rename(2), particularly in the case of a crash.
ext4 was found to truncate files after a crash, that had been written with
open(foo.tmp), write(), close() and then rename(foo.tmp, foo). This is
 because ext4 uses delayed allocation and may not write the contents to disk
immediately, but commits metadata changes quite frequently. So when
rename(foo.tmp,foo) is committed to disk, it has a length of zero which
is later updated when the data is written to disk. This means after a crash,
foo is zero-length, and both the new and the old data has been lost, which
is undesirable. This doesn't happen when using ext3's default settings
because ext3 writes data to disk before metadata (which has performance
problems, see Firefox 3 and fsync[2])

Ted T'so's (the main author of ext3 and ext4) response is that applications
which perform open(),write(),close(),rename() in the expectation that they
will either get the old data or the new data, but not no data at all, are
broken, and instead should call open(),write(),fsync(),close(),rename().
Most other people are arguing that POSIX says rename(2) is atomic, and while
POSIX doesn't specify crash recovery, returning no data at all after a crash
is clearly wrong, and excessive use of fsync is overkill and
counter-productive (Ted later proposes a yes-I-really-mean-it flag for
fsync). I've omitted a lot of detail, but I think this is the core of the
argument.

Now the question I have, is how does ZFS deal with
open(),write(),close(),rename() in the case of a crash? Will it always
return the new data or the old data, or will it sometimes return no data? Is
 returning no data defensible, either under POSIX or common sense? Comments
about other filesystems, eg UFS are also welcome. As a counter-point, XFS
(written by SGI) is notorious for data-loss after a crash, but its authors
defend the behaviour as POSIX-compliant.

Note this is purely a technical discussion - I'm not interested in replies
saying ?FS is a better filesystem in general, or on GPL vs CDDL licensing.

[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/317781?comments=all
http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/
http://lwn.net/Articles/323169/
http://mjg59.livejournal.com/108257.html http://lwn.net/Articles/323464/
http://thunk.org/tytso/blog/2009/03/15/dont-fear-the-fsync/
http://lwn.net/Articles/323752/ *
http://lwn.net/Articles/322823/ *
* are currently subscriber-only, email me for a free link if you'd like to
read them
[2] http://lwn.net/Articles/283745/

-- 
James Andrewartha | Sysadmin
Data Analysis Australia Pty Ltd | STRATEGIC INFORMATION CONSULTANTS
97 Broadway, Nedlands, Western Australia, 6009
PO Box 3258, Broadway Nedlands, WA, 6009
T: +61 8 9386 3304 | F: +61 8 9386 3202 | I: http://www.daa.com.au
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss