[Roché Compaan]
...
http://mail.zope.org/pipermail/zodb-dev/2004-July/007682.html
...


[Lennart Regebro]
I read this thread, and it seems to me that the ultimate solution
would be to have a setting for FSStorage, say "fsync-behaviour" with
the options of "single", "double", "none" or "interval". We'd need an
explaining text too. Something like:

fsync-behaviour: Determines when fsync is called. Default: single.

Options:

Single: Calls fsync once per transaction. Gives you reasonable data
reliability in most cases. You should in a crash only lose one
transaction.

Note that the original complaint was that single-fsync could
theoretically leave the last transaction /claiming/ it completed
successfully despite that the transaction data wasn't fully written to
disk.  That's a case of Data.fs corruption (and possibly undetectable,
by ftest.py or fsrefs.py, if you were extremely unlucky) rather than
of merely losing a transaction.

Double: Calls fsync before marking transaction as complete as well as
after marking it as complete.

Which theoretically worms around the above by saving all the
transaction's data before marking the transaction complete.  Note that
it's /still/ possible to lose the last transaction (if the box crashes
after writing the transaction's data but before marking the
transaction complete).  IOW, the failure mode attributed to the
single-fsync case above actually belongs to the double-fsync case too.
The theoretical failure mode for the single-fsync case is much worse.

This setting is only useful if you have configured the complete storage
chain (operating system, filesystem, drivers, controllers and disks) will
not let fsync return until data is safely and completely written to disk. In
most cases and without configuration of the complete storage chain, this
setting will slow down FSStorage without actually increasing the
reliability of data written to disk.

Interval: Will call fsync only every couple of transactions, with the
interval determined by the setting "fsync-interval". This is good for
write-intensive applications where you don't mind loosing a couple of
transactions if the computer should crash.

There's again the worse possibility of corruption.  Disk controllers
typically reorder pending writes to minimize head movement, so
transaction bytes may not be stored to disk in the order written by
software.  If the box crashes "in the middle", there's no guessing
what's left on disk.

None: Will never call fsync. Good for applications of high write load
where the data is not critical. Never ever use this setting on
windows, as it on windows makes it highly likely that data will not be
written to disk at all, and a crash could quite likely make you loose
all your changes.

In truth, I expect that Windows is in exactly the same boat here:  "it
depends" on a gazillion things few people know about and that are hard
to find out.  For example, there are knobs in Windows that purport to
disable write caching on a per-hard-drive basis.  God only knows what
that truly means.

It's a good proposal, so extend it :-)  FileStorage should also grow a
method saying "do all you can to ensure that everything done so far is
committed to disk".  This would consist of:

1. Flush and fsync the .fs file (note that ZODB works at stream (FILE *) level,
  not at filehandle ("little integer") level; flush() works at the
former level and
  fsync() at the latter; it does no good to fsync if the stdio stream buffers
  still hold /some/ of the data, which is why flushing is necessary before an
  fsync).

2. Update the .index file (ZODB currently does that only "when it feels like
  it", roughly each 10000 data records written, and when a FileStorage is
  closed).

3. Flush and fsync the .index file (ZODB currently never fsyncs the .index
  file).
_______________________________________________
For more information about ZODB, see the ZODB Wiki:
http://www.zope.org/Wikis/ZODB/

ZODB-Dev mailing list  -  ZODB-Dev@zope.org
http://mail.zope.org/mailman/listinfo/zodb-dev

Reply via email to