[Bug 317781] Re: Ext4 data loss

Theodore Ts'o Fri, 06 Mar 2009 12:40:27 -0800

So, I've been aware of this problem, and have been working on a
solution, but since I'm not subscribed to this bug, I wasn't aware of
the huge discussion going on here until Nullack prodded me and asked me
to "take another look at bug 317781".    The short answer is (a) yes,
I'm aware of it, (b) there is a (partial) solution, (c) it's not yet in
mainline, and as far as I know, not in an Ubuntu Kernel, but it is
queued for integration at the next merge window, after 2.6.29 releases,
and (d) this is really more of an application design problem more than
any thing else.  The patches in question are:


http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=3bf3342f394d72ed2ec7e77b5b39e1b50fad8284
http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=6645f8c3bc3cdaa7de4aaa3d34d40c2e8e5f09ae
http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=dbc85aa9f11d8c13c15527d43a3def8d7beffdc8

So, what is the problem.    POSIX fundamentally says that what happens
if the system is not shutdown cleanly is undefined.   If you want to
force things to be stored on disk, you must use fsync() or fdatasync().
There may be performance problems with this, which is what happened with
FireFox 3.0[1] --- but that's why POSIX doesn't require that things be
synched to disk as soon as the file is closed.

[1] http://shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/

So, why wasn't this a problem before in the past?   Well, ext3 by
default has a commit interval of 5 seconds, and has data=ordered.   What
does this mean?   Well, every 5 seconds, the ext3 journal is committed;
this means that any changes in since the last commit are now guaranteed
to survive an unclean shutdown.   The journalling mode data=ordered
means that only metadata is written in the journal, but data is ordered;
this means that before the commit takes place, any data blocks are
associated with inodes that are about to be committed in that
transaction will be forced out to disk.   This is primarily done for
security reasons; if this was not done, then any newly allocated blocks
might still contain previous data belonging to some other file or user,
and after a crash, accessing that file might result in a user seeing
someone else's mail or p0rn, and that's unacceptable from a security
perspective.

However, this had the side effect of essentially guaranteeing that
anything that had been written was guaranteed to be on disk after 5
seconds.   (This is somewhat modified if you are running on batteries
and have enabled laptop mode, but we'll ignore that for the purposes of
this discussion.)  Since ext3 became the dominant filesystem for Linux,
application writers and users have started depending on this, and so
they become shocked and angry when their system locks up and they lose
data --- even though POSIX never really made any such guaranteed.   (We
could be snide and point out that they should have been shocked and
angry about crappy proprietary, binary-only drivers that no one but the
manufacturer can debug, or angry at themselves for not installing a UPS,
but that's not helpful; expectations are expectations, and it's hard to
get people to change those expectations, even when they aren't good for
themselves or the environment --- such as Americans living in exburgs
driving SUV's getting shocked and angry when gasoline hit $4/gallon, and
their 90 minute daily commute started getting expensive.  :-)

OK, so enter ext4 and delayed allocation.    With delayed allocation, we
don't allocate a location on disk for the data block right away.  Since
there is no location on disk, there is no place to write the data on a
commit; but it also means that there is no security problem.  It also
results in a massive performance improvements; for example, if you
create a scratch file, and then delete it 20 seconds later, it will
probably never hit the disk.   Unfortunately, the default VM tuning
parameters, which can be controlled by
/proc/sys/vm/dirty_expire_centiseconds and
/proc/sys/vm/dirty_writeback_centiseconds, means that in practice, a
newly created file won't hit disk until about 45-150 seconds later,
depending on how many dirty pages are in the page cache at the time.
(This isn't unique to ext4, by the way --- any advanced filesystem which
does delayed allocation, which includes xfs and the in the future,
btrfs, will have the same issue.)

So the difference between 5 seconds and 60 seconds (the normal time if
you're writing huge data sets) isn't *that* big, but for certain crappy
applications that apparently write huge numbers of small files in users'
home directories.   This appears to be the case for both GNOME and KDE.
Since these applications are rewriting existing files, and are
apparently doing so *frequently*, the chances that files will be lost is
high.

So.... what are the solutions?    The patches which are queued for the
2.6.30-rc1 merge window basically are a hack which force blocks that had
been delay allocated to be allocated when either (a) the file in which
was being written had previously been truncated using ftruncate or
opened using O_TRUNC, at which point the blocks will be allocated when
the file is closed, or (b) if a file containing blocks not yet allocated
is renamed using the rename(3) system call causing a previously existing
file to be unlinked (i.e., the application has written the file
"foo.new" and is now calling rename("foo.new", "foo"), causing the file
"foo" to be unlinked), then the file's blocks will also be forcibly
allocated.   This solves the most common cases where some crappy desktop
framework is constantly rewriting large number of files in ~/.gnome or
~/.kde, since in those cases, where the files are constantly being
replaced, they will be forced out to disk, giving users the old ext3
behaviour.   However, for large files that are being streamed out, or
large database files, in most cases they won't meet the criteria (a) and
(b) above, so we end up preserving most of the performance advantages of
ext4.

Another solution is to make sure your system is reliable.  :-)    If you
have your server in a proper data center, with a UPS, and you're not
using any unreliable binary-only video drivers or network drivers, then
your system shouldn't be randomly locking up or crashing; in that case,
a further patch which will also be merged during the 2.6.30-rc1 merge
window will provide a mount option which disables the above-mentioned
kludge, since it will impact performance.

The final solution, is we need properly written applications and desktop
libraries.  The proper way of doing this sort of thing is not to have
hundreds of tiny files in private ~/.gnome2* and ~/.kde2* directories.
Instead, the answer is to use a proper small database like sqllite for
application registries, but fixed up so that it allocates and releases
space for its database in chunks, and that it uses fdatawrite() instead
of fsync() to guarantee that data is written on disk.  If sqllite had
been properly written so that it grabbed new space for its database
storage in chunks of 16k or 64k, and released space when it was no
longer needed in similar large chunks via truncate(), and if it used
fdatasync() instead of fsync(), the performance problems with FireFox 3
wouldn't have taken place.   Such a solution is also far more efficient
in terms of disk space utilization, and minimizes disk writes which is
good for SSD's.   It is the ultimate correct answer, but it means that
you need someone with Systems experience writing the libraries used by
application writers.

-- 
Ext4 data loss
https://bugs.launchpad.net/bugs/317781
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 317781] Re: Ext4 data loss

Reply via email to