Re: [reiserfs-list] ReiserFS data corruption in very simple configuration

2001-10-03 Thread Toby Dickenson

>Of course. If you want data to hit the disk, you have to use fsync. This
>does work with reiserfs and will ensure that the data hits the disk. If
>you don't do this then bad things might happen.

This is probably a naive question, but this thread has already proved
me wrong on one naive assumption.

If the sequence is:
1. append some data to file A
2. fsync(A)
3. append some further data to A
4. some writes to other files
5. power loss

Is it guaranteed that all the data written in step 1 will still be
intact?

The potential problem I can see is that some data from step 1 may have
been written in a tail, the tail moves during step 3, and then the
original tail is overwritten before the new tail (including data from
before the fsync) is safely on disk.

Thanks for your help,


Toby Dickenson
[EMAIL PROTECTED]



[reiserfs-list] ReiserFS data corruption in very simple configuration

2001-09-30 Thread foner-reiserfs

Date: Mon, 1 Oct 2001 03:26:27 +0200
From: <[EMAIL PROTECTED] ( Marc) (A.) (Lehmann )>

On Sun, Sep 30, 2001 at 09:00:49PM -0400, [EMAIL PROTECTED] wrote:
> extending a file, the metadata is written -last-, e.g., file blocks
> are allocated, file data is written, and -then- metadata is written.

this is almost impossible to achieve with existing hardware (witness the
many discussions about disk caching for example), and, without journaling,
might even be slow.

I think perhaps we may be talking past each other; let me try to clarify.

As I said earlier in this thread, this has nothing at all to do with
disk caching.  Let me restate this again:  The scenario I'm discussing
is an otherwise-idle machine that had 2 (maybe 3) files modified, sat
idle for 30-60 seconds, and then had the reset button pushed.  I would
expect that either file data and metadata got written, or neither got
written, but not metadata without file data.  This is repeatable more
or less at will---I didn't -just- happen to catch it -just- as it
decided to frob the disks.  Instead, the problem seems to be that
reiserfs is perfectly happy to update the on-disk representation of
which disk blocks contain which files' data, and then -sit there- for
a long time (a minute? longer?) without -also- attempting to flush the
file data to the disk.  This then leads to corrupted files after the
reset.  It's not that the CPU sent data to the disk subsystem that
failed to be written by the time of the interruption; it's that the
data was still sitting in RAM and the CPU hadn't even decided to get
it out the IDE channel yet.  This means that there is -always- a giant
timing hole which can corrupt data, as opposed to just the much-tinier
hole that would be created if the file-bytes-to-disk-bytes correspondence
were updated immediately after the write that wrote the data---it
would be hard for me to accidentally hit such a hole.

> of wtmp had data from the -previous- copy of XFree86.0.log that had
> been freed (because it was unlinked when the next copy was written)
> but which had not actually had the wtmp data written to it yet

It's easily possible, but it could also be a bug. Let's the reiserfs authors
decide.

However, if it is indeed "a bug" then fixing it would only lower the
frequency of occurance.

True, but as long as it makes it only happen if the disk is -in
progress of writing stuff- when the reset or power failure happens,
the risk is -greatly- reduced.  Right now, it's an enormous timing
hole, and one that's likely to be hit---it's happened to me -every
single time- I've had to hit the reset button because (for example)
I wedged X while debugging, and even if I waited a minute after the
wedge-up to do so!  The way I've avoided it is by running a job that
syncs once a second while doing debugging that might possibly make me
unable to take the machine down cleanly.  This is a disgusting and
unreliable kluge.

Only ext3 (some modes) + turning off your harddisk's cache can ensure
this, at the moment.

Or ext3 (some modes) + assuming that the disk will at least write data
that's been sent to it, even if the CPU gets reset.  (I know it's
hopeless if power fails, but that can be made arbitrarily unlikely,
compared to a kernel panic or having to do a CPU reset.)

> to have that logfile in it (instead of zero bytes).  Is this what
> you're talking about when you say "*old* data"?  I think so, and that
> seems to match your comment below about file tails moving around
> rapidly.

appending to logfiles will result in a lot of movement. with other,
strictly block-based filesystems this occurs relatively frequent, and data
will not usually move around. with reiserfs tail movement is frequent.

Right.

> Wouldn't it make more sense to commit metadata to disk -after- the
> data blocks are written?

The problem is that there is currently no easy way to achieve that.

Why not?  (Ignore the disk-caching issue and concentrate on when the
kernel asks for data to be written to the disk.  I am -assuming that
the kernel either (a) writes the data in the order requested, or at
least (b) once it decides to write anything, keeps sending it to the
disk until its queue is completely empty.)

> file simply looks like the data was never added.  If the metadata is
> written -first-, the file can scoop up random trash from elsewhere in

Also, this is not a matter of metadata first or last. Sometimes you need
metadata first, sometimes you need it last. And in many cases, "metadata"
does not need to change, while data still changes.

I'm using "metadata" here as a shorthand for "how the filesystem knows
which byte on disk corresponds to which byte in the file", not just
things like atime, ctime, etc.

> the filesystem.  I contend that this is -much- worse, because it can
> render a previously-good file completely unparseable by tools that

Re: [reiserfs-list] ReiserFS data corruption in very simple configuration

2001-09-30 Thread pcg

On Sun, Sep 30, 2001 at 09:00:49PM -0400, [EMAIL PROTECTED] wrote:
> extending a file, the metadata is written -last-, e.g., file blocks
> are allocated, file data is written, and -then- metadata is written.

this is almost impossible to achieve with existing hardware (witness the
many discussions about disk caching for example), and, without journaling,
might even be slow.

> of wtmp had data from the -previous- copy of XFree86.0.log that had
> been freed (because it was unlinked when the next copy was written)
> but which had not actually had the wtmp data written to it yet

It's easily possible, but it could also be a bug. Let's the reiserfs authors
decide.

However, if it is indeed "a bug" then fixing it would only lower the
frequency of occurance.

Only ext3 (some modes) + turning off your harddisk's cache can ensure
this, at the moment.

> to have that logfile in it (instead of zero bytes).  Is this what
> you're talking about when you say "*old* data"?  I think so, and that
> seems to match your comment below about file tails moving around
> rapidly.

appending to logfiles will result in a lot of movement. with other,
strictly block-based filesystems this occurs relatively frequent, and data
will not usually move around. with reiserfs tail movement is frequent.

> Wouldn't it make more sense to commit metadata to disk -after- the
> data blocks are written?

The problem is that there is currently no easy way to achieve that.

> file simply looks like the data was never added.  If the metadata is
> written -first-, the file can scoop up random trash from elsewhere in

Also, this is not a matter of metadata first or last. Sometimes you need
metadata first, sometimes you need it last. And in many cases, "metadata"
does not need to change, while data still changes.

> the filesystem.  I contend that this is -much- worse, because it can
> render a previously-good file completely unparseable by tools that
> expect that -all- of the file is in a particular syntax.

It depends - with ext2 you frequently have garbled files, too. Basically, if
you write to a file and turn off the power the outcome is unexpected, and
will always be (unless you are ready to take the big speed hit).

> Unfortunately, this behavior meant that X -did- fall over, because my
> XF86Config file was trashed by being scrambled---I'd recently written
> out a new version, after all---and the trashed copy no longer made any

But the same thing can and does happen with ext2, depending on your editor
and your timing. It is not a reiserfs thing.

> But if you write the metadata first, you foil this attempt to be safe,
> because you might have this sequence at the actual disk:  [magnetic
> oxide updated w/rename][start updating magnetic oxide with tempfile
> data][power failure or reset]---ooops! original file gone, new file
> doesn't have its data yet, so sorry, thanks for playing.

Of course. If you want data to hit the disk, you have to use fsync. This
does work with reiserfs and will ensure that the data hits the disk. If
you don't do this then bad things might happen.

> By writing metadata first, it seems that reiserfs violates the
> idempotence of many filesystem operations, and does exactly the
> opposite of what "journalling" implies to anyone who understands
> databases, namely that either the operation completes entirely, or it
> is completely undone.

You are confusing databases with filesystems, however. Most journaling
filesystems work that way. Some (like ext3) are nice enough to let you
choose.

> journal the metadata, but how does this help when what it's essentially
> doing is trashing the -data- in unexpected ways exactly when such
> journalling is supposed to help, namely across a machine failure?

But ext2 works in the same way. It does happen more often with reiserfs
(especially with tails), but ignoring the problem for ext2 doesn't make it
right. If applications don't work reliably with reisrefs, they don't work
reliably with ext2. If you want reliability then mount synchronous.

> This seems like such an elementary design defect that I'm at a loss
> to understand why it's there.

About every filesystem does have this "elementary design defect". If you
want data to hit the disk, sync it. Its that simple.

> There -must- be some excellent reason,
> right?  But what?  And if not, can it be fixed?

Speed is an excellent reason. The fix is to tell the kernel to write the
data out to the platters.

Anyway, this is a good time to review the various discussions on the
reiserfs list and the kernel list on how to teach the kernel (if it is
possible) to implement loose write-ordering.

-- 
  -==- |
  ==-- _   |
  ---==---(_)__  __   __   Marc Lehmann  +--
  --==---/ / _ \/ // /\ \/ /   [EMAIL PROTECTED]  |e|
  -=/_/_//_/\_,_/ /_/\_\   XX11-RIPE --+
The choice of a GNU generation   

[reiserfs-list] ReiserFS data corruption in very simple configuration

2001-09-30 Thread foner-reiserfs

Date: Sat, 29 Sep 2001 14:52:29 +0200
From: <[EMAIL PROTECTED] ( Marc) (A.) (Lehmann )>

Thanks for your response!  Bear with me, though, because I'm asking
a design question below that relates to this.

On Sat, Sep 29, 2001 at 12:44:59AM -0400, Lenny Foner 
<[EMAIL PROTECTED]> wrote:
> isn't fixed by fsck?  [See outcome (d) below.]  I'm having difficulty
> believing how this can be possible for a non-journalling filesystem.

If you have difficulties in believing this, may I ask you how you think it
is possible for a non-journaling filesystem to prevent this at all?

Naively, one would assume that any non-journalling FS that has written
correct metadata through to the disk would either have written updates
into files, or failed to write them, but would not have written new
(<60 second old) data into different files than the data was destined for.
(I suppose the assumption I'm making here is that, when creating or
extending a file, the metadata is written -last-, e.g., file blocks
are allocated, file data is written, and -then- metadata is written.
That way, a failure anywhere before finality simply seems to vanish,
whereas writing metadata first seems to cause the lossage below.)

> But what about written to the wrong files?  See below.

What you see is most probably *old* data, not data from another (still
existing) file.

I'm...  dubious, but maybe.  As mentioned earlier in this thread,
one of the failures I saw consisted of having several lines of my
XFree86.0.log file appended to wtmp---when I logged in after the
failure, I got "Last login: " followed by several lines from that file
instead of a date.  (Other failures scrambled other files worse.)

Now, it's -possible- that rsfs allocated an extra portion to the end
of wtmp for the last-login data (as a user of the fs, I don't care
whether officially this was a "block", an entry in a journal, etc),
login "wrote" to that region (but it wasn't committed yet 'cause no
sync), my XFree86.0.log file was "created" and "written" (again
uncommitted), I pushed reset, and then when it came back up, the end
of wtmp had data from the -previous- copy of XFree86.0.log that had
been freed (because it was unlinked when the next copy was written)
but which had not actually had the wtmp data written to it yet
(because a sync hadn't happened).  I have no way to verify this, since
one XFree86.0.log looks much like the other.  Conceptually, this would
imply that wtmp was extended into disk freespace, which just happened
to have that logfile in it (instead of zero bytes).  Is this what
you're talking about when you say "*old* data"?  I think so, and that
seems to match your comment below about file tails moving around
rapidly.

But it doesn't explain -why- it works this way in the first place.
Wouldn't it make more sense to commit metadata to disk -after- the
data blocks are written?  After all, if -either one- isn't written,
the file is incomplete.  But if the metadata is written -last-, the
file simply looks like the data was never added.  If the metadata is
written -first-, the file can scoop up random trash from elsewhere in
the filesystem.  I contend that this is -much- worse, because it can
render a previously-good file completely unparseable by tools that
expect that -all- of the file is in a particular syntax.  It's just
an accident, I guess, that login will accept any random trash when
it prints its "last-login" message, rather than falling over with a
coredump because it doesn't look like a date.  [And see * below.]

Unfortunately, this behavior meant that X -did- fall over, because my
XF86Config file was trashed by being scrambled---I'd recently written
out a new version, after all---and the trashed copy no longer made any
sense.  I would have been -much- happier to have had the -unmodified-,
-old- version than a scrambled "new" version!  Without Emacs ~ files,
this would have been much worse.  Consider an app that, "for reliability",
rewrites a file by creating a temp copy, writing it out, then renaming
the temp over the original [this is how Emacs typically saves files].
But if you write the metadata first, you foil this attempt to be safe,
because you might have this sequence at the actual disk:  [magnetic
oxide updated w/rename][start updating magnetic oxide with tempfile
data][power failure or reset]---ooops! original file gone, new file
doesn't have its data yet, so sorry, thanks for playing.

By writing metadata first, it seems that reiserfs violates the
idempotence of many filesystem operations, and does exactly the
opposite of what "journalling" implies to anyone who understands
databases, namely that either the operation completes entirely, or it
is completely undone.  Yes, yes, I know (now!) that it claims to only
journal the metadata, but how does this help when what it's essentially
doing is trashing the -data- in unexpected ways exactly when such
journalling is supposed to help, namely across a machine failure

[reiserfs-list] ReiserFS data corruption in very simple configuration

2001-09-28 Thread Lenny Foner

[As before, please make sure you CC me on replies or I won't see them.  Tnx!]

Date: Tue, 25 Sep 2001 14:28:54 +0100
From: "Stephen C. Tweedie" <[EMAIL PROTECTED]>

Hi,

On Sat, Sep 22, 2001 at 04:44:21PM -0400, [EMAIL PROTECTED] wrote:

> Stock reiserfs only provides meta-data journalling. It guarantees that
> structure of you file-system will be correct after journal replay, not
> content of a files. It will never "trash" file that wasn't accessed at
> the moment of crash, though.
> 
> Thanks for clarifying this.  However, I should point out that the
> failure mode is quite serious---whereas ext2fs would simply fail
> to record data written to a file before a sync, reiserfs seems to
> have instead -swapped random pieces of one file with another-,
> which is -much- harder to detect and fix.

Not true.  ext2, ext3 in its "data=writeback" mode, and reiserfs can
all demonstrate this behaviour.  Reiserfs is being no worse than ext2
(the timings may make the race more or less likely in reiserfs, but
ext2 _is_ vulnerable.)

ext2fs can write parts of file A to file B, and vice versa, and this
isn't fixed by fsck?  [See outcome (d) below.]  I'm having difficulty
believing how this can be possible for a non-journalling filesystem.

e2fsck only restores metadata consistency on ext2 after a crash: it
can't possibly guarantee that all the data blocks have been written.

But what about written to the wrong files?  See below.

ext3 will let you do full data journaling, but also has a third mode
(the default), which doesn't journal data, but which does make sure
that data is flushed to disk before the transaction which allocated
that data is allowed to commit.  That gives you most of the
performance of ext3's fast-and-loose writeback mode, but with an
absolute guarantee that you never see stale blocks in a file after a
crash.

I've been getting a stream of private mail over the last few days
saying one thing or another about various filesystems with various
optional patches, so let me get this out in the open and see if we can
converge on an answer here.  [ext2f2, ext3fs, and reiserfs answers
should feel free to cite which mode they're talking about and URLs for
whatever patches are required to get to that mode; some impressions
about reliability and maturity would be useful, too.]

Let's take this scenario:  Files A and B have had blocks written to
them sometime in the recent past (30 to 60 seconds or so) and a sync
has not happened yet.  (I don't know how often reiserfs will be synced
by default; 60 seconds?  Longer?  Presumably running "sync" will force
it, but I don't know when else it will happen.)  File A may have been
completely rewritten or newly written (e.g., what Emacs does when it
saves a file), whereas file B may have simply been appended to (e.g.,
what happens when wtmp is updated).

The CPU reset button is then pushed.  [See P.P.S. at end of this message.]

Now, we have the following possibilities for the outcome after the
system comes back up and has finished checking its filesystem:

(a) Metadata correctly written, file data correctly written.
(b) Metadata correctly written, file data partially written.
(E.g., one or both files might have been partially or completely
updated.) 
(c) Metadata correctly written, file data completely unwritten.
(Neither file got updated at all.)
(d) Metadata correctly written, FILE DATA INTERCHANGED BETWEEN A AND B.
(E.g., File A gets some of file B written somewhere within it,
and file B gets some of file A written somewhere within it---this
is the behavior I observed, at least twice, with reiserfs.)
(e) Metadata corrupted in some fashion, file data undefined.
("Undefined" means could be any of (a) through (d) above; I don't care.)

Now, which filesystems can show each outcome?  I don't know.  I
contend that reiserfs does (d).  Stephen Tweedie talks above about
whether we can "guarantee that all the data blocks have been written",
but may be missing the point I was making, namely that THE BLOCKS HAVE
BEEN WRITTEN TO THE WRONG FILES.

It would be nice to know, for each of ext2fs, ext3fs, and reiserfs,
what the -intended- outcome is, and what the -actual- outcome is
(since implementation bugs might make the actual outcome different
from the intended outcome).  Any additional filesystems anyone would
like to toss into the pot would be welcome; maybe I'll post a matrix
of the results, if we get some.

I'm -assuming- that the intended outcome for reiserfs (without data
journalling) is one of (a), (b), or (c).  If the intended outcome for
reiserfs without data journalling [or -any- FS, really] is in fact
(d), then I don't understand how this filesystem can be intended for
any reliable service, since a failure will garble all files written in
the last several seconds in a fashion that is very, very difficult to
unscramble.  (-Per