Re: [reiserfs-list] ReiserFS data corruption in very simple configuration
>Of course. If you want data to hit the disk, you have to use fsync. This >does work with reiserfs and will ensure that the data hits the disk. If >you don't do this then bad things might happen. This is probably a naive question, but this thread has already proved me wrong on one naive assumption. If the sequence is: 1. append some data to file A 2. fsync(A) 3. append some further data to A 4. some writes to other files 5. power loss Is it guaranteed that all the data written in step 1 will still be intact? The potential problem I can see is that some data from step 1 may have been written in a tail, the tail moves during step 3, and then the original tail is overwritten before the new tail (including data from before the fsync) is safely on disk. Thanks for your help, Toby Dickenson [EMAIL PROTECTED]
[reiserfs-list] ReiserFS data corruption in very simple configuration
Date: Mon, 1 Oct 2001 03:26:27 +0200 From: <[EMAIL PROTECTED] ( Marc) (A.) (Lehmann )> On Sun, Sep 30, 2001 at 09:00:49PM -0400, [EMAIL PROTECTED] wrote: > extending a file, the metadata is written -last-, e.g., file blocks > are allocated, file data is written, and -then- metadata is written. this is almost impossible to achieve with existing hardware (witness the many discussions about disk caching for example), and, without journaling, might even be slow. I think perhaps we may be talking past each other; let me try to clarify. As I said earlier in this thread, this has nothing at all to do with disk caching. Let me restate this again: The scenario I'm discussing is an otherwise-idle machine that had 2 (maybe 3) files modified, sat idle for 30-60 seconds, and then had the reset button pushed. I would expect that either file data and metadata got written, or neither got written, but not metadata without file data. This is repeatable more or less at will---I didn't -just- happen to catch it -just- as it decided to frob the disks. Instead, the problem seems to be that reiserfs is perfectly happy to update the on-disk representation of which disk blocks contain which files' data, and then -sit there- for a long time (a minute? longer?) without -also- attempting to flush the file data to the disk. This then leads to corrupted files after the reset. It's not that the CPU sent data to the disk subsystem that failed to be written by the time of the interruption; it's that the data was still sitting in RAM and the CPU hadn't even decided to get it out the IDE channel yet. This means that there is -always- a giant timing hole which can corrupt data, as opposed to just the much-tinier hole that would be created if the file-bytes-to-disk-bytes correspondence were updated immediately after the write that wrote the data---it would be hard for me to accidentally hit such a hole. > of wtmp had data from the -previous- copy of XFree86.0.log that had > been freed (because it was unlinked when the next copy was written) > but which had not actually had the wtmp data written to it yet It's easily possible, but it could also be a bug. Let's the reiserfs authors decide. However, if it is indeed "a bug" then fixing it would only lower the frequency of occurance. True, but as long as it makes it only happen if the disk is -in progress of writing stuff- when the reset or power failure happens, the risk is -greatly- reduced. Right now, it's an enormous timing hole, and one that's likely to be hit---it's happened to me -every single time- I've had to hit the reset button because (for example) I wedged X while debugging, and even if I waited a minute after the wedge-up to do so! The way I've avoided it is by running a job that syncs once a second while doing debugging that might possibly make me unable to take the machine down cleanly. This is a disgusting and unreliable kluge. Only ext3 (some modes) + turning off your harddisk's cache can ensure this, at the moment. Or ext3 (some modes) + assuming that the disk will at least write data that's been sent to it, even if the CPU gets reset. (I know it's hopeless if power fails, but that can be made arbitrarily unlikely, compared to a kernel panic or having to do a CPU reset.) > to have that logfile in it (instead of zero bytes). Is this what > you're talking about when you say "*old* data"? I think so, and that > seems to match your comment below about file tails moving around > rapidly. appending to logfiles will result in a lot of movement. with other, strictly block-based filesystems this occurs relatively frequent, and data will not usually move around. with reiserfs tail movement is frequent. Right. > Wouldn't it make more sense to commit metadata to disk -after- the > data blocks are written? The problem is that there is currently no easy way to achieve that. Why not? (Ignore the disk-caching issue and concentrate on when the kernel asks for data to be written to the disk. I am -assuming that the kernel either (a) writes the data in the order requested, or at least (b) once it decides to write anything, keeps sending it to the disk until its queue is completely empty.) > file simply looks like the data was never added. If the metadata is > written -first-, the file can scoop up random trash from elsewhere in Also, this is not a matter of metadata first or last. Sometimes you need metadata first, sometimes you need it last. And in many cases, "metadata" does not need to change, while data still changes. I'm using "metadata" here as a shorthand for "how the filesystem knows which byte on disk corresponds to which byte in the file", not just things like atime, ctime, etc. > the filesystem. I contend that this is -much- worse, because it can > render a previously-good file completely unparseable by tools that
Re: [reiserfs-list] ReiserFS data corruption in very simple configuration
On Sun, Sep 30, 2001 at 09:00:49PM -0400, [EMAIL PROTECTED] wrote: > extending a file, the metadata is written -last-, e.g., file blocks > are allocated, file data is written, and -then- metadata is written. this is almost impossible to achieve with existing hardware (witness the many discussions about disk caching for example), and, without journaling, might even be slow. > of wtmp had data from the -previous- copy of XFree86.0.log that had > been freed (because it was unlinked when the next copy was written) > but which had not actually had the wtmp data written to it yet It's easily possible, but it could also be a bug. Let's the reiserfs authors decide. However, if it is indeed "a bug" then fixing it would only lower the frequency of occurance. Only ext3 (some modes) + turning off your harddisk's cache can ensure this, at the moment. > to have that logfile in it (instead of zero bytes). Is this what > you're talking about when you say "*old* data"? I think so, and that > seems to match your comment below about file tails moving around > rapidly. appending to logfiles will result in a lot of movement. with other, strictly block-based filesystems this occurs relatively frequent, and data will not usually move around. with reiserfs tail movement is frequent. > Wouldn't it make more sense to commit metadata to disk -after- the > data blocks are written? The problem is that there is currently no easy way to achieve that. > file simply looks like the data was never added. If the metadata is > written -first-, the file can scoop up random trash from elsewhere in Also, this is not a matter of metadata first or last. Sometimes you need metadata first, sometimes you need it last. And in many cases, "metadata" does not need to change, while data still changes. > the filesystem. I contend that this is -much- worse, because it can > render a previously-good file completely unparseable by tools that > expect that -all- of the file is in a particular syntax. It depends - with ext2 you frequently have garbled files, too. Basically, if you write to a file and turn off the power the outcome is unexpected, and will always be (unless you are ready to take the big speed hit). > Unfortunately, this behavior meant that X -did- fall over, because my > XF86Config file was trashed by being scrambled---I'd recently written > out a new version, after all---and the trashed copy no longer made any But the same thing can and does happen with ext2, depending on your editor and your timing. It is not a reiserfs thing. > But if you write the metadata first, you foil this attempt to be safe, > because you might have this sequence at the actual disk: [magnetic > oxide updated w/rename][start updating magnetic oxide with tempfile > data][power failure or reset]---ooops! original file gone, new file > doesn't have its data yet, so sorry, thanks for playing. Of course. If you want data to hit the disk, you have to use fsync. This does work with reiserfs and will ensure that the data hits the disk. If you don't do this then bad things might happen. > By writing metadata first, it seems that reiserfs violates the > idempotence of many filesystem operations, and does exactly the > opposite of what "journalling" implies to anyone who understands > databases, namely that either the operation completes entirely, or it > is completely undone. You are confusing databases with filesystems, however. Most journaling filesystems work that way. Some (like ext3) are nice enough to let you choose. > journal the metadata, but how does this help when what it's essentially > doing is trashing the -data- in unexpected ways exactly when such > journalling is supposed to help, namely across a machine failure? But ext2 works in the same way. It does happen more often with reiserfs (especially with tails), but ignoring the problem for ext2 doesn't make it right. If applications don't work reliably with reisrefs, they don't work reliably with ext2. If you want reliability then mount synchronous. > This seems like such an elementary design defect that I'm at a loss > to understand why it's there. About every filesystem does have this "elementary design defect". If you want data to hit the disk, sync it. Its that simple. > There -must- be some excellent reason, > right? But what? And if not, can it be fixed? Speed is an excellent reason. The fix is to tell the kernel to write the data out to the platters. Anyway, this is a good time to review the various discussions on the reiserfs list and the kernel list on how to teach the kernel (if it is possible) to implement loose write-ordering. -- -==- | ==-- _ | ---==---(_)__ __ __ Marc Lehmann +-- --==---/ / _ \/ // /\ \/ / [EMAIL PROTECTED] |e| -=/_/_//_/\_,_/ /_/\_\ XX11-RIPE --+ The choice of a GNU generation
[reiserfs-list] ReiserFS data corruption in very simple configuration
Date: Sat, 29 Sep 2001 14:52:29 +0200 From: <[EMAIL PROTECTED] ( Marc) (A.) (Lehmann )> Thanks for your response! Bear with me, though, because I'm asking a design question below that relates to this. On Sat, Sep 29, 2001 at 12:44:59AM -0400, Lenny Foner <[EMAIL PROTECTED]> wrote: > isn't fixed by fsck? [See outcome (d) below.] I'm having difficulty > believing how this can be possible for a non-journalling filesystem. If you have difficulties in believing this, may I ask you how you think it is possible for a non-journaling filesystem to prevent this at all? Naively, one would assume that any non-journalling FS that has written correct metadata through to the disk would either have written updates into files, or failed to write them, but would not have written new (<60 second old) data into different files than the data was destined for. (I suppose the assumption I'm making here is that, when creating or extending a file, the metadata is written -last-, e.g., file blocks are allocated, file data is written, and -then- metadata is written. That way, a failure anywhere before finality simply seems to vanish, whereas writing metadata first seems to cause the lossage below.) > But what about written to the wrong files? See below. What you see is most probably *old* data, not data from another (still existing) file. I'm... dubious, but maybe. As mentioned earlier in this thread, one of the failures I saw consisted of having several lines of my XFree86.0.log file appended to wtmp---when I logged in after the failure, I got "Last login: " followed by several lines from that file instead of a date. (Other failures scrambled other files worse.) Now, it's -possible- that rsfs allocated an extra portion to the end of wtmp for the last-login data (as a user of the fs, I don't care whether officially this was a "block", an entry in a journal, etc), login "wrote" to that region (but it wasn't committed yet 'cause no sync), my XFree86.0.log file was "created" and "written" (again uncommitted), I pushed reset, and then when it came back up, the end of wtmp had data from the -previous- copy of XFree86.0.log that had been freed (because it was unlinked when the next copy was written) but which had not actually had the wtmp data written to it yet (because a sync hadn't happened). I have no way to verify this, since one XFree86.0.log looks much like the other. Conceptually, this would imply that wtmp was extended into disk freespace, which just happened to have that logfile in it (instead of zero bytes). Is this what you're talking about when you say "*old* data"? I think so, and that seems to match your comment below about file tails moving around rapidly. But it doesn't explain -why- it works this way in the first place. Wouldn't it make more sense to commit metadata to disk -after- the data blocks are written? After all, if -either one- isn't written, the file is incomplete. But if the metadata is written -last-, the file simply looks like the data was never added. If the metadata is written -first-, the file can scoop up random trash from elsewhere in the filesystem. I contend that this is -much- worse, because it can render a previously-good file completely unparseable by tools that expect that -all- of the file is in a particular syntax. It's just an accident, I guess, that login will accept any random trash when it prints its "last-login" message, rather than falling over with a coredump because it doesn't look like a date. [And see * below.] Unfortunately, this behavior meant that X -did- fall over, because my XF86Config file was trashed by being scrambled---I'd recently written out a new version, after all---and the trashed copy no longer made any sense. I would have been -much- happier to have had the -unmodified-, -old- version than a scrambled "new" version! Without Emacs ~ files, this would have been much worse. Consider an app that, "for reliability", rewrites a file by creating a temp copy, writing it out, then renaming the temp over the original [this is how Emacs typically saves files]. But if you write the metadata first, you foil this attempt to be safe, because you might have this sequence at the actual disk: [magnetic oxide updated w/rename][start updating magnetic oxide with tempfile data][power failure or reset]---ooops! original file gone, new file doesn't have its data yet, so sorry, thanks for playing. By writing metadata first, it seems that reiserfs violates the idempotence of many filesystem operations, and does exactly the opposite of what "journalling" implies to anyone who understands databases, namely that either the operation completes entirely, or it is completely undone. Yes, yes, I know (now!) that it claims to only journal the metadata, but how does this help when what it's essentially doing is trashing the -data- in unexpected ways exactly when such journalling is supposed to help, namely across a machine failure
[reiserfs-list] ReiserFS data corruption in very simple configuration
[As before, please make sure you CC me on replies or I won't see them. Tnx!] Date: Tue, 25 Sep 2001 14:28:54 +0100 From: "Stephen C. Tweedie" <[EMAIL PROTECTED]> Hi, On Sat, Sep 22, 2001 at 04:44:21PM -0400, [EMAIL PROTECTED] wrote: > Stock reiserfs only provides meta-data journalling. It guarantees that > structure of you file-system will be correct after journal replay, not > content of a files. It will never "trash" file that wasn't accessed at > the moment of crash, though. > > Thanks for clarifying this. However, I should point out that the > failure mode is quite serious---whereas ext2fs would simply fail > to record data written to a file before a sync, reiserfs seems to > have instead -swapped random pieces of one file with another-, > which is -much- harder to detect and fix. Not true. ext2, ext3 in its "data=writeback" mode, and reiserfs can all demonstrate this behaviour. Reiserfs is being no worse than ext2 (the timings may make the race more or less likely in reiserfs, but ext2 _is_ vulnerable.) ext2fs can write parts of file A to file B, and vice versa, and this isn't fixed by fsck? [See outcome (d) below.] I'm having difficulty believing how this can be possible for a non-journalling filesystem. e2fsck only restores metadata consistency on ext2 after a crash: it can't possibly guarantee that all the data blocks have been written. But what about written to the wrong files? See below. ext3 will let you do full data journaling, but also has a third mode (the default), which doesn't journal data, but which does make sure that data is flushed to disk before the transaction which allocated that data is allowed to commit. That gives you most of the performance of ext3's fast-and-loose writeback mode, but with an absolute guarantee that you never see stale blocks in a file after a crash. I've been getting a stream of private mail over the last few days saying one thing or another about various filesystems with various optional patches, so let me get this out in the open and see if we can converge on an answer here. [ext2f2, ext3fs, and reiserfs answers should feel free to cite which mode they're talking about and URLs for whatever patches are required to get to that mode; some impressions about reliability and maturity would be useful, too.] Let's take this scenario: Files A and B have had blocks written to them sometime in the recent past (30 to 60 seconds or so) and a sync has not happened yet. (I don't know how often reiserfs will be synced by default; 60 seconds? Longer? Presumably running "sync" will force it, but I don't know when else it will happen.) File A may have been completely rewritten or newly written (e.g., what Emacs does when it saves a file), whereas file B may have simply been appended to (e.g., what happens when wtmp is updated). The CPU reset button is then pushed. [See P.P.S. at end of this message.] Now, we have the following possibilities for the outcome after the system comes back up and has finished checking its filesystem: (a) Metadata correctly written, file data correctly written. (b) Metadata correctly written, file data partially written. (E.g., one or both files might have been partially or completely updated.) (c) Metadata correctly written, file data completely unwritten. (Neither file got updated at all.) (d) Metadata correctly written, FILE DATA INTERCHANGED BETWEEN A AND B. (E.g., File A gets some of file B written somewhere within it, and file B gets some of file A written somewhere within it---this is the behavior I observed, at least twice, with reiserfs.) (e) Metadata corrupted in some fashion, file data undefined. ("Undefined" means could be any of (a) through (d) above; I don't care.) Now, which filesystems can show each outcome? I don't know. I contend that reiserfs does (d). Stephen Tweedie talks above about whether we can "guarantee that all the data blocks have been written", but may be missing the point I was making, namely that THE BLOCKS HAVE BEEN WRITTEN TO THE WRONG FILES. It would be nice to know, for each of ext2fs, ext3fs, and reiserfs, what the -intended- outcome is, and what the -actual- outcome is (since implementation bugs might make the actual outcome different from the intended outcome). Any additional filesystems anyone would like to toss into the pot would be welcome; maybe I'll post a matrix of the results, if we get some. I'm -assuming- that the intended outcome for reiserfs (without data journalling) is one of (a), (b), or (c). If the intended outcome for reiserfs without data journalling [or -any- FS, really] is in fact (d), then I don't understand how this filesystem can be intended for any reliable service, since a failure will garble all files written in the last several seconds in a fashion that is very, very difficult to unscramble. (-Per