Re: Lost file-system story
James Chacon chacon.ja...@gmail.com writes: On Tue, Dec 13, 2011 at 4:09 PM, Greg A. Woods wo...@planix.ca wrote: At Wed, 14 Dec 2011 09:06:23 +1030, Brett Lymn brett.l...@baesystems.com wrote: Subject: Re: Lost file-system story On Tue, Dec 13, 2011 at 01:38:57PM +0100, Joerg Sonnenberger wrote: fsck is supposed to handle *all* corruptions to the file system that can occur as part of normal file system operation in the kernel. It is doing best effort for others. It's a bug if it doesn't do the former and a potential missing feature for the latter. There are a lot of slips twixt cup and lip. If you are really unlucky you can get an outage at just the wrong time that will cause the filesystem to be hosed so badly that fsck cannot recover it. Sure, fsck can run to completion but all you have is most of your FS in lost+found which you have to be really really desperate to sort through. I have been working with UNIX for over 20years now and I have only seen this happen once and it was with a commercial UNIX. I've seen that happen more than once unfortunately. SunOS-4 once I think. I agree 100% with Joerg here though. I'm pretty sure at least some of the times I've seen fsck do more damage than good it was due to a kernel bug or more breaking assumptions about ordered operations. There have of course also been some pretty serious bugs in various fsck implementations across the years and vendors. I'd be suspicious of fsck failing on a regularly mounted disk with corruption that can't otherwise be tracked to outside influences (bad ram, bad disk cache, etc). I've seen some bizarre things happen on ram errors over the years for instance. I've got infinite sequence of nested subdirectories on new hardware and stable FreeBSD 5.3 once. Something like http://xkcd.com/981/ fsck refused to work there. -- HE CE3OH...
Re: Lost file-system story
At Wed, 14 Dec 2011 07:50:37 + (UTC), mlel...@serpens.de (Michael van Elst) wrote: Subject: Re: Lost file-system story wo...@planix.ca (Greg A. Woods) writes: easy, if not even easier, to do a mount -u -r Does this work again? Not that I know of, and PR#30525 concurs, as does the commit mentioned in that PR to prevent it from falsely appearing to work, a change which remains in netbsd-5 and -current to date. See my discussion of this issue earlier in this thread. -- Greg A. Woods Planix, Inc. wo...@planix.com +1 250 762-7675http://www.planix.com/ pgpsPtoKtaNDu.pgp Description: PGP signature
Re: Lost file-system story
On Thu, Dec 15, 2011 at 12:48:51AM +0400, Aleksej Saushev wrote: There have of course also been some pretty serious bugs in various fsck implementations across the years and vendors. I'd be suspicious of fsck failing on a regularly mounted disk with corruption that can't otherwise be tracked to outside influences (bad ram, bad disk cache, etc). I've seen some bizarre things happen on ram errors over the years for instance. I've got infinite sequence of nested subdirectories on new hardware and stable FreeBSD 5.3 once. Something like http://xkcd.com/981/ fsck refused to work there. At one point some time back when pounding on rename, I got a test volume into a state where if you ran fsck -fy it would fix a ton of stuff, run to completion, and mark the fs clean. Which was great, except that if you did it again, it would do the same thing. Over and over. I'm glad it was a test volume... -- David A. Holland dholl...@netbsd.org
Fwd: Lost file-system story
I did it again. gmail is trying to teach an old dog a new trick -- Forwarded message -- From: Donald Allen donaldcal...@gmail.com Date: Tue, Dec 13, 2011 at 10:04 AM Subject: Re: Lost file-system story To: David Holland dholland-t...@netbsd.org On Tue, Dec 13, 2011 at 1:27 AM, David Holland dholland-t...@netbsd.org wrote: On Mon, Dec 12, 2011 at 03:31:09PM -0500, Donald Allen wrote: Note that this bug *may* not worsen the probability of recovery after a crash. It might even increase it! Think about it. If you boot NetBSD and mount a filesystem async, it is going to be correctly structured (or deemed to be by fsck) at boot time, or the system wouldn't mount it. Assuming the system is happy with it, if you then make changes to the filesystem, but, because of this bug they are all in the buffer cache and never get written out, and then the system crashes --- you've got the filesystem you started with. Not necessarily; I did say *may* (which I wrote because you could write a good book about NetBSD internals with what I don't know about NetBSD internals). right off I can see two ways to get hosed: 1. Delete a large file. This causes the in-memory FS to believe the indirect blocks from this file are free; then it can reallocate them as data for some other file. That data then *does* get written out, so after crashing and rebooting the indirect blocks contain utter nonsense. The ffs fsck probably can't recover this. 2. Use a program that calls fsync(). This will write out some metadata blocks and not others; in the relatively benign case it will just update a previously-free inode and after crashing fsck will place the file in lost+found. In less benign cases it might do the converse of (1), and e.g. overwrite file data with indirect blocks, leading to crosslinked files or worse and probably total fsck failure. Not that any of this matters... I agree. I was just indulging in some idle speculation, having some fun. This bug should be fixed and I think the fix, as I said before, should include a knob to allow the user to control the sync frequency (maybe the knob is already there in sysctl and getting ignored for some reason?). I'm running NetBSD again on my test machine, and have a sleep-sync loop started in rc.local. /Don -- David A. Holland dholl...@netbsd.org
Re: Lost file-system story
At Wed, 14 Dec 2011 09:06:23 +1030, Brett Lymn brett.l...@baesystems.com wrote: Subject: Re: Lost file-system story On Tue, Dec 13, 2011 at 01:38:57PM +0100, Joerg Sonnenberger wrote: fsck is supposed to handle *all* corruptions to the file system that can occur as part of normal file system operation in the kernel. It is doing best effort for others. It's a bug if it doesn't do the former and a potential missing feature for the latter. There are a lot of slips twixt cup and lip. If you are really unlucky you can get an outage at just the wrong time that will cause the filesystem to be hosed so badly that fsck cannot recover it. Sure, fsck can run to completion but all you have is most of your FS in lost+found which you have to be really really desperate to sort through. I have been working with UNIX for over 20years now and I have only seen this happen once and it was with a commercial UNIX. I've seen that happen more than once unfortunately. SunOS-4 once I think. I agree 100% with Joerg here though. I'm pretty sure at least some of the times I've seen fsck do more damage than good it was due to a kernel bug or more breaking assumptions about ordered operations. There have of course also been some pretty serious bugs in various fsck implementations across the years and vendors. -- Greg A. Woods Planix, Inc. wo...@planix.com +1 250 762-7675http://www.planix.com/ pgpYVEF362Y36.pgp Description: PGP signature
Re: Lost file-system story
On Tue, Dec 13, 2011 at 4:09 PM, Greg A. Woods wo...@planix.ca wrote: At Wed, 14 Dec 2011 09:06:23 +1030, Brett Lymn brett.l...@baesystems.com wrote: Subject: Re: Lost file-system story On Tue, Dec 13, 2011 at 01:38:57PM +0100, Joerg Sonnenberger wrote: fsck is supposed to handle *all* corruptions to the file system that can occur as part of normal file system operation in the kernel. It is doing best effort for others. It's a bug if it doesn't do the former and a potential missing feature for the latter. There are a lot of slips twixt cup and lip. If you are really unlucky you can get an outage at just the wrong time that will cause the filesystem to be hosed so badly that fsck cannot recover it. Sure, fsck can run to completion but all you have is most of your FS in lost+found which you have to be really really desperate to sort through. I have been working with UNIX for over 20years now and I have only seen this happen once and it was with a commercial UNIX. I've seen that happen more than once unfortunately. SunOS-4 once I think. I agree 100% with Joerg here though. I'm pretty sure at least some of the times I've seen fsck do more damage than good it was due to a kernel bug or more breaking assumptions about ordered operations. There have of course also been some pretty serious bugs in various fsck implementations across the years and vendors. I'd be suspicious of fsck failing on a regularly mounted disk with corruption that can't otherwise be tracked to outside influences (bad ram, bad disk cache, etc). I've seen some bizarre things happen on ram errors over the years for instance. James
Re: Lost file-system story
At Mon, 12 Dec 2011 18:49:31 -0500 (EST), Matt W. Benjamin m...@linuxbox.com wrote: Subject: Re: Lost file-system story Why would sync not be effective under MNT_ASYNC? Use of sync is not required to lead to consistency expect with respect to an arbitrary point in time, but I don't think anyone ever believed otherwise. However, there should be no question of metadata never being written out if sync was run? Well sync(2) _could_ be effective even in the face of MNT_ASYNC, though I'm not sure it will, or indeed even should be required to, have a guaranteed ongoing beneficial affect to the on-disk consistency of filesystem that was mounted with MNT_ASYNC while activity continues to proceed on the filesystem. I.e. I don't expect sync(2) to suddenly enforce order on the writes that it schedules to a MNT_ASYNC-mounted filesystem. The ordering _may_ be a natural result of the implementation, but if it's not then I wouldn't consider that to be a bug, and I certainly wouldn't write any documentation that suggested it might be a possible outcome. MNT_ASYNC means, to me at least, that even sync(2) can get away with doing writes to a filesystem mounted with that flag in an order other than one which would guarantee on-disk consistency to a level where fsck could repair it. I.e. sync(2) could possibly make things worse for MNT_ASYNC mounted filesystems before it makes them better, and I don't see how that could be considered to be a bug. I do agree that IFF the filesystem is made quiescent, AND all writes necessary and scheduled by sync(2) are allowed to come to completion, THEN the on-disk state of an MNT_ASYNC-mounted filesystem must be consistent (and all data blocks must be flushed to the disk too). However if you're going to go to that trouble (i.e. close all files open on the MNT_ASYNC-mounted filesystem and somehow prevent any other file operations of any kind on that filesystem until such time that you think the sync(2) scheduled writes are all done), then it should be just as easy, if not even easier, to do a mount -u -r (or mount -u -o noasync, or even umount), in which case you'll not only be sure that the filesystem is consistent and secure, but you'll know when it reaches this state (i.e. you won't have to guess about when sync(2)'s scheduled work completes). -- Greg A. Woods Planix, Inc. wo...@planix.com +1 250 762-7675http://www.planix.com/ pgpcLcSlnWPyx.pgp Description: PGP signature
Re: Lost file-system story
wo...@planix.ca (Greg A. Woods) writes: easy, if not even easier, to do a mount -u -r Does this work again? -- -- Michael van Elst Internet: mlel...@serpens.de A potential Snark may lurk in every tree.
Re: Lost file-system story
At Fri, 9 Dec 2011 22:12:25 -0500, Donald Allen donaldcal...@gmail.com wrote: Subject: Re: Lost file-system story On Fri, Dec 9, 2011 at 8:43 PM, Greg A. Woods wo...@planix.ca wrote: At Fri, 9 Dec 2011 15:50:35 -0500, Donald Allen donaldcal...@gmail.com wrote: Subject: Re: Lost file-system story does not guarantee to keep a consistent file system structure on the disk is what I expected from NetBSD. From what I've been told in this discussion, NetBSD pretty much guarantees that if you use async and the system crashes, you *will* lose the filesystem if there's been any writing to it for an arbitrarily long period of time, since apparently meta-data for async filesystems doesn't get written as a matter of course. I'm not sure what the difference is. You would be sure if you'd read my posts carefully. The difference is whether the probability of an async-mounted filesystem is near zero or near one. I think perhaps the misunderstanding between you and everyone else is because you haven't fully appreciated what everyone has been trying to tell you about the true meaning of async in Unix-based filesystems, and in particular about NetBSD's current implementation of Unix-based filesystems, and what that all means to implementing algorithms that can relibably repair the on-disk image of a filesystem after a crash. I would have thought the warning given in the description of async in mount(8) would be sufficient, but apparently you haven't read it that way. Perhaps the problem is the last occurance of the word or in the last sentence of that warning should be changed to and. To me that would at least make the warning a bit stronger. And that's why by default, and by very strong recommendation, filesystem metadata for Unix-based filesystems (sans WABPL) should always be written synchronously to the disk if you ever hope to even try to use fsck(8). That's simply not true. Have you ever used Linux in all the years that ext2 was the predominant filesystem? ext2 filesystems were routinely mounted async for many years; everything -- data, meta-data -- was written asynchronously with no regard to ordering. DO NOT confuse any Linux-based filesystem with any Unix-based filesystem. They may have nearly identical semantics from the user programming perspective (i.e. POSIX), but they're all entirely different under the hood. Unix-based filesystems (sans WABPL, and ignoring the BSD-only LFS) have never ever Ever EVER given any guarantee about the repariability of the filesystem after a crash if it has been mounted with MNT_ASYNC. Indeed it is more or less _impossible_ by design for the system to make any such guarantee given what MNT_ASYNC actually means for Unix-based filesystems, and especially what it means in the NetBSD implementation. Unix filesystems, including Berkeley Fast File System variant, have never made any guarantees about the recoverability of an async-mounted filesystem after a crash. I never thought or asserted otherwise. Well, from my perspective, especially after carefully reading your posts, you do indeed seem to think that async-mounted Unix-based filesystems should be able to be repaired, at least some of the time, despite the documentation, and all the collected wisdom of those who've replied to your posts so far, saying otherwise. You seem to have inferred some impossible capability based on your experience with other non-Unix filesystems that have a completely different internal structure and implementation from the Unix-based filesystems in NetBSD. Nonsense -- I have inferred no such thing. Instead of referring you to previous posts for a re-read, I'll give you a little summary. I am speaking about probabilities. I completely understand that no filesystem mounted async (or any other way, for that matter), whether Linux or NetBSD or OpenBSD, is GUARANTEED to survive a crash. OK, let's try stating this once more in what I hope are the same terms you're trying to use: The probablility of any Unix-based filesystem being repariable after a crash is zero (0) if it has been mounted with MNT_ASYNC, and if there was _any_ activity that affected its structure since mount time up to the time of the crash. It still might survive after some types of changes, but it _probably_ won't. There are no guarantees. Use newfs and restore to recover. Linux ext2 is not a Unix-based filesystem and Linux itself is not a Unix-based kernel. The meaning of async to ext2 is apparently very different than it is to any Unix-based filesystem. NetBSD might be free of UNIX(tm) code, but it and its progenitors, right back to the 7th Edition of the original Unix, were all implemented by people firmly entrenched in the original Unix heritage from the inside out. For Unix-based filesystems and their repair tools, any probablility of recovery less than one is as good as if it were zero. Don't ever get your hopes up. Use newfs
Re: Lost file-system story
On Sun, Dec 11, 2011 at 23:23:33 -0500, Donald Allen wrote: On Sun, Dec 11, 2011 at 9:53 PM, Greg A. Woods wo...@planix.ca wrote: Perhaps this sentence from McKusick's memo about fsck will help you to understand: fsck is able to repair corrupted file systems using procedures based upon the order in which UNIX honors these file system update requests. This is true for all Unix-based filesystems. I'm not going to put words in McKusick's mouth, but I think you have misinterpreted this to mean that without ordering, recovery is impossible. If that's what you think (and you've said so, except when you've contradicted yourself), then you are wrong. Why? Because the evidence (e.g., my experiments) says that recovery *is* possible. Not guaranteed. Possible. What you are arguing is effectively isomorphic to: 1. I have C code that does i = i++ + i++; 2. When I use compiler C1 it always give me this specific result for i. 3. When I use compiler C2 it sometimes (or always) gives me some different result. 4. B/c of #2 C2 compiler must be wrong -uwe
Re: Lost file-system story
On Sun, Dec 11, 2011 at 06:53:26PM -0800, Greg A. Woods wrote: You would be sure if you'd read my posts carefully. The difference is whether the probability of an async-mounted filesystem is near zero or near one. I think perhaps the misunderstanding between you and everyone else is because you haven't fully appreciated what everyone has been trying to tell you about the true meaning of async in Unix-based filesystems, and in particular about NetBSD's current implementation of Unix-based filesystems, and what that all means to implementing algorithms that can relibably repair the on-disk image of a filesystem after a crash. No, as far as I can tell he understands perfectly well; he just doesn't consider the behavior acceptable. It appears that currently a ffs volume mounted -oasync never writes back metadata. I don't think this behavior is acceptable either. The fact that mounting -oasync violates assumptions made by fsck_ffs, with the result that fsck may not be able to recover after a crash (either without making a huge mess in lost+found, or not at all) is secondary at the moment, because in the absence of the previous glaring bug it's impossible to even estimate what the probability of it choking is. (Note that with ext2 on Linux from time to time fsck will not be able to recover after a crash and make a huge mess in lost+found. It never happened all that often and is probably less common now after 15 years or so of incremental work on e2fsck.) DO NOT confuse any Linux-based filesystem with any Unix-based filesystem. They may have nearly identical semantics from the user programming perspective (i.e. POSIX), but they're all entirely different under the hood. Unix-based filesystems (sans WABPL, and ignoring the BSD-only LFS) have never ever Ever EVER given any guarantee about the repariability of the filesystem after a crash if it has been mounted with MNT_ASYNC. What on earth do you mean by Unix-based filesystems such that this statement is true? Perhaps this sentence from McKusick's memo about fsck will help you to understand: fsck is able to repair corrupted file systems using procedures based upon the order in which UNIX honors these file system update requests. This is true for all Unix-based filesystems. No, it is true for ffs, and possibly for our ext2 implementation (which shares a lot of code with ffs) but nothing else. -- David A. Holland dholl...@netbsd.org
Re: Lost file-system story
[...], you do indeed seem to think that async-mounted Unix-based filesystems should be able to be repaired, at least some of the time, There's a huge difference between this isn't promsied and this never happens. They _can_ be repaired...some of the time. When they can, it is because, by coincidence, it just so happens that the stuff that got written produces a filesystem fsck can repair. The probablility of any Unix-based filesystem being repariable after a crash is zero (0) if it has been mounted with MNT_ASYNC, and if there was _any_ activity that affected its structure since mount time up to the time of the crash. This is simply false. I just tried it. On a 5.1 i386 system, I used fdisk and disklabel to make a half-gig partition, newfsed it, mounted it normally, copied a file into it, unmounted it, mounted it async, removed the file, and hit the power switch. After the machine came back up, I tried fsck on the filesystem. It said it was clean. I used fsck -f. It was happy. I mounted it and, as far as I can tell, fsck was correct in thinking the filesystem was OK. So, there is an existence-proof-by-example that there are circumstances under which a filesystem mounted async can be changed and still be left in a state fsck can repair. It still might survive after some types of changes, but it _probably_ won't. Right. But that's not probability ... is zero (0). Linux ext2 is not a Unix-based filesystem and Linux itself is not a Unix-based kernel. It's about as Unix-based as NetBSD is. Unless you mean something strange by Unix-based - what _do_ you mean by it? For Unix-based filesystems and their repair tools, any probablility of recovery less than one is as good as if it were zero. That's not how I feel about it when I've lost a filesystem. I'll take a filesystem with a nonzero probability of recovering something useful from over one that guarantees to trash everything any day (other things being equal, of course). /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Lost file-system story
At Mon, 12 Dec 2011 15:08:40 +, David Holland dholland-t...@netbsd.org wrote: Subject: Re: Lost file-system story On Sun, Dec 11, 2011 at 06:53:26PM -0800, Greg A. Woods wrote: No, as far as I can tell he understands perfectly well; he just doesn't consider the behavior acceptable. It appears that currently a ffs volume mounted -oasync never writes back metadata. I don't think this behavior is acceptable either. I agree there are conditions and operations which _should_ guarantee that the on-disk state of the filesystem is identical to what the user perceives and thus that the filesystem is 100% consistent and secure. It seems umount(2) works to make this guarantee, for example. The two other most important of these that come to mind are: mount -u -r /async-mounted-fs and mount -u -o noasync /async-mounted-fs It is my understanding that neither works at the moment, and that this is a known and reported and accepted bug, as I outlined in an earlier post to this thread. I think sync(2) should probably also work, but _only_ if the filesystem is made entirely quiescent from before the time sync() is called, and until after the time all the writes it has scheduled have completed, all the way to the disk media. (and of course once activity starts on the filesystem again, all guarantees are lost again) It might be nice if sync(2) could schedule all the needed writes to happen in an order which would ensure consistency and repairability of the on-disk image at any given time, but I'm guessing this might be too much to ask, at least without some more significant effort. However without enforcing the synchronous ordering of writes, sync(2) is effectively useless for the purposes Mr. Allen appears to have, though perhaps his level of risk tollerance would still make it useful to him while others of us would still be unable to tolerate its dangers in any scenarios where we were not prepared to use newfs to recover. Besides, the only way I know to guarantee a filesystem remains quiescent is to unmount it, so if you do that first then there's nothing for sync(2) to do afterwards, so nothing new to implement. :-) DO NOT confuse any Linux-based filesystem with any Unix-based filesystem. They may have nearly identical semantics from the user programming perspective (i.e. POSIX), but they're all entirely different under the hood. Unix-based filesystems (sans WABPL, and ignoring the BSD-only LFS) have never ever Ever EVER given any guarantee about the repariability of the filesystem after a crash if it has been mounted with MNT_ASYNC. What on earth do you mean by Unix-based filesystems such that this statement is true? I mean exactly what it sounds like -- nothing more. Having almost no knowledge about ext2 or any other non-Unix-based filesystems, I'm trying to be careful to avoid making any claims about those non-Unix-based filesystems. I included FFS as a Unix-based filesystem because I know for sure that it shares many of the attributes of the original Unix filesystems with respect to the issues surrouding MNT_ASYNC. Perhaps this sentence from McKusick's memo about fsck will help you to understand: fsck is able to repair corrupted file systems using procedures based upon the order in which UNIX honors these file system update requests. This is true for all Unix-based filesystems. No, it is true for ffs, and possibly for our ext2 implementation (which shares a lot of code with ffs) but nothing else. Well, if you follow what I by Unix-based filesystems, and you ignore LFS and options like WABPL, as I've said, then I believe it is entirely true since within my definition that leaves just FFS, and. V7, though it didn't have MNT_ASYNC, would suffer the same as if MNT_ASYNC were implemented for it -- indeed I'm guessing that NetBSD's reimplementation of v7fs will have the same problems with MNT_ASYNC. As I say, I don't know enough about the non-Unix-based filesystems in NetBSD, such as those compatible AmigaDOS, Acorn, Windows NT, or even MS-DOS, to know if they would be adversely affected by MNT_ASYNC. Indeed I'm not even sure if they all have reasonable filesystem repair tools (NetBSD has none, except maybe for ext2fs and msdos, though in my experience NetBSD's MS-DOS filesystem implementation is very fragile and it does not have a truly useful fsck_msdos, even without trying to use MNT_ASYNC with it). SysVbfs may suffer too, but I don't know enough about it either despite it being by definition Unix-based, and we don't have an fsck for it in any case. I'd also be guessing about EFS, and I'm not sure I'd categorize it as Unix-based any more than I do LFS. -- Greg A. Woods Planix, Inc. wo...@planix.com +1 250 762-7675http://www.planix.com/ pgpVUXizbZcol.pgp Description: PGP signature
Re: Lost file-system story
Andy Ruhl acr...@gmail.com writes: If solving your problem depends on sync frequency, I don't see why this shouldn't be managed by some knob to twiddle. Given that the crash scenario doesn't get worse depending on where the knob is or if the crash happens while the knob is working. If it does, it's pointless. My sense is that Donald isn't complaining about why is the sync frequency 30s instead of 60s; it's more bafflement at waiting 10-15 minutes with an idle disk and having the data not synced at all. There's a historical period of 30s, and that seems both not often enough not to cause trouble and often enough to not boggle users. It may also make sense to have a syncer behavior that is low rate, to not overwhelm asked-for IO, and to use most of the disk bandwidth when it is on, and to let it be otherwise, for laptops. But a basic correctness property is almost certainly that if the disk is spun up and is not in heavy use and lots of time passes, dirty buffers (data and metadata) are written to disk. pgp7T73KBKPFG.pgp Description: PGP signature
Re: Lost file-system story
At Mon, 12 Dec 2011 11:09:44 -0500 (EST), Mouse mo...@rodents-montreal.org wrote: Subject: Re: Lost file-system story They _can_ be repaired...some of the time. When they can, it is because, by coincidence, it just so happens that the stuff that got written produces a filesystem fsck can repair. That's totally irrellevant. Possibilities other than zero or one are not useful in manual pages, and they are only useful to an end user as a very last resort -- equivalent to calling out the army to put Humpty Dumpty back together again. For all useful intents and purposes any probablity of irreparable damage of greater than zero is, for the end user, and for all planning purposes, as good as a probability of one. Plan to use newfs and restore after every crash and you'll be OK. Plan otherwise and you will eventually be disappointed. That's not how I feel about it when I've lost a filesystem. I'll take a filesystem with a nonzero probability of recovering something useful from over one that guarantees to trash everything any day (other things being equal, of course). Heh. Yup, there are those of use who will find it a challenge to see just how much we can recover from a damaged file system no matter how useful the outcome may be. You don't put that in the manual page though, and you never give the end user that expectation (unless it's already too late for them and they've got yolk all over their face). -- Greg A. Woods Planix, Inc. wo...@planix.com +1 250 762-7675http://www.planix.com/ pgp42zLQBCM7L.pgp Description: PGP signature
Re: Lost file-system story
At Sun, 11 Dec 2011 23:23:33 -0500, Donald Allen donaldcal...@gmail.com wrote: Subject: Re: Lost file-system story How can you possibly say such a thing and hope to be taken seriously? What you just said means that P(survival) = .999 is the same as P(survival) = 0. There are a LOT of situations (e.g., mine) where P(survival) = .999 would be very acceptable and P(survival) = 0 would not. The manual page must not give probabilities or even speak of possiblities. So, as-is you have been warned properly by the manual page. For planning purposes you _must_ expect that your filesystem will be damaged beyond repair after a crash and that you will have to use newfs and restore to recover. Learn these expectations well and you will be happier in the long run. Fail to learn them and you have no recourse but to wallow in your own sorrows. I.e. you can't come to the mailing list and say that you expected something better just because you say you can get something better from something else entirely different. You have false expectations based on your experiences with entirely foreign environments. Maybe Humpty Dumpty can be put back together again, sometimes, but even if you have all the King's horses and all the King's men on call to respond to a disaster at a moment's notice, you must not expect that you can have the egg put back together successfully, even just once, even if it does look like just a minor crack this time. -- Greg A. Woods Planix, Inc. wo...@planix.com +1 250 762-7675http://www.planix.com/ pgpiHkVmGsc5g.pgp Description: PGP signature
Re: Lost file-system story
On Mon, Dec 12, 2011 at 11:39:38AM -0800, Greg A. Woods wrote: Having almost no knowledge about ext2 or any other non-Unix-based filesystems, I'm trying to be careful to avoid making any claims about those non-Unix-based filesystems. hmm.. so then how can you claim that it is entirely different (as you did in an earlier email)? It sounds like you're talking our of your, ahem.. depth. I included FFS as a Unix-based filesystem because I know for sure that it shares many of the attributes of the original Unix filesystems with respect to the issues surrouding MNT_ASYNC. Have you tried actually comparing the current NetBSD ffs sources against whatever Unix sources you are talking about? While I'm sure that there are many attributes that are shared, if you even compare the current NetBSD sources with those from, say, 1994, you will find a ton of differences. eric
Re: Lost file-system story
On Mon, Dec 12, 2011 at 12:10:32PM -0800, Greg A. Woods wrote: At Sun, 11 Dec 2011 23:23:33 -0500, Donald Allen donaldcal...@gmail.com wrote: Subject: Re: Lost file-system story How can you possibly say such a thing and hope to be taken seriously? What you just said means that P(survival) = .999 is the same as P(survival) = 0. There are a LOT of situations (e.g., mine) where P(survival) = .999 would be very acceptable and P(survival) = 0 would not. The manual page must not give probabilities or even speak of possiblities. Oh really, Greg? I suppose you can believe that if you want to, while the rest of us can continue to live in the real world where knowing things like that is actually useful. recourse but to wallow in your own sorrows. I.e. you can't come to the mailing list and say that you expected something better just because you say you can get something better from something else entirely different. You have false expectations based on your experiences with entirely foreign environments. Donald, don't listen to Greg. Just in case it needs to be repeated, you're not the only one that thinks it is reasonable to expect a non-0 probability that things will be recovereable, even if something goes wrong. eric
Re: Lost file-system story
At Mon, 12 Dec 2011 14:23:40 -0600, Eric Haszlakiewicz e...@nimenees.com wrote: Subject: Re: Lost file-system story Donald, don't listen to Greg. Just in case it needs to be repeated, you're not the only one that thinks it is reasonable to expect a non-0 probability that things will be recovereable, even if something goes wrong. Eric, what part of MNT_ASYNC don't you understand? -- Greg A. Woods Planix, Inc. wo...@planix.com +1 250 762-7675http://www.planix.com/ pgpz7FPSKpwfe.pgp Description: PGP signature
Re: Lost file-system story
On Mon, Dec 12, 2011 at 2:40 PM, Greg Troxel g...@ir.bbn.com wrote: Andy Ruhl acr...@gmail.com writes: If solving your problem depends on sync frequency, I don't see why this shouldn't be managed by some knob to twiddle. Given that the crash scenario doesn't get worse depending on where the knob is or if the crash happens while the knob is working. If it does, it's pointless. My sense is that Donald isn't complaining about why is the sync frequency 30s instead of 60s; That's right. The only thing I'm *really* complaining about is people who don't read what seems to me to be plain English (I exclude from my complaint those for whom English is not their native language). it's more bafflement at waiting 10-15 minutes with an idle disk and having the data not synced at all. There's a historical period of 30s, and that seems both not often enough not to cause trouble and often enough to not boggle users. That's certainly an issue with NetBSD that David Holland, correctly in my view, identified as a bug. OpenBSD, per the experiments I've already described, does not exhibit this behavior. Note that this bug *may* not worsen the probability of recovery after a crash. It might even increase it! Think about it. If you boot NetBSD and mount a filesystem async, it is going to be correctly structured (or deemed to be by fsck) at boot time, or the system wouldn't mount it. Assuming the system is happy with it, if you then make changes to the filesystem, but, because of this bug they are all in the buffer cache and never get written out, and then the system crashes --- you've got the filesystem you started with. This bug more importantly affects, in my view, the amount of stuff you might lose in the event of a crash. If the system has been up for N hours and you've been working away, making changes, dutifully hitting ctrl-s in gnumeric to write out changes because people have told you that changes to a gnumeric spreadsheet aren't in the filesystem until saved, and the system crashes, you are in for a big surprise. Chances are good that you will not lose the filesystem, but chances are great that you will lose your N hours of work. It may also make sense to have a syncer behavior that is low rate, to not overwhelm asked-for IO, and to use most of the disk bandwidth when it is on, and to let it be otherwise, for laptops. But a basic correctness property is almost certainly that if the disk is spun up and is not in heavy use and lots of time passes, dirty buffers (data and metadata) are written to disk. Yep. Now, knowing about this bug, a simple sync-sleep loop takes care of it. But it should be fixed in the system, so the user doesn't have to remember to do this, or to install such a loop in one of the init-time files. /Don
Re: Lost file-system story
On Mon, Dec 12, 2011 at 3:10 PM, Greg A. Woods wo...@planix.ca wrote: At Sun, 11 Dec 2011 23:23:33 -0500, Donald Allen donaldcal...@gmail.com wrote: Subject: Re: Lost file-system story How can you possibly say such a thing and hope to be taken seriously? What you just said means that P(survival) = .999 is the same as P(survival) = 0. There are a LOT of situations (e.g., mine) where P(survival) = .999 would be very acceptable and P(survival) = 0 would not. The manual page must not give probabilities or even speak of possiblities. Even when the process the man page is describing is non-deterministic? So you want man pages that lie? So, as-is you have been warned properly by the manual page. For planning purposes you _must_ expect that your filesystem will be damaged beyond repair after a crash and that you will have to use newfs and restore to recover. Learn these expectations well and you will be happier in the long run. Fail to learn them and you have no recourse but to wallow in your own sorrows. I.e. you can't come to the mailing list and say that you expected something better just because you say you can get something better from something else entirely different. You have false expectations based on your experiences with entirely foreign environments. Maybe Humpty Dumpty can be put back together again, sometimes, but even if you have all the King's horses and all the King's men on call to respond to a disaster at a moment's notice, you must not expect that you can have the egg put back together successfully, even just once, even if it does look like just a minor crack this time. You seem to have some pre-conceived and incorrect notions, together with a don't-confuse-me-with-the-facts attitude. You've hit the Daily Double. You spoke about happier in the long run above. I'd suggest trying to give more weight to reading/input and less to writing/output, and you'll most likely be happier in the long run. No guarantees, of course. /Don -- Greg A. Woods Planix, Inc. wo...@planix.com +1 250 762-7675 http://www.planix.com/
Re: Lost file-system story
At Mon, 12 Dec 2011 14:17:35 -0600, Eric Haszlakiewicz e...@nimenees.com wrote: Subject: Re: Lost file-system story On Mon, Dec 12, 2011 at 11:39:38AM -0800, Greg A. Woods wrote: Having almost no knowledge about ext2 or any other non-Unix-based filesystems, I'm trying to be careful to avoid making any claims about those non-Unix-based filesystems. hmm.. so then how can you claim that it is entirely different (as you did in an earlier email)? It sounds like you're talking our of your, ahem.. depth. As I said, I'm trying to be careful to avoid making claims one way or another about non-Unix-based filesystems. I'm also trying to keep in mind that MNT_ASYNC can be an attribute of the OS implementation well above the filesystems and I'm also trying to avoid making claims about non-Unix filesytem structures which may be faced with this feature for the first time. Once upon a time I was quite familiar with the use of the tools that came before fsck. I have a great deal of experience with the on-disk structure of V7fs, SysVfs, and many of the minor variants of these filesystems. I'm experienced with many of the things that can go wrong with these filesystems and I'm moderately experienced with how they can be repaired as best as is humanly possible with low-level bit manipulating tools when bugs in either the kernel or fsck cause unexpected failures (not unlike what can happen when MNT_ASYNC is used). I'm moderately experienced with more modern filesystems such as with SysVr4's native FS and Berkeley FFS, though less experienced with low-level on-disk repair of those filesystems (since on these modern Unix-based filesystems the standard repair tools, especially fsck, have been vastly improved; and kernel bugs which destroy the ordered writing of metadata have effectively been eliminated). I included FFS as a Unix-based filesystem because I know for sure that it shares many of the attributes of the original Unix filesystems with respect to the issues surrouding MNT_ASYNC. Have you tried actually comparing the current NetBSD ffs sources against whatever Unix sources you are talking about? While I'm sure that there are many attributes that are shared, if you even compare the current NetBSD sources with those from, say, 1994, you will find a ton of differences. This has nothing to do with any given pile of source code per se. The issues that affect repariability of a Unix-based filesystem are higher level design considerations that are common to the implementations of fsck and the filesystems they can repair from the v7 addenda tape all the way through to the implementation of modern day NetBSD's fsck_ffs(8). You might find McKusick and Kowalski's paper about BSD FFS fsck enlightening. (I can supply a copy if you can't find it elsewhere. It would be nice if it could be included in the NetBSD distribution, even if not cleaned up to reflect the current implementation. It was in 4.4BSD-Lite2, after all.) Like I said earlier: Perhaps the superblock(s) should also record when a filesystem has been mounted with MNT_ASYNC so that fsck(8) can print a warning such as: FS is dirty and was mounted async. Demons will fly out of your nose -- Greg A. Woods Planix, Inc. wo...@planix.com +1 250 762-7675http://www.planix.com/ pgprM7NvSBuE4.pgp Description: PGP signature
Re: Lost file-system story
Greg A. Woods wo...@planix.ca writes: At Mon, 12 Dec 2011 14:23:40 -0600, Eric Haszlakiewicz e...@nimenees.com wrote: Subject: Re: Lost file-system story Donald, don't listen to Greg. Just in case it needs to be repeated, you're not the only one that thinks it is reasonable to expect a non-0 probability that things will be recovereable, even if something goes wrong. Eric, what part of MNT_ASYNC don't you understand? He seems to understand it quite well. Donald came here not complaining, just surprised that things were somewhat worse than one would have expected. And he's right - async doesn't mean and data might never be written indefinitely, just that there are no ordering or completion guarantees. I'm not 100% clear what is wrong, but it seems likely that this discussion has surfaced a bug or two pgpPerN83fK7Y.pgp Description: PGP signature
Re: Lost file-system story
Hi, Why would sync not be effective under MNT_ASYNC? Use of sync is not required to lead to consistency expect with respect to an arbitrary point in time, but I don't think anyone ever believed otherwise. However, there should be no question of metadata never being written out if sync was run? Matt - Greg A. Woods wo...@planix.ca wrote: (I am waffling though on whether I think sync(2) should have any beneficial affect on the consistency of MNT_ASYNC-mounted filesytems.) -- Matt Benjamin The Linux Box 206 South Fifth Ave. Suite 150 Ann Arbor, MI 48104 http://linuxbox.com tel. 734-761-4689 fax. 734-769-8938 cel. 734-216-5309
Re: Lost file-system story
They _can_ be repaired...some of the time. That's totally irrellevant. I don't think so, not when I'm replying to a claim otherwise. Possibilities other than zero or one are not useful in manual pages, Then we can throw away fsck, because there is always _some_ chance the filesystem will be irreparable. Memory, CPUs, disks, and the transports between them do fail, occasionally transiently. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Lost file-system story
On Fri, Dec 9, 2011 at 8:43 PM, Greg A. Woods wo...@planix.ca wrote: At Fri, 9 Dec 2011 15:50:35 -0500, Donald Allen donaldcal...@gmail.com wrote: Subject: Re: Lost file-system story does not guarantee to keep a consistent file system structure on the disk is what I expected from NetBSD. From what I've been told in this discussion, NetBSD pretty much guarantees that if you use async and the system crashes, you *will* lose the filesystem if there's been any writing to it for an arbitrarily long period of time, since apparently meta-data for async filesystems doesn't get written as a matter of course. I'm not sure what the difference is. You would be sure if you'd read my posts carefully. The difference is whether the probability of an async-mounted filesystem is near zero or near one. You seem to be quibbling over minor differences and perhaps one-off experiences. Having a crash almost certainly destroy your filesystem vs. having the filesystem almost certainly survive a crash is not a minor difference. Both OpenBSD and NetBSD also say that you should not use the async flag unless you are prepared to recreate the file system from scratch if your system crashes. That means use newfs(8) [and, by implication, something like restore(8)], not fsck(8), to recover after a crash. You got lucky with your test on OpenBSD. And then there's the matter of NetBSD fsck apparently not really being designed to cope with the mess left on the disk after such a crash. Please correct me if I've misinterpreted what's been said here (there have been a few different stories told, so I'm trying to compute the mean). That's been true of Unix (and many unix-like) filesystems and their fsck(8) commands since the beginning of Unix. fsck(8) is designed to rely on the possible states of on-disk filesystem metadata because that's now Unix-based filesystems have been guaranteed to work (barring use of MNT_ASYNC, obviously). And that's why by default, and by very strong recommendation, filesystem metadata for Unix-based filesystems (sans WABPL) should always be written synchronously to the disk if you ever hope to even try to use fsck(8). That's simply not true. Have you ever used Linux in all the years that ext2 was the predominant filesystem? ext2 filesystems were routinely mounted async for many years; everything -- data, meta-data -- was written asynchronously with no regard to ordering. And yet, when those systems crashed, fsck generally, not always, but usually, restored the filesystem to working order. Of course, some data could be lost and was, but you rarely suffered the loss of an entire filesystem. That's a fact. I am not telling the OpenBSD story to rub NetBSD peoples' noses in it. I'm simply pointing out that that system appears to be an example of ffs doing what I thought it did and what I know ext2 and journal-less ext4 do -- do a very good job of putting the world into operating order (without offering an impossible guarantee to do so) after a crash when async is used, after having been told that ffs and its fsck were not designed to do this. You seem to be very confused about what MNT_ASYNC is and is not. :-) No, you don't understand what I've said. Unix filesystems, including Berkeley Fast File System variant, have never made any guarantees about the recoverability of an async-mounted filesystem after a crash. I never thought or asserted otherwise. You seem to have inferred some impossible capability based on your experience with other non-Unix filesystems that have a completely different internal structure and implementation from the Unix-based filesystems in NetBSD. Nonsense -- I have inferred no such thing. Instead of referring you to previous posts for a re-read, I'll give you a little summary. I am speaking about probabilities. I completely understand that no filesystem mounted async (or any other way, for that matter), whether Linux or NetBSD or OpenBSD, is GUARANTEED to survive a crash. The probability of surviving a crash for any of them is 1. But my experience with Linux ext2 over many years has been that the probability of survival is quite high, near 1. When I reported my experience with NetBSD ffs in this thread, I expressed surprise that the filesystem was a total loss, based on what preceded the crash. My surprise was a result of years of Linux experience. I then got some responses -- see the one from Thor Lancelot Simon, for example. In that message, he asserts that, in NetBSD, *nothing* pushes meta-data to the disk for a filesystem mounted async. Others said some contradictory things about that and I'm not sure what the truth is, but if Simon is right, then the probability of crash survival in NetBSD is indeed near zero. Another point that was made was that NetBSD ffs fsck was not designed to put a damaged filesystem back together, at least the kind of damage one might encounter with async mounting. The probability of an async
Re: Lost file-system story
On Fri, 9 Dec 2011 22:12:25 -0500 Donald Allen donaldcal...@gmail.com wrote: Linux systems do periodically write ext2 meta-data to the disk. And ext2 fsck has always been very good, and has gotten better over the years, due to the efforts of Ted T'so. I first installed Linux in 1993, almost 20 years ago, and have been using it continuously ever since. I have *never* lost an ext2 filesystem and I've never mounted one sync. I'm not sure if it's the case on Linux with ext2, but by default NetBSD FFS mounts are not sync, nor async; metadata is sync and data blocks are async. In async mode, all data is asyncronously written, including the metadata, and in sync mode everything is written synchronously (the default OpenBSD uses, if I recall). I just wanted to specify this as you mentioned not mounting your ext2 systems in sync mode, but a default NetBSD FFS mount will not be in sync mode either. Other available options with FFS are using soft dependencies (softdep) or WAPBL metadata journalling (log), with which it is possible to have increased performance VS the default mode, without really sacrificing reliability, unlike with the fully async mode. In those modes, metadata is written asynchroneously as well. Sorry if what I said is already obvious to you, -- Matt
Re: Lost file-system story
On Tue, Dec 06, 2011 at 11:58:25AM -0500, Thor Lancelot Simon wrote: With the filesystem mounted async *nothing* pushes out most metadata updates, If this is really true, it's a bug and should be fixed. -- David A. Holland dholl...@netbsd.org
Fwd: Lost file-system story
I should have sent this to the mailing list as well as David. Google has fixed something that wasn't broke -- gmail. They've introduced a new UI that I haven't gotten used to yet ... -- Forwarded message -- From: Donald Allen donaldcal...@gmail.com Date: Sun, Dec 11, 2011 at 10:23 AM Subject: Re: Lost file-system story To: David Holland dholland-t...@netbsd.org On Sun, Dec 11, 2011 at 8:57 AM, David Holland dholland-t...@netbsd.org wrote: On Tue, Dec 06, 2011 at 11:58:25AM -0500, Thor Lancelot Simon wrote: With the filesystem mounted async *nothing* pushes out most metadata updates, If this is really true, it's a bug and should be fixed. It may very well be true. I just did the following: I brought up my test laptop, running 5.1 GENERIC, with /home mounted async,noatime. I created a new file in my home directory. I should note that when I ZZ'ed out of vi, the disk light flashed momentarily, and I could hear the disk doing something. I did an ls -lt | head and the new file was there. I waited just under a minute (to let syncs happen; this is longer than any of the sysctl vfs.sync.delays, which I assume are in seconds; the man page doesn't say) and then I pulled the plug (no battery in the machine). On restart, I got no fsck errors, but the new file was not in my home directory. I then repeated this test, waiting a little over a minute this time. Same result, the new file was gone (this time I got fsck errors). Then I did the test a third time, but this time I did a sync before pulling the plug. On restart, I still got some fsck errors that were fixed, but the new file was present. This does suggest that the meta-data is not being written, at least within a minute or so of creating a new file. One thing I think we have not discussed much or at all in this thread is that the filesystem surviving a crash and how much data you lose when it does survive are separate issues. The experiments I did yesterday demonstrate that a NetBSD ffs async-mounted filesystem, together with its fsck, *can* survive a crash in bad circumstances -- lots of write activity at the time of the crash. We don't know what the probability of survival is, just that it's 0. What I did yesterday also does not address loss of data. If Simon is correct and NetBSD is just not writing metadata until sync is explicitly called, then you could have a system up for days or weeks and lose as many as all of the files created in an async filesystem since the last re-boot. We don't know definitively what it's doing yet, but I think I've demonstrated that it's not writing meta-data within one minute windows. I will do some more playing with this, waiting longer and will report what I find. We also know from this morning's tests that a user-called sync does get the meta-data written, reducing the amount of data lost in a crash that the filesystem survives. So those who advocated periodically calling sync in a loop (Christos first suggested this to me in a private email) are correct -- it's necessary if you are going to use async mounting. More later ... /Don -- David A. Holland dholl...@netbsd.org
Re: Lost file-system story
On Sun, Dec 11, 2011 at 10:25 AM, Donald Allen donaldcal...@gmail.com wrote: I should have sent this to the mailing list as well as David. Google has fixed something that wasn't broke -- gmail. They've introduced a new UI that I haven't gotten used to yet ... -- Forwarded message -- From: Donald Allen donaldcal...@gmail.com Date: Sun, Dec 11, 2011 at 10:23 AM Subject: Re: Lost file-system story To: David Holland dholland-t...@netbsd.org On Sun, Dec 11, 2011 at 8:57 AM, David Holland dholland-t...@netbsd.org wrote: On Tue, Dec 06, 2011 at 11:58:25AM -0500, Thor Lancelot Simon wrote: With the filesystem mounted async *nothing* pushes out most metadata updates, If this is really true, it's a bug and should be fixed. It may very well be true. I just did the following: I brought up my test laptop, running 5.1 GENERIC, with /home mounted async,noatime. I created a new file in my home directory. I should note that when I ZZ'ed out of vi, the disk light flashed momentarily, and I could hear the disk doing something. I did an ls -lt | head and the new file was there. I waited just under a minute (to let syncs happen; this is longer than any of the sysctl vfs.sync.delays, which I assume are in seconds; the man page doesn't say) and then I pulled the plug (no battery in the machine). On restart, I got no fsck errors, but the new file was not in my home directory. I then repeated this test, waiting a little over a minute this time. Same result, the new file was gone (this time I got fsck errors). Then I did the test a third time, but this time I did a sync before pulling the plug. On restart, I still got some fsck errors that were fixed, but the new file was present. This does suggest that the meta-data is not being written, at least within a minute or so of creating a new file. One thing I think we have not discussed much or at all in this thread is that the filesystem surviving a crash and how much data you lose when it does survive are separate issues. The experiments I did yesterday demonstrate that a NetBSD ffs async-mounted filesystem, together with its fsck, *can* survive a crash in bad circumstances -- lots of write activity at the time of the crash. We don't know what the probability of survival is, just that it's 0. What I did yesterday also does not address loss of data. If Simon is correct and NetBSD is just not writing metadata until sync is explicitly called, then you could have a system up for days or weeks and lose as many as all of the files created in an async filesystem since the last re-boot. We don't know definitively what it's doing yet, but I think I've demonstrated that it's not writing meta-data within one minute windows. I will do some more playing with this, waiting longer and will report what I find. We also know from this morning's tests that a user-called sync does get the meta-data written, reducing the amount of data lost in a crash that the filesystem survives. So those who advocated periodically calling sync in a loop (Christos first suggested this to me in a private email) are correct -- it's necessary if you are going to use async mounting. I repeated the test without the sync, but waited 15 minutes after creating the new file before killing the power. When the system came up, I got fsck errors that were fixed, and the new file I created 15 minutes before pulling the plug was not present. Whether this is intentional or a bug, I agree with David Holland -- it's wrong and should be fixed. /Don More later ... /Don -- David A. Holland dholl...@netbsd.org
Re: Lost file-system story
On Sun, Dec 11, 2011 at 10:50:29AM -0500, Donald Allen wrote: I repeated the test without the sync, but waited 15 minutes after creating the new file before killing the power. When the system came up, I got fsck errors that were fixed, and the new file I created 15 minutes before pulling the plug was not present. Whether this is intentional or a bug, I agree with David Holland -- it's wrong and should be fixed. I disagree. It is exactly why I use FFS with -o async -- to get a disk backed storage, that doesn't waste resources, if everything fits into memory, but falls gracefully otherwise. Joerg
Re: Lost file-system story
On Sun, Dec 11, 2011 at 11:04 AM, Joerg Sonnenberger jo...@britannica.bec.de wrote: On Sun, Dec 11, 2011 at 10:50:29AM -0500, Donald Allen wrote: I repeated the test without the sync, but waited 15 minutes after creating the new file before killing the power. When the system came up, I got fsck errors that were fixed, and the new file I created 15 minutes before pulling the plug was not present. Whether this is intentional or a bug, I agree with David Holland -- it's wrong and should be fixed. I disagree. It is exactly why I use FFS with -o async -- to get a disk backed storage, that doesn't waste resources, if everything fits into memory, but falls gracefully otherwise. Certainly a valid requirement, but we haven't talked about what the fix should be. I think it should have an adjustable sync frequency, so that the user can turn a knob from I want to lose as little as possible to I want maximum performance. If I get my wish, you can use the latter, which should set the frequency to zero. /Don Joerg
Re: Lost file-system story
On Sun, Dec 11, 2011 at 11:32:51AM -0500, Donald Allen wrote: On Sun, Dec 11, 2011 at 11:04 AM, Joerg Sonnenberger jo...@britannica.bec.de wrote: On Sun, Dec 11, 2011 at 10:50:29AM -0500, Donald Allen wrote: I repeated the test without the sync, but waited 15 minutes after creating the new file before killing the power. When the system came up, I got fsck errors that were fixed, and the new file I created 15 minutes before pulling the plug was not present. Whether this is intentional or a bug, I agree with David Holland -- it's wrong and should be fixed. I disagree. It is exactly why I use FFS with -o async -- to get a disk backed storage, that doesn't waste resources, if everything fits into memory, but falls gracefully otherwise. Certainly a valid requirement, but we haven't talked about what the fix should be. I think it should have an adjustable sync frequency, so that the user can turn a knob from I want to lose as little as possible to I want maximum performance. If I get my wish, you can use the latter, which should set the frequency to zero. I don't see the point. Out of order meta updates can fry the file system at any point. Really, just don't use them if you can't recreate the file system freely. As has been mentioned elsewhere in the thread, the default mount option is *not* async. Joerg
Re: Lost file-system story
On Sun, Dec 11, 2011 at 11:44 AM, Joerg Sonnenberger jo...@britannica.bec.de wrote: On Sun, Dec 11, 2011 at 11:32:51AM -0500, Donald Allen wrote: On Sun, Dec 11, 2011 at 11:04 AM, Joerg Sonnenberger jo...@britannica.bec.de wrote: On Sun, Dec 11, 2011 at 10:50:29AM -0500, Donald Allen wrote: I repeated the test without the sync, but waited 15 minutes after creating the new file before killing the power. When the system came up, I got fsck errors that were fixed, and the new file I created 15 minutes before pulling the plug was not present. Whether this is intentional or a bug, I agree with David Holland -- it's wrong and should be fixed. I disagree. It is exactly why I use FFS with -o async -- to get a disk backed storage, that doesn't waste resources, if everything fits into memory, but falls gracefully otherwise. Certainly a valid requirement, but we haven't talked about what the fix should be. I think it should have an adjustable sync frequency, so that the user can turn a knob from I want to lose as little as possible to I want maximum performance. If I get my wish, you can use the latter, which should set the frequency to zero. I don't see the point. Out of order meta updates can fry the file system at any point. Really, just don't use them if you can't recreate the file system freely. As has been mentioned elsewhere in the thread, the default mount option is *not* async. Yes, they *can* destroy the filesystem, but in Linux ext2, they rarely do (see what I've said about this in previous messages in this thread), and I've started, in a small way, to build a case for NetBSD ffs and its fsck also having a reasonable probability of surviving a crash (what really matters is the joint probability of crashing -- very low in the case of Linux over the years -- *and* losing the filesystem on restart). As for the knob, it probably doesn't make sense to mount a filesystem async and then set the knob to sync every 50 milliseconds. One isn't going to get much of a performance benefit in return for incurring the risk of async mounting (I would guess that the risk goes down as the sync frequency goes up, but doesn't go to zero). If safety is one's orientation, it would probably be better to mount default, sync, or softdep, or use the new journaling option. But sync'ing every 5 minutes or 10 minutes might well give one the performance benefit that brought async to consideration in the first place, while likely limiting lost work to a 5- or 10-minute window. I say likely, because I emphasize again, for the umpteenth time in this discussion, that I completely understand that async incurs the risk of losing the whole filesystem. But if NetBSD/ffs/fsck turns out to exhibit the same behavior as Linux/ext2 has exhibited for years, the joint probability of crashing and incurring that loss is extremely low. And if it happens, I can and will deal with that. As an example, the machine I'm typing this on is running 5.1 with an /etc/fstab that looks like this: # NetBSD /etc/fstab # See /usr/share/examples/fstab/ for more examples. /dev/wd0a / ffs rw,noatime 1 1 /dev/wd0b noneswapsw,dp 0 0 /dev/wd0e /usrffs rw,noatime 1 2 /dev/wd0f /varffs rw,noatime 1 2 /dev/wd0g /home ffs rw,noatime,async1 2 /dev/wd0b /tmpmfs rw,-s=205632 kernfs /kern kernfs rw ptyfs /dev/ptsptyfs rw procfs /proc procfs rw /dev/cd0a /cdrom cd9660 ro,noauto So everything has the default mounting+noatime except /home, which is noatime,async. I routinely rsync my home directory among my many machines, so I've got N very up-to-date backups. If I lose /home, not that big a deal. But if the system crashes and the filesystem is recovered, I'd like to have the option to make it a smaller deal still, and be able to define a maximum-loss window, something smaller than the min(time since last normal reboot, time since last rsync). /Don Joerg
Re: Lost file-system story
More later ... I installed OpenBSD 5.0 on the same machine, similar setup (all filesystems noatime except /tmp and /home, which are both async,noatime). I repeated my experiment -- wrote a new file in my home directory, waited a few minutes, and killed the power. On reboot, there were complaints from the fscks, async and not, all fixed. The system came up without a manual fsck and the new file was present in my directory. So meta-data for async filesystems is being written within a window of a handful of minutes with OpenBSD. /Don
Re: Lost file-system story
On Fri 09 Dec 2011 at 17:40:29 -0500, Donald Allen wrote: If I can find the time, I'll do that. Even a little shell script would do: #!/bin/sh while sleep 30; do sync; done -Olaf. -- ___ Olaf 'Rhialto' Seibert -- There's no point being grown-up if you \X/ rhialto/at/xs4all.nl-- can't be childish sometimes. -The 4th Doctor
Re: Lost file-system story
On Sun, Dec 11, 2011 at 3:21 PM, Donald Allen donaldcal...@gmail.com wrote: More later ... I installed OpenBSD 5.0 on the same machine, similar setup (all filesystems noatime except /tmp and /home, which are both async,noatime). I repeated my experiment -- wrote a new file in my home directory, waited a few minutes, and killed the power. On reboot, there were complaints from the fscks, async and not, all fixed. The system came up without a manual fsck and the new file was present in my directory. So meta-data for async filesystems is being written within a window of a handful of minutes with OpenBSD. I haven't read every single word you've said about this subject, so I apologize if I'm missing something. I assume you're using async because you want better performance and you have some tolerance for data loss, otherwise this wouldn't even be a discussion I think. We're just talking about probabilities of data loss then, correct? For some people (I suspect, a few that have already answered), this isn't something they are willing to discuss, even though we all know it's impossible to get to 1 as you said. But you can get really close these days. If solving your problem depends on sync frequency, I don't see why this shouldn't be managed by some knob to twiddle. Given that the crash scenario doesn't get worse depending on where the knob is or if the crash happens while the knob is working. If it does, it's pointless. Why haven't other solutions been discussed? NetBSD supports ext2. And raid. And all kinds of other stuff. Why not use it? Andy
Re: Lost file-system story
On Sun, Dec 11, 2011 at 05:04:23PM +0100, Joerg Sonnenberger wrote: I repeated the test without the sync, but waited 15 minutes after creating the new file before killing the power. When the system came up, I got fsck errors that were fixed, and the new file I created 15 minutes before pulling the plug was not present. Whether this is intentional or a bug, I agree with David Holland -- it's wrong and should be fixed. I disagree. It is exactly why I use FFS with -o async -- to get a disk backed storage, that doesn't waste resources, if everything fits into memory, but falls gracefully otherwise. That's as may be, but it's still wrong. The syncer should be writing out the metadata buffers as well as file data. (For your purpose, you'd want it to be writing out neither, btw.) Note the result from OpenBSD; we probably broke it with the UBC merge and never noticed. Don't we have at least one filesystem that doesn't support UBC? What happens to it? -- David A. Holland dholl...@netbsd.org
Re: Lost file-system story
My impression is that you are asking for the impossible. The underlying misconception (which I know very well for suffering from it myself) is that a filesystem aims at keeping the on-disc metadata consistent and that tools like fsck are intended to rapair any inconsistencies happening nontheless. This, I learned, is not true. The point of syncronous metadata writes, soft dependency metadata write re-ordering, logging/journaling/WAPBL and whatnot is _not_ to keep the on-disc metadata consistent. The sole point is to, under all adverse conditions, leave that metadata in a state that can be _put back_ into a consistent state (peferrably reflecting an in-memory state not too far back from the time of the crash) by fsck, on-mount journal replay or whatever. That difference becomes perfectly clear with journalling. After an unclean shutdown, the on-disc metadata need not be consistent. But the journal enables putting it back into a consistent state quite easily. So fsck is not aimed at and does not claim to be able to recover from random inconsistencies in the on-disc metadata. It is aimed at repairing those inconsistencies that can occur because a crash _given the metadata was written syncronously_. FreeBSD's background fsck, by the way, is aimed at reparing only those inconsistencies that can occur given the metadata was written with softep's re-ordering. Of course, keeping the on-disc metadata in a ``repairable'' state incurs a performance penalty. So you seem to be asking for the File System Holy Grail: a file system that is as fast as asyncronous metadata writes, yet able to survive any possible kind of unclean shutdown. Such a thing, to my knowledge, doesn't exist.
Re: Lost file-system story
On Sat, Dec 10, 2011 at 1:14 PM, Edgar Fuß e...@math.uni-bonn.de wrote: My impression is that you are asking for the impossible. The underlying misconception (which I know very well for suffering from it myself) is that a filesystem aims at keeping the on-disc metadata consistent and that tools like fsck are intended to rapair any inconsistencies happening nontheless. This, I learned, is not true. The point of syncronous metadata writes, soft dependency metadata write re-ordering, logging/journaling/WAPBL and whatnot is _not_ to keep the on-disc metadata consistent. The sole point is to, under all adverse conditions, leave that metadata in a state that can be _put back_ into a consistent state (peferrably reflecting an in-memory state not too far back from the time of the crash) by fsck, on-mount journal replay or whatever. That difference becomes perfectly clear with journalling. After an unclean shutdown, the on-disc metadata need not be consistent. But the journal enables putting it back into a consistent state quite easily. So fsck is not aimed at and does not claim to be able to recover from random inconsistencies in the on-disc metadata. It is aimed at repairing those inconsistencies that can occur because a crash _given the metadata was written syncronously_. FreeBSD's background fsck, by the way, is aimed at reparing only those inconsistencies that can occur given the metadata was written with softep's re-ordering. Of course, keeping the on-disc metadata in a ``repairable'' state incurs a performance penalty. So you seem to be asking for the File System Holy Grail: a file system that is as fast as asyncronous metadata writes, yet able to survive any possible kind of unclean shutdown. Such a thing, to my knowledge, doesn't exist. I'm sorry, I don't wish to be rude, but you, too, seem not to have read what I've written carefully. Or, perhaps the fault is mine, that I simply haven't made myself sufficiently clear. I've talked at length about the behavior of Linux ext2 and that it was more than acceptable, both from a standpoint of performance and reliability. I am not looking for something able to survive any possible kind of unclean shutdown. I'm looking for a reasonably low joint probability of a crash occurring *and* losing an async-mounted filesystem as a result. I simply want an async implementation where the benefit (performance) is not out-weighed by the risk (lost filesystems) and I cited Linux ext2 is an example of that. If that's not clear to you, then I'm afraid I can't do better.
Re: Lost file-system story
On Fri, Dec 9, 2011 at 4:33 PM, Brian Buhrow buh...@lothlorien.nfbcal.org wrote: Hello. Just for your edification, it is possible to break out of fsck mid-way and reinvoke it with fsck -y to get it to do the cleaning on its own. This whole discussion, interesting though it may be, may have occurred simply because of my unfamiliarity with NetBSD and probably a mistake in not looking at the fsck man page for something like the -y option when I reached the point where continuing to feed 'y's to fsck after the original crash seemed like a losing battle. Had I thought about -y (I know that fscks typically have such an option, but in my experience it's an optional answer to fsck questions, as OpenBSD's is; for whatever reason, I didn't think of it), I'd have used it, since I had nothing to lose at that point. But it's possible you have put your finger on the real truth of what happened here. Read on. You suggested trying the experiment I did with OpenBSD with NetBSD, and so I did. Twice. I installed NetBSD with separate directories for /, /usr, /var, /tmp, and /home, ala OpenBSD's default setup. All, except /home and /tmp were mounted softdef,noatime. /home was mounted async, and /tmp is an in-memory filesystem. The first time, I untarred the OpenBSD ports.tar.gz (I used it because it was what I used in the OpenBSD test, it's big, and I had it lying around) into a temporary directory in my home directory. With the battery removed from the laptop, I did an rm -rf ports and while that was happening, I pulled the power connector. On restart, fsck found a bunch of things it didn't like about the /home filesystem, but managed to fix things up to its satisfaction and declare the filesystem clean. My home directory survived this and, like OpenBSD, a fair amount of the ports directory was still present. I then removed it and re-did the untar, while the untar was happening, I again pulled the plug. This time, the automatic fsck got unhappy enough to drop me into single-user mode and ran fsck there manually. I again encountered a seemingly never-ending sequence of requests to fix this and that. So I aborted and used the -y option. It charged through a bunch of trouble spots and completed. On reboot, I found the same situation as the first one -- home directory intact and some of the ports directory present. I have a some thoughts about this: 1. Had I run fsck -y at the time of the first crash, I might well have found what I found today -- a repaired filesystem that was usable. So my assertion that the filesystem was lost may well have simply been my lack of skill as a NetBSD sys-admin. 2. Today's experiment shows that a NetBSD ffs filesystem mounted async, together with its fsck, *is* capable of surviving even a pretty brutal improper shutdown -- loss of power while a lot of writing was happening. Obviously I still don't have enough data to know if the probability of survival is comparable to Linux ext2, but what I found today is at least encouraging. I did one more experiment, and that was to untar the ports tarball, and then waited about a minute. I then did a sync. The disk light blinked just for a brief moment. This is a *big* tar file, but it appears from this easy little test that there was not a huge amount of dirty stuff sitting in the buffer cache. This is obviously not definitive, but does suggest that NetBSD is migrating stuff from the buffer cache back to the disk for async-mounted filesystems in timely fashion. A look at the code is probably the final arbiter here. I also note that there are sysctl items, such as vfs.sync.metadelay that I would like to understand. /Don Allen
Re: Lost file-system story
Donald Allen donaldcal...@gmail.com writes: On Sat, Dec 10, 2011 at 1:14 PM, Edgar Fuß e...@math.uni-bonn.de wrote: Of course, keeping the on-disc metadata in a ``repairable'' state incurs a performance penalty. So you seem to be asking for the File System Holy Grail: a file system that is as fast as asyncronous metadata writes, yet able to survive any possible kind of unclean shutdown. Such a thing, to my knowledge, doesn't exist. I'm sorry, I don't wish to be rude, but you, too, seem not to have read what I've written carefully. Or, perhaps the fault is mine, that I simply haven't made myself sufficiently clear. I've talked at length about the behavior of Linux ext2 and that it was more than acceptable, both from a standpoint of performance and reliability. I am not looking for something able to survive any possible kind of unclean shutdown. I'm looking for a reasonably low joint probability of a crash occurring *and* losing an async-mounted filesystem as a result. I simply want an async implementation where the benefit (performance) is not out-weighed by the risk (lost filesystems) and I cited Linux ext2 is an example of that. If that's not clear to you, then I'm afraid I can't do better. I think that it should be clear that async mount excludes what you want. Async mount basically means that you create fresh file system after boot. In linux it may mean another thing (e.g., it may be less asynchronous), in BSDs it means exactly that. Thus, unless you really can afford starting file system from scratch, don't mount it async. -- HE CE3OH...
Re: Lost file-system story
I just did a little experiment. I installed OpenBSD 5.0 on the same machine where I had my adventure with NetBSD. This time, I broke up the world into separate filesystems, which OpenBSD facilitates, mounting only /home and /tmp async, noatime. All the others were mounted softdep,noatime. I downloaded ports.tar.gz and un-tarred it into my home directory (I had previously un-tarred it into /usr). I then did rm -rf ports which takes awhile. While that was going, I hit the power button (I can afford to lose a filesystem containing only my home directory; it's backed up thoroughly, because I rsync it from one machine to another; there are current copies on several other machines). The system did a rapid shutdown without sync'ing the filesystems. On restart, all the softdep-mounted filesystems had no errors in fsck, as one might expect (especially since there was no intensive write-activity going on when I improperly shut the system down, as there was in /home), but I got an Unexpected inconsistency error in my home directory and requested a manual fsck; the system dropped into single-user mode after the automatic fscks finished. I ran the fsck on the filesystem that gets mounted as /home and there were a number of files and directories that were apparently half-deleted and it asked me one-by-one if I wanted to delete them. I did with a few, but then used the 'F' option to do so without further interaction (I don't believe the NetBSD fsck gave me that option; it is not documented in the NetBSD fsck man page, while it *is* documented in the OpenBSD fsck man page). The fsck completed and marked the filesystem clean. I rebooted, everything mounted normally, and a check of my home directory shows everything intact, even most of the ports directory, whose deletion I deliberately interrupted. The async warning in the OpenBSD mount page reads as follows: async Metadata I/O to the file system should be done asynchronously. By default, only regular data is read/written asynchronously. This is a dangerous flag to set since it does not guarantee to keep a consistent file system structure on the disk. You should not use this flag unless you are prepared to recreate the file system should your system crash. The most common use of this flag is to speed up restore(8) where it can give a factor of two speed increase. does not guarantee to keep a consistent file system structure on the disk is what I expected from NetBSD. From what I've been told in this discussion, NetBSD pretty much guarantees that if you use async and the system crashes, you *will* lose the filesystem if there's been any writing to it for an arbitrarily long period of time, since apparently meta-data for async filesystems doesn't get written as a matter of course. And then there's the matter of NetBSD fsck apparently not really being designed to cope with the mess left on the disk after such a crash. Please correct me if I've misinterpreted what's been said here (there have been a few different stories told, so I'm trying to compute the mean). I am not telling the OpenBSD story to rub NetBSD peoples' noses in it. I'm simply pointing out that that system appears to be an example of ffs doing what I thought it did and what I know ext2 and journal-less ext4 do -- do a very good job of putting the world into operating order (without offering an impossible guarantee to do so) after a crash when async is used, after having been told that ffs and its fsck were not designed to do this. The reason I'm beating on this is that I would have liked to use NetBSD for the application I have in mind, but I need the performance improvement that async provides (my tests show this; the tests also show that NetBSD async is about as fast as Linux, much faster than OpenBSD async, at least for doing a lot of writing, such as un-tarring a large tar file). This is practical if the joint probability of the system crashing *and* losing the async filesystem is low. My one little data point was discouraging -- the system crashed when using a wireless card with a common chipset (atheros) resulted in losing my network connection and then a system crash when a restart of networking was attempted (and, I had to use the atheros card because the system didn't pick up the built-in Cisco wireless device, which I think is supposed to be served by the an driver). The crash took out the filesystem, as we've been discussing. So I'd love it if my experience encourages someone to improve NetBSD ffs and fsck to make use of async practical, perhaps by drawing on what OpenBSD has done. I also realize that my situation is unusual, and with resources being scarce, there are a lot more important things to work on, that will affect a lot more people. But I'd at least like to get it in the queue.
Re: Lost file-system story
Hello. Just for your edification, it is possible to break out of fsck mid-way and reinvoke it with fsck -y to get it to do the cleaning on its own. With regard to your notes on speed with NetBSD versus OpenBSD, I suspect the speed trade off is where the difference is. OpenBSD is flushing buffers to disk more frequently than NetBSD is, and thus the filesystem is more complete with respect to what is on disk. Since you readily admit that you are a rare case, might I suggest that there may be an easy way for you to have your cake and eat it too. That is, get the speed and performance of NetBSD with the relative reliability, which may have been luck -- I'm not sure, with OpenBSD. You could write yourself a little program, or find an old version of update(8) from old source trees, which runs as a daemon and calls sync(2) every n seconds where n is what ever comfort level you deem appropriate. I believe that when you call sync(2), even async mounted filesystem data is flushed. With that program running, I'd be interested in having you retry your experiment with NetBSD and see if your results differ. -Brian On Dec 9, 3:50pm, Donald Allen wrote: } Subject: Re: Lost file-system story } I just did a little experiment. I installed OpenBSD 5.0 on the same } machine where I had my adventure with NetBSD. This time, I broke up } the world into separate filesystems, which OpenBSD facilitates, } mounting only /home and /tmp async, noatime. All the others were } mounted softdep,noatime. I downloaded ports.tar.gz and un-tarred it } into my home directory (I had previously un-tarred it into /usr). I } then did } } rm -rf ports } } which takes awhile. While that was going, I hit the power button (I } can afford to lose a filesystem containing only my home directory; } it's backed up thoroughly, because I rsync it from one machine to } another; there are current copies on several other machines). The } system did a rapid shutdown without sync'ing the filesystems. } } On restart, all the softdep-mounted filesystems had no errors in fsck, } as one might expect (especially since there was no intensive } write-activity going on when I improperly shut the system down, as } there was in /home), but I got an Unexpected inconsistency error in } my home directory and requested a manual fsck; the system dropped into } single-user mode after the automatic fscks finished. I ran the fsck on } the filesystem that gets mounted as /home and there were a number of } files and directories that were apparently half-deleted and it asked } me one-by-one if I wanted to delete them. I did with a few, but then } used the 'F' option to do so without further interaction (I don't } believe the NetBSD fsck gave me that option; it is not documented in } the NetBSD fsck man page, while it *is* documented in the OpenBSD fsck } man page). The fsck completed and marked the filesystem clean. I } rebooted, everything mounted normally, and a check of my home } directory shows everything intact, even most of the ports directory, } whose deletion I deliberately interrupted. } } The async warning in the OpenBSD mount page reads as follows: } } async Metadata I/O to the file system should be done } asynchronously. By default, only regular data is } read/written asynchronously. } } This is a dangerous flag to set since it does not } guarantee to keep a consistent file system structure on } the disk. You should not use this flag unless you are } prepared to recreate the file system should your system } crash. The most common use of this flag is to speed up } restore(8) where it can give a factor of two speed } increase. } } does not guarantee to keep a consistent file system structure on the } disk is what I expected from NetBSD. From what I've been told in this } discussion, NetBSD pretty much guarantees that if you use async and } the system crashes, you *will* lose the filesystem if there's been any } writing to it for an arbitrarily long period of time, since apparently } meta-data for async filesystems doesn't get written as a matter of } course. And then there's the matter of NetBSD fsck apparently not } really being designed to cope with the mess left on the disk after } such a crash. Please correct me if I've misinterpreted what's been } said here (there have been a few different stories told, so I'm trying } to compute the mean). } } I am not telling the OpenBSD story to rub NetBSD peoples' noses in it. } I'm simply pointing out that that system appears to be an example of } ffs doing what I thought it did and what I know ext2 and journal-less } ext4 do -- do a very good job of putting the world into operating } order (without offering an impossible guarantee to do so) after a } crash when async is used, after
Re: Lost file-system story
On Fri, 9 Dec 2011 15:50:35 -0500 Donald Allen donaldcal...@gmail.com wrote: were not designed to do this. The reason I'm beating on this is that I would have liked to use NetBSD for the application I have in mind, but I need the performance improvement that async provides (my tests show this; the tests also show that NetBSD async is about as fast as Linux, much faster than OpenBSD async, at least for doing a lot of writing, such as un-tarring a large tar file). This is practical if the joint The speed and reliability WAPBL provides have been enough for my uses personally; are the few seconds saved using async really that worth the trouble? Also, if raw speed is needed to do many installations on identical systems, dd with large blocks to mirror the system might be a faster alternative... I'm not argueing that fsck shouldn't be able to recover though; it ideally should, but the problem seems to be that too much metadata is missing when crashing while writing in async mode. OpenBSD's async mode would be slightly slower while flushing metadata more often, probably. Perhaps that having a sysctl to control flushing would be a good thing, though. Thanks, -- Matt
Re: Lost file-system story
On Fri, Dec 9, 2011 at 4:33 PM, Brian Buhrow buh...@lothlorien.nfbcal.org wrote: Hello. Just for your edification, it is possible to break out of fsck mid-way and reinvoke it with fsck -y to get it to do the cleaning on its own. With regard to your notes on speed with NetBSD versus OpenBSD, I suspect the speed trade off is where the difference is. OpenBSD is flushing buffers to disk more frequently than NetBSD is, and thus the filesystem is more complete with respect to what is on disk. I suspect that is due to OpenBSD's lack of a unified buffer cache, which NetBSD has. So they run out of space in the buffer cache, even though memory devoted to (empty) page-frames is available. Since you readily admit that you are a rare case, might I suggest that there may be an easy way for you to have your cake and eat it too. That is, get the speed and performance of NetBSD with the relative reliability, which may have been luck -- I'm not sure, with OpenBSD. You could write yourself a little program, or find an old version of update(8) from old source trees, which runs as a daemon and calls sync(2) every n seconds where n is what ever comfort level you deem appropriate. I believe that when you call sync(2), even async mounted filesystem data is flushed. With that program running, I'd be interested in having you retry your experiment with NetBSD and see if your results differ. If I can find the time, I'll do that. -Brian On Dec 9, 3:50pm, Donald Allen wrote: } Subject: Re: Lost file-system story } I just did a little experiment. I installed OpenBSD 5.0 on the same } machine where I had my adventure with NetBSD. This time, I broke up } the world into separate filesystems, which OpenBSD facilitates, } mounting only /home and /tmp async, noatime. All the others were } mounted softdep,noatime. I downloaded ports.tar.gz and un-tarred it } into my home directory (I had previously un-tarred it into /usr). I } then did } } rm -rf ports } } which takes awhile. While that was going, I hit the power button (I } can afford to lose a filesystem containing only my home directory; } it's backed up thoroughly, because I rsync it from one machine to } another; there are current copies on several other machines). The } system did a rapid shutdown without sync'ing the filesystems. } } On restart, all the softdep-mounted filesystems had no errors in fsck, } as one might expect (especially since there was no intensive } write-activity going on when I improperly shut the system down, as } there was in /home), but I got an Unexpected inconsistency error in } my home directory and requested a manual fsck; the system dropped into } single-user mode after the automatic fscks finished. I ran the fsck on } the filesystem that gets mounted as /home and there were a number of } files and directories that were apparently half-deleted and it asked } me one-by-one if I wanted to delete them. I did with a few, but then } used the 'F' option to do so without further interaction (I don't } believe the NetBSD fsck gave me that option; it is not documented in } the NetBSD fsck man page, while it *is* documented in the OpenBSD fsck } man page). The fsck completed and marked the filesystem clean. I } rebooted, everything mounted normally, and a check of my home } directory shows everything intact, even most of the ports directory, } whose deletion I deliberately interrupted. } } The async warning in the OpenBSD mount page reads as follows: } } async Metadata I/O to the file system should be done } asynchronously. By default, only regular data is } read/written asynchronously. } } This is a dangerous flag to set since it does not } guarantee to keep a consistent file system structure on } the disk. You should not use this flag unless you are } prepared to recreate the file system should your system } crash. The most common use of this flag is to speed up } restore(8) where it can give a factor of two speed } increase. } } does not guarantee to keep a consistent file system structure on the } disk is what I expected from NetBSD. From what I've been told in this } discussion, NetBSD pretty much guarantees that if you use async and } the system crashes, you *will* lose the filesystem if there's been any } writing to it for an arbitrarily long period of time, since apparently } meta-data for async filesystems doesn't get written as a matter of } course. And then there's the matter of NetBSD fsck apparently not } really being designed to cope with the mess left on the disk after } such a crash. Please correct me if I've misinterpreted what's been } said here (there have been a few different stories told, so I'm trying } to compute the mean). } } I am
Re: Lost file-system story
At Tue, 6 Dec 2011 12:44:16 -0500, Donald Allen donaldcal...@gmail.com wrote: Subject: Re: Lost file-system story much more clear. When I read this before the fun started, I took it to mean, perhaps unjustifiably, what I know to be true -- there is some non-zero probability that fsck of an async file-system will not be able to verify and/or restore the filesystem to correctness after a crash. You are saying that the probability, in the case of NetBSD, is 1. If that's true, that there's no periodic sync, I would say that's *really* a mistake. It should be there with a knob the administrator can turn to adjust the sync frequency. Just to be clear: There is such a knob, or rather binary switch. It's called umount(2). sync(2) might work too, but I seem to vaguely remember something about it not working for async-mounted filesystems, and some obscure reason why it wouldn't/couldn't work for them, though that doesn't seem logical to me any more. sync(2) should, IMHO, even go so far as to cause the dirty flag to be cleared on the disk once all the writes to flush all necessary updates have completed (and assuming of course that no further changes of any kind are made to the filesystem after sync(2) scheduled all the writes, and assuming of course that writes cached in the storage interface controller or in the drive controller will be written out in order. In theory mount -u -r should work too, but then there's PR#30525. Steve Bellovin asked a question some time ago on netbsd-users about why umount(2) works, but mount -u -r doesn't, and to the best of my understanding it hasn't been answered yet (though mention was made of a possible fix to be found in FreeBSD, followed by some musings on how hard it is to find and use such fixes in the diverging code bases of FreeBSD and NetBSD). Perhaps sync(2) will fail for async-mounted filesystems, or even without MNT_ASYNC, for the same reason that mount -u -r fails, though that's pure speculation based on my vague ideas, and is not based on anything in the code. The question was asked in PR#30525 about mount -u -r vs. filesystems mounted with MNT_SYNC, but nobody knew if that would make any significant difference or not (and I would naively suspect not). Perhaps the superblock should also record when a filesystem has been mounted with MNT_ASYNC so that fsck(8) can print a warning such as: FS is dirty and was mounted async. Demons will fly out of your nose -- Greg A. Woods Planix, Inc. wo...@planix.com +1 250 762-7675http://www.planix.com/ pgppoVyhhnBug.pgp Description: PGP signature
Re: Lost file-system story
At Fri, 9 Dec 2011 15:50:35 -0500, Donald Allen donaldcal...@gmail.com wrote: Subject: Re: Lost file-system story does not guarantee to keep a consistent file system structure on the disk is what I expected from NetBSD. From what I've been told in this discussion, NetBSD pretty much guarantees that if you use async and the system crashes, you *will* lose the filesystem if there's been any writing to it for an arbitrarily long period of time, since apparently meta-data for async filesystems doesn't get written as a matter of course. I'm not sure what the difference is. You seem to be quibbling over minor differences and perhaps one-off experiences. Both OpenBSD and NetBSD also say that you should not use the async flag unless you are prepared to recreate the file system from scratch if your system crashes. That means use newfs(8) [and, by implication, something like restore(8)], not fsck(8), to recover after a crash. You got lucky with your test on OpenBSD. And then there's the matter of NetBSD fsck apparently not really being designed to cope with the mess left on the disk after such a crash. Please correct me if I've misinterpreted what's been said here (there have been a few different stories told, so I'm trying to compute the mean). That's been true of Unix (and many unix-like) filesystems and their fsck(8) commands since the beginning of Unix. fsck(8) is designed to rely on the possible states of on-disk filesystem metadata because that's now Unix-based filesystems have been guaranteed to work (barring use of MNT_ASYNC, obviously). And that's why by default, and by very strong recommendation, filesystem metadata for Unix-based filesystems (sans WABPL) should always be written synchronously to the disk if you ever hope to even try to use fsck(8). I am not telling the OpenBSD story to rub NetBSD peoples' noses in it. I'm simply pointing out that that system appears to be an example of ffs doing what I thought it did and what I know ext2 and journal-less ext4 do -- do a very good job of putting the world into operating order (without offering an impossible guarantee to do so) after a crash when async is used, after having been told that ffs and its fsck were not designed to do this. You seem to be very confused about what MNT_ASYNC is and is not. :-) Unix filesystems, including Berkeley Fast File System variant, have never made any guarantees about the recoverability of an async-mounted filesystem after a crash. You seem to have inferred some impossible capability based on your experience with other non-Unix filesystems that have a completely different internal structure and implementation from the Unix-based filesystems in NetBSD. Perhaps the BSD manuals have assumed some knowledge of Unix history, but even the NetBSD-1.6 mount(8) manual, from 2002, is _extremely_ clear about the dangers of the async flag, with strong emphasis in the formatted text on the relevant warning: async All I/O to the file system should be done asyn- chronously. In the event of a crash, _it_is_ _impossible_for_the_system_to_verify_the_integrity_of_ _data_on_a_file_system_mounted_with_this_option._ You should only use this option if you have an applica- tion-specific data recovery mechanism, or are willing to recreate the file system from scratch. According to CVS that wording has not changed since October 1, 2002, and the emphasised text has been there unchanged since September 16, 1998. So I'd love it if my experience encourages someone to improve NetBSD ffs and fsck to make use of async practical As others have already said, this has already been done. It's called WABPL. See wapbl(4) for more information. Use mount -o log to enable it. (BTW, I personally don't think you would want to use softdep -- it can suffer almost as badly as async after a crash, though perhaps without totally invalidating fsck(8)'s ability to at least recover files and directories which were static since mount; and it does also offer vastly improved performance in many use cases, but as the manual says, it should still be used with care (i.e. recognition of the risks of less-tested, much more complex code, and vastly changed internal implmentation semantics implying radically different recovery modes.) -- Greg A. Woods Planix, Inc. wo...@planix.com +1 250 762-7675http://www.planix.com/ pgp7bEgL4qiOc.pgp Description: PGP signature
Re: Lost file-system story
On Wed, Dec 07, 2011 at 10:21:14AM +1100, Simon Burge wrote: David Holland wrote: There is at least one known structural problem where atime/mtime updates do not get applied to buffers (but are instead saved up internally) so they don't get written out by the syncer. We believe this is what causes those unmount-time writes, or at least many of them. I understand the delayed atime writes were to added to reduce the number of times a laptop harddisk spins up. I've often wondered if a simple sysctl could be added to control this. Unmounting my /home on my main machine takes approximately a minute. My Seconded. On a ftp server with a large filesystem (5TB, 5M inodes), shutdown takes a very long time too. -- Manuel Bouyer bou...@antioche.eu.org NetBSD: 26 ans d'experience feront toujours la difference --
Re: Lost file-system story
On Wed, Dec 07, 2011 at 10:54:40AM +0100, Manuel Bouyer wrote: On Wed, Dec 07, 2011 at 10:21:14AM +1100, Simon Burge wrote: David Holland wrote: There is at least one known structural problem where atime/mtime updates do not get applied to buffers (but are instead saved up internally) so they don't get written out by the syncer. We believe this is what causes those unmount-time writes, or at least many of them. I understand the delayed atime writes were to added to reduce the number of times a laptop harddisk spins up. I've often wondered if a simple sysctl could be added to control this. Unmounting my /home on my main machine takes approximately a minute. My Seconded. On a ftp server with a large filesystem (5TB, 5M inodes), shutdown takes a very long time too. Isn't that more the issue of writing out the atime updates? Joerg
Re: Lost file-system story
On Wed, Dec 07, 2011 at 10:59:11AM +0100, Joerg Sonnenberger wrote: On Wed, Dec 07, 2011 at 10:54:40AM +0100, Manuel Bouyer wrote: On Wed, Dec 07, 2011 at 10:21:14AM +1100, Simon Burge wrote: David Holland wrote: There is at least one known structural problem where atime/mtime updates do not get applied to buffers (but are instead saved up internally) so they don't get written out by the syncer. We believe this is what causes those unmount-time writes, or at least many of them. I understand the delayed atime writes were to added to reduce the number of times a laptop harddisk spins up. I've often wondered if a simple sysctl could be added to control this. Unmounting my /home on my main machine takes approximately a minute. My Seconded. On a ftp server with a large filesystem (5TB, 5M inodes), shutdown takes a very long time too. Isn't that more the issue of writing out the atime updates? Yes, that's it. Wasn't Simon talking about this ? -- Manuel Bouyer bou...@antioche.eu.org NetBSD: 26 ans d'experience feront toujours la difference --
Re: Lost file-system story
Manuel Bouyer wrote: On Wed, Dec 07, 2011 at 10:59:11AM +0100, Joerg Sonnenberger wrote: On Wed, Dec 07, 2011 at 10:54:40AM +0100, Manuel Bouyer wrote: On Wed, Dec 07, 2011 at 10:21:14AM +1100, Simon Burge wrote: David Holland wrote: There is at least one known structural problem where atime/mtime updates do not get applied to buffers (but are instead saved up internally) so they don't get written out by the syncer. We believe this is what causes those unmount-time writes, or at least many of them. I understand the delayed atime writes were to added to reduce the number of times a laptop harddisk spins up. I've often wondered if a simple sysctl could be added to control this. Unmounting my /home on my main machine takes approximately a minute. My Seconded. On a ftp server with a large filesystem (5TB, 5M inodes), shutdown takes a very long time too. Isn't that more the issue of writing out the atime updates? Yes, that's it. Wasn't Simon talking about this ? Yes. We all appear to be in total agreement here :) Cheers, Simon.
Re: Lost file-system story
On Tue, Dec 06, 2011 at 12:44:16PM -0500, Donald Allen wrote: On Tue, Dec 6, 2011 at 11:58 AM, Thor Lancelot Simon t...@panix.com wrote: On Tue, Dec 06, 2011 at 11:10:44AM -0500, Donald Allen wrote: 2. I'm a little bit surprised that the filesystem was as much of a mess as it was. I'm not. You mounted the filesystem async and had a crash. With the filesystem mounted async *nothing* pushes out most metadata updates, with the result that the filesystem's metadata can quickly enter a fatally inconsistent state. The only way home safe is a clean unmount. So unwritten meta-data from an async filesystem can sit in the buffer cache for arbitrarily long periods of time in NetBSD? I just want to be sure I understand what you are saying. Because that essentially guarantees, as you imply above, that if the system crashes, you will lose the filesystem. That makes the following warning, in the mount(8) man page, in the description of the async option: In the event of a crash, it is impossible for the system to verify the integrity of data on a file system mounted with this option. much more clear. When I read this before the fun started, I took it to You left out part of the warning. From NetBSD 5.1: async All I/O to the file system should be done asyn- chronously. In the event of a crash, it is impossible for the system to verify the integrity of data on a file system mounted with this option. You should only use this option if you have an applica- tion-specific data recovery mechanism, or are willing to recreate the file system from scratch. Isn't the last sentence of that paragraph in your version? Basically, there are two situations where -o async on ffs is sort of safe: a) you're installing or restore(8)ing on a freshly newfs'd filesystem, plan to unmount (or shutdown) as soon as you're finished before using the file system, and could do that again, with the same source data, in the event of a power failure during the operation; you get the benefit of a fast installation/restore. b) the file system is on volatile memory and would be gone anyway on shutdown, crash, or power failure. FFS in its default mode has been designed to do part of operations in an async fashion, but to guarantee enough writes that the remaining inconsistencies after a power failure can be cleaned up by fsck_ffs. fsck_ffs is designed for this task. It's not designed for arbitrary repairs. (When ffs does async, it really does it.) Regards, -is
Re: Lost file-system story
On Wed, Dec 7, 2011 at 9:58 AM, Ignatios Souvatzis pre...@ycm-bonn.de wrote: On Tue, Dec 06, 2011 at 12:44:16PM -0500, Donald Allen wrote: On Tue, Dec 6, 2011 at 11:58 AM, Thor Lancelot Simon t...@panix.com wrote: On Tue, Dec 06, 2011 at 11:10:44AM -0500, Donald Allen wrote: 2. I'm a little bit surprised that the filesystem was as much of a mess as it was. I'm not. You mounted the filesystem async and had a crash. With the filesystem mounted async *nothing* pushes out most metadata updates, with the result that the filesystem's metadata can quickly enter a fatally inconsistent state. The only way home safe is a clean unmount. So unwritten meta-data from an async filesystem can sit in the buffer cache for arbitrarily long periods of time in NetBSD? I just want to be sure I understand what you are saying. Because that essentially guarantees, as you imply above, that if the system crashes, you will lose the filesystem. That makes the following warning, in the mount(8) man page, in the description of the async option: In the event of a crash, it is impossible for the system to verify the integrity of data on a file system mounted with this option. much more clear. When I read this before the fun started, I took it to You left out part of the warning. From NetBSD 5.1: async All I/O to the file system should be done asyn- chronously. In the event of a crash, it is impossible for the system to verify the integrity of data on a file system mounted with this option. You should only use this option if you have an applica- tion-specific data recovery mechanism, or are willing to recreate the file system from scratch. Isn't the last sentence of that paragraph in your version? No. My version says If you use this option and the system crashes, everything will be fine. /Don
Lost file-system story
I recently installed NetBSD 5.1 on an old Thinkpad T41 that I use for experimentation. I installed it with a single, monolithic filesystem, which I mounted async,noatime. Yes, I'm fully aware that's dangerous and was aware of it at the time. But I have a long history of running Linux systems with ext2 filesystems and now, journal-less ext4 filesystems, and in all the years of running those systems, where no particular care is taken to write file-system meta-data in ordered fashion, I have never lost a file-system. Linux crashes are extremely rare, my systems are either laptops or on UPSes, and I never do something as stupid as just whacking the power-button to shut them down. On the rare occasions when a file-system has suffered an improper shutdown, fsck has always been able to recover with little or no damage. (I should perhaps mention that I'm retired now, having had a long career in software development, with a lot of OS development experience -- IBM CP/67, Tenex, TOPS20, Unix (Mach), and a LOT of Linux sys-admin experience; less with the BSDs, but not zero). The T41 has built-in Aironet Wireless Communications MPI350 wireless hardware. The GENERIC 5.1 kernel did not see this device at boot time, so no wireless. To fix this, I stuck an Atheros-based PCMCIA card in the machine, which did work. I was attempting to build Gnucash via pkgsrc on the T41 and had left the machine grinding away overnight (webkit is one of Gnucash's dependencies, and it's huge). It had finished the build when I got up the following morning and I installed gnucash and then did a bunch of cleaning-up in /usr/pkgsrc. I then tried to use firefox and found that my network connection was dead. So I did a /etc/rc.d/network restart and the system froze, completely dead. Upon restart, the automatic fsck gave up and requested a manual fsck. I tried that, but there are just too many things broken, a consequence, I'm sure, of running async and having this crash occur just after having done a lot of filesystem writing. The situation was so bad, I had to abandon this install. There are two issues here: 1. It looks like there's a bug in the Atheros driver. 2. I'm a little bit surprised that the filesystem was as much of a mess as it was. I mentioned all this to old friend Christos Zoulas and he suggested that I post this message. It is certainly true that I had done a lot of writing to the filesystem (as a result of my pkgsrc cleanup) and that had occurred within, say 10 minutes of the crash, maybe less. So it wasn't hours. But it also wasn't seconds. My Linux experience, and this is strictly gut feel -- I have no hard evidence to back this up -- tells me that if this had happened on a Linux system with an async, unjournaled filesystem, the filesystem would have survived. In suggesting that I post this, Christos mentioned that he's seen situations where a lot of writing happened in a session (e.g., a kernel build) and then the sync at shutdown time took a long time, which has made him somewhat suspicious that there might be a problem with the trickle sync that the kernel is supposed to be doing. So my purpose in posting this is to ask after doing 'make clean's of perhaps 15 or 20 packages and their dependencies, what is your estimate of the maximum time before everything gets safely written out of the buffer cache (this machine has a 1.6 Ghz Pentium M, 2 GB of memory, and a 7200 rpm 60 GB pata disk -- yes, not a normal configuration for a T41; I stuck the memory and disk in this machine taken from another, dead Thinkpad I have)? Is it seconds? Tens of seconds? Minutes? If it's small, then I would suggest that a kernel wizard have a look at the trickle sync stuff. I made the point to Christos that I'm probably one of a very small number, maybe one, who would mount the whole world async (and please, no lectures; I knew the risk going in; this was an experiment and I knew it could end badly; I did not have 10 years worth of un-backed-up financial data on this machine :-), and it is almost certainly true that if the filesystem had been mounted sync or softdep, it would have survived the crash. So if there's a problem with trickle sync, it would only have catastrophic consequences in the very rare case of someone doing what I did (mounting async, doing a lot of writing followed by a system crash). I'm trying to make the argument that there could be a problem that is benign in 99.99% of the NetBSD setups, and so you haven't heard about. /Don Allen
Re: Lost file-system story
Interesting situation. I agree that after 30s to a minute that most things should have been flushed. As a side note, it would be interesting to benchmark async vs wapbl. I have never really looked, but it has always seemed that it would be nice to have: statistics visibility into the number of dirty buffers/etc. in various caches a way to force flushes and clear caches (individually) Specifically, I think it would be great if 'systat vmstat' had a count of dirty buffers. Perhaps this is doable now and I just don't know how. Another question is if the disk had write caching enabled, but I would also expect it to flush the write cache quickly. It would be nice to have visibility into that cache, but I don't know if the ata interface supports it. pgp7ARt3Zqrdn.pgp Description: PGP signature
Re: Lost file-system story
On Tue, Dec 6, 2011 at 11:10 AM, Donald Allen donaldcal...@gmail.com wrote: [deleted] catastrophic consequences in the very rare case of someone doing what I did (mounting async, doing a lot of writing followed by a system crash). I'm trying to make the argument that there could be a problem that is benign in 99.99% of the NetBSD setups, and so you haven't heard about. I should amend this a bit. By 'benign' above, I meant that you wouldn't lose the filesystem. But if trickle-sync is working too slowly or not at all, I would think that, in the event of a crash preceded by writes to a softdep-mounted filesystem, more data could be lost than if trickle-sync were working as intended. Which wouldn't feel so benign if it happened to you. /Don
Re: Lost file-system story
On Tue, Dec 06, 2011 at 11:10:44AM -0500, Donald Allen wrote: 2. I'm a little bit surprised that the filesystem was as much of a mess as it was. I'm not. You mounted the filesystem async and had a crash. With the filesystem mounted async *nothing* pushes out most metadata updates, with the result that the filesystem's metadata can quickly enter a fatally inconsistent state. The only way home safe is a clean unmount. If you mount an FFS filesystem async you are playing with fire. Sure, it can be useful, but asbestos clothing is not optional. Thor
Re: Lost file-system story
On Tue, Dec 06, 2011 at 11:10:44AM -0500, Donald Allen wrote: My Linux experience, and this is strictly gut feel -- I have no hard evidence to back this up -- tells me that if this had happened on a Linux system with an async, unjournaled filesystem, the filesystem would have survived. Yes, it likely would have, at least if that filesystem was ext2fs. There is at least one issue beyond bugs though: ext2's fsck is written to cope with this situation. The ffs fsck isn't, and so it makes unwarranted assumptions and gets itself into trouble, sometimes even into infinite repair loops. (That is, where you can 'fsck -fy' over and over again and it'll never reach a clean state.) The short answer is: don't do that. I have no idea, btw, if using our ext2fs this way, along with e2fsck from the Linux ext2fsprogs, can be expected to work or not. I have doubts about our fsck_ext2fs though. In suggesting that I post this, Christos mentioned that he's seen situations where a lot of writing happened in a session (e.g., a kernel build) and then the sync at shutdown time took a long time, which has made him somewhat suspicious that there might be a problem with the trickle sync that the kernel is supposed to be doing. There is at least one known structural problem where atime/mtime updates do not get applied to buffers (but are instead saved up internally) so they don't get written out by the syncer. We believe this is what causes those unmount-time writes, or at least many of them. However, failure to update timestamps shouldn't result in a trashed fs. -- David A. Holland dholl...@netbsd.org
Re: Lost file-system story
On Tue, Dec 6, 2011 at 11:58 AM, Thor Lancelot Simon t...@panix.com wrote: On Tue, Dec 06, 2011 at 11:10:44AM -0500, Donald Allen wrote: 2. I'm a little bit surprised that the filesystem was as much of a mess as it was. I'm not. You mounted the filesystem async and had a crash. With the filesystem mounted async *nothing* pushes out most metadata updates, with the result that the filesystem's metadata can quickly enter a fatally inconsistent state. The only way home safe is a clean unmount. So unwritten meta-data from an async filesystem can sit in the buffer cache for arbitrarily long periods of time in NetBSD? I just want to be sure I understand what you are saying. Because that essentially guarantees, as you imply above, that if the system crashes, you will lose the filesystem. That makes the following warning, in the mount(8) man page, in the description of the async option: In the event of a crash, it is impossible for the system to verify the integrity of data on a file system mounted with this option. much more clear. When I read this before the fun started, I took it to mean, perhaps unjustifiably, what I know to be true -- there is some non-zero probability that fsck of an async file-system will not be able to verify and/or restore the filesystem to correctness after a crash. You are saying that the probability, in the case of NetBSD, is 1. If that's true, that there's no periodic sync, I would say that's *really* a mistake. It should be there with a knob the administrator can turn to adjust the sync frequency. There are uses for async filesystems (hell, google used ext2 for years and now uses journal-less ext4) and, as I said in my original post, with the assumed periodic sync'ing, fsck can put the system back together after a crash, in my case that has been invariably true. If you mount an FFS filesystem async you are playing with fire. Sure, it can be useful, but asbestos clothing is not optional. Thor
Re: Lost file-system story
David Holland wrote: There is at least one known structural problem where atime/mtime updates do not get applied to buffers (but are instead saved up internally) so they don't get written out by the syncer. We believe this is what causes those unmount-time writes, or at least many of them. I understand the delayed atime writes were to added to reduce the number of times a laptop harddisk spins up. I've often wondered if a simple sysctl could be added to control this. Unmounting my /home on my main machine takes approximately a minute. My ups-nut script does sync ( cd /home; umount /home ) sync as soon as it gets an on-battery event so that hopefully the actual shutdown if needed happens quickly. Cheers, Simon.