Re: [ANNOUNCE] DualFS: File System with Meta-data and Data Separation
On Thu, 15 Feb 2007 14:46:34 -0500, Jan Engelhardt <[EMAIL PROTECTED]> wrote: On Feb 15 2007 21:38, Andi Kleen wrote: Also I would expect your design to be slow for metadata read intensive workloads. E.g. have you tried to boot a root partition with dual fs? That's a very important IO benchmark for desktop Linux systems. Did someone say metadata intensive? Try kernel tarballs, they're a perfect workload. Tons of directories, and even more files here and there. Works wonders. I just did now per your request. To make things more relevant I created a file structure from the 2.4.19 kernel sources and repeated it recursively into the deepest dir level (10) 4 times ending up with 7280 directories with 40 levels of directories depth and 1 GB data set size. I run both tar and untar operations on the tree for ext3, reiserfs, jfs and DualFS. I remounted the FS before each test. I end up with 7280 directories 40 levels depth and 1 GB data. Both tar file and directory tree were on the FS under test. Here are the results - elapse time in sec: tar untar ext3: 144 143 reiserfs: 100 100 JFS:196 140 DualFS: 63 54 Hope this helps. Jan /Sorin -- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
On Thu, Feb 15, 2007 at 11:28:46AM -0800, Junfeng Yang wrote: > > Actually, we found a crash-during-recovery bug in ext3 too. It's a race > between resetting the journal super block and replay of the journal. This > bug was fixed by Ted long time ago (3 years?). That was found in your original work (using UML) not the more recent work using EXPLODE, correct? - Ted - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
> It was my understanding from the persentation of Dawson that ext3 and jfs > have > same problem. It is not an ext2 only problem. Also whatever solution we > adopt > we need to be sure that we test it using the eXplode methodology. apologies for dropping in randomly into the discussion: if this is about the crash-during-recovery bugs, the specific ones i discussed have been fixed in jfs and ext3 (junfeng: this is correct, right?). i should have made this clear in the talk (along with many other things: grabbing junfeng's slides and blathering about them w/o preperation is not the right algorithm for giving a good talk.) the other error --- fsync of file data on ext2 that reuses a freed inode from a file that was not flushed to disk is still open. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
On Thu, 2007-02-15 at 11:11 -0800, Junfeng Yang wrote: > Hmm. If jfs has the problem, it is a bug. jfs is designed to > handle > this correctly. I'm pretty sure I've fixed at least one bug > that > eXplode has uncovered in the past. I'm not sure what was > mentioned in > the presentation though. I'd like any information about > current > problems in jfs. > > > I believe you have fixed the JFS fsync bug, Dave. It was caused by > reusing a directory inode as a file inode. If the machine crashes > later, fsck would think this file is a directory, and clear all its > data. Yeah. That one was fixed a while back. Thanks for clearing this up. Shaggy -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
On Thu, 15 Feb 2007 12:15:59 -0500, Theodore Tso <[EMAIL PROTECTED]> wrote: On Thu, Feb 15, 2007 at 10:39:02AM -0600, Dave Kleikamp wrote: > It was my understanding from the persentation of Dawson that ext3 and jfs > have ame problem. Hmm. If jfs has the problem, it is a bug. jfs is designed to handle this correctly. I'm pretty sure I've fixed at least one bug that eXplode has uncovered in the past. I'm not sure what was mentioned in the presentation though. I'd like any information about current problems in jfs. That was not my understanding of the charts that were presented earlier this week. Ext3 journaling code will deal with this case explicitly, just as jfs does. My mistake: there were fsync bugs in JFS and ext2 that cannot be fixed by fsck. Not same for JFS and ext2. See quote: "There were two interesting fsync errors, one in JFS and one in ext2. The ext2 bug is a case where an implementation error points out a deeper design problem." ... "We found two bugs (one in JFS, one in Reiser4) where crashed disks cannot be recovered by fsck." - Ted -- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
On Thu, Feb 15, 2007 at 10:39:02AM -0600, Dave Kleikamp wrote: > > It was my understanding from the persentation of Dawson that ext3 and jfs > > have ame problem. > > Hmm. If jfs has the problem, it is a bug. jfs is designed to handle > this correctly. I'm pretty sure I've fixed at least one bug that > eXplode has uncovered in the past. I'm not sure what was mentioned in > the presentation though. I'd like any information about current > problems in jfs. That was not my understanding of the charts that were presented earlier this week. Ext3 journaling code will deal with this case explicitly, just as jfs does. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
On Thu, 2007-02-15 at 10:59 -0500, sfaibish wrote: > On Thu, 15 Feb 2007 10:09:22 -0500, Dave Kleikamp > <[EMAIL PROTECTED]> wrote: > > > On Thu, 2007-02-15 at 09:20 -0500, Theodore Tso wrote: > > > >> Another very heavyweight approach would be to simply force a full sync > >> of the filesystem whenever fysnc() is called. Not pretty, and without > >> the proper write ordering, the race is still potentially there. > > > > I don't think this race is an issue, in that it would require the crash > > to happen before the fsync completed, so there would be no expectation > > that the data is safe. It's a moot point, since I don't think this is > > an acceptable solution anyway. > > > >> I'd say that the best way to handle this is in fsck, but quite frankly > >> it's relatively low priority "bug" to handle, since a much simpler > >> workaround is to tell people to use ext3 instead. > > > > Right. Who's still using ext2? > It was my understanding from the persentation of Dawson that ext3 and jfs > have > same problem. Hmm. If jfs has the problem, it is a bug. jfs is designed to handle this correctly. I'm pretty sure I've fixed at least one bug that eXplode has uncovered in the past. I'm not sure what was mentioned in the presentation though. I'd like any information about current problems in jfs. > It is not an ext2 only problem. Also whatever solution we > adopt > we need to be sure that we test it using the eXplode methodology. > > /Sorin -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
On Thu, 15 Feb 2007 10:09:22 -0500, Dave Kleikamp <[EMAIL PROTECTED]> wrote: On Thu, 2007-02-15 at 09:20 -0500, Theodore Tso wrote: Another very heavyweight approach would be to simply force a full sync of the filesystem whenever fysnc() is called. Not pretty, and without the proper write ordering, the race is still potentially there. I don't think this race is an issue, in that it would require the crash to happen before the fsync completed, so there would be no expectation that the data is safe. It's a moot point, since I don't think this is an acceptable solution anyway. I'd say that the best way to handle this is in fsck, but quite frankly it's relatively low priority "bug" to handle, since a much simpler workaround is to tell people to use ext3 instead. Right. Who's still using ext2? It was my understanding from the persentation of Dawson that ext3 and jfs have same problem. It is not an ext2 only problem. Also whatever solution we adopt we need to be sure that we test it using the eXplode methodology. /Sorin - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
On Thu, 2007-02-15 at 09:20 -0500, Theodore Tso wrote: > Another very heavyweight approach would be to simply force a full sync > of the filesystem whenever fysnc() is called. Not pretty, and without > the proper write ordering, the race is still potentially there. I don't think this race is an issue, in that it would require the crash to happen before the fsync completed, so there would be no expectation that the data is safe. It's a moot point, since I don't think this is an acceptable solution anyway. > I'd say that the best way to handle this is in fsck, but quite frankly > it's relatively low priority "bug" to handle, since a much simpler > workaround is to tell people to use ext3 instead. Right. Who's still using ext2? -- David Kleikamp IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fix(es) for ext2 fsync bug
On Wed, Feb 14, 2007 at 11:54:54AM -0800, Valerie Henson wrote: > Background: The eXplode file system checker found a bug in ext2 fsync > behavior. Do the following: truncate file A, create file B which > reallocates one of A's old indirect blocks, fsync file B. If you then > crash before file A's metadata is all written out, fsck will complete > the truncate for file A... thereby deleting file B's data. So fsync > file B doesn't guarantee data is on disk after a crash. Details: It's actually not the case that fsck will complete the truncate for file A. The problem is that while e2fsck is processing indirect blocks in pass 1, the block which is marked as file A's indirect block (but which actually contain's file B's data) gets "fixed" when e2fsck sees block numbers which look like illegal block numbers. So this ends up corrupting file B's data. This is actually legal end result, BTW, since it's POSIX states the result of fsync() is undefined if the system crashes. Technically fsync() did actually guarantee that file B's data is "on disk"; the problem is that e2fsck would corrupt the data afterwards. Ironically, fsync()'ing file B actually makes it more likely that it might get corrupted afterwards, since normally filesystem metadata gets sync'ed out on 5 second intervals, while data gets sync'ed out at 30 second intervals. > * Rearrange order of duplicate block checking and fixing file size in > fsck. Not sure how hard this is. (Ted?) It's not a matter of changing when we deal with fixing the file size, as described above. At the fsck time, we would need to keep backup copies of any indirect blocks that get modified for whatever reason, and then in pass 1D, when we clone a block that has been claimed by multiple inods, the inodes which claim the block as a data block should get a copy of the block before it was modified by e2fsck. > * Keep a set of "still allocated on disk" block bitmaps that gets > flushed whenever a sync happens. Don't allocate these blocks. > Journaling file systems already have to do this. A list would be more efficient, as others have pointed out. That would work, although the knowing when entries could be removed from the list. The machinery for knowing when metadata has been updated isn't present in ext2, and that's a fair amount of complexity. You could clear the list/bitmap after the 5 second metadata flush command has been kicked off, or if you associate a data block with the previous inode's owner, you could clear the entry when the inode's dirty bit has been cleared, but that doesn't completely get rid of the race unless you tie it to when the write has completed (and this assumes write barriers to make sure the block was actually flushed to the media). Another very heavyweight approach would be to simply force a full sync of the filesystem whenever fysnc() is called. Not pretty, and without the proper write ordering, the race is still potentially there. I'd say that the best way to handle this is in fsck, but quite frankly it's relatively low priority "bug" to handle, since a much simpler workaround is to tell people to use ext3 instead. Regards, - Ted - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix d_path for lazy unmounts
On Thursday 15 February 2007 04:53, Jan Engelhardt wrote: > What's the point in changing pipefs... you can *never* > reach it *anyway*, even if it was a /-style path, since > pipefs is a NOMNT filesystem. The point is that we could then get rid of the special case for MS_NOUSER filesystems like pipefs in __d_path(). (This special case caused the lazy unmounted dir bug in the first place.) It is likely not really worth it, though. Andreas - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix d_path for lazy unmounts
Hi, On Feb 14 2007 14:57, Andreas Gruenbacher wrote: >[2] > >pipe: "pipe:[439336]" (or "pipe/[439336]") > >[3] Always make disconnected paths double-slashed: >-- >pipe: "//pipe/[439336]" >lazily umounted dir: "//dir/file" >lazily unmounted fs: "//file" >unreachable root: "//" > >Opinions? As for [2]/[3]: What's the point in changing pipefs... you can *never* reach it *anyway*, even if it was a /-style path, since pipefs is a NOMNT filesystem. That said, programs like lsof might break when it changes away from "pipe:[integer]" (same goes for socket:, etc.) Jan -- - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html