Re: [PATCH] SMP race in ext2 - metadata corruption.
Hi, On Fri, May 11, 2001 at 04:54:44PM +0200, Daniel Phillips wrote: > The only reasonable way I can think of getting a block-coherent view > underneath a mounted fs is to have a reverse map, and update it each > time we map block into the page cache or unmap it. It's called the "buffer cache", and Ingo's early page-cache code in 2.3 actually did install page-cache backing buffers into the buffer cache as aliases, mainly for debugging purposes. Even without that, though, an application can achieve almost-coherency via invalidation of the buffer cache before reading it. And yes, this won't necessarily remain coherent over the lifetime of the application process, but then unless the filesystem is 100% quiescent then you don't get that on 2.2 either. Which is rather the point. If the filesystem is active, then coherency cannot be obtained at the block-device level in any case without knowledge of the fs transaction activity. If the filesystem is quiescent, then you can sync it and flush the buffer cache and you already get the coherency that you need. Cheers, Stephen - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Mon, 7 May 2001, Pavel Machek wrote: > OTOH with current way if you make mistake in kernel, fsck will not > automatically inherit it; therefore fsck is likely to work even if > kernel ext2 is b0rken [and that's fairly important] ... and by the same logics you should make fsck implement its own drivers - after all, right now b0rken driver affects both the kernel ext2 and fsck ;-) I'm not sure that fsck of fs mounted read/write is worth doing in the first place, but I'd rather do that via fs/ext2 exporting its metadata explicitly than by playing silly buggers with device/fs coherency. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Monday 07 May 2001 20:42, Pavel Machek wrote: > > > It's not exactly "kernel-based fsck". What I've been talking > > > about is secondary filesystem providing coherent access to > > > primary fs metadata. I.e. mount -t ext2meta -o master=/usr none > > > /mnt and then access through /mnt/super, /mnt/block_bitmap, etc. > > > > > > Call me stupid --- but what exactly does the above actually > > > achieve? Why would you do this? > > > > Coherent access to metadata? Well, for one thing, it allows stuff > > like tunefs and friends on mounted fs. What's more useful, it > > allows to do things like access to boot code, which is _not_ safe > > to do through device access - usually you have superblock in > > vicinity and no warranties about the things that will be > > overwritten on umount. Same for debugging stuff, IO stats, etc. - > > access through secondary tree is much saner than inventing tons of > > ioctls for dealing with that. Moreover, it allows fsck and friends > > to get rid of code duplication - while the repair logics, etc. > > stays in userland (where it belongs) layout information is already > > handled in the kernel. No need to duplicate it in userland... > > OTOH with current way if you make mistake in kernel, fsck will not > automatically inherit it; therefore fsck is likely to work even if > kernel ext2 is b0rken [and that's fairly important] Al's idea ncely dances around a big problem with the page cache: there is no efficient way to know which address_space a given physical block belongs to. It *might* be nice to have such capability in a fs-independent way. We could do that now, very inefficiently, by searching all the address_spaces (i.e., inodes) for the physical block. We'd have to prevent further page cache operations while we did that, and when we add fs-private address_spaces some more mechanism would be required.. So: slow, intrusive and fragile. The only reasonable way I can think of getting a block-coherent view underneath a mounted fs is to have a reverse map, and update it each time we map block into the page cache or unmap it. The reverse map would tell us if a given physical block is currently in the page cache,and if so, which address_space it belongs to. A blocks not currently mapped into any address_space could be mapped into an 'anonymous' space covering the entire partition and moved automatically to the correct address_space when the fs tries to map it. The big problem with this mechanism is it slows down the common case, which works perfectly well without any reverse map. Not to mention adding bloat. So the next question I thought about was, is there a way to switch on a page cache reverse map just when needed and do that in a generic way. I convinced myself it wouldn't be too hard, but then there's another question: how badly do we need this? Al's idea does let us get at some of the specific parts of the fs metadata but it has its problems too. We'd need to exhaustively enumerate every kind of filesystem metadata that could reasonably be accessed underneath the filesystem, and special-case it, not so nice. But I couldn't come up with any killer examples where we'd really need a generalized, coherent view underneath a mounted filesystem, so I put these thoughts on hold. Your borked-fs example sounds interesting, have you got more of those? One more example I can suggest is: right now we have to way of detecting an error condition where the same fs block is mapped into more than one address_space. A page cache reverse map could detect this easily and would be a really useful debugging tool. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
gHi! > > It's not exactly "kernel-based fsck". What I've been talking about > > is secondary filesystem providing coherent access to primary fs > > metadata. I.e. mount -t ext2meta -o master=/usr none /mnt and > > then access through /mnt/super, /mnt/block_bitmap, etc. > > > > Call me stupid --- but what exactly does the above actually achieve? > > Why would you do this? > > Coherent access to metadata? Well, for one thing, it allows stuff like > tunefs and friends on mounted fs. What's more useful, it allows to > do things like access to boot code, which is _not_ safe to do through > device access - usually you have superblock in vicinity and no warranties > about the things that will be overwritten on umount. Same for debugging > stuff, IO stats, etc. - access through secondary tree is much saner > than inventing tons of ioctls for dealing with that. Moreover, it allows > fsck and friends to get rid of code duplication - while the repair > logics, etc. stays in userland (where it belongs) layout information > is already handled in the kernel. No need to duplicate it in userland... OTOH with current way if you make mistake in kernel, fsck will not automatically inherit it; therefore fsck is likely to work even if kernel ext2 is b0rken [and that's fairly important] > Besides, with moving bitmaps, etc. into pagecache it becomes trivial > to implement. > > BTW, we have another ugly chunk of code - duplicated between kernel > and userland and nasty in both incarnations. I mean handling of the > partition tables. Kernel should be able to read and parse them - > otherwise they are useless, right? Now, we have a bunch of userland No. You might want to see (via fdisk) partition table, even through *your* kernel can not read it. Pavel -- Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt, details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
[EMAIL PROTECTED] wrote: > > I have tried this approach too a couple of years ago. I came to the idea > that I want some kind of "event reporting" mechanism to know when > application faults and when other events (like I/O) occurs. Booting is > just the tip of the iceberg. MOST big apps are seeking on startup because >a) their code is spread out all over executable Page tuning can fix that. (Have the compiler & linker increase locality by stuffing related code in the same page. You want fast paths stuffed into as few pages as possible, regardless of which functions the instructions belong to.) This also cut down on swapping and TLB misses. Os/2 gained some nice speedups by doing this. >b) don't forget shared libraries.. They can be page tuned too, and they're often partially in memory aready when starting apps. >c) the practice of keeping configuration files in ~/.filename > implies - read a little, seek a little. >d) GUI apps tend to have a ton of icons. Putting several in a single file, or even the executable will help here. Helge Hafting - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
I have tried this approach too a couple of years ago. I came to the idea that I want some kind of "event reporting" mechanism to know when application faults and when other events (like I/O) occurs. Booting is just the tip of the iceberg. MOST big apps are seeking on startup because a) their code is spread out all over executable b) don't forget shared libraries.. c) the practice of keeping configuration files in ~/.filename implies - read a little, seek a little. d) GUI apps tend to have a ton of icons. I wonder - is it possible to get this via ptrace ? - could not find this in the manpage. Vladimir Dergachev On Fri, 4 May 2001, Richard Gooch wrote: > Linus Torvalds writes: > > Now, if you want to speed up accesses, there are things you can > > do. You can lay out the filesystem in the access order - trace the > > IO accesses at bootup ("which file, which offset, which metadata > > block?") and lay out the blocks of the files in exactly the right > > order. Then you will get linear reads _without_ doing any "dd" at > > all. > > A year ago I came up with an alternative approach for cache warming, > but I see that it wouldn't work with our current infrastructure. > However, maybe there is still a way to use the basic technique. If so, > please make suggestions. > > The idea I had (motivated by the desire to eliminate random disc > seeks, which is the limiting factor in how fast my boxes boot) was: > > - init(8) issues an ioctl(2) on the root FS block device which turns > on recording of block reads (it records block numbers) > > - at the end of the bootup process, init(8) issues another ioctl(2) to > grab the buffered block numbers, and turn off recording > > - init(8) then sorts this list in ascending order and saves the result > in a file > > - next boot, init(8) checks the file, and if it exists, opens the root > FS block device and reads in each block listed in the file. The > effect is to warm the buffer cache extremely quickly. The head will > move in one direction, grabbing data as it flys by. I expect this > will take around 1 second > > - init(8) now continues the boot process (starting the magic ioctl(2) > again so as to get a fresh list of blocks, in case something has > changed) > > - booting is now super fast, thanks to no disc activity. > > The advantage of this scheme over blindly reading the first 50 MB is > that it only reads in what you *need*, and thus will work better on > low memory systems. It's also useful for other applications, not just > speeding up the boot process. > > However, doing an ioctl(2) on the block device won't help. So the > question is, where to add the hook? One possibility is the FS, and > record inum,bnum pairs. But of course we don't have a way of accessing > via inum in user-space. So that's no good. Besides, we want to get > block numbers on the block device, because that's the only meaningful > number to resort. > > So, what, then? Some kind of hook on the page cache? Ideas? > > Regards, > > Richard > Permanent: [EMAIL PROTECTED] > Current: [EMAIL PROTECTED] > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
Alan writes: > > Actually, the EVMS project does exactly this. All I/O is done on a full > > disk basis, and essentially does block remapping for each partition. This > > also solves the problem of cache inconsistency if accessing the parent > > device vs. accessing the partition. > > Interesting. Can EVMS handle the partition labels used by the LVM layer - ie > could it replace it as well ? Yes, they already support all current LVM volumes (including snapshots). However, the user-space tools to set up new LVM volumes and manage existing ones is not ready yet. The last I talked with the IBM folks (a week ago), they said they were starting to work on the user-space tools. Because the whole partition/volume code is modular in EVMS, they will be able to handle AIX LVM, HP/UX LVM, etc. volumes in addition to the normal DOS or other partitions. Cheers, Andreas -- Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto, \ would they cancel out, leaving him still hungry?" http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
> Actually, the EVMS project does exactly this. All I/O is done on a full > disk basis, and essentially does block remapping for each partition. This > also solves the problem of cache inconsistency if accessing the parent > device vs. accessing the partition. Interesting. Can EVMS handle the partition labels used by the LVM layer - ie could it replace it as well ? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
Alan writes: > > an interesting task when your root lives on /dev/sda1. Ditto for destroying > > a single partition (not mounted/used by swap/etc.) while you have some > > other partition in use. IWBNI we had a decent API for handling partition > > tables... > > Partitions are just very crude logical volumes, and ultimiately I believe > should be handled exactly that way Actually, the EVMS project does exactly this. All I/O is done on a full disk basis, and essentially does block remapping for each partition. This also solves the problem of cache inconsistency if accessing the parent device vs. accessing the partition. Cheers, Andreas -- Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto, \ would they cancel out, leaving him still hungry?" http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
> an interesting task when your root lives on /dev/sda1. Ditto for destroying > a single partition (not mounted/used by swap/etc.) while you have some > other partition in use. IWBNI we had a decent API for handling partition > tables... Partitions are just very crude logical volumes, and ultimiately I believe should be handled exactly that way - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Sun, 6 May 2001, Chris Wedgwood wrote: > It's not exactly "kernel-based fsck". What I've been talking about > is secondary filesystem providing coherent access to primary fs > metadata. I.e. mount -t ext2meta -o master=/usr none /mnt and > then access through /mnt/super, /mnt/block_bitmap, etc. > > Call me stupid --- but what exactly does the above actually achieve? > Why would you do this? Coherent access to metadata? Well, for one thing, it allows stuff like tunefs and friends on mounted fs. What's more useful, it allows to do things like access to boot code, which is _not_ safe to do through device access - usually you have superblock in vicinity and no warranties about the things that will be overwritten on umount. Same for debugging stuff, IO stats, etc. - access through secondary tree is much saner than inventing tons of ioctls for dealing with that. Moreover, it allows fsck and friends to get rid of code duplication - while the repair logics, etc. stays in userland (where it belongs) layout information is already handled in the kernel. No need to duplicate it in userland... Besides, with moving bitmaps, etc. into pagecache it becomes trivial to implement. BTW, we have another ugly chunk of code - duplicated between kernel and userland and nasty in both incarnations. I mean handling of the partition tables. Kernel should be able to read and parse them - otherwise they are useless, right? Now, we have a bunch of userland utilities that do the same. Various fdisks, that is. If you look how they work you'll see that on the read side they duplicate kernel code and on the write side... To put it quite mildly, they are not doing it in graceful way. They write relevant sectors to disk and use BLKRRPART to tell the kernel that ti should forget about all partitions on that disk and reread the partition tables. _Not_ a nice thing to do, since creation of new partition out of unused space on /dev/sda becomes an interesting task when your root lives on /dev/sda1. Ditto for destroying a single partition (not mounted/used by swap/etc.) while you have some other partition in use. IWBNI we had a decent API for handling partition tables... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Sun, May 06, 2001 at 03:00:58PM +1200, Chris Wedgwood wrote: > On Sun, May 06, 2001 at 04:50:01AM +0200, Andrea Arcangeli wrote: > > Moving e2fsck into the kernel is a completly different matter > than caching the blockdevice accesses with pagecache instead of > buffercache. > > No, I was takling about user space fsck using character devices. I misread your previous email sorry, I think you meant to fsck using rawio (not to move fsck into the kernel). You can do that just now but to get decent performance then fsck should do self caching, changing fsck to do self caching doesn't sound worthwhile either. Note also that rawio has nothing to do with the pagecache. Infact both rawio and O_DIRECT bypasses all the pagecache and its smp locks for example. > I'm not claiming it is... what I'm asking is _why_ do we need block > devices once 'everything' lives in the page cache? Where the cache of the blockdevice lives is a completly orthogonal problem with "why cached blockdevices are useful" which I addressed in the previous email. > It's just that by doing it in pagecache you can mmap it as well > and it will provide overall better performance and it's probably > cleaner design. The only visible change is that you will be able > to mmap a blockdevice as well. > > Why? What needs to mmap a block device? Since these are typically > larger than that you can mmap into a 32-bit address space (yes, I'm > ignoring the 5% or so of cases where this isn't true) I'm not aware > on many applications that do it. Last time I talked with the parted maintainer he was asking for that feature so that parted won't need to do its own anti-oom management in userspace, so he can simple mmap(MAP_SHARED) a quite large region of metadata of the blockdevice, read/write to the mmaped region and the kernel will take care of doing paging when it runs low on memory. right now it allocates the metadata in anonymous memory and loads it via read(). This memory will need to be swapped out if the working set doesn't fit in ram (and swap may not be available ;). > As I said, I'm not takling about kernel based fsck, although for > _VERY_ large filesystems even with journalling I suspect it will be > required one day (so it can run in the background and do consistency > checking when the machine is idle). Being able to fsck a live filesystem is yet another exotic feature and yes for that you would certainly need some additional kernel support. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
Chris Wedgewood writes: > As I said, I'm not takling about kernel based fsck, although for > _VERY_ large filesystems even with journalling I suspect it will be > required one day (so it can run in the background and do consistency > checking when the machine is idle). Actually, I was talking with Ted about this, and we agreed that: a) kernel-based e2fsck is a pain in the a** (locking issues, etc) b) you can do an LVM snapshot of your live filesystem and do a read-only fsck on that to check if the filesystem is still OK. For journaled filesystems like reiserfs and ext3, they need to use the super method write_super_lockfs() to block I/O and flush everything to disk at the time of the snapshot, to ensure that they don't need recovery on a read-only device. This makes the LVM snapshot equivalent to unmount the filesystem, copy contents to a new device and remount the filesystem. While (b) doesn't let you fix a filesystem online, unless there is a kernel bug or hardware problem, you should not have a problem. If you have either of those, then fixing the filesystem online is just asking for more problems in the future. Cheers, Andreas -- Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto, \ would they cancel out, leaving him still hungry?" http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Sun, 6 May 2001, Chris Wedgwood wrote: > On Sun, May 06, 2001 at 04:50:01AM +0200, Andrea Arcangeli wrote: > About a kernel based fsck Alexander told me he likes it, I > personally don't care about it that much because I believe... > > As I said, I'm not takling about kernel based fsck, although for > _VERY_ large filesystems even with journalling I suspect it will be > required one day (so it can run in the background and do consistency > checking when the machine is idle). It's not exactly "kernel-based fsck". What I've been talking about is secondary filesystem providing coherent access to primary fs metadata. I.e. mount -t ext2meta -o master=/usr none /mnt and then access through /mnt/super, /mnt/block_bitmap, etc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Sun, May 06, 2001 at 02:14:37PM +1200, Chris Wedgwood wrote: > You don't need block device for fsck, in fact some OS require you use > character devices (e.g. Solaris). Moving e2fsck into the kernel is a completly different matter than caching the blockdevice accesses with pagecache instead of buffercache. And even if you move e2fsck or reiserfsck into the kernel (you could technically do that just now regardless of where the block_dev cache lives) there will still be partd that wants to mmap the blockdevice to get rid of part of the fat32 partition (right now it uses read/write of course because buffer cache cannot be mapped in userspace), there will still be mtools, not self caching dbms, od I'm not saying we don't need block devices, but I really don't see > much of a use for them once everything in in the page cache... I > assume this is why others have got rid of them completely. I have no idea why/if other got rid of it completly, but the fact block_dev is useful has nothing to do if it's in pagecache or in buffercache, really. It's just that by doing it in pagecache you can mmap it as well and it will provide overall better performance and it's probably cleaner design. The only visible change is that you will be able to mmap a blockdevice as well. About a kernel based fsck Alexander told me he likes it, I personally don't care about it that much because I believe there's not that much to share at the source level, fsck and real fs are quite different problems, and what can be shared can be copied and by not sharing we get the flexibility of not breaking fsck every time we change the kernel and more in general the flexibility of doing it in userspace, sharing such bytecode at runtime definitely doesn't matter. It also partly depends from the fs but current ext2 situation is really fine to me and I wouldn't consier a wortwhile project to move e2fsck into the kernel. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Sat, May 05, 2001 at 03:18:08PM +1200, Chris Wedgwood wrote: > On Fri, May 04, 2001 at 05:29:40PM +0200, Andrea Arcangeli wrote: > > once block_dev is in pagecache there will obviously be no-way to > share cache between the block device and the filesystem, because > all the caches will be in completly different address spaces. > > Once we are at this point... will there be any use in having block > devices? FreeBSD appears to have done without them completely about a moving block_dev in pagecache won't change anything from userspace point of view, it's a transparent change (if we ignore the total loss of cache coherency between block_dev and fs metadata that it implies, but as Linus said such loss of coherency will happen anyways eventually because metadata will go into its address space too). Basically there will still be a use for the block devices as far as there are fsck and other userspace applications that want to use it. Andrea SYNAPSE (very amusing movie ;) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
Rogier Wolff writes: > Richard Gooch wrote: > > > > - next boot, init(8) checks the file, and if it exists, opens the root > > FS block device and reads in each block listed in the file. The > > effect is to warm the buffer cache extremely quickly. The head will > > move in one direction, grabbing data as it flys by. I expect this > > will take around 1 second > > FYI: > > Around 1992 or 1993, I rewrote Minix-fsck to do this instead of > seeking all over the place. > > Cut the total time to fsck my filesystem from around 30 to 28 > seconds. (remember the days of small filesystems?) > > That's when I decided that this was NOT an interesting project: there > was very little to be gained. > > The explanation is: A seek over a few tracks isn't much slower than a > seek over hundreds of tracks. Almost any "skip" in linear access > incurs the average 6ms rotational latency anyway. Hm. I think the access patterns between boot-up and fsck are quite different. An fsck has to seek to a large number of tracks. During bootup, I think the number of tracks accessed is much lower, and there is probably more data locality as well. Still, only one way to be sure. I haven't had time to look closely at this, but one thing that bothers me is how to find out what is being accessed in the first place. A C-library wrapper to intercept read(2) calls isn't any good, because a lot of stuff is memory-mapped (in particular shared libraries). Anyone have a clean way to do this? Regards, Richard Permanent: [EMAIL PROTECTED] Current: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Sat, 5 May 2001, Albert D. Cahalan wrote: > case P_SWAP: > sprintf(tmp, "%4.4s ", > scale_k(((task->size - task->resident) << CL_pg_shift), 4, 1)); > break; Albert, you can't be serious. The system had demand-loading for almost ten years. ->size - ->resident can be huge with no swap at all. As in, "box had never been subjected to swapon(8)". That value is a mix of amount of stuff we hadn't paged in, amount of stuff we had paged in but then dropped (e.g. code that had never been touched for two weeks, since application only uses it on startup) and amount of stuff that had been swapped out _and_ wasn't swapped in (it may very well stay in swap). BTW, "shared" is also bogus - page_count(page) can be raised by any number of things. > > * makes stuff like top(1) _walk_ _whole_ _page_ _tables_ _of_ _all_ > > _processes_ each 5 seconds. No wonder it's slow like hell and eats > > tons of CPU time. > > On my system, "statm" takes 50% longer than "stat" or "status". > Maybe there is a significant difference with Oracle on a 32 GB box? Depends on that applications mix. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
Richard Gooch wrote: > > - next boot, init(8) checks the file, and if it exists, opens the root > FS block device and reads in each block listed in the file. The > effect is to warm the buffer cache extremely quickly. The head will > move in one direction, grabbing data as it flys by. I expect this > will take around 1 second FYI: Around 1992 or 1993, I rewrote Minix-fsck to do this instead of seeking all over the place. Cut the total time to fsck my filesystem from around 30 to 28 seconds. (remember the days of small filesystems?) That's when I decided that this was NOT an interesting project: there was very little to be gained. The explanation is: A seek over a few tracks isn't much slower than a seek over hundreds of tracks. Almost any "skip" in linear access incurs the average 6ms rotational latency anyway. Roger. -- ** [EMAIL PROTECTED] ** http://www.BitWizard.nl/ ** +31-15-2137555 ** *-- BitWizard writes Linux device drivers for any device you may have! --* * There are old pilots, and there are bold pilots. * There are also old, bald pilots. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
Alexander Viro writes: >> On Fri, 4 May 2001, Alexander Viro wrote: >>> Ehh... There _is_ a way to deal with that, but it's deeply Albertesque: > ^^^ Ah, you learn from the master. > ObProcfs: I don't think that walking the page tables is a good way to > compute RSS, especially since VM maintains the thing. Mind if I rip Handling of mapped device memory should not change. For example there is the X server with mapped video memory. There is another RSS value provided elsewhere in case one does not want to include mapped device memory. Currently top uses the statm file in the following manner: case P_SIZE: sprintf(tmp, "%5.5s ", scale_k((task->size << CL_pg_shift), 5, 1)); break; case P_TRS: sprintf(tmp, "%4.4s ", scale_k((task->trs << CL_pg_shift), 4, 1)); break; case P_SWAP: sprintf(tmp, "%4.4s ", scale_k(((task->size - task->resident) << CL_pg_shift), 4, 1)); break; case P_SHARE: sprintf(tmp, "%5.5s ", scale_k((task->share << CL_pg_shift), 5, 1)); break; case P_DT: sprintf(tmp, "%3.3s ", scale_k(task->dt, 3, 0)); break; case P_RSS: /* rss, not resident (which includes IO memory) */ sprintf(tmp, "%4.4s ", scale_k((task->rss << CL_pg_shift), 4, 1)); > it out? In effect, implementation of /prc//statm > * produces extremely bogus values (VMA is from library if it goes > beyond 0x6000? Might be even true 7 years ago...) and nobody > had cared about them for 6-7 years One could count pages that are mapped executable and do not come from the main executable... but this is pretty worthless and does not consider non-executable library sections. The latest "top" does not bother to display this value. > * makes stuff like top(1) _walk_ _whole_ _page_ _tables_ _of_ _all_ > _processes_ each 5 seconds. No wonder it's slow like hell and eats > tons of CPU time. On my system, "statm" takes 50% longer than "stat" or "status". Maybe there is a significant difference with Oracle on a 32 GB box? I'd rather top didn't have to read the file at all. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Fri, 4 May 2001, Alan Cox wrote: > > iso9660 alas doesn't allow you to do that. You can speed it up by reading > the entire file into memory rather than paging it in (or reading it in and > then executing it). iso9660 layout is pretty constrained and designed for > linear file reads Note that this you can do for any filesystem, including ext2. If you instead of trying to remember what _blocks_ the bootup process reads, you keep the trace at a higher level, and then sort the _high_level_ trace and re-do that with some program, then you can obviously populate the virtual caches properly with any filesystem. The advantage of that approach is that it will continue to work forever, because there will never be any cache aliasing issues. You're always "pre-caching" using the same operation that you'll actually use when you do the real reads.. Now, that still leaves the question on how to sort the virtual cache accesses, and you might want to know what the low-level layout of the filesystem is to actually create the "sort". You might not want to sort alphabetically on the file-name, but use a "where on the disk is this file", and use _that_ as the sort oder function. That's easy to do, actually. Just use the "bmap()" ioctl. Now, you won't be able to use "dd" to populate the caches: you'd have to have your own program that walks your sorted action list and populates the caches that way (and you might want to take kernel read-ahead etc heuristics into account). SMOP. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
> Now, if you want to speed up accesses, there are things you can do. You > can lay out the filesystem in the access order - trace the IO accesses at > bootup ("which file, which offset, which metadata block?") and lay out the > blocks of the files in exactly the right order. Then you will get linear > reads _without_ doing any "dd" at all. iso9660 alas doesn't allow you to do that. You can speed it up by reading the entire file into memory rather than paging it in (or reading it in and then executing it). iso9660 layout is pretty constrained and designed for linear file reads - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
Alexander Viro writes: > > > On Fri, 4 May 2001, Richard Gooch wrote: > > > I don't bother splitting /usr off /. I gave up doing that when disc > > became cheap. There's no point anymore. And since I have a lightweight > > Yes, there is. Locality. Resistance to fs fuckups. Resistance to > disk fuckups. Easier to restore from tape. Different tunefs optimum > (higher inodes/blocks ratio, for one thing). Ability to keep /usr > read-only. Enough? The correct solution to avoiding fs fuckups is to keep /tmp, /var and /home separate. Basically, anything that gets written to for reasons other than sysadmin/upgrades. However, my point is not that it's always a bad idea to split /usr, simply that the converse is not true. IOW, it is not true to say that /usr *should* be split off. For a generic workstation, splitting /usr is not useful. Importantly, it is most certainly entirely valid to keep /usr on /. > > distribution (500 MiB and I get X, LaTeX, emacs, compilers, netscrap > > and a pile of other things), it makes even less sense to split /usr > > off. Sorry, I don't have those fancy desktops. Don't need 'em. I spend > > most of my day in emacs and xterm. > > What desktops? None of that crap on my boxen either. EMACS? What EMACS? > LaTeX is unfortunately needed (I prefer troff and AMSTeX on the TeX side). > Netrape? No chance in hell. bash is there, but I prefer to use > rc. > > I don't see what does it have to keeping root on a separate > filesystem, though - the reasons have nothing to bloat in /usr/bin. In any case, my point is that splitting /usr wouldn't help, because I'd want to preload stuff from there as well. Splitting /usr doesn't address the problem. Regards, Richard Permanent: [EMAIL PROTECTED] Current: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Fri, 4 May 2001, Richard Gooch wrote: > I don't bother splitting /usr off /. I gave up doing that when disc > became cheap. There's no point anymore. And since I have a lightweight Yes, there is. Locality. Resistance to fs fuckups. Resistance to disk fuckups. Easier to restore from tape. Different tunefs optimum (higher inodes/blocks ratio, for one thing). Ability to keep /usr read-only. Enough? > distribution (500 MiB and I get X, LaTeX, emacs, compilers, netscrap > and a pile of other things), it makes even less sense to split /usr > off. Sorry, I don't have those fancy desktops. Don't need 'em. I spend > most of my day in emacs and xterm. What desktops? None of that crap on my boxen either. EMACS? What EMACS? LaTeX is unfortunately needed (I prefer troff and AMSTeX on the TeX side). Netrape? No chance in hell. bash is there, but I prefer to use rc. I don't see what does it have to keeping root on a separate filesystem, though - the reasons have nothing to bloat in /usr/bin. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
Alexander Viro writes: > > > On Fri, 4 May 2001, Richard Gooch wrote: > > > > Two of them: use less bloated shell (and link it statically) and > > > clean your rc scripts. > > > > No, because I'm not using the latest bloated version of bash, and I'm > > Umm... Last version of bash I could call not bloated was _long_ time > ago. Something like ash(1) might be a better idea for /bin/sh. The shell is irrelevant. I can easily preload that too, if I wanted to, since it's just one thing. But it's not practical to preload all files used by name, because it's just too hard to find out all that is needed. Too much people time required, and it is specific to one distribution (and a particular revision at that). > > The problem is all the various daemons and system utilities (mount, > > hwclock, ifconfig and so on) that turn a kernel into a useful system. > > And then of course there's X... > > How do you partition the thing? I.e. what's the size of your root > partition? I'm usually doing something from 10Mb to 30Mb - that may > be the reason of differences. I don't bother splitting /usr off /. I gave up doing that when disc became cheap. There's no point anymore. And since I have a lightweight distribution (500 MiB and I get X, LaTeX, emacs, compilers, netscrap and a pile of other things), it makes even less sense to split /usr off. Sorry, I don't have those fancy desktops. Don't need 'em. I spend most of my day in emacs and xterm. And even if I did split /usr off, that would just mean I'd want to record block accesses for that device as well. This is because part of my boot process requires stuff in /usr. And after that, firing up xdm. Regards, Richard Permanent: [EMAIL PROTECTED] Current: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Fri, 4 May 2001, Richard Gooch wrote: > > Two of them: use less bloated shell (and link it statically) and > > clean your rc scripts. > > No, because I'm not using the latest bloated version of bash, and I'm Umm... Last version of bash I could call not bloated was _long_ time ago. Something like ash(1) might be a better idea for /bin/sh. > The problem is all the various daemons and system utilities (mount, > hwclock, ifconfig and so on) that turn a kernel into a useful system. > And then of course there's X... How do you partition the thing? I.e. what's the size of your root partition? I'm usually doing something from 10Mb to 30Mb - that may be the reason of differences. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Fri, 4 May 2001, Alexander Viro wrote: > > ObProcfs: I don't think that walking the page tables is a good way to > compute RSS, especially since VM maintains the thing. Well, the VM didn't always use to maintain the stuff it does now, so I bet that most of the code is just old code that still works. Feel free to rip it out. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Fri, May 04 2001, Richard Gooch wrote: > The idea I had (motivated by the desire to eliminate random disc > seeks, which is the limiting factor in how fast my boxes boot) was: > > - init(8) issues an ioctl(2) on the root FS block device which turns > on recording of block reads (it records block numbers) > > - at the end of the bootup process, init(8) issues another ioctl(2) to > grab the buffered block numbers, and turn off recording > > - init(8) then sorts this list in ascending order and saves the result > in a file > > - next boot, init(8) checks the file, and if it exists, opens the root > FS block device and reads in each block listed in the file. The > effect is to warm the buffer cache extremely quickly. The head will > move in one direction, grabbing data as it flys by. I expect this > will take around 1 second > > - init(8) now continues the boot process (starting the magic ioctl(2) > again so as to get a fresh list of blocks, in case something has > changed) > > - booting is now super fast, thanks to no disc activity. I did 95% of what you need sometime last year, to do I/O scheduler profiling (blocks requested, merge stats, request sent to disk). It was a pretty gross hack, requiring a pretty big ring buffer of kernel memory to be able to log at a sufficiently fast rate (you'd be amazed how much output a single dbench 48 run produces :-). A user space app would read data from a simple char device, save for later inspection. A better approach would be to map the ring buffer from the user app, but it was just a quick fix. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
Alexander Viro writes: > > > On Fri, 4 May 2001, Richard Gooch wrote: > > > However, doing an ioctl(2) on the block device won't help. So the > > question is, where to add the hook? One possibility is the FS, and > > record inum,bnum pairs. But of course we don't have a way of accessing > > via inum in user-space. So that's no good. Besides, we want to get > > block numbers on the block device, because that's the only meaningful > > number to resort. > > > > So, what, then? Some kind of hook on the page cache? Ideas? > > Two of them: use less bloated shell (and link it statically) and > clean your rc scripts. No, because I'm not using the latest bloated version of bash, and I'm not using the slow and bloated RedHat boot scripts. My boot scripts are lean and mean. Oh. And I already have init(8) warming the cache with these scripts. The problem is all the various daemons and system utilities (mount, hwclock, ifconfig and so on) that turn a kernel into a useful system. And then of course there's X... Sorry. A "don't do that then" answer isn't appropriate for this problem space. Regards, Richard Permanent: [EMAIL PROTECTED] Current: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Fri, 4 May 2001, Richard Gooch wrote: > However, doing an ioctl(2) on the block device won't help. So the > question is, where to add the hook? One possibility is the FS, and > record inum,bnum pairs. But of course we don't have a way of accessing > via inum in user-space. So that's no good. Besides, we want to get > block numbers on the block device, because that's the only meaningful > number to resort. > > So, what, then? Some kind of hook on the page cache? Ideas? Two of them: use less bloated shell (and link it statically) and clean your rc scripts. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Fri, 4 May 2001, Linus Torvalds wrote: > > On Fri, 4 May 2001, Alexander Viro wrote: > > > > Ehh... There _is_ a way to deal with that, but it's deeply Albertesque: ^^^ > > * add pagecache access for block device > > * put your "real" root on /dev/loop0 (setup from initrd) > > * dd > > You're one sick puppy. [snip] /me bows Nice to see that imitation was good enough ;-) Seriously, I half-expected Albert to show up at that point of thread and tried to anticipate what he'd produce. ObProcfs: I don't think that walking the page tables is a good way to compute RSS, especially since VM maintains the thing. Mind if I rip it out? In effect, implementation of /prc//statm * produces extremely bogus values (VMA is from library if it goes beyond 0x6000? Might be even true 7 years ago...) and nobody had cared about them for 6-7 years * makes stuff like top(1) _walk_ _whole_ _page_ _tables_ _of_ _all_ _processes_ each 5 seconds. No wonder it's slow like hell and eats tons of CPU time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
Linus Torvalds writes: > Now, if you want to speed up accesses, there are things you can > do. You can lay out the filesystem in the access order - trace the > IO accesses at bootup ("which file, which offset, which metadata > block?") and lay out the blocks of the files in exactly the right > order. Then you will get linear reads _without_ doing any "dd" at > all. A year ago I came up with an alternative approach for cache warming, but I see that it wouldn't work with our current infrastructure. However, maybe there is still a way to use the basic technique. If so, please make suggestions. The idea I had (motivated by the desire to eliminate random disc seeks, which is the limiting factor in how fast my boxes boot) was: - init(8) issues an ioctl(2) on the root FS block device which turns on recording of block reads (it records block numbers) - at the end of the bootup process, init(8) issues another ioctl(2) to grab the buffered block numbers, and turn off recording - init(8) then sorts this list in ascending order and saves the result in a file - next boot, init(8) checks the file, and if it exists, opens the root FS block device and reads in each block listed in the file. The effect is to warm the buffer cache extremely quickly. The head will move in one direction, grabbing data as it flys by. I expect this will take around 1 second - init(8) now continues the boot process (starting the magic ioctl(2) again so as to get a fresh list of blocks, in case something has changed) - booting is now super fast, thanks to no disc activity. The advantage of this scheme over blindly reading the first 50 MB is that it only reads in what you *need*, and thus will work better on low memory systems. It's also useful for other applications, not just speeding up the boot process. However, doing an ioctl(2) on the block device won't help. So the question is, where to add the hook? One possibility is the FS, and record inum,bnum pairs. But of course we don't have a way of accessing via inum in user-space. So that's no good. Besides, we want to get block numbers on the block device, because that's the only meaningful number to resort. So, what, then? Some kind of hook on the page cache? Ideas? Regards, Richard Permanent: [EMAIL PROTECTED] Current: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Fri, 4 May 2001, Alexander Viro wrote: > > Ehh... There _is_ a way to deal with that, but it's deeply Albertesque: > * add pagecache access for block device > * put your "real" root on /dev/loop0 (setup from initrd) > * dd You're one sick puppy. Now, the above is basically equivalent to using and populating a dynamically sized ramdisk. If you really want to go this way, I'd much rather see you using a real ram-disk (that you populate at startup with something like a compressed tar-file). THAT is definitly going to speed up booting - thanks to compression you'll not only get linear reads, but you will get fewer reads than the amount of data you need would imply. Couple that with tmpfs, or possibly something like coda (to dynamically move things between the ramdisk and the "backing store" filesystem), and you can get a ramdisk approach that actually shrinks (and, in the case of coda or whatever, truly grows) dynamically. Think of it as an exercise in multi-level filesystems and filesystem management. Others have done it before (usually between disk and tape, or disk and network), and in these days of ever-growing memory it might just make sense to do it on that level too. (No, I don't seriously think it makes sense today. But if RAM keeps growing and becoming ever cheaper, it might some day. At the point where everybody has multi-gigabyte memories, and don't really need it for anything but caching, you could think of it as just moving the caching to a higher level - you don't cache blocks, you cache parts of the filesystem). > Al, feeling sadistic today... Sadistic you are. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Fri, 4 May 2001, Linus Torvalds wrote: > Now, if you want to speed up accesses, there are things you can do. You > can lay out the filesystem in the access order - trace the IO accesses at > bootup ("which file, which offset, which metadata block?") and lay out the > blocks of the files in exactly the right order. Then you will get linear > reads _without_ doing any "dd" at all. > > Now, laying out the filesystem that way is _hard_. No question about it. > It's kind of equivalent to doing a filesystem "defreagment" operation, > except you use a different sorting function (instead of sorting blocks > linearly within each file, you sort according to access order). Ehh... There _is_ a way to deal with that, but it's deeply Albertesque: * add pagecache access for block device * put your "real" root on /dev/loop0 (setup from initrd) * dd The last step will populate pagecache for underlying device and later access to root fs will ultimately hit said pagecache, be it from page cache of files or buffer cache of /dev/loop0 - loop_make_request() will take care of that, by copying data from pagecache of /dev/. Al, feeling sadistic today... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Fri, 4 May 2001, Andrea Arcangeli wrote: > On Fri, May 04, 2001 at 01:56:14PM +0200, Jens Axboe wrote: > > Or you can rewrite block_read/write to use the page cache, in which case > > you'd have more luck doing the above. > > once block_dev is in pagecache there will obviously be no-way to share > cache between the block device and the filesystem, because all the > caches will be in completly different address spaces. They already pretty much are. I do want to re-write block_read/write to use the page cache, but not because it would impact anything in this discussion. I want to do it early in 2.5.x, because: - it will speed up accesses - it will re-use existing code better and conceptualize things more cleanly (ie it would turn a disk into a _really_ simple filesystem with just one big file ;). - it will make MM handling much better for things like fsck - the memory pressure is designed to work on page cache things. - it will be one less thing that uses the buffer cache as a "cache" (I want people to think of, and use, the buffer cache as an _IO_ entity, not a cache). It will not make the "cache at bootup" thing change at all (because even in the page cache, there is no commonality between a virtual mapping of a _file_ (or metadata) and a virtual mapping of a _disk_). It would have hidden the problem with "dd" or "dump" touching buffer cache blocks that the filesystem was using, so the original metadata corruption that started this thread would not happen. But that's not a design issue or a design goal, that would just have been a random result. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Fri, 4 May 2001, Rogier Wolff wrote: > > Linus Torvalds wrote: > > > > Ehh. Doing that would be extremely stupid, and would slow down your boot > > and nothing more. > > Ehhh, Linus, Linearly reading my harddisk goes at 26Mb per second. You obviously didn't read my explanation of _why_ it is stupid. > By analyzing my boot process I determine that 50M of my disk is used > during boot. I can then reshuffle my disk to have that 50M of data at > the beginning and reading all that into 50M of cache, I can save > thousands of 10ms seeks. No. Have you _tried_ this? What the above would do is to move 50M of the disk into the buffer cache. Then, a second later, when the boot proceeds, Linux would start filling the page cache. BY READING THE CONTENTS FROM DISK AGAIN! In short, by doing a "dd" from the disk, you would _not_ help anything at all. You would only make things slower, by reading things twice. The Linux buffer cache and page cache are two separate entities. They are not synchronized, and they are indexed through totally different means. The page cache is virtually indexed by , while the buffer cache is indexed by . > Is this simply: Don't try this then? Try it. You will see. You _can_ actually try to optimize certain things with 2.4.x: all meta-data is still in the buffer cache in 2.4.x, so what you could do is to lay out the image so that the metadata is at the front of the disk, and do the "dd" to cache just the metadata. Even then you need to be careful, and make sure that the "dd" uses the same block size as the filesystem will use. And even that will largely stop working very early in 2.5.x when the directory contents and possibly inode and bitmap metadata moves into the page cache. Now, you may ask "why use the page cache at all then"? The answer is that the page cache is a _lot_ faster to look up, exactly because of the virtual indexing (and also because the data structure is much better designed - fixed-size entities with none of the complexities of the buffer cache. The buffer cache needs to be able to do IO, while the page cache is _only_ a cache and does that one thing really well - doing IO is a completely separate issue with the page cache). Now, if you want to speed up accesses, there are things you can do. You can lay out the filesystem in the access order - trace the IO accesses at bootup ("which file, which offset, which metadata block?") and lay out the blocks of the files in exactly the right order. Then you will get linear reads _without_ doing any "dd" at all. Now, laying out the filesystem that way is _hard_. No question about it. It's kind of equivalent to doing a filesystem "defreagment" operation, except you use a different sorting function (instead of sorting blocks linearly within each file, you sort according to access order). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Fri, May 04, 2001 at 01:56:14PM +0200, Jens Axboe wrote: > Or you can rewrite block_read/write to use the page cache, in which case > you'd have more luck doing the above. once block_dev is in pagecache there will obviously be no-way to share cache between the block device and the filesystem, because all the caches will be in completly different address spaces. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
Rogier Wolff <[EMAIL PROTECTED]> wrote: > during boot. I can then reshuffle my disk to have that 50M of data at > the beginning and reading all that into 50M of cache, I can save Wasn't that one of the goals of the LVM project, along snapshots and block-level HSM ? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Fri, May 04 2001, Rogier Wolff wrote: > > On Thu, 3 May 2001, Alan Cox wrote: > > > Ditto for some CD based stuff. You burn the important binaries to the front > > > of the CD, then at boot dd 64Mb to /dev/null to prime the libraries and > > > avoid a lot of seeking during boot up from the CD-ROM. > > > > > > However I could do that from an initrd before mounting > > > > Ehh. Doing that would be extremely stupid, and would slow down your boot > > and nothing more. > > Ehhh, Linus, Linearly reading my harddisk goes at 26Mb per second. By > analyzing my boot process I determine that 50M of my disk is used > during boot. I can then reshuffle my disk to have that 50M of data at > the beginning and reading all that into 50M of cache, I can save > thousands of 10ms seeks. Boot time would go from several tens of > seconds to 2 seconds worth of DISK IO plus several seconds of pure CPU > time. Provided that the buffer cache and page cache are coherent, which they are not. So at most you'll cache fs meta data by doing the dd trick, which is hardly worth the effort. Or you can rewrite block_read/write to use the page cache, in which case you'd have more luck doing the above. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
Linus Torvalds wrote: > > On Thu, 3 May 2001, Alan Cox wrote: > > Ditto for some CD based stuff. You burn the important binaries to the front > > of the CD, then at boot dd 64Mb to /dev/null to prime the libraries and > > avoid a lot of seeking during boot up from the CD-ROM. > > > > However I could do that from an initrd before mounting > > Ehh. Doing that would be extremely stupid, and would slow down your boot > and nothing more. Ehhh, Linus, Linearly reading my harddisk goes at 26Mb per second. By analyzing my boot process I determine that 50M of my disk is used during boot. I can then reshuffle my disk to have that 50M of data at the beginning and reading all that into 50M of cache, I can save thousands of 10ms seeks. Boot time would go from several tens of seconds to 2 seconds worth of DISK IO plus several seconds of pure CPU time. This doesn't work if I don't have the memory to cache 50M of disk-blocks. Is this simply: Don't try this then? Roger. -- ** [EMAIL PROTECTED] ** http://www.BitWizard.nl/ ** +31-15-2137555 ** *-- BitWizard writes Linux device drivers for any device you may have! --* * There are old pilots, and there are bold pilots. * There are also old, bald pilots. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Thu, 3 May 2001, Alan Cox wrote: > > > > discussion in itself), and there really are no valid uses for opening a > > > block device that is already mounted. More importantly, I don't think > > > anybody actually does. > > > > Actually I did. I might do it again :) The point was to get the kernel to > > cache certain blocks in the RAM. > > Ditto for some CD based stuff. You burn the important binaries to the front > of the CD, then at boot dd 64Mb to /dev/null to prime the libraries and > avoid a lot of seeking during boot up from the CD-ROM. > > However I could do that from an initrd before mounting Ehh. Doing that would be extremely stupid, and would slow down your boot and nothing more. The page cache is _not_ coherent with the buffer cache. Any filesystem that uses the page cache for data caching (which pretty much all of them do, because it's the only way to get sane mmap semantics, and it's a lot faster than the ol dbuffer cache ever was), the above will do _nothing_ but spend time doing IO that the page cache will just end up doing again. Currently it can help to pre-load the meta-data, but quite frankly, even that is suspect, and won't work in 2.5.x when Al's metadata page-cache stuff is merged (at least directories, and likely inodes too). In short, don't do it. It doesn't work reliably (and hasn't since 2.0.x), and it will only get more and more unreliable. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
> > discussion in itself), and there really are no valid uses for opening a > > block device that is already mounted. More importantly, I don't think > > anybody actually does. > > Actually I did. I might do it again :) The point was to get the kernel to > cache certain blocks in the RAM. Ditto for some CD based stuff. You burn the important binaries to the front of the CD, then at boot dd 64Mb to /dev/null to prime the libraries and avoid a lot of seeking during boot up from the CD-ROM. However I could do that from an initrd before mounting Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Thu, 26 Apr 2001, Linus Torvalds wrote: > > > On Thu, 26 Apr 2001, Alexander Viro wrote: > > On Thu, 26 Apr 2001, Andrea Arcangeli wrote: > > > > > > how can the read in progress see a branch that we didn't spliced yet? We > > > > fd = open("/dev/hda1", O_RDONLY); > > read(fd, buf, sizeof(buf)); > > Note that I think all these arguments are fairly bogus. Doing things like > "dump" on a live filesystem is stupid and dangerous (in my opinion it is > stupid and dangerous to use "dump" at _all_, but that's a whole 'nother > discussion in itself), and there really are no valid uses for opening a > block device that is already mounted. More importantly, I don't think > anybody actually does. Actually I did. I might do it again :) The point was to get the kernel to cache certain blocks in the RAM. Vladimir Dergachev > > The fact that you _can_ do so makes the patch valid, and I do agree with > Al on the "least surprise" issue. I've already applied the patch, in fact. > But the fact is that nobody should ever do the thing that could cause > problems. > > Linus > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
Hiya. Linus Torvalds wrote: > So anybody who depends on "dump" getting backups right is already playing > russian rulette with their backups. It's not at all guaranteed to get the > right results - you may end up having stale data in the buffer cache that > ends up being "backed up". > > Dump was a stupid program in the first place. Leave it behind. Ouch. I just re-read the man page and it doesn't caution (*) against using it on mounted filesystems. That probably means that there are thousands of other losers like me using it on production machines. Volunteers to (a) change the man page, (b) talk to the distros about dumping "dump"? > However, it may be that in the long run it would be advantageous to have a > "filesystem maintenance interface" for doing things like backups and > defragmentation.. Yup, sounds good. Neil (*) The KNOWNBUGS file mentions "possible" problems while dumping active mounted filesystems, but I've (elsewhere) seen these characterised as no real problem; also, this falls a long way short of discouraging use in this fashion. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
> or such. tar/cpio and friends don't deal properly with > a. holes inside of files. > b. hardlinks between files. GNU tar handles both of these. (Not particularly efficiently in the case of sparse files, but that's a minor issue in this case.) See -S flag. Perhaps more importantly, for a _robust_ backup solution which can deal with partially unreadable tapes, you have pretty much no option other than tar for the actual storage. Olaf - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Sat, Apr 28 2001, Albert D. Cahalan wrote: > Linus Torvalds writes: > > > The buffer cache is "virtual" in the sense that /dev/hda is a > > completely separate name-space from /dev/hda1, even if there > > is some physical overlap. > > So the aliasing problems and elevator algorithm confusion remain? At least for the I/O scheduler confusion, requests to partitions will remap the buffer location and this problem disappears nicely. It's not a big issue, really. > Is this ever likely to change, and what is with the 1 kB assumptions? > (Hmmm, cruft left over from the 1 kB Minix filesystem blocks?) What 1kB assumption? -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
Martin Dalecki : > tar/cpio and friends don't deal properly with > > a. holes inside of files. > b. hardlinks between files. > ??? GNU tar does both. The only thing it currently cannot handle is Not Changing Anything: either atime or ctime _will_ be updated. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
Linus Torvalds writes: > The buffer cache is "virtual" in the sense that /dev/hda is a > completely separate name-space from /dev/hda1, even if there > is some physical overlap. So the aliasing problems and elevator algorithm confusion remain? Is this ever likely to change, and what is with the 1 kB assumptions? (Hmmm, cruft left over from the 1 kB Minix filesystem blocks?) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Fri, Apr 27, 2001 at 09:52:19AM -0700, Linus Torvalds wrote: > > On Fri, 27 Apr 2001, Vojtech Pavlik wrote: > > > > Actually this is done quite often, even on mounted fs's: > > > > hdparm -t /dev/hda > > Note that this one happens to be ok. > > The buffer cache is "virtual" in the sense that /dev/hda is a completely > separate name-space from /dev/hda1, even if there is some physical > overlap. Wouldn't something like "hdparm -t /dev/md0" trigger it though. It is the same device as gets mounted as md devices aren't partitioned. Shane -- Shane Wegner: [EMAIL PROTECTED] http://www.cm.nu/~shane/ PGP: 1024D/FFE3035D A0ED DAC4 77EC D674 5487 5B5C 4F89 9A4E FFE3 035D - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Fri, Apr 27, 2001 at 11:02:17AM -0700, LA Walsh wrote: > Andrzej Krzysztofowicz wrote: > > > I know a few people that often do: > > > > dd if=/dev/hda1 of=/dev/hdc1 > > e2fsck /dev/hdc1 > > > > to make an "exact" copy of a currently working system. > > --- > Presumably this isn't a problem is the source disks are either unmounted or >mounted 'read-only' ? > > I thought the known best solution on this was to use COW snapshots, because then you copy the filesystem as exactly the state when the snapshot was made, without impacting the writability of the filesystem while the (potentially very long) dump is made? I tried using this on LVM, but after seeing a few messages on the list about kernel oopses happening with snapshots of filesystems with heavy write activities, as well as experiencing serious problems with the LVM userspace tools (they would core dump on startup if the LVM filesystem had any sort of corruption or integrity failure) I decided to put it away until the LVM folks managed to get a production version ready. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
Andrzej Krzysztofowicz wrote: > I know a few people that often do: > > dd if=/dev/hda1 of=/dev/hdc1 > e2fsck /dev/hdc1 > > to make an "exact" copy of a currently working system. --- Presumably this isn't a problem is the source disks are either unmounted or mounted 'read-only' ? -- The above thoughts and | They may have nothing to do with writings are my own. | the opinions of my employer. :-) L A Walsh| Trust Technology, Core Linux, SGI [EMAIL PROTECTED] | Voice: (650) 933-5338 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
Linus Torvalds wrote: > On Fri, 27 Apr 2001, Neil Conway wrote: > > > > I'm surprised that dump is deprecated (by you at least ;-)). What to > > use instead for backups on machines that can't umount disks regularly? > > Note that dump simply won't work reliably at all even in 2.4.x: the buffer > cache and the page cache (where all the actual data is) are not > coherent. This is only going to get even worse in 2.5.x, when the > directories are moved into the page cache as well. > Dump was a stupid program in the first place. Leave it behind. Dump/restore are useful, on-line dump is silly. I am personally amazed that on-line, mounted dump was -ever- supported. I guess it would work if mounted ro... dump is still the canonical solution, IMHO, for saving and restoring filesystem metadata OFFLINE. tar/cpio can be taught to do stuff like security ACLs and EAs and such, but such code and formats are not yet standardized, and they do not approach dump when it comes to taking an accurate snapshot of the filesystem. -- Jeff Garzik | Disbelief, that's why you fail. Building 1024| MandrakeSoft | - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
Linus Torvalds wrote: > Dump was a stupid program in the first place. Leave it behind. Not quite Linus - dump/restore are nice tools to create for example automatic over network installation servers, i.e. efficient system images or such. tar/cpio and friends don't deal properly with a. holes inside of files. b. hardlinks between files. Really they are not useless. However I wouldn't recommend them for backup practicies as well. Please see for example: http://www.systime-solutions.de/index.php?topic=produkte&subtopic=setupserver Well yes, if you understand german... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
[ linux-kernel added back as a cc ] On Fri, 27 Apr 2001, Neil Conway wrote: > > I'm surprised that dump is deprecated (by you at least ;-)). What to > use instead for backups on machines that can't umount disks regularly? Note that dump simply won't work reliably at all even in 2.4.x: the buffer cache and the page cache (where all the actual data is) are not coherent. This is only going to get even worse in 2.5.x, when the directories are moved into the page cache as well. So anybody who depends on "dump" getting backups right is already playing russian rulette with their backups. It's not at all guaranteed to get the right results - you may end up having stale data in the buffer cache that ends up being "backed up". Dump was a stupid program in the first place. Leave it behind. > I've always thought "tar" was a bit undesirable (updates atimes or > ctimes for example). Right now, the cpio/tar/xxx solutions are definitely the best ones, and will work on multiple filesystems (another limitation of "dump"). Whatever problems they have, they are still better than the _guaranteed_(*) data corruptions of "dump". However, it may be that in the long run it would be advantageous to have a "filesystem maintenance interface" for doing things like backups and defragmentation.. Linus (*) Dump may work fine for you a thousand times. But it _will_ fail under the right circumstances. And there is nothing you can do about it. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Fri, 27 Apr 2001, Vojtech Pavlik wrote: > > Actually this is done quite often, even on mounted fs's: > > hdparm -t /dev/hda Note that this one happens to be ok. The buffer cache is "virtual" in the sense that /dev/hda is a completely separate name-space from /dev/hda1, even if there is some physical overlap. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Fri, Apr 27, 2001 at 09:23:57AM -0400, you [Alexander Viro] claimed: > > > On Fri, 27 Apr 2001, Vojtech Pavlik wrote: > > > Actually this is done quite often, even on mounted fs's: > > > > hdparm -t /dev/hda > > You would need either hdparm -t /dev/hda or mounting the > whole /dev/hda. > > Buffer cache for the disk is unrelated to buffer cache for parititions. Well, I for one have been running hdparm -t /dev/md0 or time head -c 1000m /dev/md0 > /dev/null while /dev/md0 was mounted without realizing that this could be "stupid" or that it could eat my data. /dev/md0 on /backup-versioned type ext2 (rw) I often cat(1) or head(1) partitions or devices (even mounted ones) if I need dummy randomish test data for compression or tape drives (that I've been having trouble with). BTW: is 2.2 affected? 2.0? -- v -- [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Thu, Apr 26, 2001 at 01:08:25PM -0700, Linus Torvalds wrote: > Note that I think all these arguments are fairly bogus. Doing things like > "dump" on a live filesystem is stupid and dangerous (in my opinion it is > stupid and dangerous to use "dump" at _all_, but that's a whole 'nother > discussion in itself), and there really are no valid uses for opening a > block device that is already mounted. More importantly, I don't think > anybody actually does. You can use LVM snapshot volumes to do it safely. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Fri, 27 Apr 2001, Vojtech Pavlik wrote: > Actually this is done quite often, even on mounted fs's: > > hdparm -t /dev/hda You would need either hdparm -t /dev/hda or mounting the whole /dev/hda. Buffer cache for the disk is unrelated to buffer cache for parititions. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Thu, Apr 26, 2001 at 01:08:25PM -0700, Linus Torvalds wrote: > On Thu, 26 Apr 2001, Alexander Viro wrote: > > On Thu, 26 Apr 2001, Andrea Arcangeli wrote: > > > > > > how can the read in progress see a branch that we didn't spliced yet? We > > > > fd = open("/dev/hda1", O_RDONLY); > > read(fd, buf, sizeof(buf)); > > Note that I think all these arguments are fairly bogus. Doing things like > "dump" on a live filesystem is stupid and dangerous (in my opinion it is > stupid and dangerous to use "dump" at _all_, but that's a whole 'nother > discussion in itself), and there really are no valid uses for opening a > block device that is already mounted. More importantly, I don't think > anybody actually does. Actually this is done quite often, even on mounted fs's: hdparm -t /dev/hda > The fact that you _can_ do so makes the patch valid, and I do agree with > Al on the "least surprise" issue. I've already applied the patch, in fact. > But the fact is that nobody should ever do the thing that could cause > problems. -- Vojtech Pavlik SuSE Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Thu, 26 Apr 2001, Alexander Viro wrote: > > > On Thu, 26 Apr 2001, Richard B. Johnson wrote: > > > The disk image, raw.bin, does NOT contain the image of the floppy. > > Most of boot stuff added by lilo is missing. It will eventually > > get there, but it's not there now, even though the floppy was > > un-mounted! > > I doubt that you can reproduce that on anything remotely recent. > All buffers are flushed when last user closes device. > 2.4.3 Buffers are not flushed (actually written) to disk. The floppy continues to be written for 20 to 30 seconds after `umount` returns to the shell. A program like `cp` , accessing the raw device during this time does not get what will eventually be written. Cheers, Dick Johnson Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips). "Memory is like gasoline. You use it up when you are running. Of course you get it all back when you reboot..."; Actual explanation obtained from the Micro$oft help desk. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Thu, Apr 26, 2001 at 05:19:26PM -0700, Linus Torvalds wrote: > Detail nit: don't do this. Use "current_text_addr()" instead. Simpler to > read, and gcc will actually do the right thing wrt inlining etc. Agreed, thanks for the info. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Fri, 27 Apr 2001, Andrea Arcangeli wrote: > + __label__ here; > + here: > + printk(KERN_ERR "IO error or racy use of wait_on_buffer() from %p\n", >&&here); Detail nit: don't do this. Use "current_text_addr()" instead. Simpler to read, and gcc will actually do the right thing wrt inlining etc. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Thu, Apr 26, 2001 at 07:25:23PM -0400, Alexander Viro wrote: > > > On Thu, 26 Apr 2001, Andrea Arcangeli wrote: > > > > How about adding > > > if (!buffer_uptodate(bh)) { > > > printk(KERN_ERR "IO error or racy use of wait_on_buffer()"); > > > show_task(current); > > > } > > > in the end of wait_on_buffer() for a while? > > > > At the _top_ of wait_on_buffer would be better then at the end. > > In that case ll_rw_block() + wait_on_buffer() (absolutely legitimate > combination) will scream at you. --- 2.4.4pre7/include/linux/locks.h Thu Apr 26 05:22:11 2001 +++ 2.4.4pre7aa1/include/linux/locks.h Fri Apr 27 01:52:31 2001 @@ -18,6 +18,11 @@ { if (test_bit(BH_Lock, &bh->b_state)) __wait_on_buffer(bh); + else if (!buffer_uptodate(bh)) { + __label__ here; + here: + printk(KERN_ERR "IO error or racy use of wait_on_buffer() from %p\n", +&&here); + } } extern inline void lock_buffer(struct buffer_head * bh) Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Thu, 26 Apr 2001, Andrea Arcangeli wrote: > > How about adding > > if (!buffer_uptodate(bh)) { > > printk(KERN_ERR "IO error or racy use of wait_on_buffer()"); > > show_task(current); > > } > > in the end of wait_on_buffer() for a while? > > At the _top_ of wait_on_buffer would be better then at the end. In that case ll_rw_block() + wait_on_buffer() (absolutely legitimate combination) will scream at you. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
> > Note that I think all these arguments are fairly bogus. Doing things like > "dump" on a live filesystem is stupid and dangerous (in my opinion it is > stupid and dangerous to use "dump" at _all_, but that's a whole 'nother > discussion in itself), and there really are no valid uses for opening a > block device that is already mounted. More importantly, I don't think > anybody actually does. I know a few people that often do: dd if=/dev/hda1 of=/dev/hdc1 e2fsck /dev/hdc1 to make an "exact" copy of a currently working system. Maybe it is stupid, but they do. Fortunately, their systems are not SMP... Andrzej - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Thu, Apr 26, 2001 at 01:26:15PM -0700, Linus Torvalds wrote: > > > On Thu, 26 Apr 2001, Andrea Arcangeli wrote: > > > > What I'm saying above is that even without the wait_on_buffer ext2 can > > screwup itself because the splice happens after the buffer are just all > > uptodate so any "reader" (I mean any reader through ext2 not through > > block_dev) will never try to do a bread on that blocks before they're > > just zeroed and uptodate. > > I assume you meant "..can _not_ screw up itself..", otherwise the rest of yes, it was a typo sorry. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Thu, Apr 26, 2001 at 04:49:20PM -0400, Alexander Viro wrote: > getblk(); if (!buffer_uptodate) wait_on_buffer(); > is not in that class. It _is_ OK on UP as long as we don't block, but on > SMP it doesn't guarantee anything - buffer_head can be in any state > when we are done. IMO all such places require fixing. Yes, actually it's probably ok for most of other "fs" against "fs" cases because those fses still hold the big lock while handling metadata but they should really use the lock_buffer way if they want to protect against the block_dev accesses too. > How about adding > if (!buffer_uptodate(bh)) { > printk(KERN_ERR "IO error or racy use of wait_on_buffer()"); > show_task(current); > } > in the end of wait_on_buffer() for a while? At the _top_ of wait_on_buffer would be better then at the end. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Thu, Apr 26, 2001 at 10:11:09PM +0200, Andrea Arcangeli wrote: > On Thu, Apr 26, 2001 at 03:55:19PM -0400, Alexander Viro wrote: > > > > > > On Thu, 26 Apr 2001, Andrea Arcangeli wrote: > > > > > On Thu, Apr 26, 2001 at 03:34:00PM -0400, Alexander Viro wrote: > > > > Same scenario, but with read-in-progress started before we do getblk(). BTW, > > > > > > how can the read in progress see a branch that we didn't spliced yet? We > > > > fd = open("/dev/hda1", O_RDONLY); > > read(fd, buf, sizeof(buf)); > > You misunderstood the context of what I said, I perfectly know the race > you are talking about, I was answering Linus's question "the > wait_on_buffer isn't even necessary to protect ext2 against ext2". You > are talking about the other race that is "ext2" against "block_dev", and > I obviously agree on that one since the first place as I immediatly > answered you "correct". > > What I'm saying above is that even without the wait_on_buffer ext2 can ^^^ "cannot" of course > screwup itself because the splice happens after the buffer are just all > uptodate so any "reader" (I mean any reader through ext2 not through > block_dev) will never try to do a bread on that blocks before they're > just zeroed and uptodate. > > Andrea Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Thu, Apr 26, 2001 at 01:08:25PM -0700, Linus Torvalds wrote: > But the fact is that nobody should ever do the thing that could cause > problems. dump in 2.4 also gets uncoherent view of the data which make things even worse than in 2.2 (to change that we should hash in the buffer hashtable all the bh overlapped in the pagecache and no I'm not suggesting that relax). The only reason it has a chance to work with ext2 is because ext2 is very dumb and it misses an inode map and in turn inodes are at a predictable location on disk so it cannot run totally out of control. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Thu, Apr 26, 2001 at 03:55:19PM -0400, Alexander Viro wrote: > > > On Thu, 26 Apr 2001, Andrea Arcangeli wrote: > > > On Thu, Apr 26, 2001 at 03:34:00PM -0400, Alexander Viro wrote: > > > Same scenario, but with read-in-progress started before we do getblk(). BTW, > > > > how can the read in progress see a branch that we didn't spliced yet? We > > fd = open("/dev/hda1", O_RDONLY); > read(fd, buf, sizeof(buf)); You misunderstood the context of what I said, I perfectly know the race you are talking about, I was answering Linus's question "the wait_on_buffer isn't even necessary to protect ext2 against ext2". You are talking about the other race that is "ext2" against "block_dev", and I obviously agree on that one since the first place as I immediatly answered you "correct". What I'm saying above is that even without the wait_on_buffer ext2 can screwup itself because the splice happens after the buffer are just all uptodate so any "reader" (I mean any reader through ext2 not through block_dev) will never try to do a bread on that blocks before they're just zeroed and uptodate. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Thu, 26 Apr 2001, Richard B. Johnson wrote: > The disk image, raw.bin, does NOT contain the image of the floppy. > Most of boot stuff added by lilo is missing. It will eventually > get there, but it's not there now, even though the floppy was > un-mounted! I doubt that you can reproduce that on anything remotely recent. All buffers are flushed when last user closes device. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Thu, 26 Apr 2001, Linus Torvalds wrote: > Note that I think all these arguments are fairly bogus. Doing things like > "dump" on a live filesystem is stupid and dangerous (in my opinion it is > stupid and dangerous to use "dump" at _all_, but that's a whole 'nother > discussion in itself), and there really are no valid uses for opening a > block device that is already mounted. More importantly, I don't think > anybody actually does. Agreed. > The fact that you _can_ do so makes the patch valid, and I do agree with > Al on the "least surprise" issue. I've already applied the patch, in fact. > But the fact is that nobody should ever do the thing that could cause > problems. IMO the real issue is in fuzzy rules for use of wait_on_buffer(). There is one case when it's 100% correct - when we had done ll_rw_block() on that bh at earlier point and want to make sure that it's completed. And there is a lot of uses that are kinda-sorta correct for UP, but break on SMP. unmap_buffer() was similar to that race. So are races in minix, sysvfs and ufs. So is one in block_write() and here the problem is quite real - it's not as idiotic as device/mounted fs races. Basically, all legitimate cases are ones where we would be very unhappy about buffer being not uptodate afterwards. getblk(); if (!buffer_uptodate) wait_on_buffer(); is not in that class. It _is_ OK on UP as long as we don't block, but on SMP it doesn't guarantee anything - buffer_head can be in any state when we are done. IMO all such places require fixing. How about adding if (!buffer_uptodate(bh)) { printk(KERN_ERR "IO error or racy use of wait_on_buffer()"); show_task(current); } in the end of wait_on_buffer() for a while? Al - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Thu, 26 Apr 2001, Alexander Viro wrote: > > > On Thu, 26 Apr 2001, Andrea Arcangeli wrote: > > > > the wait-on-buffer is not strictly necessary: it's probably there to make > > > > maybe not but I need to check some more bit to be sure. > > Same scenario, but with read-in-progress started before we do getblk(). BTW, > old writeback is harmless - we will overwrite anyway. And _that_ can happen > without direct access to device - truncate() doesn't terminate writeout of > the indirect blocks it frees (IMO it should, but that's another story). > This seems to be the problem reported about a year ago, but never fixed. It exists, even in early kernels. mke2fs /dev/fd0 mount /dev/fd0 /mnt cp stuff /mnt lilo -C -
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Thu, 26 Apr 2001, Andrea Arcangeli wrote: > > What I'm saying above is that even without the wait_on_buffer ext2 can > screwup itself because the splice happens after the buffer are just all > uptodate so any "reader" (I mean any reader through ext2 not through > block_dev) will never try to do a bread on that blocks before they're > just zeroed and uptodate. I assume you meant "..can _not_ screw up itself..", otherwise the rest of the sentence doesn't seem to make much sense. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Thu, Apr 26, 2001 at 03:17:54PM -0400, Alexander Viro wrote: > > > On Thu, 26 Apr 2001, I wrote: > > > On Thu, 26 Apr 2001, Linus Torvalds wrote: > > > > > I see the race, but I don't see how you can actually trigger it. > > > > > > Exactly _who_ does the "read from device" part? Somebody doing a > > > "fsck" while the filesystem is mounted read-write and actively written > > > to? Yeah, you'd get disk corruption that way, but you'll get it regardless > > > of this bug. > > OK, I think I've a better explanation now: > > Suppose /dev/hda1 is owned by root.disks and permissions are 640. > It is mounted read-write. > > Process foo belongs to pfy.staff. PFY is included into disks, but doesn't > have root. I claim that he should be unable to cause fs corruption on > /dev/hda1. > > Currently foo _can_ cause such corruption, even though it has nothing > resembling write permissions for device in question. > > IMO it is wrong. I'm not saying that it's a real security problem. I'm > not saying that PFY is not idiot or that his actions make any sense. > However, I think that situation when he can do that without write > access to device is just plain wrong. > > Does the above make sense? Sure. And as said `dump` has the same issues. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Thu, 26 Apr 2001, Alexander Viro wrote: > On Thu, 26 Apr 2001, Andrea Arcangeli wrote: > > > > how can the read in progress see a branch that we didn't spliced yet? We > > fd = open("/dev/hda1", O_RDONLY); > read(fd, buf, sizeof(buf)); Note that I think all these arguments are fairly bogus. Doing things like "dump" on a live filesystem is stupid and dangerous (in my opinion it is stupid and dangerous to use "dump" at _all_, but that's a whole 'nother discussion in itself), and there really are no valid uses for opening a block device that is already mounted. More importantly, I don't think anybody actually does. The fact that you _can_ do so makes the patch valid, and I do agree with Al on the "least surprise" issue. I've already applied the patch, in fact. But the fact is that nobody should ever do the thing that could cause problems. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Thu, 26 Apr 2001, Andrea Arcangeli wrote: > On Thu, Apr 26, 2001 at 03:34:00PM -0400, Alexander Viro wrote: > > Same scenario, but with read-in-progress started before we do getblk(). BTW, > > how can the read in progress see a branch that we didn't spliced yet? We fd = open("/dev/hda1", O_RDONLY); read(fd, buf, sizeof(buf)); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Thu, Apr 26, 2001 at 03:34:00PM -0400, Alexander Viro wrote: > Same scenario, but with read-in-progress started before we do getblk(). BTW, how can the read in progress see a branch that we didn't spliced yet? We clear and mark uptodate the new part of the branch before it's visible to any reader no? then in splice we write the key into the where->p and the branch become visible to the readers but by that time the reader won't start I/O because the buffer are just uptodate. I only had a short look now and to verify Ingo's fix, so maybe I overlooked something. > without direct access to device - truncate() doesn't terminate writeout of > the indirect blocks it frees (IMO it should, but that's another story). If the block is under I/O or dirty that's another story, the only issue here is when the buffer block is new and not uptodate. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Thu, Apr 26, 2001 at 09:15:57PM +0200, Andrea Arcangeli wrote: > maybe not but I need to check some more bit to be sure. yes we probably don't need it for fs against fs in 2.4 because we make the new metadata block visible to a reader (splice) only after they're all uptodate and all directory operations are serialized by the vfs (with the i_zombie). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Thu, 26 Apr 2001, Andrea Arcangeli wrote: > > the wait-on-buffer is not strictly necessary: it's probably there to make > > maybe not but I need to check some more bit to be sure. Same scenario, but with read-in-progress started before we do getblk(). BTW, old writeback is harmless - we will overwrite anyway. And _that_ can happen without direct access to device - truncate() doesn't terminate writeout of the indirect blocks it frees (IMO it should, but that's another story). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Thu, Apr 26, 2001 at 11:49:14AM -0700, Linus Torvalds wrote: > > On Thu, Apr 26, 2001 at 11:45:47AM -0400, Alexander Viro wrote: > > > > Ext2 does getblk+wait_on_buffer for new metadata blocks before > > filling them with zeroes. While that is enough for single-processor, > > on SMP we have the following race: > > > > getblk gives us unlocked, non-uptodate bh > > wait_on_buffer() does nothing > > read from device locks it and starts IO > > I see the race, but I don't see how you can actually trigger it. > > Exactly _who_ does the "read from device" part? Somebody doing a /sbin/dump > the wait-on-buffer is not strictly necessary: it's probably there to make maybe not but I need to check some more bit to be sure. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Thu, 26 Apr 2001, I wrote: > On Thu, 26 Apr 2001, Linus Torvalds wrote: > > > I see the race, but I don't see how you can actually trigger it. > > > > Exactly _who_ does the "read from device" part? Somebody doing a > > "fsck" while the filesystem is mounted read-write and actively written > > to? Yeah, you'd get disk corruption that way, but you'll get it regardless > > of this bug. OK, I think I've a better explanation now: Suppose /dev/hda1 is owned by root.disks and permissions are 640. It is mounted read-write. Process foo belongs to pfy.staff. PFY is included into disks, but doesn't have root. I claim that he should be unable to cause fs corruption on /dev/hda1. Currently foo _can_ cause such corruption, even though it has nothing resembling write permissions for device in question. IMO it is wrong. I'm not saying that it's a real security problem. I'm not saying that PFY is not idiot or that his actions make any sense. However, I think that situation when he can do that without write access to device is just plain wrong. Does the above make sense? Al - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Thu, 26 Apr 2001, Linus Torvalds wrote: > I see the race, but I don't see how you can actually trigger it. > > Exactly _who_ does the "read from device" part? Somebody doing a > "fsck" while the filesystem is mounted read-write and actively written > to? Yeah, you'd get disk corruption that way, but you'll get it regardless > of this bug. > There's nothing else that should be using that block at that stage. And if > there were, that would be a bug in itself, as far as I can tell. We've > just allocated it, and we're the only and exclusive owners of that block > on the disk. Anybody else who touches it is seriously broken. > Now, I don't disagree with your patch (it's just obviously cleaner to lock > it properly), but I don't think this is a real bug. I suspect that even > the wait-on-buffer is not strictly necessary: it's probably there to make > sure old write-backs have completed, but that doesn't really matter > either. > > We used to have "breada()" do physical read-ahead that could have > triggered this, but we've long since gotten rid of that. > > Or am I overlooking something? Somebody doing dd(1) _from_ that disk. Sure, he's bound to get crap. But I really don't think that opening device for read should be able to affect its contents in any way. BTW, same race exists between block_read() and block_write(). And that one is even more obviously wrong: xterm A:xterm B: dd if=/dev/hda of=/dev/hdb dd if=/dev/hdb of=/dev/null result: some blocks on hdb retaining their old contents. IMO "no matter what you read, you don't affect the contents" is a good general principle. Sure, you can get crap if you read in the middle of write. That's expected and sane. However, the final contents of file depends only on the things done by writers. Al - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Thursday, April 26, 2001 02:24:26 PM -0400 Alexander Viro <[EMAIL PROTECTED]> wrote: > > > On Thu, 26 Apr 2001, Andrea Arcangeli wrote: > >> correct. I bet other fs are affected as well btw. > > If only... block_read() vs. block_write() has the same race. I'm going > through the list of all wait_on_buffer() users right now. > Looks like reiserfs has it too when allocating tree blocks, but it should be harder to hit. The fix should be small but it will take me a bit to make sure it doesn't affect the rest of the balancing code. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Thu, Apr 26, 2001 at 11:45:47AM -0400, Alexander Viro wrote: > > Ext2 does getblk+wait_on_buffer for new metadata blocks before > filling them with zeroes. While that is enough for single-processor, > on SMP we have the following race: > > getblk gives us unlocked, non-uptodate bh > wait_on_buffer() does nothing > read from device locks it and starts IO I see the race, but I don't see how you can actually trigger it. Exactly _who_ does the "read from device" part? Somebody doing a "fsck" while the filesystem is mounted read-write and actively written to? Yeah, you'd get disk corruption that way, but you'll get it regardless of this bug. There's nothing else that should be using that block at that stage. And if there were, that would be a bug in itself, as far as I can tell. We've just allocated it, and we're the only and exclusive owners of that block on the disk. Anybody else who touches it is seriously broken. Now, I don't disagree with your patch (it's just obviously cleaner to lock it properly), but I don't think this is a real bug. I suspect that even the wait-on-buffer is not strictly necessary: it's probably there to make sure old write-backs have completed, but that doesn't really matter either. We used to have "breada()" do physical read-ahead that could have triggered this, but we've long since gotten rid of that. Or am I overlooking something? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Thu, 26 Apr 2001, Andrea Arcangeli wrote: > correct. I bet other fs are affected as well btw. If only... block_read() vs. block_write() has the same race. I'm going through the list of all wait_on_buffer() users right now. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] SMP race in ext2 - metadata corruption.
On Thu, Apr 26, 2001 at 11:45:47AM -0400, Alexander Viro wrote: > Ext2 does getblk+wait_on_buffer for new metadata blocks before > filling them with zeroes. While that is enough for single-processor, > on SMP we have the following race: > > getblk gives us unlocked, non-uptodate bh > wait_on_buffer() does nothing > read from device locks it and starts IO > we zero it out. > on-disk data overwrites our zeroes. > we mark it dirty > bdflush writes the old data (_not_ zeroes) back to disk. > > Result: crap in metadata block. Proposed fix: lock_buffer()/unlock_buffer() > around memset()/mark_buffer_uptodate() instead of wait_on_buffer() before > them. > > Patch against 2.4.4-pre7 follows. Please, apply. correct. I bet other fs are affected as well btw. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] SMP race in ext2 - metadata corruption.
Ext2 does getblk+wait_on_buffer for new metadata blocks before filling them with zeroes. While that is enough for single-processor, on SMP we have the following race: getblk gives us unlocked, non-uptodate bh wait_on_buffer() does nothing read from device locks it and starts IO we zero it out. on-disk data overwrites our zeroes. we mark it dirty bdflush writes the old data (_not_ zeroes) back to disk. Result: crap in metadata block. Proposed fix: lock_buffer()/unlock_buffer() around memset()/mark_buffer_uptodate() instead of wait_on_buffer() before them. Patch against 2.4.4-pre7 follows. Please, apply. Al --- S4-pre7/fs/ext2/inode.c Wed Apr 25 20:43:08 2001 +++ S4-pre7-ext2/fs/ext2/inode.cThu Apr 26 11:36:11 2001 @@ -397,13 +397,13 @@ * the pointer to new one, then send parent to disk. */ bh = getblk(inode->i_dev, parent, blocksize); - if (!buffer_uptodate(bh)) - wait_on_buffer(bh); + lock_buffer(bh); memset(bh->b_data, 0, blocksize); branch[n].bh = bh; branch[n].p = (u32*) bh->b_data + offsets[n]; *branch[n].p = branch[n].key; mark_buffer_uptodate(bh, 1); + unlock_buffer(bh); mark_buffer_dirty_inode(bh, inode); if (IS_SYNC(inode) || inode->u.ext2_i.i_osync) { ll_rw_block (WRITE, 1, &bh); @@ -587,10 +587,10 @@ struct buffer_head *bh; bh = getblk(dummy.b_dev, dummy.b_blocknr, inode->i_sb->s_blocksize); if (buffer_new(&dummy)) { - if (!buffer_uptodate(bh)) - wait_on_buffer(bh); + lock_buffer(bh); memset(bh->b_data, 0, inode->i_sb->s_blocksize); mark_buffer_uptodate(bh, 1); + unlock_buffer(bh); mark_buffer_dirty_inode(bh, inode); } return bh; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/