Re: inode leak in 2.6.24?
On Wed, Feb 20, 2008 at 03:36:53PM +0100, Ferenc Wagner wrote: > David Chinner <[EMAIL PROTECTED]> writes: > > The xfs inodes are clearly pinned by the dentry cache, so the issue > > is dentries, not inodes. What's causing dentries not to be > > reclaimed? I can't see anything that cold pin them (e.g. no filp's > > that would indicate open files being responsible), so my initial > > thoughts are that memory reclaim may have changed behaviour. > > > > I guess the first thing to find out is whether memory pressure > > results in freeing the dentries. To simulate memory pressure causing > > slab cache reclaim, can you run: > > > > # echo 2 > /proc/sys/vm/drop_caches > > > > and see if the number of dentries and inodes drops. If the number > > goes down significantly, then we aren't leaking dentries and there's > > been a change in memoy reclaim behaviour. If it stays the same, then > > we probably are leaking dentries > > Hi Dave, > > Thanks for looking into this. There's no real conclusion yet: the > simulated memory pressure sent the numbers down allright, but > meanwhile it turned out that this is a different case: on this machine > the increase wasn't a constant growth, but related to the daily > updatedb job. I'll reload the original kernel on the original > machine, and collect the same info if the problem reappers. Ok, let me know how it goes when you get a chance. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: inode leak in 2.6.24?
On Wed, Feb 20, 2008 at 03:36:53PM +0100, Ferenc Wagner wrote: David Chinner [EMAIL PROTECTED] writes: The xfs inodes are clearly pinned by the dentry cache, so the issue is dentries, not inodes. What's causing dentries not to be reclaimed? I can't see anything that cold pin them (e.g. no filp's that would indicate open files being responsible), so my initial thoughts are that memory reclaim may have changed behaviour. I guess the first thing to find out is whether memory pressure results in freeing the dentries. To simulate memory pressure causing slab cache reclaim, can you run: # echo 2 /proc/sys/vm/drop_caches and see if the number of dentries and inodes drops. If the number goes down significantly, then we aren't leaking dentries and there's been a change in memoy reclaim behaviour. If it stays the same, then we probably are leaking dentries Hi Dave, Thanks for looking into this. There's no real conclusion yet: the simulated memory pressure sent the numbers down allright, but meanwhile it turned out that this is a different case: on this machine the increase wasn't a constant growth, but related to the daily updatedb job. I'll reload the original kernel on the original machine, and collect the same info if the problem reappers. Ok, let me know how it goes when you get a chance. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: inode leak in 2.6.24?
On Tue, Feb 19, 2008 at 05:57:08PM +0100, Ferenc Wagner wrote: > David Chinner <[EMAIL PROTECTED]> writes: > > On Sat, Feb 16, 2008 at 12:18:58AM +0100, Ferenc Wagner wrote: > So, I loaded the same kernel on a different machine, but that seems to > exhibit a very similar behaviour. The machine is absolutely idle, > nobody logged in during this period, though an updatedb ran during > this period. However, the increase seems steady, not correlated to > cron.daily. > > The contents of /proc/sys/fs/inode-nr after reboot was: > 4421 95 > > and now, 13h35m later it's: > 1461820 > > Find the two slabinfo outputs attached. Content-Description: slabinfo output immediately after reboot > xfs_inode791800384 101 : tunables 54 278 : > slabdata 80 80 0 > xfs_vnode790790384 101 : tunables 54 278 : > slabdata 79 79 0 > dentry 5133 5133132 291 : tunables 120 608 : > slabdata177177 0 Content-Description: slabinfo output 13h35m after reboot > xfs_inode 142548 142550384 101 : tunables 54 278 : > slabdata 14255 14255 0 > xfs_vnode 142548 142550384 101 : tunables 54 278 : > slabdata 14255 14255 0 > dentry148003 148074132 291 : tunables 120 608 : > slabdata 5106 5106 0 > The xfs_inode, xfs_vnode and dentry lines show significant increase. > The machine indeed uses XFS as its root filesystem. Hope this gives > enough clues to narrow down the problem. I can try other kernels if > needed. The xfs inodes are clearly pinned by the dentry cache, so the issue is dentries, not inodes. What's causing dentries not to be reclaimed? I can't see anything that cold pin them (e.g. no filp's that would indicate open files being responsible), so my initial thoughts are that memory reclaim may have changed behaviour. I guess the first thing to find out is whether memory pressure results in freeing the dentries. To simulate memory pressure causing slab cache reclaim, can you run: # echo 2 > /proc/sys/vm/drop_caches and see if the number of dentries and inodes drops. If the number goes down significantly, then we aren't leaking dentries and there's been a change in memoy reclaim behaviour. If it stays the same, then we probably are leaking dentries Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Implement barrier support for single device DM devices
On Tue, Feb 19, 2008 at 02:39:00AM +, Alasdair G Kergon wrote: > > For example, how safe > > xfs is if barriers are not supported or turned off? > > The last time we tried xfs with dm it didn't seem to notice -EOPNOTSUPP > everywhere it should => recovery may find corruption. Bug reports, please. What we don't know about, we can't fix. As of this commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=0bfefc46dc028df60120acdb92062169c9328769 XFS should be handling all cases of -EOPNOTSUPP for barrier I/Os. If you are still having problems, please let us know. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: xfsaild causing 30+ wakeups/s on an idle system since 2.6.25-rcX
On Mon, Feb 18, 2008 at 03:22:02PM -0800, Linda Walsh wrote: > Not to look excessively dumb, but what's xfsaild? AIL = Active Item List It is a sorted list all the logged metadata objects that have not yet been written back to disk. The xfsaild is responsible for tail pushing the log. i.e. writing back objects in the AIL in the most efficient manner possible. Why a thread? Because allowing parallelism in tail pushing is a scalability problem and moving this to it's own thread completely isolates it from parallelism. Tail pushing only requires a small amount of CPU time, but it requires a global scope spinlock. Isolating the pushing to a single CPU means the spinlock is not contended across every CPU in the machine. How much did it improve scalability? on a 2048p machine with an MPI job that did a synchronised close of 12,000 files (6 per CPU), the close time went from ~5400s without the thread to 9s with the xfsaild. That's only about 600x faster. ;) > xfs seems to be sprouting daemons at a more rapid pace > these days...xfsbufd, xfssyncd, xfsdatad, xfslogd, xfs_mru_cache, and > now xfsaild? Why not? Got to make use of all those cores machines have these days. ;) Fundamentally, threads are cheap and simple. We'll keep adding threads where it makes sense as long as it improves performance and scalability. > Are there any design docs (scribbles?) saying what these do and why > they were added so I can just go read 'em myself? I'm sure they > were added for good reason...just am curious more than anything. 'git log' is your friend. The commits that introduce the new threads explain why they are necessary. ;) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: xfsaild causing 30+ wakeups/s on an idle system since 2.6.25-rcX
On Mon, Feb 18, 2008 at 03:22:02PM -0800, Linda Walsh wrote: Not to look excessively dumb, but what's xfsaild? AIL = Active Item List It is a sorted list all the logged metadata objects that have not yet been written back to disk. The xfsaild is responsible for tail pushing the log. i.e. writing back objects in the AIL in the most efficient manner possible. Why a thread? Because allowing parallelism in tail pushing is a scalability problem and moving this to it's own thread completely isolates it from parallelism. Tail pushing only requires a small amount of CPU time, but it requires a global scope spinlock. Isolating the pushing to a single CPU means the spinlock is not contended across every CPU in the machine. How much did it improve scalability? on a 2048p machine with an MPI job that did a synchronised close of 12,000 files (6 per CPU), the close time went from ~5400s without the thread to 9s with the xfsaild. That's only about 600x faster. ;) xfs seems to be sprouting daemons at a more rapid pace these days...xfsbufd, xfssyncd, xfsdatad, xfslogd, xfs_mru_cache, and now xfsaild? Why not? Got to make use of all those cores machines have these days. ;) Fundamentally, threads are cheap and simple. We'll keep adding threads where it makes sense as long as it improves performance and scalability. Are there any design docs (scribbles?) saying what these do and why they were added so I can just go read 'em myself? I'm sure they were added for good reason...just am curious more than anything. 'git log' is your friend. The commits that introduce the new threads explain why they are necessary. ;) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Implement barrier support for single device DM devices
On Tue, Feb 19, 2008 at 02:39:00AM +, Alasdair G Kergon wrote: For example, how safe xfs is if barriers are not supported or turned off? The last time we tried xfs with dm it didn't seem to notice -EOPNOTSUPP everywhere it should = recovery may find corruption. Bug reports, please. What we don't know about, we can't fix. As of this commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=0bfefc46dc028df60120acdb92062169c9328769 XFS should be handling all cases of -EOPNOTSUPP for barrier I/Os. If you are still having problems, please let us know. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: inode leak in 2.6.24?
On Tue, Feb 19, 2008 at 05:57:08PM +0100, Ferenc Wagner wrote: David Chinner [EMAIL PROTECTED] writes: On Sat, Feb 16, 2008 at 12:18:58AM +0100, Ferenc Wagner wrote: So, I loaded the same kernel on a different machine, but that seems to exhibit a very similar behaviour. The machine is absolutely idle, nobody logged in during this period, though an updatedb ran during this period. However, the increase seems steady, not correlated to cron.daily. The contents of /proc/sys/fs/inode-nr after reboot was: 4421 95 and now, 13h35m later it's: 1461820 Find the two slabinfo outputs attached. Content-Description: slabinfo output immediately after reboot xfs_inode791800384 101 : tunables 54 278 : slabdata 80 80 0 xfs_vnode790790384 101 : tunables 54 278 : slabdata 79 79 0 dentry 5133 5133132 291 : tunables 120 608 : slabdata177177 0 Content-Description: slabinfo output 13h35m after reboot xfs_inode 142548 142550384 101 : tunables 54 278 : slabdata 14255 14255 0 xfs_vnode 142548 142550384 101 : tunables 54 278 : slabdata 14255 14255 0 dentry148003 148074132 291 : tunables 120 608 : slabdata 5106 5106 0 The xfs_inode, xfs_vnode and dentry lines show significant increase. The machine indeed uses XFS as its root filesystem. Hope this gives enough clues to narrow down the problem. I can try other kernels if needed. The xfs inodes are clearly pinned by the dentry cache, so the issue is dentries, not inodes. What's causing dentries not to be reclaimed? I can't see anything that cold pin them (e.g. no filp's that would indicate open files being responsible), so my initial thoughts are that memory reclaim may have changed behaviour. I guess the first thing to find out is whether memory pressure results in freeing the dentries. To simulate memory pressure causing slab cache reclaim, can you run: # echo 2 /proc/sys/vm/drop_caches and see if the number of dentries and inodes drops. If the number goes down significantly, then we aren't leaking dentries and there's been a change in memoy reclaim behaviour. If it stays the same, then we probably are leaking dentries Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] Re: [PATCH] Implement barrier support for single device DM devices
On Tue, Feb 19, 2008 at 02:56:43AM +, Alasdair G Kergon wrote: > On Tue, Feb 19, 2008 at 09:16:44AM +1100, David Chinner wrote: > > Surely any hardware that doesn't support barrier > > operations can emulate them with cache flushes when they receive a > > barrier I/O from the filesystem > > My complaint about having to support them within dm when more than one > device is involved is because any efficiencies disappear: you can't send > further I/O to any one device until all the other devices have completed > their barrier (or else later I/O to that device could overtake the > barrier on another device). Right - it's a horrible performance hit. But - how is what you describe any different to the filesystem doing: - flush block device - issue I/O - wait for completion - flush block device around any I/O that it would otherwise simply tag as a barrier? That serialisation at the filesystem layer is a horrible, horrible performance hi. And then there's the fact that we can't implement that in XFS because all the barrier I/Os we issue are asynchronous. We'd basically have to serialise all metadata operations and now we are talking about far worse performance hits than implementing barrier emulation in the block device. Also, it's instructive to look at the implementation of blkdev_issue_flush() - the API one is supposed to use to trigger a full block device flush. It doesn't work on DM/MD either, because it uses a no-I/O barrier bio: bio->bi_end_io = bio_end_empty_barrier; bio->bi_private = bio->bi_bdev = bdev; submit_bio(1 << BIO_RW_BARRIER, bio); wait_for_completion(); So, if the underlying block device doesn't support barriers, there's no point in changing the filesystem to issue flushes, either... > And then I argue that it would be better > for the filesystem to have the information that these are not hardware > barriers so it has the opportunity of tuning its behaviour (e.g. > flushing less often because it's a more expensive operation). There is generally no option from the filesystem POV to "flush less". Either we use barrier I/Os where we need to and are safe with volatile caches or we corrupt filesystems with volatile caches when power loss occurs. There is no in-between where "flushing less" will save us from corruption Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] Re: [PATCH] Implement barrier support for single device DM devices
On Mon, Feb 18, 2008 at 04:24:27PM +0300, Michael Tokarev wrote: > First, I still don't understand why in God's sake barriers are "working" > while regular cache flushes are not. Almost no consumer-grade hard drive > supports write barriers, but they all support regular cache flushes, and > the latter should be enough (while not the most speed-optimal) to ensure > data safety. Why to require write cache disable (like in XFS FAQ) instead > of going the flush-cache-when-appropriate (as opposed to write-barrier- > when-appropriate) way? Devil's advocate: Why should we need to support multiple different block layer APIs to do the same thing? Surely any hardware that doesn't support barrier operations can emulate them with cache flushes when they receive a barrier I/O from the filesystem Also, given that disabling the write cache still allows CTQ/NCQ to operate effectively and that in most cases WCD+CTQ is as fast as WCE+barriers, the simplest thing to do is turn off volatile write caches and not require any extra software kludges for safe operation. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: inode leak in 2.6.24?
On Sat, Feb 16, 2008 at 12:18:58AM +0100, Ferenc Wagner wrote: > Hi, > > 5 days ago I pulled the git tree (HEAD was > 25f666300625d894ebe04bac2b4b3aadb907c861), added two minor patches > (the vmsplice fix and the GFS1 exports), compiled and booted the > kernel. Things are working OK, but I noticed that inode usage has > been steadily rising since then (see attached graph, unless lost in > transit). The real filesystems used by the machine are XFS. I wonder > if it may be some kind of bug and if yes, whether it has been fixed > already. Feel free to ask for any missing information. Output of /proc/slabinfo, please. If you could get a sample when the machine has just booted, one at the typical post-boot steady state as well as one after it has increased well past normal, it would help identify what type of inode is increasing differently. Also, can you tell us what metrics you are graphing (i.e. where in /proc or /sys you are getting them from)? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: xfsaild causing 30+ wakeups/s on an idle system since 2.6.25-rcX
On Mon, Feb 18, 2008 at 11:41:39AM +0200, Török Edwin wrote: > David Chinner wrote: > > On Sun, Feb 17, 2008 at 05:51:08PM +0100, Oliver Pinter wrote: > > > >> On 2/17/08, Török Edwin <[EMAIL PROTECTED]> wrote: > >> > >>> Hi, > >>> > >>> xfsaild is causing many wakeups, a quick investigation shows > >>> xfsaild_push is always > >>> returning 30 msecs timeout value. > >>> > > > > That's a bug > > Ok. Your patches fixes the 30+ wakeups :) Good. I'll push it out for review then. > > , and has nothing to do with power consumption. ;) > > > > I suggest using a sysctl value (such as > /proc/sys/vm/dirty_writeback_centisecs), instead of a hardcoded default > 1000. No, too magic. I dislike adding knobs to workaround issues that really should be fixed by having sane default behaviour. Further down the track as we correct know issues with the AIL push code we'll be able to increase this idle timeout or even make it purely wakeup driven once we get back to an idle state. However, right now it still needs that once a second wakeup to work around a nasty corner case that can hang the filesystem Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: xfsaild causing 30+ wakeups/s on an idle system since 2.6.25-rcX
On Mon, Feb 18, 2008 at 11:41:39AM +0200, Török Edwin wrote: David Chinner wrote: On Sun, Feb 17, 2008 at 05:51:08PM +0100, Oliver Pinter wrote: On 2/17/08, Török Edwin [EMAIL PROTECTED] wrote: Hi, xfsaild is causing many wakeups, a quick investigation shows xfsaild_push is always returning 30 msecs timeout value. That's a bug Ok. Your patches fixes the 30+ wakeups :) Good. I'll push it out for review then. , and has nothing to do with power consumption. ;) I suggest using a sysctl value (such as /proc/sys/vm/dirty_writeback_centisecs), instead of a hardcoded default 1000. No, too magic. I dislike adding knobs to workaround issues that really should be fixed by having sane default behaviour. Further down the track as we correct know issues with the AIL push code we'll be able to increase this idle timeout or even make it purely wakeup driven once we get back to an idle state. However, right now it still needs that once a second wakeup to work around a nasty corner case that can hang the filesystem Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: inode leak in 2.6.24?
On Sat, Feb 16, 2008 at 12:18:58AM +0100, Ferenc Wagner wrote: Hi, 5 days ago I pulled the git tree (HEAD was 25f666300625d894ebe04bac2b4b3aadb907c861), added two minor patches (the vmsplice fix and the GFS1 exports), compiled and booted the kernel. Things are working OK, but I noticed that inode usage has been steadily rising since then (see attached graph, unless lost in transit). The real filesystems used by the machine are XFS. I wonder if it may be some kind of bug and if yes, whether it has been fixed already. Feel free to ask for any missing information. Output of /proc/slabinfo, please. If you could get a sample when the machine has just booted, one at the typical post-boot steady state as well as one after it has increased well past normal, it would help identify what type of inode is increasing differently. Also, can you tell us what metrics you are graphing (i.e. where in /proc or /sys you are getting them from)? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] Re: [PATCH] Implement barrier support for single device DM devices
On Mon, Feb 18, 2008 at 04:24:27PM +0300, Michael Tokarev wrote: First, I still don't understand why in God's sake barriers are working while regular cache flushes are not. Almost no consumer-grade hard drive supports write barriers, but they all support regular cache flushes, and the latter should be enough (while not the most speed-optimal) to ensure data safety. Why to require write cache disable (like in XFS FAQ) instead of going the flush-cache-when-appropriate (as opposed to write-barrier- when-appropriate) way? Devil's advocate: Why should we need to support multiple different block layer APIs to do the same thing? Surely any hardware that doesn't support barrier operations can emulate them with cache flushes when they receive a barrier I/O from the filesystem Also, given that disabling the write cache still allows CTQ/NCQ to operate effectively and that in most cases WCD+CTQ is as fast as WCE+barriers, the simplest thing to do is turn off volatile write caches and not require any extra software kludges for safe operation. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] Re: [PATCH] Implement barrier support for single device DM devices
On Tue, Feb 19, 2008 at 02:56:43AM +, Alasdair G Kergon wrote: On Tue, Feb 19, 2008 at 09:16:44AM +1100, David Chinner wrote: Surely any hardware that doesn't support barrier operations can emulate them with cache flushes when they receive a barrier I/O from the filesystem My complaint about having to support them within dm when more than one device is involved is because any efficiencies disappear: you can't send further I/O to any one device until all the other devices have completed their barrier (or else later I/O to that device could overtake the barrier on another device). Right - it's a horrible performance hit. But - how is what you describe any different to the filesystem doing: - flush block device - issue I/O - wait for completion - flush block device around any I/O that it would otherwise simply tag as a barrier? That serialisation at the filesystem layer is a horrible, horrible performance hi. And then there's the fact that we can't implement that in XFS because all the barrier I/Os we issue are asynchronous. We'd basically have to serialise all metadata operations and now we are talking about far worse performance hits than implementing barrier emulation in the block device. Also, it's instructive to look at the implementation of blkdev_issue_flush() - the API one is supposed to use to trigger a full block device flush. It doesn't work on DM/MD either, because it uses a no-I/O barrier bio: bio-bi_end_io = bio_end_empty_barrier; bio-bi_private = wait; bio-bi_bdev = bdev; submit_bio(1 BIO_RW_BARRIER, bio); wait_for_completion(wait); So, if the underlying block device doesn't support barriers, there's no point in changing the filesystem to issue flushes, either... And then I argue that it would be better for the filesystem to have the information that these are not hardware barriers so it has the opportunity of tuning its behaviour (e.g. flushing less often because it's a more expensive operation). There is generally no option from the filesystem POV to flush less. Either we use barrier I/Os where we need to and are safe with volatile caches or we corrupt filesystems with volatile caches when power loss occurs. There is no in-between where flushing less will save us from corruption Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Implement barrier support for single device DM devices
On Fri, Feb 15, 2008 at 04:07:54PM +0300, Michael Tokarev wrote: > Alasdair G Kergon wrote: > > On Fri, Feb 15, 2008 at 01:08:21PM +0100, Andi Kleen wrote: > >> Implement barrier support for single device DM devices > > > > Thanks. We've got some (more-invasive) dm patches in the works that > > attempt to use flushing to emulate barriers where we can't just > > pass them down like that. > > I wonder if it's worth the effort to try to implement this. > > As far as I understand (*), if a filesystem realizes that the > underlying block device does not support barriers, it will > switch to using regular flushes instead No, typically the filesystems won't issue flushes, either. > - isn't it the same > thing as you're trying to do on an MD level? > > Note that a filesystem must understand barriers/flushes on > underlying block device, since many disk drives don't support > barriers anyway. > > (*) this is, in fact, an interesting question. I still can't > find complete information about this. For example, how safe > xfs is if barriers are not supported or turned off? Is it > "less safe" than with barriers? Will it use regular cache > flushes if barriers are not here? Try reading at the XFS FAQ: http://oss.sgi.com/projects/xfs/faq/#wcache Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: xfsaild causing 30+ wakeups/s on an idle system since 2.6.25-rcX
On Sun, Feb 17, 2008 at 05:51:08PM +0100, Oliver Pinter wrote: > On 2/17/08, Török Edwin <[EMAIL PROTECTED]> wrote: > > Hi, > > > > xfsaild is causing many wakeups, a quick investigation shows > > xfsaild_push is always > > returning 30 msecs timeout value. That's a bug, and has nothing to do with power consumption. ;) I can see that there is a dirty item in the filesystem: Entering kdb (current=0xe0b8f4fe, pid 30046) on processor 3 due to Breakpoint @ 0xa00100454fc0 [3]kdb> bt Stack traceback for pid 30046 0xe0b8f4fe300462 13 R 0xe0b8f4fe0340 *xfsaild 0xa00100454fc0 xfsaild_push args (0xe0b89090, 0xe0b8f4fe7e30, 0x31b) [3]kdb> xail 0xe0b89090 AIL for mp 0xe0b89090, oldest first [0] type buf flags: 0x1lsn [1:13133] buf 0xe0b880258800 blkno 0x0 flags: 0x2 Superblock (at 0xe0b8f9b3c000) [3]kdb> the superblock is dirty, and the lsn is well beyond the target of the xfsaild hence it *should* be idling. However, it isn't idling because there is a dirty item in the list and the idle trigger of "is list empty" is not tripping. I only managed to reproduce this on a lazy superblock counter filesystem (i.e. new mkfs and recent kernel), as it logs the superblock every so often, and that is probably what is keeping the fs dirty like this. Can you see if the patch below fixes the problem. --- Idle state is not being detected properly by the xfsaild push code. The current idle state is detected by an empty list which may never happen with mostly idle filesystem or one using lazy superblock counters. A single dirty item in the list will result repeated looping to push everything past the target when everything because it fails to check if we managed to push anything. Fix by considering a dirty list with everything past the target as an idle state and set the timeout appropriately. Signed-off-by: Dave Chinner <[EMAIL PROTECTED]> --- fs/xfs/xfs_trans_ail.c | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/xfs_trans_ail.c === --- 2.6.x-xfs-new.orig/fs/xfs/xfs_trans_ail.c 2008-02-18 09:14:34.0 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_trans_ail.c2008-02-18 09:18:52.070682570 +1100 @@ -261,14 +261,17 @@ xfsaild_push( xfs_log_force(mp, (xfs_lsn_t)0, XFS_LOG_FORCE); } - /* -* We reached the target so wait a bit longer for I/O to complete and -* remove pushed items from the AIL before we start the next scan from -* the start of the AIL. -*/ - if ((XFS_LSN_CMP(lsn, target) >= 0)) { + if (count && (XFS_LSN_CMP(lsn, target) >= 0)) { + /* +* We reached the target so wait a bit longer for I/O to +* complete and remove pushed items from the AIL before we +* start the next scan from the start of the AIL. +*/ tout += 20; last_pushed_lsn = 0; + } else if (!count) { + /* We're past our target or empty, so idle */ + tout = 1000; } else if ((restarts > XFS_TRANS_PUSH_AIL_RESTARTS) || (count && ((stuck * 100) / count > 90))) { /* -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: xfsaild causing 30+ wakeups/s on an idle system since 2.6.25-rcX
On Sun, Feb 17, 2008 at 05:51:08PM +0100, Oliver Pinter wrote: On 2/17/08, Török Edwin [EMAIL PROTECTED] wrote: Hi, xfsaild is causing many wakeups, a quick investigation shows xfsaild_push is always returning 30 msecs timeout value. That's a bug, and has nothing to do with power consumption. ;) I can see that there is a dirty item in the filesystem: Entering kdb (current=0xe0b8f4fe, pid 30046) on processor 3 due to Breakpoint @ 0xa00100454fc0 [3]kdb bt Stack traceback for pid 30046 0xe0b8f4fe300462 13 R 0xe0b8f4fe0340 *xfsaild 0xa00100454fc0 xfsaild_push args (0xe0b89090, 0xe0b8f4fe7e30, 0x31b) [3]kdb xail 0xe0b89090 AIL for mp 0xe0b89090, oldest first [0] type buf flags: 0x1 in aillsn [1:13133] buf 0xe0b880258800 blkno 0x0 flags: 0x2 dirty Superblock (at 0xe0b8f9b3c000) [3]kdb the superblock is dirty, and the lsn is well beyond the target of the xfsaild hence it *should* be idling. However, it isn't idling because there is a dirty item in the list and the idle trigger of is list empty is not tripping. I only managed to reproduce this on a lazy superblock counter filesystem (i.e. new mkfs and recent kernel), as it logs the superblock every so often, and that is probably what is keeping the fs dirty like this. Can you see if the patch below fixes the problem. --- Idle state is not being detected properly by the xfsaild push code. The current idle state is detected by an empty list which may never happen with mostly idle filesystem or one using lazy superblock counters. A single dirty item in the list will result repeated looping to push everything past the target when everything because it fails to check if we managed to push anything. Fix by considering a dirty list with everything past the target as an idle state and set the timeout appropriately. Signed-off-by: Dave Chinner [EMAIL PROTECTED] --- fs/xfs/xfs_trans_ail.c | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/xfs_trans_ail.c === --- 2.6.x-xfs-new.orig/fs/xfs/xfs_trans_ail.c 2008-02-18 09:14:34.0 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_trans_ail.c2008-02-18 09:18:52.070682570 +1100 @@ -261,14 +261,17 @@ xfsaild_push( xfs_log_force(mp, (xfs_lsn_t)0, XFS_LOG_FORCE); } - /* -* We reached the target so wait a bit longer for I/O to complete and -* remove pushed items from the AIL before we start the next scan from -* the start of the AIL. -*/ - if ((XFS_LSN_CMP(lsn, target) = 0)) { + if (count (XFS_LSN_CMP(lsn, target) = 0)) { + /* +* We reached the target so wait a bit longer for I/O to +* complete and remove pushed items from the AIL before we +* start the next scan from the start of the AIL. +*/ tout += 20; last_pushed_lsn = 0; + } else if (!count) { + /* We're past our target or empty, so idle */ + tout = 1000; } else if ((restarts XFS_TRANS_PUSH_AIL_RESTARTS) || (count ((stuck * 100) / count 90))) { /* -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Implement barrier support for single device DM devices
On Fri, Feb 15, 2008 at 04:07:54PM +0300, Michael Tokarev wrote: Alasdair G Kergon wrote: On Fri, Feb 15, 2008 at 01:08:21PM +0100, Andi Kleen wrote: Implement barrier support for single device DM devices Thanks. We've got some (more-invasive) dm patches in the works that attempt to use flushing to emulate barriers where we can't just pass them down like that. I wonder if it's worth the effort to try to implement this. As far as I understand (*), if a filesystem realizes that the underlying block device does not support barriers, it will switch to using regular flushes instead No, typically the filesystems won't issue flushes, either. - isn't it the same thing as you're trying to do on an MD level? Note that a filesystem must understand barriers/flushes on underlying block device, since many disk drives don't support barriers anyway. (*) this is, in fact, an interesting question. I still can't find complete information about this. For example, how safe xfs is if barriers are not supported or turned off? Is it less safe than with barriers? Will it use regular cache flushes if barriers are not here? Try reading at the XFS FAQ: http://oss.sgi.com/projects/xfs/faq/#wcache Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: first tree
On Fri, Feb 15, 2008 at 11:50:40AM +1100, Stephen Rothwell wrote: > Hi David, > > On Fri, 15 Feb 2008 10:17:02 +1100 David Chinner <[EMAIL PROTECTED]> wrote: > > > > The current XFS tree that goes into -mm is: > > > > git://oss.sgi.com:8090/xfs/xfs-2.6.git master > > Added, thanks. > > I have put you as the contact point - is this correct? Not really ;) > I notice that the > MAINTAINERS file has a different contact for XFS. Yup - [EMAIL PROTECTED] If you want a real person on the end of problem reports feel free to leave me there, but you should really also cc the xfs-masters address as well to guarantee that the problem is seen and dealt with promptly as I'm not always available. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: first tree
On Fri, Feb 15, 2008 at 12:35:37AM +1100, Stephen Rothwell wrote: > Also, more trees please ... :-) The current XFS tree that goes into -mm is: git://oss.sgi.com:8090/xfs/xfs-2.6.git master Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: first tree
On Fri, Feb 15, 2008 at 12:35:37AM +1100, Stephen Rothwell wrote: Also, more trees please ... :-) The current XFS tree that goes into -mm is: git://oss.sgi.com:8090/xfs/xfs-2.6.git master Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: first tree
On Fri, Feb 15, 2008 at 11:50:40AM +1100, Stephen Rothwell wrote: Hi David, On Fri, 15 Feb 2008 10:17:02 +1100 David Chinner [EMAIL PROTECTED] wrote: The current XFS tree that goes into -mm is: git://oss.sgi.com:8090/xfs/xfs-2.6.git master Added, thanks. I have put you as the contact point - is this correct? Not really ;) I notice that the MAINTAINERS file has a different contact for XFS. Yup - [EMAIL PROTECTED] If you want a real person on the end of problem reports feel free to leave me there, but you should really also cc the xfs-masters address as well to guarantee that the problem is seen and dealt with promptly as I'm not always available. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: xfs [_fsr] probs in 2.6.24.0
On Tue, Feb 12, 2008 at 01:02:05PM -0800, Linda Walsh wrote: > David Chinner wrote: > >Perhaps by running xfs_fsr manually you could reproduce the > >problem while you are sitting in front of the machine... > > Um...yeah, AND with multiple "cp's of multi-gig files > going on at same time, both local, by a sister machine via NFS, > and and a 3rd machine tapping (not banging) away via CIFS. > These were on top of normal server duties. Whenever I stress > it on *purpose* and watch it, works fine. GRRRI HATE > THAT!!! I feel your pain. > The file system(s) going "offline" due > to xfs-detected filesystem errors has only happened *once* on > asa, the 64-bit machine. It's a fairly new machine w/o added > hardware -- but this only happened in 2.24.0 when I added the > tickless clock option, which sure seems like a remote possibility for > causing an xfs error, but could be. Well, tickless is new and shiny and I doubt anyone has done much testing with XFS on tickless kernels. Still, if that's a new config option you set, change it back to what you had for .23 on that hardware and try again. > >Looking at it in terms of i_mutex, other filesystems hold > >i_mutex over dio_get_page() (all those that use DIO_LOCKING) > >so question is whether we are allowed to take the i_mutex > >in ->release. I note that both reiserfs and hfsplus take > >i_mutex in ->release as well as use DIO_LOCKING, so this > >problem is not isolated to XFS. > > > >However, it would appear that mmap_sem -> i_mutex is illegal > >according to the comment at the head of mm/filemap.c. While we are > >not using i_mutex in this case, the inversion would seem to be > >equivalent in nature. > > > >There's not going to be a quick fix for this. > > What could be the consequences of this locking anomaly? If you have a multithreaded application that mixes mmap and direct I/O, and you have a simultaneous munmap() call and read() to the same file, you might be able to deadlock access to that file. However, you'd have to be certifiably insane to write an application that did this (mix mmap and direct I/O to the same file at the same time), so I think exposure is pretty limited. > I.e., for example, in NFS, I have enabled "allow direct I/O on NFS > files". That's client side direct I/O, which is not what the server does. Client side direct I/O results in synchronous buffered I/O on the server, which will thrash your disks pretty hard. The config option help does warn you about this. ;) > >And the other one: > > > >>Feb 7 02:01:50 kern: > >>--- > >>Feb 7 02:01:50 kern: xfs_fsr/6313 is trying to acquire lock: > >>Feb 7 02:01:50 kern: (&(>i_lock)->mr_lock/2){}, at: > >>[] xfs_ilock+0x82/0xc0 > >>Feb 7 02:01:50 kern: > >>Feb 7 02:01:50 kern: but task is already holding lock: > >>Feb 7 02:01:50 kern: (&(>i_iolock)->mr_lock/3){--..}, at: > >>[] xfs_ilock+0xa5/0xc0 > >>Feb 7 02:01:50 kern: > >>Feb 7 02:01:50 kern: which lock already depends on the new lock. > > > >Looks like yet another false positive. Basically we do this > >in xfs_swap_extents: > > > > inode A: i_iolock class 2 > > inode A: i_ilock class 2 > > inode B: i_iolock class 3 > > inode B: i_ilock class 3 > > . > > inode A: unlock ilock > > inode B: unlock ilock > > . > >>>>>>inode A: ilock class 2 > > inode B: ilock class 3 > > > >And lockdep appears to be complaining about the relocking of inode A > >as class 2 because we've got a class 3 iolock still held, hence > >violating the order it saw initially. There's no possible deadlock > >here so we'll just have to add more hacks to the annotation code to make > >lockdep happy. > > Is there a reason to unlock and relock the same inode while > the level 3 lock is held -- i.e. does 'unlocking ilock' allow some > increased 'throughput' for some other potential process to access > the same inode? It prevents a single thread deadlock when doing transaction reservation. i.e. the process of setting up a transaction can require the ilock to be taken, and hence we have to drop it before and pick it back up after the transaction reservation. We hold on to the iolock to prevent the inode from having new I/O started while we do the transaction reservation, so it's in the same state after the reservation as it was before We have to hold both locks to guarantee exclusive access to the inode, so once we have the reservation we need to pi
Re: xfs [_fsr] probs in 2.6.24.0
On Mon, Feb 11, 2008 at 05:02:05PM -0800, Linda Walsh wrote: > > I'm getting similar errors on an x86-32 & x86-64 kernel. The x86-64 system > (2nd log below w/date+times) was unusable this morning: one or more of the > xfs file systems had "gone off line" due to some unknown error (upon reboot, > no errors were indicated; all partitions on the same physical disk). > > I keep ending up with random failures on two systems. The 32-bit sys more > often than not just "locks-up" -- no messages, no keyboard response...etc. > Nothing to do except reset -- and of course, nothing in the logs Filesystem bugs rarely hang systems hard like that - more likely is a hardware or driver problem. And neither of the lockdep reports below are likely to be responsible for a system wide, no-response hang. [cut a bunch of speculation and stuff about hardware problems causing XFS problems] > Lemme know if I can provide any info in addressing these problems. I don't > know if they are related to the other system crashes, but usage patterns > indicate it could be disk-activity related. So eliminating known bugs & > problems in that area might help (?)... If your hardware or drivers are unstable, then XFS cannot be expected to reliably work. Given that xfs_fsr apparently triggers the hangs, I'd suggest putting lots of I/O load on your disk subsystem by copying files around with direct I/O (just like xfs_fsr does) to try to reproduce the problem. Perhaps by running xfs_fsr manually you could reproduce the problem while you are sitting in front of the machine... Looking at the lockdep reports: > The first set of errors the I have some diagnostics for, I could > only find in dmesg: > (ish-32bit machine) > === > [ INFO: possible circular locking dependency detected ] > 2.6.24-ish #1 > --- > xfs_fsr/2119 is trying to acquire lock: > (>mmap_sem){}, at: [] dio_get_page+0x62/0x160 > > but task is already holding lock: > (&(>i_iolock)->mr_lock){}, at: [] xfs_ilock+0x5b/0xb0 dio_get_page() takes the mmap_sem of the processes vma that has the pages we do I/O into. That's not new. We're holding the xfs inode iolock at this point to protect against truncate and simultaneous buffered I/O races and this is also unchanged. i.e. this is normal. > which lock already depends on the new lock. > > > the existing dependency chain (in reverse order) is: munmap() dropping the last reference to it's vm_file and calling ->release() which causes a truncate of speculatively allocated space to take place. IOWs, ->release() is called with the mmap_sem held. Hmmm Looking at it in terms of i_mutex, other filesystems hold i_mutex over dio_get_page() (all those that use DIO_LOCKING) so question is whether we are allowed to take the i_mutex in ->release. I note that both reiserfs and hfsplus take i_mutex in ->release as well as use DIO_LOCKING, so this problem is not isolated to XFS. However, it would appear that mmap_sem -> i_mutex is illegal according to the comment at the head of mm/filemap.c. While we are not using i_mutex in this case, the inversion would seem to be equivalent in nature. There's not going to be a quick fix for this. And the other one: > Feb 7 02:01:50 kern: > --- > Feb 7 02:01:50 kern: xfs_fsr/6313 is trying to acquire lock: > Feb 7 02:01:50 kern: (&(>i_lock)->mr_lock/2){}, at: > [] xfs_ilock+0x82/0xc0 > Feb 7 02:01:50 kern: > Feb 7 02:01:50 kern: but task is already holding lock: > Feb 7 02:01:50 kern: (&(>i_iolock)->mr_lock/3){--..}, at: > [] xfs_ilock+0xa5/0xc0 > Feb 7 02:01:50 kern: > Feb 7 02:01:50 kern: which lock already depends on the new lock. Looks like yet another false positive. Basically we do this in xfs_swap_extents: inode A: i_iolock class 2 inode A: i_ilock class 2 inode B: i_iolock class 3 inode B: i_ilock class 3 . inode A: unlock ilock inode B: unlock ilock . > inode A: ilock class 2 inode B: ilock class 3 And lockdep appears to be complaining about the relocking of inode A as class 2 because we've got a class 3 iolock still held, hence violating the order it saw initially. There's no possible deadlock here so we'll just have to add more hacks to the annotation code to make lockdep happy. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: xfs [_fsr] probs in 2.6.24.0
On Mon, Feb 11, 2008 at 05:02:05PM -0800, Linda Walsh wrote: I'm getting similar errors on an x86-32 x86-64 kernel. The x86-64 system (2nd log below w/date+times) was unusable this morning: one or more of the xfs file systems had gone off line due to some unknown error (upon reboot, no errors were indicated; all partitions on the same physical disk). I keep ending up with random failures on two systems. The 32-bit sys more often than not just locks-up -- no messages, no keyboard response...etc. Nothing to do except reset -- and of course, nothing in the logs Filesystem bugs rarely hang systems hard like that - more likely is a hardware or driver problem. And neither of the lockdep reports below are likely to be responsible for a system wide, no-response hang. [cut a bunch of speculation and stuff about hardware problems causing XFS problems] Lemme know if I can provide any info in addressing these problems. I don't know if they are related to the other system crashes, but usage patterns indicate it could be disk-activity related. So eliminating known bugs problems in that area might help (?)... If your hardware or drivers are unstable, then XFS cannot be expected to reliably work. Given that xfs_fsr apparently triggers the hangs, I'd suggest putting lots of I/O load on your disk subsystem by copying files around with direct I/O (just like xfs_fsr does) to try to reproduce the problem. Perhaps by running xfs_fsr manually you could reproduce the problem while you are sitting in front of the machine... Looking at the lockdep reports: The first set of errors the I have some diagnostics for, I could only find in dmesg: (ish-32bit machine) === [ INFO: possible circular locking dependency detected ] 2.6.24-ish #1 --- xfs_fsr/2119 is trying to acquire lock: (mm-mmap_sem){}, at: [b01a3432] dio_get_page+0x62/0x160 but task is already holding lock: ((ip-i_iolock)-mr_lock){}, at: [b02651db] xfs_ilock+0x5b/0xb0 dio_get_page() takes the mmap_sem of the processes vma that has the pages we do I/O into. That's not new. We're holding the xfs inode iolock at this point to protect against truncate and simultaneous buffered I/O races and this is also unchanged. i.e. this is normal. which lock already depends on the new lock. the existing dependency chain (in reverse order) is: munmap() dropping the last reference to it's vm_file and calling -release() which causes a truncate of speculatively allocated space to take place. IOWs, -release() is called with the mmap_sem held. Hmmm Looking at it in terms of i_mutex, other filesystems hold i_mutex over dio_get_page() (all those that use DIO_LOCKING) so question is whether we are allowed to take the i_mutex in -release. I note that both reiserfs and hfsplus take i_mutex in -release as well as use DIO_LOCKING, so this problem is not isolated to XFS. However, it would appear that mmap_sem - i_mutex is illegal according to the comment at the head of mm/filemap.c. While we are not using i_mutex in this case, the inversion would seem to be equivalent in nature. There's not going to be a quick fix for this. And the other one: Feb 7 02:01:50 kern: --- Feb 7 02:01:50 kern: xfs_fsr/6313 is trying to acquire lock: Feb 7 02:01:50 kern: ((ip-i_lock)-mr_lock/2){}, at: [803c1b22] xfs_ilock+0x82/0xc0 Feb 7 02:01:50 kern: Feb 7 02:01:50 kern: but task is already holding lock: Feb 7 02:01:50 kern: ((ip-i_iolock)-mr_lock/3){--..}, at: [803c1b45] xfs_ilock+0xa5/0xc0 Feb 7 02:01:50 kern: Feb 7 02:01:50 kern: which lock already depends on the new lock. Looks like yet another false positive. Basically we do this in xfs_swap_extents: inode A: i_iolock class 2 inode A: i_ilock class 2 inode B: i_iolock class 3 inode B: i_ilock class 3 . inode A: unlock ilock inode B: unlock ilock . inode A: ilock class 2 inode B: ilock class 3 And lockdep appears to be complaining about the relocking of inode A as class 2 because we've got a class 3 iolock still held, hence violating the order it saw initially. There's no possible deadlock here so we'll just have to add more hacks to the annotation code to make lockdep happy. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: xfs [_fsr] probs in 2.6.24.0
On Tue, Feb 12, 2008 at 01:02:05PM -0800, Linda Walsh wrote: David Chinner wrote: Perhaps by running xfs_fsr manually you could reproduce the problem while you are sitting in front of the machine... Um...yeah, AND with multiple cp's of multi-gig files going on at same time, both local, by a sister machine via NFS, and and a 3rd machine tapping (not banging) away via CIFS. These were on top of normal server duties. Whenever I stress it on *purpose* and watch it, works fine. GRRRI HATE THAT!!! I feel your pain. The file system(s) going offline due to xfs-detected filesystem errors has only happened *once* on asa, the 64-bit machine. It's a fairly new machine w/o added hardware -- but this only happened in 2.24.0 when I added the tickless clock option, which sure seems like a remote possibility for causing an xfs error, but could be. Well, tickless is new and shiny and I doubt anyone has done much testing with XFS on tickless kernels. Still, if that's a new config option you set, change it back to what you had for .23 on that hardware and try again. Looking at it in terms of i_mutex, other filesystems hold i_mutex over dio_get_page() (all those that use DIO_LOCKING) so question is whether we are allowed to take the i_mutex in -release. I note that both reiserfs and hfsplus take i_mutex in -release as well as use DIO_LOCKING, so this problem is not isolated to XFS. However, it would appear that mmap_sem - i_mutex is illegal according to the comment at the head of mm/filemap.c. While we are not using i_mutex in this case, the inversion would seem to be equivalent in nature. There's not going to be a quick fix for this. What could be the consequences of this locking anomaly? If you have a multithreaded application that mixes mmap and direct I/O, and you have a simultaneous munmap() call and read() to the same file, you might be able to deadlock access to that file. However, you'd have to be certifiably insane to write an application that did this (mix mmap and direct I/O to the same file at the same time), so I think exposure is pretty limited. I.e., for example, in NFS, I have enabled allow direct I/O on NFS files. That's client side direct I/O, which is not what the server does. Client side direct I/O results in synchronous buffered I/O on the server, which will thrash your disks pretty hard. The config option help does warn you about this. ;) And the other one: Feb 7 02:01:50 kern: --- Feb 7 02:01:50 kern: xfs_fsr/6313 is trying to acquire lock: Feb 7 02:01:50 kern: ((ip-i_lock)-mr_lock/2){}, at: [803c1b22] xfs_ilock+0x82/0xc0 Feb 7 02:01:50 kern: Feb 7 02:01:50 kern: but task is already holding lock: Feb 7 02:01:50 kern: ((ip-i_iolock)-mr_lock/3){--..}, at: [803c1b45] xfs_ilock+0xa5/0xc0 Feb 7 02:01:50 kern: Feb 7 02:01:50 kern: which lock already depends on the new lock. Looks like yet another false positive. Basically we do this in xfs_swap_extents: inode A: i_iolock class 2 inode A: i_ilock class 2 inode B: i_iolock class 3 inode B: i_ilock class 3 . inode A: unlock ilock inode B: unlock ilock . inode A: ilock class 2 inode B: ilock class 3 And lockdep appears to be complaining about the relocking of inode A as class 2 because we've got a class 3 iolock still held, hence violating the order it saw initially. There's no possible deadlock here so we'll just have to add more hacks to the annotation code to make lockdep happy. Is there a reason to unlock and relock the same inode while the level 3 lock is held -- i.e. does 'unlocking ilock' allow some increased 'throughput' for some other potential process to access the same inode? It prevents a single thread deadlock when doing transaction reservation. i.e. the process of setting up a transaction can require the ilock to be taken, and hence we have to drop it before and pick it back up after the transaction reservation. We hold on to the iolock to prevent the inode from having new I/O started while we do the transaction reservation, so it's in the same state after the reservation as it was before We have to hold both locks to guarantee exclusive access to the inode, so once we have the reservation we need to pick the ilocks back up. The way we do it here does not violate lock ordering at all (iolock before ilock on a single inode, and ascending inode number order for multiple inodes), but lockdep is not smart enough to know that. Hence we need more complex annotations to shut it up. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ
Re: [PATCH] xfs: convert beX_add to beX_add_cpu (new common API)
On Sun, Feb 10, 2008 at 11:18:09AM +0100, Marcin Slusarz wrote: > This patch was in Andrew tree, but it was uncomplete. > Here is updated version. > > --- > remove beX_add functions and replace all uses with beX_add_cpu > > Signed-off-by: Marcin Slusarz <[EMAIL PROTECTED]> > --- Looks good. You can add a: Reviewed-by: Dave Chinner <[EMAIL PROTECTED]> To the patch. Andrew - feel free to push this to Linus with the other 2 patches in this series when you are ready. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO queuing and complete affinity with threads (was Re: [PATCH 0/8] IO queuing and complete affinity)
On Fri, Feb 08, 2008 at 08:59:55AM +0100, Jens Axboe wrote: > > > > At least they reported it to be the most efficient scheme in their > > > > testing, and Dave thought that migrating completions out to submitters > > > > might be a bottleneck in some cases. > > > > > > More so than migrating submitters to completers? The advantage of only > > > movign submitters is that you get rid of the completion locking. Apart > > > from that, the cost should be the same, especially for the thread based > > > solution. > > > > Not specifically for the block layer, but higher layers like xfs. > > True, but that's parallel to the initial statement - that migrating > completers is most costly than migrating submitters. So I'd like Dave to > expand on why he thinks that migrating completers it more costly than > submitters, APART from the locking associated with adding the request to > a remote CPU list. What I think Nick is referring to is the comments I made that at a higher layer (e.g. filesystems) migrating completions to the submitter CPU may be exactly the wrong thing to do. I don't recall making any comments on migrating submitters - I think others have already commented on that so I'll ignore that for the moment and try to explain why completion on submitter CPU /may/ be bad. For example, in the case of XFS it is fine for data I/O but it is wrong for transaction I/O completion. We want to direct all transaction completions to as few CPUs as possible (one, ideally) so that all the completion processing happens on the same CPU, rather than bouncing global cachelines and locks between all the CPUs taking completion interrupts. In more detail, the XFS transaction subsystem is asynchronous. We submit the transaction I/O on the CPU that creates the transaction so the I/O can come from any CPU in the system. If we then farm the completion processing out to the submission CPU, that will push it all over the machine and guarantee that we bounce all of the XFS transaction log structures and locks all over the machine on completion as well as submission (right now it's lots of submission CPUs, few completion CPUs). An example how bad this can get - this patch: http://oss.sgi.com/archives/xfs/2007-11/msg00217.html which prevents simultaneous access to the items tracked in the log during transaction reservation. Having several hundred CPUs trying to hit this list at once is really bad for performance - the test app on the 2048p machine that saw this problem went from ~5500s runtime down to 9s with the above patch. I use this example because the transaction I/O completion touches exactly the same list, locks and structures but is limited in distribution (and therefore contention) by the number of simultaneous I/O completion CPUs. Doing completion on the submitter CPU will cause much wider distribution of completion processing and introduce exactly the same issues as the transaction reservation side had. As it goes, with large, multi-device volumes (e.g. big stripe) we already see issues with simultaneous completion processing (e.g. the 8p machine mentioned in the above link), so I'd really like to avoid making these problems worse Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO queuing and complete affinity with threads (was Re: [PATCH 0/8] IO queuing and complete affinity)
On Fri, Feb 08, 2008 at 08:59:55AM +0100, Jens Axboe wrote: At least they reported it to be the most efficient scheme in their testing, and Dave thought that migrating completions out to submitters might be a bottleneck in some cases. More so than migrating submitters to completers? The advantage of only movign submitters is that you get rid of the completion locking. Apart from that, the cost should be the same, especially for the thread based solution. Not specifically for the block layer, but higher layers like xfs. True, but that's parallel to the initial statement - that migrating completers is most costly than migrating submitters. So I'd like Dave to expand on why he thinks that migrating completers it more costly than submitters, APART from the locking associated with adding the request to a remote CPU list. What I think Nick is referring to is the comments I made that at a higher layer (e.g. filesystems) migrating completions to the submitter CPU may be exactly the wrong thing to do. I don't recall making any comments on migrating submitters - I think others have already commented on that so I'll ignore that for the moment and try to explain why completion on submitter CPU /may/ be bad. For example, in the case of XFS it is fine for data I/O but it is wrong for transaction I/O completion. We want to direct all transaction completions to as few CPUs as possible (one, ideally) so that all the completion processing happens on the same CPU, rather than bouncing global cachelines and locks between all the CPUs taking completion interrupts. In more detail, the XFS transaction subsystem is asynchronous. We submit the transaction I/O on the CPU that creates the transaction so the I/O can come from any CPU in the system. If we then farm the completion processing out to the submission CPU, that will push it all over the machine and guarantee that we bounce all of the XFS transaction log structures and locks all over the machine on completion as well as submission (right now it's lots of submission CPUs, few completion CPUs). An example how bad this can get - this patch: http://oss.sgi.com/archives/xfs/2007-11/msg00217.html which prevents simultaneous access to the items tracked in the log during transaction reservation. Having several hundred CPUs trying to hit this list at once is really bad for performance - the test app on the 2048p machine that saw this problem went from ~5500s runtime down to 9s with the above patch. I use this example because the transaction I/O completion touches exactly the same list, locks and structures but is limited in distribution (and therefore contention) by the number of simultaneous I/O completion CPUs. Doing completion on the submitter CPU will cause much wider distribution of completion processing and introduce exactly the same issues as the transaction reservation side had. As it goes, with large, multi-device volumes (e.g. big stripe) we already see issues with simultaneous completion processing (e.g. the 8p machine mentioned in the above link), so I'd really like to avoid making these problems worse Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] xfs: convert beX_add to beX_add_cpu (new common API)
On Sun, Feb 10, 2008 at 11:18:09AM +0100, Marcin Slusarz wrote: This patch was in Andrew tree, but it was uncomplete. Here is updated version. --- remove beX_add functions and replace all uses with beX_add_cpu Signed-off-by: Marcin Slusarz [EMAIL PROTECTED] --- Looks good. You can add a: Reviewed-by: Dave Chinner [EMAIL PROTECTED] To the patch. Andrew - feel free to push this to Linus with the other 2 patches in this series when you are ready. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [stable] [patch 00/45] 2.6.24-stable review
On Thu, Feb 07, 2008 at 05:12:30PM -0800, Greg KH wrote: > On Fri, Feb 08, 2008 at 11:44:30AM +1100, David Chinner wrote: > > Greg, > > > > Is there any reason why the XFS patch I sent to the stable list a > > couple of days ago is not included in this series? > > > > http://oss.sgi.com/archives/xfs/2008-02/msg00027.html > > > > We've had multiple reports of it, and multiple confirmations that > > the patch in the link above fixes the problem. > > I didn't think it was in Linus's tree yet. Is it? Not yet - it's in the pipeline. I'll see if that can be sped up (someone else usually takes care of the pushes to Linus). Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/45] 2.6.24-stable review
Greg, Is there any reason why the XFS patch I sent to the stable list a couple of days ago is not included in this series? http://oss.sgi.com/archives/xfs/2008-02/msg00027.html We've had multiple reports of it, and multiple confirmations that the patch in the link above fixes the problem. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: remount-ro & umount & quota interaction
On Thu, Feb 07, 2008 at 03:10:18PM +0100, Jan Engelhardt wrote: > > On Feb 7 2008 15:04, Jan Kara wrote: > >On Thu 07-02-08 13:49:52, Michael Tokarev wrote: > >> Jan Kara wrote: > >> [deadlock after remount-ro followed with umount when > >> quota is enabled] > >> > >> Hmm. While that will prevent the lockup, maybe it's better to > >> perform an equivalent of quotaoff on mount-ro instead? [...] > > > > We couldn't leave quota on when filesystem is remounted ro because we > >need to modify quotafile when quota is being turned off. We could turn off > >quotas when remounting read-only. As we turn them off during umount, it > >probably makes sence to turn them off on remount-ro as well. > > Objection. XFS handles quotas differently that does not involve > modifying a file on the fs, so quotas could stay on (even if it does > not make much sense) while the fs is ro. Wrong and right. XFS uses files for the backing store for quota information, though it does handle quotas very differently to every other Linux filesystem. On remount-ro, we simply flush all the dirty dquots to their backing store so the dquots can then be treated as ro just like every other object in the filesystem. You don't need to turn off quotas to do this Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: remount-ro umount quota interaction
On Thu, Feb 07, 2008 at 03:10:18PM +0100, Jan Engelhardt wrote: On Feb 7 2008 15:04, Jan Kara wrote: On Thu 07-02-08 13:49:52, Michael Tokarev wrote: Jan Kara wrote: [deadlock after remount-ro followed with umount when quota is enabled] Hmm. While that will prevent the lockup, maybe it's better to perform an equivalent of quotaoff on mount-ro instead? [...] We couldn't leave quota on when filesystem is remounted ro because we need to modify quotafile when quota is being turned off. We could turn off quotas when remounting read-only. As we turn them off during umount, it probably makes sence to turn them off on remount-ro as well. Objection. XFS handles quotas differently that does not involve modifying a file on the fs, so quotas could stay on (even if it does not make much sense) while the fs is ro. Wrong and right. XFS uses files for the backing store for quota information, though it does handle quotas very differently to every other Linux filesystem. On remount-ro, we simply flush all the dirty dquots to their backing store so the dquots can then be treated as ro just like every other object in the filesystem. You don't need to turn off quotas to do this Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/45] 2.6.24-stable review
Greg, Is there any reason why the XFS patch I sent to the stable list a couple of days ago is not included in this series? http://oss.sgi.com/archives/xfs/2008-02/msg00027.html We've had multiple reports of it, and multiple confirmations that the patch in the link above fixes the problem. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [stable] [patch 00/45] 2.6.24-stable review
On Thu, Feb 07, 2008 at 05:12:30PM -0800, Greg KH wrote: On Fri, Feb 08, 2008 at 11:44:30AM +1100, David Chinner wrote: Greg, Is there any reason why the XFS patch I sent to the stable list a couple of days ago is not included in this series? http://oss.sgi.com/archives/xfs/2008-02/msg00027.html We've had multiple reports of it, and multiple confirmations that the patch in the link above fixes the problem. I didn't think it was in Linus's tree yet. Is it? Not yet - it's in the pipeline. I'll see if that can be sped up (someone else usually takes care of the pushes to Linus). Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc] direct IO submission and completion scalability issues
On Mon, Feb 04, 2008 at 11:09:59AM +0100, Nick Piggin wrote: > You get better behaviour in the slab and page allocators and locality > and cache hotness of memory. For example, I guess in a filesystem / > pagecache heavy workload, you have to touch each struct page, buffer head, > fs private state, and also often have to wake the thread for completion. > Much of this data has just been touched at submit time, so doin this on > the same CPU is nice... [] > I'm surprised that the xfs global state bouncing would outweigh the > bouncing of all the per-page/block/bio/request/etc data that gets touched > during completion. We'll see. per-page/block.bio/request/etc is local to a single I/O. the only penalty is a cacheline bounce for each of the structures from one CPU to another. That is, there is no global state modified by these completions. The real issue is metadata. The transaction log I/O completion funnels through a state machine protected by a single lock, which means completions on different CPUs pulls that lock to all completion CPUs. Given that the same lock is used during transaction completion for other state transitions (in task context, not intr), the more cpus active at once touches, the worse the problem gets. Then there's metadata I/O completion, which funnels through a larger set of global locks in the transaction subsystem (e.g. the active item list lock, the log reservation locks, the log state lock, etc) which once again means the more CPUs we have delivering I/O completions, the worse the problem gets. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc] direct IO submission and completion scalability issues
On Mon, Feb 04, 2008 at 11:09:59AM +0100, Nick Piggin wrote: You get better behaviour in the slab and page allocators and locality and cache hotness of memory. For example, I guess in a filesystem / pagecache heavy workload, you have to touch each struct page, buffer head, fs private state, and also often have to wake the thread for completion. Much of this data has just been touched at submit time, so doin this on the same CPU is nice... [] I'm surprised that the xfs global state bouncing would outweigh the bouncing of all the per-page/block/bio/request/etc data that gets touched during completion. We'll see. per-page/block.bio/request/etc is local to a single I/O. the only penalty is a cacheline bounce for each of the structures from one CPU to another. That is, there is no global state modified by these completions. The real issue is metadata. The transaction log I/O completion funnels through a state machine protected by a single lock, which means completions on different CPUs pulls that lock to all completion CPUs. Given that the same lock is used during transaction completion for other state transitions (in task context, not intr), the more cpus active at once touches, the worse the problem gets. Then there's metadata I/O completion, which funnels through a larger set of global locks in the transaction subsystem (e.g. the active item list lock, the log reservation locks, the log state lock, etc) which once again means the more CPUs we have delivering I/O completions, the worse the problem gets. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc] direct IO submission and completion scalability issues
On Sun, Feb 03, 2008 at 08:14:45PM -0800, Arjan van de Ven wrote: > David Chinner wrote: > >Hi Nick, > > > >When Matthew was describing this work at an LCA presentation (not > >sure whether you were at that presentation or not), Zach came up > >with the idea that allowing the submitting application control the > >CPU that the io completion processing was occurring would be a good > >approach to try. That is, we submit a "completion cookie" with the > >bio that indicates where we want completion to run, rather than > >dictating that completion runs on the submission CPU. > > > >The reasoning is that only the higher level context really knows > >what is optimal, and that changes from application to application. > > well.. kinda. One of the really hard parts of the submit/completion stuff > is that > the slab/slob/slub/slib allocator ends up basically "cycling" memory > through the system; > there's a sink of free memory on all the submission cpus and a source of > free memory > on the completion cpu. I don't think applications are capable of working > out what is > best in this scenario.. Applications as in "anything that calls submit_bio()". i.e, direct I/O, filesystems, etc. i.e. not userspace but in-kernel applications. In XFS, simultaneous io completion on multiple CPUs can contribute greatly to contention of global structures in XFS. By controlling where completions are delivered, we can greatly reduce this contention, especially on large, mulitpathed devices that deliver interrupts to multiple CPUs that may be far distant from each other. We have all the state and intelligence necessary to control this sort policy decision effectively. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc] direct IO submission and completion scalability issues
On Sun, Feb 03, 2008 at 10:52:52AM +0100, Nick Piggin wrote: > On Fri, Jul 27, 2007 at 06:21:28PM -0700, Suresh B wrote: > > > > Second experiment which we did was migrating the IO submission to the > > IO completion cpu. Instead of submitting the IO on the same cpu where the > > request arrived, in this experiment the IO submission gets migrated to the > > cpu that is processing IO completions(interrupt). This will minimize the > > access to remote cachelines (that happens in timers, slab, scsi layers). The > > IO submission request is forwarded to the kblockd thread on the cpu > > receiving > > the interrupts. As part of this, we also made kblockd thread on each cpu as > > the > > highest priority thread, so that IO gets submitted as soon as possible on > > the > > interrupt cpu with out any delay. On x86_64 SMP platform with 16 cores, this > > resulted in 2% performance improvement and 3.3% improvement on two node ia64 > > platform. > > > > Quick and dirty prototype patch(not meant for inclusion) for this io > > migration > > experiment is appended to this e-mail. > > > > Observation #1 mentioned above is also applicable to this experiment. CPU's > > processing interrupts will now have to cater IO submission/processing > > load aswell. > > > > Observation #2: This introduces some migration overhead during IO > > submission. > > With the current prototype, every incoming IO request results in an IPI and > > context switch(to kblockd thread) on the interrupt processing cpu. > > This issue needs to be addressed and main challenge to address is > > the efficient mechanism of doing this IO migration(how much batching to do > > and > > when to send the migrate request?), so that we don't delay the IO much and > > at > > the same point, don't cause much overhead during migration. > > Hi guys, > > Just had another way we might do this. Migrate the completions out to > the submitting CPUs rather than migrate submission into the completing > CPU. Hi Nick, When Matthew was describing this work at an LCA presentation (not sure whether you were at that presentation or not), Zach came up with the idea that allowing the submitting application control the CPU that the io completion processing was occurring would be a good approach to try. That is, we submit a "completion cookie" with the bio that indicates where we want completion to run, rather than dictating that completion runs on the submission CPU. The reasoning is that only the higher level context really knows what is optimal, and that changes from application to application. The "complete on the submission CPU" policy _may_ be more optimal for database workloads, but it is definitely suboptimal for XFS and transaction I/O completion handling because it simply drags a bunch of global filesystem state around between all the CPUs running completions. In that case, we really only want a single CPU to be handling the completions. (Zach - please correct me if I've missed anything) Looking at your patch - if you turn it around so that the "submission CPU" field can be specified as the "completion cpu" then I think the patch will expose the policy knobs needed to do the above. Add the bio -> rq linkage to enable filesystems and DIO to control the completion CPU field and we're almost done ;) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: XFS oops in vanilla 2.6.24
On Thu, Jan 31, 2008 at 12:23:11PM +0100, Peter Zijlstra wrote: > Lets CC the XFS maintainer.. Adding the xfs list and hch. It might be a couple of days before I get to this - I've got a week of backlog to catch up on after LCA > On Wed, 2008-01-30 at 20:23 +, Sven Geggus wrote: > > Hi there, > > > > I get the following with 2.6.24: > > > > Ending clean XFS mount for filesystem: dm-0 > > BUG: unable to handle kernel paging request at virtual address f2134000 How long after mount does this happen? Does it happen when listing a specific directory? i.e. do you have a reproducable test case for it? Cheers, Dave. > > printing eip: c021a13a *pde = 010b5067 *pte = 32134000 > > Oops: [#1] PREEMPT DEBUG_PAGEALLOC > > Modules linked in: radeon drm rfcomm l2cap sym53c8xx scsi_transport_spi > > snd_via82xx 8139too snd_mpu401_uart snd_ens1371 snd_rawmidi snd_ac97_codec > > ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc via_agp > > agpgart > > > > Pid: 3889, comm: bash Not tainted (2.6.24 #3) > > EIP: 0060:[] EFLAGS: 00010282 CPU: 0 > > EIP is at xfs_file_readdir+0xfa/0x18c > > EAX: EBX: 02f5 ECX: 0020 EDX: > > ESI: EDI: f2133ff8 EBP: f227ff68 ESP: f227ff10 > > DS: 007b ES: 007b FS: GS: 0033 SS: 0068 > > Process bash (pid: 3889, ti=f227e000 task=f7205a80 task.ti=f227e000) > > Stack: 02f5 2c000137 c0165358 f227ff94 > > f221c810 > >f2d85e48 02f5 f2133000 > > 1000 > >0ff8 02f9 c0421c80 f221c810 f1cdbe48 f227ff88 > > c0165543 > > Call Trace: > > [] show_trace_log_lvl+0x1a/0x2f > > [] show_stack_log_lvl+0x9b/0xa3 > > [] show_registers+0xa0/0x1e2 > > [] die+0x10f/0x1dd > > [] do_page_fault+0x43a/0x519 > > [] error_code+0x6a/0x70 > > [] vfs_readdir+0x5d/0x89 > > [] sys_getdents64+0x5e/0xa0 > > [] syscall_call+0x7/0xb > > === > > Code: 89 74 24 04 81 e3 ff ff ff 7f 89 1c 24 ff 55 bc 85 c0 0f 85 82 00 00 > > 00 8b 4f 10 31 d2 83 c1 1f 83 e1 f8 29 4d d0 19 55 d4 01 cf <8b> 57 08 8b > > 4f 0c 89 55 d8 89 4d dc 83 7d d4 00 7f a1 7c 06 83 > > EIP: [] xfs_file_readdir+0xfa/0x18c SS:ESP 0068:f227ff10 > > ---[ end trace e518e1370efb695e ]--- > > > > Sven > > -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: XFS oops in vanilla 2.6.24
On Thu, Jan 31, 2008 at 12:23:11PM +0100, Peter Zijlstra wrote: Lets CC the XFS maintainer.. Adding the xfs list and hch. It might be a couple of days before I get to this - I've got a week of backlog to catch up on after LCA On Wed, 2008-01-30 at 20:23 +, Sven Geggus wrote: Hi there, I get the following with 2.6.24: Ending clean XFS mount for filesystem: dm-0 BUG: unable to handle kernel paging request at virtual address f2134000 How long after mount does this happen? Does it happen when listing a specific directory? i.e. do you have a reproducable test case for it? Cheers, Dave. printing eip: c021a13a *pde = 010b5067 *pte = 32134000 Oops: [#1] PREEMPT DEBUG_PAGEALLOC Modules linked in: radeon drm rfcomm l2cap sym53c8xx scsi_transport_spi snd_via82xx 8139too snd_mpu401_uart snd_ens1371 snd_rawmidi snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc via_agp agpgart Pid: 3889, comm: bash Not tainted (2.6.24 #3) EIP: 0060:[c021a13a] EFLAGS: 00010282 CPU: 0 EIP is at xfs_file_readdir+0xfa/0x18c EAX: EBX: 02f5 ECX: 0020 EDX: ESI: EDI: f2133ff8 EBP: f227ff68 ESP: f227ff10 DS: 007b ES: 007b FS: GS: 0033 SS: 0068 Process bash (pid: 3889, ti=f227e000 task=f7205a80 task.ti=f227e000) Stack: 02f5 2c000137 c0165358 f227ff94 f221c810 f2d85e48 02f5 f2133000 1000 0ff8 02f9 c0421c80 f221c810 f1cdbe48 f227ff88 c0165543 Call Trace: [c0104da3] show_trace_log_lvl+0x1a/0x2f [c0104e53] show_stack_log_lvl+0x9b/0xa3 [c0104efb] show_registers+0xa0/0x1e2 [c010514c] die+0x10f/0x1dd [c0113555] do_page_fault+0x43a/0x519 [c040fe52] error_code+0x6a/0x70 [c0165543] vfs_readdir+0x5d/0x89 [c01655cd] sys_getdents64+0x5e/0xa0 [c0103f06] syscall_call+0x7/0xb === Code: 89 74 24 04 81 e3 ff ff ff 7f 89 1c 24 ff 55 bc 85 c0 0f 85 82 00 00 00 8b 4f 10 31 d2 83 c1 1f 83 e1 f8 29 4d d0 19 55 d4 01 cf 8b 57 08 8b 4f 0c 89 55 d8 89 4d dc 83 7d d4 00 7f a1 7c 06 83 EIP: [c021a13a] xfs_file_readdir+0xfa/0x18c SS:ESP 0068:f227ff10 ---[ end trace e518e1370efb695e ]--- Sven -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc] direct IO submission and completion scalability issues
On Sun, Feb 03, 2008 at 10:52:52AM +0100, Nick Piggin wrote: On Fri, Jul 27, 2007 at 06:21:28PM -0700, Suresh B wrote: Second experiment which we did was migrating the IO submission to the IO completion cpu. Instead of submitting the IO on the same cpu where the request arrived, in this experiment the IO submission gets migrated to the cpu that is processing IO completions(interrupt). This will minimize the access to remote cachelines (that happens in timers, slab, scsi layers). The IO submission request is forwarded to the kblockd thread on the cpu receiving the interrupts. As part of this, we also made kblockd thread on each cpu as the highest priority thread, so that IO gets submitted as soon as possible on the interrupt cpu with out any delay. On x86_64 SMP platform with 16 cores, this resulted in 2% performance improvement and 3.3% improvement on two node ia64 platform. Quick and dirty prototype patch(not meant for inclusion) for this io migration experiment is appended to this e-mail. Observation #1 mentioned above is also applicable to this experiment. CPU's processing interrupts will now have to cater IO submission/processing load aswell. Observation #2: This introduces some migration overhead during IO submission. With the current prototype, every incoming IO request results in an IPI and context switch(to kblockd thread) on the interrupt processing cpu. This issue needs to be addressed and main challenge to address is the efficient mechanism of doing this IO migration(how much batching to do and when to send the migrate request?), so that we don't delay the IO much and at the same point, don't cause much overhead during migration. Hi guys, Just had another way we might do this. Migrate the completions out to the submitting CPUs rather than migrate submission into the completing CPU. Hi Nick, When Matthew was describing this work at an LCA presentation (not sure whether you were at that presentation or not), Zach came up with the idea that allowing the submitting application control the CPU that the io completion processing was occurring would be a good approach to try. That is, we submit a completion cookie with the bio that indicates where we want completion to run, rather than dictating that completion runs on the submission CPU. The reasoning is that only the higher level context really knows what is optimal, and that changes from application to application. The complete on the submission CPU policy _may_ be more optimal for database workloads, but it is definitely suboptimal for XFS and transaction I/O completion handling because it simply drags a bunch of global filesystem state around between all the CPUs running completions. In that case, we really only want a single CPU to be handling the completions. (Zach - please correct me if I've missed anything) Looking at your patch - if you turn it around so that the submission CPU field can be specified as the completion cpu then I think the patch will expose the policy knobs needed to do the above. Add the bio - rq linkage to enable filesystems and DIO to control the completion CPU field and we're almost done ;) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc] direct IO submission and completion scalability issues
On Sun, Feb 03, 2008 at 08:14:45PM -0800, Arjan van de Ven wrote: David Chinner wrote: Hi Nick, When Matthew was describing this work at an LCA presentation (not sure whether you were at that presentation or not), Zach came up with the idea that allowing the submitting application control the CPU that the io completion processing was occurring would be a good approach to try. That is, we submit a completion cookie with the bio that indicates where we want completion to run, rather than dictating that completion runs on the submission CPU. The reasoning is that only the higher level context really knows what is optimal, and that changes from application to application. well.. kinda. One of the really hard parts of the submit/completion stuff is that the slab/slob/slub/slib allocator ends up basically cycling memory through the system; there's a sink of free memory on all the submission cpus and a source of free memory on the completion cpu. I don't think applications are capable of working out what is best in this scenario.. Applications as in anything that calls submit_bio(). i.e, direct I/O, filesystems, etc. i.e. not userspace but in-kernel applications. In XFS, simultaneous io completion on multiple CPUs can contribute greatly to contention of global structures in XFS. By controlling where completions are delivered, we can greatly reduce this contention, especially on large, mulitpathed devices that deliver interrupts to multiple CPUs that may be far distant from each other. We have all the state and intelligence necessary to control this sort policy decision effectively. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] ext3 freeze feature
On Sat, Jan 26, 2008 at 04:35:26PM +1100, David Chinner wrote: > On Fri, Jan 25, 2008 at 07:59:38PM +0900, Takashi Sato wrote: > > The points of the implementation are followings. > > - Add calls of the freeze function (freeze_bdev) and > > the unfreeze function (thaw_bdev) in ext3_ioctl(). > > > > - ext3_freeze_timeout() which calls the unfreeze function (thaw_bdev) > > is registered to the delayed work queue to unfreeze the filesystem > > automatically after the lapse of the specified time. > > Seems like pointless complexity to me - what happens if a > timeout occurs while the filsystem is still freezing? > > It's not uncommon for a freeze to take minutes if memory > is full of dirty data that needs to be flushed out, esp. if > dm-snap is doing COWs for every write issued Sorry, ignore this bit - I just realised the timer is set up after the freeze has occurred Still, that makes it potentially dangerous to whatever is being done while the filesystem is frozen Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] ext3 freeze feature
On Fri, Jan 25, 2008 at 07:59:38PM +0900, Takashi Sato wrote: > The points of the implementation are followings. > - Add calls of the freeze function (freeze_bdev) and > the unfreeze function (thaw_bdev) in ext3_ioctl(). > > - ext3_freeze_timeout() which calls the unfreeze function (thaw_bdev) > is registered to the delayed work queue to unfreeze the filesystem > automatically after the lapse of the specified time. Seems like pointless complexity to me - what happens if a timeout occurs while the filsystem is still freezing? It's not uncommon for a freeze to take minutes if memory is full of dirty data that needs to be flushed out, esp. if dm-snap is doing COWs for every write issued > + case EXT3_IOC_FREEZE: { > + if (inode->i_sb->s_frozen != SB_UNFROZEN) > + return -EINVAL; > + freeze_bdev(inode->i_sb->s_bdev); > + case EXT3_IOC_THAW: { > + if (!capable(CAP_SYS_ADMIN)) > + return -EPERM; > + if (inode->i_sb->s_frozen == SB_UNFROZEN) > + return -EINVAL; . > + /* Unfreeze */ > + thaw_bdev(inode->i_sb->s_bdev, inode->i_sb); That's inherently unsafe - you can have multiple unfreezes running in parallel which seriously screws with the bdev semaphore count that is used to lock the device due to doing multiple up()s for every down. Your timeout thingy guarantee that at some point you will get multiple up()s occuring due to the timer firing racing with a thaw ioctl. If this interface is to be more widely exported, then it needs a complete revamp of the bdev is locked while it is frozen so that there is no chance of a double up() ever occuring on the bd_mount_sem due to racing thaws. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] ext3 freeze feature
On Fri, Jan 25, 2008 at 09:42:30PM +0900, Takashi Sato wrote: > >I am also wondering whether we should have system call(s) for these: > > > >On Jan 25, 2008 12:59 PM, Takashi Sato <[EMAIL PROTECTED]> wrote: > >>+ case EXT3_IOC_FREEZE: { > > > >>+ case EXT3_IOC_THAW: { > > > >And just convert XFS to use them too? > > I think it is reasonable to implement it as the generic system call, as you > said. Does XFS folks think so? Sure. Note that we can't immediately remove the XFS ioctls otherwise we'd break userspace utilities that use them Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 01/26] mount options: add documentation
> In message <[EMAIL PROTECTED]>, Miklos Szeredi writes: > > From: Miklos Szeredi <[EMAIL PROTECTED]> > > > > This series addresses the problem of showing mount options in > > /proc/mounts. [...] > > The following filesystems still need fixing: CIFS, NFS, XFS, Unionfs, > > Reiser4. For CIFS, NFS and XFS I wasn't able to understand how some > > of the options are used. The last two are not yet in mainline, so I > > leave fixing those to their respective maintainers out of pure > > laziness. XFS has already been updated. The fix is in the XFs git tree that Andrew picks up for -mm releases and will be merged in the 2.6.25-rc1 window. Commit is here: http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs-2.6.git;a=commit;h=8c33fb6ca99aa17373bd3d5a507ac0eaefb7abb4 Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 01/26] mount options: add documentation
In message [EMAIL PROTECTED], Miklos Szeredi writes: From: Miklos Szeredi [EMAIL PROTECTED] This series addresses the problem of showing mount options in /proc/mounts. [...] The following filesystems still need fixing: CIFS, NFS, XFS, Unionfs, Reiser4. For CIFS, NFS and XFS I wasn't able to understand how some of the options are used. The last two are not yet in mainline, so I leave fixing those to their respective maintainers out of pure laziness. XFS has already been updated. The fix is in the XFs git tree that Andrew picks up for -mm releases and will be merged in the 2.6.25-rc1 window. Commit is here: http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs-2.6.git;a=commit;h=8c33fb6ca99aa17373bd3d5a507ac0eaefb7abb4 Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] ext3 freeze feature
On Sat, Jan 26, 2008 at 04:35:26PM +1100, David Chinner wrote: On Fri, Jan 25, 2008 at 07:59:38PM +0900, Takashi Sato wrote: The points of the implementation are followings. - Add calls of the freeze function (freeze_bdev) and the unfreeze function (thaw_bdev) in ext3_ioctl(). - ext3_freeze_timeout() which calls the unfreeze function (thaw_bdev) is registered to the delayed work queue to unfreeze the filesystem automatically after the lapse of the specified time. Seems like pointless complexity to me - what happens if a timeout occurs while the filsystem is still freezing? It's not uncommon for a freeze to take minutes if memory is full of dirty data that needs to be flushed out, esp. if dm-snap is doing COWs for every write issued Sorry, ignore this bit - I just realised the timer is set up after the freeze has occurred Still, that makes it potentially dangerous to whatever is being done while the filesystem is frozen Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] ext3 freeze feature
On Fri, Jan 25, 2008 at 07:59:38PM +0900, Takashi Sato wrote: The points of the implementation are followings. - Add calls of the freeze function (freeze_bdev) and the unfreeze function (thaw_bdev) in ext3_ioctl(). - ext3_freeze_timeout() which calls the unfreeze function (thaw_bdev) is registered to the delayed work queue to unfreeze the filesystem automatically after the lapse of the specified time. Seems like pointless complexity to me - what happens if a timeout occurs while the filsystem is still freezing? It's not uncommon for a freeze to take minutes if memory is full of dirty data that needs to be flushed out, esp. if dm-snap is doing COWs for every write issued + case EXT3_IOC_FREEZE: { + if (inode-i_sb-s_frozen != SB_UNFROZEN) + return -EINVAL; + freeze_bdev(inode-i_sb-s_bdev); + case EXT3_IOC_THAW: { + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + if (inode-i_sb-s_frozen == SB_UNFROZEN) + return -EINVAL; . + /* Unfreeze */ + thaw_bdev(inode-i_sb-s_bdev, inode-i_sb); That's inherently unsafe - you can have multiple unfreezes running in parallel which seriously screws with the bdev semaphore count that is used to lock the device due to doing multiple up()s for every down. Your timeout thingy guarantee that at some point you will get multiple up()s occuring due to the timer firing racing with a thaw ioctl. If this interface is to be more widely exported, then it needs a complete revamp of the bdev is locked while it is frozen so that there is no chance of a double up() ever occuring on the bd_mount_sem due to racing thaws. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] ext3 freeze feature
On Fri, Jan 25, 2008 at 09:42:30PM +0900, Takashi Sato wrote: I am also wondering whether we should have system call(s) for these: On Jan 25, 2008 12:59 PM, Takashi Sato [EMAIL PROTECTED] wrote: + case EXT3_IOC_FREEZE: { + case EXT3_IOC_THAW: { And just convert XFS to use them too? I think it is reasonable to implement it as the generic system call, as you said. Does XFS folks think so? Sure. Note that we can't immediately remove the XFS ioctls otherwise we'd break userspace utilities that use them Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: do_remount_sb(RDONLY) race? (was: XFS oops under 2.6.23.9)
On Wed, Jan 23, 2008 at 04:24:33PM +1030, Jonathan Woithe wrote: > > On Wed, Jan 23, 2008 at 03:00:48PM +1030, Jonathan Woithe wrote: > > > Last night my laptop suffered an oops during closedown. The full oops > > > reports can be downloaded from > > > > > > http://www.atrad.com.au/~jwoithe/xfs_oops/ > > > > Assertion failed: atomic_read(>m_active_trans) == 0, file: > > fs/xfs/xfs_vfsops.c, line 689. > > > > The remount read-only of the root drive supposedly completed > > while there was still active modification of the filesystem > > taking place. . > > The read only flag only gets set *after* we've made the filesystem > > readonly, which means before we are truly read only, we can race > > with other threads opening files read/write or filesystem > > modifcations can take place. > > > > The result of that race (if it is really unsafe) will be assert you > > see. The patch I wrote a couple of months ago to fix the problem > > is attached below > > Thanks for the patch. I will apply it and see what happens. > > Will this be in 2.6.24? No - because hitting the problem is so rare that I'm not even sure it's a problem. One of the VFS gurus will need to comment on whether this really is a problem, and if so the correct fix is to do_remount_sb() so that it closes the hole for everyone. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
do_remount_sb(RDONLY) race? (was: XFS oops under 2.6.23.9)
On Wed, Jan 23, 2008 at 03:00:48PM +1030, Jonathan Woithe wrote: > Last night my laptop suffered an oops during closedown. The full oops > reports can be downloaded from > > http://www.atrad.com.au/~jwoithe/xfs_oops/ Assertion failed: atomic_read(>m_active_trans) == 0, file: fs/xfs/xfs_vfsops.c, line 689. The remount read-only of the root drive supposedly completed while there was still active modification of the filesystem taking place. > Kernel version was kernel.org 2.6.23.9 compiled as a low latency desktop. The patch in 2.6.23 that introduced this check was: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=516b2e7c2661615ba5d5ad9fb584f068363502d3 Basically, the remount-readonly path was not flushing things properly, so we changed it to flushing things properly and ensure we got bug reports if it wasn't. Yours is the second report of not shutting down correctly since this change went in (we've seen it once in ~8 months in a QA environment). I've had suspicions of a race in the remount-ro code in do_remount_sb() w.r.t to the fs_may_remount_ro() check. That is, we do an unlocked check to see if we can remount readonly and then fail to check again once we've locked the superblock out and start the remount. The read only flag only gets set *after* we've made the filesystem readonly, which means before we are truly read only, we can race with other threads opening files read/write or filesystem modifcations can take place. The result of that race (if it is really unsafe) will be assert you see. The patch I wrote a couple of months ago to fix the problem is attached below Cheers, Dave. --- Set the MS_RDONLY before we check to see if we can remount read only so that we close a race between checking remount is ok and setting the superblock flag that allows other processes to start modifying the filesystem while it is being remounted. Signed-off-by: Dave Chinner <[EMAIL PROTECTED]> --- fs/xfs/linux-2.6/xfs_super.c | 16 1 file changed, 16 insertions(+) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_super.c === --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_super.c 2008-01-22 14:57:07.753782292 +1100 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_super.c 2008-01-23 16:22:16.940279351 +1100 @@ -1222,6 +1222,22 @@ xfs_fs_remount( struct xfs_mount_args *args = xfs_args_allocate(sb, 0); int error; + /* +* We need to have the MS_RDONLY flag set on the filesystem before we +* try to quiesce it down to a sane state. If we don't set the +* MS_RDONLY before we check the fs_may_remount_ro(sb) state, we have a +* race where write operations can start after we've checked it is OK +* to remount read only. This results in assert failures due to being +* unable to quiesce the transaction subsystem correctly. +*/ + if (!(sb->s_flags & MS_RDONLY) && (*flags & MS_RDONLY)) { + sb->s_flags |= MS_RDONLY; + if (!fs_may_remount_ro(sb)) { + sb->s_flags &= ~MS_RDONLY; + return -EBUSY; + } + } + error = xfs_parseargs(mp, options, args, 1); if (!error) error = xfs_mntupdate(mp, flags, args); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/6] IO context sharing
On Tue, Jan 22, 2008 at 10:49:15AM +0100, Jens Axboe wrote: > Hi, > > Today io contexts are per-process and define the (surprise) io context > of that process. In some situations it would be handy if several > processes share an IO context. I think that the nfsd threads should probably share as well. It should probably provide an io context per thread pool Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Parallelize IO for e2fsck
On Tue, Jan 22, 2008 at 12:05:11AM -0700, Andreas Dilger wrote: > On Jan 22, 2008 14:38 +1100, David Chinner wrote: > > On Mon, Jan 21, 2008 at 04:00:41PM -0700, Andreas Dilger wrote: > > > I discussed this with Ted at one point also. This is a generic problem, > > > not just for readahead, because "fsck" can run multiple e2fsck in parallel > > > and in case of many large filesystems on a single node this can cause > > > memory usage problems also. > > > > > > What I was proposing is that "fsck.{fstype}" be modified to return an > > > estimated minimum amount of memory needed, and some "desired" amount of > > > memory (i.e. readahead) to fsck the filesystem, using some parameter like > > > "fsck.{fstype} --report-memory-needed /dev/XXX". If this does not > > > return the output in the expected format, or returns an error then fsck > > > will assume some amount of memory based on the device size and continue > > > as it does today. > > > > And while fsck is running, some other program runs that uses > > memory and blows your carefully calculated paramters to smithereens? > > Well, fsck has a rather restricted working environment, because it is > run before most other processes start (i.e. single-user mode). For fsck > initiated by an admin in other runlevels the admin would need to specify > the upper limit of memory usage. My proposal was only for the single-user > fsck at boot time. The simple case. ;) Because XFS has shutdown features, it's not uncommon to hear about people running xfs_repair on an otherwise live system. e.g. XFS detects a corrupted block, shuts down the filesystem, the admin unmounts it, runs xfs_repair, puts it back online. meanwhile, all the other filesystems and users continue unaffected. In this use case, getting feedback about memory usage is, IMO, very worthwhile. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Parallelize IO for e2fsck
On Tue, Jan 22, 2008 at 12:05:11AM -0700, Andreas Dilger wrote: On Jan 22, 2008 14:38 +1100, David Chinner wrote: On Mon, Jan 21, 2008 at 04:00:41PM -0700, Andreas Dilger wrote: I discussed this with Ted at one point also. This is a generic problem, not just for readahead, because fsck can run multiple e2fsck in parallel and in case of many large filesystems on a single node this can cause memory usage problems also. What I was proposing is that fsck.{fstype} be modified to return an estimated minimum amount of memory needed, and some desired amount of memory (i.e. readahead) to fsck the filesystem, using some parameter like fsck.{fstype} --report-memory-needed /dev/XXX. If this does not return the output in the expected format, or returns an error then fsck will assume some amount of memory based on the device size and continue as it does today. And while fsck is running, some other program runs that uses memory and blows your carefully calculated paramters to smithereens? Well, fsck has a rather restricted working environment, because it is run before most other processes start (i.e. single-user mode). For fsck initiated by an admin in other runlevels the admin would need to specify the upper limit of memory usage. My proposal was only for the single-user fsck at boot time. The simple case. ;) Because XFS has shutdown features, it's not uncommon to hear about people running xfs_repair on an otherwise live system. e.g. XFS detects a corrupted block, shuts down the filesystem, the admin unmounts it, runs xfs_repair, puts it back online. meanwhile, all the other filesystems and users continue unaffected. In this use case, getting feedback about memory usage is, IMO, very worthwhile. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/6] IO context sharing
On Tue, Jan 22, 2008 at 10:49:15AM +0100, Jens Axboe wrote: Hi, Today io contexts are per-process and define the (surprise) io context of that process. In some situations it would be handy if several processes share an IO context. I think that the nfsd threads should probably share as well. It should probably provide an io context per thread pool Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
do_remount_sb(RDONLY) race? (was: XFS oops under 2.6.23.9)
On Wed, Jan 23, 2008 at 03:00:48PM +1030, Jonathan Woithe wrote: Last night my laptop suffered an oops during closedown. The full oops reports can be downloaded from http://www.atrad.com.au/~jwoithe/xfs_oops/ Assertion failed: atomic_read(mp-m_active_trans) == 0, file: fs/xfs/xfs_vfsops.c, line 689. The remount read-only of the root drive supposedly completed while there was still active modification of the filesystem taking place. Kernel version was kernel.org 2.6.23.9 compiled as a low latency desktop. The patch in 2.6.23 that introduced this check was: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=516b2e7c2661615ba5d5ad9fb584f068363502d3 Basically, the remount-readonly path was not flushing things properly, so we changed it to flushing things properly and ensure we got bug reports if it wasn't. Yours is the second report of not shutting down correctly since this change went in (we've seen it once in ~8 months in a QA environment). I've had suspicions of a race in the remount-ro code in do_remount_sb() w.r.t to the fs_may_remount_ro() check. That is, we do an unlocked check to see if we can remount readonly and then fail to check again once we've locked the superblock out and start the remount. The read only flag only gets set *after* we've made the filesystem readonly, which means before we are truly read only, we can race with other threads opening files read/write or filesystem modifcations can take place. The result of that race (if it is really unsafe) will be assert you see. The patch I wrote a couple of months ago to fix the problem is attached below Cheers, Dave. --- Set the MS_RDONLY before we check to see if we can remount read only so that we close a race between checking remount is ok and setting the superblock flag that allows other processes to start modifying the filesystem while it is being remounted. Signed-off-by: Dave Chinner [EMAIL PROTECTED] --- fs/xfs/linux-2.6/xfs_super.c | 16 1 file changed, 16 insertions(+) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_super.c === --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_super.c 2008-01-22 14:57:07.753782292 +1100 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_super.c 2008-01-23 16:22:16.940279351 +1100 @@ -1222,6 +1222,22 @@ xfs_fs_remount( struct xfs_mount_args *args = xfs_args_allocate(sb, 0); int error; + /* +* We need to have the MS_RDONLY flag set on the filesystem before we +* try to quiesce it down to a sane state. If we don't set the +* MS_RDONLY before we check the fs_may_remount_ro(sb) state, we have a +* race where write operations can start after we've checked it is OK +* to remount read only. This results in assert failures due to being +* unable to quiesce the transaction subsystem correctly. +*/ + if (!(sb-s_flags MS_RDONLY) (*flags MS_RDONLY)) { + sb-s_flags |= MS_RDONLY; + if (!fs_may_remount_ro(sb)) { + sb-s_flags = ~MS_RDONLY; + return -EBUSY; + } + } + error = xfs_parseargs(mp, options, args, 1); if (!error) error = xfs_mntupdate(mp, flags, args); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: do_remount_sb(RDONLY) race? (was: XFS oops under 2.6.23.9)
On Wed, Jan 23, 2008 at 04:24:33PM +1030, Jonathan Woithe wrote: On Wed, Jan 23, 2008 at 03:00:48PM +1030, Jonathan Woithe wrote: Last night my laptop suffered an oops during closedown. The full oops reports can be downloaded from http://www.atrad.com.au/~jwoithe/xfs_oops/ Assertion failed: atomic_read(mp-m_active_trans) == 0, file: fs/xfs/xfs_vfsops.c, line 689. The remount read-only of the root drive supposedly completed while there was still active modification of the filesystem taking place. . The read only flag only gets set *after* we've made the filesystem readonly, which means before we are truly read only, we can race with other threads opening files read/write or filesystem modifcations can take place. The result of that race (if it is really unsafe) will be assert you see. The patch I wrote a couple of months ago to fix the problem is attached below Thanks for the patch. I will apply it and see what happens. Will this be in 2.6.24? No - because hitting the problem is so rare that I'm not even sure it's a problem. One of the VFS gurus will need to comment on whether this really is a problem, and if so the correct fix is to do_remount_sb() so that it closes the hole for everyone. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Parallelize IO for e2fsck
On Mon, Jan 21, 2008 at 04:00:41PM -0700, Andreas Dilger wrote: > On Jan 16, 2008 13:30 -0800, Valerie Henson wrote: > > I have a partial solution that sort of blindly manages the buffer > > cache. First, the user passes e2fsck a parameter saying how much > > memory is available as buffer cache. The readahead thread reads > > things in and immediately throws them away so they are only in buffer > > cache (no double-caching). Then readahead and e2fsck work together so > > that readahead only reads in new blocks when the main thread is done > > with earlier blocks. The already-used blocks get kicked out of buffer > > cache to make room for the new ones. > > > > What would be nice is to take into account the current total memory > > usage of the whole fsck process and factor that in. I don't think it > > would be hard to add to the existing cache management framework. > > Thoughts? > > I discussed this with Ted at one point also. This is a generic problem, > not just for readahead, because "fsck" can run multiple e2fsck in parallel > and in case of many large filesystems on a single node this can cause > memory usage problems also. > > What I was proposing is that "fsck.{fstype}" be modified to return an > estimated minimum amount of memory needed, and some "desired" amount of > memory (i.e. readahead) to fsck the filesystem, using some parameter like > "fsck.{fstype} --report-memory-needed /dev/XXX". If this does not > return the output in the expected format, or returns an error then fsck > will assume some amount of memory based on the device size and continue > as it does today. And while fsck is running, some other program runs that uses memory and blows your carefully calculated paramters to smithereens? I think there is a clear need for applications to be able to register a callback from the kernel to indicate that the machine as a whole is running out of memory and that the application should trim it's caches to reduce memory utilisation. Perhaps instead of swapping immediately, a SIGLOWMEM could be sent to a processes that aren't masking the signal followed by a short grace period to allow the processes to free up some memory before swapping out pages from that process? With this sort of feedback, the fsck process can scale back it's readahead and remove cached info that is not critical to what it is currently doing and thereby prevent readahead thrashing as memory usage of the fsck process itself grows. Another example where this could be useful is to tell browsers to release some of their cache rather than having the VM swap it out. IMO, a scheme like this will be far more reliable than trying to guess what the optimal settings are going to be over the whole lifetime of a process Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Parallelize IO for e2fsck
On Mon, Jan 21, 2008 at 04:00:41PM -0700, Andreas Dilger wrote: On Jan 16, 2008 13:30 -0800, Valerie Henson wrote: I have a partial solution that sort of blindly manages the buffer cache. First, the user passes e2fsck a parameter saying how much memory is available as buffer cache. The readahead thread reads things in and immediately throws them away so they are only in buffer cache (no double-caching). Then readahead and e2fsck work together so that readahead only reads in new blocks when the main thread is done with earlier blocks. The already-used blocks get kicked out of buffer cache to make room for the new ones. What would be nice is to take into account the current total memory usage of the whole fsck process and factor that in. I don't think it would be hard to add to the existing cache management framework. Thoughts? I discussed this with Ted at one point also. This is a generic problem, not just for readahead, because fsck can run multiple e2fsck in parallel and in case of many large filesystems on a single node this can cause memory usage problems also. What I was proposing is that fsck.{fstype} be modified to return an estimated minimum amount of memory needed, and some desired amount of memory (i.e. readahead) to fsck the filesystem, using some parameter like fsck.{fstype} --report-memory-needed /dev/XXX. If this does not return the output in the expected format, or returns an error then fsck will assume some amount of memory based on the device size and continue as it does today. And while fsck is running, some other program runs that uses memory and blows your carefully calculated paramters to smithereens? I think there is a clear need for applications to be able to register a callback from the kernel to indicate that the machine as a whole is running out of memory and that the application should trim it's caches to reduce memory utilisation. Perhaps instead of swapping immediately, a SIGLOWMEM could be sent to a processes that aren't masking the signal followed by a short grace period to allow the processes to free up some memory before swapping out pages from that process? With this sort of feedback, the fsck process can scale back it's readahead and remove cached info that is not critical to what it is currently doing and thereby prevent readahead thrashing as memory usage of the fsck process itself grows. Another example where this could be useful is to tell browsers to release some of their cache rather than having the VM swap it out. IMO, a scheme like this will be far more reliable than trying to guess what the optimal settings are going to be over the whole lifetime of a process Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc8: possible circular locking dependency detected
On Fri, Jan 18, 2008 at 10:45:17PM +0100, Christian Kujau wrote: > Hi, > > just FYI, upgrading to -rc8 gave the following messages in kern.log in > the morning hours, when the backups were run: > > === > [ INFO: possible circular locking dependency detected ] > 2.6.24-rc8 #2 > --- > rsync/23295 is trying to acquire lock: > (iprune_mutex){--..}, at: [] shrink_icache_memory+0x72/0x220 > > but task is already holding lock: > (&(>i_iolock)->mr_lock){}, at: [] xfs_ilock+0x96/0xb0 > > which lock already depends on the new lock. memory reclaim can occur when an inode lock is held, causing i_iolock -> iprune_mutex to occur. This is quite common. During reclaim, while holding iprune_mutex, we lock a different inode to complete the cleaning up of it, resulting in iprune_mutex -> i_iolock. At this point, lockdep gets upset and blats out a warning. But, there's no problem here as it is always safe for us to take the i_iolock in inode reclaim because it can never be the same as the i_iolock that we've taken prior to memory reclaim being entered. Therefore false positive. Lockdep folk - we really need an annotation to prevent this false positive from being reported because we are getting reports at least once a week Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc8: possible circular locking dependency detected
On Fri, Jan 18, 2008 at 10:45:17PM +0100, Christian Kujau wrote: Hi, just FYI, upgrading to -rc8 gave the following messages in kern.log in the morning hours, when the backups were run: === [ INFO: possible circular locking dependency detected ] 2.6.24-rc8 #2 --- rsync/23295 is trying to acquire lock: (iprune_mutex){--..}, at: [c017a552] shrink_icache_memory+0x72/0x220 but task is already holding lock: ((ip-i_iolock)-mr_lock){}, at: [c0275056] xfs_ilock+0x96/0xb0 which lock already depends on the new lock. memory reclaim can occur when an inode lock is held, causing i_iolock - iprune_mutex to occur. This is quite common. During reclaim, while holding iprune_mutex, we lock a different inode to complete the cleaning up of it, resulting in iprune_mutex - i_iolock. At this point, lockdep gets upset and blats out a warning. But, there's no problem here as it is always safe for us to take the i_iolock in inode reclaim because it can never be the same as the i_iolock that we've taken prior to memory reclaim being entered. Therefore false positive. Lockdep folk - we really need an annotation to prevent this false positive from being reported because we are getting reports at least once a week Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Converting writeback linked lists to a tree based data structure
On Fri, Jan 18, 2008 at 01:41:33PM +0800, Fengguang Wu wrote: > > That is, think of large file writes like process scheduler batch > > jobs - bulk throughput is what matters, so the larger the time slice > > you give them the higher the throughput. > > > > IMO, the sort of result we should be looking at is a > > writeback design that results in cycling somewhat like: > > > > slice 1: iterate over small files > > slice 2: flush large file 1 > > slice 3: iterate over small files > > slice 4: flush large file 2 > > .. > > slice n-1: flush large file N > > slice n: iterate over small files > > slice n+1: flush large file N+1 > > > > So that we keep the disk busy with a relatively fair mix of > > small and large I/Os while both are necessary. > > If we can sync fast enough, the lower layer would be able to merge > those 4MB requests. No, not necessarily - think of a stripe with a chunk size of 512k. That 4MB will be split into 8x512k chunks and sent to different devices (and hence elevator queues). The only way you get elevator merging in this sort of config is that if you send multiple stripe *width* sized amounts to the device in a very short time period. I see quite a few filesystems with stripe widths in the tens of MB range. > > Put simply: > > > > The higher the bandwidth of the device, the more frequently > > we need to be servicing the inodes with large amounts of > > dirty data to be written to maintain write throughput at a > > significant percentage of the device capability. > > > > The writeback algorithm needs to take this into account for it > > to be able to scale effectively for high throughput devices. > > Slow queues go full first. Currently the writeback code will skip > _and_ congestion_wait() for congested filesystems. The better policy > is to congestion_wait() _after_ all other writable pages have been > synced. Agreed. The comments I've made are mainly concerned with getting efficient flushing of a single device occuring. Interactions between multiple devices are a separable issue Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Converting writeback linked lists to a tree based data structure
On Thu, Jan 17, 2008 at 09:38:24PM -0800, Michael Rubin wrote: > On Jan 17, 2008 9:01 PM, David Chinner <[EMAIL PROTECTED]> wrote: > > First off thank you for the very detailed reply. This rocks and gives > me much to think about. > > > On Thu, Jan 17, 2008 at 01:07:05PM -0800, Michael Rubin wrote: > > This seems suboptimal for large files. If you keep feeding in > > new least recently dirtied files, the large files will never > > get an unimpeded go at the disk and hence we'll struggle to > > get decent bandwidth under anything but pure large file > > write loads. > > You're right. I understand now. I just changed a dial on my tests, > ran it and found pdflush not keeping up like it should. I need to > address this. > > > Switching inodes during writeback implies a seek to the new write > > location, while continuing to write the same inode has no seek > > penalty because the writeback is sequential. It follows from this > > that allowing larges file a disproportionate amount of data > > writeback is desirable. > > > > Also, cycling rapidly through all the large files to write 4MB to each is > > going to cause us to spend time seeking rather than writing compared > > to cycling slower and writing 40MB from each large file at a time. > > > > i.e. servicing one large file for 100ms is going to result in higher > > writeback throughput than servicing 10 large files for 10ms each > > because there's going to be less seeking and more writing done by > > the disks. > > > > That is, think of large file writes like process scheduler batch > > jobs - bulk throughput is what matters, so the larger the time slice > > you give them the higher the throughput. > > > > IMO, the sort of result we should be looking at is a > > writeback design that results in cycling somewhat like: > > > > slice 1: iterate over small files > > slice 2: flush large file 1 > > slice 3: iterate over small files > > slice 4: flush large file 2 > > .. > > slice n-1: flush large file N > > slice n: iterate over small files > > slice n+1: flush large file N+1 > > > > So that we keep the disk busy with a relatively fair mix of > > small and large I/Os while both are necessary. > > I am getting where you are coming from. But if we are going to make > changes to optimize for seeks maybe we need to be more aggressive in > write back in how we organize both time and location. Right now AFAIK > there is no attention to location in the writeback path. True. But IMO, locality ordering really only impacts the small file data writes and the inodes themselves because there is typically lots of seeks in doing that. For large sequential writes to a file, writing a significant chunk of data gives that bit of writeback it's own locality because it does not cause seeks. Hence simply writing large enough chunks avoids any need to order the writeback by locality. Hence I writeback ordering by locality more a function of optimising the "iterate over small files" aspect of the writeback. > > The higher the bandwidth of the device, the more frequently > > we need to be servicing the inodes with large amounts of > > dirty data to be written to maintain write throughput at a > > significant percentage of the device capability. > > > > Could you expand that to say it's not the inodes of large files but > the ones with data that we can exploit locality? Not sure I understand what you mean. Can you rephrase that? > Often large files are fragmented. Then the filesystem is not doing it's job. Fragmentation does not happen very frequently in XFS for large files - that is one of the reasons it is extremely good for large files and high throughput applications... > Would it make more sense to pursue cracking the inodes and > grouping their blocks's locations? Or is this all overkill and should > be handled at a lower level like the elevator? For large files it is overkill. For filesystems that do delayed allocation, it is often impossible (no block mapping until the writeback is executed unless it's an overwrite). At this point, I'd say it is best to leave it to the filesystem and the elevator to do their jobs properly. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Converting writeback linked lists to a tree based data structure
On Fri, Jan 18, 2008 at 01:41:33PM +0800, Fengguang Wu wrote: That is, think of large file writes like process scheduler batch jobs - bulk throughput is what matters, so the larger the time slice you give them the higher the throughput. IMO, the sort of result we should be looking at is a writeback design that results in cycling somewhat like: slice 1: iterate over small files slice 2: flush large file 1 slice 3: iterate over small files slice 4: flush large file 2 .. slice n-1: flush large file N slice n: iterate over small files slice n+1: flush large file N+1 So that we keep the disk busy with a relatively fair mix of small and large I/Os while both are necessary. If we can sync fast enough, the lower layer would be able to merge those 4MB requests. No, not necessarily - think of a stripe with a chunk size of 512k. That 4MB will be split into 8x512k chunks and sent to different devices (and hence elevator queues). The only way you get elevator merging in this sort of config is that if you send multiple stripe *width* sized amounts to the device in a very short time period. I see quite a few filesystems with stripe widths in the tens of MB range. Put simply: The higher the bandwidth of the device, the more frequently we need to be servicing the inodes with large amounts of dirty data to be written to maintain write throughput at a significant percentage of the device capability. The writeback algorithm needs to take this into account for it to be able to scale effectively for high throughput devices. Slow queues go full first. Currently the writeback code will skip _and_ congestion_wait() for congested filesystems. The better policy is to congestion_wait() _after_ all other writable pages have been synced. Agreed. The comments I've made are mainly concerned with getting efficient flushing of a single device occuring. Interactions between multiple devices are a separable issue Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Converting writeback linked lists to a tree based data structure
On Thu, Jan 17, 2008 at 01:07:05PM -0800, Michael Rubin wrote: > > Michael, could you sort out and document the new starvation prevention > > schemes? > > The basic idea behind the writeback algorithm to handle starvation. > The over arching idea is that we want to preserve order of writeback > based on when an inode was dirtied and also preserve the dirtied_when > contents until the inode has been written back (partially or fully) > > Every sync_sb_inodes we find the least recent inodes dirtied. To deal > with large or small starvation we have a s_flush_gen for each > iteration of sync_sb_inodes every time we issue a writeback we mark > that the inode cannot be processed until the next s_flush_gen. This > way we don't process one get to the rest since we keep pushing them > into subsequent s_fush_gen's. This seems suboptimal for large files. If you keep feeding in new least recently dirtied files, the large files will never get an unimpeded go at the disk and hence we'll struggle to get decent bandwidth under anything but pure large file write loads. Fairness is a tradeoff between seeks and bandwidth. Ideally, we want to spend 50% of the *disk* time servicing sequential writes and 50% of the time servicing seeky writes - that way neither get penalised unfairly by the other type of workload. Switching inodes during writeback implies a seek to the new write location, while continuing to write the same inode has no seek penalty because the writeback is sequential. It follows from this that allowing larges file a disproportionate amount of data writeback is desirable. Also, cycling rapidly through all the large files to write 4MB to each is going to cause us to spend time seeking rather than writing compared to cycling slower and writing 40MB from each large file at a time. i.e. servicing one large file for 100ms is going to result in higher writeback throughput than servicing 10 large files for 10ms each because there's going to be less seeking and more writing done by the disks. That is, think of large file writes like process scheduler batch jobs - bulk throughput is what matters, so the larger the time slice you give them the higher the throughput. IMO, the sort of result we should be looking at is a writeback design that results in cycling somewhat like: slice 1: iterate over small files slice 2: flush large file 1 slice 3: iterate over small files slice 4: flush large file 2 .. slice n-1: flush large file N slice n: iterate over small files slice n+1: flush large file N+1 So that we keep the disk busy with a relatively fair mix of small and large I/Os while both are necessary. Furthermore, as disk bandwidth goes up, the relationship between large file and small file writes changes if we want to maintain writeback at a significant percentage of the maximum bandwidth of the drive (which is extremely desirable). So if we take a 4k page machine and a 1024page writeback slice, for different disks, we get a bandwidth slice in terms of disk seeks like: slow disk: 20MB/s, 10ms seek (say a laptop drive) - 4MB write takes 200ms, or equivalent of 10 seeks normal disk: 60MB/s, 8ms seek (desktop) - 4MB write takes 66ms, or equivalent of 8 seeks fast disk: 120MB/s, 5ms seek (15krpm SAS) - 4MB write takes 33ms, or equivalent of 6 seeks small RAID5 lun: 200MB/s, 4ms seek - 4MB write takes 20ms, or equivalent of 5 seeks Large RAID5 lun: 1GB/s, 2ms seek - 4MB write takes 4ms, or equivalent of 2 seeks Put simply: The higher the bandwidth of the device, the more frequently we need to be servicing the inodes with large amounts of dirty data to be written to maintain write throughput at a significant percentage of the device capability. The writeback algorithm needs to take this into account for it to be able to scale effectively for high throughput devices. BTW, it needs to be recognised that if we are under memory pressure we can clean much more memory in a short period of time by writing out all the large files first. This would clearly benefit the system as a whole as we'd get the most pages available for reclaim as possible in a short a time as possible. The writeback algorithm should really have a mode that allows this sort of flush ordering to occur Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Parallelize IO for e2fsck
On Wed, Jan 16, 2008 at 01:30:43PM -0800, Valerie Henson wrote: > Hi y'all, > > This is a request for comments on the rewrite of the e2fsck IO > parallelization patches I sent out a few months ago. The mechanism is > totally different. Previously IO was parallelized by issuing IOs from > multiple threads; now a single thread issues fadvise(WILLNEED) and > then uses read() to complete the IO. Interesting. We ultimately rejected a similar patch to xfs_repair (pre-population the kernel block device cache) mainly because of low memory performance issues and it doesn't really enable you to do anything particularly smart with optimising I/O patterns for larger, high performance RAID arrays. The low memory problems were particularly bad; the readahead thrashing cause a slowdown of 2-3x compared to the baseline and often it was due to the repair process requiring all of memory to cache stuff it would need later. IIRC, multi-terabyte ext3 filesystems have similar memory usage problems to XFS, so there's a good chance that this patch will see the same sorts of issues. > Single disk performance doesn't change, but elapsed time drops by > about 50% on a big RAID-5 box. Passes 1 and 2 are parallelized. Pass > 5 is left as an exercise for the reader. Promising results, though Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Parallelize IO for e2fsck
On Wed, Jan 16, 2008 at 01:30:43PM -0800, Valerie Henson wrote: Hi y'all, This is a request for comments on the rewrite of the e2fsck IO parallelization patches I sent out a few months ago. The mechanism is totally different. Previously IO was parallelized by issuing IOs from multiple threads; now a single thread issues fadvise(WILLNEED) and then uses read() to complete the IO. Interesting. We ultimately rejected a similar patch to xfs_repair (pre-population the kernel block device cache) mainly because of low memory performance issues and it doesn't really enable you to do anything particularly smart with optimising I/O patterns for larger, high performance RAID arrays. The low memory problems were particularly bad; the readahead thrashing cause a slowdown of 2-3x compared to the baseline and often it was due to the repair process requiring all of memory to cache stuff it would need later. IIRC, multi-terabyte ext3 filesystems have similar memory usage problems to XFS, so there's a good chance that this patch will see the same sorts of issues. Single disk performance doesn't change, but elapsed time drops by about 50% on a big RAID-5 box. Passes 1 and 2 are parallelized. Pass 5 is left as an exercise for the reader. Promising results, though Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Converting writeback linked lists to a tree based data structure
On Thu, Jan 17, 2008 at 01:07:05PM -0800, Michael Rubin wrote: Michael, could you sort out and document the new starvation prevention schemes? The basic idea behind the writeback algorithm to handle starvation. The over arching idea is that we want to preserve order of writeback based on when an inode was dirtied and also preserve the dirtied_when contents until the inode has been written back (partially or fully) Every sync_sb_inodes we find the least recent inodes dirtied. To deal with large or small starvation we have a s_flush_gen for each iteration of sync_sb_inodes every time we issue a writeback we mark that the inode cannot be processed until the next s_flush_gen. This way we don't process one get to the rest since we keep pushing them into subsequent s_fush_gen's. This seems suboptimal for large files. If you keep feeding in new least recently dirtied files, the large files will never get an unimpeded go at the disk and hence we'll struggle to get decent bandwidth under anything but pure large file write loads. Fairness is a tradeoff between seeks and bandwidth. Ideally, we want to spend 50% of the *disk* time servicing sequential writes and 50% of the time servicing seeky writes - that way neither get penalised unfairly by the other type of workload. Switching inodes during writeback implies a seek to the new write location, while continuing to write the same inode has no seek penalty because the writeback is sequential. It follows from this that allowing larges file a disproportionate amount of data writeback is desirable. Also, cycling rapidly through all the large files to write 4MB to each is going to cause us to spend time seeking rather than writing compared to cycling slower and writing 40MB from each large file at a time. i.e. servicing one large file for 100ms is going to result in higher writeback throughput than servicing 10 large files for 10ms each because there's going to be less seeking and more writing done by the disks. That is, think of large file writes like process scheduler batch jobs - bulk throughput is what matters, so the larger the time slice you give them the higher the throughput. IMO, the sort of result we should be looking at is a writeback design that results in cycling somewhat like: slice 1: iterate over small files slice 2: flush large file 1 slice 3: iterate over small files slice 4: flush large file 2 .. slice n-1: flush large file N slice n: iterate over small files slice n+1: flush large file N+1 So that we keep the disk busy with a relatively fair mix of small and large I/Os while both are necessary. Furthermore, as disk bandwidth goes up, the relationship between large file and small file writes changes if we want to maintain writeback at a significant percentage of the maximum bandwidth of the drive (which is extremely desirable). So if we take a 4k page machine and a 1024page writeback slice, for different disks, we get a bandwidth slice in terms of disk seeks like: slow disk: 20MB/s, 10ms seek (say a laptop drive) - 4MB write takes 200ms, or equivalent of 10 seeks normal disk: 60MB/s, 8ms seek (desktop) - 4MB write takes 66ms, or equivalent of 8 seeks fast disk: 120MB/s, 5ms seek (15krpm SAS) - 4MB write takes 33ms, or equivalent of 6 seeks small RAID5 lun: 200MB/s, 4ms seek - 4MB write takes 20ms, or equivalent of 5 seeks Large RAID5 lun: 1GB/s, 2ms seek - 4MB write takes 4ms, or equivalent of 2 seeks Put simply: The higher the bandwidth of the device, the more frequently we need to be servicing the inodes with large amounts of dirty data to be written to maintain write throughput at a significant percentage of the device capability. The writeback algorithm needs to take this into account for it to be able to scale effectively for high throughput devices. BTW, it needs to be recognised that if we are under memory pressure we can clean much more memory in a short period of time by writing out all the large files first. This would clearly benefit the system as a whole as we'd get the most pages available for reclaim as possible in a short a time as possible. The writeback algorithm should really have a mode that allows this sort of flush ordering to occur Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Converting writeback linked lists to a tree based data structure
On Thu, Jan 17, 2008 at 11:16:00AM +0800, Fengguang Wu wrote: > On Thu, Jan 17, 2008 at 09:35:10AM +1100, David Chinner wrote: > > On Wed, Jan 16, 2008 at 05:07:20PM +0800, Fengguang Wu wrote: > > > On Tue, Jan 15, 2008 at 09:51:49PM -0800, Andrew Morton wrote: > > > > > Then to do better ordering by adopting radix tree(or rbtree > > > > > if radix tree is not enough), > > > > > > > > ordering of what? > > > > > > Switch from time to location. > > > > Note that data writeback may be adversely affected by location > > based writeback rather than time based writeback - think of > > the effect of location based data writeback on an app that > > creates lots of short term (<30s) temp files and then removes > > them before they are written back. > > A small(e.g. 5s) time window can still be enforced, but... Yes, you could, but that will then result in non-deterministic performance for repeated workloads because the order of file writeback will not be consistent. e.g. the first run is fast because the output file is at lower offset than the temp file meaning the temp file gets deleted without being written. The second run is slow because the location of the files is reversed and the temp file is written to disk before the final output file and hence the run is much slower because it writes much more. The third run is also slow, but the files are like the first fast run. However, pdflush tries to write the temp file back within 5s of it being dirtied so it skips it and writes the output file first. The difference between the first+second case can be found by knowing that inode number determines writeback order, but there is no obvious clue as to why the first+third runs are different. This is exactly the sort of non-deterministic behaviour we want to avoid in a writeback algorithm. > > H - I'm wondering if we'd do better to split data writeback from > > inode writeback. i.e. we do two passes. The first pass writes all > > the data back in time order, the second pass writes all the inodes > > back in location order. > > > > Right now we interleave data and inode writeback, (i.e. we do data, > > inode, data, inode, data, inode, ). I'd much prefer to see all > > data written out first, then the inodes. ->writepage often dirties > > the inode and hence if we need to do multiple do_writepages() calls > > on an inode to flush all the data (e.g. congestion, large amounts of > > data to be written, etc), we really shouldn't be calling > > write_inode() after every do_writepages() call. The inode > > should not be written until all the data is written > > That may do good to XFS. Another case is documented as follows: > "the write_inode() function of a typical fs will perform no I/O, but > will mark buffers in the blockdev mapping as dirty." Yup, but in that situation ->write_inode() does not do any I/O, so it will work with any high level inode writeback ordering or timing scheme equally well. As a result, that's not the case we need to optimise at all. FWIW, the NFS client is likely to work better with split data/ inode writeback as it also has to mark the inode dirty on async write completion (to get ->write_inode called to issue a commit RPC). Hence delaying the inode write until after all the data is written makes sense there as well Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Converting writeback linked lists to a tree based data structure
On Wed, Jan 16, 2008 at 05:07:20PM +0800, Fengguang Wu wrote: > On Tue, Jan 15, 2008 at 09:51:49PM -0800, Andrew Morton wrote: > > > Then to do better ordering by adopting radix tree(or rbtree > > > if radix tree is not enough), > > > > ordering of what? > > Switch from time to location. Note that data writeback may be adversely affected by location based writeback rather than time based writeback - think of the effect of location based data writeback on an app that creates lots of short term (<30s) temp files and then removes them before they are written back. Also, data writeback locatio cannot be easily derived from the inode number in pretty much all cases. "near" in terms of XFS means the same AG which means the data could be up to a TB away from the inode, and if you have >1TB filesystems usingthe default inode32 allocator, file data is *never* placed near the inode - the inodes are in the first TB of the filesystem, the data is rotored around the rest of the filesystem. And with delayed allocation, you don't know where the data is even going to be written ahead of the filesystem ->writepage call, so you can't do optimal location ordering for data in this case. > > > and lastly get rid of the list_heads to > > > avoid locking. Does it sound like a good path? > > > > I'd have thaought that replacing list_heads with another data structure > > would be a simgle commit. > > That would be easy. s_more_io and s_more_io_wait can all be converted > to radix trees. Makes sense for location based writeback of the inodes themselves, but not for data. H - I'm wondering if we'd do better to split data writeback from inode writeback. i.e. we do two passes. The first pass writes all the data back in time order, the second pass writes all the inodes back in location order. Right now we interleave data and inode writeback, (i.e. we do data, inode, data, inode, data, inode, ). I'd much prefer to see all data written out first, then the inodes. ->writepage often dirties the inode and hence if we need to do multiple do_writepages() calls on an inode to flush all the data (e.g. congestion, large amounts of data to be written, etc), we really shouldn't be calling write_inode() after every do_writepages() call. The inode should not be written until all the data is written Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 09/13] writeback: requeue_io() on redirtied inode
On Tue, Jan 15, 2008 at 08:36:46PM +0800, Fengguang Wu wrote: > Redirtied inodes could be seen in really fast writes. > They should really be synced as soon as possible. > > redirty_tail() could delay the inode for up to 30s. > Kill the delay by using requeue_io() instead. That's actually bad for anything that does delayed allocation or updates state on data I/o completion. e.g. XFS when writing past EOF doing delalloc dirties the inode during writeout (allocation) and then updates the file size on data I/o completion hence dirtying the inode again. With this change, writing the last pages out would result in hitting this code and causing the inode to be flushed very soon after the data write. Then, after the inode write is issued, we get data I/o completion which dirties the inode again, resulting in needing to write the inode again to clean it. i.e. it introduces a potential new and useless inode write I/O. Also, the immediate inode write may be useless for XFS because the inode may be pinned in memory due to async transactions still in flight (e.g. from delalloc) so we've got two situations where flushing the inode immediately is suboptimal. Hence I don't think this is an optimisation that should be made in the generic writeback code. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 09/13] writeback: requeue_io() on redirtied inode
On Tue, Jan 15, 2008 at 08:36:46PM +0800, Fengguang Wu wrote: Redirtied inodes could be seen in really fast writes. They should really be synced as soon as possible. redirty_tail() could delay the inode for up to 30s. Kill the delay by using requeue_io() instead. That's actually bad for anything that does delayed allocation or updates state on data I/o completion. e.g. XFS when writing past EOF doing delalloc dirties the inode during writeout (allocation) and then updates the file size on data I/o completion hence dirtying the inode again. With this change, writing the last pages out would result in hitting this code and causing the inode to be flushed very soon after the data write. Then, after the inode write is issued, we get data I/o completion which dirties the inode again, resulting in needing to write the inode again to clean it. i.e. it introduces a potential new and useless inode write I/O. Also, the immediate inode write may be useless for XFS because the inode may be pinned in memory due to async transactions still in flight (e.g. from delalloc) so we've got two situations where flushing the inode immediately is suboptimal. Hence I don't think this is an optimisation that should be made in the generic writeback code. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Converting writeback linked lists to a tree based data structure
On Wed, Jan 16, 2008 at 05:07:20PM +0800, Fengguang Wu wrote: On Tue, Jan 15, 2008 at 09:51:49PM -0800, Andrew Morton wrote: Then to do better ordering by adopting radix tree(or rbtree if radix tree is not enough), ordering of what? Switch from time to location. Note that data writeback may be adversely affected by location based writeback rather than time based writeback - think of the effect of location based data writeback on an app that creates lots of short term (30s) temp files and then removes them before they are written back. Also, data writeback locatio cannot be easily derived from the inode number in pretty much all cases. near in terms of XFS means the same AG which means the data could be up to a TB away from the inode, and if you have 1TB filesystems usingthe default inode32 allocator, file data is *never* placed near the inode - the inodes are in the first TB of the filesystem, the data is rotored around the rest of the filesystem. And with delayed allocation, you don't know where the data is even going to be written ahead of the filesystem -writepage call, so you can't do optimal location ordering for data in this case. and lastly get rid of the list_heads to avoid locking. Does it sound like a good path? I'd have thaought that replacing list_heads with another data structure would be a simgle commit. That would be easy. s_more_io and s_more_io_wait can all be converted to radix trees. Makes sense for location based writeback of the inodes themselves, but not for data. H - I'm wondering if we'd do better to split data writeback from inode writeback. i.e. we do two passes. The first pass writes all the data back in time order, the second pass writes all the inodes back in location order. Right now we interleave data and inode writeback, (i.e. we do data, inode, data, inode, data, inode, ). I'd much prefer to see all data written out first, then the inodes. -writepage often dirties the inode and hence if we need to do multiple do_writepages() calls on an inode to flush all the data (e.g. congestion, large amounts of data to be written, etc), we really shouldn't be calling write_inode() after every do_writepages() call. The inode should not be written until all the data is written Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Converting writeback linked lists to a tree based data structure
On Thu, Jan 17, 2008 at 11:16:00AM +0800, Fengguang Wu wrote: On Thu, Jan 17, 2008 at 09:35:10AM +1100, David Chinner wrote: On Wed, Jan 16, 2008 at 05:07:20PM +0800, Fengguang Wu wrote: On Tue, Jan 15, 2008 at 09:51:49PM -0800, Andrew Morton wrote: Then to do better ordering by adopting radix tree(or rbtree if radix tree is not enough), ordering of what? Switch from time to location. Note that data writeback may be adversely affected by location based writeback rather than time based writeback - think of the effect of location based data writeback on an app that creates lots of short term (30s) temp files and then removes them before they are written back. A small(e.g. 5s) time window can still be enforced, but... Yes, you could, but that will then result in non-deterministic performance for repeated workloads because the order of file writeback will not be consistent. e.g. the first run is fast because the output file is at lower offset than the temp file meaning the temp file gets deleted without being written. The second run is slow because the location of the files is reversed and the temp file is written to disk before the final output file and hence the run is much slower because it writes much more. The third run is also slow, but the files are like the first fast run. However, pdflush tries to write the temp file back within 5s of it being dirtied so it skips it and writes the output file first. The difference between the first+second case can be found by knowing that inode number determines writeback order, but there is no obvious clue as to why the first+third runs are different. This is exactly the sort of non-deterministic behaviour we want to avoid in a writeback algorithm. H - I'm wondering if we'd do better to split data writeback from inode writeback. i.e. we do two passes. The first pass writes all the data back in time order, the second pass writes all the inodes back in location order. Right now we interleave data and inode writeback, (i.e. we do data, inode, data, inode, data, inode, ). I'd much prefer to see all data written out first, then the inodes. -writepage often dirties the inode and hence if we need to do multiple do_writepages() calls on an inode to flush all the data (e.g. congestion, large amounts of data to be written, etc), we really shouldn't be calling write_inode() after every do_writepages() call. The inode should not be written until all the data is written That may do good to XFS. Another case is documented as follows: the write_inode() function of a typical fs will perform no I/O, but will mark buffers in the blockdev mapping as dirty. Yup, but in that situation -write_inode() does not do any I/O, so it will work with any high level inode writeback ordering or timing scheme equally well. As a result, that's not the case we need to optimise at all. FWIW, the NFS client is likely to work better with split data/ inode writeback as it also has to mark the inode dirty on async write completion (to get -write_inode called to issue a commit RPC). Hence delaying the inode write until after all the data is written makes sense there as well Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Converting writeback linked lists to a tree based data structure
On Tue, Jan 15, 2008 at 07:44:15PM -0800, Andrew Morton wrote: > On Wed, 16 Jan 2008 11:01:08 +0800 Fengguang Wu <[EMAIL PROTECTED]> wrote: > > > On Tue, Jan 15, 2008 at 09:53:42AM -0800, Michael Rubin wrote: > > > On Jan 15, 2008 12:46 AM, Peter Zijlstra <[EMAIL PROTECTED]> wrote: > > > > Just a quick question, how does this interact/depend-uppon etc.. with > > > > Fengguangs patches I still have in my mailbox? (Those from Dec 28th) > > > > > > They don't. They apply to a 2.6.24rc7 tree. This is a candidte for 2.6.25. > > > > > > This work was done before Fengguang's patches. I am trying to test > > > Fengguang's for comparison but am having problems with getting mm1 to > > > boot on my systems. > > > > Yeah, they are independent ones. The initial motivation is to fix the > > bug "sluggish writeback on small+large files". Michael introduced > > a new rbtree, and me introduced a new list(s_more_io_wait). > > > > Basically I think rbtree is an overkill to do time based ordering. > > Sorry, Michael. But s_dirty would be enough for that. Plus, s_more_io > > provides fair queuing between small/large files, and s_more_io_wait > > provides waiting mechanism for blocked inodes. > > > > The time ordered rbtree may delay io for a blocked inode simply by > > modifying its dirtied_when and reinsert it. But it would no longer be > > that easy if it is to be ordered by location. > > What does the term "ordered by location" mean? Attemting to sort inodes by > physical disk address? By using their i_ino as a key? > > That sounds optimistic. In XFS, inode number is an encoding of it's location on disk, so ordering inode writeback by inode number *does* make sense. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
On Tue, Jan 15, 2008 at 09:16:53PM +0100, Pavel Machek wrote: > Hi! > > > > What are ext3 expectations of disk (is there doc somewhere)? For > > > example... if disk does not lie, but powerfail during write damages > > > the sector -- is ext3 still going to work properly? > > > > Nope. However the few disks that did this rapidly got firmware updates > > because there are other OS's that can't cope. > > > > > If disk does not lie, but powerfail during write may cause random > > > numbers to be returned on read -- can fsck handle that? > > > > most of the time. and fsck knows about writing sectors to remove read > > errors in metadata blocks. > > > > > What abou disk that kills 5 sectors around sector being written during > > > powerfail; can ext3 survive that? > > > > generally. Note btw that for added fun there is nothing that guarantees > > the blocks around a block on the media are sequentially numbered. The > > usually are but you never know. > > Ok, should something like this be added to the documentation? > > It would be cool to be able to include few examples (modern SATA disks > support bariers so are safe, any IDE from 1989 is unsafe), but I do > not know enough about hw... ext3 is not the only filesystem that will have trouble due to volatile write caches. We see problems often enough with XFS due to volatile write caches that it's in our FAQ: http://oss.sgi.com/projects/xfs/faq.html#wcache Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
On Tue, Jan 15, 2008 at 09:16:53PM +0100, Pavel Machek wrote: Hi! What are ext3 expectations of disk (is there doc somewhere)? For example... if disk does not lie, but powerfail during write damages the sector -- is ext3 still going to work properly? Nope. However the few disks that did this rapidly got firmware updates because there are other OS's that can't cope. If disk does not lie, but powerfail during write may cause random numbers to be returned on read -- can fsck handle that? most of the time. and fsck knows about writing sectors to remove read errors in metadata blocks. What abou disk that kills 5 sectors around sector being written during powerfail; can ext3 survive that? generally. Note btw that for added fun there is nothing that guarantees the blocks around a block on the media are sequentially numbered. The usually are but you never know. Ok, should something like this be added to the documentation? It would be cool to be able to include few examples (modern SATA disks support bariers so are safe, any IDE from 1989 is unsafe), but I do not know enough about hw... ext3 is not the only filesystem that will have trouble due to volatile write caches. We see problems often enough with XFS due to volatile write caches that it's in our FAQ: http://oss.sgi.com/projects/xfs/faq.html#wcache Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] Converting writeback linked lists to a tree based data structure
On Tue, Jan 15, 2008 at 07:44:15PM -0800, Andrew Morton wrote: On Wed, 16 Jan 2008 11:01:08 +0800 Fengguang Wu [EMAIL PROTECTED] wrote: On Tue, Jan 15, 2008 at 09:53:42AM -0800, Michael Rubin wrote: On Jan 15, 2008 12:46 AM, Peter Zijlstra [EMAIL PROTECTED] wrote: Just a quick question, how does this interact/depend-uppon etc.. with Fengguangs patches I still have in my mailbox? (Those from Dec 28th) They don't. They apply to a 2.6.24rc7 tree. This is a candidte for 2.6.25. This work was done before Fengguang's patches. I am trying to test Fengguang's for comparison but am having problems with getting mm1 to boot on my systems. Yeah, they are independent ones. The initial motivation is to fix the bug sluggish writeback on small+large files. Michael introduced a new rbtree, and me introduced a new list(s_more_io_wait). Basically I think rbtree is an overkill to do time based ordering. Sorry, Michael. But s_dirty would be enough for that. Plus, s_more_io provides fair queuing between small/large files, and s_more_io_wait provides waiting mechanism for blocked inodes. The time ordered rbtree may delay io for a blocked inode simply by modifying its dirtied_when and reinsert it. But it would no longer be that easy if it is to be ordered by location. What does the term ordered by location mean? Attemting to sort inodes by physical disk address? By using their i_ino as a key? That sounds optimistic. In XFS, inode number is an encoding of it's location on disk, so ordering inode writeback by inode number *does* make sense. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why is deleting (or reading) files not counted as IO-Wait in top?
On Wed, Jan 02, 2008 at 08:35:03PM +0100, Matthias Schniedermeyer wrote: > Hi > > > Currently i'm deleting about 500.000 files on a XFS-filesystem which > takes a few minutes, as i had a top open i saw that 'wa' is shown as > 0.0% (Nothing else running currently) and everything except 'id' is near > the bottom too. Kernel is 2.6.23.11. Simply because the only I/O that XFS does during a delete is to the log and the log does async I/O and hence the process never blocks in I/O. Instead, it blocks in a far more complex space reservation that may or may not be related to I/O wait > So, as 'rm -rf' is essentially a IO (or seek, to be more correct)-bound > task, shouldn't that count as "Waiting for IO"? rm -rf is not seek bound on XFS - it's generally determined by the sequential write speed of the block device or how fast your CPU is > The man-page of top says: > 'Amount of time the CPU has been waiting for I/O to complete.' Async I/O means that typically your CPU does not get held up waiting for I/O to complete Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why is deleting (or reading) files not counted as IO-Wait in top?
On Wed, Jan 02, 2008 at 08:35:03PM +0100, Matthias Schniedermeyer wrote: Hi Currently i'm deleting about 500.000 files on a XFS-filesystem which takes a few minutes, as i had a top open i saw that 'wa' is shown as 0.0% (Nothing else running currently) and everything except 'id' is near the bottom too. Kernel is 2.6.23.11. Simply because the only I/O that XFS does during a delete is to the log and the log does async I/O and hence the process never blocks in I/O. Instead, it blocks in a far more complex space reservation that may or may not be related to I/O wait So, as 'rm -rf' is essentially a IO (or seek, to be more correct)-bound task, shouldn't that count as Waiting for IO? rm -rf is not seek bound on XFS - it's generally determined by the sequential write speed of the block device or how fast your CPU is The man-page of top says: 'Amount of time the CPU has been waiting for I/O to complete.' Async I/O means that typically your CPU does not get held up waiting for I/O to complete Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: xfs|loop|raid: attempt to access beyond end of device
On Sun, Dec 23, 2007 at 08:21:08PM +0100, Janos Haar wrote: > Hello, list, > > I have a little problem on one of my productive system. > > The system sometimes crashed, like this: > > Dec 23 08:53:05 Albohacen-global kernel: attempt to access beyond end of > device > Dec 23 08:53:05 Albohacen-global kernel: loop0: rw=1, want=50552830649176, > limit=3085523200 > Dec 23 08:53:05 Albohacen-global kernel: Buffer I/O error on device loop0, > logical block 6319103831146 > Dec 23 08:53:05 Albohacen-global kernel: lost page write due to I/O error on > loop0 So a long way beyond the end of the device. [snip soft lockup warnings] > Dec 23 09:08:19 Albohacen-global kernel: Filesystem "loop0": Access to block > zero in inode 397821447 start_block: 0 start_off: 0 blkcnt: 0 extent-state: > 0 lastx: e4 And that's to block zero of the filesystem. Sure signs of a corupted inode extent btree. We've seen a few of these corruptions on loopback device reported recently. You'll need to unmount and repair the filesystem to make this go away, but it's hard to know what is causing the btree corruption. > Dec 23 09:08:22 Albohacen-global last message repeated 19 times > > some more info: > > [EMAIL PROTECTED] ~]# uname -a > Linux Albohacen-global 2.6.21.1 #3 SMP Thu May 3 04:33:36 CEST 2007 x86_64 > x86_64 x86_64 GNU/Linux > [EMAIL PROTECTED] ~]# cat /proc/mdstat > Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] > [multipath] [faulty] > md1 : active raid4 sdf2[1] sde2[0] sdd2[5] sdc2[4] sdb2[3] sda2[2] > 19558720 blocks level 4, 64k chunk, algorithm 0 [6/6] [UU] > bitmap: 8/239 pages [32KB], 8KB chunk > > md2 : active raid4 sdf3[1] sde3[0] sdd3[5] sdc3[4] sdb3[3] sda3[2] > 1542761600 blocks level 4, 64k chunk, algorithm 0 [6/6] [UU] > bitmap: 0/148 pages [0KB], 1024KB chunk > > md0 : active raid1 sdb1[1] sda1[0] > 104320 blocks [2/2] [UU] > > unused devices: > [EMAIL PROTECTED] ~]# losetup /dev/loop0 > /dev/loop0: [0010]:6598 (/dev/md2), encryption blowfish (type 18) You're using an encrypted block device? What mechanism are you using for encryption (doesn't appear to be dmcrypt)? Does it handle readahead bio cancellation correctly? We had similar XFS corruption problems on dmcrypt between 2.6.14 and ~2.6.20 due to a bug in dmcrypt's failure to handle aborted readahead I/O correctly Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: xfs|loop|raid: attempt to access beyond end of device
On Sun, Dec 23, 2007 at 08:21:08PM +0100, Janos Haar wrote: Hello, list, I have a little problem on one of my productive system. The system sometimes crashed, like this: Dec 23 08:53:05 Albohacen-global kernel: attempt to access beyond end of device Dec 23 08:53:05 Albohacen-global kernel: loop0: rw=1, want=50552830649176, limit=3085523200 Dec 23 08:53:05 Albohacen-global kernel: Buffer I/O error on device loop0, logical block 6319103831146 Dec 23 08:53:05 Albohacen-global kernel: lost page write due to I/O error on loop0 So a long way beyond the end of the device. [snip soft lockup warnings] Dec 23 09:08:19 Albohacen-global kernel: Filesystem loop0: Access to block zero in inode 397821447 start_block: 0 start_off: 0 blkcnt: 0 extent-state: 0 lastx: e4 And that's to block zero of the filesystem. Sure signs of a corupted inode extent btree. We've seen a few of these corruptions on loopback device reported recently. You'll need to unmount and repair the filesystem to make this go away, but it's hard to know what is causing the btree corruption. Dec 23 09:08:22 Albohacen-global last message repeated 19 times some more info: [EMAIL PROTECTED] ~]# uname -a Linux Albohacen-global 2.6.21.1 #3 SMP Thu May 3 04:33:36 CEST 2007 x86_64 x86_64 x86_64 GNU/Linux [EMAIL PROTECTED] ~]# cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty] md1 : active raid4 sdf2[1] sde2[0] sdd2[5] sdc2[4] sdb2[3] sda2[2] 19558720 blocks level 4, 64k chunk, algorithm 0 [6/6] [UU] bitmap: 8/239 pages [32KB], 8KB chunk md2 : active raid4 sdf3[1] sde3[0] sdd3[5] sdc3[4] sdb3[3] sda3[2] 1542761600 blocks level 4, 64k chunk, algorithm 0 [6/6] [UU] bitmap: 0/148 pages [0KB], 1024KB chunk md0 : active raid1 sdb1[1] sda1[0] 104320 blocks [2/2] [UU] unused devices: none [EMAIL PROTECTED] ~]# losetup /dev/loop0 /dev/loop0: [0010]:6598 (/dev/md2), encryption blowfish (type 18) You're using an encrypted block device? What mechanism are you using for encryption (doesn't appear to be dmcrypt)? Does it handle readahead bio cancellation correctly? We had similar XFS corruption problems on dmcrypt between 2.6.14 and ~2.6.20 due to a bug in dmcrypt's failure to handle aborted readahead I/O correctly Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ia64] BUG: sleeping in atomic
On Wed, Dec 19, 2007 at 11:42:04AM -0500, Kyle McMartin wrote: > On Wed, Dec 19, 2007 at 04:54:30PM +1100, David Chinner wrote: > > [ 5667.086055] BUG: sleeping function called from invalid context at > > kernel/fork.c:401 > > > > The problem is that mmput is called under the read_lock by > find_thread_for_addr... The comment above seems to indicate that gdb > needs to be able to access any child tasks register backing store > memory... This seems pretty broken. > > cheers, Kyle > > --- > > Who knows, maybe gdb is saner now? > > diff --git a/arch/ia64/kernel/ptrace.c b/arch/ia64/kernel/ptrace.c > index 2e96f17..b609704 100644 > --- a/arch/ia64/kernel/ptrace.c > +++ b/arch/ia64/kernel/ptrace.c > @@ -1418,7 +1418,7 @@ asmlinkage long > sys_ptrace (long request, pid_t pid, unsigned long addr, unsigned long data) > { > struct pt_regs *pt; > - unsigned long urbs_end, peek_or_poke; > + unsigned long urbs_end; > struct task_struct *child; > struct switch_stack *sw; > long ret; > @@ -1430,23 +1430,12 @@ sys_ptrace (long request, pid_t pid, unsigned long > addr, unsigned long data) > goto out; > } > > - peek_or_poke = (request == PTRACE_PEEKTEXT > - || request == PTRACE_PEEKDATA > - || request == PTRACE_POKETEXT > - || request == PTRACE_POKEDATA); > - ret = -ESRCH; > - read_lock(_lock); > - { > - child = find_task_by_pid(pid); > - if (child) { > - if (peek_or_poke) > - child = find_thread_for_addr(child, addr); > - get_task_struct(child); > - } > - } > - read_unlock(_lock); > - if (!child) > + child = ptrace_get_task_struct(pid); > + if (IS_ERR(child)) { > + ret = PTR_ERR(child); > goto out; > + } > + > ret = -EPERM; > if (pid == 1) /* no messing around with init! */ > goto out_tsk; Yes, this patch fixes the problem (though I haven't tried to use gdb yet). Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch, rfc] mm.h, security.h, key.h and preventing namespace poisoning
On Thu, Dec 20, 2007 at 11:07:01AM +1100, James Morris wrote: > On Wed, 19 Dec 2007, David Chinner wrote: > > > Folks, > > > > I just updated a git tree and started getting errors on a > > "copy_keys" macro warning. > > > > The code I've been working on uses a ->copy_keys() method for > > copying the keys in a btree block from one place to another. I've > > been working on this code for a while > > (http://oss.sgi.com/archives/xfs/2007-11/msg00046.html) and keep the > > tree I'm working in reletively up to date (lags linus by a couple of > > weeks at most). The update I did this afternoon gave a conflict > > warning with the macro in include/linux/key.h. > > > > Given that I'm not directly including key.h anywhere in the XFS > > code, I'm getting the namespace polluted indirectly from some other > > include that is necessary. > > > > As it turns out, this commit from 13 days ago: > > > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7cd94146cd504016315608e297219f9fb7b1413b > > > > included security.h in mm.h and that is how I'm seeing the namespace > > poisoning coming from key.h when !CONFIG_KEY. > > > > Including security.h in mm.h means much wider includes for pretty > > much the entire kernel, and it opens up namespace issues like this > > that never previously existed. > > > > The patch below (only tested for !CONFIG_KEYS && !CONFIG_SECURITY) > > moves security.h into the mmap.c and nommu.c files that need it so > > it doesn't end up with kernel wide scope. > > > > Comments? > > The idea with this placement was to keep memory management code with other > similar code, rather than pushing it into security.h, where it does not > functionally belong. > > Something to not also is that you can't "depend" on security.h not being > included all over the place, as LSM does touch a lot of the kernel. > Unecessarily including it is bad, of course. Which is what including it in mm.h does. It also pull sin a lot of other headers files as has already been noted. > I'm not sure I understand your namespace pollution issue, either. doing this globally: #ifdef CONFIG_SOMETHING extern int some_common_name(int a, int b, int c); #else #define some_common_name(a,b,c) 0 #endif means that no-one can use some_common_name *anywhere* in the kernel. In this case, i have a completely *private* use of some_common_name and now I can't use that because the wonderful define above that now has effectively global scope because it gets included from key.h via security.h via mm.h. > In any case, I think the right solution is not to include security.h at > all in mm.h, as it is only being done to get a declaration for > mmap_min_addr. > > How about this, instead ? > > Signed-off-by: James Morris <[EMAIL PROTECTED]> > --- > > mm.h |5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 1b7b95c..02fbac7 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -12,7 +12,6 @@ > #include > #include > #include > -#include > > struct mempolicy; > struct anon_vma; > @@ -34,6 +33,10 @@ extern int sysctl_legacy_va_layout; > #define sysctl_legacy_va_layout 0 > #endif > > +#ifdef CONFIG_SECURITY > +extern unsigned long mmap_min_addr; > +#endif > + > #include > #include > #include Fine by me. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Important regression with XFS update for 2.6.24-rc6
On Wed, Dec 19, 2007 at 12:17:30PM +0100, Damien Wyart wrote: > * David Chinner <[EMAIL PROTECTED]> [071219 11:45]: > > Can someone pass me a brown paper bag, please? > > My first impression on this bug was not so wrong, after all ;-) > > > That also explains why we haven't seen it - it requires the user buffer to > > fill on the first entry of a backing buffer and so it is largely dependent > > on the pattern of name lengths, page size and filesystem block size > > aligning just right to trigger the problem. > > I guess I was lucky to trigger it quite easily... > > > Can you test this patch, Damien? > > Works fine, all the bad symptoms have disappeared and strace output is > normal. > > So you can add: > > Tested-by: Damien Wyart <[EMAIL PROTECTED]> Thanks for reporting the bug and testing the fix so quickly, Damien. I'll give it some more QA before I push it, though. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Important regression with XFS update for 2.6.24-rc6
On Wed, Dec 19, 2007 at 02:19:47AM +1100, David Chinner wrote: > On Tue, Dec 18, 2007 at 03:30:31PM +0100, Damien Wyart wrote: > > * David Chinner <[EMAIL PROTECTED]> [071218 13:24]: > > > Ok. I haven't noticed anything wrong with directories up to about > > > 250,000 files in the last few days. The ls -l I just did on > > > a directory with 15000 entries (btree format) used about 5MB of RAM. > > > extent format directories appear to work fine as well (tested 500 > > > entries). > > > > Ok, nice to know the problem is not so frequent. > > . > > > I have put the files at http://damien.wyart.free.fr/xfs/ > > > > strace_xfs_problem.1.gz and strace_xfs_problem.2.gz have been created > > with the problematic kernel, and are quite bigger than > > strace_xfs_problem.normal.gz, which has been created with the vanilla > > rc5-git5. There is also xfs_info. > > Looks like several getdents() through the directory the getdents() > call starts outputting the first files again. It gets to a certain > point and always goes back to the beginning. However, it appears to > get to the end eventually (without ever getting past the bad offset). UML and a bunch of printk's to the rescue. So we went back to double buffering, which then screwed up the d_off of the dirents. I changed the temporary dirents to point to the current offset so that filldir got what it expected when filling the user buffer. Except it appears that it I didn't to initialise the current offset for the first dirent read from the temporary buffer so filldir occasionally got an uninitialised offset. Can someone pass me a brown paper bag, please? In my local testing, more often than not, that uninitialised offset reads as zero which is where the looping comes from. Sometimes it points off into wacko-land, which is probably how we eventually get the looping terminating before you run out of memory. That also explains why we haven't seen it - it requires the user buffer to fill on the first entry of a backing buffer and so it is largely dependent on the pattern of name lengths, page size and filesystem block size aligning just right to trigger the problem. Can you test this patch, Damien? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/linux-2.6/xfs_file.c |1 + 1 file changed, 1 insertion(+) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_file.c === --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_file.c 2007-12-19 00:26:40.0 +1100 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_file.c 2007-12-19 21:26:38.701143555 +1100 @@ -348,6 +348,7 @@ xfs_file_readdir( size = buf.used; de = (struct hack_dirent *)buf.dirent; + curr_offset = de->offset /* & 0x7fff */; while (size > 0) { if (filldir(dirent, de->name, de->namlen, curr_offset & 0x7fff, -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch, rfc] mm.h, security.h, key.h and preventing namespace poisoning
Folks, I just updated a git tree and started getting errors on a "copy_keys" macro warning. The code I've been working on uses a ->copy_keys() method for copying the keys in a btree block from one place to another. I've been working on this code for a while (http://oss.sgi.com/archives/xfs/2007-11/msg00046.html) and keep the tree I'm working in reletively up to date (lags linus by a couple of weeks at most). The update I did this afternoon gave a conflict warning with the macro in include/linux/key.h. Given that I'm not directly including key.h anywhere in the XFS code, I'm getting the namespace polluted indirectly from some other include that is necessary. As it turns out, this commit from 13 days ago: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7cd94146cd504016315608e297219f9fb7b1413b included security.h in mm.h and that is how I'm seeing the namespace poisoning coming from key.h when !CONFIG_KEY. Including security.h in mm.h means much wider includes for pretty much the entire kernel, and it opens up namespace issues like this that never previously existed. The patch below (only tested for !CONFIG_KEYS && !CONFIG_SECURITY) moves security.h into the mmap.c and nommu.c files that need it so it doesn't end up with kernel wide scope. Comments? --- include/linux/mm.h | 16 include/linux/security.h | 26 +- mm/mmap.c|1 + mm/nommu.c |1 + 4 files changed, 27 insertions(+), 17 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 1b7b95c..520238c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -12,7 +12,6 @@ #include #include #include -#include struct mempolicy; struct anon_vma; @@ -514,21 +513,6 @@ static inline void set_page_links(struct page *page, enum zone_type zone, } /* - * If a hint addr is less than mmap_min_addr change hint to be as - * low as possible but still greater than mmap_min_addr - */ -static inline unsigned long round_hint_to_min(unsigned long hint) -{ -#ifdef CONFIG_SECURITY - hint &= PAGE_MASK; - if (((void *)hint != NULL) && - (hint < mmap_min_addr)) - return PAGE_ALIGN(mmap_min_addr); -#endif - return hint; -} - -/* * Some inline functions in vmstat.h depend on page_zone() */ #include diff --git a/include/linux/security.h b/include/linux/security.h index ac05083..e9ba391 100644 --- a/include/linux/security.h +++ b/include/linux/security.h @@ -2568,6 +2568,19 @@ void security_key_free(struct key *key); int security_key_permission(key_ref_t key_ref, struct task_struct *context, key_perm_t perm); +/* + * If a hint addr is less than mmap_min_addr change hint to be as + * low as possible but still greater than mmap_min_addr + */ +static inline unsigned long round_hint_to_min(unsigned long hint) +{ + hint &= PAGE_MASK; + if (((void *)hint != NULL) && + (hint < mmap_min_addr)) + return PAGE_ALIGN(mmap_min_addr); + return hint; +} + #else static inline int security_key_alloc(struct key *key, @@ -2588,7 +2601,18 @@ static inline int security_key_permission(key_ref_t key_ref, return 0; } -#endif +static inline unsigned long round_hint_to_min(unsigned long hint) +{ + return hint; +} + +#endif /* CONFIG_SECURITY */ + +#else /* !CONFIG_KEYS */ +static inline unsigned long round_hint_to_min(unsigned long hint) +{ + return hint; +} #endif /* CONFIG_KEYS */ #endif /* ! __LINUX_SECURITY_H */ diff --git a/mm/mmap.c b/mm/mmap.c index 15678aa..0d666de 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -26,6 +26,7 @@ #include #include #include +#include #include #include diff --git a/mm/nommu.c b/mm/nommu.c index b989cb9..99702d1 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #include -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch, rfc] mm.h, security.h, key.h and preventing namespace poisoning
Folks, I just updated a git tree and started getting errors on a copy_keys macro warning. The code I've been working on uses a -copy_keys() method for copying the keys in a btree block from one place to another. I've been working on this code for a while (http://oss.sgi.com/archives/xfs/2007-11/msg00046.html) and keep the tree I'm working in reletively up to date (lags linus by a couple of weeks at most). The update I did this afternoon gave a conflict warning with the macro in include/linux/key.h. Given that I'm not directly including key.h anywhere in the XFS code, I'm getting the namespace polluted indirectly from some other include that is necessary. As it turns out, this commit from 13 days ago: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7cd94146cd504016315608e297219f9fb7b1413b included security.h in mm.h and that is how I'm seeing the namespace poisoning coming from key.h when !CONFIG_KEY. Including security.h in mm.h means much wider includes for pretty much the entire kernel, and it opens up namespace issues like this that never previously existed. The patch below (only tested for !CONFIG_KEYS !CONFIG_SECURITY) moves security.h into the mmap.c and nommu.c files that need it so it doesn't end up with kernel wide scope. Comments? --- include/linux/mm.h | 16 include/linux/security.h | 26 +- mm/mmap.c|1 + mm/nommu.c |1 + 4 files changed, 27 insertions(+), 17 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 1b7b95c..520238c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -12,7 +12,6 @@ #include linux/prio_tree.h #include linux/debug_locks.h #include linux/mm_types.h -#include linux/security.h struct mempolicy; struct anon_vma; @@ -514,21 +513,6 @@ static inline void set_page_links(struct page *page, enum zone_type zone, } /* - * If a hint addr is less than mmap_min_addr change hint to be as - * low as possible but still greater than mmap_min_addr - */ -static inline unsigned long round_hint_to_min(unsigned long hint) -{ -#ifdef CONFIG_SECURITY - hint = PAGE_MASK; - if (((void *)hint != NULL) - (hint mmap_min_addr)) - return PAGE_ALIGN(mmap_min_addr); -#endif - return hint; -} - -/* * Some inline functions in vmstat.h depend on page_zone() */ #include linux/vmstat.h diff --git a/include/linux/security.h b/include/linux/security.h index ac05083..e9ba391 100644 --- a/include/linux/security.h +++ b/include/linux/security.h @@ -2568,6 +2568,19 @@ void security_key_free(struct key *key); int security_key_permission(key_ref_t key_ref, struct task_struct *context, key_perm_t perm); +/* + * If a hint addr is less than mmap_min_addr change hint to be as + * low as possible but still greater than mmap_min_addr + */ +static inline unsigned long round_hint_to_min(unsigned long hint) +{ + hint = PAGE_MASK; + if (((void *)hint != NULL) + (hint mmap_min_addr)) + return PAGE_ALIGN(mmap_min_addr); + return hint; +} + #else static inline int security_key_alloc(struct key *key, @@ -2588,7 +2601,18 @@ static inline int security_key_permission(key_ref_t key_ref, return 0; } -#endif +static inline unsigned long round_hint_to_min(unsigned long hint) +{ + return hint; +} + +#endif /* CONFIG_SECURITY */ + +#else /* !CONFIG_KEYS */ +static inline unsigned long round_hint_to_min(unsigned long hint) +{ + return hint; +} #endif /* CONFIG_KEYS */ #endif /* ! __LINUX_SECURITY_H */ diff --git a/mm/mmap.c b/mm/mmap.c index 15678aa..0d666de 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -26,6 +26,7 @@ #include linux/mount.h #include linux/mempolicy.h #include linux/rmap.h +#include linux/security.h #include asm/uaccess.h #include asm/cacheflush.h diff --git a/mm/nommu.c b/mm/nommu.c index b989cb9..99702d1 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -28,6 +28,7 @@ #include linux/personality.h #include linux/security.h #include linux/syscalls.h +#include linux/security.h #include asm/uaccess.h #include asm/tlb.h -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Important regression with XFS update for 2.6.24-rc6
On Wed, Dec 19, 2007 at 02:19:47AM +1100, David Chinner wrote: On Tue, Dec 18, 2007 at 03:30:31PM +0100, Damien Wyart wrote: * David Chinner [EMAIL PROTECTED] [071218 13:24]: Ok. I haven't noticed anything wrong with directories up to about 250,000 files in the last few days. The ls -l I just did on a directory with 15000 entries (btree format) used about 5MB of RAM. extent format directories appear to work fine as well (tested 500 entries). Ok, nice to know the problem is not so frequent. . I have put the files at http://damien.wyart.free.fr/xfs/ strace_xfs_problem.1.gz and strace_xfs_problem.2.gz have been created with the problematic kernel, and are quite bigger than strace_xfs_problem.normal.gz, which has been created with the vanilla rc5-git5. There is also xfs_info. Looks like several getdents() through the directory the getdents() call starts outputting the first files again. It gets to a certain point and always goes back to the beginning. However, it appears to get to the end eventually (without ever getting past the bad offset). UML and a bunch of printk's to the rescue. So we went back to double buffering, which then screwed up the d_off of the dirents. I changed the temporary dirents to point to the current offset so that filldir got what it expected when filling the user buffer. Except it appears that it I didn't to initialise the current offset for the first dirent read from the temporary buffer so filldir occasionally got an uninitialised offset. Can someone pass me a brown paper bag, please? In my local testing, more often than not, that uninitialised offset reads as zero which is where the looping comes from. Sometimes it points off into wacko-land, which is probably how we eventually get the looping terminating before you run out of memory. That also explains why we haven't seen it - it requires the user buffer to fill on the first entry of a backing buffer and so it is largely dependent on the pattern of name lengths, page size and filesystem block size aligning just right to trigger the problem. Can you test this patch, Damien? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/linux-2.6/xfs_file.c |1 + 1 file changed, 1 insertion(+) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_file.c === --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_file.c 2007-12-19 00:26:40.0 +1100 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_file.c 2007-12-19 21:26:38.701143555 +1100 @@ -348,6 +348,7 @@ xfs_file_readdir( size = buf.used; de = (struct hack_dirent *)buf.dirent; + curr_offset = de-offset /* 0x7fff */; while (size 0) { if (filldir(dirent, de-name, de-namlen, curr_offset 0x7fff, -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Important regression with XFS update for 2.6.24-rc6
On Wed, Dec 19, 2007 at 12:17:30PM +0100, Damien Wyart wrote: * David Chinner [EMAIL PROTECTED] [071219 11:45]: Can someone pass me a brown paper bag, please? My first impression on this bug was not so wrong, after all ;-) That also explains why we haven't seen it - it requires the user buffer to fill on the first entry of a backing buffer and so it is largely dependent on the pattern of name lengths, page size and filesystem block size aligning just right to trigger the problem. I guess I was lucky to trigger it quite easily... Can you test this patch, Damien? Works fine, all the bad symptoms have disappeared and strace output is normal. So you can add: Tested-by: Damien Wyart [EMAIL PROTECTED] Thanks for reporting the bug and testing the fix so quickly, Damien. I'll give it some more QA before I push it, though. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch, rfc] mm.h, security.h, key.h and preventing namespace poisoning
On Thu, Dec 20, 2007 at 11:07:01AM +1100, James Morris wrote: On Wed, 19 Dec 2007, David Chinner wrote: Folks, I just updated a git tree and started getting errors on a copy_keys macro warning. The code I've been working on uses a -copy_keys() method for copying the keys in a btree block from one place to another. I've been working on this code for a while (http://oss.sgi.com/archives/xfs/2007-11/msg00046.html) and keep the tree I'm working in reletively up to date (lags linus by a couple of weeks at most). The update I did this afternoon gave a conflict warning with the macro in include/linux/key.h. Given that I'm not directly including key.h anywhere in the XFS code, I'm getting the namespace polluted indirectly from some other include that is necessary. As it turns out, this commit from 13 days ago: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7cd94146cd504016315608e297219f9fb7b1413b included security.h in mm.h and that is how I'm seeing the namespace poisoning coming from key.h when !CONFIG_KEY. Including security.h in mm.h means much wider includes for pretty much the entire kernel, and it opens up namespace issues like this that never previously existed. The patch below (only tested for !CONFIG_KEYS !CONFIG_SECURITY) moves security.h into the mmap.c and nommu.c files that need it so it doesn't end up with kernel wide scope. Comments? The idea with this placement was to keep memory management code with other similar code, rather than pushing it into security.h, where it does not functionally belong. Something to not also is that you can't depend on security.h not being included all over the place, as LSM does touch a lot of the kernel. Unecessarily including it is bad, of course. Which is what including it in mm.h does. It also pull sin a lot of other headers files as has already been noted. I'm not sure I understand your namespace pollution issue, either. doing this globally: #ifdef CONFIG_SOMETHING extern int some_common_name(int a, int b, int c); #else #define some_common_name(a,b,c) 0 #endif means that no-one can use some_common_name *anywhere* in the kernel. In this case, i have a completely *private* use of some_common_name and now I can't use that because the wonderful define above that now has effectively global scope because it gets included from key.h via security.h via mm.h. In any case, I think the right solution is not to include security.h at all in mm.h, as it is only being done to get a declaration for mmap_min_addr. How about this, instead ? Signed-off-by: James Morris [EMAIL PROTECTED] --- mm.h |5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 1b7b95c..02fbac7 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -12,7 +12,6 @@ #include linux/prio_tree.h #include linux/debug_locks.h #include linux/mm_types.h -#include linux/security.h struct mempolicy; struct anon_vma; @@ -34,6 +33,10 @@ extern int sysctl_legacy_va_layout; #define sysctl_legacy_va_layout 0 #endif +#ifdef CONFIG_SECURITY +extern unsigned long mmap_min_addr; +#endif + #include asm/page.h #include asm/pgtable.h #include asm/processor.h Fine by me. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ia64] BUG: sleeping in atomic
On Wed, Dec 19, 2007 at 11:42:04AM -0500, Kyle McMartin wrote: On Wed, Dec 19, 2007 at 04:54:30PM +1100, David Chinner wrote: [ 5667.086055] BUG: sleeping function called from invalid context at kernel/fork.c:401 The problem is that mmput is called under the read_lock by find_thread_for_addr... The comment above seems to indicate that gdb needs to be able to access any child tasks register backing store memory... This seems pretty broken. cheers, Kyle --- Who knows, maybe gdb is saner now? diff --git a/arch/ia64/kernel/ptrace.c b/arch/ia64/kernel/ptrace.c index 2e96f17..b609704 100644 --- a/arch/ia64/kernel/ptrace.c +++ b/arch/ia64/kernel/ptrace.c @@ -1418,7 +1418,7 @@ asmlinkage long sys_ptrace (long request, pid_t pid, unsigned long addr, unsigned long data) { struct pt_regs *pt; - unsigned long urbs_end, peek_or_poke; + unsigned long urbs_end; struct task_struct *child; struct switch_stack *sw; long ret; @@ -1430,23 +1430,12 @@ sys_ptrace (long request, pid_t pid, unsigned long addr, unsigned long data) goto out; } - peek_or_poke = (request == PTRACE_PEEKTEXT - || request == PTRACE_PEEKDATA - || request == PTRACE_POKETEXT - || request == PTRACE_POKEDATA); - ret = -ESRCH; - read_lock(tasklist_lock); - { - child = find_task_by_pid(pid); - if (child) { - if (peek_or_poke) - child = find_thread_for_addr(child, addr); - get_task_struct(child); - } - } - read_unlock(tasklist_lock); - if (!child) + child = ptrace_get_task_struct(pid); + if (IS_ERR(child)) { + ret = PTR_ERR(child); goto out; + } + ret = -EPERM; if (pid == 1) /* no messing around with init! */ goto out_tsk; Yes, this patch fixes the problem (though I haven't tried to use gdb yet). Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[ia64] BUG: sleeping in atomic
Just saw this again: [ 5667.086055] BUG: sleeping function called from invalid context at kernel/fork.c:401 [ 5667.087314] in_atomic():1, irqs_disabled():0 [ 5667.088210] [ 5667.088212] Call Trace: [ 5667.089104] [] show_stack+0x80/0xa0 [ 5667.089106] sp=e038f6507a00 bsp=e038f6500f48 [ 5667.116025] [] dump_stack+0x30/0x60 [ 5667.116028] sp=e038f6507bd0 bsp=e038f6500f30 [ 5667.118317] [] __might_sleep+0x1e0/0x200 [ 5667.118320] sp=e038f6507bd0 bsp=e038f6500f08 [ 5667.142316] [] mmput+0x20/0x220 [ 5667.142319] sp=e038f6507bd0 bsp=e038f6500ee0 [ 5667.164201] [] sys_ptrace+0x460/0x15c0 [ 5667.164203] sp=e038f6507bd0 bsp=e038f6500dd0 [ 5667.175488] [] ia64_ret_from_syscall+0x0/0x20 [ 5667.175490] sp=e038f6507e30 bsp=e038f6500dd0 [ 5667.199324] [] __kernel_syscall_via_break+0x0/0x20 [ 5667.199327] sp=e038f6508000 bsp=e038f6500dd0 [ 5682.626704] BUG: sleeping function called from invalid context at kernel/fork.c:401 When stracing a process on 2.6.24-rc3 on ia64. commandline to reproduce: # strace ls -l /mnt/scratch/test ISTR reporting this some time ago (maybe 2.6.22?) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] XFS update for 2.6.24-rc6
On Tue, Dec 18, 2007 at 05:19:04PM -0800, Linus Torvalds wrote: > > > On Wed, 19 Dec 2007, David Chinner wrote: > > > > On Tue, Dec 18, 2007 at 05:59:11PM +1100, Lachlan McIlroy wrote: > > > Please pull from the for-linus branch: > > > git pull git://oss.sgi.com:8090/xfs/xfs-2.6.git for-linus > > > > Linus, please don't pull this yet. A problem has been found in > > the dirent fix, and we've just fixed another mknod related regression > > so we've got another couple of fixes still to go in XFS for 2.6.24. > > Too late, it's already long since pulled. Ok, we'll just have to get the fixes to you ASAP. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] XFS update for 2.6.24-rc6
On Tue, Dec 18, 2007 at 05:59:11PM +1100, Lachlan McIlroy wrote: > Please pull from the for-linus branch: > git pull git://oss.sgi.com:8090/xfs/xfs-2.6.git for-linus Linus, please don't pull this yet. A problem has been found in the dirent fix, and we've just fixed another mknod related regression so we've got another couple of fixes still to go in XFS for 2.6.24. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/