from:"David Chinner"

Re: inode leak in 2.6.24?

2008-02-20 Thread David Chinner

On Wed, Feb 20, 2008 at 03:36:53PM +0100, Ferenc Wagner wrote:
> David Chinner <[EMAIL PROTECTED]> writes:
> > The xfs inodes are clearly pinned by the dentry cache, so the issue
> > is dentries, not inodes. What's causing dentries not to be
> > reclaimed?  I can't see anything that cold pin them (e.g. no filp's
> > that would indicate open files being responsible), so my initial
> > thoughts are that memory reclaim may have changed behaviour.
> >
> > I guess the first thing to find out is whether memory pressure
> > results in freeing the dentries. To simulate memory pressure causing
> > slab cache reclaim, can you run:
> >
> > # echo 2 > /proc/sys/vm/drop_caches
> >
> > and see if the number of dentries and inodes drops. If the number
> > goes down significantly, then we aren't leaking dentries and there's
> > been a change in memoy reclaim behaviour. If it stays the same, then
> > we probably are leaking dentries
> 
> Hi Dave,
> 
> Thanks for looking into this.  There's no real conclusion yet: the
> simulated memory pressure sent the numbers down allright, but
> meanwhile it turned out that this is a different case: on this machine
> the increase wasn't a constant growth, but related to the daily
> updatedb job.  I'll reload the original kernel on the original
> machine, and collect the same info if the problem reappers.

Ok, let me know how it goes when you get a chance.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: inode leak in 2.6.24?

2008-02-20 Thread David Chinner

On Wed, Feb 20, 2008 at 03:36:53PM +0100, Ferenc Wagner wrote:
 David Chinner [EMAIL PROTECTED] writes:
  The xfs inodes are clearly pinned by the dentry cache, so the issue
  is dentries, not inodes. What's causing dentries not to be
  reclaimed?  I can't see anything that cold pin them (e.g. no filp's
  that would indicate open files being responsible), so my initial
  thoughts are that memory reclaim may have changed behaviour.
 
  I guess the first thing to find out is whether memory pressure
  results in freeing the dentries. To simulate memory pressure causing
  slab cache reclaim, can you run:
 
  # echo 2  /proc/sys/vm/drop_caches
 
  and see if the number of dentries and inodes drops. If the number
  goes down significantly, then we aren't leaking dentries and there's
  been a change in memoy reclaim behaviour. If it stays the same, then
  we probably are leaking dentries
 
 Hi Dave,
 
 Thanks for looking into this.  There's no real conclusion yet: the
 simulated memory pressure sent the numbers down allright, but
 meanwhile it turned out that this is a different case: on this machine
 the increase wasn't a constant growth, but related to the daily
 updatedb job.  I'll reload the original kernel on the original
 machine, and collect the same info if the problem reappers.

Ok, let me know how it goes when you get a chance.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: inode leak in 2.6.24?

2008-02-19 Thread David Chinner

On Tue, Feb 19, 2008 at 05:57:08PM +0100, Ferenc Wagner wrote:
> David Chinner <[EMAIL PROTECTED]> writes:
> > On Sat, Feb 16, 2008 at 12:18:58AM +0100, Ferenc Wagner wrote:
> So, I loaded the same kernel on a different machine, but that seems to
> exhibit a very similar behaviour.  The machine is absolutely idle,
> nobody logged in during this period, though an updatedb ran during
> this period.  However, the increase seems steady, not correlated to
> cron.daily.
> 
> The contents of /proc/sys/fs/inode-nr after reboot was:
> 4421  95
> 
> and now, 13h35m later it's:
> 1461820
> 
> Find the two slabinfo outputs attached.

Content-Description: slabinfo output immediately after reboot
> xfs_inode791800384   101 : tunables   54   278 : 
> slabdata 80 80  0
> xfs_vnode790790384   101 : tunables   54   278 : 
> slabdata 79 79  0
> dentry  5133   5133132   291 : tunables  120   608 : 
> slabdata177177  0

Content-Description: slabinfo output 13h35m after reboot
> xfs_inode 142548 142550384   101 : tunables   54   278 : 
> slabdata  14255  14255  0
> xfs_vnode 142548 142550384   101 : tunables   54   278 : 
> slabdata  14255  14255  0
> dentry148003 148074132   291 : tunables  120   608 : 
> slabdata   5106   5106  0

> The xfs_inode, xfs_vnode and dentry lines show significant increase.
> The machine indeed uses XFS as its root filesystem.  Hope this gives
> enough clues to narrow down the problem.  I can try other kernels if
> needed.

The xfs inodes are clearly pinned by the dentry cache, so the issue
is dentries, not inodes. What's causing dentries not to be
reclaimed?  I can't see anything that cold pin them (e.g. no filp's
that would indicate open files being responsible), so my initial
thoughts are that memory reclaim may have changed behaviour.

I guess the first thing to find out is whether memory pressure
results in freeing the dentries. To simulate memory pressure causing
slab cache reclaim, can you run:

# echo 2 > /proc/sys/vm/drop_caches

and see if the number of dentries and inodes drops. If the number
goes down significantly, then we aren't leaking dentries and there's
been a change in memoy reclaim behaviour. If it stays the same, then
we probably are leaking dentries

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Implement barrier support for single device DM devices

2008-02-19 Thread David Chinner

On Tue, Feb 19, 2008 at 02:39:00AM +, Alasdair G Kergon wrote:
> > For example, how safe
> > xfs is if barriers are not supported or turned off?  
> 
> The last time we tried xfs with dm it didn't seem to notice -EOPNOTSUPP
> everywhere it should => recovery may find corruption.

Bug reports, please. What we don't know about, we can't fix.

As of this commit:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=0bfefc46dc028df60120acdb92062169c9328769

XFS should be handling all cases of -EOPNOTSUPP for barrier
I/Os. If you are still having problems, please let us know.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: xfsaild causing 30+ wakeups/s on an idle system since 2.6.25-rcX

2008-02-19 Thread David Chinner

On Mon, Feb 18, 2008 at 03:22:02PM -0800, Linda Walsh wrote:
> Not to look excessively dumb, but what's xfsaild?

AIL = Active Item List

It is a sorted list all the logged metadata objects that have not
yet been written back to disk.  The xfsaild is responsible for tail
pushing the log.  i.e.  writing back objects in the AIL in the most
efficient manner possible.

Why a thread? Because allowing parallelism in tail pushing is a
scalability problem and moving this to it's own thread completely
isolates it from parallelism. Tail pushing only requires a small
amount of CPU time, but it requires a global scope spinlock.
Isolating the pushing to a single CPU means the spinlock is not
contended across every CPU in the machine.

How much did it improve scalability? on a 2048p machine with an
MPI job that did a synchronised close of 12,000 files (6 per CPU),
the close time went from ~5400s without the thread to 9s with the
xfsaild. That's only about 600x faster. ;)

> xfs seems to be sprouting daemons at a more rapid pace
> these days...xfsbufd, xfssyncd, xfsdatad, xfslogd, xfs_mru_cache, and
> now xfsaild?

Why not? Got to make use of all those cores machines have these
days. ;)

Fundamentally, threads are cheap and simple. We'll keep adding
threads where it makes sense as long as it improves performance and
scalability.

> Are there any design docs (scribbles?) saying what these do and why
> they were added so I can just go read 'em myself?  I'm sure they
> were added for good reason...just am curious more than anything.

'git log' is your friend. The commits that introduce the new threads
explain why they are necessary. ;)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: xfsaild causing 30+ wakeups/s on an idle system since 2.6.25-rcX

2008-02-19 Thread David Chinner

On Mon, Feb 18, 2008 at 03:22:02PM -0800, Linda Walsh wrote:
 Not to look excessively dumb, but what's xfsaild?

AIL = Active Item List

It is a sorted list all the logged metadata objects that have not
yet been written back to disk.  The xfsaild is responsible for tail
pushing the log.  i.e.  writing back objects in the AIL in the most
efficient manner possible.

Why a thread? Because allowing parallelism in tail pushing is a
scalability problem and moving this to it's own thread completely
isolates it from parallelism. Tail pushing only requires a small
amount of CPU time, but it requires a global scope spinlock.
Isolating the pushing to a single CPU means the spinlock is not
contended across every CPU in the machine.

How much did it improve scalability? on a 2048p machine with an
MPI job that did a synchronised close of 12,000 files (6 per CPU),
the close time went from ~5400s without the thread to 9s with the
xfsaild. That's only about 600x faster. ;)

 xfs seems to be sprouting daemons at a more rapid pace
 these days...xfsbufd, xfssyncd, xfsdatad, xfslogd, xfs_mru_cache, and
 now xfsaild?

Why not? Got to make use of all those cores machines have these
days. ;)

Fundamentally, threads are cheap and simple. We'll keep adding
threads where it makes sense as long as it improves performance and
scalability.

 Are there any design docs (scribbles?) saying what these do and why
 they were added so I can just go read 'em myself?  I'm sure they
 were added for good reason...just am curious more than anything.

'git log' is your friend. The commits that introduce the new threads
explain why they are necessary. ;)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Implement barrier support for single device DM devices

2008-02-19 Thread David Chinner

On Tue, Feb 19, 2008 at 02:39:00AM +, Alasdair G Kergon wrote:
  For example, how safe
  xfs is if barriers are not supported or turned off?  
 
 The last time we tried xfs with dm it didn't seem to notice -EOPNOTSUPP
 everywhere it should = recovery may find corruption.

Bug reports, please. What we don't know about, we can't fix.

As of this commit:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=0bfefc46dc028df60120acdb92062169c9328769

XFS should be handling all cases of -EOPNOTSUPP for barrier
I/Os. If you are still having problems, please let us know.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: inode leak in 2.6.24?

2008-02-19 Thread David Chinner

On Tue, Feb 19, 2008 at 05:57:08PM +0100, Ferenc Wagner wrote:
 David Chinner [EMAIL PROTECTED] writes:
  On Sat, Feb 16, 2008 at 12:18:58AM +0100, Ferenc Wagner wrote:
 So, I loaded the same kernel on a different machine, but that seems to
 exhibit a very similar behaviour.  The machine is absolutely idle,
 nobody logged in during this period, though an updatedb ran during
 this period.  However, the increase seems steady, not correlated to
 cron.daily.
 
 The contents of /proc/sys/fs/inode-nr after reboot was:
 4421  95
 
 and now, 13h35m later it's:
 1461820
 
 Find the two slabinfo outputs attached.

Content-Description: slabinfo output immediately after reboot
 xfs_inode791800384   101 : tunables   54   278 : 
 slabdata 80 80  0
 xfs_vnode790790384   101 : tunables   54   278 : 
 slabdata 79 79  0
 dentry  5133   5133132   291 : tunables  120   608 : 
 slabdata177177  0

Content-Description: slabinfo output 13h35m after reboot
 xfs_inode 142548 142550384   101 : tunables   54   278 : 
 slabdata  14255  14255  0
 xfs_vnode 142548 142550384   101 : tunables   54   278 : 
 slabdata  14255  14255  0
 dentry148003 148074132   291 : tunables  120   608 : 
 slabdata   5106   5106  0

 The xfs_inode, xfs_vnode and dentry lines show significant increase.
 The machine indeed uses XFS as its root filesystem.  Hope this gives
 enough clues to narrow down the problem.  I can try other kernels if
 needed.

The xfs inodes are clearly pinned by the dentry cache, so the issue
is dentries, not inodes. What's causing dentries not to be
reclaimed?  I can't see anything that cold pin them (e.g. no filp's
that would indicate open files being responsible), so my initial
thoughts are that memory reclaim may have changed behaviour.

I guess the first thing to find out is whether memory pressure
results in freeing the dentries. To simulate memory pressure causing
slab cache reclaim, can you run:

# echo 2  /proc/sys/vm/drop_caches

and see if the number of dentries and inodes drops. If the number
goes down significantly, then we aren't leaking dentries and there's
been a change in memoy reclaim behaviour. If it stays the same, then
we probably are leaking dentries

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [dm-devel] Re: [PATCH] Implement barrier support for single device DM devices

2008-02-18 Thread David Chinner

On Tue, Feb 19, 2008 at 02:56:43AM +, Alasdair G Kergon wrote:
> On Tue, Feb 19, 2008 at 09:16:44AM +1100, David Chinner wrote:
> > Surely any hardware that doesn't support barrier
> > operations can emulate them with cache flushes when they receive a
> > barrier I/O from the filesystem
>  
> My complaint about having to support them within dm when more than one
> device is involved is because any efficiencies disappear: you can't send
> further I/O to any one device until all the other devices have completed
> their barrier (or else later I/O to that device could overtake the
> barrier on another device).

Right - it's a horrible performance hit.

But - how is what you describe any different to the filesystem doing:

- flush block device
- issue I/O
- wait for completion
- flush block device

around any I/O that it would otherwise simply tag as a barrier?
That serialisation at the filesystem layer is a horrible, horrible
performance hi.

And then there's the fact that we can't implement that in XFS
because all the barrier I/Os we issue are asynchronous.  We'd
basically have to serialise all metadata operations and now we
are talking about far worse performance hits than implementing
barrier emulation in the block device.

Also, it's instructive to look at the implementation of
blkdev_issue_flush() - the API one is supposed to use to trigger a
full block device flush. It doesn't work on DM/MD either, because
it uses a no-I/O barrier bio:

bio->bi_end_io = bio_end_empty_barrier;
bio->bi_private = 
bio->bi_bdev = bdev;
submit_bio(1 << BIO_RW_BARRIER, bio);

wait_for_completion();

So, if the underlying block device doesn't support barriers,
there's no point in changing the filesystem to issue flushes,
either...

> And then I argue that it would be better
> for the filesystem to have the information that these are not hardware
> barriers so it has the opportunity of tuning its behaviour (e.g.
> flushing less often because it's a more expensive operation).

There is generally no option from the filesystem POV to "flush
less". Either we use barrier I/Os where we need to and are safe with
volatile caches or we corrupt filesystems with volatile caches when
power loss occurs. There is no in-between where "flushing less"
will save us from corruption

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [dm-devel] Re: [PATCH] Implement barrier support for single device DM devices

2008-02-18 Thread David Chinner

On Mon, Feb 18, 2008 at 04:24:27PM +0300, Michael Tokarev wrote:
> First, I still don't understand why in God's sake barriers are "working"
> while regular cache flushes are not.  Almost no consumer-grade hard drive
> supports write barriers, but they all support regular cache flushes, and
> the latter should be enough (while not the most speed-optimal) to ensure
> data safety.  Why to require write cache disable (like in XFS FAQ) instead
> of going the flush-cache-when-appropriate (as opposed to write-barrier-
> when-appropriate) way?

Devil's advocate:

Why should we need to support multiple different block layer APIs
to do the same thing? Surely any hardware that doesn't support barrier
operations can emulate them with cache flushes when they receive a
barrier I/O from the filesystem

Also, given that disabling the write cache still allows CTQ/NCQ to
operate effectively and that in most cases WCD+CTQ is as fast as
WCE+barriers, the simplest thing to do is turn off volatile write
caches and not require any extra software kludges for safe
operation.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: inode leak in 2.6.24?

2008-02-18 Thread David Chinner

On Sat, Feb 16, 2008 at 12:18:58AM +0100, Ferenc Wagner wrote:
> Hi,
> 
> 5 days ago I pulled the git tree (HEAD was
> 25f666300625d894ebe04bac2b4b3aadb907c861), added two minor patches
> (the vmsplice fix and the GFS1 exports), compiled and booted the
> kernel.  Things are working OK, but I noticed that inode usage has
> been steadily rising since then (see attached graph, unless lost in
> transit).  The real filesystems used by the machine are XFS.  I wonder
> if it may be some kind of bug and if yes, whether it has been fixed
> already.  Feel free to ask for any missing information.

Output of /proc/slabinfo, please. If you could get a sample when the
machine has just booted, one at the typical post-boot steady state
as well as one after it has increased well past normal, it would
help identify what type of inode is increasing differently.

Also, can you tell us what metrics you are graphing (i.e. where
in /proc or /sys you are getting them from)?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: xfsaild causing 30+ wakeups/s on an idle system since 2.6.25-rcX

2008-02-18 Thread David Chinner

On Mon, Feb 18, 2008 at 11:41:39AM +0200, Török Edwin wrote:
> David Chinner wrote:
> > On Sun, Feb 17, 2008 at 05:51:08PM +0100, Oliver Pinter wrote:
> >   
> >> On 2/17/08, Török Edwin <[EMAIL PROTECTED]> wrote:
> >> 
> >>> Hi,
> >>>
> >>> xfsaild is causing many wakeups, a quick investigation shows
> >>> xfsaild_push is always
> >>> returning 30 msecs timeout value.
> >>>   
> >
> > That's a bug
> 
> Ok. Your patches fixes the 30+ wakeups :)

Good. I'll push it out for review then.

> > , and has nothing to do with power consumption. ;)
> >   
> 
> I suggest using a sysctl value (such as
> /proc/sys/vm/dirty_writeback_centisecs), instead of a hardcoded default
> 1000.

No, too magic. I dislike adding knobs to workaround issues that
really should be fixed by having sane default behaviour.  Further
down the track as we correct know issues with the AIL push code
we'll be able to increase this idle timeout or even make it purely
wakeup driven once we get back to an idle state. However, right now
it still needs that once a second wakeup to work around a nasty
corner case that can hang the filesystem

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: xfsaild causing 30+ wakeups/s on an idle system since 2.6.25-rcX

2008-02-18 Thread David Chinner

On Mon, Feb 18, 2008 at 11:41:39AM +0200, Török Edwin wrote:
 David Chinner wrote:
  On Sun, Feb 17, 2008 at 05:51:08PM +0100, Oliver Pinter wrote:

  On 2/17/08, Török Edwin [EMAIL PROTECTED] wrote:
  
  Hi,
 
  xfsaild is causing many wakeups, a quick investigation shows
  xfsaild_push is always
  returning 30 msecs timeout value.

 
  That's a bug
 
 Ok. Your patches fixes the 30+ wakeups :)

Good. I'll push it out for review then.

  , and has nothing to do with power consumption. ;)

 
 I suggest using a sysctl value (such as
 /proc/sys/vm/dirty_writeback_centisecs), instead of a hardcoded default
 1000.

No, too magic. I dislike adding knobs to workaround issues that
really should be fixed by having sane default behaviour.  Further
down the track as we correct know issues with the AIL push code
we'll be able to increase this idle timeout or even make it purely
wakeup driven once we get back to an idle state. However, right now
it still needs that once a second wakeup to work around a nasty
corner case that can hang the filesystem

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: inode leak in 2.6.24?

2008-02-18 Thread David Chinner

On Sat, Feb 16, 2008 at 12:18:58AM +0100, Ferenc Wagner wrote:
 Hi,
 
 5 days ago I pulled the git tree (HEAD was
 25f666300625d894ebe04bac2b4b3aadb907c861), added two minor patches
 (the vmsplice fix and the GFS1 exports), compiled and booted the
 kernel.  Things are working OK, but I noticed that inode usage has
 been steadily rising since then (see attached graph, unless lost in
 transit).  The real filesystems used by the machine are XFS.  I wonder
 if it may be some kind of bug and if yes, whether it has been fixed
 already.  Feel free to ask for any missing information.

Output of /proc/slabinfo, please. If you could get a sample when the
machine has just booted, one at the typical post-boot steady state
as well as one after it has increased well past normal, it would
help identify what type of inode is increasing differently.

Also, can you tell us what metrics you are graphing (i.e. where
in /proc or /sys you are getting them from)?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [dm-devel] Re: [PATCH] Implement barrier support for single device DM devices

2008-02-18 Thread David Chinner

On Mon, Feb 18, 2008 at 04:24:27PM +0300, Michael Tokarev wrote:
 First, I still don't understand why in God's sake barriers are working
 while regular cache flushes are not.  Almost no consumer-grade hard drive
 supports write barriers, but they all support regular cache flushes, and
 the latter should be enough (while not the most speed-optimal) to ensure
 data safety.  Why to require write cache disable (like in XFS FAQ) instead
 of going the flush-cache-when-appropriate (as opposed to write-barrier-
 when-appropriate) way?

Devil's advocate:

Why should we need to support multiple different block layer APIs
to do the same thing? Surely any hardware that doesn't support barrier
operations can emulate them with cache flushes when they receive a
barrier I/O from the filesystem

Also, given that disabling the write cache still allows CTQ/NCQ to
operate effectively and that in most cases WCD+CTQ is as fast as
WCE+barriers, the simplest thing to do is turn off volatile write
caches and not require any extra software kludges for safe
operation.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [dm-devel] Re: [PATCH] Implement barrier support for single device DM devices

2008-02-18 Thread David Chinner

On Tue, Feb 19, 2008 at 02:56:43AM +, Alasdair G Kergon wrote:
 On Tue, Feb 19, 2008 at 09:16:44AM +1100, David Chinner wrote:
  Surely any hardware that doesn't support barrier
  operations can emulate them with cache flushes when they receive a
  barrier I/O from the filesystem
  
 My complaint about having to support them within dm when more than one
 device is involved is because any efficiencies disappear: you can't send
 further I/O to any one device until all the other devices have completed
 their barrier (or else later I/O to that device could overtake the
 barrier on another device).

Right - it's a horrible performance hit.

But - how is what you describe any different to the filesystem doing:

- flush block device
- issue I/O
- wait for completion
- flush block device

around any I/O that it would otherwise simply tag as a barrier?
That serialisation at the filesystem layer is a horrible, horrible
performance hi.

And then there's the fact that we can't implement that in XFS
because all the barrier I/Os we issue are asynchronous.  We'd
basically have to serialise all metadata operations and now we
are talking about far worse performance hits than implementing
barrier emulation in the block device.

Also, it's instructive to look at the implementation of
blkdev_issue_flush() - the API one is supposed to use to trigger a
full block device flush. It doesn't work on DM/MD either, because
it uses a no-I/O barrier bio:

bio-bi_end_io = bio_end_empty_barrier;
bio-bi_private = wait;
bio-bi_bdev = bdev;
submit_bio(1  BIO_RW_BARRIER, bio);

wait_for_completion(wait);

So, if the underlying block device doesn't support barriers,
there's no point in changing the filesystem to issue flushes,
either...

 And then I argue that it would be better
 for the filesystem to have the information that these are not hardware
 barriers so it has the opportunity of tuning its behaviour (e.g.
 flushing less often because it's a more expensive operation).

There is generally no option from the filesystem POV to flush
less. Either we use barrier I/Os where we need to and are safe with
volatile caches or we corrupt filesystems with volatile caches when
power loss occurs. There is no in-between where flushing less
will save us from corruption

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Implement barrier support for single device DM devices

2008-02-17 Thread David Chinner

On Fri, Feb 15, 2008 at 04:07:54PM +0300, Michael Tokarev wrote:
> Alasdair G Kergon wrote:
> > On Fri, Feb 15, 2008 at 01:08:21PM +0100, Andi Kleen wrote:
> >> Implement barrier support for single device DM devices
> >  
> > Thanks.  We've got some (more-invasive) dm patches in the works that
> > attempt to use flushing to emulate barriers where we can't just
> > pass them down like that.
> 
> I wonder if it's worth the effort to try to implement this.
> 
> As far as I understand (*), if a filesystem realizes that the
> underlying block device does not support barriers, it will
> switch to using regular flushes instead

No, typically the filesystems won't issue flushes, either.

> - isn't it the same
> thing as you're trying to do on an MD level?
> 
> Note that a filesystem must understand barriers/flushes on
> underlying block device, since many disk drives don't support
> barriers anyway.
> 
> (*) this is, in fact, an interesting question.  I still can't
> find complete information about this.  For example, how safe
> xfs is if barriers are not supported or turned off?  Is it
> "less safe" than with barriers?  Will it use regular cache
> flushes if barriers are not here?

Try reading at the XFS FAQ:

http://oss.sgi.com/projects/xfs/faq/#wcache

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: xfsaild causing 30+ wakeups/s on an idle system since 2.6.25-rcX

2008-02-17 Thread David Chinner

On Sun, Feb 17, 2008 at 05:51:08PM +0100, Oliver Pinter wrote:
> On 2/17/08, Török Edwin <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > xfsaild is causing many wakeups, a quick investigation shows
> > xfsaild_push is always
> > returning 30 msecs timeout value.

That's a bug, and has nothing to do with power consumption. ;)

I can see that there is a dirty item in the filesystem:

Entering kdb (current=0xe0b8f4fe, pid 30046) on processor 3 due to 
Breakpoint @ 0xa00100454fc0
[3]kdb> bt
Stack traceback for pid 30046
0xe0b8f4fe300462  13   R  0xe0b8f4fe0340 *xfsaild
0xa00100454fc0 xfsaild_push
args (0xe0b89090, 0xe0b8f4fe7e30, 0x31b)

[3]kdb> xail 0xe0b89090
AIL for mp 0xe0b89090, oldest first
[0] type buf flags: 0x1lsn [1:13133]
   buf 0xe0b880258800 blkno 0x0 flags: 0x2 
   Superblock (at 0xe0b8f9b3c000)
[3]kdb>

the superblock is dirty, and the lsn is well beyond the target
of the xfsaild hence it *should* be idling. However, it isn't
idling because there is a dirty item in the list and the idle trigger
of "is list empty" is not tripping.

I only managed to reproduce this on a lazy superblock counter
filesystem (i.e.  new mkfs and recent kernel), as it logs the
superblock every so often, and that is probably what is keeping the
fs dirty like this.

Can you see if the patch below fixes the problem.

---

Idle state is not being detected properly by the xfsaild push
code. The current idle state is detected by an empty list
which may never happen with mostly idle filesystem or one
using lazy superblock counters. A single dirty item in the
list will result repeated looping to push everything past
the target when everything because it fails to check if we
managed to push anything.

Fix by considering a dirty list with everything past the target
as an idle state and set the timeout appropriately.

Signed-off-by: Dave Chinner <[EMAIL PROTECTED]>
---
 fs/xfs/xfs_trans_ail.c |   15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

Index: 2.6.x-xfs-new/fs/xfs/xfs_trans_ail.c
===
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_trans_ail.c   2008-02-18 09:14:34.0 
+1100
+++ 2.6.x-xfs-new/fs/xfs/xfs_trans_ail.c2008-02-18 09:18:52.070682570 
+1100
@@ -261,14 +261,17 @@ xfsaild_push(
xfs_log_force(mp, (xfs_lsn_t)0, XFS_LOG_FORCE);
}

-   /*
-* We reached the target so wait a bit longer for I/O to complete and
-* remove pushed items from the AIL before we start the next scan from
-* the start of the AIL.
-*/
-   if ((XFS_LSN_CMP(lsn, target) >= 0)) {
+   if (count && (XFS_LSN_CMP(lsn, target) >= 0)) {
+   /*
+* We reached the target so wait a bit longer for I/O to
+* complete and remove pushed items from the AIL before we
+* start the next scan from the start of the AIL.
+*/
tout += 20;
last_pushed_lsn = 0;
+   } else if (!count) {
+   /* We're past our target or empty, so idle */
+   tout = 1000;
} else if ((restarts > XFS_TRANS_PUSH_AIL_RESTARTS) ||
   (count && ((stuck * 100) / count > 90))) {
/*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: xfsaild causing 30+ wakeups/s on an idle system since 2.6.25-rcX

2008-02-17 Thread David Chinner

On Sun, Feb 17, 2008 at 05:51:08PM +0100, Oliver Pinter wrote:
 On 2/17/08, Török Edwin [EMAIL PROTECTED] wrote:
  Hi,
 
  xfsaild is causing many wakeups, a quick investigation shows
  xfsaild_push is always
  returning 30 msecs timeout value.

That's a bug, and has nothing to do with power consumption. ;)

I can see that there is a dirty item in the filesystem:

Entering kdb (current=0xe0b8f4fe, pid 30046) on processor 3 due to 
Breakpoint @ 0xa00100454fc0
[3]kdb bt
Stack traceback for pid 30046
0xe0b8f4fe300462  13   R  0xe0b8f4fe0340 *xfsaild
0xa00100454fc0 xfsaild_push
args (0xe0b89090, 0xe0b8f4fe7e30, 0x31b)

[3]kdb xail 0xe0b89090
AIL for mp 0xe0b89090, oldest first
[0] type buf flags: 0x1 in aillsn [1:13133]
   buf 0xe0b880258800 blkno 0x0 flags: 0x2 dirty 
   Superblock (at 0xe0b8f9b3c000)
[3]kdb

the superblock is dirty, and the lsn is well beyond the target
of the xfsaild hence it *should* be idling. However, it isn't
idling because there is a dirty item in the list and the idle trigger
of is list empty is not tripping.

I only managed to reproduce this on a lazy superblock counter
filesystem (i.e.  new mkfs and recent kernel), as it logs the
superblock every so often, and that is probably what is keeping the
fs dirty like this.

Can you see if the patch below fixes the problem.

---

Idle state is not being detected properly by the xfsaild push
code. The current idle state is detected by an empty list
which may never happen with mostly idle filesystem or one
using lazy superblock counters. A single dirty item in the
list will result repeated looping to push everything past
the target when everything because it fails to check if we
managed to push anything.

Fix by considering a dirty list with everything past the target
as an idle state and set the timeout appropriately.

Signed-off-by: Dave Chinner [EMAIL PROTECTED]
---
 fs/xfs/xfs_trans_ail.c |   15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

Index: 2.6.x-xfs-new/fs/xfs/xfs_trans_ail.c
===
--- 2.6.x-xfs-new.orig/fs/xfs/xfs_trans_ail.c   2008-02-18 09:14:34.0 
+1100
+++ 2.6.x-xfs-new/fs/xfs/xfs_trans_ail.c2008-02-18 09:18:52.070682570 
+1100
@@ -261,14 +261,17 @@ xfsaild_push(
xfs_log_force(mp, (xfs_lsn_t)0, XFS_LOG_FORCE);
}
 
-   /*
-* We reached the target so wait a bit longer for I/O to complete and
-* remove pushed items from the AIL before we start the next scan from
-* the start of the AIL.
-*/
-   if ((XFS_LSN_CMP(lsn, target) = 0)) {
+   if (count  (XFS_LSN_CMP(lsn, target) = 0)) {
+   /*
+* We reached the target so wait a bit longer for I/O to
+* complete and remove pushed items from the AIL before we
+* start the next scan from the start of the AIL.
+*/
tout += 20;
last_pushed_lsn = 0;
+   } else if (!count) {
+   /* We're past our target or empty, so idle */
+   tout = 1000;
} else if ((restarts  XFS_TRANS_PUSH_AIL_RESTARTS) ||
   (count  ((stuck * 100) / count  90))) {
/*
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Implement barrier support for single device DM devices

2008-02-17 Thread David Chinner

On Fri, Feb 15, 2008 at 04:07:54PM +0300, Michael Tokarev wrote:
 Alasdair G Kergon wrote:
  On Fri, Feb 15, 2008 at 01:08:21PM +0100, Andi Kleen wrote:
  Implement barrier support for single device DM devices
   
  Thanks.  We've got some (more-invasive) dm patches in the works that
  attempt to use flushing to emulate barriers where we can't just
  pass them down like that.
 
 I wonder if it's worth the effort to try to implement this.
 
 As far as I understand (*), if a filesystem realizes that the
 underlying block device does not support barriers, it will
 switch to using regular flushes instead

No, typically the filesystems won't issue flushes, either.

 - isn't it the same
 thing as you're trying to do on an MD level?
 
 Note that a filesystem must understand barriers/flushes on
 underlying block device, since many disk drives don't support
 barriers anyway.
 
 (*) this is, in fact, an interesting question.  I still can't
 find complete information about this.  For example, how safe
 xfs is if barriers are not supported or turned off?  Is it
 less safe than with barriers?  Will it use regular cache
 flushes if barriers are not here?

Try reading at the XFS FAQ:

http://oss.sgi.com/projects/xfs/faq/#wcache

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: first tree

2008-02-14 Thread David Chinner

On Fri, Feb 15, 2008 at 11:50:40AM +1100, Stephen Rothwell wrote:
> Hi David,
> 
> On Fri, 15 Feb 2008 10:17:02 +1100 David Chinner <[EMAIL PROTECTED]> wrote:
> >
> > The current XFS tree that goes into -mm is:
> > 
> > git://oss.sgi.com:8090/xfs/xfs-2.6.git master
> 
> Added, thanks.
> 
> I have put you as the contact point - is this correct?

Not really ;)

> I notice that the
> MAINTAINERS file has a different contact for XFS.

Yup - [EMAIL PROTECTED] If you want a real person
on the end of problem reports feel free to leave me there,
but you should really also cc the xfs-masters address as well
to guarantee that the problem is seen and dealt with promptly
as I'm not always available.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: first tree

2008-02-14 Thread David Chinner

On Fri, Feb 15, 2008 at 12:35:37AM +1100, Stephen Rothwell wrote:
> Also, more trees please ... :-)

The current XFS tree that goes into -mm is:

git://oss.sgi.com:8090/xfs/xfs-2.6.git master

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: first tree

2008-02-14 Thread David Chinner

On Fri, Feb 15, 2008 at 12:35:37AM +1100, Stephen Rothwell wrote:
 Also, more trees please ... :-)

The current XFS tree that goes into -mm is:

git://oss.sgi.com:8090/xfs/xfs-2.6.git master

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: first tree

2008-02-14 Thread David Chinner

On Fri, Feb 15, 2008 at 11:50:40AM +1100, Stephen Rothwell wrote:
 Hi David,
 
 On Fri, 15 Feb 2008 10:17:02 +1100 David Chinner [EMAIL PROTECTED] wrote:
 
  The current XFS tree that goes into -mm is:
  
  git://oss.sgi.com:8090/xfs/xfs-2.6.git master
 
 Added, thanks.
 
 I have put you as the contact point - is this correct?

Not really ;)

 I notice that the
 MAINTAINERS file has a different contact for XFS.

Yup - [EMAIL PROTECTED] If you want a real person
on the end of problem reports feel free to leave me there,
but you should really also cc the xfs-masters address as well
to guarantee that the problem is seen and dealt with promptly
as I'm not always available.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: xfs [_fsr] probs in 2.6.24.0

2008-02-12 Thread David Chinner

On Tue, Feb 12, 2008 at 01:02:05PM -0800, Linda Walsh wrote:
> David Chinner wrote:
> >Perhaps by running xfs_fsr manually you could reproduce the
> >problem while you are sitting in front of the machine...
> 
>   Um...yeah, AND with multiple "cp's of multi-gig files
> going on at same time, both local, by a sister machine via NFS,
> and and a 3rd machine tapping (not banging) away via CIFS.
> These were on top of normal server duties.  Whenever I stress
> it on *purpose* and watch it, works fine.   GRRRI HATE
>  THAT!!!

I feel your pain.

>   The file system(s) going "offline" due
> to xfs-detected filesystem errors has only happened *once* on
> asa, the 64-bit machine.  It's a fairly new machine w/o added
> hardware -- but this only happened in 2.24.0 when I added the
> tickless clock option, which sure seems like a remote possibility for
> causing an xfs error, but could be.

Well, tickless is new and shiny and I doubt anyone has done
much testing with XFS on tickless kernels. Still, if that's a new
config option you set, change it back to what you had for .23 on
that hardware and try again.

> >Looking at it in terms of i_mutex, other filesystems hold
> >i_mutex over dio_get_page() (all those that use DIO_LOCKING)
> >so question is whether we are allowed to take the i_mutex
> >in ->release. I note that both reiserfs and hfsplus take 
> >i_mutex in ->release as well as use DIO_LOCKING, so this
> >problem is not isolated to XFS.
> >
> >However, it would appear that mmap_sem -> i_mutex is illegal
> >according to the comment at the head of mm/filemap.c. While we are
> >not using i_mutex in this case, the inversion would seem to be
> >equivalent in nature.
> >
> >There's not going to be a quick fix for this.
> 
>   What could be the consequences of this locking anomaly?

If you have a multithreaded application that mixes mmap and
direct I/O, and you have a simultaneous munmap() call and
read() to the same file, you might be able to deadlock access
to that file. However, you'd have to be certifiably insane
to write an application that did this (mix mmap and direct I/O
to the same file at the same time), so I think exposure is
pretty limited.

> I.e., for example, in NFS, I have enabled "allow direct I/O on NFS
> files".

That's client side direct I/O, which is not what the server
does. Client side direct I/O results in synchronous buffered
I/O on the server, which will thrash your disks pretty hard.
The config option help does warn you about this. ;)

> >And the other one:
> >
> >>Feb  7 02:01:50 kern: 
> >>---
> >>Feb  7 02:01:50 kern: xfs_fsr/6313 is trying to acquire lock:
> >>Feb  7 02:01:50 kern:  (&(>i_lock)->mr_lock/2){}, at: 
> >>[] xfs_ilock+0x82/0xc0
> >>Feb  7 02:01:50 kern:
> >>Feb  7 02:01:50 kern: but task is already holding lock:
> >>Feb  7 02:01:50 kern:  (&(>i_iolock)->mr_lock/3){--..}, at: 
> >>[] xfs_ilock+0xa5/0xc0
> >>Feb  7 02:01:50 kern:
> >>Feb  7 02:01:50 kern: which lock already depends on the new lock.
> >
> >Looks like yet another false positive. Basically we do this
> >in xfs_swap_extents:
> >
> > inode A: i_iolock class 2
> > inode A: i_ilock class 2
> > inode B: i_iolock class 3
> > inode B: i_ilock class 3
> > .
> > inode A: unlock ilock
> > inode B: unlock ilock
> > .
> >>>>>>inode A: ilock class 2
> > inode B: ilock class 3
> >
> >And lockdep appears to be complaining about the relocking of inode A
> >as class 2 because we've got a class 3 iolock still held, hence
> >violating the order it saw initially.  There's no possible deadlock
> >here so we'll just have to add more hacks to the annotation code to make
> >lockdep happy.
> 
>   Is there a reason to unlock and relock the same inode while
> the level 3 lock is held -- i.e. does 'unlocking ilock' allow some
> increased 'throughput' for some other potential process to access
> the same inode? 

It prevents a single thread deadlock when doing transaction reservation.
i.e. the process of setting up a transaction can require the ilock
to be taken, and hence we have to drop it before and pick it back up
after the transaction reservation.

We hold on to the iolock to prevent the inode from having new I/O
started while we do the transaction reservation, so it's in the
same state after the reservation as it was before

We have to hold both locks to guarantee exclusive access to the
inode, so once we have the reservation we need to pi

Re: xfs [_fsr] probs in 2.6.24.0

2008-02-12 Thread David Chinner

On Mon, Feb 11, 2008 at 05:02:05PM -0800, Linda Walsh wrote:
> 
> I'm getting similar errors on an x86-32 & x86-64 kernel.  The x86-64 system
> (2nd log below w/date+times) was unusable this morning: one or more of the
> xfs file systems had "gone off line" due to some unknown error (upon reboot,
> no errors were indicated; all partitions on the same physical disk).
> 
> I keep ending up with random failures on two systems.  The 32-bit sys more
> often than not just "locks-up" -- no messages, no keyboard response...etc.
> Nothing to do except reset -- and of course, nothing in the logs

Filesystem bugs rarely hang systems hard like that - more likely is
a hardware or driver problem. And neither of the lockdep reports
below are likely to be responsible for a system wide, no-response
hang.

[cut a bunch of speculation and stuff about hardware problems
causing XFS problems]

> Lemme know if I can provide any info in addressing these problems.  I don't
> know if they are related to the other system crashes, but usage patterns
> indicate it could be disk-activity related.  So eliminating known bugs &
> problems in that area might help (?)...

If your hardware or drivers are unstable, then XFS cannot be
expected to reliably work. Given that xfs_fsr apparently triggers
the hangs, I'd suggest putting lots of I/O load on your disk subsystem
by copying files around with direct I/O (just like xfs_fsr does) to
try to reproduce the problem.

Perhaps by running xfs_fsr manually you could reproduce the
problem while you are sitting in front of the machine...

Looking at the lockdep reports:

> The first set of errors the I have some diagnostics for, I could
> only find in dmesg:
> (ish-32bit machine)
> ===
> [ INFO: possible circular locking dependency detected ]
> 2.6.24-ish #1
> ---
> xfs_fsr/2119 is trying to acquire lock:
> (>mmap_sem){}, at: [] dio_get_page+0x62/0x160
> 
> but task is already holding lock:
> (&(>i_iolock)->mr_lock){}, at: [] xfs_ilock+0x5b/0xb0

dio_get_page() takes the mmap_sem of the processes
vma that has the pages we do I/O into. That's not new.
We're holding the xfs inode iolock at this point to protect
against truncate  and simultaneous buffered I/O races and
this is also unchanged.  i.e. this is normal.

> which lock already depends on the new lock.
> 
> 
> the existing dependency chain (in reverse order) is:

munmap() dropping the last reference to it's vm_file and
calling ->release() which causes a truncate of speculatively
allocated space to take place. IOWs, ->release() is called
with the mmap_sem held. Hmmm

Looking at it in terms of i_mutex, other filesystems hold
i_mutex over dio_get_page() (all those that use DIO_LOCKING)
so question is whether we are allowed to take the i_mutex
in ->release. I note that both reiserfs and hfsplus take 
i_mutex in ->release as well as use DIO_LOCKING, so this
problem is not isolated to XFS.

However, it would appear that mmap_sem -> i_mutex is illegal
according to the comment at the head of mm/filemap.c. While we are
not using i_mutex in this case, the inversion would seem to be
equivalent in nature.

There's not going to be a quick fix for this.

And the other one:

> Feb  7 02:01:50 kern: 
> ---
> Feb  7 02:01:50 kern: xfs_fsr/6313 is trying to acquire lock:
> Feb  7 02:01:50 kern:  (&(>i_lock)->mr_lock/2){}, at: 
> [] xfs_ilock+0x82/0xc0
> Feb  7 02:01:50 kern:
> Feb  7 02:01:50 kern: but task is already holding lock:
> Feb  7 02:01:50 kern:  (&(>i_iolock)->mr_lock/3){--..}, at: 
> [] xfs_ilock+0xa5/0xc0
> Feb  7 02:01:50 kern:
> Feb  7 02:01:50 kern: which lock already depends on the new lock.

Looks like yet another false positive. Basically we do this
in xfs_swap_extents:

inode A: i_iolock class 2
inode A: i_ilock class 2
inode B: i_iolock class 3
inode B: i_ilock class 3
.
inode A: unlock ilock
inode B: unlock ilock
.
>   inode A: ilock class 2
inode B: ilock class 3

And lockdep appears to be complaining about the relocking of inode A
as class 2 because we've got a class 3 iolock still held, hence
violating the order it saw initially.  There's no possible deadlock
here so we'll just have to add more hacks to the annotation code to make
lockdep happy.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: xfs [_fsr] probs in 2.6.24.0

2008-02-12 Thread David Chinner

On Mon, Feb 11, 2008 at 05:02:05PM -0800, Linda Walsh wrote:
 
 I'm getting similar errors on an x86-32  x86-64 kernel.  The x86-64 system
 (2nd log below w/date+times) was unusable this morning: one or more of the
 xfs file systems had gone off line due to some unknown error (upon reboot,
 no errors were indicated; all partitions on the same physical disk).
 
 I keep ending up with random failures on two systems.  The 32-bit sys more
 often than not just locks-up -- no messages, no keyboard response...etc.
 Nothing to do except reset -- and of course, nothing in the logs

Filesystem bugs rarely hang systems hard like that - more likely is
a hardware or driver problem. And neither of the lockdep reports
below are likely to be responsible for a system wide, no-response
hang.

[cut a bunch of speculation and stuff about hardware problems
causing XFS problems]

 Lemme know if I can provide any info in addressing these problems.  I don't
 know if they are related to the other system crashes, but usage patterns
 indicate it could be disk-activity related.  So eliminating known bugs 
 problems in that area might help (?)...

If your hardware or drivers are unstable, then XFS cannot be
expected to reliably work. Given that xfs_fsr apparently triggers
the hangs, I'd suggest putting lots of I/O load on your disk subsystem
by copying files around with direct I/O (just like xfs_fsr does) to
try to reproduce the problem.

Perhaps by running xfs_fsr manually you could reproduce the
problem while you are sitting in front of the machine...

Looking at the lockdep reports:

 The first set of errors the I have some diagnostics for, I could
 only find in dmesg:
 (ish-32bit machine)
 ===
 [ INFO: possible circular locking dependency detected ]
 2.6.24-ish #1
 ---
 xfs_fsr/2119 is trying to acquire lock:
 (mm-mmap_sem){}, at: [b01a3432] dio_get_page+0x62/0x160
 
 but task is already holding lock:
 ((ip-i_iolock)-mr_lock){}, at: [b02651db] xfs_ilock+0x5b/0xb0

dio_get_page() takes the mmap_sem of the processes
vma that has the pages we do I/O into. That's not new.
We're holding the xfs inode iolock at this point to protect
against truncate  and simultaneous buffered I/O races and
this is also unchanged.  i.e. this is normal.

 which lock already depends on the new lock.
 
 
 the existing dependency chain (in reverse order) is:

munmap() dropping the last reference to it's vm_file and
calling -release() which causes a truncate of speculatively
allocated space to take place. IOWs, -release() is called
with the mmap_sem held. Hmmm

Looking at it in terms of i_mutex, other filesystems hold
i_mutex over dio_get_page() (all those that use DIO_LOCKING)
so question is whether we are allowed to take the i_mutex
in -release. I note that both reiserfs and hfsplus take 
i_mutex in -release as well as use DIO_LOCKING, so this
problem is not isolated to XFS.

However, it would appear that mmap_sem - i_mutex is illegal
according to the comment at the head of mm/filemap.c. While we are
not using i_mutex in this case, the inversion would seem to be
equivalent in nature.

There's not going to be a quick fix for this.

And the other one:

 Feb  7 02:01:50 kern: 
 ---
 Feb  7 02:01:50 kern: xfs_fsr/6313 is trying to acquire lock:
 Feb  7 02:01:50 kern:  ((ip-i_lock)-mr_lock/2){}, at: 
 [803c1b22] xfs_ilock+0x82/0xc0
 Feb  7 02:01:50 kern:
 Feb  7 02:01:50 kern: but task is already holding lock:
 Feb  7 02:01:50 kern:  ((ip-i_iolock)-mr_lock/3){--..}, at: 
 [803c1b45] xfs_ilock+0xa5/0xc0
 Feb  7 02:01:50 kern:
 Feb  7 02:01:50 kern: which lock already depends on the new lock.

Looks like yet another false positive. Basically we do this
in xfs_swap_extents:

inode A: i_iolock class 2
inode A: i_ilock class 2
inode B: i_iolock class 3
inode B: i_ilock class 3
.
inode A: unlock ilock
inode B: unlock ilock
.
   inode A: ilock class 2
inode B: ilock class 3

And lockdep appears to be complaining about the relocking of inode A
as class 2 because we've got a class 3 iolock still held, hence
violating the order it saw initially.  There's no possible deadlock
here so we'll just have to add more hacks to the annotation code to make
lockdep happy.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: xfs [_fsr] probs in 2.6.24.0

2008-02-12 Thread David Chinner

On Tue, Feb 12, 2008 at 01:02:05PM -0800, Linda Walsh wrote:
 David Chinner wrote:
 Perhaps by running xfs_fsr manually you could reproduce the
 problem while you are sitting in front of the machine...
 
   Um...yeah, AND with multiple cp's of multi-gig files
 going on at same time, both local, by a sister machine via NFS,
 and and a 3rd machine tapping (not banging) away via CIFS.
 These were on top of normal server duties.  Whenever I stress
 it on *purpose* and watch it, works fine.   GRRRI HATE
  THAT!!!

I feel your pain.

   The file system(s) going offline due
 to xfs-detected filesystem errors has only happened *once* on
 asa, the 64-bit machine.  It's a fairly new machine w/o added
 hardware -- but this only happened in 2.24.0 when I added the
 tickless clock option, which sure seems like a remote possibility for
 causing an xfs error, but could be.

Well, tickless is new and shiny and I doubt anyone has done
much testing with XFS on tickless kernels. Still, if that's a new
config option you set, change it back to what you had for .23 on
that hardware and try again.

 Looking at it in terms of i_mutex, other filesystems hold
 i_mutex over dio_get_page() (all those that use DIO_LOCKING)
 so question is whether we are allowed to take the i_mutex
 in -release. I note that both reiserfs and hfsplus take 
 i_mutex in -release as well as use DIO_LOCKING, so this
 problem is not isolated to XFS.
 
 However, it would appear that mmap_sem - i_mutex is illegal
 according to the comment at the head of mm/filemap.c. While we are
 not using i_mutex in this case, the inversion would seem to be
 equivalent in nature.
 
 There's not going to be a quick fix for this.
 
   What could be the consequences of this locking anomaly?

If you have a multithreaded application that mixes mmap and
direct I/O, and you have a simultaneous munmap() call and
read() to the same file, you might be able to deadlock access
to that file. However, you'd have to be certifiably insane
to write an application that did this (mix mmap and direct I/O
to the same file at the same time), so I think exposure is
pretty limited.

 I.e., for example, in NFS, I have enabled allow direct I/O on NFS
 files.

That's client side direct I/O, which is not what the server
does. Client side direct I/O results in synchronous buffered
I/O on the server, which will thrash your disks pretty hard.
The config option help does warn you about this. ;)

 And the other one:
 
 Feb  7 02:01:50 kern: 
 ---
 Feb  7 02:01:50 kern: xfs_fsr/6313 is trying to acquire lock:
 Feb  7 02:01:50 kern:  ((ip-i_lock)-mr_lock/2){}, at: 
 [803c1b22] xfs_ilock+0x82/0xc0
 Feb  7 02:01:50 kern:
 Feb  7 02:01:50 kern: but task is already holding lock:
 Feb  7 02:01:50 kern:  ((ip-i_iolock)-mr_lock/3){--..}, at: 
 [803c1b45] xfs_ilock+0xa5/0xc0
 Feb  7 02:01:50 kern:
 Feb  7 02:01:50 kern: which lock already depends on the new lock.
 
 Looks like yet another false positive. Basically we do this
 in xfs_swap_extents:
 
  inode A: i_iolock class 2
  inode A: i_ilock class 2
  inode B: i_iolock class 3
  inode B: i_ilock class 3
  .
  inode A: unlock ilock
  inode B: unlock ilock
  .
 inode A: ilock class 2
  inode B: ilock class 3
 
 And lockdep appears to be complaining about the relocking of inode A
 as class 2 because we've got a class 3 iolock still held, hence
 violating the order it saw initially.  There's no possible deadlock
 here so we'll just have to add more hacks to the annotation code to make
 lockdep happy.
 
   Is there a reason to unlock and relock the same inode while
 the level 3 lock is held -- i.e. does 'unlocking ilock' allow some
 increased 'throughput' for some other potential process to access
 the same inode? 

It prevents a single thread deadlock when doing transaction reservation.
i.e. the process of setting up a transaction can require the ilock
to be taken, and hence we have to drop it before and pick it back up
after the transaction reservation.

We hold on to the iolock to prevent the inode from having new I/O
started while we do the transaction reservation, so it's in the
same state after the reservation as it was before

We have to hold both locks to guarantee exclusive access to the
inode, so once we have the reservation we need to pick the ilocks
back up. The way we do it here does not violate lock ordering at all
(iolock before ilock on a single inode, and ascending inode number
order for multiple inodes), but lockdep is not smart enough to know
that. Hence we need more complex annotations to shut it up.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ

Re: [PATCH] xfs: convert beX_add to beX_add_cpu (new common API)

2008-02-10 Thread David Chinner

On Sun, Feb 10, 2008 at 11:18:09AM +0100, Marcin Slusarz wrote:
> This patch was in Andrew tree, but it was uncomplete.
> Here is updated version.
>
> ---
> remove beX_add functions and replace all uses with beX_add_cpu
> 
> Signed-off-by: Marcin Slusarz <[EMAIL PROTECTED]>
> ---

Looks good. You can add a:

Reviewed-by: Dave Chinner <[EMAIL PROTECTED]>

To the patch.

Andrew - feel free to push this to Linus with the other 2 
patches in this series when you are ready.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO queuing and complete affinity with threads (was Re: [PATCH 0/8] IO queuing and complete affinity)

2008-02-10 Thread David Chinner

On Fri, Feb 08, 2008 at 08:59:55AM +0100, Jens Axboe wrote:
> > > > At least they reported it to be the most efficient scheme in their
> > > > testing, and Dave thought that migrating completions out to submitters
> > > > might be a bottleneck in some cases.
> > > 
> > > More so than migrating submitters to completers? The advantage of only
> > > movign submitters is that you get rid of the completion locking. Apart
> > > from that, the cost should be the same, especially for the thread based
> > > solution.
> > 
> > Not specifically for the block layer, but higher layers like xfs.
> 
> True, but that's parallel to the initial statement - that migrating
> completers is most costly than migrating submitters. So I'd like Dave to
> expand on why he thinks that migrating completers it more costly than
> submitters, APART from the locking associated with adding the request to
> a remote CPU list.

What I think Nick is referring to is the comments I made that at a
higher layer (e.g. filesystems) migrating completions to the
submitter CPU may be exactly the wrong thing to do. I don't recall
making any comments on migrating submitters - I think others have
already commented on that so I'll ignore that for the moment and
try to explain why completion on submitter CPU /may/ be bad.

For example, in the case of XFS it is fine for data I/O but it is
wrong for transaction I/O completion. We want to direct all
transaction completions to as few CPUs as possible (one, ideally) so
that all the completion processing happens on the same CPU, rather
than bouncing global cachelines and locks between all the CPUs
taking completion interrupts.

In more detail, the XFS transaction subsystem is asynchronous.
We submit the transaction I/O on the CPU that creates the
transaction so the I/O can come from any CPU in the system.  If we
then farm the completion processing out to the submission CPU, that
will push it all over the machine and guarantee that we bounce all
of the XFS transaction log structures and locks all over the machine
on completion as well as submission (right now it's lots of
submission CPUs, few completion CPUs).

An example how bad this can get - this patch:

http://oss.sgi.com/archives/xfs/2007-11/msg00217.html

which prevents simultaneous access to the items tracked in the log
during transaction reservation. Having several hundred CPUs trying
to hit this list at once is really bad for performance - the test
app on the 2048p machine that saw this problem went from ~5500s
runtime down to 9s with the above patch.

I use this example because the transaction I/O completion touches
exactly the same list, locks and structures but is limited in
distribution (and therefore contention) by the number of
simultaneous I/O completion CPUs. Doing completion on the submitter
CPU will cause much wider distribution of completion processing and
introduce exactly the same issues as the transaction reservation
side had.

As it goes, with large, multi-device volumes (e.g. big stripe) we
already see issues with simultaneous completion processing (e.g. the
8p machine mentioned in the above link), so I'd really like to avoid
making these problems worse

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO queuing and complete affinity with threads (was Re: [PATCH 0/8] IO queuing and complete affinity)

2008-02-10 Thread David Chinner

On Fri, Feb 08, 2008 at 08:59:55AM +0100, Jens Axboe wrote:
At least they reported it to be the most efficient scheme in their
testing, and Dave thought that migrating completions out to submitters
might be a bottleneck in some cases.
   
   More so than migrating submitters to completers? The advantage of only
   movign submitters is that you get rid of the completion locking. Apart
   from that, the cost should be the same, especially for the thread based
   solution.
  
  Not specifically for the block layer, but higher layers like xfs.
 
 True, but that's parallel to the initial statement - that migrating
 completers is most costly than migrating submitters. So I'd like Dave to
 expand on why he thinks that migrating completers it more costly than
 submitters, APART from the locking associated with adding the request to
 a remote CPU list.

What I think Nick is referring to is the comments I made that at a
higher layer (e.g. filesystems) migrating completions to the
submitter CPU may be exactly the wrong thing to do. I don't recall
making any comments on migrating submitters - I think others have
already commented on that so I'll ignore that for the moment and
try to explain why completion on submitter CPU /may/ be bad.

For example, in the case of XFS it is fine for data I/O but it is
wrong for transaction I/O completion. We want to direct all
transaction completions to as few CPUs as possible (one, ideally) so
that all the completion processing happens on the same CPU, rather
than bouncing global cachelines and locks between all the CPUs
taking completion interrupts.

In more detail, the XFS transaction subsystem is asynchronous.
We submit the transaction I/O on the CPU that creates the
transaction so the I/O can come from any CPU in the system.  If we
then farm the completion processing out to the submission CPU, that
will push it all over the machine and guarantee that we bounce all
of the XFS transaction log structures and locks all over the machine
on completion as well as submission (right now it's lots of
submission CPUs, few completion CPUs).

An example how bad this can get - this patch:

http://oss.sgi.com/archives/xfs/2007-11/msg00217.html

which prevents simultaneous access to the items tracked in the log
during transaction reservation. Having several hundred CPUs trying
to hit this list at once is really bad for performance - the test
app on the 2048p machine that saw this problem went from ~5500s
runtime down to 9s with the above patch.

I use this example because the transaction I/O completion touches
exactly the same list, locks and structures but is limited in
distribution (and therefore contention) by the number of
simultaneous I/O completion CPUs. Doing completion on the submitter
CPU will cause much wider distribution of completion processing and
introduce exactly the same issues as the transaction reservation
side had.

As it goes, with large, multi-device volumes (e.g. big stripe) we
already see issues with simultaneous completion processing (e.g. the
8p machine mentioned in the above link), so I'd really like to avoid
making these problems worse

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] xfs: convert beX_add to beX_add_cpu (new common API)

2008-02-10 Thread David Chinner

On Sun, Feb 10, 2008 at 11:18:09AM +0100, Marcin Slusarz wrote:
 This patch was in Andrew tree, but it was uncomplete.
 Here is updated version.

 ---
 remove beX_add functions and replace all uses with beX_add_cpu
 
 Signed-off-by: Marcin Slusarz [EMAIL PROTECTED]
 ---

Looks good. You can add a:

Reviewed-by: Dave Chinner [EMAIL PROTECTED]

To the patch.

Andrew - feel free to push this to Linus with the other 2 
patches in this series when you are ready.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [stable] [patch 00/45] 2.6.24-stable review

2008-02-07 Thread David Chinner

On Thu, Feb 07, 2008 at 05:12:30PM -0800, Greg KH wrote:
> On Fri, Feb 08, 2008 at 11:44:30AM +1100, David Chinner wrote:
> > Greg,
> > 
> > Is there any reason why the XFS patch I sent to the stable list a
> > couple of days ago is not included in this series?
> > 
> > http://oss.sgi.com/archives/xfs/2008-02/msg00027.html
> > 
> > We've had multiple reports of it, and multiple confirmations that
> > the patch in the link above fixes the problem.
> 
> I didn't think it was in Linus's tree yet.  Is it?

Not yet - it's in the pipeline. I'll see if that can be sped
up (someone else usually takes care of the pushes to Linus).

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 00/45] 2.6.24-stable review

2008-02-07 Thread David Chinner

Greg,

Is there any reason why the XFS patch I sent to the stable list a
couple of days ago is not included in this series?

http://oss.sgi.com/archives/xfs/2008-02/msg00027.html

We've had multiple reports of it, and multiple confirmations that
the patch in the link above fixes the problem.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: remount-ro & umount & quota interaction

2008-02-07 Thread David Chinner

On Thu, Feb 07, 2008 at 03:10:18PM +0100, Jan Engelhardt wrote:
> 
> On Feb 7 2008 15:04, Jan Kara wrote:
> >On Thu 07-02-08 13:49:52, Michael Tokarev wrote:
> >> Jan Kara wrote:
> >> [deadlock after remount-ro followed with umount when
> >>  quota is enabled]
> >> 
> >> Hmm.  While that will prevent the lockup, maybe it's better to
> >> perform an equivalent of quotaoff on mount-ro instead? [...]
> >
> >  We couldn't leave quota on when filesystem is remounted ro because we
> >need to modify quotafile when quota is being turned off. We could turn off
> >quotas when remounting read-only. As we turn them off during umount, it
> >probably makes sence to turn them off on remount-ro as well.
> 
> Objection. XFS handles quotas differently that does not involve
> modifying a file on the fs, so quotas could stay on (even if it does
> not make much sense) while the fs is ro.

Wrong and right. XFS uses files for the backing store for quota
information, though it does handle quotas very differently to every
other Linux filesystem.

On remount-ro, we simply flush all the dirty dquots to their backing
store so the dquots can then be treated as ro just like every other object
in the filesystem. You don't need to turn off quotas to do this

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: remount-ro umount quota interaction

2008-02-07 Thread David Chinner

On Thu, Feb 07, 2008 at 03:10:18PM +0100, Jan Engelhardt wrote:
 
 On Feb 7 2008 15:04, Jan Kara wrote:
 On Thu 07-02-08 13:49:52, Michael Tokarev wrote:
  Jan Kara wrote:
  [deadlock after remount-ro followed with umount when
   quota is enabled]
  
  Hmm.  While that will prevent the lockup, maybe it's better to
  perform an equivalent of quotaoff on mount-ro instead? [...]
 
   We couldn't leave quota on when filesystem is remounted ro because we
 need to modify quotafile when quota is being turned off. We could turn off
 quotas when remounting read-only. As we turn them off during umount, it
 probably makes sence to turn them off on remount-ro as well.
 
 Objection. XFS handles quotas differently that does not involve
 modifying a file on the fs, so quotas could stay on (even if it does
 not make much sense) while the fs is ro.

Wrong and right. XFS uses files for the backing store for quota
information, though it does handle quotas very differently to every
other Linux filesystem.

On remount-ro, we simply flush all the dirty dquots to their backing
store so the dquots can then be treated as ro just like every other object
in the filesystem. You don't need to turn off quotas to do this

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 00/45] 2.6.24-stable review

2008-02-07 Thread David Chinner

Greg,

Is there any reason why the XFS patch I sent to the stable list a
couple of days ago is not included in this series?

http://oss.sgi.com/archives/xfs/2008-02/msg00027.html

We've had multiple reports of it, and multiple confirmations that
the patch in the link above fixes the problem.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [stable] [patch 00/45] 2.6.24-stable review

2008-02-07 Thread David Chinner

On Thu, Feb 07, 2008 at 05:12:30PM -0800, Greg KH wrote:
 On Fri, Feb 08, 2008 at 11:44:30AM +1100, David Chinner wrote:
  Greg,
  
  Is there any reason why the XFS patch I sent to the stable list a
  couple of days ago is not included in this series?
  
  http://oss.sgi.com/archives/xfs/2008-02/msg00027.html
  
  We've had multiple reports of it, and multiple confirmations that
  the patch in the link above fixes the problem.
 
 I didn't think it was in Linus's tree yet.  Is it?

Not yet - it's in the pipeline. I'll see if that can be sped
up (someone else usually takes care of the pushes to Linus).

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] direct IO submission and completion scalability issues

2008-02-04 Thread David Chinner

On Mon, Feb 04, 2008 at 11:09:59AM +0100, Nick Piggin wrote:
> You get better behaviour in the slab and page allocators and locality
> and cache hotness of memory. For example, I guess in a filesystem /
> pagecache heavy workload, you have to touch each struct page, buffer head,
> fs private state, and also often have to wake the thread for completion.
> Much of this data has just been touched at submit time, so doin this on
> the same CPU is nice...

[]

> I'm surprised that the xfs global state bouncing would outweigh the
> bouncing of all the per-page/block/bio/request/etc data that gets touched
> during completion. We'll see.

per-page/block.bio/request/etc is local to a single I/O. the only
penalty is a cacheline bounce for each of the structures from one
CPU to another.  That is, there is no global state modified by these
completions.

The real issue is metadata. The transaction log I/O completion
funnels through a state machine protected by a single lock, which
means completions on different CPUs pulls that lock to all
completion CPUs. Given that the same lock is used during transaction
completion for other state transitions (in task context, not intr),
the more cpus active at once touches, the worse the problem gets.

Then there's metadata I/O completion, which funnels through a larger
set of global locks in the transaction subsystem (e.g. the active
item list lock, the log reservation locks, the log state lock, etc)
which once again means the more CPUs we have delivering I/O
completions, the worse the problem gets.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] direct IO submission and completion scalability issues

2008-02-04 Thread David Chinner

On Mon, Feb 04, 2008 at 11:09:59AM +0100, Nick Piggin wrote:
 You get better behaviour in the slab and page allocators and locality
 and cache hotness of memory. For example, I guess in a filesystem /
 pagecache heavy workload, you have to touch each struct page, buffer head,
 fs private state, and also often have to wake the thread for completion.
 Much of this data has just been touched at submit time, so doin this on
 the same CPU is nice...

[]

 I'm surprised that the xfs global state bouncing would outweigh the
 bouncing of all the per-page/block/bio/request/etc data that gets touched
 during completion. We'll see.

per-page/block.bio/request/etc is local to a single I/O. the only
penalty is a cacheline bounce for each of the structures from one
CPU to another.  That is, there is no global state modified by these
completions.

The real issue is metadata. The transaction log I/O completion
funnels through a state machine protected by a single lock, which
means completions on different CPUs pulls that lock to all
completion CPUs. Given that the same lock is used during transaction
completion for other state transitions (in task context, not intr),
the more cpus active at once touches, the worse the problem gets.

Then there's metadata I/O completion, which funnels through a larger
set of global locks in the transaction subsystem (e.g. the active
item list lock, the log reservation locks, the log state lock, etc)
which once again means the more CPUs we have delivering I/O
completions, the worse the problem gets.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] direct IO submission and completion scalability issues

2008-02-03 Thread David Chinner

On Sun, Feb 03, 2008 at 08:14:45PM -0800, Arjan van de Ven wrote:
> David Chinner wrote:
> >Hi Nick,
> >
> >When Matthew was describing this work at an LCA presentation (not
> >sure whether you were at that presentation or not), Zach came up
> >with the idea that allowing the submitting application control the
> >CPU that the io completion processing was occurring would be a good
> >approach to try.  That is, we submit a "completion cookie" with the
> >bio that indicates where we want completion to run, rather than
> >dictating that completion runs on the submission CPU.
> >
> >The reasoning is that only the higher level context really knows
> >what is optimal, and that changes from application to application.
> 
> well.. kinda. One of the really hard parts of the submit/completion stuff 
> is that
> the slab/slob/slub/slib allocator ends up basically "cycling" memory 
> through the system;
> there's a sink of free memory on all the submission cpus and a source of 
> free memory
> on the completion cpu. I don't think applications are capable of working 
> out what is
> best in this scenario..

Applications as in "anything that calls submit_bio()". i.e, direct I/O,
filesystems, etc. i.e. not userspace but in-kernel applications.

In XFS, simultaneous io completion on multiple CPUs can contribute greatly to
contention of global structures in XFS. By controlling where completions are
delivered, we can greatly reduce this contention, especially on large,
mulitpathed devices that deliver interrupts to multiple CPUs that may be far
distant from each other.  We have all the state and intelligence necessary
to control this sort policy decision effectively.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] direct IO submission and completion scalability issues

2008-02-03 Thread David Chinner

On Sun, Feb 03, 2008 at 10:52:52AM +0100, Nick Piggin wrote:
> On Fri, Jul 27, 2007 at 06:21:28PM -0700, Suresh B wrote:
> > 
> > Second experiment which we did was migrating the IO submission to the
> > IO completion cpu. Instead of submitting the IO on the same cpu where the
> > request arrived, in this experiment  the IO submission gets migrated to the
> > cpu that is processing IO completions(interrupt). This will minimize the
> > access to remote cachelines (that happens in timers, slab, scsi layers). The
> > IO submission request is forwarded to the kblockd thread on the cpu 
> > receiving
> > the interrupts. As part of this, we also made kblockd thread on each cpu as 
> > the
> > highest priority thread, so that IO gets submitted as soon as possible on 
> > the
> > interrupt cpu with out any delay. On x86_64 SMP platform with 16 cores, this
> > resulted in 2% performance improvement and 3.3% improvement on two node ia64
> > platform.
> > 
> > Quick and dirty prototype patch(not meant for inclusion) for this io 
> > migration
> > experiment is appended to this e-mail.
> > 
> > Observation #1 mentioned above is also applicable to this experiment. CPU's
> > processing interrupts will now have to cater IO submission/processing
> > load aswell.
> > 
> > Observation #2: This introduces some migration overhead during IO 
> > submission.
> > With the current prototype, every incoming IO request results in an IPI and
> > context switch(to kblockd thread) on the interrupt processing cpu.
> > This issue needs to be addressed and main challenge to address is
> > the efficient mechanism of doing this IO migration(how much batching to do 
> > and
> > when to send the migrate request?), so that we don't delay the IO much and 
> > at
> > the same point, don't cause much overhead during migration.
> 
> Hi guys,
> 
> Just had another way we might do this. Migrate the completions out to
> the submitting CPUs rather than migrate submission into the completing
> CPU.

Hi Nick,

When Matthew was describing this work at an LCA presentation (not
sure whether you were at that presentation or not), Zach came up
with the idea that allowing the submitting application control the
CPU that the io completion processing was occurring would be a good
approach to try.  That is, we submit a "completion cookie" with the
bio that indicates where we want completion to run, rather than
dictating that completion runs on the submission CPU.

The reasoning is that only the higher level context really knows
what is optimal, and that changes from application to application.
The "complete on the submission CPU" policy _may_ be more optimal
for database workloads, but it is definitely suboptimal for XFS and
transaction I/O completion handling because it simply drags a bunch
of global filesystem state around between all the CPUs running
completions. In that case, we really only want a single CPU to be
handling the completions.

(Zach - please correct me if I've missed anything)

Looking at your patch - if you turn it around so that the
"submission CPU" field can be specified as the "completion cpu" then
I think the patch will expose the policy knobs needed to do the
above. Add the bio -> rq linkage to enable filesystems and DIO to
control the completion CPU field and we're almost done ;)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: XFS oops in vanilla 2.6.24

2008-02-03 Thread David Chinner

On Thu, Jan 31, 2008 at 12:23:11PM +0100, Peter Zijlstra wrote:
> Lets CC the XFS maintainer..

Adding the xfs list and hch.

It might be a couple of days before I get to this - I've got a
week of backlog to catch up on after LCA

> On Wed, 2008-01-30 at 20:23 +, Sven Geggus wrote:
> > Hi there,
> > 
> > I get the following with 2.6.24:
> > 
> > Ending clean XFS mount for filesystem: dm-0
> > BUG: unable to handle kernel paging request at virtual address f2134000

How long after mount does this happen? Does it happen when listing a specific
directory? i.e. do you have a reproducable test case for it?

Cheers,

Dave.

> > printing eip: c021a13a *pde = 010b5067 *pte = 32134000 
> > Oops:  [#1] PREEMPT DEBUG_PAGEALLOC
> > Modules linked in: radeon drm rfcomm l2cap sym53c8xx scsi_transport_spi 
> > snd_via82xx 8139too snd_mpu401_uart snd_ens1371 snd_rawmidi snd_ac97_codec 
> > ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc via_agp 
> > agpgart
> > 
> > Pid: 3889, comm: bash Not tainted (2.6.24 #3)
> > EIP: 0060:[] EFLAGS: 00010282 CPU: 0
> > EIP is at xfs_file_readdir+0xfa/0x18c
> > EAX:  EBX: 02f5 ECX: 0020 EDX: 
> > ESI:  EDI: f2133ff8 EBP: f227ff68 ESP: f227ff10
> >  DS: 007b ES: 007b FS:  GS: 0033 SS: 0068
> > Process bash (pid: 3889, ti=f227e000 task=f7205a80 task.ti=f227e000)
> > Stack: 02f5  2c000137   c0165358 f227ff94 
> > f221c810 
> >f2d85e48    02f5  f2133000 
> > 1000 
> >0ff8 02f9  c0421c80 f221c810 f1cdbe48 f227ff88 
> > c0165543 
> > Call Trace:
> >  [] show_trace_log_lvl+0x1a/0x2f
> >  [] show_stack_log_lvl+0x9b/0xa3
> >  [] show_registers+0xa0/0x1e2
> >  [] die+0x10f/0x1dd
> >  [] do_page_fault+0x43a/0x519
> >  [] error_code+0x6a/0x70
> >  [] vfs_readdir+0x5d/0x89
> >  [] sys_getdents64+0x5e/0xa0
> >  [] syscall_call+0x7/0xb
> >  ===
> > Code: 89 74 24 04 81 e3 ff ff ff 7f 89 1c 24 ff 55 bc 85 c0 0f 85 82 00 00 
> > 00 8b 4f 10 31 d2 83 c1 1f 83 e1 f8 29 4d d0 19 55 d4 01 cf <8b> 57 08 8b 
> > 4f 0c 89 55 d8 89 4d dc 83 7d d4 00 7f a1 7c 06 83 
> > EIP: [] xfs_file_readdir+0xfa/0x18c SS:ESP 0068:f227ff10
> > ---[ end trace e518e1370efb695e ]---
> > 
> > Sven
> > 

-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: XFS oops in vanilla 2.6.24

2008-02-03 Thread David Chinner

On Thu, Jan 31, 2008 at 12:23:11PM +0100, Peter Zijlstra wrote:
 Lets CC the XFS maintainer..

Adding the xfs list and hch.

It might be a couple of days before I get to this - I've got a
week of backlog to catch up on after LCA

 On Wed, 2008-01-30 at 20:23 +, Sven Geggus wrote:
  Hi there,
  
  I get the following with 2.6.24:
  
  Ending clean XFS mount for filesystem: dm-0
  BUG: unable to handle kernel paging request at virtual address f2134000

How long after mount does this happen? Does it happen when listing a specific
directory? i.e. do you have a reproducable test case for it?

Cheers,

Dave.

  printing eip: c021a13a *pde = 010b5067 *pte = 32134000 
  Oops:  [#1] PREEMPT DEBUG_PAGEALLOC
  Modules linked in: radeon drm rfcomm l2cap sym53c8xx scsi_transport_spi 
  snd_via82xx 8139too snd_mpu401_uart snd_ens1371 snd_rawmidi snd_ac97_codec 
  ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc via_agp 
  agpgart
  
  Pid: 3889, comm: bash Not tainted (2.6.24 #3)
  EIP: 0060:[c021a13a] EFLAGS: 00010282 CPU: 0
  EIP is at xfs_file_readdir+0xfa/0x18c
  EAX:  EBX: 02f5 ECX: 0020 EDX: 
  ESI:  EDI: f2133ff8 EBP: f227ff68 ESP: f227ff10
   DS: 007b ES: 007b FS:  GS: 0033 SS: 0068
  Process bash (pid: 3889, ti=f227e000 task=f7205a80 task.ti=f227e000)
  Stack: 02f5  2c000137   c0165358 f227ff94 
  f221c810 
 f2d85e48    02f5  f2133000 
  1000 
 0ff8 02f9  c0421c80 f221c810 f1cdbe48 f227ff88 
  c0165543 
  Call Trace:
   [c0104da3] show_trace_log_lvl+0x1a/0x2f
   [c0104e53] show_stack_log_lvl+0x9b/0xa3
   [c0104efb] show_registers+0xa0/0x1e2
   [c010514c] die+0x10f/0x1dd
   [c0113555] do_page_fault+0x43a/0x519
   [c040fe52] error_code+0x6a/0x70
   [c0165543] vfs_readdir+0x5d/0x89
   [c01655cd] sys_getdents64+0x5e/0xa0
   [c0103f06] syscall_call+0x7/0xb
   ===
  Code: 89 74 24 04 81 e3 ff ff ff 7f 89 1c 24 ff 55 bc 85 c0 0f 85 82 00 00 
  00 8b 4f 10 31 d2 83 c1 1f 83 e1 f8 29 4d d0 19 55 d4 01 cf 8b 57 08 8b 
  4f 0c 89 55 d8 89 4d dc 83 7d d4 00 7f a1 7c 06 83 
  EIP: [c021a13a] xfs_file_readdir+0xfa/0x18c SS:ESP 0068:f227ff10
  ---[ end trace e518e1370efb695e ]---
  
  Sven
  

-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] direct IO submission and completion scalability issues

2008-02-03 Thread David Chinner

On Sun, Feb 03, 2008 at 10:52:52AM +0100, Nick Piggin wrote:
 On Fri, Jul 27, 2007 at 06:21:28PM -0700, Suresh B wrote:
  
  Second experiment which we did was migrating the IO submission to the
  IO completion cpu. Instead of submitting the IO on the same cpu where the
  request arrived, in this experiment  the IO submission gets migrated to the
  cpu that is processing IO completions(interrupt). This will minimize the
  access to remote cachelines (that happens in timers, slab, scsi layers). The
  IO submission request is forwarded to the kblockd thread on the cpu 
  receiving
  the interrupts. As part of this, we also made kblockd thread on each cpu as 
  the
  highest priority thread, so that IO gets submitted as soon as possible on 
  the
  interrupt cpu with out any delay. On x86_64 SMP platform with 16 cores, this
  resulted in 2% performance improvement and 3.3% improvement on two node ia64
  platform.
  
  Quick and dirty prototype patch(not meant for inclusion) for this io 
  migration
  experiment is appended to this e-mail.
  
  Observation #1 mentioned above is also applicable to this experiment. CPU's
  processing interrupts will now have to cater IO submission/processing
  load aswell.
  
  Observation #2: This introduces some migration overhead during IO 
  submission.
  With the current prototype, every incoming IO request results in an IPI and
  context switch(to kblockd thread) on the interrupt processing cpu.
  This issue needs to be addressed and main challenge to address is
  the efficient mechanism of doing this IO migration(how much batching to do 
  and
  when to send the migrate request?), so that we don't delay the IO much and 
  at
  the same point, don't cause much overhead during migration.
 
 Hi guys,
 
 Just had another way we might do this. Migrate the completions out to
 the submitting CPUs rather than migrate submission into the completing
 CPU.

Hi Nick,

When Matthew was describing this work at an LCA presentation (not
sure whether you were at that presentation or not), Zach came up
with the idea that allowing the submitting application control the
CPU that the io completion processing was occurring would be a good
approach to try.  That is, we submit a completion cookie with the
bio that indicates where we want completion to run, rather than
dictating that completion runs on the submission CPU.

The reasoning is that only the higher level context really knows
what is optimal, and that changes from application to application.
The complete on the submission CPU policy _may_ be more optimal
for database workloads, but it is definitely suboptimal for XFS and
transaction I/O completion handling because it simply drags a bunch
of global filesystem state around between all the CPUs running
completions. In that case, we really only want a single CPU to be
handling the completions.

(Zach - please correct me if I've missed anything)

Looking at your patch - if you turn it around so that the
submission CPU field can be specified as the completion cpu then
I think the patch will expose the policy knobs needed to do the
above. Add the bio - rq linkage to enable filesystems and DIO to
control the completion CPU field and we're almost done ;)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc] direct IO submission and completion scalability issues

2008-02-03 Thread David Chinner

On Sun, Feb 03, 2008 at 08:14:45PM -0800, Arjan van de Ven wrote:
 David Chinner wrote:
 Hi Nick,
 
 When Matthew was describing this work at an LCA presentation (not
 sure whether you were at that presentation or not), Zach came up
 with the idea that allowing the submitting application control the
 CPU that the io completion processing was occurring would be a good
 approach to try.  That is, we submit a completion cookie with the
 bio that indicates where we want completion to run, rather than
 dictating that completion runs on the submission CPU.
 
 The reasoning is that only the higher level context really knows
 what is optimal, and that changes from application to application.
 
 well.. kinda. One of the really hard parts of the submit/completion stuff 
 is that
 the slab/slob/slub/slib allocator ends up basically cycling memory 
 through the system;
 there's a sink of free memory on all the submission cpus and a source of 
 free memory
 on the completion cpu. I don't think applications are capable of working 
 out what is
 best in this scenario..

Applications as in anything that calls submit_bio(). i.e, direct I/O,
filesystems, etc. i.e. not userspace but in-kernel applications.

In XFS, simultaneous io completion on multiple CPUs can contribute greatly to
contention of global structures in XFS. By controlling where completions are
delivered, we can greatly reduce this contention, especially on large,
mulitpathed devices that deliver interrupts to multiple CPUs that may be far
distant from each other.  We have all the state and intelligence necessary
to control this sort policy decision effectively.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] ext3 freeze feature

2008-01-25 Thread David Chinner

On Sat, Jan 26, 2008 at 04:35:26PM +1100, David Chinner wrote:
> On Fri, Jan 25, 2008 at 07:59:38PM +0900, Takashi Sato wrote:
> > The points of the implementation are followings.
> > - Add calls of the freeze function (freeze_bdev) and
> >   the unfreeze function (thaw_bdev) in ext3_ioctl().
> > 
> > - ext3_freeze_timeout() which calls the unfreeze function (thaw_bdev)
> >   is registered to the delayed work queue to unfreeze the filesystem
> >   automatically after the lapse of the specified time.
> 
> Seems like pointless complexity to me - what happens if a
> timeout occurs while the filsystem is still freezing?
> 
> It's not uncommon for a freeze to take minutes if memory
> is full of dirty data that needs to be flushed out, esp. if
> dm-snap is doing COWs for every write issued

Sorry, ignore this bit - I just realised the timer is set
up after the freeze has occurred

Still, that makes it potentially dangerous to whatever is being
done while the filesystem is frozen

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] ext3 freeze feature

2008-01-25 Thread David Chinner

On Fri, Jan 25, 2008 at 07:59:38PM +0900, Takashi Sato wrote:
> The points of the implementation are followings.
> - Add calls of the freeze function (freeze_bdev) and
>   the unfreeze function (thaw_bdev) in ext3_ioctl().
> 
> - ext3_freeze_timeout() which calls the unfreeze function (thaw_bdev)
>   is registered to the delayed work queue to unfreeze the filesystem
>   automatically after the lapse of the specified time.

Seems like pointless complexity to me - what happens if a
timeout occurs while the filsystem is still freezing?

It's not uncommon for a freeze to take minutes if memory
is full of dirty data that needs to be flushed out, esp. if
dm-snap is doing COWs for every write issued

> + case EXT3_IOC_FREEZE: {

> + if (inode->i_sb->s_frozen != SB_UNFROZEN)
> + return -EINVAL;

> + freeze_bdev(inode->i_sb->s_bdev);

> + case EXT3_IOC_THAW: {
> + if (!capable(CAP_SYS_ADMIN))
> + return -EPERM;
> + if (inode->i_sb->s_frozen == SB_UNFROZEN)
> + return -EINVAL;
.
> + /* Unfreeze */
> + thaw_bdev(inode->i_sb->s_bdev, inode->i_sb);

That's inherently unsafe - you can have multiple unfreezes
running in parallel which seriously screws with the bdev semaphore
count that is used to lock the device due to doing multiple up()s
for every down.

Your timeout thingy guarantee that at some point you will get
multiple up()s occuring due to the timer firing racing with
a thaw ioctl. 

If this interface is to be more widely exported, then it needs
a complete revamp of the bdev is locked while it is frozen so
that there is no chance of a double up() ever occuring on the
bd_mount_sem due to racing thaws.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] ext3 freeze feature

2008-01-25 Thread David Chinner

On Fri, Jan 25, 2008 at 09:42:30PM +0900, Takashi Sato wrote:
> >I am also wondering whether we should have system call(s) for these:
> >
> >On Jan 25, 2008 12:59 PM, Takashi Sato <[EMAIL PROTECTED]> wrote:
> >>+   case EXT3_IOC_FREEZE: {
> >
> >>+   case EXT3_IOC_THAW: {
> >
> >And just convert XFS to use them too?
> 
> I think it is reasonable to implement it as the generic system call, as you
> said.  Does XFS folks think so?

Sure.

Note that we can't immediately remove the XFS ioctls otherwise
we'd break userspace utilities that use them

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 01/26] mount options: add documentation

2008-01-25 Thread David Chinner


> In message <[EMAIL PROTECTED]>, Miklos Szeredi writes:
> > From: Miklos Szeredi <[EMAIL PROTECTED]>
> > 
> > This series addresses the problem of showing mount options in
> > /proc/mounts.
[...]
> > The following filesystems still need fixing: CIFS, NFS, XFS, Unionfs,
> > Reiser4.  For CIFS, NFS and XFS I wasn't able to understand how some
> > of the options are used.  The last two are not yet in mainline, so I
> > leave fixing those to their respective maintainers out of pure
> > laziness.

XFS has already been updated. The fix is in the XFs git tree that
Andrew picks up for -mm releases and will be merged in the 2.6.25-rc1
window. Commit is here:

http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs-2.6.git;a=commit;h=8c33fb6ca99aa17373bd3d5a507ac0eaefb7abb4

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 01/26] mount options: add documentation

2008-01-25 Thread David Chinner


 In message [EMAIL PROTECTED], Miklos Szeredi writes:
  From: Miklos Szeredi [EMAIL PROTECTED]
  
  This series addresses the problem of showing mount options in
  /proc/mounts.
[...]
  The following filesystems still need fixing: CIFS, NFS, XFS, Unionfs,
  Reiser4.  For CIFS, NFS and XFS I wasn't able to understand how some
  of the options are used.  The last two are not yet in mainline, so I
  leave fixing those to their respective maintainers out of pure
  laziness.

XFS has already been updated. The fix is in the XFs git tree that
Andrew picks up for -mm releases and will be merged in the 2.6.25-rc1
window. Commit is here:

http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs-2.6.git;a=commit;h=8c33fb6ca99aa17373bd3d5a507ac0eaefb7abb4

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] ext3 freeze feature

2008-01-25 Thread David Chinner

On Sat, Jan 26, 2008 at 04:35:26PM +1100, David Chinner wrote:
 On Fri, Jan 25, 2008 at 07:59:38PM +0900, Takashi Sato wrote:
  The points of the implementation are followings.
  - Add calls of the freeze function (freeze_bdev) and
the unfreeze function (thaw_bdev) in ext3_ioctl().
  
  - ext3_freeze_timeout() which calls the unfreeze function (thaw_bdev)
is registered to the delayed work queue to unfreeze the filesystem
automatically after the lapse of the specified time.
 
 Seems like pointless complexity to me - what happens if a
 timeout occurs while the filsystem is still freezing?
 
 It's not uncommon for a freeze to take minutes if memory
 is full of dirty data that needs to be flushed out, esp. if
 dm-snap is doing COWs for every write issued

Sorry, ignore this bit - I just realised the timer is set
up after the freeze has occurred

Still, that makes it potentially dangerous to whatever is being
done while the filesystem is frozen

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] ext3 freeze feature

2008-01-25 Thread David Chinner

On Fri, Jan 25, 2008 at 07:59:38PM +0900, Takashi Sato wrote:
 The points of the implementation are followings.
 - Add calls of the freeze function (freeze_bdev) and
   the unfreeze function (thaw_bdev) in ext3_ioctl().
 
 - ext3_freeze_timeout() which calls the unfreeze function (thaw_bdev)
   is registered to the delayed work queue to unfreeze the filesystem
   automatically after the lapse of the specified time.

Seems like pointless complexity to me - what happens if a
timeout occurs while the filsystem is still freezing?

It's not uncommon for a freeze to take minutes if memory
is full of dirty data that needs to be flushed out, esp. if
dm-snap is doing COWs for every write issued

 + case EXT3_IOC_FREEZE: {

 + if (inode-i_sb-s_frozen != SB_UNFROZEN)
 + return -EINVAL;

 + freeze_bdev(inode-i_sb-s_bdev);

 + case EXT3_IOC_THAW: {
 + if (!capable(CAP_SYS_ADMIN))
 + return -EPERM;
 + if (inode-i_sb-s_frozen == SB_UNFROZEN)
 + return -EINVAL;
.
 + /* Unfreeze */
 + thaw_bdev(inode-i_sb-s_bdev, inode-i_sb);

That's inherently unsafe - you can have multiple unfreezes
running in parallel which seriously screws with the bdev semaphore
count that is used to lock the device due to doing multiple up()s
for every down.

Your timeout thingy guarantee that at some point you will get
multiple up()s occuring due to the timer firing racing with
a thaw ioctl. 

If this interface is to be more widely exported, then it needs
a complete revamp of the bdev is locked while it is frozen so
that there is no chance of a double up() ever occuring on the
bd_mount_sem due to racing thaws.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] ext3 freeze feature

2008-01-25 Thread David Chinner

On Fri, Jan 25, 2008 at 09:42:30PM +0900, Takashi Sato wrote:
 I am also wondering whether we should have system call(s) for these:
 
 On Jan 25, 2008 12:59 PM, Takashi Sato [EMAIL PROTECTED] wrote:
 +   case EXT3_IOC_FREEZE: {
 
 +   case EXT3_IOC_THAW: {
 
 And just convert XFS to use them too?
 
 I think it is reasonable to implement it as the generic system call, as you
 said.  Does XFS folks think so?

Sure.

Note that we can't immediately remove the XFS ioctls otherwise
we'd break userspace utilities that use them

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: do_remount_sb(RDONLY) race? (was: XFS oops under 2.6.23.9)

2008-01-22 Thread David Chinner

On Wed, Jan 23, 2008 at 04:24:33PM +1030, Jonathan Woithe wrote:
> > On Wed, Jan 23, 2008 at 03:00:48PM +1030, Jonathan Woithe wrote:
> > > Last night my laptop suffered an oops during closedown.  The full oops
> > > reports can be downloaded from
> > > 
> > >   http://www.atrad.com.au/~jwoithe/xfs_oops/
> > 
> > Assertion failed: atomic_read(>m_active_trans) == 0, file:
> > fs/xfs/xfs_vfsops.c, line 689.
> > 
> > The remount read-only of the root drive supposedly completed
> > while there was still active modification of the filesystem
> > taking place.
.
> > The read only flag only gets set *after* we've made the filesystem
> > readonly, which means before we are truly read only, we can race
> > with other threads opening files read/write or filesystem
> > modifcations can take place.
> > 
> > The result of that race (if it is really unsafe) will be assert you
> > see. The patch I wrote a couple of months ago to fix the problem
> > is attached below
> 
> Thanks for the patch.  I will apply it and see what happens.
> 
> Will this be in 2.6.24?

No - because hitting the problem is so rare that I'm not even
sure it's a problem. One of the VFS gurus will need to comment
on whether this really is a problem, and if so the correct fix
is to do_remount_sb() so that it closes the hole for everyone.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

do_remount_sb(RDONLY) race? (was: XFS oops under 2.6.23.9)

2008-01-22 Thread David Chinner

On Wed, Jan 23, 2008 at 03:00:48PM +1030, Jonathan Woithe wrote:
> Last night my laptop suffered an oops during closedown.  The full oops
> reports can be downloaded from
> 
>   http://www.atrad.com.au/~jwoithe/xfs_oops/

Assertion failed: atomic_read(>m_active_trans) == 0, file:
fs/xfs/xfs_vfsops.c, line 689.

The remount read-only of the root drive supposedly completed
while there was still active modification of the filesystem
taking place.

> Kernel version was kernel.org 2.6.23.9 compiled as a low latency desktop. 

The patch in 2.6.23 that introduced this check was:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=516b2e7c2661615ba5d5ad9fb584f068363502d3

Basically, the remount-readonly path was not flushing things
properly, so we changed it to flushing things properly and ensure we
got bug reports if it wasn't. Yours is the second report of not
shutting down correctly since this change went in (we've seen it
once in ~8 months in a QA environment).

I've had suspicions of a race in the remount-ro code in
do_remount_sb() w.r.t to the fs_may_remount_ro() check.  That is, we
do an unlocked check to see if we can remount readonly and then fail
to check again once we've locked the superblock out and start the
remount.

The read only flag only gets set *after* we've made the filesystem
readonly, which means before we are truly read only, we can race
with other threads opening files read/write or filesystem
modifcations can take place.

The result of that race (if it is really unsafe) will be assert you
see. The patch I wrote a couple of months ago to fix the problem
is attached below

Cheers,

Dave.

---

Set the MS_RDONLY before we check to see if we can remount
read only so that we close a race between checking remount
is ok and setting the superblock flag that allows other
processes to start modifying the filesystem while it is
being remounted.

Signed-off-by: Dave Chinner <[EMAIL PROTECTED]>
---
 fs/xfs/linux-2.6/xfs_super.c |   16 
 1 file changed, 16 insertions(+)

Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_super.c
===
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_super.c 2008-01-22 
14:57:07.753782292 +1100
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_super.c  2008-01-23 16:22:16.940279351 
+1100
@@ -1222,6 +1222,22 @@ xfs_fs_remount(
struct xfs_mount_args   *args = xfs_args_allocate(sb, 0);
int error;

+   /*
+* We need to have the MS_RDONLY flag set on the filesystem before we
+* try to quiesce it down to a sane state. If we don't set the
+* MS_RDONLY before we check the fs_may_remount_ro(sb) state, we have a
+* race where write operations can start after we've checked it is OK
+* to remount read only. This results in assert failures due to being
+* unable to quiesce the transaction subsystem correctly.
+*/
+   if (!(sb->s_flags & MS_RDONLY) && (*flags & MS_RDONLY)) {
+   sb->s_flags |= MS_RDONLY;
+   if (!fs_may_remount_ro(sb)) {
+   sb->s_flags &= ~MS_RDONLY;
+   return -EBUSY;
+   }
+   }
+
error = xfs_parseargs(mp, options, args, 1);
if (!error)
error = xfs_mntupdate(mp, flags, args);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/6] IO context sharing

2008-01-22 Thread David Chinner

On Tue, Jan 22, 2008 at 10:49:15AM +0100, Jens Axboe wrote:
> Hi,
> 
> Today io contexts are per-process and define the (surprise) io context
> of that process. In some situations it would be handy if several
> processes share an IO context.

I think that the nfsd threads should probably share as
well. It should probably provide an io context per thread
pool

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-22 Thread David Chinner

On Tue, Jan 22, 2008 at 12:05:11AM -0700, Andreas Dilger wrote:
> On Jan 22, 2008  14:38 +1100, David Chinner wrote:
> > On Mon, Jan 21, 2008 at 04:00:41PM -0700, Andreas Dilger wrote:
> > > I discussed this with Ted at one point also.  This is a generic problem,
> > > not just for readahead, because "fsck" can run multiple e2fsck in parallel
> > > and in case of many large filesystems on a single node this can cause
> > > memory usage problems also.
> > > 
> > > What I was proposing is that "fsck.{fstype}" be modified to return an
> > > estimated minimum amount of memory needed, and some "desired" amount of
> > > memory (i.e. readahead) to fsck the filesystem, using some parameter like
> > > "fsck.{fstype} --report-memory-needed /dev/XXX".  If this does not
> > > return the output in the expected format, or returns an error then fsck
> > > will assume some amount of memory based on the device size and continue
> > > as it does today.
> > 
> > And while fsck is running, some other program runs that uses
> > memory and blows your carefully calculated paramters to smithereens?
> 
> Well, fsck has a rather restricted working environment, because it is
> run before most other processes start (i.e. single-user mode).  For fsck
> initiated by an admin in other runlevels the admin would need to specify
> the upper limit of memory usage.  My proposal was only for the single-user
> fsck at boot time.

The simple case. ;)

Because XFS has shutdown features, it's not uncommon to hear about
people running xfs_repair on an otherwise live system. e.g. XFS
detects a corrupted block, shuts down the filesystem, the admin
unmounts it, runs xfs_repair, puts it back online. meanwhile, all
the other filesystems and users continue unaffected. In this use
case, getting feedback about memory usage is, IMO, very worthwhile.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-22 Thread David Chinner

On Tue, Jan 22, 2008 at 12:05:11AM -0700, Andreas Dilger wrote:
 On Jan 22, 2008  14:38 +1100, David Chinner wrote:
  On Mon, Jan 21, 2008 at 04:00:41PM -0700, Andreas Dilger wrote:
   I discussed this with Ted at one point also.  This is a generic problem,
   not just for readahead, because fsck can run multiple e2fsck in parallel
   and in case of many large filesystems on a single node this can cause
   memory usage problems also.
   
   What I was proposing is that fsck.{fstype} be modified to return an
   estimated minimum amount of memory needed, and some desired amount of
   memory (i.e. readahead) to fsck the filesystem, using some parameter like
   fsck.{fstype} --report-memory-needed /dev/XXX.  If this does not
   return the output in the expected format, or returns an error then fsck
   will assume some amount of memory based on the device size and continue
   as it does today.
  
  And while fsck is running, some other program runs that uses
  memory and blows your carefully calculated paramters to smithereens?
 
 Well, fsck has a rather restricted working environment, because it is
 run before most other processes start (i.e. single-user mode).  For fsck
 initiated by an admin in other runlevels the admin would need to specify
 the upper limit of memory usage.  My proposal was only for the single-user
 fsck at boot time.

The simple case. ;)

Because XFS has shutdown features, it's not uncommon to hear about
people running xfs_repair on an otherwise live system. e.g. XFS
detects a corrupted block, shuts down the filesystem, the admin
unmounts it, runs xfs_repair, puts it back online. meanwhile, all
the other filesystems and users continue unaffected. In this use
case, getting feedback about memory usage is, IMO, very worthwhile.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/6] IO context sharing

2008-01-22 Thread David Chinner

On Tue, Jan 22, 2008 at 10:49:15AM +0100, Jens Axboe wrote:
 Hi,
 
 Today io contexts are per-process and define the (surprise) io context
 of that process. In some situations it would be handy if several
 processes share an IO context.

I think that the nfsd threads should probably share as
well. It should probably provide an io context per thread
pool

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

do_remount_sb(RDONLY) race? (was: XFS oops under 2.6.23.9)

2008-01-22 Thread David Chinner

On Wed, Jan 23, 2008 at 03:00:48PM +1030, Jonathan Woithe wrote:
 Last night my laptop suffered an oops during closedown.  The full oops
 reports can be downloaded from
 
   http://www.atrad.com.au/~jwoithe/xfs_oops/

Assertion failed: atomic_read(mp-m_active_trans) == 0, file:
fs/xfs/xfs_vfsops.c, line 689.

The remount read-only of the root drive supposedly completed
while there was still active modification of the filesystem
taking place.

 Kernel version was kernel.org 2.6.23.9 compiled as a low latency desktop. 

The patch in 2.6.23 that introduced this check was:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=516b2e7c2661615ba5d5ad9fb584f068363502d3

Basically, the remount-readonly path was not flushing things
properly, so we changed it to flushing things properly and ensure we
got bug reports if it wasn't. Yours is the second report of not
shutting down correctly since this change went in (we've seen it
once in ~8 months in a QA environment).

I've had suspicions of a race in the remount-ro code in
do_remount_sb() w.r.t to the fs_may_remount_ro() check.  That is, we
do an unlocked check to see if we can remount readonly and then fail
to check again once we've locked the superblock out and start the
remount.

The read only flag only gets set *after* we've made the filesystem
readonly, which means before we are truly read only, we can race
with other threads opening files read/write or filesystem
modifcations can take place.

The result of that race (if it is really unsafe) will be assert you
see. The patch I wrote a couple of months ago to fix the problem
is attached below

Cheers,

Dave.

---

Set the MS_RDONLY before we check to see if we can remount
read only so that we close a race between checking remount
is ok and setting the superblock flag that allows other
processes to start modifying the filesystem while it is
being remounted.

Signed-off-by: Dave Chinner [EMAIL PROTECTED]
---
 fs/xfs/linux-2.6/xfs_super.c |   16 
 1 file changed, 16 insertions(+)

Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_super.c
===
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_super.c 2008-01-22 
14:57:07.753782292 +1100
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_super.c  2008-01-23 16:22:16.940279351 
+1100
@@ -1222,6 +1222,22 @@ xfs_fs_remount(
struct xfs_mount_args   *args = xfs_args_allocate(sb, 0);
int error;
 
+   /*
+* We need to have the MS_RDONLY flag set on the filesystem before we
+* try to quiesce it down to a sane state. If we don't set the
+* MS_RDONLY before we check the fs_may_remount_ro(sb) state, we have a
+* race where write operations can start after we've checked it is OK
+* to remount read only. This results in assert failures due to being
+* unable to quiesce the transaction subsystem correctly.
+*/
+   if (!(sb-s_flags  MS_RDONLY)  (*flags  MS_RDONLY)) {
+   sb-s_flags |= MS_RDONLY;
+   if (!fs_may_remount_ro(sb)) {
+   sb-s_flags = ~MS_RDONLY;
+   return -EBUSY;
+   }
+   }
+
error = xfs_parseargs(mp, options, args, 1);
if (!error)
error = xfs_mntupdate(mp, flags, args);
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: do_remount_sb(RDONLY) race? (was: XFS oops under 2.6.23.9)

2008-01-22 Thread David Chinner

On Wed, Jan 23, 2008 at 04:24:33PM +1030, Jonathan Woithe wrote:
  On Wed, Jan 23, 2008 at 03:00:48PM +1030, Jonathan Woithe wrote:
   Last night my laptop suffered an oops during closedown.  The full oops
   reports can be downloaded from
   
 http://www.atrad.com.au/~jwoithe/xfs_oops/
  
  Assertion failed: atomic_read(mp-m_active_trans) == 0, file:
  fs/xfs/xfs_vfsops.c, line 689.
  
  The remount read-only of the root drive supposedly completed
  while there was still active modification of the filesystem
  taking place.
.
  The read only flag only gets set *after* we've made the filesystem
  readonly, which means before we are truly read only, we can race
  with other threads opening files read/write or filesystem
  modifcations can take place.
  
  The result of that race (if it is really unsafe) will be assert you
  see. The patch I wrote a couple of months ago to fix the problem
  is attached below
 
 Thanks for the patch.  I will apply it and see what happens.
 
 Will this be in 2.6.24?

No - because hitting the problem is so rare that I'm not even
sure it's a problem. One of the VFS gurus will need to comment
on whether this really is a problem, and if so the correct fix
is to do_remount_sb() so that it closes the hole for everyone.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-21 Thread David Chinner

On Mon, Jan 21, 2008 at 04:00:41PM -0700, Andreas Dilger wrote:
> On Jan 16, 2008  13:30 -0800, Valerie Henson wrote:
> > I have a partial solution that sort of blindly manages the buffer
> > cache.  First, the user passes e2fsck a parameter saying how much
> > memory is available as buffer cache.  The readahead thread reads
> > things in and immediately throws them away so they are only in buffer
> > cache (no double-caching).  Then readahead and e2fsck work together so
> > that readahead only reads in new blocks when the main thread is done
> > with earlier blocks.  The already-used blocks get kicked out of buffer
> > cache to make room for the new ones.
> >
> > What would be nice is to take into account the current total memory
> > usage of the whole fsck process and factor that in.  I don't think it
> > would be hard to add to the existing cache management framework.
> > Thoughts?
> 
> I discussed this with Ted at one point also.  This is a generic problem,
> not just for readahead, because "fsck" can run multiple e2fsck in parallel
> and in case of many large filesystems on a single node this can cause
> memory usage problems also.
> 
> What I was proposing is that "fsck.{fstype}" be modified to return an
> estimated minimum amount of memory needed, and some "desired" amount of
> memory (i.e. readahead) to fsck the filesystem, using some parameter like
> "fsck.{fstype} --report-memory-needed /dev/XXX".  If this does not
> return the output in the expected format, or returns an error then fsck
> will assume some amount of memory based on the device size and continue
> as it does today.

And while fsck is running, some other program runs that uses
memory and blows your carefully calculated paramters to smithereens?

I think there is a clear need for applications to be able to
register a callback from the kernel to indicate that the machine as
a whole is running out of memory and that the application should
trim it's caches to reduce memory utilisation.

Perhaps instead of swapping immediately, a SIGLOWMEM could be sent
to a processes that aren't masking the signal followed by a short
grace period to allow the processes to free up some memory before
swapping out pages from that process?

With this sort of feedback, the fsck process can scale back it's
readahead and remove cached info that is not critical to what it
is currently doing and thereby prevent readahead thrashing as
memory usage of the fsck process itself grows.

Another example where this could be useful is to tell browsers to
release some of their cache rather than having the VM swap it out.

IMO, a scheme like this will be far more reliable than trying to
guess what the optimal settings are going to be over the whole
lifetime of a process

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-21 Thread David Chinner

On Mon, Jan 21, 2008 at 04:00:41PM -0700, Andreas Dilger wrote:
 On Jan 16, 2008  13:30 -0800, Valerie Henson wrote:
  I have a partial solution that sort of blindly manages the buffer
  cache.  First, the user passes e2fsck a parameter saying how much
  memory is available as buffer cache.  The readahead thread reads
  things in and immediately throws them away so they are only in buffer
  cache (no double-caching).  Then readahead and e2fsck work together so
  that readahead only reads in new blocks when the main thread is done
  with earlier blocks.  The already-used blocks get kicked out of buffer
  cache to make room for the new ones.
 
  What would be nice is to take into account the current total memory
  usage of the whole fsck process and factor that in.  I don't think it
  would be hard to add to the existing cache management framework.
  Thoughts?
 
 I discussed this with Ted at one point also.  This is a generic problem,
 not just for readahead, because fsck can run multiple e2fsck in parallel
 and in case of many large filesystems on a single node this can cause
 memory usage problems also.
 
 What I was proposing is that fsck.{fstype} be modified to return an
 estimated minimum amount of memory needed, and some desired amount of
 memory (i.e. readahead) to fsck the filesystem, using some parameter like
 fsck.{fstype} --report-memory-needed /dev/XXX.  If this does not
 return the output in the expected format, or returns an error then fsck
 will assume some amount of memory based on the device size and continue
 as it does today.

And while fsck is running, some other program runs that uses
memory and blows your carefully calculated paramters to smithereens?

I think there is a clear need for applications to be able to
register a callback from the kernel to indicate that the machine as
a whole is running out of memory and that the application should
trim it's caches to reduce memory utilisation.

Perhaps instead of swapping immediately, a SIGLOWMEM could be sent
to a processes that aren't masking the signal followed by a short
grace period to allow the processes to free up some memory before
swapping out pages from that process?

With this sort of feedback, the fsck process can scale back it's
readahead and remove cached info that is not critical to what it
is currently doing and thereby prevent readahead thrashing as
memory usage of the fsck process itself grows.

Another example where this could be useful is to tell browsers to
release some of their cache rather than having the VM swap it out.

IMO, a scheme like this will be far more reliable than trying to
guess what the optimal settings are going to be over the whole
lifetime of a process

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc8: possible circular locking dependency detected

2008-01-20 Thread David Chinner

On Fri, Jan 18, 2008 at 10:45:17PM +0100, Christian Kujau wrote:
> Hi,
> 
> just FYI, upgrading to -rc8 gave the following messages in kern.log in
> the morning hours, when the backups were run:
> 
> ===
> [ INFO: possible circular locking dependency detected ]
> 2.6.24-rc8 #2
> ---
> rsync/23295 is trying to acquire lock:
>  (iprune_mutex){--..}, at: [] shrink_icache_memory+0x72/0x220
> 
> but task is already holding lock:
>  (&(>i_iolock)->mr_lock){}, at: [] xfs_ilock+0x96/0xb0
> 
> which lock already depends on the new lock.

memory reclaim can occur when an inode lock is held,
causing i_iolock -> iprune_mutex to occur. This is quite
common.

During reclaim, while holding iprune_mutex, we lock a
different inode to complete the cleaning up of it,
resulting in iprune_mutex -> i_iolock.

At this point, lockdep gets upset and blats out a warning.

But, there's no problem here as it is always safe for us
to take the i_iolock in inode reclaim because it can never
be the same as the i_iolock that we've taken prior to memory
reclaim being entered. Therefore false positive.

Lockdep folk - we really need an annotation to prevent this false
positive from being reported because we are getting reports at
least once a week

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc8: possible circular locking dependency detected

2008-01-20 Thread David Chinner

On Fri, Jan 18, 2008 at 10:45:17PM +0100, Christian Kujau wrote:
 Hi,
 
 just FYI, upgrading to -rc8 gave the following messages in kern.log in
 the morning hours, when the backups were run:
 
 ===
 [ INFO: possible circular locking dependency detected ]
 2.6.24-rc8 #2
 ---
 rsync/23295 is trying to acquire lock:
  (iprune_mutex){--..}, at: [c017a552] shrink_icache_memory+0x72/0x220
 
 but task is already holding lock:
  ((ip-i_iolock)-mr_lock){}, at: [c0275056] xfs_ilock+0x96/0xb0
 
 which lock already depends on the new lock.

memory reclaim can occur when an inode lock is held,
causing i_iolock - iprune_mutex to occur. This is quite
common.

During reclaim, while holding iprune_mutex, we lock a
different inode to complete the cleaning up of it,
resulting in iprune_mutex - i_iolock.

At this point, lockdep gets upset and blats out a warning.

But, there's no problem here as it is always safe for us
to take the i_iolock in inode reclaim because it can never
be the same as the i_iolock that we've taken prior to memory
reclaim being entered. Therefore false positive.

Lockdep folk - we really need an annotation to prevent this false
positive from being reported because we are getting reports at
least once a week

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Converting writeback linked lists to a tree based data structure

2008-01-18 Thread David Chinner

On Fri, Jan 18, 2008 at 01:41:33PM +0800, Fengguang Wu wrote:
> > That is, think of large file writes like process scheduler batch
> > jobs - bulk throughput is what matters, so the larger the time slice
> > you give them the higher the throughput.
> > 
> > IMO, the sort of result we should be looking at is a
> > writeback design that results in cycling somewhat like:
> > 
> > slice 1: iterate over small files
> > slice 2: flush large file 1
> > slice 3: iterate over small files
> > slice 4: flush large file 2
> > ..
> > slice n-1: flush large file N
> > slice n: iterate over small files
> > slice n+1: flush large file N+1
> > 
> > So that we keep the disk busy with a relatively fair mix of
> > small and large I/Os while both are necessary.
> 
> If we can sync fast enough, the lower layer would be able to merge
> those 4MB requests.

No, not necessarily - think of a stripe with a chunk size of 512k.
That 4MB will be split into 8x512k chunks and sent to different
devices (and hence elevator queues). The only way you get elevator
merging in this sort of config is that if you send multiple stripe
*width* sized amounts to the device in a very short time period.
I see quite a few filesystems with stripe widths in the tens of MB
range.

> > Put simply:
> > 
> > The higher the bandwidth of the device, the more frequently
> > we need to be servicing the inodes with large amounts of
> > dirty data to be written to maintain write throughput at a
> > significant percentage of the device capability.
> > 
> > The writeback algorithm needs to take this into account for it
> > to be able to scale effectively for high throughput devices.
> 
> Slow queues go full first. Currently the writeback code will skip
> _and_ congestion_wait() for congested filesystems. The better policy
> is to congestion_wait() _after_ all other writable pages have been
> synced.

Agreed.

The comments I've made are mainly concerned with getting efficient
flushing of a single device occuring. Interactions between multiple
devices are a separable issue

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Converting writeback linked lists to a tree based data structure

2008-01-18 Thread David Chinner

On Thu, Jan 17, 2008 at 09:38:24PM -0800, Michael Rubin wrote:
> On Jan 17, 2008 9:01 PM, David Chinner <[EMAIL PROTECTED]> wrote:
> 
> First off thank you for the very detailed reply. This rocks and gives
> me much to think about.
> 
> > On Thu, Jan 17, 2008 at 01:07:05PM -0800, Michael Rubin wrote:
> > This seems suboptimal for large files. If you keep feeding in
> > new least recently dirtied files, the large files will never
> > get an unimpeded go at the disk and hence we'll struggle to
> > get decent bandwidth under anything but pure large file
> > write loads.
> 
> You're right. I understand now. I just  changed a dial on my tests,
> ran it and found pdflush not keeping up like it should. I need to
> address this.
> 
> > Switching inodes during writeback implies a seek to the new write
> > location, while continuing to write the same inode has no seek
> > penalty because the writeback is sequential.  It follows from this
> > that allowing larges file a disproportionate amount of data
> > writeback is desirable.
> >
> > Also, cycling rapidly through all the large files to write 4MB to each is
> > going to cause us to spend time seeking rather than writing compared
> > to cycling slower and writing 40MB from each large file at a time.
> >
> > i.e. servicing one large file for 100ms is going to result in higher
> > writeback throughput than servicing 10 large files for 10ms each
> > because there's going to be less seeking and more writing done by
> > the disks.
> >
> > That is, think of large file writes like process scheduler batch
> > jobs - bulk throughput is what matters, so the larger the time slice
> > you give them the higher the throughput.
> >
> > IMO, the sort of result we should be looking at is a
> > writeback design that results in cycling somewhat like:
> >
> > slice 1: iterate over small files
> > slice 2: flush large file 1
> > slice 3: iterate over small files
> > slice 4: flush large file 2
> > ..
> > slice n-1: flush large file N
> > slice n: iterate over small files
> > slice n+1: flush large file N+1
> >
> > So that we keep the disk busy with a relatively fair mix of
> > small and large I/Os while both are necessary.
> 
> I am getting where you are coming from. But if we are going to make
> changes to optimize for seeks maybe we need to be more aggressive in
> write back in how we organize both time and location. Right now AFAIK
> there is no attention to location in the writeback path.

True. But IMO, locality ordering really only impacts the small file
data writes and the inodes themselves because there is typically
lots of seeks in doing that.

For large sequential writes to a file, writing a significant
chunk of data gives that bit of writeback it's own locality
because it does not cause seeks. Hence simply writing large
enough chunks avoids any need to order the writeback by locality.

Hence I writeback ordering by locality more a function of 
optimising the "iterate over small files" aspect of the writeback.

> > The higher the bandwidth of the device, the more frequently
> > we need to be servicing the inodes with large amounts of
> > dirty data to be written to maintain write throughput at a
> > significant percentage of the device capability.
> >
> 
> Could you expand that to say it's not the inodes of large files but
> the ones with data that we can exploit locality?

Not sure I understand what you mean. Can you rephrase that?

> Often large files are fragmented.

Then the filesystem is not doing it's job. Fragmentation does
not happen very frequently in XFS for large files - that is one
of the reasons it is extremely good for large files and high
throughput applications...

> Would it make more sense to pursue cracking the inodes and
> grouping their blocks's locations? Or is this all overkill and should
> be handled at a lower level like the elevator?

For large files it is overkill. For filesystems that do delayed
allocation, it is often impossible (no block mapping until
the writeback is executed unless it's an overwrite).

At this point, I'd say it is best to leave it to the filesystem and
the elevator to do their jobs properly.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Converting writeback linked lists to a tree based data structure

2008-01-18 Thread David Chinner

On Fri, Jan 18, 2008 at 01:41:33PM +0800, Fengguang Wu wrote:
  That is, think of large file writes like process scheduler batch
  jobs - bulk throughput is what matters, so the larger the time slice
  you give them the higher the throughput.
  
  IMO, the sort of result we should be looking at is a
  writeback design that results in cycling somewhat like:
  
  slice 1: iterate over small files
  slice 2: flush large file 1
  slice 3: iterate over small files
  slice 4: flush large file 2
  ..
  slice n-1: flush large file N
  slice n: iterate over small files
  slice n+1: flush large file N+1
  
  So that we keep the disk busy with a relatively fair mix of
  small and large I/Os while both are necessary.
 
 If we can sync fast enough, the lower layer would be able to merge
 those 4MB requests.

No, not necessarily - think of a stripe with a chunk size of 512k.
That 4MB will be split into 8x512k chunks and sent to different
devices (and hence elevator queues). The only way you get elevator
merging in this sort of config is that if you send multiple stripe
*width* sized amounts to the device in a very short time period.
I see quite a few filesystems with stripe widths in the tens of MB
range.

  Put simply:
  
  The higher the bandwidth of the device, the more frequently
  we need to be servicing the inodes with large amounts of
  dirty data to be written to maintain write throughput at a
  significant percentage of the device capability.
  
  The writeback algorithm needs to take this into account for it
  to be able to scale effectively for high throughput devices.
 
 Slow queues go full first. Currently the writeback code will skip
 _and_ congestion_wait() for congested filesystems. The better policy
 is to congestion_wait() _after_ all other writable pages have been
 synced.

Agreed.

The comments I've made are mainly concerned with getting efficient
flushing of a single device occuring. Interactions between multiple
devices are a separable issue

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Converting writeback linked lists to a tree based data structure

2008-01-17 Thread David Chinner

On Thu, Jan 17, 2008 at 01:07:05PM -0800, Michael Rubin wrote:
> > Michael, could you sort out and document the new starvation prevention 
> > schemes?
> 
> The basic idea behind the writeback algorithm to handle starvation.
> The over arching idea is that we want to preserve order of writeback
> based on when an inode was dirtied and also preserve the dirtied_when
> contents until the inode has been written back (partially or fully)
> 
> Every sync_sb_inodes we find the least recent inodes dirtied. To deal
> with large or small starvation we have a s_flush_gen for each
> iteration of sync_sb_inodes every time we issue a writeback we mark
> that the inode cannot be processed until the next s_flush_gen. This
> way we don't process one get to the rest since we keep pushing them
> into subsequent s_fush_gen's.

This seems suboptimal for large files. If you keep feeding in
new least recently dirtied files, the large files will never
get an unimpeded go at the disk and hence we'll struggle to
get decent bandwidth under anything but pure large file
write loads.

Fairness is a tradeoff between seeks and bandwidth.  Ideally, we
want to spend 50% of the *disk* time servicing sequential writes and
50% of the time servicing seeky writes - that way neither get
penalised unfairly by the other type of workload.

Switching inodes during writeback implies a seek to the new write
location, while continuing to write the same inode has no seek
penalty because the writeback is sequential.  It follows from this
that allowing larges file a disproportionate amount of data
writeback is desirable.

Also, cycling rapidly through all the large files to write 4MB to each is
going to cause us to spend time seeking rather than writing compared
to cycling slower and writing 40MB from each large file at a time.

i.e. servicing one large file for 100ms is going to result in higher
writeback throughput than servicing 10 large files for 10ms each
because there's going to be less seeking and more writing done by
the disks.

That is, think of large file writes like process scheduler batch
jobs - bulk throughput is what matters, so the larger the time slice
you give them the higher the throughput.

IMO, the sort of result we should be looking at is a
writeback design that results in cycling somewhat like:

slice 1: iterate over small files
slice 2: flush large file 1
slice 3: iterate over small files
slice 4: flush large file 2
..
slice n-1: flush large file N
slice n: iterate over small files
slice n+1: flush large file N+1

So that we keep the disk busy with a relatively fair mix of
small and large I/Os while both are necessary.

Furthermore, as disk bandwidth goes up, the relationship
between large file and small file writes changes if we want
to maintain writeback at a significant percentage of the
maximum bandwidth of the drive (which is extremely desirable).
So if we take a 4k page machine and a 1024page writeback slice,
for different disks, we get a bandwidth slice in terms of disk
seeks like:

slow disk: 20MB/s, 10ms seek (say a laptop drive)
- 4MB write takes 200ms, or equivalent of 10 seeks

normal disk: 60MB/s, 8ms seek (desktop)
- 4MB write takes 66ms, or equivalent of 8 seeks

fast disk: 120MB/s, 5ms seek (15krpm SAS)
- 4MB write takes 33ms,  or equivalent of 6 seeks

small RAID5 lun: 200MB/s, 4ms seek
- 4MB write takes 20ms, or equivalent of 5 seeks

Large RAID5 lun: 1GB/s, 2ms seek
- 4MB write takes 4ms, or equivalent of 2 seeks

Put simply:

The higher the bandwidth of the device, the more frequently
we need to be servicing the inodes with large amounts of
dirty data to be written to maintain write throughput at a
significant percentage of the device capability.

The writeback algorithm needs to take this into account for it
to be able to scale effectively for high throughput devices.

BTW, it needs to be recognised that if we are under memory pressure
we can clean much more memory in a short period of time by writing
out all the large files first. This would clearly benefit the system
as a whole as we'd get the most pages available for reclaim as
possible in a short a time as possible. The writeback algorithm
should really have a mode that allows this sort of flush ordering to
occur

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-17 Thread David Chinner

On Wed, Jan 16, 2008 at 01:30:43PM -0800, Valerie Henson wrote:
> Hi y'all,
> 
> This is a request for comments on the rewrite of the e2fsck IO
> parallelization patches I sent out a few months ago.  The mechanism is
> totally different.  Previously IO was parallelized by issuing IOs from
> multiple threads; now a single thread issues fadvise(WILLNEED) and
> then uses read() to complete the IO.

Interesting.

We ultimately rejected a similar patch to xfs_repair (pre-population
the kernel block device cache) mainly because of low memory
performance issues and it doesn't really enable you to do anything
particularly smart with optimising I/O patterns for larger, high
performance RAID arrays.

The low memory problems were particularly bad; the readahead
thrashing cause a slowdown of 2-3x compared to the baseline and
often it was due to the repair process requiring all of memory
to cache stuff it would need later. IIRC, multi-terabyte ext3
filesystems have similar memory usage problems to XFS, so there's
a good chance that this patch will see the same sorts of issues.

> Single disk performance doesn't change, but elapsed time drops by
> about 50% on a big RAID-5 box.  Passes 1 and 2 are parallelized.  Pass
> 5 is left as an exercise for the reader.

Promising results, though

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Parallelize IO for e2fsck

2008-01-17 Thread David Chinner

On Wed, Jan 16, 2008 at 01:30:43PM -0800, Valerie Henson wrote:
 Hi y'all,
 
 This is a request for comments on the rewrite of the e2fsck IO
 parallelization patches I sent out a few months ago.  The mechanism is
 totally different.  Previously IO was parallelized by issuing IOs from
 multiple threads; now a single thread issues fadvise(WILLNEED) and
 then uses read() to complete the IO.

Interesting.

We ultimately rejected a similar patch to xfs_repair (pre-population
the kernel block device cache) mainly because of low memory
performance issues and it doesn't really enable you to do anything
particularly smart with optimising I/O patterns for larger, high
performance RAID arrays.

The low memory problems were particularly bad; the readahead
thrashing cause a slowdown of 2-3x compared to the baseline and
often it was due to the repair process requiring all of memory
to cache stuff it would need later. IIRC, multi-terabyte ext3
filesystems have similar memory usage problems to XFS, so there's
a good chance that this patch will see the same sorts of issues.

 Single disk performance doesn't change, but elapsed time drops by
 about 50% on a big RAID-5 box.  Passes 1 and 2 are parallelized.  Pass
 5 is left as an exercise for the reader.

Promising results, though

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Converting writeback linked lists to a tree based data structure

2008-01-17 Thread David Chinner

On Thu, Jan 17, 2008 at 01:07:05PM -0800, Michael Rubin wrote:
  Michael, could you sort out and document the new starvation prevention 
  schemes?
 
 The basic idea behind the writeback algorithm to handle starvation.
 The over arching idea is that we want to preserve order of writeback
 based on when an inode was dirtied and also preserve the dirtied_when
 contents until the inode has been written back (partially or fully)
 
 Every sync_sb_inodes we find the least recent inodes dirtied. To deal
 with large or small starvation we have a s_flush_gen for each
 iteration of sync_sb_inodes every time we issue a writeback we mark
 that the inode cannot be processed until the next s_flush_gen. This
 way we don't process one get to the rest since we keep pushing them
 into subsequent s_fush_gen's.

This seems suboptimal for large files. If you keep feeding in
new least recently dirtied files, the large files will never
get an unimpeded go at the disk and hence we'll struggle to
get decent bandwidth under anything but pure large file
write loads.

Fairness is a tradeoff between seeks and bandwidth.  Ideally, we
want to spend 50% of the *disk* time servicing sequential writes and
50% of the time servicing seeky writes - that way neither get
penalised unfairly by the other type of workload.

Switching inodes during writeback implies a seek to the new write
location, while continuing to write the same inode has no seek
penalty because the writeback is sequential.  It follows from this
that allowing larges file a disproportionate amount of data
writeback is desirable.

Also, cycling rapidly through all the large files to write 4MB to each is
going to cause us to spend time seeking rather than writing compared
to cycling slower and writing 40MB from each large file at a time.

i.e. servicing one large file for 100ms is going to result in higher
writeback throughput than servicing 10 large files for 10ms each
because there's going to be less seeking and more writing done by
the disks.

That is, think of large file writes like process scheduler batch
jobs - bulk throughput is what matters, so the larger the time slice
you give them the higher the throughput.

IMO, the sort of result we should be looking at is a
writeback design that results in cycling somewhat like:

slice 1: iterate over small files
slice 2: flush large file 1
slice 3: iterate over small files
slice 4: flush large file 2
..
slice n-1: flush large file N
slice n: iterate over small files
slice n+1: flush large file N+1

So that we keep the disk busy with a relatively fair mix of
small and large I/Os while both are necessary.

Furthermore, as disk bandwidth goes up, the relationship
between large file and small file writes changes if we want
to maintain writeback at a significant percentage of the
maximum bandwidth of the drive (which is extremely desirable).
So if we take a 4k page machine and a 1024page writeback slice,
for different disks, we get a bandwidth slice in terms of disk
seeks like:

slow disk: 20MB/s, 10ms seek (say a laptop drive)
- 4MB write takes 200ms, or equivalent of 10 seeks

normal disk: 60MB/s, 8ms seek (desktop)
- 4MB write takes 66ms, or equivalent of 8 seeks

fast disk: 120MB/s, 5ms seek (15krpm SAS)
- 4MB write takes 33ms,  or equivalent of 6 seeks

small RAID5 lun: 200MB/s, 4ms seek
- 4MB write takes 20ms, or equivalent of 5 seeks

Large RAID5 lun: 1GB/s, 2ms seek
- 4MB write takes 4ms, or equivalent of 2 seeks

Put simply:

The higher the bandwidth of the device, the more frequently
we need to be servicing the inodes with large amounts of
dirty data to be written to maintain write throughput at a
significant percentage of the device capability.

The writeback algorithm needs to take this into account for it
to be able to scale effectively for high throughput devices.

BTW, it needs to be recognised that if we are under memory pressure
we can clean much more memory in a short period of time by writing
out all the large files first. This would clearly benefit the system
as a whole as we'd get the most pages available for reclaim as
possible in a short a time as possible. The writeback algorithm
should really have a mode that allows this sort of flush ordering to
occur

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Converting writeback linked lists to a tree based data structure

2008-01-16 Thread David Chinner

On Thu, Jan 17, 2008 at 11:16:00AM +0800, Fengguang Wu wrote:
> On Thu, Jan 17, 2008 at 09:35:10AM +1100, David Chinner wrote:
> > On Wed, Jan 16, 2008 at 05:07:20PM +0800, Fengguang Wu wrote:
> > > On Tue, Jan 15, 2008 at 09:51:49PM -0800, Andrew Morton wrote:
> > > > > Then to do better ordering by adopting radix tree(or rbtree
> > > > > if radix tree is not enough),
> > > > 
> > > > ordering of what?
> > > 
> > > Switch from time to location.
> > 
> > Note that data writeback may be adversely affected by location
> > based writeback rather than time based writeback - think of
> > the effect of location based data writeback on an app that
> > creates lots of short term (<30s) temp files and then removes
> > them before they are written back.
> 
> A small(e.g. 5s) time window can still be enforced, but...

Yes, you could, but that will then result in non-deterministic
performance for repeated workloads because the order of file
writeback will not be consistent.

e.g.  the first run is fast because the output file is at lower
offset than the temp file meaning the temp file gets deleted
without being written.

The second run is slow because the location of the files is
reversed and the temp file is written to disk before the
final output file and hence the run is much slower because
it writes much more.

The third run is also slow, but the files are like the first
fast run. However, pdflush tries to write the temp file back
within 5s of it being dirtied so it skips it and writes
the output file first.

The difference between the first+second case can be found by
knowing that inode number determines writeback order, but
there is no obvious clue as to why the first+third runs are
different.

This is exactly the sort of non-deterministic behaviour we 
want to avoid in a writeback algorithm.

> > H - I'm wondering if we'd do better to split data writeback from
> > inode writeback. i.e. we do two passes.  The first pass writes all
> > the data back in time order, the second pass writes all the inodes
> > back in location order.
> > 
> > Right now we interleave data and inode writeback, (i.e.  we do data,
> > inode, data, inode, data, inode, ). I'd much prefer to see all
> > data written out first, then the inodes. ->writepage often dirties
> > the inode and hence if we need to do multiple do_writepages() calls
> > on an inode to flush all the data (e.g. congestion, large amounts of
> > data to be written, etc), we really shouldn't be calling
> > write_inode() after every do_writepages() call. The inode
> > should not be written until all the data is written
> 
> That may do good to XFS. Another case is documented as follows:
> "the write_inode() function of a typical fs will perform no I/O, but
> will mark buffers in the blockdev mapping as dirty."

Yup, but in that situation ->write_inode() does not do any I/O, so
it will work with any high level inode writeback ordering or timing
scheme equally well.  As a result, that's not the case we need to
optimise at all.

FWIW, the NFS client is likely to work better with split data/
inode writeback as it also has to mark the inode dirty on async
write completion (to get ->write_inode called to issue a commit
RPC). Hence delaying the inode write until after all the data
is written makes sense there as well

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Converting writeback linked lists to a tree based data structure

2008-01-16 Thread David Chinner

On Wed, Jan 16, 2008 at 05:07:20PM +0800, Fengguang Wu wrote:
> On Tue, Jan 15, 2008 at 09:51:49PM -0800, Andrew Morton wrote:
> > > Then to do better ordering by adopting radix tree(or rbtree
> > > if radix tree is not enough),
> > 
> > ordering of what?
> 
> Switch from time to location.

Note that data writeback may be adversely affected by location
based writeback rather than time based writeback - think of
the effect of location based data writeback on an app that
creates lots of short term (<30s) temp files and then removes
them before they are written back.

Also, data writeback locatio cannot be easily derived from
the inode number in pretty much all cases. "near" in terms
of XFS means the same AG which means the data could be up to
a TB away from the inode, and if you have >1TB filesystems
usingthe default inode32 allocator, file data is *never*
placed near the inode - the inodes are in the first TB of
the filesystem, the data is rotored around the rest of the
filesystem.

And with delayed allocation, you don't know where the data is even
going to be written ahead of the filesystem ->writepage call, so you
can't do optimal location ordering for data in this case.

> > > and lastly get rid of the list_heads to
> > > avoid locking. Does it sound like a good path?
> > 
> > I'd have thaought that replacing list_heads with another data structure
> > would be a simgle commit.
> 
> That would be easy. s_more_io and s_more_io_wait can all be converted
> to radix trees.

Makes sense for location based writeback of the inodes themselves,
but not for data.

H - I'm wondering if we'd do better to split data writeback from
inode writeback. i.e. we do two passes.  The first pass writes all
the data back in time order, the second pass writes all the inodes
back in location order.

Right now we interleave data and inode writeback, (i.e.  we do data,
inode, data, inode, data, inode, ). I'd much prefer to see all
data written out first, then the inodes. ->writepage often dirties
the inode and hence if we need to do multiple do_writepages() calls
on an inode to flush all the data (e.g. congestion, large amounts of
data to be written, etc), we really shouldn't be calling
write_inode() after every do_writepages() call. The inode
should not be written until all the data is written

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 09/13] writeback: requeue_io() on redirtied inode

2008-01-16 Thread David Chinner

On Tue, Jan 15, 2008 at 08:36:46PM +0800, Fengguang Wu wrote:
> Redirtied inodes could be seen in really fast writes.
> They should really be synced as soon as possible.
> 
> redirty_tail() could delay the inode for up to 30s.
> Kill the delay by using requeue_io() instead.

That's actually bad for anything that does delayed allocation
or updates state on data I/o completion.

e.g. XFS when writing past EOF doing delalloc dirties the inode
during writeout (allocation) and then updates the file size on data
I/o completion hence dirtying the inode again.

With this change, writing the last pages out would result
in hitting this code and causing the inode to be flushed very
soon after the data write. Then, after the inode write is issued,
we get data I/o completion which dirties the inode again,
resulting in needing to write the inode again to clean it.
i.e. it introduces a potential new and useless inode write
I/O.

Also, the immediate inode write may be useless for XFS because the
inode may be pinned in memory due to async transactions
still in flight (e.g. from delalloc) so we've got two
situations where flushing the inode immediately is suboptimal.

Hence I don't think this is an optimisation that should be made
in the generic writeback code.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 09/13] writeback: requeue_io() on redirtied inode

2008-01-16 Thread David Chinner

On Tue, Jan 15, 2008 at 08:36:46PM +0800, Fengguang Wu wrote:
 Redirtied inodes could be seen in really fast writes.
 They should really be synced as soon as possible.
 
 redirty_tail() could delay the inode for up to 30s.
 Kill the delay by using requeue_io() instead.

That's actually bad for anything that does delayed allocation
or updates state on data I/o completion.

e.g. XFS when writing past EOF doing delalloc dirties the inode
during writeout (allocation) and then updates the file size on data
I/o completion hence dirtying the inode again.

With this change, writing the last pages out would result
in hitting this code and causing the inode to be flushed very
soon after the data write. Then, after the inode write is issued,
we get data I/o completion which dirties the inode again,
resulting in needing to write the inode again to clean it.
i.e. it introduces a potential new and useless inode write
I/O.

Also, the immediate inode write may be useless for XFS because the
inode may be pinned in memory due to async transactions
still in flight (e.g. from delalloc) so we've got two
situations where flushing the inode immediately is suboptimal.

Hence I don't think this is an optimisation that should be made
in the generic writeback code.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Converting writeback linked lists to a tree based data structure

2008-01-16 Thread David Chinner

On Wed, Jan 16, 2008 at 05:07:20PM +0800, Fengguang Wu wrote:
 On Tue, Jan 15, 2008 at 09:51:49PM -0800, Andrew Morton wrote:
   Then to do better ordering by adopting radix tree(or rbtree
   if radix tree is not enough),
  
  ordering of what?
 
 Switch from time to location.

Note that data writeback may be adversely affected by location
based writeback rather than time based writeback - think of
the effect of location based data writeback on an app that
creates lots of short term (30s) temp files and then removes
them before they are written back.

Also, data writeback locatio cannot be easily derived from
the inode number in pretty much all cases. near in terms
of XFS means the same AG which means the data could be up to
a TB away from the inode, and if you have 1TB filesystems
usingthe default inode32 allocator, file data is *never*
placed near the inode - the inodes are in the first TB of
the filesystem, the data is rotored around the rest of the
filesystem.

And with delayed allocation, you don't know where the data is even
going to be written ahead of the filesystem -writepage call, so you
can't do optimal location ordering for data in this case.

   and lastly get rid of the list_heads to
   avoid locking. Does it sound like a good path?
  
  I'd have thaought that replacing list_heads with another data structure
  would be a simgle commit.
 
 That would be easy. s_more_io and s_more_io_wait can all be converted
 to radix trees.

Makes sense for location based writeback of the inodes themselves,
but not for data.

H - I'm wondering if we'd do better to split data writeback from
inode writeback. i.e. we do two passes.  The first pass writes all
the data back in time order, the second pass writes all the inodes
back in location order.

Right now we interleave data and inode writeback, (i.e.  we do data,
inode, data, inode, data, inode, ). I'd much prefer to see all
data written out first, then the inodes. -writepage often dirties
the inode and hence if we need to do multiple do_writepages() calls
on an inode to flush all the data (e.g. congestion, large amounts of
data to be written, etc), we really shouldn't be calling
write_inode() after every do_writepages() call. The inode
should not be written until all the data is written

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Converting writeback linked lists to a tree based data structure

2008-01-16 Thread David Chinner

On Thu, Jan 17, 2008 at 11:16:00AM +0800, Fengguang Wu wrote:
 On Thu, Jan 17, 2008 at 09:35:10AM +1100, David Chinner wrote:
  On Wed, Jan 16, 2008 at 05:07:20PM +0800, Fengguang Wu wrote:
   On Tue, Jan 15, 2008 at 09:51:49PM -0800, Andrew Morton wrote:
 Then to do better ordering by adopting radix tree(or rbtree
 if radix tree is not enough),

ordering of what?
   
   Switch from time to location.
  
  Note that data writeback may be adversely affected by location
  based writeback rather than time based writeback - think of
  the effect of location based data writeback on an app that
  creates lots of short term (30s) temp files and then removes
  them before they are written back.
 
 A small(e.g. 5s) time window can still be enforced, but...

Yes, you could, but that will then result in non-deterministic
performance for repeated workloads because the order of file
writeback will not be consistent.

e.g.  the first run is fast because the output file is at lower
offset than the temp file meaning the temp file gets deleted
without being written.

The second run is slow because the location of the files is
reversed and the temp file is written to disk before the
final output file and hence the run is much slower because
it writes much more.

The third run is also slow, but the files are like the first
fast run. However, pdflush tries to write the temp file back
within 5s of it being dirtied so it skips it and writes
the output file first.

The difference between the first+second case can be found by
knowing that inode number determines writeback order, but
there is no obvious clue as to why the first+third runs are
different.

This is exactly the sort of non-deterministic behaviour we 
want to avoid in a writeback algorithm.

  H - I'm wondering if we'd do better to split data writeback from
  inode writeback. i.e. we do two passes.  The first pass writes all
  the data back in time order, the second pass writes all the inodes
  back in location order.
  
  Right now we interleave data and inode writeback, (i.e.  we do data,
  inode, data, inode, data, inode, ). I'd much prefer to see all
  data written out first, then the inodes. -writepage often dirties
  the inode and hence if we need to do multiple do_writepages() calls
  on an inode to flush all the data (e.g. congestion, large amounts of
  data to be written, etc), we really shouldn't be calling
  write_inode() after every do_writepages() call. The inode
  should not be written until all the data is written
 
 That may do good to XFS. Another case is documented as follows:
 the write_inode() function of a typical fs will perform no I/O, but
 will mark buffers in the blockdev mapping as dirty.

Yup, but in that situation -write_inode() does not do any I/O, so
it will work with any high level inode writeback ordering or timing
scheme equally well.  As a result, that's not the case we need to
optimise at all.

FWIW, the NFS client is likely to work better with split data/
inode writeback as it also has to mark the inode dirty on async
write completion (to get -write_inode called to issue a commit
RPC). Hence delaying the inode write until after all the data
is written makes sense there as well

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Converting writeback linked lists to a tree based data structure

2008-01-15 Thread David Chinner

On Tue, Jan 15, 2008 at 07:44:15PM -0800, Andrew Morton wrote:
> On Wed, 16 Jan 2008 11:01:08 +0800 Fengguang Wu <[EMAIL PROTECTED]> wrote:
> 
> > On Tue, Jan 15, 2008 at 09:53:42AM -0800, Michael Rubin wrote:
> > > On Jan 15, 2008 12:46 AM, Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> > > > Just a quick question, how does this interact/depend-uppon etc.. with
> > > > Fengguangs patches I still have in my mailbox? (Those from Dec 28th)
> > > 
> > > They don't. They apply to a 2.6.24rc7 tree. This is a candidte for 2.6.25.
> > > 
> > > This work was done before Fengguang's patches. I am trying to test
> > > Fengguang's for comparison but am having problems with getting mm1 to
> > > boot on my systems.
> > 
> > Yeah, they are independent ones. The initial motivation is to fix the
> > bug "sluggish writeback on small+large files". Michael introduced
> > a new rbtree, and me introduced a new list(s_more_io_wait).
> > 
> > Basically I think rbtree is an overkill to do time based ordering.
> > Sorry, Michael. But s_dirty would be enough for that. Plus, s_more_io
> > provides fair queuing between small/large files, and s_more_io_wait
> > provides waiting mechanism for blocked inodes.
> > 
> > The time ordered rbtree may delay io for a blocked inode simply by
> > modifying its dirtied_when and reinsert it. But it would no longer be
> > that easy if it is to be ordered by location.
> 
> What does the term "ordered by location" mean?  Attemting to sort inodes by
> physical disk address?  By using their i_ino as a key?
> 
> That sounds optimistic.

In XFS, inode number is an encoding of it's location on disk, so
ordering inode writeback by inode number *does* make sense.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-15 Thread David Chinner

On Tue, Jan 15, 2008 at 09:16:53PM +0100, Pavel Machek wrote:
> Hi!
> 
> > > What are ext3 expectations of disk (is there doc somewhere)? For
> > > example... if disk does not lie, but powerfail during write damages
> > > the sector -- is ext3 still going to work properly?
> > 
> > Nope. However the few disks that did this rapidly got firmware updates
> > because there are other OS's that can't cope.
> > 
> > > If disk does not lie, but powerfail during write may cause random
> > > numbers to be returned on read -- can fsck handle that?
> > 
> > most of the time. and fsck knows about writing sectors to remove read
> > errors in metadata blocks.
> > 
> > > What abou disk that kills 5 sectors around sector being written during
> > > powerfail; can ext3 survive that?
> > 
> > generally. Note btw that for added fun there is nothing that guarantees
> > the blocks around a block on the media are sequentially numbered. The
> > usually are but you never know.
> 
> Ok, should something like this be added to the documentation?
> 
> It would be cool to be able to include few examples (modern SATA disks
> support bariers so are safe, any IDE from 1989 is unsafe), but I do
> not know enough about hw...

ext3 is not the only filesystem that will have trouble due to
volatile write caches. We see problems often enough with XFS
due to volatile write caches that it's in our FAQ:

http://oss.sgi.com/projects/xfs/faq.html#wcache

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-15 Thread David Chinner

On Tue, Jan 15, 2008 at 09:16:53PM +0100, Pavel Machek wrote:
 Hi!
 
   What are ext3 expectations of disk (is there doc somewhere)? For
   example... if disk does not lie, but powerfail during write damages
   the sector -- is ext3 still going to work properly?
  
  Nope. However the few disks that did this rapidly got firmware updates
  because there are other OS's that can't cope.
  
   If disk does not lie, but powerfail during write may cause random
   numbers to be returned on read -- can fsck handle that?
  
  most of the time. and fsck knows about writing sectors to remove read
  errors in metadata blocks.
  
   What abou disk that kills 5 sectors around sector being written during
   powerfail; can ext3 survive that?
  
  generally. Note btw that for added fun there is nothing that guarantees
  the blocks around a block on the media are sequentially numbered. The
  usually are but you never know.
 
 Ok, should something like this be added to the documentation?
 
 It would be cool to be able to include few examples (modern SATA disks
 support bariers so are safe, any IDE from 1989 is unsafe), but I do
 not know enough about hw...

ext3 is not the only filesystem that will have trouble due to
volatile write caches. We see problems often enough with XFS
due to volatile write caches that it's in our FAQ:

http://oss.sgi.com/projects/xfs/faq.html#wcache

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Converting writeback linked lists to a tree based data structure

2008-01-15 Thread David Chinner

On Tue, Jan 15, 2008 at 07:44:15PM -0800, Andrew Morton wrote:
 On Wed, 16 Jan 2008 11:01:08 +0800 Fengguang Wu [EMAIL PROTECTED] wrote:
 
  On Tue, Jan 15, 2008 at 09:53:42AM -0800, Michael Rubin wrote:
   On Jan 15, 2008 12:46 AM, Peter Zijlstra [EMAIL PROTECTED] wrote:
Just a quick question, how does this interact/depend-uppon etc.. with
Fengguangs patches I still have in my mailbox? (Those from Dec 28th)
   
   They don't. They apply to a 2.6.24rc7 tree. This is a candidte for 2.6.25.
   
   This work was done before Fengguang's patches. I am trying to test
   Fengguang's for comparison but am having problems with getting mm1 to
   boot on my systems.
  
  Yeah, they are independent ones. The initial motivation is to fix the
  bug sluggish writeback on small+large files. Michael introduced
  a new rbtree, and me introduced a new list(s_more_io_wait).
  
  Basically I think rbtree is an overkill to do time based ordering.
  Sorry, Michael. But s_dirty would be enough for that. Plus, s_more_io
  provides fair queuing between small/large files, and s_more_io_wait
  provides waiting mechanism for blocked inodes.
  
  The time ordered rbtree may delay io for a blocked inode simply by
  modifying its dirtied_when and reinsert it. But it would no longer be
  that easy if it is to be ordered by location.
 
 What does the term ordered by location mean?  Attemting to sort inodes by
 physical disk address?  By using their i_ino as a key?
 
 That sounds optimistic.

In XFS, inode number is an encoding of it's location on disk, so
ordering inode writeback by inode number *does* make sense.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Why is deleting (or reading) files not counted as IO-Wait in top?

2008-01-13 Thread David Chinner

On Wed, Jan 02, 2008 at 08:35:03PM +0100, Matthias Schniedermeyer wrote:
> Hi
> 
> 
> Currently i'm deleting about 500.000 files on a XFS-filesystem which 
> takes a few minutes, as i had a top open i saw that 'wa' is shown as 
> 0.0% (Nothing else running currently) and everything except 'id' is near 
> the bottom too. Kernel is 2.6.23.11.

Simply because the only I/O that XFS does during a delete is
to the log and the log does async I/O and hence the process
never blocks in I/O.

Instead, it blocks in a far more complex space reservation that
may or may not be related to I/O wait

> So, as 'rm -rf' is essentially a IO (or seek, to be more correct)-bound 
> task, shouldn't that count as "Waiting for IO"?

rm -rf is not seek bound on XFS - it's generally determined by
the sequential write speed of the block device or how fast your
CPU is

> The man-page of top says:
> 'Amount of time the CPU has been waiting for I/O to complete.'

Async I/O means that typically your CPU does not get held up
waiting for I/O to complete

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Why is deleting (or reading) files not counted as IO-Wait in top?

2008-01-13 Thread David Chinner

On Wed, Jan 02, 2008 at 08:35:03PM +0100, Matthias Schniedermeyer wrote:
 Hi
 
 
 Currently i'm deleting about 500.000 files on a XFS-filesystem which 
 takes a few minutes, as i had a top open i saw that 'wa' is shown as 
 0.0% (Nothing else running currently) and everything except 'id' is near 
 the bottom too. Kernel is 2.6.23.11.

Simply because the only I/O that XFS does during a delete is
to the log and the log does async I/O and hence the process
never blocks in I/O.

Instead, it blocks in a far more complex space reservation that
may or may not be related to I/O wait

 So, as 'rm -rf' is essentially a IO (or seek, to be more correct)-bound 
 task, shouldn't that count as Waiting for IO?

rm -rf is not seek bound on XFS - it's generally determined by
the sequential write speed of the block device or how fast your
CPU is

 The man-page of top says:
 'Amount of time the CPU has been waiting for I/O to complete.'

Async I/O means that typically your CPU does not get held up
waiting for I/O to complete

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: xfs|loop|raid: attempt to access beyond end of device

2007-12-25 Thread David Chinner

On Sun, Dec 23, 2007 at 08:21:08PM +0100, Janos Haar wrote:
> Hello, list,
> 
> I have a little problem on one of my productive system.
> 
> The system sometimes crashed, like this:
> 
> Dec 23 08:53:05 Albohacen-global kernel: attempt to access beyond end of
> device
> Dec 23 08:53:05 Albohacen-global kernel: loop0: rw=1, want=50552830649176,
> limit=3085523200
> Dec 23 08:53:05 Albohacen-global kernel: Buffer I/O error on device loop0,
> logical block 6319103831146
> Dec 23 08:53:05 Albohacen-global kernel: lost page write due to I/O error on
> loop0

So a long way beyond the end of the device.

[snip soft lockup warnings]

> Dec 23 09:08:19 Albohacen-global kernel: Filesystem "loop0": Access to block
> zero in inode 397821447 start_block: 0 start_off: 0 blkcnt: 0 extent-state:
> 0 lastx: e4

And that's to block zero of the filesystem. Sure signs of a corupted inode
extent btree. We've seen a few of these corruptions on loopback device
reported recently.

You'll need to unmount and repair the filesystem to make this go away,
but it's hard to know what is causing the btree corruption.

> Dec 23 09:08:22 Albohacen-global last message repeated 19 times
> 
> some more info:
> 
> [EMAIL PROTECTED] ~]# uname -a
> Linux Albohacen-global 2.6.21.1 #3 SMP Thu May 3 04:33:36 CEST 2007 x86_64
> x86_64 x86_64 GNU/Linux
> [EMAIL PROTECTED] ~]# cat /proc/mdstat
> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
> [multipath] [faulty]
> md1 : active raid4 sdf2[1] sde2[0] sdd2[5] sdc2[4] sdb2[3] sda2[2]
>   19558720 blocks level 4, 64k chunk, algorithm 0 [6/6] [UU]
>   bitmap: 8/239 pages [32KB], 8KB chunk
> 
> md2 : active raid4 sdf3[1] sde3[0] sdd3[5] sdc3[4] sdb3[3] sda3[2]
>   1542761600 blocks level 4, 64k chunk, algorithm 0 [6/6] [UU]
>   bitmap: 0/148 pages [0KB], 1024KB chunk
> 
> md0 : active raid1 sdb1[1] sda1[0]
>   104320 blocks [2/2] [UU]
> 
> unused devices: 
> [EMAIL PROTECTED] ~]# losetup /dev/loop0
> /dev/loop0: [0010]:6598 (/dev/md2), encryption blowfish (type 18)

You're using an encrypted block device? What mechanism are you using for
encryption (doesn't appear to be dmcrypt)? Does it handle readahead bio
cancellation correctly? We had similar XFS corruption problems on dmcrypt
between 2.6.14 and ~2.6.20 due to a bug in dmcrypt's failure to handle
aborted readahead I/O correctly

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: xfs|loop|raid: attempt to access beyond end of device

2007-12-25 Thread David Chinner

On Sun, Dec 23, 2007 at 08:21:08PM +0100, Janos Haar wrote:
 Hello, list,
 
 I have a little problem on one of my productive system.
 
 The system sometimes crashed, like this:
 
 Dec 23 08:53:05 Albohacen-global kernel: attempt to access beyond end of
 device
 Dec 23 08:53:05 Albohacen-global kernel: loop0: rw=1, want=50552830649176,
 limit=3085523200
 Dec 23 08:53:05 Albohacen-global kernel: Buffer I/O error on device loop0,
 logical block 6319103831146
 Dec 23 08:53:05 Albohacen-global kernel: lost page write due to I/O error on
 loop0

So a long way beyond the end of the device.

[snip soft lockup warnings]

 Dec 23 09:08:19 Albohacen-global kernel: Filesystem loop0: Access to block
 zero in inode 397821447 start_block: 0 start_off: 0 blkcnt: 0 extent-state:
 0 lastx: e4

And that's to block zero of the filesystem. Sure signs of a corupted inode
extent btree. We've seen a few of these corruptions on loopback device
reported recently.

You'll need to unmount and repair the filesystem to make this go away,
but it's hard to know what is causing the btree corruption.

 Dec 23 09:08:22 Albohacen-global last message repeated 19 times
 
 some more info:
 
 [EMAIL PROTECTED] ~]# uname -a
 Linux Albohacen-global 2.6.21.1 #3 SMP Thu May 3 04:33:36 CEST 2007 x86_64
 x86_64 x86_64 GNU/Linux
 [EMAIL PROTECTED] ~]# cat /proc/mdstat
 Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
 [multipath] [faulty]
 md1 : active raid4 sdf2[1] sde2[0] sdd2[5] sdc2[4] sdb2[3] sda2[2]
   19558720 blocks level 4, 64k chunk, algorithm 0 [6/6] [UU]
   bitmap: 8/239 pages [32KB], 8KB chunk
 
 md2 : active raid4 sdf3[1] sde3[0] sdd3[5] sdc3[4] sdb3[3] sda3[2]
   1542761600 blocks level 4, 64k chunk, algorithm 0 [6/6] [UU]
   bitmap: 0/148 pages [0KB], 1024KB chunk
 
 md0 : active raid1 sdb1[1] sda1[0]
   104320 blocks [2/2] [UU]
 
 unused devices: none
 [EMAIL PROTECTED] ~]# losetup /dev/loop0
 /dev/loop0: [0010]:6598 (/dev/md2), encryption blowfish (type 18)

You're using an encrypted block device? What mechanism are you using for
encryption (doesn't appear to be dmcrypt)? Does it handle readahead bio
cancellation correctly? We had similar XFS corruption problems on dmcrypt
between 2.6.14 and ~2.6.20 due to a bug in dmcrypt's failure to handle
aborted readahead I/O correctly

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ia64] BUG: sleeping in atomic

2007-12-19 Thread David Chinner

On Wed, Dec 19, 2007 at 11:42:04AM -0500, Kyle McMartin wrote:
> On Wed, Dec 19, 2007 at 04:54:30PM +1100, David Chinner wrote:
> > [ 5667.086055] BUG: sleeping function called from invalid context at 
> > kernel/fork.c:401
> > 
> 
> The problem is that mmput is called under the read_lock by
> find_thread_for_addr... The comment above seems to indicate that gdb
> needs to be able to access any child tasks register backing store
> memory... This seems pretty broken.
> 
> cheers, Kyle
> 
> ---
> 
> Who knows, maybe gdb is saner now?
> 
> diff --git a/arch/ia64/kernel/ptrace.c b/arch/ia64/kernel/ptrace.c
> index 2e96f17..b609704 100644
> --- a/arch/ia64/kernel/ptrace.c
> +++ b/arch/ia64/kernel/ptrace.c
> @@ -1418,7 +1418,7 @@ asmlinkage long
>  sys_ptrace (long request, pid_t pid, unsigned long addr, unsigned long data)
>  {
>   struct pt_regs *pt;
> - unsigned long urbs_end, peek_or_poke;
> + unsigned long urbs_end;
>   struct task_struct *child;
>   struct switch_stack *sw;
>   long ret;
> @@ -1430,23 +1430,12 @@ sys_ptrace (long request, pid_t pid, unsigned long 
> addr, unsigned long data)
>   goto out;
>   }
>  
> - peek_or_poke = (request == PTRACE_PEEKTEXT
> - || request == PTRACE_PEEKDATA
> - || request == PTRACE_POKETEXT
> - || request == PTRACE_POKEDATA);
> - ret = -ESRCH;
> - read_lock(_lock);
> - {
> - child = find_task_by_pid(pid);
> - if (child) {
> - if (peek_or_poke)
> - child = find_thread_for_addr(child, addr);
> - get_task_struct(child);
> - }
> - }
> - read_unlock(_lock);
> - if (!child)
> + child = ptrace_get_task_struct(pid);
> + if (IS_ERR(child)) {
> + ret = PTR_ERR(child);
>   goto out;
> + }
> +
>   ret = -EPERM;
>   if (pid == 1)   /* no messing around with init! */
>   goto out_tsk;

Yes, this patch fixes the problem (though I haven't tried to
use gdb yet).

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch, rfc] mm.h, security.h, key.h and preventing namespace poisoning

2007-12-19 Thread David Chinner

On Thu, Dec 20, 2007 at 11:07:01AM +1100, James Morris wrote:
> On Wed, 19 Dec 2007, David Chinner wrote:
> 
> > Folks,
> > 
> > I just updated a git tree and started getting errors on a
> > "copy_keys" macro warning.
> > 
> > The code I've been working on uses a ->copy_keys() method for
> > copying the keys in a btree block from one place to another. I've
> > been working on this code for a while
> > (http://oss.sgi.com/archives/xfs/2007-11/msg00046.html) and keep the
> > tree I'm working in reletively up to date (lags linus by a couple of
> > weeks at most). The update I did this afternoon gave a conflict
> > warning with the macro in include/linux/key.h.
> > 
> > Given that I'm not directly including key.h anywhere in the XFS
> > code, I'm getting the namespace polluted indirectly from some other
> > include that is necessary.
> > 
> > As it turns out, this commit from 13 days ago:
> > 
> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7cd94146cd504016315608e297219f9fb7b1413b
> > 
> > included security.h in mm.h and that is how I'm seeing the namespace
> > poisoning coming from key.h when !CONFIG_KEY.
> > 
> > Including security.h in mm.h means much wider includes for pretty
> > much the entire kernel, and it opens up namespace issues like this
> > that never previously existed.
> > 
> > The patch below (only tested for !CONFIG_KEYS && !CONFIG_SECURITY)
> > moves security.h into the mmap.c and nommu.c files that need it so
> > it doesn't end up with kernel wide scope.
> > 
> > Comments?
> 
> The idea with this placement was to keep memory management code with other 
> similar code, rather than pushing it into security.h, where it does not 
> functionally belong.
> 
> Something to not also is that you can't "depend" on security.h not being 
> included all over the place, as LSM does touch a lot of the kernel.  
> Unecessarily including it is bad, of course.

Which is what including it in mm.h does. It also pull sin a lot of
other headers files as has already been noted.

> I'm not sure I understand your namespace pollution issue, either.

doing this globally:

#ifdef CONFIG_SOMETHING
extern int  some_common_name(int a, int b, int c);
#else
#define some_common_name(a,b,c) 0
#endif

means that no-one can use some_common_name *anywhere* in the kernel.
In this case, i have a completely *private* use of some_common_name
and now I can't use that because the wonderful define above that
now has effectively global scope because it gets included from key.h
via security.h via mm.h.

> In any case, I think the right solution is not to include security.h at 
> all in mm.h, as it is only being done to get a declaration for 
> mmap_min_addr.
> 
> How about this, instead ?
> 
> Signed-off-by: James Morris <[EMAIL PROTECTED]>
> ---
> 
>  mm.h |5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1b7b95c..02fbac7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -12,7 +12,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  
>  struct mempolicy;
>  struct anon_vma;
> @@ -34,6 +33,10 @@ extern int sysctl_legacy_va_layout;
>  #define sysctl_legacy_va_layout 0
>  #endif
>  
> +#ifdef CONFIG_SECURITY
> +extern unsigned long mmap_min_addr;
> +#endif
> +
>  #include 
>  #include 
>  #include 

Fine by me.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Important regression with XFS update for 2.6.24-rc6

2007-12-19 Thread David Chinner

On Wed, Dec 19, 2007 at 12:17:30PM +0100, Damien Wyart wrote:
> * David Chinner <[EMAIL PROTECTED]> [071219 11:45]:
> > Can someone pass me a brown paper bag, please?
> 
> My first impression on this bug was not so wrong, after all ;-)
> 
> > That also explains why we haven't seen it - it requires the user buffer to
> > fill on the first entry of a backing buffer and so it is largely dependent
> > on the pattern of name lengths, page size and filesystem block size
> > aligning just right to trigger the problem.
> 
> I guess I was lucky to trigger it quite easily...
> 
> > Can you test this patch, Damien?
> 
> Works fine, all the bad symptoms have disappeared and strace output is
> normal.
> 
> So you can add:
> 
> Tested-by: Damien Wyart <[EMAIL PROTECTED]>

Thanks for reporting the bug and testing the fix so quickly, Damien.
I'll give it some more QA before I push it, though.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Important regression with XFS update for 2.6.24-rc6

2007-12-19 Thread David Chinner

On Wed, Dec 19, 2007 at 02:19:47AM +1100, David Chinner wrote:
> On Tue, Dec 18, 2007 at 03:30:31PM +0100, Damien Wyart wrote:
> > * David Chinner <[EMAIL PROTECTED]> [071218 13:24]:
> > > Ok. I haven't noticed anything wrong with directories up to about
> > > 250,000 files in the last few days. The ls -l I just did on
> > > a directory with 15000 entries (btree format) used about 5MB of RAM.
> > > extent format directories appear to work fine as well (tested 500
> > > entries).
> > 
> > Ok, nice to know the problem is not so frequent.
> 
> .
> 
> > I have put the files at http://damien.wyart.free.fr/xfs/
> > 
> > strace_xfs_problem.1.gz and strace_xfs_problem.2.gz have been created
> > with the problematic kernel, and are quite bigger than
> > strace_xfs_problem.normal.gz, which has been created with the vanilla
> > rc5-git5. There is also xfs_info.
> 
> Looks like several getdents() through the directory the getdents()
> call starts outputting the first files again. It gets to a certain
> point and always goes back to the beginning. However, it appears to
> get to the end eventually (without ever getting past the bad offset).

UML and a bunch of printk's to the rescue.

So we went back to double buffering, which then screwed up the d_off
of the dirents. I changed the temporary dirents to point to the current
offset so that filldir got what it expected when filling the user buffer.

Except it appears that it I didn't to initialise the current
offset for the first dirent read from the temporary buffer so filldir
occasionally got an uninitialised offset. Can someone pass me a
brown paper bag, please?

In my local testing, more often than not, that uninitialised offset
reads as zero which is where the looping comes from. Sometimes it
points off into wacko-land, which is probably how we eventually get
the looping terminating before you run out of memory.

That also explains why we haven't seen it - it requires the user buffer
to fill on the first entry of a backing buffer and so it is largely
dependent on the pattern of name lengths, page size and filesystem
block size aligning just right to trigger the problem.

Can you test this patch, Damien?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

---
 fs/xfs/linux-2.6/xfs_file.c |1 +
 1 file changed, 1 insertion(+)

Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_file.c
===
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_file.c  2007-12-19 
00:26:40.0 +1100
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_file.c   2007-12-19 21:26:38.701143555 
+1100
@@ -348,6 +348,7 @@ xfs_file_readdir(

size = buf.used;
de = (struct hack_dirent *)buf.dirent;
+   curr_offset = de->offset /* & 0x7fff */;
while (size > 0) {
if (filldir(dirent, de->name, de->namlen,
curr_offset & 0x7fff,
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch, rfc] mm.h, security.h, key.h and preventing namespace poisoning

2007-12-19 Thread David Chinner

Folks,

I just updated a git tree and started getting errors on a
"copy_keys" macro warning.

The code I've been working on uses a ->copy_keys() method for
copying the keys in a btree block from one place to another. I've
been working on this code for a while
(http://oss.sgi.com/archives/xfs/2007-11/msg00046.html) and keep the
tree I'm working in reletively up to date (lags linus by a couple of
weeks at most). The update I did this afternoon gave a conflict
warning with the macro in include/linux/key.h.

Given that I'm not directly including key.h anywhere in the XFS
code, I'm getting the namespace polluted indirectly from some other
include that is necessary.

As it turns out, this commit from 13 days ago:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7cd94146cd504016315608e297219f9fb7b1413b

included security.h in mm.h and that is how I'm seeing the namespace
poisoning coming from key.h when !CONFIG_KEY.

Including security.h in mm.h means much wider includes for pretty
much the entire kernel, and it opens up namespace issues like this
that never previously existed.

The patch below (only tested for !CONFIG_KEYS && !CONFIG_SECURITY)
moves security.h into the mmap.c and nommu.c files that need it so
it doesn't end up with kernel wide scope.

Comments?

---
 include/linux/mm.h   |   16 
 include/linux/security.h |   26 +-
 mm/mmap.c|1 +
 mm/nommu.c   |1 +
 4 files changed, 27 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1b7b95c..520238c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -12,7 +12,6 @@
 #include 
 #include 
 #include 
-#include 
 
 struct mempolicy;
 struct anon_vma;
@@ -514,21 +513,6 @@ static inline void set_page_links(struct page *page, enum 
zone_type zone,
 }
 
 /*
- * If a hint addr is less than mmap_min_addr change hint to be as
- * low as possible but still greater than mmap_min_addr
- */
-static inline unsigned long round_hint_to_min(unsigned long hint)
-{
-#ifdef CONFIG_SECURITY
-   hint &= PAGE_MASK;
-   if (((void *)hint != NULL) &&
-   (hint < mmap_min_addr))
-   return PAGE_ALIGN(mmap_min_addr);
-#endif
-   return hint;
-}
-
-/*
  * Some inline functions in vmstat.h depend on page_zone()
  */
 #include 
diff --git a/include/linux/security.h b/include/linux/security.h
index ac05083..e9ba391 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -2568,6 +2568,19 @@ void security_key_free(struct key *key);
 int security_key_permission(key_ref_t key_ref,
struct task_struct *context, key_perm_t perm);
 
+/*
+ * If a hint addr is less than mmap_min_addr change hint to be as
+ * low as possible but still greater than mmap_min_addr
+ */
+static inline unsigned long round_hint_to_min(unsigned long hint)
+{
+   hint &= PAGE_MASK;
+   if (((void *)hint != NULL) &&
+   (hint < mmap_min_addr))
+   return PAGE_ALIGN(mmap_min_addr);
+   return hint;
+}
+
 #else
 
 static inline int security_key_alloc(struct key *key,
@@ -2588,7 +2601,18 @@ static inline int security_key_permission(key_ref_t 
key_ref,
return 0;
 }
 
-#endif
+static inline unsigned long round_hint_to_min(unsigned long hint)
+{
+   return hint;
+}
+
+#endif /* CONFIG_SECURITY */
+
+#else /* !CONFIG_KEYS */
+static inline unsigned long round_hint_to_min(unsigned long hint)
+{
+   return hint;
+}
 #endif /* CONFIG_KEYS */
 
 #endif /* ! __LINUX_SECURITY_H */
diff --git a/mm/mmap.c b/mm/mmap.c
index 15678aa..0d666de 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
diff --git a/mm/nommu.c b/mm/nommu.c
index b989cb9..99702d1 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch, rfc] mm.h, security.h, key.h and preventing namespace poisoning

2007-12-19 Thread David Chinner

Folks,

I just updated a git tree and started getting errors on a
copy_keys macro warning.

The code I've been working on uses a -copy_keys() method for
copying the keys in a btree block from one place to another. I've
been working on this code for a while
(http://oss.sgi.com/archives/xfs/2007-11/msg00046.html) and keep the
tree I'm working in reletively up to date (lags linus by a couple of
weeks at most). The update I did this afternoon gave a conflict
warning with the macro in include/linux/key.h.

Given that I'm not directly including key.h anywhere in the XFS
code, I'm getting the namespace polluted indirectly from some other
include that is necessary.

As it turns out, this commit from 13 days ago:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7cd94146cd504016315608e297219f9fb7b1413b

included security.h in mm.h and that is how I'm seeing the namespace
poisoning coming from key.h when !CONFIG_KEY.

Including security.h in mm.h means much wider includes for pretty
much the entire kernel, and it opens up namespace issues like this
that never previously existed.

The patch below (only tested for !CONFIG_KEYS  !CONFIG_SECURITY)
moves security.h into the mmap.c and nommu.c files that need it so
it doesn't end up with kernel wide scope.

Comments?

---
 include/linux/mm.h   |   16 
 include/linux/security.h |   26 +-
 mm/mmap.c|1 +
 mm/nommu.c   |1 +
 4 files changed, 27 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1b7b95c..520238c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -12,7 +12,6 @@
 #include linux/prio_tree.h
 #include linux/debug_locks.h
 #include linux/mm_types.h
-#include linux/security.h
 
 struct mempolicy;
 struct anon_vma;
@@ -514,21 +513,6 @@ static inline void set_page_links(struct page *page, enum 
zone_type zone,
 }
 
 /*
- * If a hint addr is less than mmap_min_addr change hint to be as
- * low as possible but still greater than mmap_min_addr
- */
-static inline unsigned long round_hint_to_min(unsigned long hint)
-{
-#ifdef CONFIG_SECURITY
-   hint = PAGE_MASK;
-   if (((void *)hint != NULL) 
-   (hint  mmap_min_addr))
-   return PAGE_ALIGN(mmap_min_addr);
-#endif
-   return hint;
-}
-
-/*
  * Some inline functions in vmstat.h depend on page_zone()
  */
 #include linux/vmstat.h
diff --git a/include/linux/security.h b/include/linux/security.h
index ac05083..e9ba391 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -2568,6 +2568,19 @@ void security_key_free(struct key *key);
 int security_key_permission(key_ref_t key_ref,
struct task_struct *context, key_perm_t perm);
 
+/*
+ * If a hint addr is less than mmap_min_addr change hint to be as
+ * low as possible but still greater than mmap_min_addr
+ */
+static inline unsigned long round_hint_to_min(unsigned long hint)
+{
+   hint = PAGE_MASK;
+   if (((void *)hint != NULL) 
+   (hint  mmap_min_addr))
+   return PAGE_ALIGN(mmap_min_addr);
+   return hint;
+}
+
 #else
 
 static inline int security_key_alloc(struct key *key,
@@ -2588,7 +2601,18 @@ static inline int security_key_permission(key_ref_t 
key_ref,
return 0;
 }
 
-#endif
+static inline unsigned long round_hint_to_min(unsigned long hint)
+{
+   return hint;
+}
+
+#endif /* CONFIG_SECURITY */
+
+#else /* !CONFIG_KEYS */
+static inline unsigned long round_hint_to_min(unsigned long hint)
+{
+   return hint;
+}
 #endif /* CONFIG_KEYS */
 
 #endif /* ! __LINUX_SECURITY_H */
diff --git a/mm/mmap.c b/mm/mmap.c
index 15678aa..0d666de 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -26,6 +26,7 @@
 #include linux/mount.h
 #include linux/mempolicy.h
 #include linux/rmap.h
+#include linux/security.h
 
 #include asm/uaccess.h
 #include asm/cacheflush.h
diff --git a/mm/nommu.c b/mm/nommu.c
index b989cb9..99702d1 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -28,6 +28,7 @@
 #include linux/personality.h
 #include linux/security.h
 #include linux/syscalls.h
+#include linux/security.h
 
 #include asm/uaccess.h
 #include asm/tlb.h
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Important regression with XFS update for 2.6.24-rc6

2007-12-19 Thread David Chinner

On Wed, Dec 19, 2007 at 02:19:47AM +1100, David Chinner wrote:
 On Tue, Dec 18, 2007 at 03:30:31PM +0100, Damien Wyart wrote:
  * David Chinner [EMAIL PROTECTED] [071218 13:24]:
   Ok. I haven't noticed anything wrong with directories up to about
   250,000 files in the last few days. The ls -l I just did on
   a directory with 15000 entries (btree format) used about 5MB of RAM.
   extent format directories appear to work fine as well (tested 500
   entries).
  
  Ok, nice to know the problem is not so frequent.
 
 .
 
  I have put the files at http://damien.wyart.free.fr/xfs/
  
  strace_xfs_problem.1.gz and strace_xfs_problem.2.gz have been created
  with the problematic kernel, and are quite bigger than
  strace_xfs_problem.normal.gz, which has been created with the vanilla
  rc5-git5. There is also xfs_info.
 
 Looks like several getdents() through the directory the getdents()
 call starts outputting the first files again. It gets to a certain
 point and always goes back to the beginning. However, it appears to
 get to the end eventually (without ever getting past the bad offset).

UML and a bunch of printk's to the rescue.

So we went back to double buffering, which then screwed up the d_off
of the dirents. I changed the temporary dirents to point to the current
offset so that filldir got what it expected when filling the user buffer.

Except it appears that it I didn't to initialise the current
offset for the first dirent read from the temporary buffer so filldir
occasionally got an uninitialised offset. Can someone pass me a
brown paper bag, please?

In my local testing, more often than not, that uninitialised offset
reads as zero which is where the looping comes from. Sometimes it
points off into wacko-land, which is probably how we eventually get
the looping terminating before you run out of memory.

That also explains why we haven't seen it - it requires the user buffer
to fill on the first entry of a backing buffer and so it is largely
dependent on the pattern of name lengths, page size and filesystem
block size aligning just right to trigger the problem.

Can you test this patch, Damien?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

---
 fs/xfs/linux-2.6/xfs_file.c |1 +
 1 file changed, 1 insertion(+)

Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_file.c
===
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_file.c  2007-12-19 
00:26:40.0 +1100
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_file.c   2007-12-19 21:26:38.701143555 
+1100
@@ -348,6 +348,7 @@ xfs_file_readdir(
 
size = buf.used;
de = (struct hack_dirent *)buf.dirent;
+   curr_offset = de-offset /*  0x7fff */;
while (size  0) {
if (filldir(dirent, de-name, de-namlen,
curr_offset  0x7fff,
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Important regression with XFS update for 2.6.24-rc6

2007-12-19 Thread David Chinner

On Wed, Dec 19, 2007 at 12:17:30PM +0100, Damien Wyart wrote:
 * David Chinner [EMAIL PROTECTED] [071219 11:45]:
  Can someone pass me a brown paper bag, please?
 
 My first impression on this bug was not so wrong, after all ;-)
 
  That also explains why we haven't seen it - it requires the user buffer to
  fill on the first entry of a backing buffer and so it is largely dependent
  on the pattern of name lengths, page size and filesystem block size
  aligning just right to trigger the problem.
 
 I guess I was lucky to trigger it quite easily...
 
  Can you test this patch, Damien?
 
 Works fine, all the bad symptoms have disappeared and strace output is
 normal.
 
 So you can add:
 
 Tested-by: Damien Wyart [EMAIL PROTECTED]

Thanks for reporting the bug and testing the fix so quickly, Damien.
I'll give it some more QA before I push it, though.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch, rfc] mm.h, security.h, key.h and preventing namespace poisoning

2007-12-19 Thread David Chinner

On Thu, Dec 20, 2007 at 11:07:01AM +1100, James Morris wrote:
 On Wed, 19 Dec 2007, David Chinner wrote:
 
  Folks,
  
  I just updated a git tree and started getting errors on a
  copy_keys macro warning.
  
  The code I've been working on uses a -copy_keys() method for
  copying the keys in a btree block from one place to another. I've
  been working on this code for a while
  (http://oss.sgi.com/archives/xfs/2007-11/msg00046.html) and keep the
  tree I'm working in reletively up to date (lags linus by a couple of
  weeks at most). The update I did this afternoon gave a conflict
  warning with the macro in include/linux/key.h.
  
  Given that I'm not directly including key.h anywhere in the XFS
  code, I'm getting the namespace polluted indirectly from some other
  include that is necessary.
  
  As it turns out, this commit from 13 days ago:
  
  http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7cd94146cd504016315608e297219f9fb7b1413b
  
  included security.h in mm.h and that is how I'm seeing the namespace
  poisoning coming from key.h when !CONFIG_KEY.
  
  Including security.h in mm.h means much wider includes for pretty
  much the entire kernel, and it opens up namespace issues like this
  that never previously existed.
  
  The patch below (only tested for !CONFIG_KEYS  !CONFIG_SECURITY)
  moves security.h into the mmap.c and nommu.c files that need it so
  it doesn't end up with kernel wide scope.
  
  Comments?
 
 The idea with this placement was to keep memory management code with other 
 similar code, rather than pushing it into security.h, where it does not 
 functionally belong.
 
 Something to not also is that you can't depend on security.h not being 
 included all over the place, as LSM does touch a lot of the kernel.  
 Unecessarily including it is bad, of course.

Which is what including it in mm.h does. It also pull sin a lot of
other headers files as has already been noted.

 I'm not sure I understand your namespace pollution issue, either.

doing this globally:

#ifdef CONFIG_SOMETHING
extern int  some_common_name(int a, int b, int c);
#else
#define some_common_name(a,b,c) 0
#endif

means that no-one can use some_common_name *anywhere* in the kernel.
In this case, i have a completely *private* use of some_common_name
and now I can't use that because the wonderful define above that
now has effectively global scope because it gets included from key.h
via security.h via mm.h.

 In any case, I think the right solution is not to include security.h at 
 all in mm.h, as it is only being done to get a declaration for 
 mmap_min_addr.
 
 How about this, instead ?
 
 Signed-off-by: James Morris [EMAIL PROTECTED]
 ---
 
  mm.h |5 -
  1 file changed, 4 insertions(+), 1 deletion(-)
 
 diff --git a/include/linux/mm.h b/include/linux/mm.h
 index 1b7b95c..02fbac7 100644
 --- a/include/linux/mm.h
 +++ b/include/linux/mm.h
 @@ -12,7 +12,6 @@
  #include linux/prio_tree.h
  #include linux/debug_locks.h
  #include linux/mm_types.h
 -#include linux/security.h
  
  struct mempolicy;
  struct anon_vma;
 @@ -34,6 +33,10 @@ extern int sysctl_legacy_va_layout;
  #define sysctl_legacy_va_layout 0
  #endif
  
 +#ifdef CONFIG_SECURITY
 +extern unsigned long mmap_min_addr;
 +#endif
 +
  #include asm/page.h
  #include asm/pgtable.h
  #include asm/processor.h

Fine by me.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ia64] BUG: sleeping in atomic

2007-12-19 Thread David Chinner

On Wed, Dec 19, 2007 at 11:42:04AM -0500, Kyle McMartin wrote:
 On Wed, Dec 19, 2007 at 04:54:30PM +1100, David Chinner wrote:
  [ 5667.086055] BUG: sleeping function called from invalid context at 
  kernel/fork.c:401
  
 
 The problem is that mmput is called under the read_lock by
 find_thread_for_addr... The comment above seems to indicate that gdb
 needs to be able to access any child tasks register backing store
 memory... This seems pretty broken.
 
 cheers, Kyle
 
 ---
 
 Who knows, maybe gdb is saner now?
 
 diff --git a/arch/ia64/kernel/ptrace.c b/arch/ia64/kernel/ptrace.c
 index 2e96f17..b609704 100644
 --- a/arch/ia64/kernel/ptrace.c
 +++ b/arch/ia64/kernel/ptrace.c
 @@ -1418,7 +1418,7 @@ asmlinkage long
  sys_ptrace (long request, pid_t pid, unsigned long addr, unsigned long data)
  {
   struct pt_regs *pt;
 - unsigned long urbs_end, peek_or_poke;
 + unsigned long urbs_end;
   struct task_struct *child;
   struct switch_stack *sw;
   long ret;
 @@ -1430,23 +1430,12 @@ sys_ptrace (long request, pid_t pid, unsigned long 
 addr, unsigned long data)
   goto out;
   }
  
 - peek_or_poke = (request == PTRACE_PEEKTEXT
 - || request == PTRACE_PEEKDATA
 - || request == PTRACE_POKETEXT
 - || request == PTRACE_POKEDATA);
 - ret = -ESRCH;
 - read_lock(tasklist_lock);
 - {
 - child = find_task_by_pid(pid);
 - if (child) {
 - if (peek_or_poke)
 - child = find_thread_for_addr(child, addr);
 - get_task_struct(child);
 - }
 - }
 - read_unlock(tasklist_lock);
 - if (!child)
 + child = ptrace_get_task_struct(pid);
 + if (IS_ERR(child)) {
 + ret = PTR_ERR(child);
   goto out;
 + }
 +
   ret = -EPERM;
   if (pid == 1)   /* no messing around with init! */
   goto out_tsk;

Yes, this patch fixes the problem (though I haven't tried to
use gdb yet).

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[ia64] BUG: sleeping in atomic

2007-12-18 Thread David Chinner

Just saw this again:

[ 5667.086055] BUG: sleeping function called from invalid context at 
kernel/fork.c:401
[ 5667.087314] in_atomic():1, irqs_disabled():0
[ 5667.088210]
[ 5667.088212] Call Trace:
[ 5667.089104]  [] show_stack+0x80/0xa0
[ 5667.089106] sp=e038f6507a00 
bsp=e038f6500f48
[ 5667.116025]  [] dump_stack+0x30/0x60
[ 5667.116028] sp=e038f6507bd0 
bsp=e038f6500f30
[ 5667.118317]  [] __might_sleep+0x1e0/0x200
[ 5667.118320] sp=e038f6507bd0 
bsp=e038f6500f08
[ 5667.142316]  [] mmput+0x20/0x220
[ 5667.142319] sp=e038f6507bd0 
bsp=e038f6500ee0
[ 5667.164201]  [] sys_ptrace+0x460/0x15c0
[ 5667.164203] sp=e038f6507bd0 
bsp=e038f6500dd0
[ 5667.175488]  [] ia64_ret_from_syscall+0x0/0x20
[ 5667.175490] sp=e038f6507e30 
bsp=e038f6500dd0
[ 5667.199324]  [] __kernel_syscall_via_break+0x0/0x20
[ 5667.199327] sp=e038f6508000 
bsp=e038f6500dd0
[ 5682.626704] BUG: sleeping function called from invalid context at 
kernel/fork.c:401

When stracing a process on 2.6.24-rc3 on ia64. commandline to reproduce:

# strace ls -l /mnt/scratch/test

ISTR reporting this some time ago (maybe 2.6.22?)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL] XFS update for 2.6.24-rc6

2007-12-18 Thread David Chinner

On Tue, Dec 18, 2007 at 05:19:04PM -0800, Linus Torvalds wrote:
> 
> 
> On Wed, 19 Dec 2007, David Chinner wrote:
> >
> > On Tue, Dec 18, 2007 at 05:59:11PM +1100, Lachlan McIlroy wrote:
> > > Please pull from the for-linus branch:
> > > git pull git://oss.sgi.com:8090/xfs/xfs-2.6.git for-linus
> > 
> > Linus, please don't pull this yet. A problem has been found in
> > the dirent fix, and we've just fixed another mknod related regression
> > so we've got another couple of fixes still to go in XFS for 2.6.24.
> 
> Too late, it's already long since pulled.

Ok, we'll just have to get the fixes to you ASAP.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL] XFS update for 2.6.24-rc6

2007-12-18 Thread David Chinner

On Tue, Dec 18, 2007 at 05:59:11PM +1100, Lachlan McIlroy wrote:
> Please pull from the for-linus branch:
> git pull git://oss.sgi.com:8090/xfs/xfs-2.6.git for-linus

Linus, please don't pull this yet. A problem has been found in
the dirent fix, and we've just fixed another mknod related regression
so we've got another couple of fixes still to go in XFS for 2.6.24.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 9 >

1 - 100 of 837 matches

Mail list logo