Re: dirty balancing deadlock

2007-02-22 Thread Miklos Szeredi
> > On Thu, 22 Feb 2007 08:42:26 +0100 Miklos Szeredi <[EMAIL PROTECTED]> wrote:
> > > > 
> > > > Index: linux/mm/page-writeback.c
> > > > ===
> > > > --- linux.orig/mm/page-writeback.c  2007-02-19 17:32:41.0 
> > > > +0100
> > > > +++ linux/mm/page-writeback.c   2007-02-19 18:05:28.0 +0100
> > > > @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a
> > > > dirty_thresh)
> > > > break;
> > > >  
> > > > +   /*
> > > > +* Acquit this producer if there's little or nothing
> > > > +* to write back to this particular queue
> > > > +*
> > > > +* Without this check a deadlock is possible in the
> > > > +* following case:
> > > > +*
> > > > +* - filesystem A writes data through filesystem B
> > > > +* - filesystem A has dirty pages over dirty_thresh
> > > > +* - writeback is started, this triggers a write in B
> > > > +* - balance_dirty_pages() is called synchronously
> > > > +* - the write to B blocks
> > > > +* - the writeback completes, but dirty is still over 
> > > > threshold
> > > > +* - the blocking write prevents futher writes from 
> > > > happening
> > > > +*/
> > > > +   if (atomic_long_read(>nr_dirty) +
> > > > +   atomic_long_read(>nr_writeback) < 16)
> > > > +   break;
> > > > +
> > > 
> > > The problem seems to that little "- the write to B blocks".
> > > 
> > > How come it blocks?  I mean, if we cannot retire writes to that filesystem
> > > then we're screwed anyway.
> > 
> > Sorry about the sloppy description.  I mean, it's not the lowlevel
> > write that will block, but rather the VFS one
> > (generic_file_aio_write).  It will block (or rather loop forever with
> > 0.1 second sleeps) in balance_dirty_pages().  That means, that for
> > this inode, i_mutex is held and no other writer can continue the work.
> 
> "this inode" I assume is the inode against filesystem A?

No, the one in B.

> Why does holding that inode's i_mutex prevent further writeback of
> pages in A?

It is generic_file_aio_write() that is holding the mutex.

Here's the stack for the filesystem daemon trying to write back a page:

08dcfb40:  [<08182fe6>] schedule+0x246/0x547
08dcfb98:  [<08183a03>] schedule_timeout+0x4e/0xb6
08dcfbcc:  [<08183991>] io_schedule_timeout+0x11/0x20
08dcfbd4:  [<080a0cf2>] congestion_wait+0x72/0x87
08dcfc04:  [<0809c693>] balance_dirty_pages+0xa8/0x153
08dcfc5c:  [<0809c7bf>] balance_dirty_pages_ratelimited_nr+0x43/0x45
08dcfc68:  [<080992b5>] generic_file_buffered_write+0x3e3/0x6f5
08dcfd20:  [<0809988e>] __generic_file_aio_write_nolock+0x2c7/0x5dd
08dcfda8:  [<08099cb6>] generic_file_aio_write+0x55/0xc7
08dcfddc:  [<080ea1e6>] ext3_file_write+0x39/0xaf
08dcfe04:  [<080b060b>] do_sync_write+0xd8/0x10e
08dcfebc:  [<080b06e3>] vfs_write+0xa2/0x1cb
08dcfeec:  [<080b09b8>] sys_pwrite64+0x65/0x69

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-22 Thread Miklos Szeredi
  On Thu, 22 Feb 2007 08:42:26 +0100 Miklos Szeredi [EMAIL PROTECTED] wrote:

Index: linux/mm/page-writeback.c
===
--- linux.orig/mm/page-writeback.c  2007-02-19 17:32:41.0 
+0100
+++ linux/mm/page-writeback.c   2007-02-19 18:05:28.0 +0100
@@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a
dirty_thresh)
break;
 
+   /*
+* Acquit this producer if there's little or nothing
+* to write back to this particular queue
+*
+* Without this check a deadlock is possible in the
+* following case:
+*
+* - filesystem A writes data through filesystem B
+* - filesystem A has dirty pages over dirty_thresh
+* - writeback is started, this triggers a write in B
+* - balance_dirty_pages() is called synchronously
+* - the write to B blocks
+* - the writeback completes, but dirty is still over 
threshold
+* - the blocking write prevents futher writes from 
happening
+*/
+   if (atomic_long_read(bdi-nr_dirty) +
+   atomic_long_read(bdi-nr_writeback)  16)
+   break;
+
   
   The problem seems to that little - the write to B blocks.
   
   How come it blocks?  I mean, if we cannot retire writes to that filesystem
   then we're screwed anyway.
  
  Sorry about the sloppy description.  I mean, it's not the lowlevel
  write that will block, but rather the VFS one
  (generic_file_aio_write).  It will block (or rather loop forever with
  0.1 second sleeps) in balance_dirty_pages().  That means, that for
  this inode, i_mutex is held and no other writer can continue the work.
 
 this inode I assume is the inode against filesystem A?

No, the one in B.

 Why does holding that inode's i_mutex prevent further writeback of
 pages in A?

It is generic_file_aio_write() that is holding the mutex.

Here's the stack for the filesystem daemon trying to write back a page:

08dcfb40:  [08182fe6] schedule+0x246/0x547
08dcfb98:  [08183a03] schedule_timeout+0x4e/0xb6
08dcfbcc:  [08183991] io_schedule_timeout+0x11/0x20
08dcfbd4:  [080a0cf2] congestion_wait+0x72/0x87
08dcfc04:  [0809c693] balance_dirty_pages+0xa8/0x153
08dcfc5c:  [0809c7bf] balance_dirty_pages_ratelimited_nr+0x43/0x45
08dcfc68:  [080992b5] generic_file_buffered_write+0x3e3/0x6f5
08dcfd20:  [0809988e] __generic_file_aio_write_nolock+0x2c7/0x5dd
08dcfda8:  [08099cb6] generic_file_aio_write+0x55/0xc7
08dcfddc:  [080ea1e6] ext3_file_write+0x39/0xaf
08dcfe04:  [080b060b] do_sync_write+0xd8/0x10e
08dcfebc:  [080b06e3] vfs_write+0xa2/0x1cb
08dcfeec:  [080b09b8] sys_pwrite64+0x65/0x69

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-21 Thread Andrew Morton
> On Thu, 22 Feb 2007 08:42:26 +0100 Miklos Szeredi <[EMAIL PROTECTED]> wrote:
> > > 
> > > Index: linux/mm/page-writeback.c
> > > ===
> > > --- linux.orig/mm/page-writeback.c2007-02-19 17:32:41.0 
> > > +0100
> > > +++ linux/mm/page-writeback.c 2007-02-19 18:05:28.0 +0100
> > > @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a
> > >   dirty_thresh)
> > >   break;
> > >  
> > > + /*
> > > +  * Acquit this producer if there's little or nothing
> > > +  * to write back to this particular queue
> > > +  *
> > > +  * Without this check a deadlock is possible in the
> > > +  * following case:
> > > +  *
> > > +  * - filesystem A writes data through filesystem B
> > > +  * - filesystem A has dirty pages over dirty_thresh
> > > +  * - writeback is started, this triggers a write in B
> > > +  * - balance_dirty_pages() is called synchronously
> > > +  * - the write to B blocks
> > > +  * - the writeback completes, but dirty is still over threshold
> > > +  * - the blocking write prevents futher writes from happening
> > > +  */
> > > + if (atomic_long_read(>nr_dirty) +
> > > + atomic_long_read(>nr_writeback) < 16)
> > > + break;
> > > +
> > 
> > The problem seems to that little "- the write to B blocks".
> > 
> > How come it blocks?  I mean, if we cannot retire writes to that filesystem
> > then we're screwed anyway.
> 
> Sorry about the sloppy description.  I mean, it's not the lowlevel
> write that will block, but rather the VFS one
> (generic_file_aio_write).  It will block (or rather loop forever with
> 0.1 second sleeps) in balance_dirty_pages().  That means, that for
> this inode, i_mutex is held and no other writer can continue the work.

"this inode" I assume is the inode against filesystem A?

Why does holding that inode's i_mutex prevent further writeback of pages in A?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-21 Thread Miklos Szeredi
> > How about this?
> 
> I still don't understand this bug.
> 
> > Solves the FUSE deadlock, but not the throttle_vm_writeout() one.
> > I'll try to tackle that one as well.
> > 
> > If the per-bdi dirty counter goes below 16, balance_dirty_pages()
> > returns.
> > 
> > Does the constant need to tunable?  If it's too large, then the global
> > threshold is more easily exceeded.  If it's too small, then in a tight
> > situation progress will be slower.
> > 
> > Thanks,
> > Miklos
> > 
> > Index: linux/mm/page-writeback.c
> > ===
> > --- linux.orig/mm/page-writeback.c  2007-02-19 17:32:41.0 +0100
> > +++ linux/mm/page-writeback.c   2007-02-19 18:05:28.0 +0100
> > @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a
> > dirty_thresh)
> > break;
> >  
> > +   /*
> > +* Acquit this producer if there's little or nothing
> > +* to write back to this particular queue
> > +*
> > +* Without this check a deadlock is possible in the
> > +* following case:
> > +*
> > +* - filesystem A writes data through filesystem B
> > +* - filesystem A has dirty pages over dirty_thresh
> > +* - writeback is started, this triggers a write in B
> > +* - balance_dirty_pages() is called synchronously
> > +* - the write to B blocks
> > +* - the writeback completes, but dirty is still over threshold
> > +* - the blocking write prevents futher writes from happening
> > +*/
> > +   if (atomic_long_read(>nr_dirty) +
> > +   atomic_long_read(>nr_writeback) < 16)
> > +   break;
> > +
> 
> The problem seems to that little "- the write to B blocks".
> 
> How come it blocks?  I mean, if we cannot retire writes to that filesystem
> then we're screwed anyway.

Sorry about the sloppy description.  I mean, it's not the lowlevel
write that will block, but rather the VFS one
(generic_file_aio_write).  It will block (or rather loop forever with
0.1 second sleeps) in balance_dirty_pages().  That means, that for
this inode, i_mutex is held and no other writer can continue the work.

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-21 Thread Andrew Morton
On Mon, 19 Feb 2007 18:11:55 +0100
Miklos Szeredi <[EMAIL PROTECTED]> wrote:

> How about this?

I still don't understand this bug.

> Solves the FUSE deadlock, but not the throttle_vm_writeout() one.
> I'll try to tackle that one as well.
> 
> If the per-bdi dirty counter goes below 16, balance_dirty_pages()
> returns.
> 
> Does the constant need to tunable?  If it's too large, then the global
> threshold is more easily exceeded.  If it's too small, then in a tight
> situation progress will be slower.
> 
> Thanks,
> Miklos
> 
> Index: linux/mm/page-writeback.c
> ===
> --- linux.orig/mm/page-writeback.c2007-02-19 17:32:41.0 +0100
> +++ linux/mm/page-writeback.c 2007-02-19 18:05:28.0 +0100
> @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a
>   dirty_thresh)
>   break;
>  
> + /*
> +  * Acquit this producer if there's little or nothing
> +  * to write back to this particular queue
> +  *
> +  * Without this check a deadlock is possible in the
> +  * following case:
> +  *
> +  * - filesystem A writes data through filesystem B
> +  * - filesystem A has dirty pages over dirty_thresh
> +  * - writeback is started, this triggers a write in B
> +  * - balance_dirty_pages() is called synchronously
> +  * - the write to B blocks
> +  * - the writeback completes, but dirty is still over threshold
> +  * - the blocking write prevents futher writes from happening
> +  */
> + if (atomic_long_read(>nr_dirty) +
> + atomic_long_read(>nr_writeback) < 16)
> + break;
> +

The problem seems to that little "- the write to B blocks".

How come it blocks?  I mean, if we cannot retire writes to that filesystem
then we're screwed anyway.

Anyway, I think I'll think about this issue a little later on.  You might
as well prepare full changelogs for your proposed changes, because we'll be
needing them anyway.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-21 Thread Andrew Morton
On Mon, 19 Feb 2007 18:11:55 +0100
Miklos Szeredi [EMAIL PROTECTED] wrote:

 How about this?

I still don't understand this bug.

 Solves the FUSE deadlock, but not the throttle_vm_writeout() one.
 I'll try to tackle that one as well.
 
 If the per-bdi dirty counter goes below 16, balance_dirty_pages()
 returns.
 
 Does the constant need to tunable?  If it's too large, then the global
 threshold is more easily exceeded.  If it's too small, then in a tight
 situation progress will be slower.
 
 Thanks,
 Miklos
 
 Index: linux/mm/page-writeback.c
 ===
 --- linux.orig/mm/page-writeback.c2007-02-19 17:32:41.0 +0100
 +++ linux/mm/page-writeback.c 2007-02-19 18:05:28.0 +0100
 @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a
   dirty_thresh)
   break;
  
 + /*
 +  * Acquit this producer if there's little or nothing
 +  * to write back to this particular queue
 +  *
 +  * Without this check a deadlock is possible in the
 +  * following case:
 +  *
 +  * - filesystem A writes data through filesystem B
 +  * - filesystem A has dirty pages over dirty_thresh
 +  * - writeback is started, this triggers a write in B
 +  * - balance_dirty_pages() is called synchronously
 +  * - the write to B blocks
 +  * - the writeback completes, but dirty is still over threshold
 +  * - the blocking write prevents futher writes from happening
 +  */
 + if (atomic_long_read(bdi-nr_dirty) +
 + atomic_long_read(bdi-nr_writeback)  16)
 + break;
 +

The problem seems to that little - the write to B blocks.

How come it blocks?  I mean, if we cannot retire writes to that filesystem
then we're screwed anyway.

Anyway, I think I'll think about this issue a little later on.  You might
as well prepare full changelogs for your proposed changes, because we'll be
needing them anyway.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-21 Thread Miklos Szeredi
  How about this?
 
 I still don't understand this bug.
 
  Solves the FUSE deadlock, but not the throttle_vm_writeout() one.
  I'll try to tackle that one as well.
  
  If the per-bdi dirty counter goes below 16, balance_dirty_pages()
  returns.
  
  Does the constant need to tunable?  If it's too large, then the global
  threshold is more easily exceeded.  If it's too small, then in a tight
  situation progress will be slower.
  
  Thanks,
  Miklos
  
  Index: linux/mm/page-writeback.c
  ===
  --- linux.orig/mm/page-writeback.c  2007-02-19 17:32:41.0 +0100
  +++ linux/mm/page-writeback.c   2007-02-19 18:05:28.0 +0100
  @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a
  dirty_thresh)
  break;
   
  +   /*
  +* Acquit this producer if there's little or nothing
  +* to write back to this particular queue
  +*
  +* Without this check a deadlock is possible in the
  +* following case:
  +*
  +* - filesystem A writes data through filesystem B
  +* - filesystem A has dirty pages over dirty_thresh
  +* - writeback is started, this triggers a write in B
  +* - balance_dirty_pages() is called synchronously
  +* - the write to B blocks
  +* - the writeback completes, but dirty is still over threshold
  +* - the blocking write prevents futher writes from happening
  +*/
  +   if (atomic_long_read(bdi-nr_dirty) +
  +   atomic_long_read(bdi-nr_writeback)  16)
  +   break;
  +
 
 The problem seems to that little - the write to B blocks.
 
 How come it blocks?  I mean, if we cannot retire writes to that filesystem
 then we're screwed anyway.

Sorry about the sloppy description.  I mean, it's not the lowlevel
write that will block, but rather the VFS one
(generic_file_aio_write).  It will block (or rather loop forever with
0.1 second sleeps) in balance_dirty_pages().  That means, that for
this inode, i_mutex is held and no other writer can continue the work.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-21 Thread Andrew Morton
 On Thu, 22 Feb 2007 08:42:26 +0100 Miklos Szeredi [EMAIL PROTECTED] wrote:
   
   Index: linux/mm/page-writeback.c
   ===
   --- linux.orig/mm/page-writeback.c2007-02-19 17:32:41.0 
   +0100
   +++ linux/mm/page-writeback.c 2007-02-19 18:05:28.0 +0100
   @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a
 dirty_thresh)
 break;

   + /*
   +  * Acquit this producer if there's little or nothing
   +  * to write back to this particular queue
   +  *
   +  * Without this check a deadlock is possible in the
   +  * following case:
   +  *
   +  * - filesystem A writes data through filesystem B
   +  * - filesystem A has dirty pages over dirty_thresh
   +  * - writeback is started, this triggers a write in B
   +  * - balance_dirty_pages() is called synchronously
   +  * - the write to B blocks
   +  * - the writeback completes, but dirty is still over threshold
   +  * - the blocking write prevents futher writes from happening
   +  */
   + if (atomic_long_read(bdi-nr_dirty) +
   + atomic_long_read(bdi-nr_writeback)  16)
   + break;
   +
  
  The problem seems to that little - the write to B blocks.
  
  How come it blocks?  I mean, if we cannot retire writes to that filesystem
  then we're screwed anyway.
 
 Sorry about the sloppy description.  I mean, it's not the lowlevel
 write that will block, but rather the VFS one
 (generic_file_aio_write).  It will block (or rather loop forever with
 0.1 second sleeps) in balance_dirty_pages().  That means, that for
 this inode, i_mutex is held and no other writer can continue the work.

this inode I assume is the inode against filesystem A?

Why does holding that inode's i_mutex prevent further writeback of pages in A?


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-20 Thread Chris Mason
On Tue, Feb 20, 2007 at 09:47:11AM +0100, Miklos Szeredi wrote:
> > > How about this?
> > > 
> > > Solves the FUSE deadlock, but not the throttle_vm_writeout() one.
> > > I'll try to tackle that one as well.
> > > 
> > > If the per-bdi dirty counter goes below 16, balance_dirty_pages()
> > > returns.
> > > 
> > > Does the constant need to tunable?  If it's too large, then the global
> > > threshold is more easily exceeded.  If it's too small, then in a tight
> > > situation progress will be slower.
> > 
> > Ok, what is supposed to happen here is that filesystems are supposed to
> > be throttled from making more dirty pages when the system is over the
> > threshold.  Even if filesystem A doesn't have much to contribute, and
> > filesystem B is the cause of 99% of the dirty pages, the goal of the
> > threshold is to prevent more dirty data from happening, and filesystem A
> > should block.
> 
> Which is the cause of the current deadlock.  But if we allow
> filesystem A to go into the red just a little, the deadlock is
> avoided, because it can continue to make progress with cleaning the
> dirtyness produced by B.
> 
> The maximum that filesystems can go over the limit will be
> 
>   (16 + epsilon) * number-of-queues

Right, even for thousands of mounted filesystems ~16 pages per FS
effectively pinned is not horrible.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-20 Thread Miklos Szeredi
> > > > > In general, writepage is supposed to do work without blocking on
> > > > > expensive locks that will get pdflush and dirty reclaim stuck in this
> > > > > fashion.  You'll probably have to take the same approach reiserfs does
> > > > > in data=journal mode, which is leaving the page dirty if 
> > > > > fuse_get_req_wp
> > > > > is going to block without making progress.
> > > > 
> > > > Pdflush, and dirty reclaim set wbc->nonblocking to true.
> > > > balance_dirty_pages and fsync don't.  The problem here is that
> > > > Andrew's patch is wrong to let balance_dirty_pages() try to write back
> > > > pages from a different queue.
> > > 
> > > async or sync, writepage is supposed to either make progress or bail.
> > > loopback aside, if the fuse call is blocking long term, you're going to
> > > run into problems.
> > 
> > Hmm, like what?
> 
> Something a little different from what you're seeing.  Basically if the
> PF_MEMALLOC paths end up waiting on a filesystem transaction, and that
> transaction is waiting for more ram, the system will eventually grind to
> a halt.  data=journal is the easiest way to hit it, since writepage
> always logs at least 4k.
> 
> WB_SYNC_NONE and wbc->nonblocking aren't a great test, in reiser I
> resorted to testing PF_MEMALLOC.

I'm not pretending to understand how journaling filesystems work, but
this shouldn't be an issue with fuse.  Can you show me a call path,
where PF_MEMALLOC is set and .nonblocking is not?

Thanks,
Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-20 Thread Miklos Szeredi
> > How about this?
> > 
> > Solves the FUSE deadlock, but not the throttle_vm_writeout() one.
> > I'll try to tackle that one as well.
> > 
> > If the per-bdi dirty counter goes below 16, balance_dirty_pages()
> > returns.
> > 
> > Does the constant need to tunable?  If it's too large, then the global
> > threshold is more easily exceeded.  If it's too small, then in a tight
> > situation progress will be slower.
> 
> Ok, what is supposed to happen here is that filesystems are supposed to
> be throttled from making more dirty pages when the system is over the
> threshold.  Even if filesystem A doesn't have much to contribute, and
> filesystem B is the cause of 99% of the dirty pages, the goal of the
> threshold is to prevent more dirty data from happening, and filesystem A
> should block.

Which is the cause of the current deadlock.  But if we allow
filesystem A to go into the red just a little, the deadlock is
avoided, because it can continue to make progress with cleaning the
dirtyness produced by B.

The maximum that filesystems can go over the limit will be

  (16 + epsilon) * number-of-queues

This is usually insignificant compared to the limit itself (~2000
pages on a machine with 32MB)

However with thousands of fuse mounts this may become a problem, as
each filesystem gets a separate queue.  In theory, just 2 pages are
enough to always make progress, but current dirty balancing can't
enforce this, as the ratelimit is at least 8 pages.

So there may have to be some more strict page accounting within fuse
itself, but that doesn't change the overall concept I think.

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-20 Thread Miklos Szeredi
  How about this?
  
  Solves the FUSE deadlock, but not the throttle_vm_writeout() one.
  I'll try to tackle that one as well.
  
  If the per-bdi dirty counter goes below 16, balance_dirty_pages()
  returns.
  
  Does the constant need to tunable?  If it's too large, then the global
  threshold is more easily exceeded.  If it's too small, then in a tight
  situation progress will be slower.
 
 Ok, what is supposed to happen here is that filesystems are supposed to
 be throttled from making more dirty pages when the system is over the
 threshold.  Even if filesystem A doesn't have much to contribute, and
 filesystem B is the cause of 99% of the dirty pages, the goal of the
 threshold is to prevent more dirty data from happening, and filesystem A
 should block.

Which is the cause of the current deadlock.  But if we allow
filesystem A to go into the red just a little, the deadlock is
avoided, because it can continue to make progress with cleaning the
dirtyness produced by B.

The maximum that filesystems can go over the limit will be

  (16 + epsilon) * number-of-queues

This is usually insignificant compared to the limit itself (~2000
pages on a machine with 32MB)

However with thousands of fuse mounts this may become a problem, as
each filesystem gets a separate queue.  In theory, just 2 pages are
enough to always make progress, but current dirty balancing can't
enforce this, as the ratelimit is at least 8 pages.

So there may have to be some more strict page accounting within fuse
itself, but that doesn't change the overall concept I think.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-20 Thread Miklos Szeredi
 In general, writepage is supposed to do work without blocking on
 expensive locks that will get pdflush and dirty reclaim stuck in this
 fashion.  You'll probably have to take the same approach reiserfs does
 in data=journal mode, which is leaving the page dirty if 
 fuse_get_req_wp
 is going to block without making progress.

Pdflush, and dirty reclaim set wbc-nonblocking to true.
balance_dirty_pages and fsync don't.  The problem here is that
Andrew's patch is wrong to let balance_dirty_pages() try to write back
pages from a different queue.
   
   async or sync, writepage is supposed to either make progress or bail.
   loopback aside, if the fuse call is blocking long term, you're going to
   run into problems.
  
  Hmm, like what?
 
 Something a little different from what you're seeing.  Basically if the
 PF_MEMALLOC paths end up waiting on a filesystem transaction, and that
 transaction is waiting for more ram, the system will eventually grind to
 a halt.  data=journal is the easiest way to hit it, since writepage
 always logs at least 4k.
 
 WB_SYNC_NONE and wbc-nonblocking aren't a great test, in reiser I
 resorted to testing PF_MEMALLOC.

I'm not pretending to understand how journaling filesystems work, but
this shouldn't be an issue with fuse.  Can you show me a call path,
where PF_MEMALLOC is set and .nonblocking is not?

Thanks,
Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-20 Thread Chris Mason
On Tue, Feb 20, 2007 at 09:47:11AM +0100, Miklos Szeredi wrote:
   How about this?
   
   Solves the FUSE deadlock, but not the throttle_vm_writeout() one.
   I'll try to tackle that one as well.
   
   If the per-bdi dirty counter goes below 16, balance_dirty_pages()
   returns.
   
   Does the constant need to tunable?  If it's too large, then the global
   threshold is more easily exceeded.  If it's too small, then in a tight
   situation progress will be slower.
  
  Ok, what is supposed to happen here is that filesystems are supposed to
  be throttled from making more dirty pages when the system is over the
  threshold.  Even if filesystem A doesn't have much to contribute, and
  filesystem B is the cause of 99% of the dirty pages, the goal of the
  threshold is to prevent more dirty data from happening, and filesystem A
  should block.
 
 Which is the cause of the current deadlock.  But if we allow
 filesystem A to go into the red just a little, the deadlock is
 avoided, because it can continue to make progress with cleaning the
 dirtyness produced by B.
 
 The maximum that filesystems can go over the limit will be
 
   (16 + epsilon) * number-of-queues

Right, even for thousands of mounted filesystems ~16 pages per FS
effectively pinned is not horrible.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-19 Thread Chris Mason
On Mon, Feb 19, 2007 at 02:14:15AM +0100, Miklos Szeredi wrote:
> > > > In general, writepage is supposed to do work without blocking on
> > > > expensive locks that will get pdflush and dirty reclaim stuck in this
> > > > fashion.  You'll probably have to take the same approach reiserfs does
> > > > in data=journal mode, which is leaving the page dirty if fuse_get_req_wp
> > > > is going to block without making progress.
> > > 
> > > Pdflush, and dirty reclaim set wbc->nonblocking to true.
> > > balance_dirty_pages and fsync don't.  The problem here is that
> > > Andrew's patch is wrong to let balance_dirty_pages() try to write back
> > > pages from a different queue.
> > 
> > async or sync, writepage is supposed to either make progress or bail.
> > loopback aside, if the fuse call is blocking long term, you're going to
> > run into problems.
> 
> Hmm, like what?

Something a little different from what you're seeing.  Basically if the
PF_MEMALLOC paths end up waiting on a filesystem transaction, and that
transaction is waiting for more ram, the system will eventually grind to
a halt.  data=journal is the easiest way to hit it, since writepage
always logs at least 4k.

WB_SYNC_NONE and wbc->nonblocking aren't a great test, in reiser I
resorted to testing PF_MEMALLOC.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-19 Thread Chris Mason
On Mon, Feb 19, 2007 at 06:11:55PM +0100, Miklos Szeredi wrote:
> How about this?
> 
> Solves the FUSE deadlock, but not the throttle_vm_writeout() one.
> I'll try to tackle that one as well.
> 
> If the per-bdi dirty counter goes below 16, balance_dirty_pages()
> returns.
> 
> Does the constant need to tunable?  If it's too large, then the global
> threshold is more easily exceeded.  If it's too small, then in a tight
> situation progress will be slower.

Ok, what is supposed to happen here is that filesystems are supposed to
be throttled from making more dirty pages when the system is over the
threshold.  Even if filesystem A doesn't have much to contribute, and
filesystem B is the cause of 99% of the dirty pages, the goal of the
threshold is to prevent more dirty data from happening, and filesystem A
should block.

But, with the producer consumer setup of fuse, I think this is a pretty
good compromise.  16 dirty/writeback pages shouldn't hurt the overall
limits too badly.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-19 Thread Miklos Szeredi
> Solves the FUSE deadlock, but not the throttle_vm_writeout() one.
> I'll try to tackle that one as well.
> 
> If the per-bdi dirty counter goes below 16, balance_dirty_pages()
> returns.
> 
> Does the constant need to tunable?  If it's too large, then the global
> threshold is more easily exceeded.  If it's too small, then in a tight
> situation progress will be slower.

Similar in spirit, this should solve the deadlock on throttle_vm_writeout().
Totally untested.

Does this approach look workable?

Thanks,
Miklos


Index: linux/include/linux/swap.h
===
--- linux.orig/include/linux/swap.h 2007-02-19 23:39:36.0 +0100
+++ linux/include/linux/swap.h  2007-02-20 00:03:38.0 +0100
@@ -277,10 +277,14 @@ static inline void disable_swap_token(vo
put_swap_token(swap_token_mm);
 }
 
+#define nr_swap_writeback \
+   atomic_long_read(_space.backing_dev_info->nr_writeback)
+
 #else /* CONFIG_SWAP */
 
 #define total_swap_pages   0
 #define total_swapcache_pages  0UL
+#define nr_swap_writeback  0UL
 
 #define si_swapinfo(val) \
do { (val)->freeswap = (val)->totalswap = 0; } while (0)
Index: linux/mm/page-writeback.c
===
--- linux.orig/mm/page-writeback.c  2007-02-19 23:43:03.0 +0100
+++ linux/mm/page-writeback.c   2007-02-20 00:03:49.0 +0100
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * The maximum number of pages to writeout in a single bdflush/kupdate
@@ -332,6 +333,9 @@ void throttle_vm_writeout(void)
 if (global_page_state(NR_UNSTABLE_NFS) +
global_page_state(NR_WRITEBACK) <= dirty_thresh)
break;
+
+   if (nr_swap_writeback < 16)
+   break;
 congestion_wait(WRITE, HZ/10);
 }
 }
Index: linux/mm/page_io.c
===
--- linux.orig/mm/page_io.c 2007-02-19 23:24:23.0 +0100
+++ linux/mm/page_io.c  2007-02-19 23:42:21.0 +0100
@@ -70,6 +70,7 @@ static int end_swap_bio_write(struct bio
ClearPageReclaim(page);
}
end_page_writeback(page);
+   atomic_long_dec(_space.backing_dev_info->nr_writeback);
bio_put(bio);
return 0;
 }
@@ -121,6 +122,7 @@ int swap_writepage(struct page *page, st
if (wbc->sync_mode == WB_SYNC_ALL)
rw |= (1 << BIO_RW_SYNC);
count_vm_event(PSWPOUT);
+   atomic_long_inc(_space.backing_dev_info->nr_writeback);
set_page_writeback(page);
unlock_page(page);
submit_bio(rw, bio);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-19 Thread Miklos Szeredi
How about this?

Solves the FUSE deadlock, but not the throttle_vm_writeout() one.
I'll try to tackle that one as well.

If the per-bdi dirty counter goes below 16, balance_dirty_pages()
returns.

Does the constant need to tunable?  If it's too large, then the global
threshold is more easily exceeded.  If it's too small, then in a tight
situation progress will be slower.

Thanks,
Miklos

Index: linux/mm/page-writeback.c
===
--- linux.orig/mm/page-writeback.c  2007-02-19 17:32:41.0 +0100
+++ linux/mm/page-writeback.c   2007-02-19 18:05:28.0 +0100
@@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a
dirty_thresh)
break;
 
+   /*
+* Acquit this producer if there's little or nothing
+* to write back to this particular queue
+*
+* Without this check a deadlock is possible in the
+* following case:
+*
+* - filesystem A writes data through filesystem B
+* - filesystem A has dirty pages over dirty_thresh
+* - writeback is started, this triggers a write in B
+* - balance_dirty_pages() is called synchronously
+* - the write to B blocks
+* - the writeback completes, but dirty is still over threshold
+* - the blocking write prevents futher writes from happening
+*/
+   if (atomic_long_read(>nr_dirty) +
+   atomic_long_read(>nr_writeback) < 16)
+   break;
+
if (!dirty_exceeded)
dirty_exceeded = 1;
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-19 Thread Miklos Szeredi
How about this?

Solves the FUSE deadlock, but not the throttle_vm_writeout() one.
I'll try to tackle that one as well.

If the per-bdi dirty counter goes below 16, balance_dirty_pages()
returns.

Does the constant need to tunable?  If it's too large, then the global
threshold is more easily exceeded.  If it's too small, then in a tight
situation progress will be slower.

Thanks,
Miklos

Index: linux/mm/page-writeback.c
===
--- linux.orig/mm/page-writeback.c  2007-02-19 17:32:41.0 +0100
+++ linux/mm/page-writeback.c   2007-02-19 18:05:28.0 +0100
@@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a
dirty_thresh)
break;
 
+   /*
+* Acquit this producer if there's little or nothing
+* to write back to this particular queue
+*
+* Without this check a deadlock is possible in the
+* following case:
+*
+* - filesystem A writes data through filesystem B
+* - filesystem A has dirty pages over dirty_thresh
+* - writeback is started, this triggers a write in B
+* - balance_dirty_pages() is called synchronously
+* - the write to B blocks
+* - the writeback completes, but dirty is still over threshold
+* - the blocking write prevents futher writes from happening
+*/
+   if (atomic_long_read(bdi-nr_dirty) +
+   atomic_long_read(bdi-nr_writeback)  16)
+   break;
+
if (!dirty_exceeded)
dirty_exceeded = 1;
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-19 Thread Miklos Szeredi
 Solves the FUSE deadlock, but not the throttle_vm_writeout() one.
 I'll try to tackle that one as well.
 
 If the per-bdi dirty counter goes below 16, balance_dirty_pages()
 returns.
 
 Does the constant need to tunable?  If it's too large, then the global
 threshold is more easily exceeded.  If it's too small, then in a tight
 situation progress will be slower.

Similar in spirit, this should solve the deadlock on throttle_vm_writeout().
Totally untested.

Does this approach look workable?

Thanks,
Miklos


Index: linux/include/linux/swap.h
===
--- linux.orig/include/linux/swap.h 2007-02-19 23:39:36.0 +0100
+++ linux/include/linux/swap.h  2007-02-20 00:03:38.0 +0100
@@ -277,10 +277,14 @@ static inline void disable_swap_token(vo
put_swap_token(swap_token_mm);
 }
 
+#define nr_swap_writeback \
+   atomic_long_read(swapper_space.backing_dev_info-nr_writeback)
+
 #else /* CONFIG_SWAP */
 
 #define total_swap_pages   0
 #define total_swapcache_pages  0UL
+#define nr_swap_writeback  0UL
 
 #define si_swapinfo(val) \
do { (val)-freeswap = (val)-totalswap = 0; } while (0)
Index: linux/mm/page-writeback.c
===
--- linux.orig/mm/page-writeback.c  2007-02-19 23:43:03.0 +0100
+++ linux/mm/page-writeback.c   2007-02-20 00:03:49.0 +0100
@@ -33,6 +33,7 @@
 #include linux/syscalls.h
 #include linux/buffer_head.h
 #include linux/pagevec.h
+#include linux/swap.h
 
 /*
  * The maximum number of pages to writeout in a single bdflush/kupdate
@@ -332,6 +333,9 @@ void throttle_vm_writeout(void)
 if (global_page_state(NR_UNSTABLE_NFS) +
global_page_state(NR_WRITEBACK) = dirty_thresh)
break;
+
+   if (nr_swap_writeback  16)
+   break;
 congestion_wait(WRITE, HZ/10);
 }
 }
Index: linux/mm/page_io.c
===
--- linux.orig/mm/page_io.c 2007-02-19 23:24:23.0 +0100
+++ linux/mm/page_io.c  2007-02-19 23:42:21.0 +0100
@@ -70,6 +70,7 @@ static int end_swap_bio_write(struct bio
ClearPageReclaim(page);
}
end_page_writeback(page);
+   atomic_long_dec(swapper_space.backing_dev_info-nr_writeback);
bio_put(bio);
return 0;
 }
@@ -121,6 +122,7 @@ int swap_writepage(struct page *page, st
if (wbc-sync_mode == WB_SYNC_ALL)
rw |= (1  BIO_RW_SYNC);
count_vm_event(PSWPOUT);
+   atomic_long_inc(swapper_space.backing_dev_info-nr_writeback);
set_page_writeback(page);
unlock_page(page);
submit_bio(rw, bio);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-19 Thread Chris Mason
On Mon, Feb 19, 2007 at 06:11:55PM +0100, Miklos Szeredi wrote:
 How about this?
 
 Solves the FUSE deadlock, but not the throttle_vm_writeout() one.
 I'll try to tackle that one as well.
 
 If the per-bdi dirty counter goes below 16, balance_dirty_pages()
 returns.
 
 Does the constant need to tunable?  If it's too large, then the global
 threshold is more easily exceeded.  If it's too small, then in a tight
 situation progress will be slower.

Ok, what is supposed to happen here is that filesystems are supposed to
be throttled from making more dirty pages when the system is over the
threshold.  Even if filesystem A doesn't have much to contribute, and
filesystem B is the cause of 99% of the dirty pages, the goal of the
threshold is to prevent more dirty data from happening, and filesystem A
should block.

But, with the producer consumer setup of fuse, I think this is a pretty
good compromise.  16 dirty/writeback pages shouldn't hurt the overall
limits too badly.

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-19 Thread Chris Mason
On Mon, Feb 19, 2007 at 02:14:15AM +0100, Miklos Szeredi wrote:
In general, writepage is supposed to do work without blocking on
expensive locks that will get pdflush and dirty reclaim stuck in this
fashion.  You'll probably have to take the same approach reiserfs does
in data=journal mode, which is leaving the page dirty if fuse_get_req_wp
is going to block without making progress.
   
   Pdflush, and dirty reclaim set wbc-nonblocking to true.
   balance_dirty_pages and fsync don't.  The problem here is that
   Andrew's patch is wrong to let balance_dirty_pages() try to write back
   pages from a different queue.
  
  async or sync, writepage is supposed to either make progress or bail.
  loopback aside, if the fuse call is blocking long term, you're going to
  run into problems.
 
 Hmm, like what?

Something a little different from what you're seeing.  Basically if the
PF_MEMALLOC paths end up waiting on a filesystem transaction, and that
transaction is waiting for more ram, the system will eventually grind to
a halt.  data=journal is the easiest way to hit it, since writepage
always logs at least 4k.

WB_SYNC_NONE and wbc-nonblocking aren't a great test, in reiser I
resorted to testing PF_MEMALLOC.

-chris

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-18 Thread Miklos Szeredi
> > > In general, writepage is supposed to do work without blocking on
> > > expensive locks that will get pdflush and dirty reclaim stuck in this
> > > fashion.  You'll probably have to take the same approach reiserfs does
> > > in data=journal mode, which is leaving the page dirty if fuse_get_req_wp
> > > is going to block without making progress.
> > 
> > Pdflush, and dirty reclaim set wbc->nonblocking to true.
> > balance_dirty_pages and fsync don't.  The problem here is that
> > Andrew's patch is wrong to let balance_dirty_pages() try to write back
> > pages from a different queue.
> 
> async or sync, writepage is supposed to either make progress or bail.
> loopback aside, if the fuse call is blocking long term, you're going to
> run into problems.

Hmm, like what?

Thanks,
Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-18 Thread Chris Mason
On Mon, Feb 19, 2007 at 01:54:31AM +0100, Miklos Szeredi wrote:
> > > > > > If so, writes to B will decrease the dirty memory threshold.
> > > > > 
> > > > > Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
> > > > > Some pages queued for writeback (doesn't matter how much).  B writes
> > > > > back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
> > > > > B doesn't know that there's nothing more to write back for B, it's
> > > > > just waiting there for those 1099, which'll never get written.
> > > > 
> > > > hm, OK, arguable.  I guess something like this..
> > > 
> > > Doesn't help the fuse case, but does seem to help the loopback mount
> > > one.
> > > 
> > > For fuse it's worse with the patch: now the write triggered by the
> > > balance recurses into fuse, with disastrous results, since the fuse
> > > writeback is now blocked on the userspace queue.
> > > 
> > > fusexmp_fh_no D 40136678 0   505494   506   504 (NOTLB)
> > > 08982b78 0001  08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000
> > >08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 
> > > 08f98000
> > >085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 
> > > 089a7100 Call Trace:
> > > 08f9f9a0:  [<0805d8cb>] switch_to_skas+0x3b/0x83
> > > 08f9f9b8:  [<0805a38a>] _switch_to+0x49/0x99
> > > 08f9f9e0:  [<08183006>] schedule+0x246/0x547
> > > 08f9fa38:  [<08103c7e>] fuse_get_req_wp+0xe9/0x14a
> > > 08f9fa70:  [<08103d2e>] fuse_writepage+0x4f/0x12c
> > 
> > In general, writepage is supposed to do work without blocking on
> > expensive locks that will get pdflush and dirty reclaim stuck in this
> > fashion.  You'll probably have to take the same approach reiserfs does
> > in data=journal mode, which is leaving the page dirty if fuse_get_req_wp
> > is going to block without making progress.
> 
> Pdflush, and dirty reclaim set wbc->nonblocking to true.
> balance_dirty_pages and fsync don't.  The problem here is that
> Andrew's patch is wrong to let balance_dirty_pages() try to write back
> pages from a different queue.

async or sync, writepage is supposed to either make progress or bail.
loopback aside, if the fuse call is blocking long term, you're going to
run into problems.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-18 Thread Miklos Szeredi
> > > > > If so, writes to B will decrease the dirty memory threshold.
> > > > 
> > > > Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
> > > > Some pages queued for writeback (doesn't matter how much).  B writes
> > > > back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
> > > > B doesn't know that there's nothing more to write back for B, it's
> > > > just waiting there for those 1099, which'll never get written.
> > > 
> > > hm, OK, arguable.  I guess something like this..
> > 
> > Doesn't help the fuse case, but does seem to help the loopback mount
> > one.
> > 
> > For fuse it's worse with the patch: now the write triggered by the
> > balance recurses into fuse, with disastrous results, since the fuse
> > writeback is now blocked on the userspace queue.
> > 
> > fusexmp_fh_no D 40136678 0   505494   506   504 (NOTLB)
> > 08982b78 0001  08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000
> >08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 
> > 08f98000
> >085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 
> > 089a7100 Call Trace:
> > 08f9f9a0:  [<0805d8cb>] switch_to_skas+0x3b/0x83
> > 08f9f9b8:  [<0805a38a>] _switch_to+0x49/0x99
> > 08f9f9e0:  [<08183006>] schedule+0x246/0x547
> > 08f9fa38:  [<08103c7e>] fuse_get_req_wp+0xe9/0x14a
> > 08f9fa70:  [<08103d2e>] fuse_writepage+0x4f/0x12c
> 
> In general, writepage is supposed to do work without blocking on
> expensive locks that will get pdflush and dirty reclaim stuck in this
> fashion.  You'll probably have to take the same approach reiserfs does
> in data=journal mode, which is leaving the page dirty if fuse_get_req_wp
> is going to block without making progress.

Pdflush, and dirty reclaim set wbc->nonblocking to true.
balance_dirty_pages and fsync don't.  The problem here is that
Andrew's patch is wrong to let balance_dirty_pages() try to write back
pages from a different queue.

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-18 Thread Chris Mason
On Mon, Feb 19, 2007 at 01:25:21AM +0100, Miklos Szeredi wrote:
> > > > If so, writes to B will decrease the dirty memory threshold.
> > > 
> > > Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
> > > Some pages queued for writeback (doesn't matter how much).  B writes
> > > back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
> > > B doesn't know that there's nothing more to write back for B, it's
> > > just waiting there for those 1099, which'll never get written.
> > 
> > hm, OK, arguable.  I guess something like this..
> 
> Doesn't help the fuse case, but does seem to help the loopback mount
> one.
> 
> For fuse it's worse with the patch: now the write triggered by the
> balance recurses into fuse, with disastrous results, since the fuse
> writeback is now blocked on the userspace queue.
> 
> fusexmp_fh_no D 40136678 0   505494   506   504 (NOTLB)
> 08982b78 0001  08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000
>08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 08f98000
>085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 
> 089a7100 Call Trace:
> 08f9f9a0:  [<0805d8cb>] switch_to_skas+0x3b/0x83
> 08f9f9b8:  [<0805a38a>] _switch_to+0x49/0x99
> 08f9f9e0:  [<08183006>] schedule+0x246/0x547
> 08f9fa38:  [<08103c7e>] fuse_get_req_wp+0xe9/0x14a
> 08f9fa70:  [<08103d2e>] fuse_writepage+0x4f/0x12c

In general, writepage is supposed to do work without blocking on
expensive locks that will get pdflush and dirty reclaim stuck in this
fashion.  You'll probably have to take the same approach reiserfs does
in data=journal mode, which is leaving the page dirty if fuse_get_req_wp
is going to block without making progress.

Queue it somewhere else (ie an internal Fs cleaning thread) and leave
the page dirty so that we can move on to other pages that have a chance
of being cleaned.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-18 Thread Miklos Szeredi
> > > > If so, writes to B will decrease the dirty memory threshold.
> > > 
> > > Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
> > > Some pages queued for writeback (doesn't matter how much).  B writes
> > > back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
> > > B doesn't know that there's nothing more to write back for B, it's
> > > just waiting there for those 1099, which'll never get written.
> > 
> > hm, OK, arguable.  I guess something like this..
> 
> Doesn't help the fuse case, but does seem to help the loopback mount
> one.

No sorry, it doesn't even help the loopback deadlock.  It sometimes
takes quite a while to trigger...

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-18 Thread Miklos Szeredi
> --- a/fs/fs-writeback.c~a
> +++ a/fs/fs-writeback.c
> @@ -356,7 +356,7 @@ int generic_sync_sb_inodes(struct super_
>   continue;   /* Skip a congested blockdev */
>   }
>  
> - if (wbc->bdi && bdi != wbc->bdi) {
> + if (wbc->bdi && bdi != wbc->bdi && bdi_write_congested(bdi)) {
>   if (!sb_is_blkdev_sb(sb))
>   break;  /* fs has the wrong queue */
>   list_move(>i_list, >s_dirty);

Checking bdi_write_congested(bdi) is not reliable, since the queue can
become congested _after_ the check is done.

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-18 Thread Miklos Szeredi
> > > If so, writes to B will decrease the dirty memory threshold.
> > 
> > Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
> > Some pages queued for writeback (doesn't matter how much).  B writes
> > back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
> > B doesn't know that there's nothing more to write back for B, it's
> > just waiting there for those 1099, which'll never get written.
> 
> hm, OK, arguable.  I guess something like this..

Doesn't help the fuse case, but does seem to help the loopback mount
one.

For fuse it's worse with the patch: now the write triggered by the
balance recurses into fuse, with disastrous results, since the fuse
writeback is now blocked on the userspace queue.

fusexmp_fh_no D 40136678 0   505494   506   504 (NOTLB)
08982b78 0001  08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000
   08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 08f98000
   085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 089a7100 
Call Trace:
08f9f9a0:  [<0805d8cb>] switch_to_skas+0x3b/0x83
08f9f9b8:  [<0805a38a>] _switch_to+0x49/0x99
08f9f9e0:  [<08183006>] schedule+0x246/0x547
08f9fa38:  [<08103c7e>] fuse_get_req_wp+0xe9/0x14a
08f9fa70:  [<08103d2e>] fuse_writepage+0x4f/0x12c
08f9faac:  [<0809ce3f>] __writepage+0x1e/0x3d
08f9fac0:  [<0809cd39>] write_cache_pages+0x222/0x30a
08f9fb44:  [<0809ce8d>] generic_writepages+0x2f/0x35
08f9fb5c:  [<0809ced6>] do_writepages+0x43/0x45
08f9fb70:  [<080cb8d2>] __writeback_single_inode+0xbc/0x173
08f9fbb8:  [<080cbb30>] sync_sb_inodes+0x1a7/0x260
08f9fbe8:  [<080cbc54>] writeback_inodes+0x6b/0x81
08f9fc04:  [<0809c640>] balance_dirty_pages+0x55/0x153
08f9fc5c:  [<0809c7bf>] balance_dirty_pages_ratelimited_nr+0x43/0x45
08f9fc68:  [<080992b5>] generic_file_buffered_write+0x3e3/0x6f5
08f9fd20:  [<0809988e>] __generic_file_aio_write_nolock+0x2c7/0x5dd
08f9fda8:  [<08099cb6>] generic_file_aio_write+0x55/0xc7
08f9fddc:  [<080ea206>] ext3_file_write+0x39/0xaf
08f9fe04:  [<080b060b>] do_sync_write+0xd8/0x10e
08f9febc:  [<080b06e3>] vfs_write+0xa2/0x1cb
08f9feec:  [<080b09b8>] sys_pwrite64+0x65/0x69
08f9ff10:  [<0805dd54>] handle_syscall+0x90/0xbc
08f9ff64:  [<0806d56c>] handle_trap+0x27/0x121
08f9ff8c:  [<0806dc65>] userspace+0x1de/0x226
08f9ffe4:  [<0805da19>] fork_handler+0x76/0x88
08f9fffc:  [<>] nosmp+0xf7fb7000/0x14


> but where's pdflush?  It should be busily transferring dirtiness from A to
> B.

The transfer of dirtyness from A to B goes through the narrow channel
of i_mutex.  And once that is plugged by the stuck balance_dirty_pages()
nothing else can pass through.

> > > The writeout code _should_ just sit there transferring dirtyiness from A 
> > > to
> > > B and cleaning pages via B, looping around, alternating between both.
> > > 
> > > What does sysrq-t say?
> > 
> > This is the fuse daemon thread that got stuck.
> 
> Where's pdflsuh?

Doing nothing I guess.  The request queue for the fuse filesystem is
full, so writepage with wbc->nonblocking=1 will be skipped.

pdflush   D 40045401 023  52412 (L-TLB)
088d5bf8 0001  08907df8 0805d8cb 088d55f8 088d5bf8 0890
   0890 08907e20 0805a38a 088d5100 088d5700 08907e10 0890 0890
   0847c300 088d5700 088d5100 08907e78 08182fe6 088d5100 088d5700 088d5100 
Call Trace:
08907de4:  [<0805d8cb>] switch_to_skas+0x3b/0x83
08907dfc:  [<0805a38a>] _switch_to+0x49/0x99
08907e24:  [<08182fe6>] schedule+0x246/0x547
08907e7c:  [<08183a03>] schedule_timeout+0x4e/0xb6
08907eb0:  [<08183991>] io_schedule_timeout+0x11/0x20
08907eb8:  [<080a0cf2>] congestion_wait+0x72/0x87
08907ee8:  [<0809c860>] background_writeout+0x35/0xa4
08907f38:  [<0809d41e>] __pdflush+0xae/0x152
08907f54:  [<0809d4f5>] pdflush+0x33/0x39
08907f84:  [<0808a03a>] kthread+0xa7/0xab
08907fb4:  [<0806a0f1>] run_kernel_thread+0x41/0x50
08907fe0:  [<0805d975>] new_thread_handler+0x62/0x8b
08907ffc:  [<>] nosmp+0xf7fb7000/0x14

pdflush   D 40045401 024  52523 (L-TLB)
081e1458 0001  088ffe00 0805d8cb 088d5bf8 081e1458 088f8000
   088f8000 088ffe28 0805a38a 088d5700 081e0f60 088ffe18 088f8000 088f8000
   0847c300 081e0f60 088d5700 088ffe80 08182fe6 088d5700 081e0f60 088d5700 
Call Trace:
088ffdec:  [<0805d8cb>] switch_to_skas+0x3b/0x83
088ffe04:  [<0805a38a>] _switch_to+0x49/0x99
088ffe2c:  [<08182fe6>] schedule+0x246/0x547
088ffe84:  [<08183a03>] schedule_timeout+0x4e/0xb6
088ffeb8:  [<08183991>] io_schedule_timeout+0x11/0x20
088ffec0:  [<080a0cf2>] congestion_wait+0x72/0x87
088ffef0:  [<0809c98c>] wb_kupdate+0x93/0xd9
088fff38:  [<0809d41e>] __pdflush+0xae/0x152
088fff54:  [<0809d4f5>] pdflush+0x33/0x39
088fff84:  [<0808a03a>] kthread+0xa7/0xab
088fffb4:  [<0806a0f1>] run_kernel_thread+0x41/0x50
088fffe0:  [<0805d975>] new_thread_handler+0x62/0x8b
088c:  [<>] nosmp+0xf7fb7000/0x14
-
To unsubscribe from this list: send the line 

Re: dirty balancing deadlock

2007-02-18 Thread Andrew Morton
On Mon, 19 Feb 2007 00:22:11 +0100 Miklos Szeredi <[EMAIL PROTECTED]> wrote:

> > If so, writes to B will decrease the dirty memory threshold.
> 
> Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
> Some pages queued for writeback (doesn't matter how much).  B writes
> back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
> B doesn't know that there's nothing more to write back for B, it's
> just waiting there for those 1099, which'll never get written.

hm, OK, arguable.  I guess something like this..

--- a/fs/fs-writeback.c~a
+++ a/fs/fs-writeback.c
@@ -356,7 +356,7 @@ int generic_sync_sb_inodes(struct super_
continue;   /* Skip a congested blockdev */
}
 
-   if (wbc->bdi && bdi != wbc->bdi) {
+   if (wbc->bdi && bdi != wbc->bdi && bdi_write_congested(bdi)) {
if (!sb_is_blkdev_sb(sb))
break;  /* fs has the wrong queue */
list_move(>i_list, >s_dirty);
_

but where's pdflush?  It should be busily transferring dirtiness from A to
B.

> > The writeout code _should_ just sit there transferring dirtyiness from A to
> > B and cleaning pages via B, looping around, alternating between both.
> > 
> > What does sysrq-t say?
> 
> This is the fuse daemon thread that got stuck.

Where's pdflsuh?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-18 Thread Miklos Szeredi
> > > > I was testing the new fuse shared writable mmap support, and finding
> > > > that bash-shared-mapping deadlocks (which isn't so strange ;).  What
> > > > is more strange is that this is not an OOM situation at all, with
> > > > plenty of free and cached pages.
> > > > 
> > > > A little more investigation shows that a similar deadlock happens
> > > > reliably with bash-shared-mapping on a loopback mount, even if only
> > > > half the total memory is used.
> > > > 
> > > > The cause is slightly different in the two cases:
> > > > 
> > > >   - loopback mount: allocation by the underlying filesystem is stalled
> > > > on throttle_vm_writeout()
> > > > 
> > > >   - fuse-loop: page dirtying on the underlying filesystem is stalled on
> > > > balance_dirty_pages()
> > > > 
> > > > In both cases the underlying fs is totally innocent, with no
> > > > dirty/writback pages, yet it's waiting for the global dirty+writeback
> > > > to go below the threshold, which obviously won't, until the
> > > > allocation/dirtying succeeds.
> > > > 
> > > > I'm not quite sure what the solution is, and asking for thoughts.
> > > 
> > > But  these things don't just throttle.  They also perform large 
> > > amounts
> > > of writeback, which causes the dirty levels to subside.
> > > 
> > > >From your description it appears that this writeback isn't happening, or
> > > isn't working.  How come?
> > 
> >  - filesystems A and B
> >  - write to A will end up as write to B
> >  - dirty pages in A manage to go over dirty_threshold
> >  - page writeback is started from A
> >  - this triggers writeback for a couple of pages in B
> >  - writeback finishes normally, but dirty+writeback pages are still
> >over threshold
> >  - balance_dirty_pages in B gets stuck, nothing ever moves after this
> > 
> > At least this is my theory for what happens.
> > 
> 
> Is B a real filesystem?

Yes.

> If so, writes to B will decrease the dirty memory threshold.

Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
Some pages queued for writeback (doesn't matter how much).  B writes
back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
B doesn't know that there's nothing more to write back for B, it's
just waiting there for those 1099, which'll never get written.

> The writeout code _should_ just sit there transferring dirtyiness from A to
> B and cleaning pages via B, looping around, alternating between both.
> 
> What does sysrq-t say?

This is the fuse daemon thread that got stuck.  There are lots of
others that are stuck on some ext3 mutex as a result of this.

fusexmp_fh_no D 40045401 0   527493   533   495 (NOTLB)
088d55f8 0001  08dcfb14 0805d8cb 08a09b78 088d55f8 08dc8000
   08dc8000 08dcfb3c 0805a38a 08a09680 088d5100 08dcfb2c 08dc8000 08dc8000
   0847c300 088d5100 08a09680 08dcfb94 08182fe6 08a09680 088d5100 08a09680 
Call Trace:
08dcfb00:  [<0805d8cb>] switch_to_skas+0x3b/0x83
08dcfb18:  [<0805a38a>] _switch_to+0x49/0x99
08dcfb40:  [<08182fe6>] schedule+0x246/0x547
08dcfb98:  [<08183a03>] schedule_timeout+0x4e/0xb6
08dcfbcc:  [<08183991>] io_schedule_timeout+0x11/0x20
08dcfbd4:  [<080a0cf2>] congestion_wait+0x72/0x87
08dcfc04:  [<0809c693>] balance_dirty_pages+0xa8/0x153
08dcfc5c:  [<0809c7bf>] balance_dirty_pages_ratelimited_nr+0x43/0x45
08dcfc68:  [<080992b5>] generic_file_buffered_write+0x3e3/0x6f5
08dcfd20:  [<0809988e>] __generic_file_aio_write_nolock+0x2c7/0x5dd
08dcfda8:  [<08099cb6>] generic_file_aio_write+0x55/0xc7
08dcfddc:  [<080ea1e6>] ext3_file_write+0x39/0xaf
08dcfe04:  [<080b060b>] do_sync_write+0xd8/0x10e
08dcfebc:  [<080b06e3>] vfs_write+0xa2/0x1cb
08dcfeec:  [<080b09b8>] sys_pwrite64+0x65/0x69
08dcff10:  [<0805dd54>] handle_syscall+0x90/0xbc
08dcff64:  [<0806d56c>] handle_trap+0x27/0x121
08dcff8c:  [<0806dc65>] userspace+0x1de/0x226
08dcffe4:  [<0805da19>] fork_handler+0x76/0x88
08dcfffc:  [] 0xd4cf0007

/proc/vmstat:

nr_anon_pages 668
nr_mapped 3168
nr_file_pages 5191
nr_slab_reclaimable 173
nr_slab_unreclaimable 494
nr_page_table_pages 65
nr_dirty 2174
nr_writeback 10
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
pgpgin 10955
pgpgout 421091
pswpin 0
pswpout 0
pgalloc_dma 0
pgalloc_normal 268761
pgfree 269709
pgactivate 128287
pgdeactivate 31253
pgfault 237350
pgmajfault 4340
pgrefill_dma 0
pgrefill_normal 127899
pgsteal_dma 0
pgsteal_normal 46892
pgscan_kswapd_dma 0
pgscan_kswapd_normal 47104
pgscan_direct_dma 0
pgscan_direct_normal 36544
pginodesteal 0
slabs_scanned 2048
kswapd_steal 25083
kswapd_inodesteal 335
pageoutrun 656
allocstall 423
pgrotated 0

Breakpoint 3, balance_dirty_pages (mapping=0xa01feb0)
at mm/page-writeback.c:202
202 dirty_exceeded = 1;
(gdb) p dirty_thresh
$1 = 2113
(gdb)

For completeness' sake, here's the backtrace for the stuck loopback as
well:

loop0 D BFFFE101 0   499  5   50059 (L-TLB)
088cc578 0001  09197c4c 0805d8cb 084fe6f8 088cc578 0919

Re: dirty balancing deadlock

2007-02-18 Thread Andrew Morton
On Sun, 18 Feb 2007 23:50:14 +0100 Miklos Szeredi <[EMAIL PROTECTED]> wrote:

> > > I was testing the new fuse shared writable mmap support, and finding
> > > that bash-shared-mapping deadlocks (which isn't so strange ;).  What
> > > is more strange is that this is not an OOM situation at all, with
> > > plenty of free and cached pages.
> > > 
> > > A little more investigation shows that a similar deadlock happens
> > > reliably with bash-shared-mapping on a loopback mount, even if only
> > > half the total memory is used.
> > > 
> > > The cause is slightly different in the two cases:
> > > 
> > >   - loopback mount: allocation by the underlying filesystem is stalled
> > > on throttle_vm_writeout()
> > > 
> > >   - fuse-loop: page dirtying on the underlying filesystem is stalled on
> > > balance_dirty_pages()
> > > 
> > > In both cases the underlying fs is totally innocent, with no
> > > dirty/writback pages, yet it's waiting for the global dirty+writeback
> > > to go below the threshold, which obviously won't, until the
> > > allocation/dirtying succeeds.
> > > 
> > > I'm not quite sure what the solution is, and asking for thoughts.
> > 
> > But  these things don't just throttle.  They also perform large amounts
> > of writeback, which causes the dirty levels to subside.
> > 
> > >From your description it appears that this writeback isn't happening, or
> > isn't working.  How come?
> 
>  - filesystems A and B
>  - write to A will end up as write to B
>  - dirty pages in A manage to go over dirty_threshold
>  - page writeback is started from A
>  - this triggers writeback for a couple of pages in B
>  - writeback finishes normally, but dirty+writeback pages are still
>over threshold
>  - balance_dirty_pages in B gets stuck, nothing ever moves after this
> 
> At least this is my theory for what happens.
> 

Is B a real filesystem?  If so, writes to B will decrease the dirty memory
threshold.

The writeout code _should_ just sit there transferring dirtyiness from A to
B and cleaning pages via B, looping around, alternating between both.

What does sysrq-t say?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-18 Thread Miklos Szeredi
> Andrew Morton wrote:
> > On Sun, 18 Feb 2007 19:28:18 +0100 Miklos Szeredi <[EMAIL PROTECTED]> wrote:
> > 
> >> I was testing the new fuse shared writable mmap support, and finding
> >> that bash-shared-mapping deadlocks (which isn't so strange ;).  What
> >> is more strange is that this is not an OOM situation at all, with
> >> plenty of free and cached pages.
> >>
> >> A little more investigation shows that a similar deadlock happens
> >> reliably with bash-shared-mapping on a loopback mount, even if only
> >> half the total memory is used.
> >>
> >> The cause is slightly different in the two cases:
> >>
> >>   - loopback mount: allocation by the underlying filesystem is stalled
> >> on throttle_vm_writeout()
> >>
> >>   - fuse-loop: page dirtying on the underlying filesystem is stalled on
> >> balance_dirty_pages()
> >>
> >> In both cases the underlying fs is totally innocent, with no
> >> dirty/writback pages, yet it's waiting for the global dirty+writeback
> >> to go below the threshold, which obviously won't, until the
> >> allocation/dirtying succeeds.
> >>
> >> I'm not quite sure what the solution is, and asking for thoughts.
> > 
> > But  these things don't just throttle.  They also perform large amounts
> > of writeback, which causes the dirty levels to subside.
> > 
> >>From your description it appears that this writeback isn't happening, or
> > isn't working.  How come?
> 
> Is the fuse daemon trying to do writeback to itself, perhaps?
> 
> That is, trying to write out data to the FUSE filesystem, for which
> it is also the server.

No.  It's trying to write out data to a different filesystem.

Trying to write out data to itself very obviously deadlocks, but that
doesn't affect anything beside the stupid filesystem itself, and there
are mechanisms for aborting such a situation (forced umount, abort
through fuse-control filesystem).

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-18 Thread Miklos Szeredi
> > I was testing the new fuse shared writable mmap support, and finding
> > that bash-shared-mapping deadlocks (which isn't so strange ;).  What
> > is more strange is that this is not an OOM situation at all, with
> > plenty of free and cached pages.
> > 
> > A little more investigation shows that a similar deadlock happens
> > reliably with bash-shared-mapping on a loopback mount, even if only
> > half the total memory is used.
> > 
> > The cause is slightly different in the two cases:
> > 
> >   - loopback mount: allocation by the underlying filesystem is stalled
> > on throttle_vm_writeout()
> > 
> >   - fuse-loop: page dirtying on the underlying filesystem is stalled on
> > balance_dirty_pages()
> > 
> > In both cases the underlying fs is totally innocent, with no
> > dirty/writback pages, yet it's waiting for the global dirty+writeback
> > to go below the threshold, which obviously won't, until the
> > allocation/dirtying succeeds.
> > 
> > I'm not quite sure what the solution is, and asking for thoughts.
> 
> But  these things don't just throttle.  They also perform large amounts
> of writeback, which causes the dirty levels to subside.
> 
> >From your description it appears that this writeback isn't happening, or
> isn't working.  How come?

 - filesystems A and B
 - write to A will end up as write to B
 - dirty pages in A manage to go over dirty_threshold
 - page writeback is started from A
 - this triggers writeback for a couple of pages in B
 - writeback finishes normally, but dirty+writeback pages are still
   over threshold
 - balance_dirty_pages in B gets stuck, nothing ever moves after this

At least this is my theory for what happens.

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-18 Thread Rik van Riel

Andrew Morton wrote:

On Sun, 18 Feb 2007 19:28:18 +0100 Miklos Szeredi <[EMAIL PROTECTED]> wrote:


I was testing the new fuse shared writable mmap support, and finding
that bash-shared-mapping deadlocks (which isn't so strange ;).  What
is more strange is that this is not an OOM situation at all, with
plenty of free and cached pages.

A little more investigation shows that a similar deadlock happens
reliably with bash-shared-mapping on a loopback mount, even if only
half the total memory is used.

The cause is slightly different in the two cases:

  - loopback mount: allocation by the underlying filesystem is stalled
on throttle_vm_writeout()

  - fuse-loop: page dirtying on the underlying filesystem is stalled on
balance_dirty_pages()

In both cases the underlying fs is totally innocent, with no
dirty/writback pages, yet it's waiting for the global dirty+writeback
to go below the threshold, which obviously won't, until the
allocation/dirtying succeeds.

I'm not quite sure what the solution is, and asking for thoughts.


But  these things don't just throttle.  They also perform large amounts
of writeback, which causes the dirty levels to subside.


From your description it appears that this writeback isn't happening, or

isn't working.  How come?


Is the fuse daemon trying to do writeback to itself, perhaps?

That is, trying to write out data to the FUSE filesystem, for which
it is also the server.


--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-18 Thread Andrew Morton
On Sun, 18 Feb 2007 19:28:18 +0100 Miklos Szeredi <[EMAIL PROTECTED]> wrote:

> I was testing the new fuse shared writable mmap support, and finding
> that bash-shared-mapping deadlocks (which isn't so strange ;).  What
> is more strange is that this is not an OOM situation at all, with
> plenty of free and cached pages.
> 
> A little more investigation shows that a similar deadlock happens
> reliably with bash-shared-mapping on a loopback mount, even if only
> half the total memory is used.
> 
> The cause is slightly different in the two cases:
> 
>   - loopback mount: allocation by the underlying filesystem is stalled
> on throttle_vm_writeout()
> 
>   - fuse-loop: page dirtying on the underlying filesystem is stalled on
> balance_dirty_pages()
> 
> In both cases the underlying fs is totally innocent, with no
> dirty/writback pages, yet it's waiting for the global dirty+writeback
> to go below the threshold, which obviously won't, until the
> allocation/dirtying succeeds.
> 
> I'm not quite sure what the solution is, and asking for thoughts.

But  these things don't just throttle.  They also perform large amounts
of writeback, which causes the dirty levels to subside.

>From your description it appears that this writeback isn't happening, or
isn't working.  How come?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-18 Thread Andrew Morton
On Sun, 18 Feb 2007 19:28:18 +0100 Miklos Szeredi [EMAIL PROTECTED] wrote:

 I was testing the new fuse shared writable mmap support, and finding
 that bash-shared-mapping deadlocks (which isn't so strange ;).  What
 is more strange is that this is not an OOM situation at all, with
 plenty of free and cached pages.
 
 A little more investigation shows that a similar deadlock happens
 reliably with bash-shared-mapping on a loopback mount, even if only
 half the total memory is used.
 
 The cause is slightly different in the two cases:
 
   - loopback mount: allocation by the underlying filesystem is stalled
 on throttle_vm_writeout()
 
   - fuse-loop: page dirtying on the underlying filesystem is stalled on
 balance_dirty_pages()
 
 In both cases the underlying fs is totally innocent, with no
 dirty/writback pages, yet it's waiting for the global dirty+writeback
 to go below the threshold, which obviously won't, until the
 allocation/dirtying succeeds.
 
 I'm not quite sure what the solution is, and asking for thoughts.

But  these things don't just throttle.  They also perform large amounts
of writeback, which causes the dirty levels to subside.

From your description it appears that this writeback isn't happening, or
isn't working.  How come?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-18 Thread Rik van Riel

Andrew Morton wrote:

On Sun, 18 Feb 2007 19:28:18 +0100 Miklos Szeredi [EMAIL PROTECTED] wrote:


I was testing the new fuse shared writable mmap support, and finding
that bash-shared-mapping deadlocks (which isn't so strange ;).  What
is more strange is that this is not an OOM situation at all, with
plenty of free and cached pages.

A little more investigation shows that a similar deadlock happens
reliably with bash-shared-mapping on a loopback mount, even if only
half the total memory is used.

The cause is slightly different in the two cases:

  - loopback mount: allocation by the underlying filesystem is stalled
on throttle_vm_writeout()

  - fuse-loop: page dirtying on the underlying filesystem is stalled on
balance_dirty_pages()

In both cases the underlying fs is totally innocent, with no
dirty/writback pages, yet it's waiting for the global dirty+writeback
to go below the threshold, which obviously won't, until the
allocation/dirtying succeeds.

I'm not quite sure what the solution is, and asking for thoughts.


But  these things don't just throttle.  They also perform large amounts
of writeback, which causes the dirty levels to subside.


From your description it appears that this writeback isn't happening, or

isn't working.  How come?


Is the fuse daemon trying to do writeback to itself, perhaps?

That is, trying to write out data to the FUSE filesystem, for which
it is also the server.


--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-18 Thread Miklos Szeredi
  I was testing the new fuse shared writable mmap support, and finding
  that bash-shared-mapping deadlocks (which isn't so strange ;).  What
  is more strange is that this is not an OOM situation at all, with
  plenty of free and cached pages.
  
  A little more investigation shows that a similar deadlock happens
  reliably with bash-shared-mapping on a loopback mount, even if only
  half the total memory is used.
  
  The cause is slightly different in the two cases:
  
- loopback mount: allocation by the underlying filesystem is stalled
  on throttle_vm_writeout()
  
- fuse-loop: page dirtying on the underlying filesystem is stalled on
  balance_dirty_pages()
  
  In both cases the underlying fs is totally innocent, with no
  dirty/writback pages, yet it's waiting for the global dirty+writeback
  to go below the threshold, which obviously won't, until the
  allocation/dirtying succeeds.
  
  I'm not quite sure what the solution is, and asking for thoughts.
 
 But  these things don't just throttle.  They also perform large amounts
 of writeback, which causes the dirty levels to subside.
 
 From your description it appears that this writeback isn't happening, or
 isn't working.  How come?

 - filesystems A and B
 - write to A will end up as write to B
 - dirty pages in A manage to go over dirty_threshold
 - page writeback is started from A
 - this triggers writeback for a couple of pages in B
 - writeback finishes normally, but dirty+writeback pages are still
   over threshold
 - balance_dirty_pages in B gets stuck, nothing ever moves after this

At least this is my theory for what happens.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-18 Thread Miklos Szeredi
 Andrew Morton wrote:
  On Sun, 18 Feb 2007 19:28:18 +0100 Miklos Szeredi [EMAIL PROTECTED] wrote:
  
  I was testing the new fuse shared writable mmap support, and finding
  that bash-shared-mapping deadlocks (which isn't so strange ;).  What
  is more strange is that this is not an OOM situation at all, with
  plenty of free and cached pages.
 
  A little more investigation shows that a similar deadlock happens
  reliably with bash-shared-mapping on a loopback mount, even if only
  half the total memory is used.
 
  The cause is slightly different in the two cases:
 
- loopback mount: allocation by the underlying filesystem is stalled
  on throttle_vm_writeout()
 
- fuse-loop: page dirtying on the underlying filesystem is stalled on
  balance_dirty_pages()
 
  In both cases the underlying fs is totally innocent, with no
  dirty/writback pages, yet it's waiting for the global dirty+writeback
  to go below the threshold, which obviously won't, until the
  allocation/dirtying succeeds.
 
  I'm not quite sure what the solution is, and asking for thoughts.
  
  But  these things don't just throttle.  They also perform large amounts
  of writeback, which causes the dirty levels to subside.
  
 From your description it appears that this writeback isn't happening, or
  isn't working.  How come?
 
 Is the fuse daemon trying to do writeback to itself, perhaps?
 
 That is, trying to write out data to the FUSE filesystem, for which
 it is also the server.

No.  It's trying to write out data to a different filesystem.

Trying to write out data to itself very obviously deadlocks, but that
doesn't affect anything beside the stupid filesystem itself, and there
are mechanisms for aborting such a situation (forced umount, abort
through fuse-control filesystem).

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-18 Thread Andrew Morton
On Sun, 18 Feb 2007 23:50:14 +0100 Miklos Szeredi [EMAIL PROTECTED] wrote:

   I was testing the new fuse shared writable mmap support, and finding
   that bash-shared-mapping deadlocks (which isn't so strange ;).  What
   is more strange is that this is not an OOM situation at all, with
   plenty of free and cached pages.
   
   A little more investigation shows that a similar deadlock happens
   reliably with bash-shared-mapping on a loopback mount, even if only
   half the total memory is used.
   
   The cause is slightly different in the two cases:
   
 - loopback mount: allocation by the underlying filesystem is stalled
   on throttle_vm_writeout()
   
 - fuse-loop: page dirtying on the underlying filesystem is stalled on
   balance_dirty_pages()
   
   In both cases the underlying fs is totally innocent, with no
   dirty/writback pages, yet it's waiting for the global dirty+writeback
   to go below the threshold, which obviously won't, until the
   allocation/dirtying succeeds.
   
   I'm not quite sure what the solution is, and asking for thoughts.
  
  But  these things don't just throttle.  They also perform large amounts
  of writeback, which causes the dirty levels to subside.
  
  From your description it appears that this writeback isn't happening, or
  isn't working.  How come?
 
  - filesystems A and B
  - write to A will end up as write to B
  - dirty pages in A manage to go over dirty_threshold
  - page writeback is started from A
  - this triggers writeback for a couple of pages in B
  - writeback finishes normally, but dirty+writeback pages are still
over threshold
  - balance_dirty_pages in B gets stuck, nothing ever moves after this
 
 At least this is my theory for what happens.
 

Is B a real filesystem?  If so, writes to B will decrease the dirty memory
threshold.

The writeout code _should_ just sit there transferring dirtyiness from A to
B and cleaning pages via B, looping around, alternating between both.

What does sysrq-t say?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-18 Thread Miklos Szeredi
I was testing the new fuse shared writable mmap support, and finding
that bash-shared-mapping deadlocks (which isn't so strange ;).  What
is more strange is that this is not an OOM situation at all, with
plenty of free and cached pages.

A little more investigation shows that a similar deadlock happens
reliably with bash-shared-mapping on a loopback mount, even if only
half the total memory is used.

The cause is slightly different in the two cases:

  - loopback mount: allocation by the underlying filesystem is stalled
on throttle_vm_writeout()

  - fuse-loop: page dirtying on the underlying filesystem is stalled on
balance_dirty_pages()

In both cases the underlying fs is totally innocent, with no
dirty/writback pages, yet it's waiting for the global dirty+writeback
to go below the threshold, which obviously won't, until the
allocation/dirtying succeeds.

I'm not quite sure what the solution is, and asking for thoughts.
   
   But  these things don't just throttle.  They also perform large 
   amounts
   of writeback, which causes the dirty levels to subside.
   
   From your description it appears that this writeback isn't happening, or
   isn't working.  How come?
  
   - filesystems A and B
   - write to A will end up as write to B
   - dirty pages in A manage to go over dirty_threshold
   - page writeback is started from A
   - this triggers writeback for a couple of pages in B
   - writeback finishes normally, but dirty+writeback pages are still
 over threshold
   - balance_dirty_pages in B gets stuck, nothing ever moves after this
  
  At least this is my theory for what happens.
  
 
 Is B a real filesystem?

Yes.

 If so, writes to B will decrease the dirty memory threshold.

Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
Some pages queued for writeback (doesn't matter how much).  B writes
back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
B doesn't know that there's nothing more to write back for B, it's
just waiting there for those 1099, which'll never get written.

 The writeout code _should_ just sit there transferring dirtyiness from A to
 B and cleaning pages via B, looping around, alternating between both.
 
 What does sysrq-t say?

This is the fuse daemon thread that got stuck.  There are lots of
others that are stuck on some ext3 mutex as a result of this.

fusexmp_fh_no D 40045401 0   527493   533   495 (NOTLB)
088d55f8 0001  08dcfb14 0805d8cb 08a09b78 088d55f8 08dc8000
   08dc8000 08dcfb3c 0805a38a 08a09680 088d5100 08dcfb2c 08dc8000 08dc8000
   0847c300 088d5100 08a09680 08dcfb94 08182fe6 08a09680 088d5100 08a09680 
Call Trace:
08dcfb00:  [0805d8cb] switch_to_skas+0x3b/0x83
08dcfb18:  [0805a38a] _switch_to+0x49/0x99
08dcfb40:  [08182fe6] schedule+0x246/0x547
08dcfb98:  [08183a03] schedule_timeout+0x4e/0xb6
08dcfbcc:  [08183991] io_schedule_timeout+0x11/0x20
08dcfbd4:  [080a0cf2] congestion_wait+0x72/0x87
08dcfc04:  [0809c693] balance_dirty_pages+0xa8/0x153
08dcfc5c:  [0809c7bf] balance_dirty_pages_ratelimited_nr+0x43/0x45
08dcfc68:  [080992b5] generic_file_buffered_write+0x3e3/0x6f5
08dcfd20:  [0809988e] __generic_file_aio_write_nolock+0x2c7/0x5dd
08dcfda8:  [08099cb6] generic_file_aio_write+0x55/0xc7
08dcfddc:  [080ea1e6] ext3_file_write+0x39/0xaf
08dcfe04:  [080b060b] do_sync_write+0xd8/0x10e
08dcfebc:  [080b06e3] vfs_write+0xa2/0x1cb
08dcfeec:  [080b09b8] sys_pwrite64+0x65/0x69
08dcff10:  [0805dd54] handle_syscall+0x90/0xbc
08dcff64:  [0806d56c] handle_trap+0x27/0x121
08dcff8c:  [0806dc65] userspace+0x1de/0x226
08dcffe4:  [0805da19] fork_handler+0x76/0x88
08dcfffc:  [d4cf0007] 0xd4cf0007

/proc/vmstat:

nr_anon_pages 668
nr_mapped 3168
nr_file_pages 5191
nr_slab_reclaimable 173
nr_slab_unreclaimable 494
nr_page_table_pages 65
nr_dirty 2174
nr_writeback 10
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
pgpgin 10955
pgpgout 421091
pswpin 0
pswpout 0
pgalloc_dma 0
pgalloc_normal 268761
pgfree 269709
pgactivate 128287
pgdeactivate 31253
pgfault 237350
pgmajfault 4340
pgrefill_dma 0
pgrefill_normal 127899
pgsteal_dma 0
pgsteal_normal 46892
pgscan_kswapd_dma 0
pgscan_kswapd_normal 47104
pgscan_direct_dma 0
pgscan_direct_normal 36544
pginodesteal 0
slabs_scanned 2048
kswapd_steal 25083
kswapd_inodesteal 335
pageoutrun 656
allocstall 423
pgrotated 0

Breakpoint 3, balance_dirty_pages (mapping=0xa01feb0)
at mm/page-writeback.c:202
202 dirty_exceeded = 1;
(gdb) p dirty_thresh
$1 = 2113
(gdb)

For completeness' sake, here's the backtrace for the stuck loopback as
well:

loop0 D BFFFE101 0   499  5   50059 (L-TLB)
088cc578 0001  09197c4c 0805d8cb 084fe6f8 088cc578 0919
   0919 09197c74 0805a38a 084fe200 088cc080 09197c64 0919 0919
   086d9c80 088cc080 084fe200 09197ccc 08182ab6 084fe200 088cc080 084fe200 
Call Trace:

Re: dirty balancing deadlock

2007-02-18 Thread Andrew Morton
On Mon, 19 Feb 2007 00:22:11 +0100 Miklos Szeredi [EMAIL PROTECTED] wrote:

  If so, writes to B will decrease the dirty memory threshold.
 
 Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
 Some pages queued for writeback (doesn't matter how much).  B writes
 back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
 B doesn't know that there's nothing more to write back for B, it's
 just waiting there for those 1099, which'll never get written.

hm, OK, arguable.  I guess something like this..

--- a/fs/fs-writeback.c~a
+++ a/fs/fs-writeback.c
@@ -356,7 +356,7 @@ int generic_sync_sb_inodes(struct super_
continue;   /* Skip a congested blockdev */
}
 
-   if (wbc-bdi  bdi != wbc-bdi) {
+   if (wbc-bdi  bdi != wbc-bdi  bdi_write_congested(bdi)) {
if (!sb_is_blkdev_sb(sb))
break;  /* fs has the wrong queue */
list_move(inode-i_list, sb-s_dirty);
_

but where's pdflush?  It should be busily transferring dirtiness from A to
B.

  The writeout code _should_ just sit there transferring dirtyiness from A to
  B and cleaning pages via B, looping around, alternating between both.
  
  What does sysrq-t say?
 
 This is the fuse daemon thread that got stuck.

Where's pdflsuh?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-18 Thread Miklos Szeredi
   If so, writes to B will decrease the dirty memory threshold.
  
  Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
  Some pages queued for writeback (doesn't matter how much).  B writes
  back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
  B doesn't know that there's nothing more to write back for B, it's
  just waiting there for those 1099, which'll never get written.
 
 hm, OK, arguable.  I guess something like this..

Doesn't help the fuse case, but does seem to help the loopback mount
one.

For fuse it's worse with the patch: now the write triggered by the
balance recurses into fuse, with disastrous results, since the fuse
writeback is now blocked on the userspace queue.

fusexmp_fh_no D 40136678 0   505494   506   504 (NOTLB)
08982b78 0001  08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000
   08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 08f98000
   085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 089a7100 
Call Trace:
08f9f9a0:  [0805d8cb] switch_to_skas+0x3b/0x83
08f9f9b8:  [0805a38a] _switch_to+0x49/0x99
08f9f9e0:  [08183006] schedule+0x246/0x547
08f9fa38:  [08103c7e] fuse_get_req_wp+0xe9/0x14a
08f9fa70:  [08103d2e] fuse_writepage+0x4f/0x12c
08f9faac:  [0809ce3f] __writepage+0x1e/0x3d
08f9fac0:  [0809cd39] write_cache_pages+0x222/0x30a
08f9fb44:  [0809ce8d] generic_writepages+0x2f/0x35
08f9fb5c:  [0809ced6] do_writepages+0x43/0x45
08f9fb70:  [080cb8d2] __writeback_single_inode+0xbc/0x173
08f9fbb8:  [080cbb30] sync_sb_inodes+0x1a7/0x260
08f9fbe8:  [080cbc54] writeback_inodes+0x6b/0x81
08f9fc04:  [0809c640] balance_dirty_pages+0x55/0x153
08f9fc5c:  [0809c7bf] balance_dirty_pages_ratelimited_nr+0x43/0x45
08f9fc68:  [080992b5] generic_file_buffered_write+0x3e3/0x6f5
08f9fd20:  [0809988e] __generic_file_aio_write_nolock+0x2c7/0x5dd
08f9fda8:  [08099cb6] generic_file_aio_write+0x55/0xc7
08f9fddc:  [080ea206] ext3_file_write+0x39/0xaf
08f9fe04:  [080b060b] do_sync_write+0xd8/0x10e
08f9febc:  [080b06e3] vfs_write+0xa2/0x1cb
08f9feec:  [080b09b8] sys_pwrite64+0x65/0x69
08f9ff10:  [0805dd54] handle_syscall+0x90/0xbc
08f9ff64:  [0806d56c] handle_trap+0x27/0x121
08f9ff8c:  [0806dc65] userspace+0x1de/0x226
08f9ffe4:  [0805da19] fork_handler+0x76/0x88
08f9fffc:  [] nosmp+0xf7fb7000/0x14


 but where's pdflush?  It should be busily transferring dirtiness from A to
 B.

The transfer of dirtyness from A to B goes through the narrow channel
of i_mutex.  And once that is plugged by the stuck balance_dirty_pages()
nothing else can pass through.

   The writeout code _should_ just sit there transferring dirtyiness from A 
   to
   B and cleaning pages via B, looping around, alternating between both.
   
   What does sysrq-t say?
  
  This is the fuse daemon thread that got stuck.
 
 Where's pdflsuh?

Doing nothing I guess.  The request queue for the fuse filesystem is
full, so writepage with wbc-nonblocking=1 will be skipped.

pdflush   D 40045401 023  52412 (L-TLB)
088d5bf8 0001  08907df8 0805d8cb 088d55f8 088d5bf8 0890
   0890 08907e20 0805a38a 088d5100 088d5700 08907e10 0890 0890
   0847c300 088d5700 088d5100 08907e78 08182fe6 088d5100 088d5700 088d5100 
Call Trace:
08907de4:  [0805d8cb] switch_to_skas+0x3b/0x83
08907dfc:  [0805a38a] _switch_to+0x49/0x99
08907e24:  [08182fe6] schedule+0x246/0x547
08907e7c:  [08183a03] schedule_timeout+0x4e/0xb6
08907eb0:  [08183991] io_schedule_timeout+0x11/0x20
08907eb8:  [080a0cf2] congestion_wait+0x72/0x87
08907ee8:  [0809c860] background_writeout+0x35/0xa4
08907f38:  [0809d41e] __pdflush+0xae/0x152
08907f54:  [0809d4f5] pdflush+0x33/0x39
08907f84:  [0808a03a] kthread+0xa7/0xab
08907fb4:  [0806a0f1] run_kernel_thread+0x41/0x50
08907fe0:  [0805d975] new_thread_handler+0x62/0x8b
08907ffc:  [] nosmp+0xf7fb7000/0x14

pdflush   D 40045401 024  52523 (L-TLB)
081e1458 0001  088ffe00 0805d8cb 088d5bf8 081e1458 088f8000
   088f8000 088ffe28 0805a38a 088d5700 081e0f60 088ffe18 088f8000 088f8000
   0847c300 081e0f60 088d5700 088ffe80 08182fe6 088d5700 081e0f60 088d5700 
Call Trace:
088ffdec:  [0805d8cb] switch_to_skas+0x3b/0x83
088ffe04:  [0805a38a] _switch_to+0x49/0x99
088ffe2c:  [08182fe6] schedule+0x246/0x547
088ffe84:  [08183a03] schedule_timeout+0x4e/0xb6
088ffeb8:  [08183991] io_schedule_timeout+0x11/0x20
088ffec0:  [080a0cf2] congestion_wait+0x72/0x87
088ffef0:  [0809c98c] wb_kupdate+0x93/0xd9
088fff38:  [0809d41e] __pdflush+0xae/0x152
088fff54:  [0809d4f5] pdflush+0x33/0x39
088fff84:  [0808a03a] kthread+0xa7/0xab
088fffb4:  [0806a0f1] run_kernel_thread+0x41/0x50
088fffe0:  [0805d975] new_thread_handler+0x62/0x8b
088c:  [] nosmp+0xf7fb7000/0x14
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please 

Re: dirty balancing deadlock

2007-02-18 Thread Miklos Szeredi
 --- a/fs/fs-writeback.c~a
 +++ a/fs/fs-writeback.c
 @@ -356,7 +356,7 @@ int generic_sync_sb_inodes(struct super_
   continue;   /* Skip a congested blockdev */
   }
  
 - if (wbc-bdi  bdi != wbc-bdi) {
 + if (wbc-bdi  bdi != wbc-bdi  bdi_write_congested(bdi)) {
   if (!sb_is_blkdev_sb(sb))
   break;  /* fs has the wrong queue */
   list_move(inode-i_list, sb-s_dirty);

Checking bdi_write_congested(bdi) is not reliable, since the queue can
become congested _after_ the check is done.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-18 Thread Miklos Szeredi
If so, writes to B will decrease the dirty memory threshold.
   
   Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
   Some pages queued for writeback (doesn't matter how much).  B writes
   back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
   B doesn't know that there's nothing more to write back for B, it's
   just waiting there for those 1099, which'll never get written.
  
  hm, OK, arguable.  I guess something like this..
 
 Doesn't help the fuse case, but does seem to help the loopback mount
 one.

No sorry, it doesn't even help the loopback deadlock.  It sometimes
takes quite a while to trigger...

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-18 Thread Chris Mason
On Mon, Feb 19, 2007 at 01:25:21AM +0100, Miklos Szeredi wrote:
If so, writes to B will decrease the dirty memory threshold.
   
   Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
   Some pages queued for writeback (doesn't matter how much).  B writes
   back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
   B doesn't know that there's nothing more to write back for B, it's
   just waiting there for those 1099, which'll never get written.
  
  hm, OK, arguable.  I guess something like this..
 
 Doesn't help the fuse case, but does seem to help the loopback mount
 one.
 
 For fuse it's worse with the patch: now the write triggered by the
 balance recurses into fuse, with disastrous results, since the fuse
 writeback is now blocked on the userspace queue.
 
 fusexmp_fh_no D 40136678 0   505494   506   504 (NOTLB)
 08982b78 0001  08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000
08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 08f98000
085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 
 089a7100 Call Trace:
 08f9f9a0:  [0805d8cb] switch_to_skas+0x3b/0x83
 08f9f9b8:  [0805a38a] _switch_to+0x49/0x99
 08f9f9e0:  [08183006] schedule+0x246/0x547
 08f9fa38:  [08103c7e] fuse_get_req_wp+0xe9/0x14a
 08f9fa70:  [08103d2e] fuse_writepage+0x4f/0x12c

In general, writepage is supposed to do work without blocking on
expensive locks that will get pdflush and dirty reclaim stuck in this
fashion.  You'll probably have to take the same approach reiserfs does
in data=journal mode, which is leaving the page dirty if fuse_get_req_wp
is going to block without making progress.

Queue it somewhere else (ie an internal Fs cleaning thread) and leave
the page dirty so that we can move on to other pages that have a chance
of being cleaned.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-18 Thread Miklos Szeredi
 If so, writes to B will decrease the dirty memory threshold.

Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
Some pages queued for writeback (doesn't matter how much).  B writes
back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
B doesn't know that there's nothing more to write back for B, it's
just waiting there for those 1099, which'll never get written.
   
   hm, OK, arguable.  I guess something like this..
  
  Doesn't help the fuse case, but does seem to help the loopback mount
  one.
  
  For fuse it's worse with the patch: now the write triggered by the
  balance recurses into fuse, with disastrous results, since the fuse
  writeback is now blocked on the userspace queue.
  
  fusexmp_fh_no D 40136678 0   505494   506   504 (NOTLB)
  08982b78 0001  08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000
 08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 
  08f98000
 085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 
  089a7100 Call Trace:
  08f9f9a0:  [0805d8cb] switch_to_skas+0x3b/0x83
  08f9f9b8:  [0805a38a] _switch_to+0x49/0x99
  08f9f9e0:  [08183006] schedule+0x246/0x547
  08f9fa38:  [08103c7e] fuse_get_req_wp+0xe9/0x14a
  08f9fa70:  [08103d2e] fuse_writepage+0x4f/0x12c
 
 In general, writepage is supposed to do work without blocking on
 expensive locks that will get pdflush and dirty reclaim stuck in this
 fashion.  You'll probably have to take the same approach reiserfs does
 in data=journal mode, which is leaving the page dirty if fuse_get_req_wp
 is going to block without making progress.

Pdflush, and dirty reclaim set wbc-nonblocking to true.
balance_dirty_pages and fsync don't.  The problem here is that
Andrew's patch is wrong to let balance_dirty_pages() try to write back
pages from a different queue.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-18 Thread Chris Mason
On Mon, Feb 19, 2007 at 01:54:31AM +0100, Miklos Szeredi wrote:
  If so, writes to B will decrease the dirty memory threshold.
 
 Yes, but not by enough.  Say A dirties a 1100 pages, limit is 1000.
 Some pages queued for writeback (doesn't matter how much).  B writes
 back 1, 1099 dirty remain in A, zero in B.  balance_dirty_pages() for
 B doesn't know that there's nothing more to write back for B, it's
 just waiting there for those 1099, which'll never get written.

hm, OK, arguable.  I guess something like this..
   
   Doesn't help the fuse case, but does seem to help the loopback mount
   one.
   
   For fuse it's worse with the patch: now the write triggered by the
   balance recurses into fuse, with disastrous results, since the fuse
   writeback is now blocked on the userspace queue.
   
   fusexmp_fh_no D 40136678 0   505494   506   504 (NOTLB)
   08982b78 0001  08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000
  08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 
   08f98000
  085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 
   089a7100 Call Trace:
   08f9f9a0:  [0805d8cb] switch_to_skas+0x3b/0x83
   08f9f9b8:  [0805a38a] _switch_to+0x49/0x99
   08f9f9e0:  [08183006] schedule+0x246/0x547
   08f9fa38:  [08103c7e] fuse_get_req_wp+0xe9/0x14a
   08f9fa70:  [08103d2e] fuse_writepage+0x4f/0x12c
  
  In general, writepage is supposed to do work without blocking on
  expensive locks that will get pdflush and dirty reclaim stuck in this
  fashion.  You'll probably have to take the same approach reiserfs does
  in data=journal mode, which is leaving the page dirty if fuse_get_req_wp
  is going to block without making progress.
 
 Pdflush, and dirty reclaim set wbc-nonblocking to true.
 balance_dirty_pages and fsync don't.  The problem here is that
 Andrew's patch is wrong to let balance_dirty_pages() try to write back
 pages from a different queue.

async or sync, writepage is supposed to either make progress or bail.
loopback aside, if the fuse call is blocking long term, you're going to
run into problems.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dirty balancing deadlock

2007-02-18 Thread Miklos Szeredi
   In general, writepage is supposed to do work without blocking on
   expensive locks that will get pdflush and dirty reclaim stuck in this
   fashion.  You'll probably have to take the same approach reiserfs does
   in data=journal mode, which is leaving the page dirty if fuse_get_req_wp
   is going to block without making progress.
  
  Pdflush, and dirty reclaim set wbc-nonblocking to true.
  balance_dirty_pages and fsync don't.  The problem here is that
  Andrew's patch is wrong to let balance_dirty_pages() try to write back
  pages from a different queue.
 
 async or sync, writepage is supposed to either make progress or bail.
 loopback aside, if the fuse call is blocking long term, you're going to
 run into problems.

Hmm, like what?

Thanks,
Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/