Re: dirty balancing deadlock
> > On Thu, 22 Feb 2007 08:42:26 +0100 Miklos Szeredi <[EMAIL PROTECTED]> wrote: > > > > > > > > Index: linux/mm/page-writeback.c > > > > === > > > > --- linux.orig/mm/page-writeback.c 2007-02-19 17:32:41.0 > > > > +0100 > > > > +++ linux/mm/page-writeback.c 2007-02-19 18:05:28.0 +0100 > > > > @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a > > > > dirty_thresh) > > > > break; > > > > > > > > + /* > > > > +* Acquit this producer if there's little or nothing > > > > +* to write back to this particular queue > > > > +* > > > > +* Without this check a deadlock is possible in the > > > > +* following case: > > > > +* > > > > +* - filesystem A writes data through filesystem B > > > > +* - filesystem A has dirty pages over dirty_thresh > > > > +* - writeback is started, this triggers a write in B > > > > +* - balance_dirty_pages() is called synchronously > > > > +* - the write to B blocks > > > > +* - the writeback completes, but dirty is still over > > > > threshold > > > > +* - the blocking write prevents futher writes from > > > > happening > > > > +*/ > > > > + if (atomic_long_read(>nr_dirty) + > > > > + atomic_long_read(>nr_writeback) < 16) > > > > + break; > > > > + > > > > > > The problem seems to that little "- the write to B blocks". > > > > > > How come it blocks? I mean, if we cannot retire writes to that filesystem > > > then we're screwed anyway. > > > > Sorry about the sloppy description. I mean, it's not the lowlevel > > write that will block, but rather the VFS one > > (generic_file_aio_write). It will block (or rather loop forever with > > 0.1 second sleeps) in balance_dirty_pages(). That means, that for > > this inode, i_mutex is held and no other writer can continue the work. > > "this inode" I assume is the inode against filesystem A? No, the one in B. > Why does holding that inode's i_mutex prevent further writeback of > pages in A? It is generic_file_aio_write() that is holding the mutex. Here's the stack for the filesystem daemon trying to write back a page: 08dcfb40: [<08182fe6>] schedule+0x246/0x547 08dcfb98: [<08183a03>] schedule_timeout+0x4e/0xb6 08dcfbcc: [<08183991>] io_schedule_timeout+0x11/0x20 08dcfbd4: [<080a0cf2>] congestion_wait+0x72/0x87 08dcfc04: [<0809c693>] balance_dirty_pages+0xa8/0x153 08dcfc5c: [<0809c7bf>] balance_dirty_pages_ratelimited_nr+0x43/0x45 08dcfc68: [<080992b5>] generic_file_buffered_write+0x3e3/0x6f5 08dcfd20: [<0809988e>] __generic_file_aio_write_nolock+0x2c7/0x5dd 08dcfda8: [<08099cb6>] generic_file_aio_write+0x55/0xc7 08dcfddc: [<080ea1e6>] ext3_file_write+0x39/0xaf 08dcfe04: [<080b060b>] do_sync_write+0xd8/0x10e 08dcfebc: [<080b06e3>] vfs_write+0xa2/0x1cb 08dcfeec: [<080b09b8>] sys_pwrite64+0x65/0x69 Miklos - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
On Thu, 22 Feb 2007 08:42:26 +0100 Miklos Szeredi [EMAIL PROTECTED] wrote: Index: linux/mm/page-writeback.c === --- linux.orig/mm/page-writeback.c 2007-02-19 17:32:41.0 +0100 +++ linux/mm/page-writeback.c 2007-02-19 18:05:28.0 +0100 @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a dirty_thresh) break; + /* +* Acquit this producer if there's little or nothing +* to write back to this particular queue +* +* Without this check a deadlock is possible in the +* following case: +* +* - filesystem A writes data through filesystem B +* - filesystem A has dirty pages over dirty_thresh +* - writeback is started, this triggers a write in B +* - balance_dirty_pages() is called synchronously +* - the write to B blocks +* - the writeback completes, but dirty is still over threshold +* - the blocking write prevents futher writes from happening +*/ + if (atomic_long_read(bdi-nr_dirty) + + atomic_long_read(bdi-nr_writeback) 16) + break; + The problem seems to that little - the write to B blocks. How come it blocks? I mean, if we cannot retire writes to that filesystem then we're screwed anyway. Sorry about the sloppy description. I mean, it's not the lowlevel write that will block, but rather the VFS one (generic_file_aio_write). It will block (or rather loop forever with 0.1 second sleeps) in balance_dirty_pages(). That means, that for this inode, i_mutex is held and no other writer can continue the work. this inode I assume is the inode against filesystem A? No, the one in B. Why does holding that inode's i_mutex prevent further writeback of pages in A? It is generic_file_aio_write() that is holding the mutex. Here's the stack for the filesystem daemon trying to write back a page: 08dcfb40: [08182fe6] schedule+0x246/0x547 08dcfb98: [08183a03] schedule_timeout+0x4e/0xb6 08dcfbcc: [08183991] io_schedule_timeout+0x11/0x20 08dcfbd4: [080a0cf2] congestion_wait+0x72/0x87 08dcfc04: [0809c693] balance_dirty_pages+0xa8/0x153 08dcfc5c: [0809c7bf] balance_dirty_pages_ratelimited_nr+0x43/0x45 08dcfc68: [080992b5] generic_file_buffered_write+0x3e3/0x6f5 08dcfd20: [0809988e] __generic_file_aio_write_nolock+0x2c7/0x5dd 08dcfda8: [08099cb6] generic_file_aio_write+0x55/0xc7 08dcfddc: [080ea1e6] ext3_file_write+0x39/0xaf 08dcfe04: [080b060b] do_sync_write+0xd8/0x10e 08dcfebc: [080b06e3] vfs_write+0xa2/0x1cb 08dcfeec: [080b09b8] sys_pwrite64+0x65/0x69 Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
> On Thu, 22 Feb 2007 08:42:26 +0100 Miklos Szeredi <[EMAIL PROTECTED]> wrote: > > > > > > Index: linux/mm/page-writeback.c > > > === > > > --- linux.orig/mm/page-writeback.c2007-02-19 17:32:41.0 > > > +0100 > > > +++ linux/mm/page-writeback.c 2007-02-19 18:05:28.0 +0100 > > > @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a > > > dirty_thresh) > > > break; > > > > > > + /* > > > + * Acquit this producer if there's little or nothing > > > + * to write back to this particular queue > > > + * > > > + * Without this check a deadlock is possible in the > > > + * following case: > > > + * > > > + * - filesystem A writes data through filesystem B > > > + * - filesystem A has dirty pages over dirty_thresh > > > + * - writeback is started, this triggers a write in B > > > + * - balance_dirty_pages() is called synchronously > > > + * - the write to B blocks > > > + * - the writeback completes, but dirty is still over threshold > > > + * - the blocking write prevents futher writes from happening > > > + */ > > > + if (atomic_long_read(>nr_dirty) + > > > + atomic_long_read(>nr_writeback) < 16) > > > + break; > > > + > > > > The problem seems to that little "- the write to B blocks". > > > > How come it blocks? I mean, if we cannot retire writes to that filesystem > > then we're screwed anyway. > > Sorry about the sloppy description. I mean, it's not the lowlevel > write that will block, but rather the VFS one > (generic_file_aio_write). It will block (or rather loop forever with > 0.1 second sleeps) in balance_dirty_pages(). That means, that for > this inode, i_mutex is held and no other writer can continue the work. "this inode" I assume is the inode against filesystem A? Why does holding that inode's i_mutex prevent further writeback of pages in A? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
> > How about this? > > I still don't understand this bug. > > > Solves the FUSE deadlock, but not the throttle_vm_writeout() one. > > I'll try to tackle that one as well. > > > > If the per-bdi dirty counter goes below 16, balance_dirty_pages() > > returns. > > > > Does the constant need to tunable? If it's too large, then the global > > threshold is more easily exceeded. If it's too small, then in a tight > > situation progress will be slower. > > > > Thanks, > > Miklos > > > > Index: linux/mm/page-writeback.c > > === > > --- linux.orig/mm/page-writeback.c 2007-02-19 17:32:41.0 +0100 > > +++ linux/mm/page-writeback.c 2007-02-19 18:05:28.0 +0100 > > @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a > > dirty_thresh) > > break; > > > > + /* > > +* Acquit this producer if there's little or nothing > > +* to write back to this particular queue > > +* > > +* Without this check a deadlock is possible in the > > +* following case: > > +* > > +* - filesystem A writes data through filesystem B > > +* - filesystem A has dirty pages over dirty_thresh > > +* - writeback is started, this triggers a write in B > > +* - balance_dirty_pages() is called synchronously > > +* - the write to B blocks > > +* - the writeback completes, but dirty is still over threshold > > +* - the blocking write prevents futher writes from happening > > +*/ > > + if (atomic_long_read(>nr_dirty) + > > + atomic_long_read(>nr_writeback) < 16) > > + break; > > + > > The problem seems to that little "- the write to B blocks". > > How come it blocks? I mean, if we cannot retire writes to that filesystem > then we're screwed anyway. Sorry about the sloppy description. I mean, it's not the lowlevel write that will block, but rather the VFS one (generic_file_aio_write). It will block (or rather loop forever with 0.1 second sleeps) in balance_dirty_pages(). That means, that for this inode, i_mutex is held and no other writer can continue the work. Miklos - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
On Mon, 19 Feb 2007 18:11:55 +0100 Miklos Szeredi <[EMAIL PROTECTED]> wrote: > How about this? I still don't understand this bug. > Solves the FUSE deadlock, but not the throttle_vm_writeout() one. > I'll try to tackle that one as well. > > If the per-bdi dirty counter goes below 16, balance_dirty_pages() > returns. > > Does the constant need to tunable? If it's too large, then the global > threshold is more easily exceeded. If it's too small, then in a tight > situation progress will be slower. > > Thanks, > Miklos > > Index: linux/mm/page-writeback.c > === > --- linux.orig/mm/page-writeback.c2007-02-19 17:32:41.0 +0100 > +++ linux/mm/page-writeback.c 2007-02-19 18:05:28.0 +0100 > @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a > dirty_thresh) > break; > > + /* > + * Acquit this producer if there's little or nothing > + * to write back to this particular queue > + * > + * Without this check a deadlock is possible in the > + * following case: > + * > + * - filesystem A writes data through filesystem B > + * - filesystem A has dirty pages over dirty_thresh > + * - writeback is started, this triggers a write in B > + * - balance_dirty_pages() is called synchronously > + * - the write to B blocks > + * - the writeback completes, but dirty is still over threshold > + * - the blocking write prevents futher writes from happening > + */ > + if (atomic_long_read(>nr_dirty) + > + atomic_long_read(>nr_writeback) < 16) > + break; > + The problem seems to that little "- the write to B blocks". How come it blocks? I mean, if we cannot retire writes to that filesystem then we're screwed anyway. Anyway, I think I'll think about this issue a little later on. You might as well prepare full changelogs for your proposed changes, because we'll be needing them anyway. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
On Mon, 19 Feb 2007 18:11:55 +0100 Miklos Szeredi [EMAIL PROTECTED] wrote: How about this? I still don't understand this bug. Solves the FUSE deadlock, but not the throttle_vm_writeout() one. I'll try to tackle that one as well. If the per-bdi dirty counter goes below 16, balance_dirty_pages() returns. Does the constant need to tunable? If it's too large, then the global threshold is more easily exceeded. If it's too small, then in a tight situation progress will be slower. Thanks, Miklos Index: linux/mm/page-writeback.c === --- linux.orig/mm/page-writeback.c2007-02-19 17:32:41.0 +0100 +++ linux/mm/page-writeback.c 2007-02-19 18:05:28.0 +0100 @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a dirty_thresh) break; + /* + * Acquit this producer if there's little or nothing + * to write back to this particular queue + * + * Without this check a deadlock is possible in the + * following case: + * + * - filesystem A writes data through filesystem B + * - filesystem A has dirty pages over dirty_thresh + * - writeback is started, this triggers a write in B + * - balance_dirty_pages() is called synchronously + * - the write to B blocks + * - the writeback completes, but dirty is still over threshold + * - the blocking write prevents futher writes from happening + */ + if (atomic_long_read(bdi-nr_dirty) + + atomic_long_read(bdi-nr_writeback) 16) + break; + The problem seems to that little - the write to B blocks. How come it blocks? I mean, if we cannot retire writes to that filesystem then we're screwed anyway. Anyway, I think I'll think about this issue a little later on. You might as well prepare full changelogs for your proposed changes, because we'll be needing them anyway. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
How about this? I still don't understand this bug. Solves the FUSE deadlock, but not the throttle_vm_writeout() one. I'll try to tackle that one as well. If the per-bdi dirty counter goes below 16, balance_dirty_pages() returns. Does the constant need to tunable? If it's too large, then the global threshold is more easily exceeded. If it's too small, then in a tight situation progress will be slower. Thanks, Miklos Index: linux/mm/page-writeback.c === --- linux.orig/mm/page-writeback.c 2007-02-19 17:32:41.0 +0100 +++ linux/mm/page-writeback.c 2007-02-19 18:05:28.0 +0100 @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a dirty_thresh) break; + /* +* Acquit this producer if there's little or nothing +* to write back to this particular queue +* +* Without this check a deadlock is possible in the +* following case: +* +* - filesystem A writes data through filesystem B +* - filesystem A has dirty pages over dirty_thresh +* - writeback is started, this triggers a write in B +* - balance_dirty_pages() is called synchronously +* - the write to B blocks +* - the writeback completes, but dirty is still over threshold +* - the blocking write prevents futher writes from happening +*/ + if (atomic_long_read(bdi-nr_dirty) + + atomic_long_read(bdi-nr_writeback) 16) + break; + The problem seems to that little - the write to B blocks. How come it blocks? I mean, if we cannot retire writes to that filesystem then we're screwed anyway. Sorry about the sloppy description. I mean, it's not the lowlevel write that will block, but rather the VFS one (generic_file_aio_write). It will block (or rather loop forever with 0.1 second sleeps) in balance_dirty_pages(). That means, that for this inode, i_mutex is held and no other writer can continue the work. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
On Thu, 22 Feb 2007 08:42:26 +0100 Miklos Szeredi [EMAIL PROTECTED] wrote: Index: linux/mm/page-writeback.c === --- linux.orig/mm/page-writeback.c2007-02-19 17:32:41.0 +0100 +++ linux/mm/page-writeback.c 2007-02-19 18:05:28.0 +0100 @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a dirty_thresh) break; + /* + * Acquit this producer if there's little or nothing + * to write back to this particular queue + * + * Without this check a deadlock is possible in the + * following case: + * + * - filesystem A writes data through filesystem B + * - filesystem A has dirty pages over dirty_thresh + * - writeback is started, this triggers a write in B + * - balance_dirty_pages() is called synchronously + * - the write to B blocks + * - the writeback completes, but dirty is still over threshold + * - the blocking write prevents futher writes from happening + */ + if (atomic_long_read(bdi-nr_dirty) + + atomic_long_read(bdi-nr_writeback) 16) + break; + The problem seems to that little - the write to B blocks. How come it blocks? I mean, if we cannot retire writes to that filesystem then we're screwed anyway. Sorry about the sloppy description. I mean, it's not the lowlevel write that will block, but rather the VFS one (generic_file_aio_write). It will block (or rather loop forever with 0.1 second sleeps) in balance_dirty_pages(). That means, that for this inode, i_mutex is held and no other writer can continue the work. this inode I assume is the inode against filesystem A? Why does holding that inode's i_mutex prevent further writeback of pages in A? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
On Tue, Feb 20, 2007 at 09:47:11AM +0100, Miklos Szeredi wrote: > > > How about this? > > > > > > Solves the FUSE deadlock, but not the throttle_vm_writeout() one. > > > I'll try to tackle that one as well. > > > > > > If the per-bdi dirty counter goes below 16, balance_dirty_pages() > > > returns. > > > > > > Does the constant need to tunable? If it's too large, then the global > > > threshold is more easily exceeded. If it's too small, then in a tight > > > situation progress will be slower. > > > > Ok, what is supposed to happen here is that filesystems are supposed to > > be throttled from making more dirty pages when the system is over the > > threshold. Even if filesystem A doesn't have much to contribute, and > > filesystem B is the cause of 99% of the dirty pages, the goal of the > > threshold is to prevent more dirty data from happening, and filesystem A > > should block. > > Which is the cause of the current deadlock. But if we allow > filesystem A to go into the red just a little, the deadlock is > avoided, because it can continue to make progress with cleaning the > dirtyness produced by B. > > The maximum that filesystems can go over the limit will be > > (16 + epsilon) * number-of-queues Right, even for thousands of mounted filesystems ~16 pages per FS effectively pinned is not horrible. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
> > > > > In general, writepage is supposed to do work without blocking on > > > > > expensive locks that will get pdflush and dirty reclaim stuck in this > > > > > fashion. You'll probably have to take the same approach reiserfs does > > > > > in data=journal mode, which is leaving the page dirty if > > > > > fuse_get_req_wp > > > > > is going to block without making progress. > > > > > > > > Pdflush, and dirty reclaim set wbc->nonblocking to true. > > > > balance_dirty_pages and fsync don't. The problem here is that > > > > Andrew's patch is wrong to let balance_dirty_pages() try to write back > > > > pages from a different queue. > > > > > > async or sync, writepage is supposed to either make progress or bail. > > > loopback aside, if the fuse call is blocking long term, you're going to > > > run into problems. > > > > Hmm, like what? > > Something a little different from what you're seeing. Basically if the > PF_MEMALLOC paths end up waiting on a filesystem transaction, and that > transaction is waiting for more ram, the system will eventually grind to > a halt. data=journal is the easiest way to hit it, since writepage > always logs at least 4k. > > WB_SYNC_NONE and wbc->nonblocking aren't a great test, in reiser I > resorted to testing PF_MEMALLOC. I'm not pretending to understand how journaling filesystems work, but this shouldn't be an issue with fuse. Can you show me a call path, where PF_MEMALLOC is set and .nonblocking is not? Thanks, Miklos - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
> > How about this? > > > > Solves the FUSE deadlock, but not the throttle_vm_writeout() one. > > I'll try to tackle that one as well. > > > > If the per-bdi dirty counter goes below 16, balance_dirty_pages() > > returns. > > > > Does the constant need to tunable? If it's too large, then the global > > threshold is more easily exceeded. If it's too small, then in a tight > > situation progress will be slower. > > Ok, what is supposed to happen here is that filesystems are supposed to > be throttled from making more dirty pages when the system is over the > threshold. Even if filesystem A doesn't have much to contribute, and > filesystem B is the cause of 99% of the dirty pages, the goal of the > threshold is to prevent more dirty data from happening, and filesystem A > should block. Which is the cause of the current deadlock. But if we allow filesystem A to go into the red just a little, the deadlock is avoided, because it can continue to make progress with cleaning the dirtyness produced by B. The maximum that filesystems can go over the limit will be (16 + epsilon) * number-of-queues This is usually insignificant compared to the limit itself (~2000 pages on a machine with 32MB) However with thousands of fuse mounts this may become a problem, as each filesystem gets a separate queue. In theory, just 2 pages are enough to always make progress, but current dirty balancing can't enforce this, as the ratelimit is at least 8 pages. So there may have to be some more strict page accounting within fuse itself, but that doesn't change the overall concept I think. Miklos - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
How about this? Solves the FUSE deadlock, but not the throttle_vm_writeout() one. I'll try to tackle that one as well. If the per-bdi dirty counter goes below 16, balance_dirty_pages() returns. Does the constant need to tunable? If it's too large, then the global threshold is more easily exceeded. If it's too small, then in a tight situation progress will be slower. Ok, what is supposed to happen here is that filesystems are supposed to be throttled from making more dirty pages when the system is over the threshold. Even if filesystem A doesn't have much to contribute, and filesystem B is the cause of 99% of the dirty pages, the goal of the threshold is to prevent more dirty data from happening, and filesystem A should block. Which is the cause of the current deadlock. But if we allow filesystem A to go into the red just a little, the deadlock is avoided, because it can continue to make progress with cleaning the dirtyness produced by B. The maximum that filesystems can go over the limit will be (16 + epsilon) * number-of-queues This is usually insignificant compared to the limit itself (~2000 pages on a machine with 32MB) However with thousands of fuse mounts this may become a problem, as each filesystem gets a separate queue. In theory, just 2 pages are enough to always make progress, but current dirty balancing can't enforce this, as the ratelimit is at least 8 pages. So there may have to be some more strict page accounting within fuse itself, but that doesn't change the overall concept I think. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
In general, writepage is supposed to do work without blocking on expensive locks that will get pdflush and dirty reclaim stuck in this fashion. You'll probably have to take the same approach reiserfs does in data=journal mode, which is leaving the page dirty if fuse_get_req_wp is going to block without making progress. Pdflush, and dirty reclaim set wbc-nonblocking to true. balance_dirty_pages and fsync don't. The problem here is that Andrew's patch is wrong to let balance_dirty_pages() try to write back pages from a different queue. async or sync, writepage is supposed to either make progress or bail. loopback aside, if the fuse call is blocking long term, you're going to run into problems. Hmm, like what? Something a little different from what you're seeing. Basically if the PF_MEMALLOC paths end up waiting on a filesystem transaction, and that transaction is waiting for more ram, the system will eventually grind to a halt. data=journal is the easiest way to hit it, since writepage always logs at least 4k. WB_SYNC_NONE and wbc-nonblocking aren't a great test, in reiser I resorted to testing PF_MEMALLOC. I'm not pretending to understand how journaling filesystems work, but this shouldn't be an issue with fuse. Can you show me a call path, where PF_MEMALLOC is set and .nonblocking is not? Thanks, Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
On Tue, Feb 20, 2007 at 09:47:11AM +0100, Miklos Szeredi wrote: How about this? Solves the FUSE deadlock, but not the throttle_vm_writeout() one. I'll try to tackle that one as well. If the per-bdi dirty counter goes below 16, balance_dirty_pages() returns. Does the constant need to tunable? If it's too large, then the global threshold is more easily exceeded. If it's too small, then in a tight situation progress will be slower. Ok, what is supposed to happen here is that filesystems are supposed to be throttled from making more dirty pages when the system is over the threshold. Even if filesystem A doesn't have much to contribute, and filesystem B is the cause of 99% of the dirty pages, the goal of the threshold is to prevent more dirty data from happening, and filesystem A should block. Which is the cause of the current deadlock. But if we allow filesystem A to go into the red just a little, the deadlock is avoided, because it can continue to make progress with cleaning the dirtyness produced by B. The maximum that filesystems can go over the limit will be (16 + epsilon) * number-of-queues Right, even for thousands of mounted filesystems ~16 pages per FS effectively pinned is not horrible. -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
On Mon, Feb 19, 2007 at 02:14:15AM +0100, Miklos Szeredi wrote: > > > > In general, writepage is supposed to do work without blocking on > > > > expensive locks that will get pdflush and dirty reclaim stuck in this > > > > fashion. You'll probably have to take the same approach reiserfs does > > > > in data=journal mode, which is leaving the page dirty if fuse_get_req_wp > > > > is going to block without making progress. > > > > > > Pdflush, and dirty reclaim set wbc->nonblocking to true. > > > balance_dirty_pages and fsync don't. The problem here is that > > > Andrew's patch is wrong to let balance_dirty_pages() try to write back > > > pages from a different queue. > > > > async or sync, writepage is supposed to either make progress or bail. > > loopback aside, if the fuse call is blocking long term, you're going to > > run into problems. > > Hmm, like what? Something a little different from what you're seeing. Basically if the PF_MEMALLOC paths end up waiting on a filesystem transaction, and that transaction is waiting for more ram, the system will eventually grind to a halt. data=journal is the easiest way to hit it, since writepage always logs at least 4k. WB_SYNC_NONE and wbc->nonblocking aren't a great test, in reiser I resorted to testing PF_MEMALLOC. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
On Mon, Feb 19, 2007 at 06:11:55PM +0100, Miklos Szeredi wrote: > How about this? > > Solves the FUSE deadlock, but not the throttle_vm_writeout() one. > I'll try to tackle that one as well. > > If the per-bdi dirty counter goes below 16, balance_dirty_pages() > returns. > > Does the constant need to tunable? If it's too large, then the global > threshold is more easily exceeded. If it's too small, then in a tight > situation progress will be slower. Ok, what is supposed to happen here is that filesystems are supposed to be throttled from making more dirty pages when the system is over the threshold. Even if filesystem A doesn't have much to contribute, and filesystem B is the cause of 99% of the dirty pages, the goal of the threshold is to prevent more dirty data from happening, and filesystem A should block. But, with the producer consumer setup of fuse, I think this is a pretty good compromise. 16 dirty/writeback pages shouldn't hurt the overall limits too badly. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
> Solves the FUSE deadlock, but not the throttle_vm_writeout() one. > I'll try to tackle that one as well. > > If the per-bdi dirty counter goes below 16, balance_dirty_pages() > returns. > > Does the constant need to tunable? If it's too large, then the global > threshold is more easily exceeded. If it's too small, then in a tight > situation progress will be slower. Similar in spirit, this should solve the deadlock on throttle_vm_writeout(). Totally untested. Does this approach look workable? Thanks, Miklos Index: linux/include/linux/swap.h === --- linux.orig/include/linux/swap.h 2007-02-19 23:39:36.0 +0100 +++ linux/include/linux/swap.h 2007-02-20 00:03:38.0 +0100 @@ -277,10 +277,14 @@ static inline void disable_swap_token(vo put_swap_token(swap_token_mm); } +#define nr_swap_writeback \ + atomic_long_read(_space.backing_dev_info->nr_writeback) + #else /* CONFIG_SWAP */ #define total_swap_pages 0 #define total_swapcache_pages 0UL +#define nr_swap_writeback 0UL #define si_swapinfo(val) \ do { (val)->freeswap = (val)->totalswap = 0; } while (0) Index: linux/mm/page-writeback.c === --- linux.orig/mm/page-writeback.c 2007-02-19 23:43:03.0 +0100 +++ linux/mm/page-writeback.c 2007-02-20 00:03:49.0 +0100 @@ -33,6 +33,7 @@ #include #include #include +#include /* * The maximum number of pages to writeout in a single bdflush/kupdate @@ -332,6 +333,9 @@ void throttle_vm_writeout(void) if (global_page_state(NR_UNSTABLE_NFS) + global_page_state(NR_WRITEBACK) <= dirty_thresh) break; + + if (nr_swap_writeback < 16) + break; congestion_wait(WRITE, HZ/10); } } Index: linux/mm/page_io.c === --- linux.orig/mm/page_io.c 2007-02-19 23:24:23.0 +0100 +++ linux/mm/page_io.c 2007-02-19 23:42:21.0 +0100 @@ -70,6 +70,7 @@ static int end_swap_bio_write(struct bio ClearPageReclaim(page); } end_page_writeback(page); + atomic_long_dec(_space.backing_dev_info->nr_writeback); bio_put(bio); return 0; } @@ -121,6 +122,7 @@ int swap_writepage(struct page *page, st if (wbc->sync_mode == WB_SYNC_ALL) rw |= (1 << BIO_RW_SYNC); count_vm_event(PSWPOUT); + atomic_long_inc(_space.backing_dev_info->nr_writeback); set_page_writeback(page); unlock_page(page); submit_bio(rw, bio); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
How about this? Solves the FUSE deadlock, but not the throttle_vm_writeout() one. I'll try to tackle that one as well. If the per-bdi dirty counter goes below 16, balance_dirty_pages() returns. Does the constant need to tunable? If it's too large, then the global threshold is more easily exceeded. If it's too small, then in a tight situation progress will be slower. Thanks, Miklos Index: linux/mm/page-writeback.c === --- linux.orig/mm/page-writeback.c 2007-02-19 17:32:41.0 +0100 +++ linux/mm/page-writeback.c 2007-02-19 18:05:28.0 +0100 @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a dirty_thresh) break; + /* +* Acquit this producer if there's little or nothing +* to write back to this particular queue +* +* Without this check a deadlock is possible in the +* following case: +* +* - filesystem A writes data through filesystem B +* - filesystem A has dirty pages over dirty_thresh +* - writeback is started, this triggers a write in B +* - balance_dirty_pages() is called synchronously +* - the write to B blocks +* - the writeback completes, but dirty is still over threshold +* - the blocking write prevents futher writes from happening +*/ + if (atomic_long_read(>nr_dirty) + + atomic_long_read(>nr_writeback) < 16) + break; + if (!dirty_exceeded) dirty_exceeded = 1; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
How about this? Solves the FUSE deadlock, but not the throttle_vm_writeout() one. I'll try to tackle that one as well. If the per-bdi dirty counter goes below 16, balance_dirty_pages() returns. Does the constant need to tunable? If it's too large, then the global threshold is more easily exceeded. If it's too small, then in a tight situation progress will be slower. Thanks, Miklos Index: linux/mm/page-writeback.c === --- linux.orig/mm/page-writeback.c 2007-02-19 17:32:41.0 +0100 +++ linux/mm/page-writeback.c 2007-02-19 18:05:28.0 +0100 @@ -198,6 +198,25 @@ static void balance_dirty_pages(struct a dirty_thresh) break; + /* +* Acquit this producer if there's little or nothing +* to write back to this particular queue +* +* Without this check a deadlock is possible in the +* following case: +* +* - filesystem A writes data through filesystem B +* - filesystem A has dirty pages over dirty_thresh +* - writeback is started, this triggers a write in B +* - balance_dirty_pages() is called synchronously +* - the write to B blocks +* - the writeback completes, but dirty is still over threshold +* - the blocking write prevents futher writes from happening +*/ + if (atomic_long_read(bdi-nr_dirty) + + atomic_long_read(bdi-nr_writeback) 16) + break; + if (!dirty_exceeded) dirty_exceeded = 1; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
Solves the FUSE deadlock, but not the throttle_vm_writeout() one. I'll try to tackle that one as well. If the per-bdi dirty counter goes below 16, balance_dirty_pages() returns. Does the constant need to tunable? If it's too large, then the global threshold is more easily exceeded. If it's too small, then in a tight situation progress will be slower. Similar in spirit, this should solve the deadlock on throttle_vm_writeout(). Totally untested. Does this approach look workable? Thanks, Miklos Index: linux/include/linux/swap.h === --- linux.orig/include/linux/swap.h 2007-02-19 23:39:36.0 +0100 +++ linux/include/linux/swap.h 2007-02-20 00:03:38.0 +0100 @@ -277,10 +277,14 @@ static inline void disable_swap_token(vo put_swap_token(swap_token_mm); } +#define nr_swap_writeback \ + atomic_long_read(swapper_space.backing_dev_info-nr_writeback) + #else /* CONFIG_SWAP */ #define total_swap_pages 0 #define total_swapcache_pages 0UL +#define nr_swap_writeback 0UL #define si_swapinfo(val) \ do { (val)-freeswap = (val)-totalswap = 0; } while (0) Index: linux/mm/page-writeback.c === --- linux.orig/mm/page-writeback.c 2007-02-19 23:43:03.0 +0100 +++ linux/mm/page-writeback.c 2007-02-20 00:03:49.0 +0100 @@ -33,6 +33,7 @@ #include linux/syscalls.h #include linux/buffer_head.h #include linux/pagevec.h +#include linux/swap.h /* * The maximum number of pages to writeout in a single bdflush/kupdate @@ -332,6 +333,9 @@ void throttle_vm_writeout(void) if (global_page_state(NR_UNSTABLE_NFS) + global_page_state(NR_WRITEBACK) = dirty_thresh) break; + + if (nr_swap_writeback 16) + break; congestion_wait(WRITE, HZ/10); } } Index: linux/mm/page_io.c === --- linux.orig/mm/page_io.c 2007-02-19 23:24:23.0 +0100 +++ linux/mm/page_io.c 2007-02-19 23:42:21.0 +0100 @@ -70,6 +70,7 @@ static int end_swap_bio_write(struct bio ClearPageReclaim(page); } end_page_writeback(page); + atomic_long_dec(swapper_space.backing_dev_info-nr_writeback); bio_put(bio); return 0; } @@ -121,6 +122,7 @@ int swap_writepage(struct page *page, st if (wbc-sync_mode == WB_SYNC_ALL) rw |= (1 BIO_RW_SYNC); count_vm_event(PSWPOUT); + atomic_long_inc(swapper_space.backing_dev_info-nr_writeback); set_page_writeback(page); unlock_page(page); submit_bio(rw, bio); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
On Mon, Feb 19, 2007 at 06:11:55PM +0100, Miklos Szeredi wrote: How about this? Solves the FUSE deadlock, but not the throttle_vm_writeout() one. I'll try to tackle that one as well. If the per-bdi dirty counter goes below 16, balance_dirty_pages() returns. Does the constant need to tunable? If it's too large, then the global threshold is more easily exceeded. If it's too small, then in a tight situation progress will be slower. Ok, what is supposed to happen here is that filesystems are supposed to be throttled from making more dirty pages when the system is over the threshold. Even if filesystem A doesn't have much to contribute, and filesystem B is the cause of 99% of the dirty pages, the goal of the threshold is to prevent more dirty data from happening, and filesystem A should block. But, with the producer consumer setup of fuse, I think this is a pretty good compromise. 16 dirty/writeback pages shouldn't hurt the overall limits too badly. -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
On Mon, Feb 19, 2007 at 02:14:15AM +0100, Miklos Szeredi wrote: In general, writepage is supposed to do work without blocking on expensive locks that will get pdflush and dirty reclaim stuck in this fashion. You'll probably have to take the same approach reiserfs does in data=journal mode, which is leaving the page dirty if fuse_get_req_wp is going to block without making progress. Pdflush, and dirty reclaim set wbc-nonblocking to true. balance_dirty_pages and fsync don't. The problem here is that Andrew's patch is wrong to let balance_dirty_pages() try to write back pages from a different queue. async or sync, writepage is supposed to either make progress or bail. loopback aside, if the fuse call is blocking long term, you're going to run into problems. Hmm, like what? Something a little different from what you're seeing. Basically if the PF_MEMALLOC paths end up waiting on a filesystem transaction, and that transaction is waiting for more ram, the system will eventually grind to a halt. data=journal is the easiest way to hit it, since writepage always logs at least 4k. WB_SYNC_NONE and wbc-nonblocking aren't a great test, in reiser I resorted to testing PF_MEMALLOC. -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
> > > In general, writepage is supposed to do work without blocking on > > > expensive locks that will get pdflush and dirty reclaim stuck in this > > > fashion. You'll probably have to take the same approach reiserfs does > > > in data=journal mode, which is leaving the page dirty if fuse_get_req_wp > > > is going to block without making progress. > > > > Pdflush, and dirty reclaim set wbc->nonblocking to true. > > balance_dirty_pages and fsync don't. The problem here is that > > Andrew's patch is wrong to let balance_dirty_pages() try to write back > > pages from a different queue. > > async or sync, writepage is supposed to either make progress or bail. > loopback aside, if the fuse call is blocking long term, you're going to > run into problems. Hmm, like what? Thanks, Miklos - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
On Mon, Feb 19, 2007 at 01:54:31AM +0100, Miklos Szeredi wrote: > > > > > > If so, writes to B will decrease the dirty memory threshold. > > > > > > > > > > Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. > > > > > Some pages queued for writeback (doesn't matter how much). B writes > > > > > back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for > > > > > B doesn't know that there's nothing more to write back for B, it's > > > > > just waiting there for those 1099, which'll never get written. > > > > > > > > hm, OK, arguable. I guess something like this.. > > > > > > Doesn't help the fuse case, but does seem to help the loopback mount > > > one. > > > > > > For fuse it's worse with the patch: now the write triggered by the > > > balance recurses into fuse, with disastrous results, since the fuse > > > writeback is now blocked on the userspace queue. > > > > > > fusexmp_fh_no D 40136678 0 505494 506 504 (NOTLB) > > > 08982b78 0001 08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000 > > >08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 > > > 08f98000 > > >085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 > > > 089a7100 Call Trace: > > > 08f9f9a0: [<0805d8cb>] switch_to_skas+0x3b/0x83 > > > 08f9f9b8: [<0805a38a>] _switch_to+0x49/0x99 > > > 08f9f9e0: [<08183006>] schedule+0x246/0x547 > > > 08f9fa38: [<08103c7e>] fuse_get_req_wp+0xe9/0x14a > > > 08f9fa70: [<08103d2e>] fuse_writepage+0x4f/0x12c > > > > In general, writepage is supposed to do work without blocking on > > expensive locks that will get pdflush and dirty reclaim stuck in this > > fashion. You'll probably have to take the same approach reiserfs does > > in data=journal mode, which is leaving the page dirty if fuse_get_req_wp > > is going to block without making progress. > > Pdflush, and dirty reclaim set wbc->nonblocking to true. > balance_dirty_pages and fsync don't. The problem here is that > Andrew's patch is wrong to let balance_dirty_pages() try to write back > pages from a different queue. async or sync, writepage is supposed to either make progress or bail. loopback aside, if the fuse call is blocking long term, you're going to run into problems. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
> > > > > If so, writes to B will decrease the dirty memory threshold. > > > > > > > > Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. > > > > Some pages queued for writeback (doesn't matter how much). B writes > > > > back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for > > > > B doesn't know that there's nothing more to write back for B, it's > > > > just waiting there for those 1099, which'll never get written. > > > > > > hm, OK, arguable. I guess something like this.. > > > > Doesn't help the fuse case, but does seem to help the loopback mount > > one. > > > > For fuse it's worse with the patch: now the write triggered by the > > balance recurses into fuse, with disastrous results, since the fuse > > writeback is now blocked on the userspace queue. > > > > fusexmp_fh_no D 40136678 0 505494 506 504 (NOTLB) > > 08982b78 0001 08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000 > >08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 > > 08f98000 > >085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 > > 089a7100 Call Trace: > > 08f9f9a0: [<0805d8cb>] switch_to_skas+0x3b/0x83 > > 08f9f9b8: [<0805a38a>] _switch_to+0x49/0x99 > > 08f9f9e0: [<08183006>] schedule+0x246/0x547 > > 08f9fa38: [<08103c7e>] fuse_get_req_wp+0xe9/0x14a > > 08f9fa70: [<08103d2e>] fuse_writepage+0x4f/0x12c > > In general, writepage is supposed to do work without blocking on > expensive locks that will get pdflush and dirty reclaim stuck in this > fashion. You'll probably have to take the same approach reiserfs does > in data=journal mode, which is leaving the page dirty if fuse_get_req_wp > is going to block without making progress. Pdflush, and dirty reclaim set wbc->nonblocking to true. balance_dirty_pages and fsync don't. The problem here is that Andrew's patch is wrong to let balance_dirty_pages() try to write back pages from a different queue. Miklos - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
On Mon, Feb 19, 2007 at 01:25:21AM +0100, Miklos Szeredi wrote: > > > > If so, writes to B will decrease the dirty memory threshold. > > > > > > Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. > > > Some pages queued for writeback (doesn't matter how much). B writes > > > back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for > > > B doesn't know that there's nothing more to write back for B, it's > > > just waiting there for those 1099, which'll never get written. > > > > hm, OK, arguable. I guess something like this.. > > Doesn't help the fuse case, but does seem to help the loopback mount > one. > > For fuse it's worse with the patch: now the write triggered by the > balance recurses into fuse, with disastrous results, since the fuse > writeback is now blocked on the userspace queue. > > fusexmp_fh_no D 40136678 0 505494 506 504 (NOTLB) > 08982b78 0001 08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000 >08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 08f98000 >085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 > 089a7100 Call Trace: > 08f9f9a0: [<0805d8cb>] switch_to_skas+0x3b/0x83 > 08f9f9b8: [<0805a38a>] _switch_to+0x49/0x99 > 08f9f9e0: [<08183006>] schedule+0x246/0x547 > 08f9fa38: [<08103c7e>] fuse_get_req_wp+0xe9/0x14a > 08f9fa70: [<08103d2e>] fuse_writepage+0x4f/0x12c In general, writepage is supposed to do work without blocking on expensive locks that will get pdflush and dirty reclaim stuck in this fashion. You'll probably have to take the same approach reiserfs does in data=journal mode, which is leaving the page dirty if fuse_get_req_wp is going to block without making progress. Queue it somewhere else (ie an internal Fs cleaning thread) and leave the page dirty so that we can move on to other pages that have a chance of being cleaned. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
> > > > If so, writes to B will decrease the dirty memory threshold. > > > > > > Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. > > > Some pages queued for writeback (doesn't matter how much). B writes > > > back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for > > > B doesn't know that there's nothing more to write back for B, it's > > > just waiting there for those 1099, which'll never get written. > > > > hm, OK, arguable. I guess something like this.. > > Doesn't help the fuse case, but does seem to help the loopback mount > one. No sorry, it doesn't even help the loopback deadlock. It sometimes takes quite a while to trigger... Miklos - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
> --- a/fs/fs-writeback.c~a > +++ a/fs/fs-writeback.c > @@ -356,7 +356,7 @@ int generic_sync_sb_inodes(struct super_ > continue; /* Skip a congested blockdev */ > } > > - if (wbc->bdi && bdi != wbc->bdi) { > + if (wbc->bdi && bdi != wbc->bdi && bdi_write_congested(bdi)) { > if (!sb_is_blkdev_sb(sb)) > break; /* fs has the wrong queue */ > list_move(>i_list, >s_dirty); Checking bdi_write_congested(bdi) is not reliable, since the queue can become congested _after_ the check is done. Miklos - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
> > > If so, writes to B will decrease the dirty memory threshold. > > > > Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. > > Some pages queued for writeback (doesn't matter how much). B writes > > back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for > > B doesn't know that there's nothing more to write back for B, it's > > just waiting there for those 1099, which'll never get written. > > hm, OK, arguable. I guess something like this.. Doesn't help the fuse case, but does seem to help the loopback mount one. For fuse it's worse with the patch: now the write triggered by the balance recurses into fuse, with disastrous results, since the fuse writeback is now blocked on the userspace queue. fusexmp_fh_no D 40136678 0 505494 506 504 (NOTLB) 08982b78 0001 08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000 08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 08f98000 085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 089a7100 Call Trace: 08f9f9a0: [<0805d8cb>] switch_to_skas+0x3b/0x83 08f9f9b8: [<0805a38a>] _switch_to+0x49/0x99 08f9f9e0: [<08183006>] schedule+0x246/0x547 08f9fa38: [<08103c7e>] fuse_get_req_wp+0xe9/0x14a 08f9fa70: [<08103d2e>] fuse_writepage+0x4f/0x12c 08f9faac: [<0809ce3f>] __writepage+0x1e/0x3d 08f9fac0: [<0809cd39>] write_cache_pages+0x222/0x30a 08f9fb44: [<0809ce8d>] generic_writepages+0x2f/0x35 08f9fb5c: [<0809ced6>] do_writepages+0x43/0x45 08f9fb70: [<080cb8d2>] __writeback_single_inode+0xbc/0x173 08f9fbb8: [<080cbb30>] sync_sb_inodes+0x1a7/0x260 08f9fbe8: [<080cbc54>] writeback_inodes+0x6b/0x81 08f9fc04: [<0809c640>] balance_dirty_pages+0x55/0x153 08f9fc5c: [<0809c7bf>] balance_dirty_pages_ratelimited_nr+0x43/0x45 08f9fc68: [<080992b5>] generic_file_buffered_write+0x3e3/0x6f5 08f9fd20: [<0809988e>] __generic_file_aio_write_nolock+0x2c7/0x5dd 08f9fda8: [<08099cb6>] generic_file_aio_write+0x55/0xc7 08f9fddc: [<080ea206>] ext3_file_write+0x39/0xaf 08f9fe04: [<080b060b>] do_sync_write+0xd8/0x10e 08f9febc: [<080b06e3>] vfs_write+0xa2/0x1cb 08f9feec: [<080b09b8>] sys_pwrite64+0x65/0x69 08f9ff10: [<0805dd54>] handle_syscall+0x90/0xbc 08f9ff64: [<0806d56c>] handle_trap+0x27/0x121 08f9ff8c: [<0806dc65>] userspace+0x1de/0x226 08f9ffe4: [<0805da19>] fork_handler+0x76/0x88 08f9fffc: [<>] nosmp+0xf7fb7000/0x14 > but where's pdflush? It should be busily transferring dirtiness from A to > B. The transfer of dirtyness from A to B goes through the narrow channel of i_mutex. And once that is plugged by the stuck balance_dirty_pages() nothing else can pass through. > > > The writeout code _should_ just sit there transferring dirtyiness from A > > > to > > > B and cleaning pages via B, looping around, alternating between both. > > > > > > What does sysrq-t say? > > > > This is the fuse daemon thread that got stuck. > > Where's pdflsuh? Doing nothing I guess. The request queue for the fuse filesystem is full, so writepage with wbc->nonblocking=1 will be skipped. pdflush D 40045401 023 52412 (L-TLB) 088d5bf8 0001 08907df8 0805d8cb 088d55f8 088d5bf8 0890 0890 08907e20 0805a38a 088d5100 088d5700 08907e10 0890 0890 0847c300 088d5700 088d5100 08907e78 08182fe6 088d5100 088d5700 088d5100 Call Trace: 08907de4: [<0805d8cb>] switch_to_skas+0x3b/0x83 08907dfc: [<0805a38a>] _switch_to+0x49/0x99 08907e24: [<08182fe6>] schedule+0x246/0x547 08907e7c: [<08183a03>] schedule_timeout+0x4e/0xb6 08907eb0: [<08183991>] io_schedule_timeout+0x11/0x20 08907eb8: [<080a0cf2>] congestion_wait+0x72/0x87 08907ee8: [<0809c860>] background_writeout+0x35/0xa4 08907f38: [<0809d41e>] __pdflush+0xae/0x152 08907f54: [<0809d4f5>] pdflush+0x33/0x39 08907f84: [<0808a03a>] kthread+0xa7/0xab 08907fb4: [<0806a0f1>] run_kernel_thread+0x41/0x50 08907fe0: [<0805d975>] new_thread_handler+0x62/0x8b 08907ffc: [<>] nosmp+0xf7fb7000/0x14 pdflush D 40045401 024 52523 (L-TLB) 081e1458 0001 088ffe00 0805d8cb 088d5bf8 081e1458 088f8000 088f8000 088ffe28 0805a38a 088d5700 081e0f60 088ffe18 088f8000 088f8000 0847c300 081e0f60 088d5700 088ffe80 08182fe6 088d5700 081e0f60 088d5700 Call Trace: 088ffdec: [<0805d8cb>] switch_to_skas+0x3b/0x83 088ffe04: [<0805a38a>] _switch_to+0x49/0x99 088ffe2c: [<08182fe6>] schedule+0x246/0x547 088ffe84: [<08183a03>] schedule_timeout+0x4e/0xb6 088ffeb8: [<08183991>] io_schedule_timeout+0x11/0x20 088ffec0: [<080a0cf2>] congestion_wait+0x72/0x87 088ffef0: [<0809c98c>] wb_kupdate+0x93/0xd9 088fff38: [<0809d41e>] __pdflush+0xae/0x152 088fff54: [<0809d4f5>] pdflush+0x33/0x39 088fff84: [<0808a03a>] kthread+0xa7/0xab 088fffb4: [<0806a0f1>] run_kernel_thread+0x41/0x50 088fffe0: [<0805d975>] new_thread_handler+0x62/0x8b 088c: [<>] nosmp+0xf7fb7000/0x14 - To unsubscribe from this list: send the line
Re: dirty balancing deadlock
On Mon, 19 Feb 2007 00:22:11 +0100 Miklos Szeredi <[EMAIL PROTECTED]> wrote: > > If so, writes to B will decrease the dirty memory threshold. > > Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. > Some pages queued for writeback (doesn't matter how much). B writes > back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for > B doesn't know that there's nothing more to write back for B, it's > just waiting there for those 1099, which'll never get written. hm, OK, arguable. I guess something like this.. --- a/fs/fs-writeback.c~a +++ a/fs/fs-writeback.c @@ -356,7 +356,7 @@ int generic_sync_sb_inodes(struct super_ continue; /* Skip a congested blockdev */ } - if (wbc->bdi && bdi != wbc->bdi) { + if (wbc->bdi && bdi != wbc->bdi && bdi_write_congested(bdi)) { if (!sb_is_blkdev_sb(sb)) break; /* fs has the wrong queue */ list_move(>i_list, >s_dirty); _ but where's pdflush? It should be busily transferring dirtiness from A to B. > > The writeout code _should_ just sit there transferring dirtyiness from A to > > B and cleaning pages via B, looping around, alternating between both. > > > > What does sysrq-t say? > > This is the fuse daemon thread that got stuck. Where's pdflsuh? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
> > > > I was testing the new fuse shared writable mmap support, and finding > > > > that bash-shared-mapping deadlocks (which isn't so strange ;). What > > > > is more strange is that this is not an OOM situation at all, with > > > > plenty of free and cached pages. > > > > > > > > A little more investigation shows that a similar deadlock happens > > > > reliably with bash-shared-mapping on a loopback mount, even if only > > > > half the total memory is used. > > > > > > > > The cause is slightly different in the two cases: > > > > > > > > - loopback mount: allocation by the underlying filesystem is stalled > > > > on throttle_vm_writeout() > > > > > > > > - fuse-loop: page dirtying on the underlying filesystem is stalled on > > > > balance_dirty_pages() > > > > > > > > In both cases the underlying fs is totally innocent, with no > > > > dirty/writback pages, yet it's waiting for the global dirty+writeback > > > > to go below the threshold, which obviously won't, until the > > > > allocation/dirtying succeeds. > > > > > > > > I'm not quite sure what the solution is, and asking for thoughts. > > > > > > But these things don't just throttle. They also perform large > > > amounts > > > of writeback, which causes the dirty levels to subside. > > > > > > >From your description it appears that this writeback isn't happening, or > > > isn't working. How come? > > > > - filesystems A and B > > - write to A will end up as write to B > > - dirty pages in A manage to go over dirty_threshold > > - page writeback is started from A > > - this triggers writeback for a couple of pages in B > > - writeback finishes normally, but dirty+writeback pages are still > >over threshold > > - balance_dirty_pages in B gets stuck, nothing ever moves after this > > > > At least this is my theory for what happens. > > > > Is B a real filesystem? Yes. > If so, writes to B will decrease the dirty memory threshold. Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. Some pages queued for writeback (doesn't matter how much). B writes back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for B doesn't know that there's nothing more to write back for B, it's just waiting there for those 1099, which'll never get written. > The writeout code _should_ just sit there transferring dirtyiness from A to > B and cleaning pages via B, looping around, alternating between both. > > What does sysrq-t say? This is the fuse daemon thread that got stuck. There are lots of others that are stuck on some ext3 mutex as a result of this. fusexmp_fh_no D 40045401 0 527493 533 495 (NOTLB) 088d55f8 0001 08dcfb14 0805d8cb 08a09b78 088d55f8 08dc8000 08dc8000 08dcfb3c 0805a38a 08a09680 088d5100 08dcfb2c 08dc8000 08dc8000 0847c300 088d5100 08a09680 08dcfb94 08182fe6 08a09680 088d5100 08a09680 Call Trace: 08dcfb00: [<0805d8cb>] switch_to_skas+0x3b/0x83 08dcfb18: [<0805a38a>] _switch_to+0x49/0x99 08dcfb40: [<08182fe6>] schedule+0x246/0x547 08dcfb98: [<08183a03>] schedule_timeout+0x4e/0xb6 08dcfbcc: [<08183991>] io_schedule_timeout+0x11/0x20 08dcfbd4: [<080a0cf2>] congestion_wait+0x72/0x87 08dcfc04: [<0809c693>] balance_dirty_pages+0xa8/0x153 08dcfc5c: [<0809c7bf>] balance_dirty_pages_ratelimited_nr+0x43/0x45 08dcfc68: [<080992b5>] generic_file_buffered_write+0x3e3/0x6f5 08dcfd20: [<0809988e>] __generic_file_aio_write_nolock+0x2c7/0x5dd 08dcfda8: [<08099cb6>] generic_file_aio_write+0x55/0xc7 08dcfddc: [<080ea1e6>] ext3_file_write+0x39/0xaf 08dcfe04: [<080b060b>] do_sync_write+0xd8/0x10e 08dcfebc: [<080b06e3>] vfs_write+0xa2/0x1cb 08dcfeec: [<080b09b8>] sys_pwrite64+0x65/0x69 08dcff10: [<0805dd54>] handle_syscall+0x90/0xbc 08dcff64: [<0806d56c>] handle_trap+0x27/0x121 08dcff8c: [<0806dc65>] userspace+0x1de/0x226 08dcffe4: [<0805da19>] fork_handler+0x76/0x88 08dcfffc: [] 0xd4cf0007 /proc/vmstat: nr_anon_pages 668 nr_mapped 3168 nr_file_pages 5191 nr_slab_reclaimable 173 nr_slab_unreclaimable 494 nr_page_table_pages 65 nr_dirty 2174 nr_writeback 10 nr_unstable 0 nr_bounce 0 nr_vmscan_write 0 pgpgin 10955 pgpgout 421091 pswpin 0 pswpout 0 pgalloc_dma 0 pgalloc_normal 268761 pgfree 269709 pgactivate 128287 pgdeactivate 31253 pgfault 237350 pgmajfault 4340 pgrefill_dma 0 pgrefill_normal 127899 pgsteal_dma 0 pgsteal_normal 46892 pgscan_kswapd_dma 0 pgscan_kswapd_normal 47104 pgscan_direct_dma 0 pgscan_direct_normal 36544 pginodesteal 0 slabs_scanned 2048 kswapd_steal 25083 kswapd_inodesteal 335 pageoutrun 656 allocstall 423 pgrotated 0 Breakpoint 3, balance_dirty_pages (mapping=0xa01feb0) at mm/page-writeback.c:202 202 dirty_exceeded = 1; (gdb) p dirty_thresh $1 = 2113 (gdb) For completeness' sake, here's the backtrace for the stuck loopback as well: loop0 D BFFFE101 0 499 5 50059 (L-TLB) 088cc578 0001 09197c4c 0805d8cb 084fe6f8 088cc578 0919
Re: dirty balancing deadlock
On Sun, 18 Feb 2007 23:50:14 +0100 Miklos Szeredi <[EMAIL PROTECTED]> wrote: > > > I was testing the new fuse shared writable mmap support, and finding > > > that bash-shared-mapping deadlocks (which isn't so strange ;). What > > > is more strange is that this is not an OOM situation at all, with > > > plenty of free and cached pages. > > > > > > A little more investigation shows that a similar deadlock happens > > > reliably with bash-shared-mapping on a loopback mount, even if only > > > half the total memory is used. > > > > > > The cause is slightly different in the two cases: > > > > > > - loopback mount: allocation by the underlying filesystem is stalled > > > on throttle_vm_writeout() > > > > > > - fuse-loop: page dirtying on the underlying filesystem is stalled on > > > balance_dirty_pages() > > > > > > In both cases the underlying fs is totally innocent, with no > > > dirty/writback pages, yet it's waiting for the global dirty+writeback > > > to go below the threshold, which obviously won't, until the > > > allocation/dirtying succeeds. > > > > > > I'm not quite sure what the solution is, and asking for thoughts. > > > > But these things don't just throttle. They also perform large amounts > > of writeback, which causes the dirty levels to subside. > > > > >From your description it appears that this writeback isn't happening, or > > isn't working. How come? > > - filesystems A and B > - write to A will end up as write to B > - dirty pages in A manage to go over dirty_threshold > - page writeback is started from A > - this triggers writeback for a couple of pages in B > - writeback finishes normally, but dirty+writeback pages are still >over threshold > - balance_dirty_pages in B gets stuck, nothing ever moves after this > > At least this is my theory for what happens. > Is B a real filesystem? If so, writes to B will decrease the dirty memory threshold. The writeout code _should_ just sit there transferring dirtyiness from A to B and cleaning pages via B, looping around, alternating between both. What does sysrq-t say? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
> Andrew Morton wrote: > > On Sun, 18 Feb 2007 19:28:18 +0100 Miklos Szeredi <[EMAIL PROTECTED]> wrote: > > > >> I was testing the new fuse shared writable mmap support, and finding > >> that bash-shared-mapping deadlocks (which isn't so strange ;). What > >> is more strange is that this is not an OOM situation at all, with > >> plenty of free and cached pages. > >> > >> A little more investigation shows that a similar deadlock happens > >> reliably with bash-shared-mapping on a loopback mount, even if only > >> half the total memory is used. > >> > >> The cause is slightly different in the two cases: > >> > >> - loopback mount: allocation by the underlying filesystem is stalled > >> on throttle_vm_writeout() > >> > >> - fuse-loop: page dirtying on the underlying filesystem is stalled on > >> balance_dirty_pages() > >> > >> In both cases the underlying fs is totally innocent, with no > >> dirty/writback pages, yet it's waiting for the global dirty+writeback > >> to go below the threshold, which obviously won't, until the > >> allocation/dirtying succeeds. > >> > >> I'm not quite sure what the solution is, and asking for thoughts. > > > > But these things don't just throttle. They also perform large amounts > > of writeback, which causes the dirty levels to subside. > > > >>From your description it appears that this writeback isn't happening, or > > isn't working. How come? > > Is the fuse daemon trying to do writeback to itself, perhaps? > > That is, trying to write out data to the FUSE filesystem, for which > it is also the server. No. It's trying to write out data to a different filesystem. Trying to write out data to itself very obviously deadlocks, but that doesn't affect anything beside the stupid filesystem itself, and there are mechanisms for aborting such a situation (forced umount, abort through fuse-control filesystem). Miklos - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
> > I was testing the new fuse shared writable mmap support, and finding > > that bash-shared-mapping deadlocks (which isn't so strange ;). What > > is more strange is that this is not an OOM situation at all, with > > plenty of free and cached pages. > > > > A little more investigation shows that a similar deadlock happens > > reliably with bash-shared-mapping on a loopback mount, even if only > > half the total memory is used. > > > > The cause is slightly different in the two cases: > > > > - loopback mount: allocation by the underlying filesystem is stalled > > on throttle_vm_writeout() > > > > - fuse-loop: page dirtying on the underlying filesystem is stalled on > > balance_dirty_pages() > > > > In both cases the underlying fs is totally innocent, with no > > dirty/writback pages, yet it's waiting for the global dirty+writeback > > to go below the threshold, which obviously won't, until the > > allocation/dirtying succeeds. > > > > I'm not quite sure what the solution is, and asking for thoughts. > > But these things don't just throttle. They also perform large amounts > of writeback, which causes the dirty levels to subside. > > >From your description it appears that this writeback isn't happening, or > isn't working. How come? - filesystems A and B - write to A will end up as write to B - dirty pages in A manage to go over dirty_threshold - page writeback is started from A - this triggers writeback for a couple of pages in B - writeback finishes normally, but dirty+writeback pages are still over threshold - balance_dirty_pages in B gets stuck, nothing ever moves after this At least this is my theory for what happens. Miklos - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
Andrew Morton wrote: On Sun, 18 Feb 2007 19:28:18 +0100 Miklos Szeredi <[EMAIL PROTECTED]> wrote: I was testing the new fuse shared writable mmap support, and finding that bash-shared-mapping deadlocks (which isn't so strange ;). What is more strange is that this is not an OOM situation at all, with plenty of free and cached pages. A little more investigation shows that a similar deadlock happens reliably with bash-shared-mapping on a loopback mount, even if only half the total memory is used. The cause is slightly different in the two cases: - loopback mount: allocation by the underlying filesystem is stalled on throttle_vm_writeout() - fuse-loop: page dirtying on the underlying filesystem is stalled on balance_dirty_pages() In both cases the underlying fs is totally innocent, with no dirty/writback pages, yet it's waiting for the global dirty+writeback to go below the threshold, which obviously won't, until the allocation/dirtying succeeds. I'm not quite sure what the solution is, and asking for thoughts. But these things don't just throttle. They also perform large amounts of writeback, which causes the dirty levels to subside. From your description it appears that this writeback isn't happening, or isn't working. How come? Is the fuse daemon trying to do writeback to itself, perhaps? That is, trying to write out data to the FUSE filesystem, for which it is also the server. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
On Sun, 18 Feb 2007 19:28:18 +0100 Miklos Szeredi <[EMAIL PROTECTED]> wrote: > I was testing the new fuse shared writable mmap support, and finding > that bash-shared-mapping deadlocks (which isn't so strange ;). What > is more strange is that this is not an OOM situation at all, with > plenty of free and cached pages. > > A little more investigation shows that a similar deadlock happens > reliably with bash-shared-mapping on a loopback mount, even if only > half the total memory is used. > > The cause is slightly different in the two cases: > > - loopback mount: allocation by the underlying filesystem is stalled > on throttle_vm_writeout() > > - fuse-loop: page dirtying on the underlying filesystem is stalled on > balance_dirty_pages() > > In both cases the underlying fs is totally innocent, with no > dirty/writback pages, yet it's waiting for the global dirty+writeback > to go below the threshold, which obviously won't, until the > allocation/dirtying succeeds. > > I'm not quite sure what the solution is, and asking for thoughts. But these things don't just throttle. They also perform large amounts of writeback, which causes the dirty levels to subside. >From your description it appears that this writeback isn't happening, or isn't working. How come? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
On Sun, 18 Feb 2007 19:28:18 +0100 Miklos Szeredi [EMAIL PROTECTED] wrote: I was testing the new fuse shared writable mmap support, and finding that bash-shared-mapping deadlocks (which isn't so strange ;). What is more strange is that this is not an OOM situation at all, with plenty of free and cached pages. A little more investigation shows that a similar deadlock happens reliably with bash-shared-mapping on a loopback mount, even if only half the total memory is used. The cause is slightly different in the two cases: - loopback mount: allocation by the underlying filesystem is stalled on throttle_vm_writeout() - fuse-loop: page dirtying on the underlying filesystem is stalled on balance_dirty_pages() In both cases the underlying fs is totally innocent, with no dirty/writback pages, yet it's waiting for the global dirty+writeback to go below the threshold, which obviously won't, until the allocation/dirtying succeeds. I'm not quite sure what the solution is, and asking for thoughts. But these things don't just throttle. They also perform large amounts of writeback, which causes the dirty levels to subside. From your description it appears that this writeback isn't happening, or isn't working. How come? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
Andrew Morton wrote: On Sun, 18 Feb 2007 19:28:18 +0100 Miklos Szeredi [EMAIL PROTECTED] wrote: I was testing the new fuse shared writable mmap support, and finding that bash-shared-mapping deadlocks (which isn't so strange ;). What is more strange is that this is not an OOM situation at all, with plenty of free and cached pages. A little more investigation shows that a similar deadlock happens reliably with bash-shared-mapping on a loopback mount, even if only half the total memory is used. The cause is slightly different in the two cases: - loopback mount: allocation by the underlying filesystem is stalled on throttle_vm_writeout() - fuse-loop: page dirtying on the underlying filesystem is stalled on balance_dirty_pages() In both cases the underlying fs is totally innocent, with no dirty/writback pages, yet it's waiting for the global dirty+writeback to go below the threshold, which obviously won't, until the allocation/dirtying succeeds. I'm not quite sure what the solution is, and asking for thoughts. But these things don't just throttle. They also perform large amounts of writeback, which causes the dirty levels to subside. From your description it appears that this writeback isn't happening, or isn't working. How come? Is the fuse daemon trying to do writeback to itself, perhaps? That is, trying to write out data to the FUSE filesystem, for which it is also the server. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
I was testing the new fuse shared writable mmap support, and finding that bash-shared-mapping deadlocks (which isn't so strange ;). What is more strange is that this is not an OOM situation at all, with plenty of free and cached pages. A little more investigation shows that a similar deadlock happens reliably with bash-shared-mapping on a loopback mount, even if only half the total memory is used. The cause is slightly different in the two cases: - loopback mount: allocation by the underlying filesystem is stalled on throttle_vm_writeout() - fuse-loop: page dirtying on the underlying filesystem is stalled on balance_dirty_pages() In both cases the underlying fs is totally innocent, with no dirty/writback pages, yet it's waiting for the global dirty+writeback to go below the threshold, which obviously won't, until the allocation/dirtying succeeds. I'm not quite sure what the solution is, and asking for thoughts. But these things don't just throttle. They also perform large amounts of writeback, which causes the dirty levels to subside. From your description it appears that this writeback isn't happening, or isn't working. How come? - filesystems A and B - write to A will end up as write to B - dirty pages in A manage to go over dirty_threshold - page writeback is started from A - this triggers writeback for a couple of pages in B - writeback finishes normally, but dirty+writeback pages are still over threshold - balance_dirty_pages in B gets stuck, nothing ever moves after this At least this is my theory for what happens. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
Andrew Morton wrote: On Sun, 18 Feb 2007 19:28:18 +0100 Miklos Szeredi [EMAIL PROTECTED] wrote: I was testing the new fuse shared writable mmap support, and finding that bash-shared-mapping deadlocks (which isn't so strange ;). What is more strange is that this is not an OOM situation at all, with plenty of free and cached pages. A little more investigation shows that a similar deadlock happens reliably with bash-shared-mapping on a loopback mount, even if only half the total memory is used. The cause is slightly different in the two cases: - loopback mount: allocation by the underlying filesystem is stalled on throttle_vm_writeout() - fuse-loop: page dirtying on the underlying filesystem is stalled on balance_dirty_pages() In both cases the underlying fs is totally innocent, with no dirty/writback pages, yet it's waiting for the global dirty+writeback to go below the threshold, which obviously won't, until the allocation/dirtying succeeds. I'm not quite sure what the solution is, and asking for thoughts. But these things don't just throttle. They also perform large amounts of writeback, which causes the dirty levels to subside. From your description it appears that this writeback isn't happening, or isn't working. How come? Is the fuse daemon trying to do writeback to itself, perhaps? That is, trying to write out data to the FUSE filesystem, for which it is also the server. No. It's trying to write out data to a different filesystem. Trying to write out data to itself very obviously deadlocks, but that doesn't affect anything beside the stupid filesystem itself, and there are mechanisms for aborting such a situation (forced umount, abort through fuse-control filesystem). Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
On Sun, 18 Feb 2007 23:50:14 +0100 Miklos Szeredi [EMAIL PROTECTED] wrote: I was testing the new fuse shared writable mmap support, and finding that bash-shared-mapping deadlocks (which isn't so strange ;). What is more strange is that this is not an OOM situation at all, with plenty of free and cached pages. A little more investigation shows that a similar deadlock happens reliably with bash-shared-mapping on a loopback mount, even if only half the total memory is used. The cause is slightly different in the two cases: - loopback mount: allocation by the underlying filesystem is stalled on throttle_vm_writeout() - fuse-loop: page dirtying on the underlying filesystem is stalled on balance_dirty_pages() In both cases the underlying fs is totally innocent, with no dirty/writback pages, yet it's waiting for the global dirty+writeback to go below the threshold, which obviously won't, until the allocation/dirtying succeeds. I'm not quite sure what the solution is, and asking for thoughts. But these things don't just throttle. They also perform large amounts of writeback, which causes the dirty levels to subside. From your description it appears that this writeback isn't happening, or isn't working. How come? - filesystems A and B - write to A will end up as write to B - dirty pages in A manage to go over dirty_threshold - page writeback is started from A - this triggers writeback for a couple of pages in B - writeback finishes normally, but dirty+writeback pages are still over threshold - balance_dirty_pages in B gets stuck, nothing ever moves after this At least this is my theory for what happens. Is B a real filesystem? If so, writes to B will decrease the dirty memory threshold. The writeout code _should_ just sit there transferring dirtyiness from A to B and cleaning pages via B, looping around, alternating between both. What does sysrq-t say? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
I was testing the new fuse shared writable mmap support, and finding that bash-shared-mapping deadlocks (which isn't so strange ;). What is more strange is that this is not an OOM situation at all, with plenty of free and cached pages. A little more investigation shows that a similar deadlock happens reliably with bash-shared-mapping on a loopback mount, even if only half the total memory is used. The cause is slightly different in the two cases: - loopback mount: allocation by the underlying filesystem is stalled on throttle_vm_writeout() - fuse-loop: page dirtying on the underlying filesystem is stalled on balance_dirty_pages() In both cases the underlying fs is totally innocent, with no dirty/writback pages, yet it's waiting for the global dirty+writeback to go below the threshold, which obviously won't, until the allocation/dirtying succeeds. I'm not quite sure what the solution is, and asking for thoughts. But these things don't just throttle. They also perform large amounts of writeback, which causes the dirty levels to subside. From your description it appears that this writeback isn't happening, or isn't working. How come? - filesystems A and B - write to A will end up as write to B - dirty pages in A manage to go over dirty_threshold - page writeback is started from A - this triggers writeback for a couple of pages in B - writeback finishes normally, but dirty+writeback pages are still over threshold - balance_dirty_pages in B gets stuck, nothing ever moves after this At least this is my theory for what happens. Is B a real filesystem? Yes. If so, writes to B will decrease the dirty memory threshold. Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. Some pages queued for writeback (doesn't matter how much). B writes back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for B doesn't know that there's nothing more to write back for B, it's just waiting there for those 1099, which'll never get written. The writeout code _should_ just sit there transferring dirtyiness from A to B and cleaning pages via B, looping around, alternating between both. What does sysrq-t say? This is the fuse daemon thread that got stuck. There are lots of others that are stuck on some ext3 mutex as a result of this. fusexmp_fh_no D 40045401 0 527493 533 495 (NOTLB) 088d55f8 0001 08dcfb14 0805d8cb 08a09b78 088d55f8 08dc8000 08dc8000 08dcfb3c 0805a38a 08a09680 088d5100 08dcfb2c 08dc8000 08dc8000 0847c300 088d5100 08a09680 08dcfb94 08182fe6 08a09680 088d5100 08a09680 Call Trace: 08dcfb00: [0805d8cb] switch_to_skas+0x3b/0x83 08dcfb18: [0805a38a] _switch_to+0x49/0x99 08dcfb40: [08182fe6] schedule+0x246/0x547 08dcfb98: [08183a03] schedule_timeout+0x4e/0xb6 08dcfbcc: [08183991] io_schedule_timeout+0x11/0x20 08dcfbd4: [080a0cf2] congestion_wait+0x72/0x87 08dcfc04: [0809c693] balance_dirty_pages+0xa8/0x153 08dcfc5c: [0809c7bf] balance_dirty_pages_ratelimited_nr+0x43/0x45 08dcfc68: [080992b5] generic_file_buffered_write+0x3e3/0x6f5 08dcfd20: [0809988e] __generic_file_aio_write_nolock+0x2c7/0x5dd 08dcfda8: [08099cb6] generic_file_aio_write+0x55/0xc7 08dcfddc: [080ea1e6] ext3_file_write+0x39/0xaf 08dcfe04: [080b060b] do_sync_write+0xd8/0x10e 08dcfebc: [080b06e3] vfs_write+0xa2/0x1cb 08dcfeec: [080b09b8] sys_pwrite64+0x65/0x69 08dcff10: [0805dd54] handle_syscall+0x90/0xbc 08dcff64: [0806d56c] handle_trap+0x27/0x121 08dcff8c: [0806dc65] userspace+0x1de/0x226 08dcffe4: [0805da19] fork_handler+0x76/0x88 08dcfffc: [d4cf0007] 0xd4cf0007 /proc/vmstat: nr_anon_pages 668 nr_mapped 3168 nr_file_pages 5191 nr_slab_reclaimable 173 nr_slab_unreclaimable 494 nr_page_table_pages 65 nr_dirty 2174 nr_writeback 10 nr_unstable 0 nr_bounce 0 nr_vmscan_write 0 pgpgin 10955 pgpgout 421091 pswpin 0 pswpout 0 pgalloc_dma 0 pgalloc_normal 268761 pgfree 269709 pgactivate 128287 pgdeactivate 31253 pgfault 237350 pgmajfault 4340 pgrefill_dma 0 pgrefill_normal 127899 pgsteal_dma 0 pgsteal_normal 46892 pgscan_kswapd_dma 0 pgscan_kswapd_normal 47104 pgscan_direct_dma 0 pgscan_direct_normal 36544 pginodesteal 0 slabs_scanned 2048 kswapd_steal 25083 kswapd_inodesteal 335 pageoutrun 656 allocstall 423 pgrotated 0 Breakpoint 3, balance_dirty_pages (mapping=0xa01feb0) at mm/page-writeback.c:202 202 dirty_exceeded = 1; (gdb) p dirty_thresh $1 = 2113 (gdb) For completeness' sake, here's the backtrace for the stuck loopback as well: loop0 D BFFFE101 0 499 5 50059 (L-TLB) 088cc578 0001 09197c4c 0805d8cb 084fe6f8 088cc578 0919 0919 09197c74 0805a38a 084fe200 088cc080 09197c64 0919 0919 086d9c80 088cc080 084fe200 09197ccc 08182ab6 084fe200 088cc080 084fe200 Call Trace:
Re: dirty balancing deadlock
On Mon, 19 Feb 2007 00:22:11 +0100 Miklos Szeredi [EMAIL PROTECTED] wrote: If so, writes to B will decrease the dirty memory threshold. Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. Some pages queued for writeback (doesn't matter how much). B writes back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for B doesn't know that there's nothing more to write back for B, it's just waiting there for those 1099, which'll never get written. hm, OK, arguable. I guess something like this.. --- a/fs/fs-writeback.c~a +++ a/fs/fs-writeback.c @@ -356,7 +356,7 @@ int generic_sync_sb_inodes(struct super_ continue; /* Skip a congested blockdev */ } - if (wbc-bdi bdi != wbc-bdi) { + if (wbc-bdi bdi != wbc-bdi bdi_write_congested(bdi)) { if (!sb_is_blkdev_sb(sb)) break; /* fs has the wrong queue */ list_move(inode-i_list, sb-s_dirty); _ but where's pdflush? It should be busily transferring dirtiness from A to B. The writeout code _should_ just sit there transferring dirtyiness from A to B and cleaning pages via B, looping around, alternating between both. What does sysrq-t say? This is the fuse daemon thread that got stuck. Where's pdflsuh? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
If so, writes to B will decrease the dirty memory threshold. Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. Some pages queued for writeback (doesn't matter how much). B writes back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for B doesn't know that there's nothing more to write back for B, it's just waiting there for those 1099, which'll never get written. hm, OK, arguable. I guess something like this.. Doesn't help the fuse case, but does seem to help the loopback mount one. For fuse it's worse with the patch: now the write triggered by the balance recurses into fuse, with disastrous results, since the fuse writeback is now blocked on the userspace queue. fusexmp_fh_no D 40136678 0 505494 506 504 (NOTLB) 08982b78 0001 08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000 08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 08f98000 085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 089a7100 Call Trace: 08f9f9a0: [0805d8cb] switch_to_skas+0x3b/0x83 08f9f9b8: [0805a38a] _switch_to+0x49/0x99 08f9f9e0: [08183006] schedule+0x246/0x547 08f9fa38: [08103c7e] fuse_get_req_wp+0xe9/0x14a 08f9fa70: [08103d2e] fuse_writepage+0x4f/0x12c 08f9faac: [0809ce3f] __writepage+0x1e/0x3d 08f9fac0: [0809cd39] write_cache_pages+0x222/0x30a 08f9fb44: [0809ce8d] generic_writepages+0x2f/0x35 08f9fb5c: [0809ced6] do_writepages+0x43/0x45 08f9fb70: [080cb8d2] __writeback_single_inode+0xbc/0x173 08f9fbb8: [080cbb30] sync_sb_inodes+0x1a7/0x260 08f9fbe8: [080cbc54] writeback_inodes+0x6b/0x81 08f9fc04: [0809c640] balance_dirty_pages+0x55/0x153 08f9fc5c: [0809c7bf] balance_dirty_pages_ratelimited_nr+0x43/0x45 08f9fc68: [080992b5] generic_file_buffered_write+0x3e3/0x6f5 08f9fd20: [0809988e] __generic_file_aio_write_nolock+0x2c7/0x5dd 08f9fda8: [08099cb6] generic_file_aio_write+0x55/0xc7 08f9fddc: [080ea206] ext3_file_write+0x39/0xaf 08f9fe04: [080b060b] do_sync_write+0xd8/0x10e 08f9febc: [080b06e3] vfs_write+0xa2/0x1cb 08f9feec: [080b09b8] sys_pwrite64+0x65/0x69 08f9ff10: [0805dd54] handle_syscall+0x90/0xbc 08f9ff64: [0806d56c] handle_trap+0x27/0x121 08f9ff8c: [0806dc65] userspace+0x1de/0x226 08f9ffe4: [0805da19] fork_handler+0x76/0x88 08f9fffc: [] nosmp+0xf7fb7000/0x14 but where's pdflush? It should be busily transferring dirtiness from A to B. The transfer of dirtyness from A to B goes through the narrow channel of i_mutex. And once that is plugged by the stuck balance_dirty_pages() nothing else can pass through. The writeout code _should_ just sit there transferring dirtyiness from A to B and cleaning pages via B, looping around, alternating between both. What does sysrq-t say? This is the fuse daemon thread that got stuck. Where's pdflsuh? Doing nothing I guess. The request queue for the fuse filesystem is full, so writepage with wbc-nonblocking=1 will be skipped. pdflush D 40045401 023 52412 (L-TLB) 088d5bf8 0001 08907df8 0805d8cb 088d55f8 088d5bf8 0890 0890 08907e20 0805a38a 088d5100 088d5700 08907e10 0890 0890 0847c300 088d5700 088d5100 08907e78 08182fe6 088d5100 088d5700 088d5100 Call Trace: 08907de4: [0805d8cb] switch_to_skas+0x3b/0x83 08907dfc: [0805a38a] _switch_to+0x49/0x99 08907e24: [08182fe6] schedule+0x246/0x547 08907e7c: [08183a03] schedule_timeout+0x4e/0xb6 08907eb0: [08183991] io_schedule_timeout+0x11/0x20 08907eb8: [080a0cf2] congestion_wait+0x72/0x87 08907ee8: [0809c860] background_writeout+0x35/0xa4 08907f38: [0809d41e] __pdflush+0xae/0x152 08907f54: [0809d4f5] pdflush+0x33/0x39 08907f84: [0808a03a] kthread+0xa7/0xab 08907fb4: [0806a0f1] run_kernel_thread+0x41/0x50 08907fe0: [0805d975] new_thread_handler+0x62/0x8b 08907ffc: [] nosmp+0xf7fb7000/0x14 pdflush D 40045401 024 52523 (L-TLB) 081e1458 0001 088ffe00 0805d8cb 088d5bf8 081e1458 088f8000 088f8000 088ffe28 0805a38a 088d5700 081e0f60 088ffe18 088f8000 088f8000 0847c300 081e0f60 088d5700 088ffe80 08182fe6 088d5700 081e0f60 088d5700 Call Trace: 088ffdec: [0805d8cb] switch_to_skas+0x3b/0x83 088ffe04: [0805a38a] _switch_to+0x49/0x99 088ffe2c: [08182fe6] schedule+0x246/0x547 088ffe84: [08183a03] schedule_timeout+0x4e/0xb6 088ffeb8: [08183991] io_schedule_timeout+0x11/0x20 088ffec0: [080a0cf2] congestion_wait+0x72/0x87 088ffef0: [0809c98c] wb_kupdate+0x93/0xd9 088fff38: [0809d41e] __pdflush+0xae/0x152 088fff54: [0809d4f5] pdflush+0x33/0x39 088fff84: [0808a03a] kthread+0xa7/0xab 088fffb4: [0806a0f1] run_kernel_thread+0x41/0x50 088fffe0: [0805d975] new_thread_handler+0x62/0x8b 088c: [] nosmp+0xf7fb7000/0x14 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please
Re: dirty balancing deadlock
--- a/fs/fs-writeback.c~a +++ a/fs/fs-writeback.c @@ -356,7 +356,7 @@ int generic_sync_sb_inodes(struct super_ continue; /* Skip a congested blockdev */ } - if (wbc-bdi bdi != wbc-bdi) { + if (wbc-bdi bdi != wbc-bdi bdi_write_congested(bdi)) { if (!sb_is_blkdev_sb(sb)) break; /* fs has the wrong queue */ list_move(inode-i_list, sb-s_dirty); Checking bdi_write_congested(bdi) is not reliable, since the queue can become congested _after_ the check is done. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
If so, writes to B will decrease the dirty memory threshold. Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. Some pages queued for writeback (doesn't matter how much). B writes back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for B doesn't know that there's nothing more to write back for B, it's just waiting there for those 1099, which'll never get written. hm, OK, arguable. I guess something like this.. Doesn't help the fuse case, but does seem to help the loopback mount one. No sorry, it doesn't even help the loopback deadlock. It sometimes takes quite a while to trigger... Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
On Mon, Feb 19, 2007 at 01:25:21AM +0100, Miklos Szeredi wrote: If so, writes to B will decrease the dirty memory threshold. Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. Some pages queued for writeback (doesn't matter how much). B writes back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for B doesn't know that there's nothing more to write back for B, it's just waiting there for those 1099, which'll never get written. hm, OK, arguable. I guess something like this.. Doesn't help the fuse case, but does seem to help the loopback mount one. For fuse it's worse with the patch: now the write triggered by the balance recurses into fuse, with disastrous results, since the fuse writeback is now blocked on the userspace queue. fusexmp_fh_no D 40136678 0 505494 506 504 (NOTLB) 08982b78 0001 08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000 08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 08f98000 085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 089a7100 Call Trace: 08f9f9a0: [0805d8cb] switch_to_skas+0x3b/0x83 08f9f9b8: [0805a38a] _switch_to+0x49/0x99 08f9f9e0: [08183006] schedule+0x246/0x547 08f9fa38: [08103c7e] fuse_get_req_wp+0xe9/0x14a 08f9fa70: [08103d2e] fuse_writepage+0x4f/0x12c In general, writepage is supposed to do work without blocking on expensive locks that will get pdflush and dirty reclaim stuck in this fashion. You'll probably have to take the same approach reiserfs does in data=journal mode, which is leaving the page dirty if fuse_get_req_wp is going to block without making progress. Queue it somewhere else (ie an internal Fs cleaning thread) and leave the page dirty so that we can move on to other pages that have a chance of being cleaned. -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
If so, writes to B will decrease the dirty memory threshold. Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. Some pages queued for writeback (doesn't matter how much). B writes back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for B doesn't know that there's nothing more to write back for B, it's just waiting there for those 1099, which'll never get written. hm, OK, arguable. I guess something like this.. Doesn't help the fuse case, but does seem to help the loopback mount one. For fuse it's worse with the patch: now the write triggered by the balance recurses into fuse, with disastrous results, since the fuse writeback is now blocked on the userspace queue. fusexmp_fh_no D 40136678 0 505494 506 504 (NOTLB) 08982b78 0001 08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000 08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 08f98000 085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 089a7100 Call Trace: 08f9f9a0: [0805d8cb] switch_to_skas+0x3b/0x83 08f9f9b8: [0805a38a] _switch_to+0x49/0x99 08f9f9e0: [08183006] schedule+0x246/0x547 08f9fa38: [08103c7e] fuse_get_req_wp+0xe9/0x14a 08f9fa70: [08103d2e] fuse_writepage+0x4f/0x12c In general, writepage is supposed to do work without blocking on expensive locks that will get pdflush and dirty reclaim stuck in this fashion. You'll probably have to take the same approach reiserfs does in data=journal mode, which is leaving the page dirty if fuse_get_req_wp is going to block without making progress. Pdflush, and dirty reclaim set wbc-nonblocking to true. balance_dirty_pages and fsync don't. The problem here is that Andrew's patch is wrong to let balance_dirty_pages() try to write back pages from a different queue. Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
On Mon, Feb 19, 2007 at 01:54:31AM +0100, Miklos Szeredi wrote: If so, writes to B will decrease the dirty memory threshold. Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. Some pages queued for writeback (doesn't matter how much). B writes back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for B doesn't know that there's nothing more to write back for B, it's just waiting there for those 1099, which'll never get written. hm, OK, arguable. I guess something like this.. Doesn't help the fuse case, but does seem to help the loopback mount one. For fuse it's worse with the patch: now the write triggered by the balance recurses into fuse, with disastrous results, since the fuse writeback is now blocked on the userspace queue. fusexmp_fh_no D 40136678 0 505494 506 504 (NOTLB) 08982b78 0001 08f9f9b4 0805d8cb 089a75f8 08982b78 08f98000 08f98000 08f9f9dc 0805a38a 089a7100 08982680 08f9f9cc 08f98000 08f98000 085d8300 08982680 089a7100 08f9fa34 08183006 089a7100 08982680 089a7100 Call Trace: 08f9f9a0: [0805d8cb] switch_to_skas+0x3b/0x83 08f9f9b8: [0805a38a] _switch_to+0x49/0x99 08f9f9e0: [08183006] schedule+0x246/0x547 08f9fa38: [08103c7e] fuse_get_req_wp+0xe9/0x14a 08f9fa70: [08103d2e] fuse_writepage+0x4f/0x12c In general, writepage is supposed to do work without blocking on expensive locks that will get pdflush and dirty reclaim stuck in this fashion. You'll probably have to take the same approach reiserfs does in data=journal mode, which is leaving the page dirty if fuse_get_req_wp is going to block without making progress. Pdflush, and dirty reclaim set wbc-nonblocking to true. balance_dirty_pages and fsync don't. The problem here is that Andrew's patch is wrong to let balance_dirty_pages() try to write back pages from a different queue. async or sync, writepage is supposed to either make progress or bail. loopback aside, if the fuse call is blocking long term, you're going to run into problems. -chris - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dirty balancing deadlock
In general, writepage is supposed to do work without blocking on expensive locks that will get pdflush and dirty reclaim stuck in this fashion. You'll probably have to take the same approach reiserfs does in data=journal mode, which is leaving the page dirty if fuse_get_req_wp is going to block without making progress. Pdflush, and dirty reclaim set wbc-nonblocking to true. balance_dirty_pages and fsync don't. The problem here is that Andrew's patch is wrong to let balance_dirty_pages() try to write back pages from a different queue. async or sync, writepage is supposed to either make progress or bail. loopback aside, if the fuse call is blocking long term, you're going to run into problems. Hmm, like what? Thanks, Miklos - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/