Re: [patch 03/22] fix deadlock in balance_dirty_pages

2007-03-01 Thread Miklos Szeredi
> I'm just not going to apply weird hacks to work around a bug which
> I do not understand, and I have spent way too much time trying to understand
> this one.

I suggest you apply patch #5 "balance dirty pages from loop device"
and see for yourself the same deadlock with a simple loopback mount.

I beleive _not_ doing balance_dirty_pages() in loop is rather a bigger
hack, then mine ;)

> So let us persist.
> 
> Please fully describe the role of i_mutex in this hang.

OK.  Added description of the multithreaded case, with i_mutex's role:

+ What if Pr_b is multithreaded?  The first thread will enter
+ balance_dirty_pages() and loop there as shown above.  It will hold
+ i_mutex for the inode, taken in generic_file_aio_write().
+ 
+ The other theads now try to write back more data into the same file,
+ but will block on i_mutex.  So even with unlimited number of threads
+ no progress is made.

Thanks,
Miklos


From: Miklos Szeredi <[EMAIL PROTECTED]>

This deadlock happens, when dirty pages from one filesystem are
written back through another filesystem.  It easiest to demonstrate
with fuse although it could affect looback mounts as well (see
following patches).

Let's call the filesystems A(bove) and B(elow).  Process Pr_a is
writing to A, and process Pr_b is writing to B.

Pr_a is bash-shared-mapping.  Pr_b is the fuse filesystem daemon
(fusexmp_fh), for simplicity let's assume that Pr_b is single
threaded.

These are the simplified stack traces of these processes after the
deadlock:

Pr_a (bash-shared-mapping):

  (block on queue)
  fuse_writepage
  generic_writepages
  writeback_inodes
  balance_dirty_pages
  balance_dirty_pages_ratelimited_nr
  set_page_dirty_mapping_balance
  do_no_page


Pr_b (fusexmp_fh):

  io_schedule_timeout
  congestion_wait
  balance_dirty_pages
  balance_dirty_pages_ratelimited_nr
  generic_file_buffered_write
  generic_file_aio_write
  ext3_file_write
  do_sync_write
  vfs_write
  sys_pwrite64


Thanks to the aggressive nature of Pr_a, it can happen, that

  nr_file_dirty > dirty_thresh + margin

This is due to both nr_dirty growing and dirty_thresh shrinking, which
in turn is due to nr_file_mapped rapidly growing.  The exact size of
the margin at which the deadlock happens is not known, but it's around
100 pages.

At this point Pr_a enters balance_dirty_pages and starts to write back
some if it's dirty pages.  After submitting some requests, it blocks
on the request queue.

The first write request will trigger Pr_b to perform a write()
syscall.  This will submit a write request to the block device and
then may enter balance_dirty_pages().

The condition for exiting balance_dirty_pages() is

 - either that write_chunk pages have been written

 - or nr_file_dirty + nr_writeback < dirty_thresh

It is entirely possible that less than write_chunk pages were written,
in which case balance_dirty_pages() will not exit even after all the
submitted requests have been succesfully completed.

Which means that the write() syscall does not return.

Which means, that no more dirty pages from A will be written back, and
neither nr_writeback nor nr_file_dirty will decrease.

Which means, that balance_dirty_pages() will loop forever.

What if Pr_b is multithreaded?  The first thread will enter
balance_dirty_pages() and loop there as shown above.  It will hold
i_mutex for the inode, taken in generic_file_aio_write().

The other theads now try to write back more data into the same file,
but will block on i_mutex.  So even with unlimited number of threads
no progress is made.

Q.E.D.

The solution is to exit balance_dirty_pages() on the condition, that
there are only a few dirty + writeback pages for this backing dev.  This
makes sure, that there is always some progress with this setup.

The number of outstanding dirty + written pages is limited to 8, which
means that when over the threshold (dirty_exceeded == 1), each
filesystem may only effectively pin a maximum of 16 (+8 because of
ratelimiting) extra pages.

Note: a similar safety vent is always needed if there's a global limit
for the dirty+writeback pages, even if in the future there will be
some per-queue (or other) soft limit.

Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
---

Index: linux/mm/page-writeback.c
===
--- linux.orig/mm/page-writeback.c  2007-02-27 14:41:07.0 +0100
+++ linux/mm/page-writeback.c   2007-02-27 14:41:07.0 +0100
@@ -201,6 +201,17 @@ static void balance_dirty_pages(struct a
if (!dirty_exceeded)
dirty_exceeded = 1;
 
+   /*
+* Acquit producer of dirty pages if there's little or
+* nothing to write back to this particular queue.
+*
+* Without this check a deadlock is possible for if
+* one filesystem is writing data through another.
+*/
+   if (atomic_long_read(>nr_dirty) +

Re: [patch 03/22] fix deadlock in balance_dirty_pages

2007-03-01 Thread Andrew Morton
On Thu, 01 Mar 2007 09:37:06 +0100 Miklos Szeredi <[EMAIL PROTECTED]> wrote:

> > Sigh.  What's this about i_mutex?  That appears to be some critical
> > information which _still_ isn't being communicated.
> > 
> 
> This:
> 
> ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
>   unsigned long nr_segs, loff_t pos)
> {
>   struct file *file = iocb->ki_filp;
>   struct address_space *mapping = file->f_mapping;
>   struct inode *inode = mapping->host;
>   ssize_t ret;
> 
>   BUG_ON(iocb->ki_pos != pos);
> 
>   mutex_lock(>i_mutex);
>   ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
>   >ki_pos);
>   mutex_unlock(>i_mutex);
> 
> 
> It's in the stack trace.  I thought it was obvious.

No, it is not obvious.

I'm just not going to apply weird hacks to work around a bug which
I do not understand, and I have spent way too much time trying to understand
this one.

So let us persist.

Please fully describe the role of i_mutex in this hang.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 03/22] fix deadlock in balance_dirty_pages

2007-03-01 Thread Miklos Szeredi
> > > > This deadlock happens, when dirty pages from one filesystem are
> > > > written back through another filesystem.  It easiest to demonstrate
> > > > with fuse although it could affect looback mounts as well (see
> > > > following patches).
> > > > 
> > > > Let's call the filesystems A(bove) and B(elow).  Process Pr_a is
> > > > writing to A, and process Pr_b is writing to B.
> > > > 
> > > > Pr_a is bash-shared-mapping.  Pr_b is the fuse filesystem daemon
> > > > (fusexmp_fh), for simplicity let's assume that Pr_b is single
> > > > threaded.
> > > > 
> > > > These are the simplified stack traces of these processes after the
> > > > deadlock:
> > > > 
> > > > Pr_a (bash-shared-mapping):
> > > > 
> > > >   (block on queue)
> > > >   fuse_writepage
> > > >   generic_writepages
> > > >   writeback_inodes
> > > >   balance_dirty_pages
> > > >   balance_dirty_pages_ratelimited_nr
> > > >   set_page_dirty_mapping_balance
> > > >   do_no_page
> > > > 
> > > > 
> > > > Pr_b (fusexmp_fh):
> > > > 
> > > >   io_schedule_timeout
> > > >   congestion_wait
> > > >   balance_dirty_pages
> > > >   balance_dirty_pages_ratelimited_nr
> > > >   generic_file_buffered_write
> > > >   generic_file_aio_write
> > > >   ext3_file_write
> > > >   do_sync_write
> > > >   vfs_write
> > > >   sys_pwrite64
> > > > 
> > > > 
> > > > Thanks to the aggressive nature of Pr_a, it can happen, that
> > > > 
> > > >   nr_file_dirty > dirty_thresh + margin
> > > > 
> > > > This is due to both nr_dirty growing and dirty_thresh shrinking, which
> > > > in turn is due to nr_file_mapped rapidly growing.  The exact size of
> > > > the margin at which the deadlock happens is not known, but it's around
> > > > 100 pages.
> > > > 
> > > > At this point Pr_a enters balance_dirty_pages and starts to write back
> > > > some if it's dirty pages.  After submitting some requests, it blocks
> > > > on the request queue.
> > > > 
> > > > The first write request will trigger Pr_b to perform a write()
> > > > syscall.  This will submit a write request to the block device and
> > > > then may enter balance_dirty_pages().
> > > > 
> > > > The condition for exiting balance_dirty_pages() is
> > > > 
> > > >  - either that write_chunk pages have been written
> > > > 
> > > >  - or nr_file_dirty + nr_writeback < dirty_thresh
> > > > 
> > > > It is entirely possible that less than write_chunk pages were written,
> > > > in which case balance_dirty_pages() will not exit even after all the
> > > > submitted requests have been succesfully completed.
> > > > 
> > > > Which means that the write() syscall does not return.
> > > 
> > > But the balance_dirty_pages() loop does more than just wait for those two
> > > conditions.  It will also submit _more_ dirty pages for writeout.  ie: it
> > > should be feeding more of file A's pages into writepage.
> > > 
> > > Why isn't that happening?
> > 
> > All of A's data is actually written by B.  So just submitting more
> > pages to some queue doesn't help, it will just make the queue longer.
> > 
> > If the queue length were not limited, and B would have limitless
> > threads, and the write() wouldn't exclude other writes to the same
> > file (i_mutex), then there would be no deadlock.
> > 
> > But for fuse the first and the last condition isn't met.
> > 
> > For the loop device the second condition isn't met, loop is single
> > threaded.
> 
> Sigh.  What's this about i_mutex?  That appears to be some critical
> information which _still_ isn't being communicated.
> 

This:

ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
unsigned long nr_segs, loff_t pos)
{
struct file *file = iocb->ki_filp;
struct address_space *mapping = file->f_mapping;
struct inode *inode = mapping->host;
ssize_t ret;

BUG_ON(iocb->ki_pos != pos);

mutex_lock(>i_mutex);
ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
>ki_pos);
mutex_unlock(>i_mutex);


It's in the stack trace.  I thought it was obvious.

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 03/22] fix deadlock in balance_dirty_pages

2007-03-01 Thread Andrew Morton
On Thu, 01 Mar 2007 08:35:28 +0100 Miklos Szeredi <[EMAIL PROTECTED]> wrote:

> > > This deadlock happens, when dirty pages from one filesystem are
> > > written back through another filesystem.  It easiest to demonstrate
> > > with fuse although it could affect looback mounts as well (see
> > > following patches).
> > > 
> > > Let's call the filesystems A(bove) and B(elow).  Process Pr_a is
> > > writing to A, and process Pr_b is writing to B.
> > > 
> > > Pr_a is bash-shared-mapping.  Pr_b is the fuse filesystem daemon
> > > (fusexmp_fh), for simplicity let's assume that Pr_b is single
> > > threaded.
> > > 
> > > These are the simplified stack traces of these processes after the
> > > deadlock:
> > > 
> > > Pr_a (bash-shared-mapping):
> > > 
> > >   (block on queue)
> > >   fuse_writepage
> > >   generic_writepages
> > >   writeback_inodes
> > >   balance_dirty_pages
> > >   balance_dirty_pages_ratelimited_nr
> > >   set_page_dirty_mapping_balance
> > >   do_no_page
> > > 
> > > 
> > > Pr_b (fusexmp_fh):
> > > 
> > >   io_schedule_timeout
> > >   congestion_wait
> > >   balance_dirty_pages
> > >   balance_dirty_pages_ratelimited_nr
> > >   generic_file_buffered_write
> > >   generic_file_aio_write
> > >   ext3_file_write
> > >   do_sync_write
> > >   vfs_write
> > >   sys_pwrite64
> > > 
> > > 
> > > Thanks to the aggressive nature of Pr_a, it can happen, that
> > > 
> > >   nr_file_dirty > dirty_thresh + margin
> > > 
> > > This is due to both nr_dirty growing and dirty_thresh shrinking, which
> > > in turn is due to nr_file_mapped rapidly growing.  The exact size of
> > > the margin at which the deadlock happens is not known, but it's around
> > > 100 pages.
> > > 
> > > At this point Pr_a enters balance_dirty_pages and starts to write back
> > > some if it's dirty pages.  After submitting some requests, it blocks
> > > on the request queue.
> > > 
> > > The first write request will trigger Pr_b to perform a write()
> > > syscall.  This will submit a write request to the block device and
> > > then may enter balance_dirty_pages().
> > > 
> > > The condition for exiting balance_dirty_pages() is
> > > 
> > >  - either that write_chunk pages have been written
> > > 
> > >  - or nr_file_dirty + nr_writeback < dirty_thresh
> > > 
> > > It is entirely possible that less than write_chunk pages were written,
> > > in which case balance_dirty_pages() will not exit even after all the
> > > submitted requests have been succesfully completed.
> > > 
> > > Which means that the write() syscall does not return.
> > 
> > But the balance_dirty_pages() loop does more than just wait for those two
> > conditions.  It will also submit _more_ dirty pages for writeout.  ie: it
> > should be feeding more of file A's pages into writepage.
> > 
> > Why isn't that happening?
> 
> All of A's data is actually written by B.  So just submitting more
> pages to some queue doesn't help, it will just make the queue longer.
> 
> If the queue length were not limited, and B would have limitless
> threads, and the write() wouldn't exclude other writes to the same
> file (i_mutex), then there would be no deadlock.
> 
> But for fuse the first and the last condition isn't met.
> 
> For the loop device the second condition isn't met, loop is single
> threaded.

Sigh.  What's this about i_mutex?  That appears to be some critical
information which _still_ isn't being communicated.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 03/22] fix deadlock in balance_dirty_pages

2007-03-01 Thread Andrew Morton
On Thu, 01 Mar 2007 08:35:28 +0100 Miklos Szeredi [EMAIL PROTECTED] wrote:

   This deadlock happens, when dirty pages from one filesystem are
   written back through another filesystem.  It easiest to demonstrate
   with fuse although it could affect looback mounts as well (see
   following patches).
   
   Let's call the filesystems A(bove) and B(elow).  Process Pr_a is
   writing to A, and process Pr_b is writing to B.
   
   Pr_a is bash-shared-mapping.  Pr_b is the fuse filesystem daemon
   (fusexmp_fh), for simplicity let's assume that Pr_b is single
   threaded.
   
   These are the simplified stack traces of these processes after the
   deadlock:
   
   Pr_a (bash-shared-mapping):
   
 (block on queue)
 fuse_writepage
 generic_writepages
 writeback_inodes
 balance_dirty_pages
 balance_dirty_pages_ratelimited_nr
 set_page_dirty_mapping_balance
 do_no_page
   
   
   Pr_b (fusexmp_fh):
   
 io_schedule_timeout
 congestion_wait
 balance_dirty_pages
 balance_dirty_pages_ratelimited_nr
 generic_file_buffered_write
 generic_file_aio_write
 ext3_file_write
 do_sync_write
 vfs_write
 sys_pwrite64
   
   
   Thanks to the aggressive nature of Pr_a, it can happen, that
   
 nr_file_dirty  dirty_thresh + margin
   
   This is due to both nr_dirty growing and dirty_thresh shrinking, which
   in turn is due to nr_file_mapped rapidly growing.  The exact size of
   the margin at which the deadlock happens is not known, but it's around
   100 pages.
   
   At this point Pr_a enters balance_dirty_pages and starts to write back
   some if it's dirty pages.  After submitting some requests, it blocks
   on the request queue.
   
   The first write request will trigger Pr_b to perform a write()
   syscall.  This will submit a write request to the block device and
   then may enter balance_dirty_pages().
   
   The condition for exiting balance_dirty_pages() is
   
- either that write_chunk pages have been written
   
- or nr_file_dirty + nr_writeback  dirty_thresh
   
   It is entirely possible that less than write_chunk pages were written,
   in which case balance_dirty_pages() will not exit even after all the
   submitted requests have been succesfully completed.
   
   Which means that the write() syscall does not return.
  
  But the balance_dirty_pages() loop does more than just wait for those two
  conditions.  It will also submit _more_ dirty pages for writeout.  ie: it
  should be feeding more of file A's pages into writepage.
  
  Why isn't that happening?
 
 All of A's data is actually written by B.  So just submitting more
 pages to some queue doesn't help, it will just make the queue longer.
 
 If the queue length were not limited, and B would have limitless
 threads, and the write() wouldn't exclude other writes to the same
 file (i_mutex), then there would be no deadlock.
 
 But for fuse the first and the last condition isn't met.
 
 For the loop device the second condition isn't met, loop is single
 threaded.

Sigh.  What's this about i_mutex?  That appears to be some critical
information which _still_ isn't being communicated.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 03/22] fix deadlock in balance_dirty_pages

2007-03-01 Thread Miklos Szeredi
This deadlock happens, when dirty pages from one filesystem are
written back through another filesystem.  It easiest to demonstrate
with fuse although it could affect looback mounts as well (see
following patches).

Let's call the filesystems A(bove) and B(elow).  Process Pr_a is
writing to A, and process Pr_b is writing to B.

Pr_a is bash-shared-mapping.  Pr_b is the fuse filesystem daemon
(fusexmp_fh), for simplicity let's assume that Pr_b is single
threaded.

These are the simplified stack traces of these processes after the
deadlock:

Pr_a (bash-shared-mapping):

  (block on queue)
  fuse_writepage
  generic_writepages
  writeback_inodes
  balance_dirty_pages
  balance_dirty_pages_ratelimited_nr
  set_page_dirty_mapping_balance
  do_no_page


Pr_b (fusexmp_fh):

  io_schedule_timeout
  congestion_wait
  balance_dirty_pages
  balance_dirty_pages_ratelimited_nr
  generic_file_buffered_write
  generic_file_aio_write
  ext3_file_write
  do_sync_write
  vfs_write
  sys_pwrite64


Thanks to the aggressive nature of Pr_a, it can happen, that

  nr_file_dirty  dirty_thresh + margin

This is due to both nr_dirty growing and dirty_thresh shrinking, which
in turn is due to nr_file_mapped rapidly growing.  The exact size of
the margin at which the deadlock happens is not known, but it's around
100 pages.

At this point Pr_a enters balance_dirty_pages and starts to write back
some if it's dirty pages.  After submitting some requests, it blocks
on the request queue.

The first write request will trigger Pr_b to perform a write()
syscall.  This will submit a write request to the block device and
then may enter balance_dirty_pages().

The condition for exiting balance_dirty_pages() is

 - either that write_chunk pages have been written

 - or nr_file_dirty + nr_writeback  dirty_thresh

It is entirely possible that less than write_chunk pages were written,
in which case balance_dirty_pages() will not exit even after all the
submitted requests have been succesfully completed.

Which means that the write() syscall does not return.
   
   But the balance_dirty_pages() loop does more than just wait for those two
   conditions.  It will also submit _more_ dirty pages for writeout.  ie: it
   should be feeding more of file A's pages into writepage.
   
   Why isn't that happening?
  
  All of A's data is actually written by B.  So just submitting more
  pages to some queue doesn't help, it will just make the queue longer.
  
  If the queue length were not limited, and B would have limitless
  threads, and the write() wouldn't exclude other writes to the same
  file (i_mutex), then there would be no deadlock.
  
  But for fuse the first and the last condition isn't met.
  
  For the loop device the second condition isn't met, loop is single
  threaded.
 
 Sigh.  What's this about i_mutex?  That appears to be some critical
 information which _still_ isn't being communicated.
 

This:

ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
unsigned long nr_segs, loff_t pos)
{
struct file *file = iocb-ki_filp;
struct address_space *mapping = file-f_mapping;
struct inode *inode = mapping-host;
ssize_t ret;

BUG_ON(iocb-ki_pos != pos);

mutex_lock(inode-i_mutex);
ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
iocb-ki_pos);
mutex_unlock(inode-i_mutex);


It's in the stack trace.  I thought it was obvious.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 03/22] fix deadlock in balance_dirty_pages

2007-03-01 Thread Andrew Morton
On Thu, 01 Mar 2007 09:37:06 +0100 Miklos Szeredi [EMAIL PROTECTED] wrote:

  Sigh.  What's this about i_mutex?  That appears to be some critical
  information which _still_ isn't being communicated.
  
 
 This:
 
 ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
   unsigned long nr_segs, loff_t pos)
 {
   struct file *file = iocb-ki_filp;
   struct address_space *mapping = file-f_mapping;
   struct inode *inode = mapping-host;
   ssize_t ret;
 
   BUG_ON(iocb-ki_pos != pos);
 
   mutex_lock(inode-i_mutex);
   ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
   iocb-ki_pos);
   mutex_unlock(inode-i_mutex);
 
 
 It's in the stack trace.  I thought it was obvious.

No, it is not obvious.

I'm just not going to apply weird hacks to work around a bug which
I do not understand, and I have spent way too much time trying to understand
this one.

So let us persist.

Please fully describe the role of i_mutex in this hang.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 03/22] fix deadlock in balance_dirty_pages

2007-03-01 Thread Miklos Szeredi
 I'm just not going to apply weird hacks to work around a bug which
 I do not understand, and I have spent way too much time trying to understand
 this one.

I suggest you apply patch #5 balance dirty pages from loop device
and see for yourself the same deadlock with a simple loopback mount.

I beleive _not_ doing balance_dirty_pages() in loop is rather a bigger
hack, then mine ;)

 So let us persist.
 
 Please fully describe the role of i_mutex in this hang.

OK.  Added description of the multithreaded case, with i_mutex's role:

+ What if Pr_b is multithreaded?  The first thread will enter
+ balance_dirty_pages() and loop there as shown above.  It will hold
+ i_mutex for the inode, taken in generic_file_aio_write().
+ 
+ The other theads now try to write back more data into the same file,
+ but will block on i_mutex.  So even with unlimited number of threads
+ no progress is made.

Thanks,
Miklos


From: Miklos Szeredi [EMAIL PROTECTED]

This deadlock happens, when dirty pages from one filesystem are
written back through another filesystem.  It easiest to demonstrate
with fuse although it could affect looback mounts as well (see
following patches).

Let's call the filesystems A(bove) and B(elow).  Process Pr_a is
writing to A, and process Pr_b is writing to B.

Pr_a is bash-shared-mapping.  Pr_b is the fuse filesystem daemon
(fusexmp_fh), for simplicity let's assume that Pr_b is single
threaded.

These are the simplified stack traces of these processes after the
deadlock:

Pr_a (bash-shared-mapping):

  (block on queue)
  fuse_writepage
  generic_writepages
  writeback_inodes
  balance_dirty_pages
  balance_dirty_pages_ratelimited_nr
  set_page_dirty_mapping_balance
  do_no_page


Pr_b (fusexmp_fh):

  io_schedule_timeout
  congestion_wait
  balance_dirty_pages
  balance_dirty_pages_ratelimited_nr
  generic_file_buffered_write
  generic_file_aio_write
  ext3_file_write
  do_sync_write
  vfs_write
  sys_pwrite64


Thanks to the aggressive nature of Pr_a, it can happen, that

  nr_file_dirty  dirty_thresh + margin

This is due to both nr_dirty growing and dirty_thresh shrinking, which
in turn is due to nr_file_mapped rapidly growing.  The exact size of
the margin at which the deadlock happens is not known, but it's around
100 pages.

At this point Pr_a enters balance_dirty_pages and starts to write back
some if it's dirty pages.  After submitting some requests, it blocks
on the request queue.

The first write request will trigger Pr_b to perform a write()
syscall.  This will submit a write request to the block device and
then may enter balance_dirty_pages().

The condition for exiting balance_dirty_pages() is

 - either that write_chunk pages have been written

 - or nr_file_dirty + nr_writeback  dirty_thresh

It is entirely possible that less than write_chunk pages were written,
in which case balance_dirty_pages() will not exit even after all the
submitted requests have been succesfully completed.

Which means that the write() syscall does not return.

Which means, that no more dirty pages from A will be written back, and
neither nr_writeback nor nr_file_dirty will decrease.

Which means, that balance_dirty_pages() will loop forever.

What if Pr_b is multithreaded?  The first thread will enter
balance_dirty_pages() and loop there as shown above.  It will hold
i_mutex for the inode, taken in generic_file_aio_write().

The other theads now try to write back more data into the same file,
but will block on i_mutex.  So even with unlimited number of threads
no progress is made.

Q.E.D.

The solution is to exit balance_dirty_pages() on the condition, that
there are only a few dirty + writeback pages for this backing dev.  This
makes sure, that there is always some progress with this setup.

The number of outstanding dirty + written pages is limited to 8, which
means that when over the threshold (dirty_exceeded == 1), each
filesystem may only effectively pin a maximum of 16 (+8 because of
ratelimiting) extra pages.

Note: a similar safety vent is always needed if there's a global limit
for the dirty+writeback pages, even if in the future there will be
some per-queue (or other) soft limit.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux/mm/page-writeback.c
===
--- linux.orig/mm/page-writeback.c  2007-02-27 14:41:07.0 +0100
+++ linux/mm/page-writeback.c   2007-02-27 14:41:07.0 +0100
@@ -201,6 +201,17 @@ static void balance_dirty_pages(struct a
if (!dirty_exceeded)
dirty_exceeded = 1;
 
+   /*
+* Acquit producer of dirty pages if there's little or
+* nothing to write back to this particular queue.
+*
+* Without this check a deadlock is possible for if
+* one filesystem is writing data through another.
+*/
+   if (atomic_long_read(bdi-nr_dirty) +
+  

Re: [patch 03/22] fix deadlock in balance_dirty_pages

2007-02-28 Thread Miklos Szeredi
> > This deadlock happens, when dirty pages from one filesystem are
> > written back through another filesystem.  It easiest to demonstrate
> > with fuse although it could affect looback mounts as well (see
> > following patches).
> > 
> > Let's call the filesystems A(bove) and B(elow).  Process Pr_a is
> > writing to A, and process Pr_b is writing to B.
> > 
> > Pr_a is bash-shared-mapping.  Pr_b is the fuse filesystem daemon
> > (fusexmp_fh), for simplicity let's assume that Pr_b is single
> > threaded.
> > 
> > These are the simplified stack traces of these processes after the
> > deadlock:
> > 
> > Pr_a (bash-shared-mapping):
> > 
> >   (block on queue)
> >   fuse_writepage
> >   generic_writepages
> >   writeback_inodes
> >   balance_dirty_pages
> >   balance_dirty_pages_ratelimited_nr
> >   set_page_dirty_mapping_balance
> >   do_no_page
> > 
> > 
> > Pr_b (fusexmp_fh):
> > 
> >   io_schedule_timeout
> >   congestion_wait
> >   balance_dirty_pages
> >   balance_dirty_pages_ratelimited_nr
> >   generic_file_buffered_write
> >   generic_file_aio_write
> >   ext3_file_write
> >   do_sync_write
> >   vfs_write
> >   sys_pwrite64
> > 
> > 
> > Thanks to the aggressive nature of Pr_a, it can happen, that
> > 
> >   nr_file_dirty > dirty_thresh + margin
> > 
> > This is due to both nr_dirty growing and dirty_thresh shrinking, which
> > in turn is due to nr_file_mapped rapidly growing.  The exact size of
> > the margin at which the deadlock happens is not known, but it's around
> > 100 pages.
> > 
> > At this point Pr_a enters balance_dirty_pages and starts to write back
> > some if it's dirty pages.  After submitting some requests, it blocks
> > on the request queue.
> > 
> > The first write request will trigger Pr_b to perform a write()
> > syscall.  This will submit a write request to the block device and
> > then may enter balance_dirty_pages().
> > 
> > The condition for exiting balance_dirty_pages() is
> > 
> >  - either that write_chunk pages have been written
> > 
> >  - or nr_file_dirty + nr_writeback < dirty_thresh
> > 
> > It is entirely possible that less than write_chunk pages were written,
> > in which case balance_dirty_pages() will not exit even after all the
> > submitted requests have been succesfully completed.
> > 
> > Which means that the write() syscall does not return.
> 
> But the balance_dirty_pages() loop does more than just wait for those two
> conditions.  It will also submit _more_ dirty pages for writeout.  ie: it
> should be feeding more of file A's pages into writepage.
> 
> Why isn't that happening?

All of A's data is actually written by B.  So just submitting more
pages to some queue doesn't help, it will just make the queue longer.

If the queue length were not limited, and B would have limitless
threads, and the write() wouldn't exclude other writes to the same
file (i_mutex), then there would be no deadlock.

But for fuse the first and the last condition isn't met.

For the loop device the second condition isn't met, loop is single
threaded.

Thanks,
Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 03/22] fix deadlock in balance_dirty_pages

2007-02-28 Thread Andrew Morton
On Tue, 27 Feb 2007 23:38:12 +0100 Miklos Szeredi <[EMAIL PROTECTED]> wrote:

> From: Miklos Szeredi <[EMAIL PROTECTED]>
> 
> This deadlock happens, when dirty pages from one filesystem are
> written back through another filesystem.  It easiest to demonstrate
> with fuse although it could affect looback mounts as well (see
> following patches).
> 
> Let's call the filesystems A(bove) and B(elow).  Process Pr_a is
> writing to A, and process Pr_b is writing to B.
> 
> Pr_a is bash-shared-mapping.  Pr_b is the fuse filesystem daemon
> (fusexmp_fh), for simplicity let's assume that Pr_b is single
> threaded.
> 
> These are the simplified stack traces of these processes after the
> deadlock:
> 
> Pr_a (bash-shared-mapping):
> 
>   (block on queue)
>   fuse_writepage
>   generic_writepages
>   writeback_inodes
>   balance_dirty_pages
>   balance_dirty_pages_ratelimited_nr
>   set_page_dirty_mapping_balance
>   do_no_page
> 
> 
> Pr_b (fusexmp_fh):
> 
>   io_schedule_timeout
>   congestion_wait
>   balance_dirty_pages
>   balance_dirty_pages_ratelimited_nr
>   generic_file_buffered_write
>   generic_file_aio_write
>   ext3_file_write
>   do_sync_write
>   vfs_write
>   sys_pwrite64
> 
> 
> Thanks to the aggressive nature of Pr_a, it can happen, that
> 
>   nr_file_dirty > dirty_thresh + margin
> 
> This is due to both nr_dirty growing and dirty_thresh shrinking, which
> in turn is due to nr_file_mapped rapidly growing.  The exact size of
> the margin at which the deadlock happens is not known, but it's around
> 100 pages.
> 
> At this point Pr_a enters balance_dirty_pages and starts to write back
> some if it's dirty pages.  After submitting some requests, it blocks
> on the request queue.
> 
> The first write request will trigger Pr_b to perform a write()
> syscall.  This will submit a write request to the block device and
> then may enter balance_dirty_pages().
> 
> The condition for exiting balance_dirty_pages() is
> 
>  - either that write_chunk pages have been written
> 
>  - or nr_file_dirty + nr_writeback < dirty_thresh
> 
> It is entirely possible that less than write_chunk pages were written,
> in which case balance_dirty_pages() will not exit even after all the
> submitted requests have been succesfully completed.
> 
> Which means that the write() syscall does not return.

But the balance_dirty_pages() loop does more than just wait for those two
conditions.  It will also submit _more_ dirty pages for writeout.  ie: it
should be feeding more of file A's pages into writepage.

Why isn't that happening?

> Which means, that no more dirty pages from A will be written back, and
> neither nr_writeback nor nr_file_dirty will decrease.
> 
> Which means, that balance_dirty_pages() will loop forever.
> 
> Q.E.D.
> 
> The solution is to exit balance_dirty_pages() on the condition, that
> there are only a few dirty + writeback pages for this backing dev.  This
> makes sure, that there is always some progress with this setup.
> 
> The number of outstanding dirty + written pages is limited to 8, which
> means that when over the threshold (dirty_exceeded == 1), each
> filesystem may only effectively pin a maximum of 16 (+8 because of
> ratelimiting) extra pages.
> 
> Note: a similar safety vent is always needed if there's a global limit
> for the dirty+writeback pages, even if in the future there will be
> some per-queue (or other) soft limit.
> 
> Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
> ---
> 
> Index: linux/mm/page-writeback.c
> ===
> --- linux.orig/mm/page-writeback.c2007-02-27 14:41:07.0 +0100
> +++ linux/mm/page-writeback.c 2007-02-27 14:41:07.0 +0100
> @@ -201,6 +201,17 @@ static void balance_dirty_pages(struct a
>   if (!dirty_exceeded)
>   dirty_exceeded = 1;
>  
> + /*
> +  * Acquit producer of dirty pages if there's little or
> +  * nothing to write back to this particular queue.
> +  *
> +  * Without this check a deadlock is possible for if
> +  * one filesystem is writing data through another.
> +  */
> + if (atomic_long_read(>nr_dirty) +
> + atomic_long_read(>nr_writeback) < 8)
> + break;
> +
>   /* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
>* Unstable writes are a feature of certain networked
>* filesystems (i.e. NFS) in which data may have been
> 
> --
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 03/22] fix deadlock in balance_dirty_pages

2007-02-28 Thread Andrew Morton
On Tue, 27 Feb 2007 23:38:12 +0100 Miklos Szeredi [EMAIL PROTECTED] wrote:

 From: Miklos Szeredi [EMAIL PROTECTED]
 
 This deadlock happens, when dirty pages from one filesystem are
 written back through another filesystem.  It easiest to demonstrate
 with fuse although it could affect looback mounts as well (see
 following patches).
 
 Let's call the filesystems A(bove) and B(elow).  Process Pr_a is
 writing to A, and process Pr_b is writing to B.
 
 Pr_a is bash-shared-mapping.  Pr_b is the fuse filesystem daemon
 (fusexmp_fh), for simplicity let's assume that Pr_b is single
 threaded.
 
 These are the simplified stack traces of these processes after the
 deadlock:
 
 Pr_a (bash-shared-mapping):
 
   (block on queue)
   fuse_writepage
   generic_writepages
   writeback_inodes
   balance_dirty_pages
   balance_dirty_pages_ratelimited_nr
   set_page_dirty_mapping_balance
   do_no_page
 
 
 Pr_b (fusexmp_fh):
 
   io_schedule_timeout
   congestion_wait
   balance_dirty_pages
   balance_dirty_pages_ratelimited_nr
   generic_file_buffered_write
   generic_file_aio_write
   ext3_file_write
   do_sync_write
   vfs_write
   sys_pwrite64
 
 
 Thanks to the aggressive nature of Pr_a, it can happen, that
 
   nr_file_dirty  dirty_thresh + margin
 
 This is due to both nr_dirty growing and dirty_thresh shrinking, which
 in turn is due to nr_file_mapped rapidly growing.  The exact size of
 the margin at which the deadlock happens is not known, but it's around
 100 pages.
 
 At this point Pr_a enters balance_dirty_pages and starts to write back
 some if it's dirty pages.  After submitting some requests, it blocks
 on the request queue.
 
 The first write request will trigger Pr_b to perform a write()
 syscall.  This will submit a write request to the block device and
 then may enter balance_dirty_pages().
 
 The condition for exiting balance_dirty_pages() is
 
  - either that write_chunk pages have been written
 
  - or nr_file_dirty + nr_writeback  dirty_thresh
 
 It is entirely possible that less than write_chunk pages were written,
 in which case balance_dirty_pages() will not exit even after all the
 submitted requests have been succesfully completed.
 
 Which means that the write() syscall does not return.

But the balance_dirty_pages() loop does more than just wait for those two
conditions.  It will also submit _more_ dirty pages for writeout.  ie: it
should be feeding more of file A's pages into writepage.

Why isn't that happening?

 Which means, that no more dirty pages from A will be written back, and
 neither nr_writeback nor nr_file_dirty will decrease.
 
 Which means, that balance_dirty_pages() will loop forever.
 
 Q.E.D.
 
 The solution is to exit balance_dirty_pages() on the condition, that
 there are only a few dirty + writeback pages for this backing dev.  This
 makes sure, that there is always some progress with this setup.
 
 The number of outstanding dirty + written pages is limited to 8, which
 means that when over the threshold (dirty_exceeded == 1), each
 filesystem may only effectively pin a maximum of 16 (+8 because of
 ratelimiting) extra pages.
 
 Note: a similar safety vent is always needed if there's a global limit
 for the dirty+writeback pages, even if in the future there will be
 some per-queue (or other) soft limit.
 
 Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
 ---
 
 Index: linux/mm/page-writeback.c
 ===
 --- linux.orig/mm/page-writeback.c2007-02-27 14:41:07.0 +0100
 +++ linux/mm/page-writeback.c 2007-02-27 14:41:07.0 +0100
 @@ -201,6 +201,17 @@ static void balance_dirty_pages(struct a
   if (!dirty_exceeded)
   dirty_exceeded = 1;
  
 + /*
 +  * Acquit producer of dirty pages if there's little or
 +  * nothing to write back to this particular queue.
 +  *
 +  * Without this check a deadlock is possible for if
 +  * one filesystem is writing data through another.
 +  */
 + if (atomic_long_read(bdi-nr_dirty) +
 + atomic_long_read(bdi-nr_writeback)  8)
 + break;
 +
   /* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
* Unstable writes are a feature of certain networked
* filesystems (i.e. NFS) in which data may have been
 
 --
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 03/22] fix deadlock in balance_dirty_pages

2007-02-28 Thread Miklos Szeredi
  This deadlock happens, when dirty pages from one filesystem are
  written back through another filesystem.  It easiest to demonstrate
  with fuse although it could affect looback mounts as well (see
  following patches).
  
  Let's call the filesystems A(bove) and B(elow).  Process Pr_a is
  writing to A, and process Pr_b is writing to B.
  
  Pr_a is bash-shared-mapping.  Pr_b is the fuse filesystem daemon
  (fusexmp_fh), for simplicity let's assume that Pr_b is single
  threaded.
  
  These are the simplified stack traces of these processes after the
  deadlock:
  
  Pr_a (bash-shared-mapping):
  
(block on queue)
fuse_writepage
generic_writepages
writeback_inodes
balance_dirty_pages
balance_dirty_pages_ratelimited_nr
set_page_dirty_mapping_balance
do_no_page
  
  
  Pr_b (fusexmp_fh):
  
io_schedule_timeout
congestion_wait
balance_dirty_pages
balance_dirty_pages_ratelimited_nr
generic_file_buffered_write
generic_file_aio_write
ext3_file_write
do_sync_write
vfs_write
sys_pwrite64
  
  
  Thanks to the aggressive nature of Pr_a, it can happen, that
  
nr_file_dirty  dirty_thresh + margin
  
  This is due to both nr_dirty growing and dirty_thresh shrinking, which
  in turn is due to nr_file_mapped rapidly growing.  The exact size of
  the margin at which the deadlock happens is not known, but it's around
  100 pages.
  
  At this point Pr_a enters balance_dirty_pages and starts to write back
  some if it's dirty pages.  After submitting some requests, it blocks
  on the request queue.
  
  The first write request will trigger Pr_b to perform a write()
  syscall.  This will submit a write request to the block device and
  then may enter balance_dirty_pages().
  
  The condition for exiting balance_dirty_pages() is
  
   - either that write_chunk pages have been written
  
   - or nr_file_dirty + nr_writeback  dirty_thresh
  
  It is entirely possible that less than write_chunk pages were written,
  in which case balance_dirty_pages() will not exit even after all the
  submitted requests have been succesfully completed.
  
  Which means that the write() syscall does not return.
 
 But the balance_dirty_pages() loop does more than just wait for those two
 conditions.  It will also submit _more_ dirty pages for writeout.  ie: it
 should be feeding more of file A's pages into writepage.
 
 Why isn't that happening?

All of A's data is actually written by B.  So just submitting more
pages to some queue doesn't help, it will just make the queue longer.

If the queue length were not limited, and B would have limitless
threads, and the write() wouldn't exclude other writes to the same
file (i_mutex), then there would be no deadlock.

But for fuse the first and the last condition isn't met.

For the loop device the second condition isn't met, loop is single
threaded.

Thanks,
Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/