Re: 2.6.23-rc6-mm1 -- mkfs stuck in 'D'

2007-09-24 Thread Fengguang Wu
On Mon, Sep 24, 2007 at 09:35:23AM +0200, Peter Zijlstra wrote:
> On Mon, 24 Sep 2007 11:01:10 +0800 Fengguang Wu <[EMAIL PROTECTED]>
> wrote:
> 
> > > That is an interesting idea how about this:
> > 
> > It looks like a workaround, but it does solve the most important problem.
> > And it is a good logic by itself.  So I'd vote for it.
> > 
> > The fundamental problem is that the per-bdi-writeback-completion based
> > estimation is not accurate under light loads. The problem remains for
> > a light-load sda when there is a heavy-load sdb. 
> 
> Well, sure, in that case sda would get to write out a lot of small
> things. But in that case it would be fair wrt the other writers.

Hmm, I cannot agree it to be fair - but pretty acceptable ;-)
Your patch already brings great improvements in the multi-bdi case.

> > One more workaround
> > could be to grant bdi(s) a minimal bdi_thresh. 
> 
> Ah, no, that is no good. For if there were a lot of BDIs this might
> happen:
>   nr_bdis * min_thresh > dirty_limit.

Sure it is in the extreme case. However the limit could be ensured
if we really want(which I'm really not sure;-) it:

if (nr_reclaimable + nr_writeback < dirty_thresh &&
bdi_nr_reclaimable + bdi_nr_writeback <= bdi_min_thresh)
break;

> > Or better to adjust the estimation logic?
> 
> Not sure what we can do here. The current thing is simple, fast and fair.

Agreed.

> > > + /*
> > > +  * break out early when:
> > > +  *  - we're below the bdi limit
> > > +  *  - we're below half the total limit
> > > +  *
> > > +  * we let the numbers exceed the strict bdi limit if the total
> > > +  * numbers are too low, this avoids (excessive) small writeouts.
> > > +  */
> > > + if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh ||
> > > + nr_reclaimable + nr_writeback < dirty_thresh / 2)
> > >   break;
> > 
> > This may be slightly better:
> > 
> > if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> > break;
> > /*
> >  * Throttle it only when the background writeback cannot 
> > catchup.
> >  */
> > if (nr_reclaimable + nr_writeback <
> > (background_thresh + dirty_thresh) / 2)
> > break;
> 
> Ah, indeed. Good idea.

Thank you :-)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23-rc6-mm1 -- mkfs stuck in 'D'

2007-09-24 Thread Peter Zijlstra
On Mon, 24 Sep 2007 11:01:10 +0800 Fengguang Wu <[EMAIL PROTECTED]>
wrote:

> > That is an interesting idea how about this:
> 
> It looks like a workaround, but it does solve the most important problem.
> And it is a good logic by itself.  So I'd vote for it.
> 
> The fundamental problem is that the per-bdi-writeback-completion based
> estimation is not accurate under light loads. The problem remains for
> a light-load sda when there is a heavy-load sdb. 

Well, sure, in that case sda would get to write out a lot of small
things. But in that case it would be fair wrt the other writers.

> One more workaround
> could be to grant bdi(s) a minimal bdi_thresh. 

Ah, no, that is no good. For if there were a lot of BDIs this might
happen:
  nr_bdis * min_thresh > dirty_limit.

> Or better to adjust the estimation logic?

Not sure what we can do here. The current thing is simple, fast and fair.

> > +   /*
> > +* break out early when:
> > +*  - we're below the bdi limit
> > +*  - we're below half the total limit
> > +*
> > +* we let the numbers exceed the strict bdi limit if the total
> > +* numbers are too low, this avoids (excessive) small writeouts.
> > +*/
> > +   if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh ||
> > +   nr_reclaimable + nr_writeback < dirty_thresh / 2)
> > break;
> 
> This may be slightly better:
> 
>   if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> break;
> /*
>  * Throttle it only when the background writeback cannot 
> catchup.
>  */
> if (nr_reclaimable + nr_writeback <
> (background_thresh + dirty_thresh) / 2)
>   break;

Ah, indeed. Good idea.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23-rc6-mm1 -- mkfs stuck in 'D'

2007-09-24 Thread Peter Zijlstra
On Mon, 24 Sep 2007 11:01:10 +0800 Fengguang Wu [EMAIL PROTECTED]
wrote:

  That is an interesting idea how about this:
 
 It looks like a workaround, but it does solve the most important problem.
 And it is a good logic by itself.  So I'd vote for it.
 
 The fundamental problem is that the per-bdi-writeback-completion based
 estimation is not accurate under light loads. The problem remains for
 a light-load sda when there is a heavy-load sdb. 

Well, sure, in that case sda would get to write out a lot of small
things. But in that case it would be fair wrt the other writers.

 One more workaround
 could be to grant bdi(s) a minimal bdi_thresh. 

Ah, no, that is no good. For if there were a lot of BDIs this might
happen:
  nr_bdis * min_thresh  dirty_limit.

 Or better to adjust the estimation logic?

Not sure what we can do here. The current thing is simple, fast and fair.

  +   /*
  +* break out early when:
  +*  - we're below the bdi limit
  +*  - we're below half the total limit
  +*
  +* we let the numbers exceed the strict bdi limit if the total
  +* numbers are too low, this avoids (excessive) small writeouts.
  +*/
  +   if (bdi_nr_reclaimable + bdi_nr_writeback = bdi_thresh ||
  +   nr_reclaimable + nr_writeback  dirty_thresh / 2)
  break;
 
 This may be slightly better:
 
   if (bdi_nr_reclaimable + bdi_nr_writeback = bdi_thresh)
 break;
 /*
  * Throttle it only when the background writeback cannot 
 catchup.
  */
 if (nr_reclaimable + nr_writeback 
 (background_thresh + dirty_thresh) / 2)
   break;

Ah, indeed. Good idea.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23-rc6-mm1 -- mkfs stuck in 'D'

2007-09-24 Thread Fengguang Wu
On Mon, Sep 24, 2007 at 09:35:23AM +0200, Peter Zijlstra wrote:
 On Mon, 24 Sep 2007 11:01:10 +0800 Fengguang Wu [EMAIL PROTECTED]
 wrote:
 
   That is an interesting idea how about this:
  
  It looks like a workaround, but it does solve the most important problem.
  And it is a good logic by itself.  So I'd vote for it.
  
  The fundamental problem is that the per-bdi-writeback-completion based
  estimation is not accurate under light loads. The problem remains for
  a light-load sda when there is a heavy-load sdb. 
 
 Well, sure, in that case sda would get to write out a lot of small
 things. But in that case it would be fair wrt the other writers.

Hmm, I cannot agree it to be fair - but pretty acceptable ;-)
Your patch already brings great improvements in the multi-bdi case.

  One more workaround
  could be to grant bdi(s) a minimal bdi_thresh. 
 
 Ah, no, that is no good. For if there were a lot of BDIs this might
 happen:
   nr_bdis * min_thresh  dirty_limit.

Sure it is in the extreme case. However the limit could be ensured
if we really want(which I'm really not sure;-) it:

if (nr_reclaimable + nr_writeback  dirty_thresh 
bdi_nr_reclaimable + bdi_nr_writeback = bdi_min_thresh)
break;

  Or better to adjust the estimation logic?
 
 Not sure what we can do here. The current thing is simple, fast and fair.

Agreed.

   + /*
   +  * break out early when:
   +  *  - we're below the bdi limit
   +  *  - we're below half the total limit
   +  *
   +  * we let the numbers exceed the strict bdi limit if the total
   +  * numbers are too low, this avoids (excessive) small writeouts.
   +  */
   + if (bdi_nr_reclaimable + bdi_nr_writeback = bdi_thresh ||
   + nr_reclaimable + nr_writeback  dirty_thresh / 2)
 break;
  
  This may be slightly better:
  
  if (bdi_nr_reclaimable + bdi_nr_writeback = bdi_thresh)
  break;
  /*
   * Throttle it only when the background writeback cannot 
  catchup.
   */
  if (nr_reclaimable + nr_writeback 
  (background_thresh + dirty_thresh) / 2)
  break;
 
 Ah, indeed. Good idea.

Thank you :-)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23-rc6-mm1 -- mkfs stuck in 'D'

2007-09-23 Thread Fengguang Wu
On Sun, Sep 23, 2007 at 03:02:35PM +0200, Peter Zijlstra wrote:
> On Sun, 23 Sep 2007 09:20:49 +0800 Fengguang Wu <[EMAIL PROTECTED]>
> wrote:
> 
> > On Sat, Sep 22, 2007 at 03:16:22PM +0200, Peter Zijlstra wrote:
> > > On Sat, 22 Sep 2007 09:55:09 +0800 Fengguang Wu <[EMAIL PROTECTED]>
> > > wrote:
> > > 
> > > > --- linux-2.6.22.orig/mm/page-writeback.c
> > > > +++ linux-2.6.22/mm/page-writeback.c
> > > > @@ -426,6 +426,14 @@ static void balance_dirty_pages(struct a
> > > > bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> > > > }
> > > >  
> > > > +   printk(KERN_DEBUG "balance_dirty_pages written %lu %lu 
> > > > congested %d limits %lu %lu %lu %lu %lu %ld\n",
> > > > +   pages_written,
> > > > +   write_chunk - wbc.nr_to_write,
> > > > +   bdi_write_congested(bdi),
> > > > +   background_thresh, dirty_thresh,
> > > > +   bdi_thresh, bdi_nr_reclaimable, 
> > > > bdi_nr_writeback,
> > > > +   bdi_thresh - bdi_nr_reclaimable - 
> > > > bdi_nr_writeback);
> > > > +
> > > > if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> > > > break;
> > > > if (pages_written >= write_chunk)
> > > > 
> > > 
> > > > [ 1305.361511] balance_dirty_pages written 0 0 congested 0 limits 48869 
> > > > 195477 5801 5760 288 -247
> > > 
> > > 
> > > 
> > > Could you perhaps instrument the writeback_inodes() path to see why
> > > nothing is written out? - the attached patch would be a nice start.
> > 
> > Curiously the lockup problem disappeared after upgrading to 2.6.23-rc6-mm1.
> > (need to watch it in a longer time window).
> > 
> > Anyway here's the output of your patch:
> > sb_locked 0
> > sb_empty 97011
> 
> It this the delta during one of these lockups? If so, it would seem

delta since boot time, for 2.6.23-rc6-mm1, no lockups ;-)

> that although dirty pages are reported against the BDI, no actual dirty
> inodes could be found.

no lockups, therefore not necessarily.
There are many other calls into writeback_inodes().

> [ note to self: writeback_inodes() seems to write out to any superblock
>   in the system. Might want to limit that to superblocks on wbc->bdi ]

generic_sync_sb_inodes() does have something like:

if (wbc->bdi && bdi != wbc->bdi)
continue;

> You say that switching to .23-rc6-mm1 solved it in your case. You are
> developing in the writeback_inodes() path, right? Could it be one of
> your local changes that confused it here?

There are a lot of changes between them:
- bdi-v9 vs bdi-v10;
- a lot writeback patches in -mm
- some writeback patches maintained locally
I just rebased my patches to .23-rc6-mm1...

> > > Most peculiar. It seems writeback_inodes() doesn't even attempt to
> > > write out stuff. Nor are outstanding writeback pages completed.
> > 
> > Still true. Another problem is that balance_dirty_pages() is being called 
> > even
> > when there are only 54 dirty pages. That could slow down writers 
> > unnecessarily.
> > 
> > balance_dirty_pages() should not be entered at all with small nr_dirty.
> > 
> > Look at these lines:
> > [  197.471619] balance_dirty_pages for tar written 405 405 congested 0 
> > global 196554 54 403 196097 bdi 0 0 398 -398
> > [  197.472196] balance_dirty_pages for tar written 405 0 congested 0 global 
> > 196554 54 372 196128 bdi 0 0 380 -380
> > [  197.472893] balance_dirty_pages for tar written 405 0 congested 0 global 
> > 196554 54 372 196128 bdi 23 0 369 -346
> > [  197.473158] balance_dirty_pages for tar written 405 0 congested 0 global 
> > 196554 54 372 196128 bdi 23 0 366 -343
> > [  197.473403] balance_dirty_pages for tar written 405 0 congested 0 global 
> > 196554 54 372 196128 bdi 23 0 365 -342
> > [  197.473674] balance_dirty_pages for tar written 405 0 congested 0 global 
> > 196554 54 372 196128 bdi 23 0 364 -341
> > [  197.474265] balance_dirty_pages for tar written 405 0 congested 0 global 
> > 196554 54 372 196128 bdi 23 0 362 -339
> > [  197.475440] balance_dirty_pages for tar written 405 0 congested 0 global 
> > 196554 54 341 196159 bdi 47 0 327 -280
> > [  197.476970] balance_dirty_pages for tar written 405 0 congested 0 global 
> > 196546 54 279 196213 bdi 95 0 279 -184
> > [  197.43] balance_dirty_pages for tar written 405 0 congested 0 global 
> > 196546 54 248 196244 bdi 95 0 255 -160
> > [  197.479463] balance_dirty_pages for tar written 405 0 congested 0 global 
> > 196546 54 217 196275 bdi 143 0 210 -67
> > [  197.479656] balance_dirty_pages for tar written 405 0 congested 0 global 
> > 196546 54 217 196275 bdi 143 0 209 -66
> > [  197.481159] balance_dirty_pages for tar written 405 0 congested 0 global 
> > 196546 54 155 196337 bdi 167 0 163 4
> 
> That is an interesting idea 

Re: 2.6.23-rc6-mm1 -- mkfs stuck in 'D'

2007-09-23 Thread Peter Zijlstra
On Sun, 23 Sep 2007 09:20:49 +0800 Fengguang Wu <[EMAIL PROTECTED]>
wrote:

> On Sat, Sep 22, 2007 at 03:16:22PM +0200, Peter Zijlstra wrote:
> > On Sat, 22 Sep 2007 09:55:09 +0800 Fengguang Wu <[EMAIL PROTECTED]>
> > wrote:
> > 
> > > --- linux-2.6.22.orig/mm/page-writeback.c
> > > +++ linux-2.6.22/mm/page-writeback.c
> > > @@ -426,6 +426,14 @@ static void balance_dirty_pages(struct a
> > >   bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> > >   }
> > >  
> > > + printk(KERN_DEBUG "balance_dirty_pages written %lu %lu 
> > > congested %d limits %lu %lu %lu %lu %lu %ld\n",
> > > + pages_written,
> > > + write_chunk - wbc.nr_to_write,
> > > + bdi_write_congested(bdi),
> > > + background_thresh, dirty_thresh,
> > > + bdi_thresh, bdi_nr_reclaimable, 
> > > bdi_nr_writeback,
> > > + bdi_thresh - bdi_nr_reclaimable - 
> > > bdi_nr_writeback);
> > > +
> > >   if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> > >   break;
> > >   if (pages_written >= write_chunk)
> > > 
> > 
> > > [ 1305.361511] balance_dirty_pages written 0 0 congested 0 limits 48869 
> > > 195477 5801 5760 288 -247
> > 
> > 
> > 
> > Could you perhaps instrument the writeback_inodes() path to see why
> > nothing is written out? - the attached patch would be a nice start.
> 
> Curiously the lockup problem disappeared after upgrading to 2.6.23-rc6-mm1.
> (need to watch it in a longer time window).
> 
> Anyway here's the output of your patch:
> sb_locked 0
> sb_empty 97011

It this the delta during one of these lockups? If so, it would seem
that although dirty pages are reported against the BDI, no actual dirty
inodes could be found.

[ note to self: writeback_inodes() seems to write out to any superblock
  in the system. Might want to limit that to superblocks on wbc->bdi ]

You say that switching to .23-rc6-mm1 solved it in your case. You are
developing in the writeback_inodes() path, right? Could it be one of
your local changes that confused it here?

> > Most peculiar. It seems writeback_inodes() doesn't even attempt to
> > write out stuff. Nor are outstanding writeback pages completed.
> 
> Still true. Another problem is that balance_dirty_pages() is being called even
> when there are only 54 dirty pages. That could slow down writers 
> unnecessarily.
> 
> balance_dirty_pages() should not be entered at all with small nr_dirty.
> 
> Look at these lines:
> [  197.471619] balance_dirty_pages for tar written 405 405 congested 0 global 
> 196554 54 403 196097 bdi 0 0 398 -398
> [  197.472196] balance_dirty_pages for tar written 405 0 congested 0 global 
> 196554 54 372 196128 bdi 0 0 380 -380
> [  197.472893] balance_dirty_pages for tar written 405 0 congested 0 global 
> 196554 54 372 196128 bdi 23 0 369 -346
> [  197.473158] balance_dirty_pages for tar written 405 0 congested 0 global 
> 196554 54 372 196128 bdi 23 0 366 -343
> [  197.473403] balance_dirty_pages for tar written 405 0 congested 0 global 
> 196554 54 372 196128 bdi 23 0 365 -342
> [  197.473674] balance_dirty_pages for tar written 405 0 congested 0 global 
> 196554 54 372 196128 bdi 23 0 364 -341
> [  197.474265] balance_dirty_pages for tar written 405 0 congested 0 global 
> 196554 54 372 196128 bdi 23 0 362 -339
> [  197.475440] balance_dirty_pages for tar written 405 0 congested 0 global 
> 196554 54 341 196159 bdi 47 0 327 -280
> [  197.476970] balance_dirty_pages for tar written 405 0 congested 0 global 
> 196546 54 279 196213 bdi 95 0 279 -184
> [  197.43] balance_dirty_pages for tar written 405 0 congested 0 global 
> 196546 54 248 196244 bdi 95 0 255 -160
> [  197.479463] balance_dirty_pages for tar written 405 0 congested 0 global 
> 196546 54 217 196275 bdi 143 0 210 -67
> [  197.479656] balance_dirty_pages for tar written 405 0 congested 0 global 
> 196546 54 217 196275 bdi 143 0 209 -66
> [  197.481159] balance_dirty_pages for tar written 405 0 congested 0 global 
> 196546 54 155 196337 bdi 167 0 163 4

That is an interesting idea how about this:

---
Subject: mm: speed up writeback ramp-up on clean systems

We allow violation of bdi limits if there is a lot of room on the
system. Once we hit half the total limit we start enforcing bdi limits
and bdi ramp-up should happen. Doing it this way avoids many small
writeouts on an otherwise idle system and should also speed up the
ramp-up.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---

Index: linux-2.6/mm/page-writeback.c
===
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -355,8 +355,8 @@ get_dirty_limits(long *pbackground, long
  */
 static void balance_dirty_pages(struct address_space *mapping)
 {
-   long bdi_nr_reclaimable;
-   long bdi_nr_writeback;
+   long nr_reclaimable, 

Re: 2.6.23-rc6-mm1 -- mkfs stuck in 'D'

2007-09-23 Thread Peter Zijlstra
On Sun, 23 Sep 2007 09:20:49 +0800 Fengguang Wu [EMAIL PROTECTED]
wrote:

 On Sat, Sep 22, 2007 at 03:16:22PM +0200, Peter Zijlstra wrote:
  On Sat, 22 Sep 2007 09:55:09 +0800 Fengguang Wu [EMAIL PROTECTED]
  wrote:
  
   --- linux-2.6.22.orig/mm/page-writeback.c
   +++ linux-2.6.22/mm/page-writeback.c
   @@ -426,6 +426,14 @@ static void balance_dirty_pages(struct a
 bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
 }

   + printk(KERN_DEBUG balance_dirty_pages written %lu %lu 
   congested %d limits %lu %lu %lu %lu %lu %ld\n,
   + pages_written,
   + write_chunk - wbc.nr_to_write,
   + bdi_write_congested(bdi),
   + background_thresh, dirty_thresh,
   + bdi_thresh, bdi_nr_reclaimable, 
   bdi_nr_writeback,
   + bdi_thresh - bdi_nr_reclaimable - 
   bdi_nr_writeback);
   +
 if (bdi_nr_reclaimable + bdi_nr_writeback = bdi_thresh)
 break;
 if (pages_written = write_chunk)
   
  
   [ 1305.361511] balance_dirty_pages written 0 0 congested 0 limits 48869 
   195477 5801 5760 288 -247
  
  snip long series of mostly identical lines
  
  Could you perhaps instrument the writeback_inodes() path to see why
  nothing is written out? - the attached patch would be a nice start.
 
 Curiously the lockup problem disappeared after upgrading to 2.6.23-rc6-mm1.
 (need to watch it in a longer time window).
 
 Anyway here's the output of your patch:
 sb_locked 0
 sb_empty 97011

It this the delta during one of these lockups? If so, it would seem
that although dirty pages are reported against the BDI, no actual dirty
inodes could be found.

[ note to self: writeback_inodes() seems to write out to any superblock
  in the system. Might want to limit that to superblocks on wbc-bdi ]

You say that switching to .23-rc6-mm1 solved it in your case. You are
developing in the writeback_inodes() path, right? Could it be one of
your local changes that confused it here?

  Most peculiar. It seems writeback_inodes() doesn't even attempt to
  write out stuff. Nor are outstanding writeback pages completed.
 
 Still true. Another problem is that balance_dirty_pages() is being called even
 when there are only 54 dirty pages. That could slow down writers 
 unnecessarily.
 
 balance_dirty_pages() should not be entered at all with small nr_dirty.
 
 Look at these lines:
 [  197.471619] balance_dirty_pages for tar written 405 405 congested 0 global 
 196554 54 403 196097 bdi 0 0 398 -398
 [  197.472196] balance_dirty_pages for tar written 405 0 congested 0 global 
 196554 54 372 196128 bdi 0 0 380 -380
 [  197.472893] balance_dirty_pages for tar written 405 0 congested 0 global 
 196554 54 372 196128 bdi 23 0 369 -346
 [  197.473158] balance_dirty_pages for tar written 405 0 congested 0 global 
 196554 54 372 196128 bdi 23 0 366 -343
 [  197.473403] balance_dirty_pages for tar written 405 0 congested 0 global 
 196554 54 372 196128 bdi 23 0 365 -342
 [  197.473674] balance_dirty_pages for tar written 405 0 congested 0 global 
 196554 54 372 196128 bdi 23 0 364 -341
 [  197.474265] balance_dirty_pages for tar written 405 0 congested 0 global 
 196554 54 372 196128 bdi 23 0 362 -339
 [  197.475440] balance_dirty_pages for tar written 405 0 congested 0 global 
 196554 54 341 196159 bdi 47 0 327 -280
 [  197.476970] balance_dirty_pages for tar written 405 0 congested 0 global 
 196546 54 279 196213 bdi 95 0 279 -184
 [  197.43] balance_dirty_pages for tar written 405 0 congested 0 global 
 196546 54 248 196244 bdi 95 0 255 -160
 [  197.479463] balance_dirty_pages for tar written 405 0 congested 0 global 
 196546 54 217 196275 bdi 143 0 210 -67
 [  197.479656] balance_dirty_pages for tar written 405 0 congested 0 global 
 196546 54 217 196275 bdi 143 0 209 -66
 [  197.481159] balance_dirty_pages for tar written 405 0 congested 0 global 
 196546 54 155 196337 bdi 167 0 163 4

That is an interesting idea how about this:

---
Subject: mm: speed up writeback ramp-up on clean systems

We allow violation of bdi limits if there is a lot of room on the
system. Once we hit half the total limit we start enforcing bdi limits
and bdi ramp-up should happen. Doing it this way avoids many small
writeouts on an otherwise idle system and should also speed up the
ramp-up.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---

Index: linux-2.6/mm/page-writeback.c
===
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -355,8 +355,8 @@ get_dirty_limits(long *pbackground, long
  */
 static void balance_dirty_pages(struct address_space *mapping)
 {
-   long bdi_nr_reclaimable;
-   long bdi_nr_writeback;
+   long nr_reclaimable, bdi_nr_reclaimable;
+   long nr_writeback, bdi_nr_writeback;
long background_thresh;
 

Re: 2.6.23-rc6-mm1 -- mkfs stuck in 'D'

2007-09-23 Thread Fengguang Wu
On Sun, Sep 23, 2007 at 03:02:35PM +0200, Peter Zijlstra wrote:
 On Sun, 23 Sep 2007 09:20:49 +0800 Fengguang Wu [EMAIL PROTECTED]
 wrote:
 
  On Sat, Sep 22, 2007 at 03:16:22PM +0200, Peter Zijlstra wrote:
   On Sat, 22 Sep 2007 09:55:09 +0800 Fengguang Wu [EMAIL PROTECTED]
   wrote:
   
--- linux-2.6.22.orig/mm/page-writeback.c
+++ linux-2.6.22/mm/page-writeback.c
@@ -426,6 +426,14 @@ static void balance_dirty_pages(struct a
bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
}
 
+   printk(KERN_DEBUG balance_dirty_pages written %lu %lu 
congested %d limits %lu %lu %lu %lu %lu %ld\n,
+   pages_written,
+   write_chunk - wbc.nr_to_write,
+   bdi_write_congested(bdi),
+   background_thresh, dirty_thresh,
+   bdi_thresh, bdi_nr_reclaimable, 
bdi_nr_writeback,
+   bdi_thresh - bdi_nr_reclaimable - 
bdi_nr_writeback);
+
if (bdi_nr_reclaimable + bdi_nr_writeback = bdi_thresh)
break;
if (pages_written = write_chunk)

   
[ 1305.361511] balance_dirty_pages written 0 0 congested 0 limits 48869 
195477 5801 5760 288 -247
   
   snip long series of mostly identical lines
   
   Could you perhaps instrument the writeback_inodes() path to see why
   nothing is written out? - the attached patch would be a nice start.
  
  Curiously the lockup problem disappeared after upgrading to 2.6.23-rc6-mm1.
  (need to watch it in a longer time window).
  
  Anyway here's the output of your patch:
  sb_locked 0
  sb_empty 97011
 
 It this the delta during one of these lockups? If so, it would seem

delta since boot time, for 2.6.23-rc6-mm1, no lockups ;-)

 that although dirty pages are reported against the BDI, no actual dirty
 inodes could be found.

no lockups, therefore not necessarily.
There are many other calls into writeback_inodes().

 [ note to self: writeback_inodes() seems to write out to any superblock
   in the system. Might want to limit that to superblocks on wbc-bdi ]

generic_sync_sb_inodes() does have something like:

if (wbc-bdi  bdi != wbc-bdi)
continue;

 You say that switching to .23-rc6-mm1 solved it in your case. You are
 developing in the writeback_inodes() path, right? Could it be one of
 your local changes that confused it here?

There are a lot of changes between them:
- bdi-v9 vs bdi-v10;
- a lot writeback patches in -mm
- some writeback patches maintained locally
I just rebased my patches to .23-rc6-mm1...

   Most peculiar. It seems writeback_inodes() doesn't even attempt to
   write out stuff. Nor are outstanding writeback pages completed.
  
  Still true. Another problem is that balance_dirty_pages() is being called 
  even
  when there are only 54 dirty pages. That could slow down writers 
  unnecessarily.
  
  balance_dirty_pages() should not be entered at all with small nr_dirty.
  
  Look at these lines:
  [  197.471619] balance_dirty_pages for tar written 405 405 congested 0 
  global 196554 54 403 196097 bdi 0 0 398 -398
  [  197.472196] balance_dirty_pages for tar written 405 0 congested 0 global 
  196554 54 372 196128 bdi 0 0 380 -380
  [  197.472893] balance_dirty_pages for tar written 405 0 congested 0 global 
  196554 54 372 196128 bdi 23 0 369 -346
  [  197.473158] balance_dirty_pages for tar written 405 0 congested 0 global 
  196554 54 372 196128 bdi 23 0 366 -343
  [  197.473403] balance_dirty_pages for tar written 405 0 congested 0 global 
  196554 54 372 196128 bdi 23 0 365 -342
  [  197.473674] balance_dirty_pages for tar written 405 0 congested 0 global 
  196554 54 372 196128 bdi 23 0 364 -341
  [  197.474265] balance_dirty_pages for tar written 405 0 congested 0 global 
  196554 54 372 196128 bdi 23 0 362 -339
  [  197.475440] balance_dirty_pages for tar written 405 0 congested 0 global 
  196554 54 341 196159 bdi 47 0 327 -280
  [  197.476970] balance_dirty_pages for tar written 405 0 congested 0 global 
  196546 54 279 196213 bdi 95 0 279 -184
  [  197.43] balance_dirty_pages for tar written 405 0 congested 0 global 
  196546 54 248 196244 bdi 95 0 255 -160
  [  197.479463] balance_dirty_pages for tar written 405 0 congested 0 global 
  196546 54 217 196275 bdi 143 0 210 -67
  [  197.479656] balance_dirty_pages for tar written 405 0 congested 0 global 
  196546 54 217 196275 bdi 143 0 209 -66
  [  197.481159] balance_dirty_pages for tar written 405 0 congested 0 global 
  196546 54 155 196337 bdi 167 0 163 4
 
 That is an interesting idea how about this:

It looks like a workaround, but it does solve the most important problem.
And it is a good logic by itself.  So I'd vote for it.

The fundamental problem is that the 

Re: 2.6.23-rc6-mm1 -- mkfs stuck in 'D'

2007-09-22 Thread Fengguang Wu
On Sat, Sep 22, 2007 at 03:16:22PM +0200, Peter Zijlstra wrote:
> On Sat, 22 Sep 2007 09:55:09 +0800 Fengguang Wu <[EMAIL PROTECTED]>
> wrote:
> 
> > --- linux-2.6.22.orig/mm/page-writeback.c
> > +++ linux-2.6.22/mm/page-writeback.c
> > @@ -426,6 +426,14 @@ static void balance_dirty_pages(struct a
> > bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> > }
> >  
> > +   printk(KERN_DEBUG "balance_dirty_pages written %lu %lu 
> > congested %d limits %lu %lu %lu %lu %lu %ld\n",
> > +   pages_written,
> > +   write_chunk - wbc.nr_to_write,
> > +   bdi_write_congested(bdi),
> > +   background_thresh, dirty_thresh,
> > +   bdi_thresh, bdi_nr_reclaimable, 
> > bdi_nr_writeback,
> > +   bdi_thresh - bdi_nr_reclaimable - 
> > bdi_nr_writeback);
> > +
> > if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> > break;
> > if (pages_written >= write_chunk)
> > 
> 
> > [ 1305.361511] balance_dirty_pages written 0 0 congested 0 limits 48869 
> > 195477 5801 5760 288 -247
> 
> 
> 
> Could you perhaps instrument the writeback_inodes() path to see why
> nothing is written out? - the attached patch would be a nice start.

Curiously the lockup problem disappeared after upgrading to 2.6.23-rc6-mm1.
(need to watch it in a longer time window).

Anyway here's the output of your patch:
sb_locked 0
sb_empty 97011

> Most peculiar. It seems writeback_inodes() doesn't even attempt to
> write out stuff. Nor are outstanding writeback pages completed.

Still true. Another problem is that balance_dirty_pages() is being called even
when there are only 54 dirty pages. That could slow down writers unnecessarily.

balance_dirty_pages() should not be entered at all with small nr_dirty.

Look at these lines:
[  197.471619] balance_dirty_pages for tar written 405 405 congested 0 global 
196554 54 403 196097 bdi 0 0 398 -398
[  197.472196] balance_dirty_pages for tar written 405 0 congested 0 global 
196554 54 372 196128 bdi 0 0 380 -380
[  197.472893] balance_dirty_pages for tar written 405 0 congested 0 global 
196554 54 372 196128 bdi 23 0 369 -346
[  197.473158] balance_dirty_pages for tar written 405 0 congested 0 global 
196554 54 372 196128 bdi 23 0 366 -343
[  197.473403] balance_dirty_pages for tar written 405 0 congested 0 global 
196554 54 372 196128 bdi 23 0 365 -342
[  197.473674] balance_dirty_pages for tar written 405 0 congested 0 global 
196554 54 372 196128 bdi 23 0 364 -341
[  197.474265] balance_dirty_pages for tar written 405 0 congested 0 global 
196554 54 372 196128 bdi 23 0 362 -339
[  197.475440] balance_dirty_pages for tar written 405 0 congested 0 global 
196554 54 341 196159 bdi 47 0 327 -280
[  197.476970] balance_dirty_pages for tar written 405 0 congested 0 global 
196546 54 279 196213 bdi 95 0 279 -184
[  197.43] balance_dirty_pages for tar written 405 0 congested 0 global 
196546 54 248 196244 bdi 95 0 255 -160
[  197.479463] balance_dirty_pages for tar written 405 0 congested 0 global 
196546 54 217 196275 bdi 143 0 210 -67
[  197.479656] balance_dirty_pages for tar written 405 0 congested 0 global 
196546 54 217 196275 bdi 143 0 209 -66
[  197.481159] balance_dirty_pages for tar written 405 0 congested 0 global 
196546 54 155 196337 bdi 167 0 163 4

The trace messages are generated by the following code:

--- linux-2.6.23-rc6-mm1.orig/mm/page-writeback.c
+++ linux-2.6.23-rc6-mm1/mm/page-writeback.c
@@ -421,6 +421,18 @@ static void balance_dirty_pages(struct a
bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
}

+   printk(KERN_DEBUG "balance_dirty_pages for %s written %lu %lu 
congested %d "
+   "global %lu %lu %lu %ld bdi %lu %lu %lu %ld\n",
+   current->comm,
+   pages_written, write_chunk - wbc.nr_to_write,
+   bdi_write_congested(bdi),
+   dirty_thresh,
+   global_dirty_unstable_pages(), 
global_page_state(NR_WRITEBACK),
+   dirty_thresh -
+   global_dirty_unstable_pages() - 
global_page_state(NR_WRITEBACK),
+   bdi_thresh, bdi_nr_reclaimable, 
bdi_nr_writeback,
+   bdi_thresh - bdi_nr_reclaimable - 
bdi_nr_writeback);
+   
if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
break;
if (pages_written >= write_chunk)


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23-rc6-mm1 -- mkfs stuck in 'D'

2007-09-22 Thread Peter Zijlstra
On Sat, 22 Sep 2007 09:55:09 +0800 Fengguang Wu <[EMAIL PROTECTED]>
wrote:

> --- linux-2.6.22.orig/mm/page-writeback.c
> +++ linux-2.6.22/mm/page-writeback.c
> @@ -426,6 +426,14 @@ static void balance_dirty_pages(struct a
>   bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
>   }
>  
> + printk(KERN_DEBUG "balance_dirty_pages written %lu %lu 
> congested %d limits %lu %lu %lu %lu %lu %ld\n",
> + pages_written,
> + write_chunk - wbc.nr_to_write,
> + bdi_write_congested(bdi),
> + background_thresh, dirty_thresh,
> + bdi_thresh, bdi_nr_reclaimable, 
> bdi_nr_writeback,
> + bdi_thresh - bdi_nr_reclaimable - 
> bdi_nr_writeback);
> +
>   if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
>   break;
>   if (pages_written >= write_chunk)
> 

> [ 1305.361511] balance_dirty_pages written 0 0 congested 0 limits 48869 
> 195477 5801 5760 288 -247



Most peculiar. It seems writeback_inodes() doesn't even attempt to
write out stuff. Nor are outstanding writeback pages completed.

Could you perhaps instrument the writeback_inodes() path to see why
nothing is written out? - the attached patch would be a nice start.

> Here are some messages when doing large dds:

> [  511.733791] balance_dirty_pages written 1540 1540 congested 0 limits 49728 
> 198913 10999 18288 0 -7289
> [  511.735048] balance_dirty_pages written 1540 1540 congested 0 limits 49728 
> 198913 12012 16752 0 -4740
> [  511.736506] balance_dirty_pages written 1540 1540 congested 0 limits 49728 
> 198913 12306 15192 1056 -3942
> [  512.002169] balance_dirty_pages written 1547 1547 congested 2 limits 49726 
> 198905 13471 12624 1848 -1001
> [  512.003795] balance_dirty_pages written 1540 1540 congested 2 limits 49723 
> 198892 13470 11088 3384 -1002
> [  512.083517] balance_dirty_pages written 1540 1540 congested 2 limits 49712 
> 198850 13572 9336 4992 -756
> [  512.085145] balance_dirty_pages written 1540 1540 congested 2 limits 49706 
> 198825 13569 7800 6528 -759
> [  512.086773] balance_dirty_pages written 1540 1540 congested 2 limits 49702 
> 198808 13568 6240 8064 -736
> [  512.184267] balance_dirty_pages written 1539 1539 congested 2 limits 49697 
> 198791 13649 5592 8592 -535
> [  512.185903] balance_dirty_pages written 1540 1540 congested 2 limits 49694 
> 198778 13649 4056 10152 -559
> [  512.187506] balance_dirty_pages written 1540 1540 congested 2 limits 49688 
> 198753 13647 2496 11688 -537
> [  512.259848] balance_dirty_pages written 1546 1546 congested 2 limits 49682 
> 198728 13769 744 13248 -223
> [  512.554646] balance_dirty_pages written 618 618 congested 2 limits 49678 
> 198712 14242 1 13368 873
> [  512.585204] balance_dirty_pages written 794 794 congested 2 limits 49657 
> 198630 14500 1 12936 1563
> [  527.714294] balance_dirty_pages written 1540 1540 congested 0 limits 49608 
> 198432 29502 28080 0 1422

This looks like a sane series, we have a surplus of reclaimable pages,
start writeout, which increases writeback pages, and congests the
device, and eventually all subsides and we finish congestion and quit.

> [  529.298022] balance_dirty_pages written 1540 1540 congested 0 limits 49579 
> 198318 30307 34704 0 -4397
> [  529.304975] balance_dirty_pages written 1539 1539 congested 0 limits 49575 
> 198302 32451 30600 0 1851
> [  529.305205] balance_dirty_pages written 1538 1538 congested 0 limits 49576 
> 198306 32571 30384 0 2187
> [  529.306988] balance_dirty_pages written 1537 1537 congested 0 limits 49580 
> 198320 32702 30120 0 2582
> [  530.893830] balance_dirty_pages written 1541 1541 congested 0 limits 49553 
> 198214 34216 35352 0 -1136
> [  530.893970] balance_dirty_pages written 1538 1538 congested 0 limits 49553 
> 198214 34290 35160 0 -870
> [  530.899873] balance_dirty_pages written 1546 1546 congested 0 limits 49556 
> 198227 36248 31248 0 5000
> [  530.900282] balance_dirty_pages written 1546 1546 congested 0 limits 49557 
> 198231 36442 30864 0 5578
> [  530.900586] balance_dirty_pages written 1539 1539 congested 0 limits 49558 
> 198235 36601 30552 0 6049
> [  532.343097] balance_dirty_pages written 1541 1541 congested 0 limits 49530 
> 198120 37862 36072 0 1790
> [  532.343595] balance_dirty_pages written 1547 1547 congested 0 limits 49533 
> 198132 38081 35640 0 2441
> [  533.872355] balance_dirty_pages written 1540 1540 congested 0 limits 49502 
> 198009 41148 37224 0 3924
> [  542.566083] balance_dirty_pages written 1542 1542 congested 0 limits 49367 
> 197469 51948 52680 0 -732
> [  542.567093] balance_dirty_pages written 1537 1537 congested 0 limits 49370 
> 197482 52663 50952 0 1711
> [  542.586552] balance_dirty_pages written 1540 1540 congested 0 limits 49352 
> 197410 54545 46032 0 8513
> [  542.606002] balance_dirty_pages written 1540 1540 

Re: 2.6.23-rc6-mm1 -- mkfs stuck in 'D'

2007-09-22 Thread Peter Zijlstra
On Sat, 22 Sep 2007 09:55:09 +0800 Fengguang Wu [EMAIL PROTECTED]
wrote:

 --- linux-2.6.22.orig/mm/page-writeback.c
 +++ linux-2.6.22/mm/page-writeback.c
 @@ -426,6 +426,14 @@ static void balance_dirty_pages(struct a
   bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
   }
  
 + printk(KERN_DEBUG balance_dirty_pages written %lu %lu 
 congested %d limits %lu %lu %lu %lu %lu %ld\n,
 + pages_written,
 + write_chunk - wbc.nr_to_write,
 + bdi_write_congested(bdi),
 + background_thresh, dirty_thresh,
 + bdi_thresh, bdi_nr_reclaimable, 
 bdi_nr_writeback,
 + bdi_thresh - bdi_nr_reclaimable - 
 bdi_nr_writeback);
 +
   if (bdi_nr_reclaimable + bdi_nr_writeback = bdi_thresh)
   break;
   if (pages_written = write_chunk)
 

 [ 1305.361511] balance_dirty_pages written 0 0 congested 0 limits 48869 
 195477 5801 5760 288 -247

snip long series of mostly identical lines

Most peculiar. It seems writeback_inodes() doesn't even attempt to
write out stuff. Nor are outstanding writeback pages completed.

Could you perhaps instrument the writeback_inodes() path to see why
nothing is written out? - the attached patch would be a nice start.

 Here are some messages when doing large dds:

 [  511.733791] balance_dirty_pages written 1540 1540 congested 0 limits 49728 
 198913 10999 18288 0 -7289
 [  511.735048] balance_dirty_pages written 1540 1540 congested 0 limits 49728 
 198913 12012 16752 0 -4740
 [  511.736506] balance_dirty_pages written 1540 1540 congested 0 limits 49728 
 198913 12306 15192 1056 -3942
 [  512.002169] balance_dirty_pages written 1547 1547 congested 2 limits 49726 
 198905 13471 12624 1848 -1001
 [  512.003795] balance_dirty_pages written 1540 1540 congested 2 limits 49723 
 198892 13470 11088 3384 -1002
 [  512.083517] balance_dirty_pages written 1540 1540 congested 2 limits 49712 
 198850 13572 9336 4992 -756
 [  512.085145] balance_dirty_pages written 1540 1540 congested 2 limits 49706 
 198825 13569 7800 6528 -759
 [  512.086773] balance_dirty_pages written 1540 1540 congested 2 limits 49702 
 198808 13568 6240 8064 -736
 [  512.184267] balance_dirty_pages written 1539 1539 congested 2 limits 49697 
 198791 13649 5592 8592 -535
 [  512.185903] balance_dirty_pages written 1540 1540 congested 2 limits 49694 
 198778 13649 4056 10152 -559
 [  512.187506] balance_dirty_pages written 1540 1540 congested 2 limits 49688 
 198753 13647 2496 11688 -537
 [  512.259848] balance_dirty_pages written 1546 1546 congested 2 limits 49682 
 198728 13769 744 13248 -223
 [  512.554646] balance_dirty_pages written 618 618 congested 2 limits 49678 
 198712 14242 1 13368 873
 [  512.585204] balance_dirty_pages written 794 794 congested 2 limits 49657 
 198630 14500 1 12936 1563
 [  527.714294] balance_dirty_pages written 1540 1540 congested 0 limits 49608 
 198432 29502 28080 0 1422

This looks like a sane series, we have a surplus of reclaimable pages,
start writeout, which increases writeback pages, and congests the
device, and eventually all subsides and we finish congestion and quit.

 [  529.298022] balance_dirty_pages written 1540 1540 congested 0 limits 49579 
 198318 30307 34704 0 -4397
 [  529.304975] balance_dirty_pages written 1539 1539 congested 0 limits 49575 
 198302 32451 30600 0 1851
 [  529.305205] balance_dirty_pages written 1538 1538 congested 0 limits 49576 
 198306 32571 30384 0 2187
 [  529.306988] balance_dirty_pages written 1537 1537 congested 0 limits 49580 
 198320 32702 30120 0 2582
 [  530.893830] balance_dirty_pages written 1541 1541 congested 0 limits 49553 
 198214 34216 35352 0 -1136
 [  530.893970] balance_dirty_pages written 1538 1538 congested 0 limits 49553 
 198214 34290 35160 0 -870
 [  530.899873] balance_dirty_pages written 1546 1546 congested 0 limits 49556 
 198227 36248 31248 0 5000
 [  530.900282] balance_dirty_pages written 1546 1546 congested 0 limits 49557 
 198231 36442 30864 0 5578
 [  530.900586] balance_dirty_pages written 1539 1539 congested 0 limits 49558 
 198235 36601 30552 0 6049
 [  532.343097] balance_dirty_pages written 1541 1541 congested 0 limits 49530 
 198120 37862 36072 0 1790
 [  532.343595] balance_dirty_pages written 1547 1547 congested 0 limits 49533 
 198132 38081 35640 0 2441
 [  533.872355] balance_dirty_pages written 1540 1540 congested 0 limits 49502 
 198009 41148 37224 0 3924
 [  542.566083] balance_dirty_pages written 1542 1542 congested 0 limits 49367 
 197469 51948 52680 0 -732
 [  542.567093] balance_dirty_pages written 1537 1537 congested 0 limits 49370 
 197482 52663 50952 0 1711
 [  542.586552] balance_dirty_pages written 1540 1540 congested 0 limits 49352 
 197410 54545 46032 0 8513
 [  542.606002] balance_dirty_pages written 1540 1540 congested 0 limits 49337 
 197350 55132 44520 0 10612


Re: 2.6.23-rc6-mm1 -- mkfs stuck in 'D'

2007-09-22 Thread Fengguang Wu
On Sat, Sep 22, 2007 at 03:16:22PM +0200, Peter Zijlstra wrote:
 On Sat, 22 Sep 2007 09:55:09 +0800 Fengguang Wu [EMAIL PROTECTED]
 wrote:
 
  --- linux-2.6.22.orig/mm/page-writeback.c
  +++ linux-2.6.22/mm/page-writeback.c
  @@ -426,6 +426,14 @@ static void balance_dirty_pages(struct a
  bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
  }
   
  +   printk(KERN_DEBUG balance_dirty_pages written %lu %lu 
  congested %d limits %lu %lu %lu %lu %lu %ld\n,
  +   pages_written,
  +   write_chunk - wbc.nr_to_write,
  +   bdi_write_congested(bdi),
  +   background_thresh, dirty_thresh,
  +   bdi_thresh, bdi_nr_reclaimable, 
  bdi_nr_writeback,
  +   bdi_thresh - bdi_nr_reclaimable - 
  bdi_nr_writeback);
  +
  if (bdi_nr_reclaimable + bdi_nr_writeback = bdi_thresh)
  break;
  if (pages_written = write_chunk)
  
 
  [ 1305.361511] balance_dirty_pages written 0 0 congested 0 limits 48869 
  195477 5801 5760 288 -247
 
 snip long series of mostly identical lines
 
 Could you perhaps instrument the writeback_inodes() path to see why
 nothing is written out? - the attached patch would be a nice start.

Curiously the lockup problem disappeared after upgrading to 2.6.23-rc6-mm1.
(need to watch it in a longer time window).

Anyway here's the output of your patch:
sb_locked 0
sb_empty 97011

 Most peculiar. It seems writeback_inodes() doesn't even attempt to
 write out stuff. Nor are outstanding writeback pages completed.

Still true. Another problem is that balance_dirty_pages() is being called even
when there are only 54 dirty pages. That could slow down writers unnecessarily.

balance_dirty_pages() should not be entered at all with small nr_dirty.

Look at these lines:
[  197.471619] balance_dirty_pages for tar written 405 405 congested 0 global 
196554 54 403 196097 bdi 0 0 398 -398
[  197.472196] balance_dirty_pages for tar written 405 0 congested 0 global 
196554 54 372 196128 bdi 0 0 380 -380
[  197.472893] balance_dirty_pages for tar written 405 0 congested 0 global 
196554 54 372 196128 bdi 23 0 369 -346
[  197.473158] balance_dirty_pages for tar written 405 0 congested 0 global 
196554 54 372 196128 bdi 23 0 366 -343
[  197.473403] balance_dirty_pages for tar written 405 0 congested 0 global 
196554 54 372 196128 bdi 23 0 365 -342
[  197.473674] balance_dirty_pages for tar written 405 0 congested 0 global 
196554 54 372 196128 bdi 23 0 364 -341
[  197.474265] balance_dirty_pages for tar written 405 0 congested 0 global 
196554 54 372 196128 bdi 23 0 362 -339
[  197.475440] balance_dirty_pages for tar written 405 0 congested 0 global 
196554 54 341 196159 bdi 47 0 327 -280
[  197.476970] balance_dirty_pages for tar written 405 0 congested 0 global 
196546 54 279 196213 bdi 95 0 279 -184
[  197.43] balance_dirty_pages for tar written 405 0 congested 0 global 
196546 54 248 196244 bdi 95 0 255 -160
[  197.479463] balance_dirty_pages for tar written 405 0 congested 0 global 
196546 54 217 196275 bdi 143 0 210 -67
[  197.479656] balance_dirty_pages for tar written 405 0 congested 0 global 
196546 54 217 196275 bdi 143 0 209 -66
[  197.481159] balance_dirty_pages for tar written 405 0 congested 0 global 
196546 54 155 196337 bdi 167 0 163 4

The trace messages are generated by the following code:

--- linux-2.6.23-rc6-mm1.orig/mm/page-writeback.c
+++ linux-2.6.23-rc6-mm1/mm/page-writeback.c
@@ -421,6 +421,18 @@ static void balance_dirty_pages(struct a
bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
}

+   printk(KERN_DEBUG balance_dirty_pages for %s written %lu %lu 
congested %d 
+   global %lu %lu %lu %ld bdi %lu %lu %lu %ld\n,
+   current-comm,
+   pages_written, write_chunk - wbc.nr_to_write,
+   bdi_write_congested(bdi),
+   dirty_thresh,
+   global_dirty_unstable_pages(), 
global_page_state(NR_WRITEBACK),
+   dirty_thresh -
+   global_dirty_unstable_pages() - 
global_page_state(NR_WRITEBACK),
+   bdi_thresh, bdi_nr_reclaimable, 
bdi_nr_writeback,
+   bdi_thresh - bdi_nr_reclaimable - 
bdi_nr_writeback);
+   
if (bdi_nr_reclaimable + bdi_nr_writeback = bdi_thresh)
break;
if (pages_written = write_chunk)


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23-rc6-mm1 -- mkfs stuck in 'D'

2007-09-21 Thread Fengguang Wu
On Thu, Sep 20, 2007 at 12:31:39PM +0100, Hugh Dickins wrote:
> On Wed, 19 Sep 2007, Peter Zijlstra wrote:
> > On Wed, 19 Sep 2007 21:03:19 +0100 (BST) Hugh Dickins
> > <[EMAIL PROTECTED]> wrote:
> > 
> > > On Wed, 19 Sep 2007, Andy Whitcroft wrote:
> > > > Seems I have a case of a largish i386 NUMA (NUMA-Q) which has a mkfs
> > > > stuck in a 'D' wait:
> > > > 
> > > >  ===
> > > > mkfs.ext2 D c10220f4 0  6233   6222
> > > >  [] io_schedule_timeout+0x1e/0x28
> > > >  [] congestion_wait+0x62/0x7a
> > > >  [] get_dirty_limits+0x16a/0x172
> > > >  [] balance_dirty_pages+0x154/0x1be
> > > >  [] generic_perform_write+0x168/0x18a
> > > >  [] generic_file_buffered_write+0x73/0x107
> > > >  [] __generic_file_aio_write_nolock+0x47a/0x4a5
> > > >  [] generic_file_aio_write_nolock+0x48/0x9b
> > > >  [] do_sync_write+0xbf/0xfc
> > > >  [] vfs_write+0x8d/0x108
> > > >  [] sys_write+0x41/0x67
> > > >  [] syscall_call+0x7/0xb
> > > >  ===
> > > 
> > > [edited out some bogus lines from stale stack]
> > > 
> > > > This machine and others have run numerous test runs on this kernel and
> > > > this is the first time I've see a hang like this.
> > > 
> > > I've been seeing something like that on 4-way PPC64: in my case I've
> > > shells hanging in D state trying to append to kernel build log on ext3
> > > (the builds themselves going on elsewhere, in tmpfs): one of the shells
> > > holding i_mutex and stuck doing congestion_waits from balance_dirty_pages.
> > > 
> > > > I wonder if this is the ultimate cause of the couple of mainline hangs
> > > > which were seen, but not diagnosed.
> > > 
> > > My *guess* is that this is peculiar to 2.6.23-rc6-mm1, and from Peter's
> > > mm-per-device-dirty-threshold.patch.  printks showed bdi_nr_reclaimable
> > > 0, bdi_nr_writeback 24, bdi_thresh 1 in balance_dirty_pages (though I've
> > > not done enough to check if those really correlate with the hangs),
> > > and I'm wondering if the bdi_stat_sum business is needed on the
> > > !nr_reclaimable path.
> > 
> > FWIW my tired brain seems to think it the !nr_reclaimable path needs it
> > just the same. So this change seems to make sense for now :-)
> 
> Thanks.
> 
> > > So I'm running now with the patch below, good so far, but can't judge
> > > until tomorrow whether it has actually addressed the problem seen.
> 
> Last night's run went well: that patch does indeed seem to have fixed it.
> Looking at the timings (some variance but _very_ much less than the night
> before), there does appear to be some other occasional slight slowdown -
> but I've no reason to suspect your patch for it, nor to suppose it's
> something new: it may just be an artifact of my heavy swap thrashing.
> 
> 
> [PATCH mm] mm per-device dirty threshold fix
> 
> Fix occasional hang when a task couldn't get out of balance_dirty_pages:
> mm-per-device-dirty-threshold.patch needs to reevaluate bdi_nr_writeback
> across all cpus when bdi_thresh is low, even in the case when there was
> no bdi_nr_reclaimable.
> 
> Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]>

Thank you Hugh. I ran into similar problems with many dd(large file)
operations.  This patch seems to fix it.

But now my desktop was locked up again when writing a lot of small
files. The problem is repeatable with the command
 $ ketchup 2.6.23-rc6-mm1

I writeup two debug patches:

---
 mm/page-writeback.c |9 +
 1 file changed, 9 insertions(+)

--- linux-2.6.22.orig/mm/page-writeback.c
+++ linux-2.6.22/mm/page-writeback.c
@@ -426,6 +426,14 @@ static void balance_dirty_pages(struct a
bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
}
 
+   printk(KERN_DEBUG "balance_dirty_pages written %lu %lu 
congested %d limits %lu %lu %lu %lu %lu %ld\n",
+   pages_written,
+   write_chunk - wbc.nr_to_write,
+   bdi_write_congested(bdi),
+   background_thresh, dirty_thresh,
+   bdi_thresh, bdi_nr_reclaimable, 
bdi_nr_writeback,
+   bdi_thresh - bdi_nr_reclaimable - 
bdi_nr_writeback);
+
if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
break;
if (pages_written >= write_chunk)

---
 mm/page-writeback.c |5 +
 1 file changed, 5 insertions(+)

--- linux-2.6.22.orig/mm/page-writeback.c
+++ linux-2.6.22/mm/page-writeback.c
@@ -373,6 +373,7 @@ static void balance_dirty_pages(struct a
long bdi_thresh;
unsigned long pages_written = 0;
unsigned long write_chunk = sync_writeback_pages();
+   int i = 0;
 
struct backing_dev_info *bdi = mapping->backing_dev_info;
 
@@ -434,6 +435,10 @@ static void balance_dirty_pages(struct a
bdi_thresh, bdi_nr_reclaimable, 
bdi_nr_writeback,
bdi_thresh - 

Re: 2.6.23-rc6-mm1 -- mkfs stuck in 'D'

2007-09-21 Thread Fengguang Wu
On Thu, Sep 20, 2007 at 12:31:39PM +0100, Hugh Dickins wrote:
 On Wed, 19 Sep 2007, Peter Zijlstra wrote:
  On Wed, 19 Sep 2007 21:03:19 +0100 (BST) Hugh Dickins
  [EMAIL PROTECTED] wrote:
  
   On Wed, 19 Sep 2007, Andy Whitcroft wrote:
Seems I have a case of a largish i386 NUMA (NUMA-Q) which has a mkfs
stuck in a 'D' wait:

 ===
mkfs.ext2 D c10220f4 0  6233   6222
 [c12194da] io_schedule_timeout+0x1e/0x28
 [c10454b4] congestion_wait+0x62/0x7a
 [c10402af] get_dirty_limits+0x16a/0x172
 [c104040b] balance_dirty_pages+0x154/0x1be
 [c103bda3] generic_perform_write+0x168/0x18a
 [c103be38] generic_file_buffered_write+0x73/0x107
 [c103c346] __generic_file_aio_write_nolock+0x47a/0x4a5
 [c103c3b9] generic_file_aio_write_nolock+0x48/0x9b
 [c105d2d6] do_sync_write+0xbf/0xfc
 [c105d3a0] vfs_write+0x8d/0x108
 [c105d4c3] sys_write+0x41/0x67
 [c100260a] syscall_call+0x7/0xb
 ===
   
   [edited out some bogus lines from stale stack]
   
This machine and others have run numerous test runs on this kernel and
this is the first time I've see a hang like this.
   
   I've been seeing something like that on 4-way PPC64: in my case I've
   shells hanging in D state trying to append to kernel build log on ext3
   (the builds themselves going on elsewhere, in tmpfs): one of the shells
   holding i_mutex and stuck doing congestion_waits from balance_dirty_pages.
   
I wonder if this is the ultimate cause of the couple of mainline hangs
which were seen, but not diagnosed.
   
   My *guess* is that this is peculiar to 2.6.23-rc6-mm1, and from Peter's
   mm-per-device-dirty-threshold.patch.  printks showed bdi_nr_reclaimable
   0, bdi_nr_writeback 24, bdi_thresh 1 in balance_dirty_pages (though I've
   not done enough to check if those really correlate with the hangs),
   and I'm wondering if the bdi_stat_sum business is needed on the
   !nr_reclaimable path.
  
  FWIW my tired brain seems to think it the !nr_reclaimable path needs it
  just the same. So this change seems to make sense for now :-)
 
 Thanks.
 
   So I'm running now with the patch below, good so far, but can't judge
   until tomorrow whether it has actually addressed the problem seen.
 
 Last night's run went well: that patch does indeed seem to have fixed it.
 Looking at the timings (some variance but _very_ much less than the night
 before), there does appear to be some other occasional slight slowdown -
 but I've no reason to suspect your patch for it, nor to suppose it's
 something new: it may just be an artifact of my heavy swap thrashing.
 
 
 [PATCH mm] mm per-device dirty threshold fix
 
 Fix occasional hang when a task couldn't get out of balance_dirty_pages:
 mm-per-device-dirty-threshold.patch needs to reevaluate bdi_nr_writeback
 across all cpus when bdi_thresh is low, even in the case when there was
 no bdi_nr_reclaimable.
 
 Signed-off-by: Hugh Dickins [EMAIL PROTECTED]

Thank you Hugh. I ran into similar problems with many dd(large file)
operations.  This patch seems to fix it.

But now my desktop was locked up again when writing a lot of small
files. The problem is repeatable with the command
 $ ketchup 2.6.23-rc6-mm1

I writeup two debug patches:

---
 mm/page-writeback.c |9 +
 1 file changed, 9 insertions(+)

--- linux-2.6.22.orig/mm/page-writeback.c
+++ linux-2.6.22/mm/page-writeback.c
@@ -426,6 +426,14 @@ static void balance_dirty_pages(struct a
bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
}
 
+   printk(KERN_DEBUG balance_dirty_pages written %lu %lu 
congested %d limits %lu %lu %lu %lu %lu %ld\n,
+   pages_written,
+   write_chunk - wbc.nr_to_write,
+   bdi_write_congested(bdi),
+   background_thresh, dirty_thresh,
+   bdi_thresh, bdi_nr_reclaimable, 
bdi_nr_writeback,
+   bdi_thresh - bdi_nr_reclaimable - 
bdi_nr_writeback);
+
if (bdi_nr_reclaimable + bdi_nr_writeback = bdi_thresh)
break;
if (pages_written = write_chunk)

---
 mm/page-writeback.c |5 +
 1 file changed, 5 insertions(+)

--- linux-2.6.22.orig/mm/page-writeback.c
+++ linux-2.6.22/mm/page-writeback.c
@@ -373,6 +373,7 @@ static void balance_dirty_pages(struct a
long bdi_thresh;
unsigned long pages_written = 0;
unsigned long write_chunk = sync_writeback_pages();
+   int i = 0;
 
struct backing_dev_info *bdi = mapping-backing_dev_info;
 
@@ -434,6 +435,10 @@ static void balance_dirty_pages(struct a
bdi_thresh, bdi_nr_reclaimable, 
bdi_nr_writeback,
bdi_thresh - bdi_nr_reclaimable - 
bdi_nr_writeback);
 
+   if (i++  200) {
+ 

Re: 2.6.23-rc6-mm1 -- mkfs stuck in 'D'

2007-09-20 Thread Peter Zijlstra
On Thu, 20 Sep 2007 12:31:39 +0100 (BST) Hugh Dickins
<[EMAIL PROTECTED]> wrote:


Thanks Hugh!

> [PATCH mm] mm per-device dirty threshold fix
> 
> Fix occasional hang when a task couldn't get out of balance_dirty_pages:
> mm-per-device-dirty-threshold.patch needs to reevaluate bdi_nr_writeback
> across all cpus when bdi_thresh is low, even in the case when there was
> no bdi_nr_reclaimable.
> 
> Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]>

Acked-by: Peter Zijlstra <[EMAIL PROTECTED]>

> ---
>  mm/page-writeback.c |   53 +++---
>  1 file changed, 24 insertions(+), 29 deletions(-)
> 
> --- 2.6.23-rc6-mm1/mm/page-writeback.c2007-09-18 12:28:25.0 
> +0100
> +++ linux/mm/page-writeback.c 2007-09-19 20:00:46.0 +0100
> @@ -379,7 +379,7 @@ static void balance_dirty_pages(struct a
>   bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
>   bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
>   if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> - break;
> + break;
>  
>   if (!bdi->dirty_exceeded)
>   bdi->dirty_exceeded = 1;
> @@ -392,39 +392,34 @@ static void balance_dirty_pages(struct a
>*/
>   if (bdi_nr_reclaimable) {
>   writeback_inodes();
> -
> + pages_written += write_chunk - wbc.nr_to_write;
>   get_dirty_limits(_thresh, _thresh,
>  _thresh, bdi);
> + }
>  
> - /*
> -  * In order to avoid the stacked BDI deadlock we need
> -  * to ensure we accurately count the 'dirty' pages when
> -  * the threshold is low.
> -  *
> -  * Otherwise it would be possible to get thresh+n pages
> -  * reported dirty, even though there are thresh-m pages
> -  * actually dirty; with m+n sitting in the percpu
> -  * deltas.
> -  */
> - if (bdi_thresh < 2*bdi_stat_error(bdi)) {
> - bdi_nr_reclaimable =
> - bdi_stat_sum(bdi, BDI_RECLAIMABLE);
> - bdi_nr_writeback =
> - bdi_stat_sum(bdi, BDI_WRITEBACK);
> - } else {
> - bdi_nr_reclaimable =
> - bdi_stat(bdi, BDI_RECLAIMABLE);
> - bdi_nr_writeback =
> - bdi_stat(bdi, BDI_WRITEBACK);
> - }
> + /*
> +  * In order to avoid the stacked BDI deadlock we need
> +  * to ensure we accurately count the 'dirty' pages when
> +  * the threshold is low.
> +  *
> +  * Otherwise it would be possible to get thresh+n pages
> +  * reported dirty, even though there are thresh-m pages
> +  * actually dirty; with m+n sitting in the percpu
> +  * deltas.
> +  */
> + if (bdi_thresh < 2*bdi_stat_error(bdi)) {
> + bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
> + bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
> + } else if (bdi_nr_reclaimable) {
> + bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
> + bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> + }
>  
> - if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> - break;
> + if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> + break;
> + if (pages_written >= write_chunk)
> + break;  /* We've done our duty */
>  
> - pages_written += write_chunk - wbc.nr_to_write;
> - if (pages_written >= write_chunk)
> - break;  /* We've done our duty */
> - }
>   congestion_wait(WRITE, HZ/10);
>   }
>  
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23-rc6-mm1 -- mkfs stuck in 'D'

2007-09-20 Thread Hugh Dickins
On Wed, 19 Sep 2007, Peter Zijlstra wrote:
> On Wed, 19 Sep 2007 21:03:19 +0100 (BST) Hugh Dickins
> <[EMAIL PROTECTED]> wrote:
> 
> > On Wed, 19 Sep 2007, Andy Whitcroft wrote:
> > > Seems I have a case of a largish i386 NUMA (NUMA-Q) which has a mkfs
> > > stuck in a 'D' wait:
> > > 
> > >  ===
> > > mkfs.ext2 D c10220f4 0  6233   6222
> > >  [] io_schedule_timeout+0x1e/0x28
> > >  [] congestion_wait+0x62/0x7a
> > >  [] get_dirty_limits+0x16a/0x172
> > >  [] balance_dirty_pages+0x154/0x1be
> > >  [] generic_perform_write+0x168/0x18a
> > >  [] generic_file_buffered_write+0x73/0x107
> > >  [] __generic_file_aio_write_nolock+0x47a/0x4a5
> > >  [] generic_file_aio_write_nolock+0x48/0x9b
> > >  [] do_sync_write+0xbf/0xfc
> > >  [] vfs_write+0x8d/0x108
> > >  [] sys_write+0x41/0x67
> > >  [] syscall_call+0x7/0xb
> > >  ===
> > 
> > [edited out some bogus lines from stale stack]
> > 
> > > This machine and others have run numerous test runs on this kernel and
> > > this is the first time I've see a hang like this.
> > 
> > I've been seeing something like that on 4-way PPC64: in my case I've
> > shells hanging in D state trying to append to kernel build log on ext3
> > (the builds themselves going on elsewhere, in tmpfs): one of the shells
> > holding i_mutex and stuck doing congestion_waits from balance_dirty_pages.
> > 
> > > I wonder if this is the ultimate cause of the couple of mainline hangs
> > > which were seen, but not diagnosed.
> > 
> > My *guess* is that this is peculiar to 2.6.23-rc6-mm1, and from Peter's
> > mm-per-device-dirty-threshold.patch.  printks showed bdi_nr_reclaimable
> > 0, bdi_nr_writeback 24, bdi_thresh 1 in balance_dirty_pages (though I've
> > not done enough to check if those really correlate with the hangs),
> > and I'm wondering if the bdi_stat_sum business is needed on the
> > !nr_reclaimable path.
> 
> FWIW my tired brain seems to think it the !nr_reclaimable path needs it
> just the same. So this change seems to make sense for now :-)

Thanks.

> > So I'm running now with the patch below, good so far, but can't judge
> > until tomorrow whether it has actually addressed the problem seen.

Last night's run went well: that patch does indeed seem to have fixed it.
Looking at the timings (some variance but _very_ much less than the night
before), there does appear to be some other occasional slight slowdown -
but I've no reason to suspect your patch for it, nor to suppose it's
something new: it may just be an artifact of my heavy swap thrashing.


[PATCH mm] mm per-device dirty threshold fix

Fix occasional hang when a task couldn't get out of balance_dirty_pages:
mm-per-device-dirty-threshold.patch needs to reevaluate bdi_nr_writeback
across all cpus when bdi_thresh is low, even in the case when there was
no bdi_nr_reclaimable.

Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]>
---
 mm/page-writeback.c |   53 +++---
 1 file changed, 24 insertions(+), 29 deletions(-)

--- 2.6.23-rc6-mm1/mm/page-writeback.c  2007-09-18 12:28:25.0 +0100
+++ linux/mm/page-writeback.c   2007-09-19 20:00:46.0 +0100
@@ -379,7 +379,7 @@ static void balance_dirty_pages(struct a
bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
-   break;
+   break;
 
if (!bdi->dirty_exceeded)
bdi->dirty_exceeded = 1;
@@ -392,39 +392,34 @@ static void balance_dirty_pages(struct a
 */
if (bdi_nr_reclaimable) {
writeback_inodes();
-
+   pages_written += write_chunk - wbc.nr_to_write;
get_dirty_limits(_thresh, _thresh,
   _thresh, bdi);
+   }
 
-   /*
-* In order to avoid the stacked BDI deadlock we need
-* to ensure we accurately count the 'dirty' pages when
-* the threshold is low.
-*
-* Otherwise it would be possible to get thresh+n pages
-* reported dirty, even though there are thresh-m pages
-* actually dirty; with m+n sitting in the percpu
-* deltas.
-*/
-   if (bdi_thresh < 2*bdi_stat_error(bdi)) {
-   bdi_nr_reclaimable =
-   bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-   bdi_nr_writeback =
-   bdi_stat_sum(bdi, BDI_WRITEBACK);
-   } else {
-   bdi_nr_reclaimable =
-   bdi_stat(bdi, 

Re: 2.6.23-rc6-mm1 -- mkfs stuck in 'D'

2007-09-20 Thread Hugh Dickins
On Wed, 19 Sep 2007, Peter Zijlstra wrote:
 On Wed, 19 Sep 2007 21:03:19 +0100 (BST) Hugh Dickins
 [EMAIL PROTECTED] wrote:
 
  On Wed, 19 Sep 2007, Andy Whitcroft wrote:
   Seems I have a case of a largish i386 NUMA (NUMA-Q) which has a mkfs
   stuck in a 'D' wait:
   
===
   mkfs.ext2 D c10220f4 0  6233   6222
[c12194da] io_schedule_timeout+0x1e/0x28
[c10454b4] congestion_wait+0x62/0x7a
[c10402af] get_dirty_limits+0x16a/0x172
[c104040b] balance_dirty_pages+0x154/0x1be
[c103bda3] generic_perform_write+0x168/0x18a
[c103be38] generic_file_buffered_write+0x73/0x107
[c103c346] __generic_file_aio_write_nolock+0x47a/0x4a5
[c103c3b9] generic_file_aio_write_nolock+0x48/0x9b
[c105d2d6] do_sync_write+0xbf/0xfc
[c105d3a0] vfs_write+0x8d/0x108
[c105d4c3] sys_write+0x41/0x67
[c100260a] syscall_call+0x7/0xb
===
  
  [edited out some bogus lines from stale stack]
  
   This machine and others have run numerous test runs on this kernel and
   this is the first time I've see a hang like this.
  
  I've been seeing something like that on 4-way PPC64: in my case I've
  shells hanging in D state trying to append to kernel build log on ext3
  (the builds themselves going on elsewhere, in tmpfs): one of the shells
  holding i_mutex and stuck doing congestion_waits from balance_dirty_pages.
  
   I wonder if this is the ultimate cause of the couple of mainline hangs
   which were seen, but not diagnosed.
  
  My *guess* is that this is peculiar to 2.6.23-rc6-mm1, and from Peter's
  mm-per-device-dirty-threshold.patch.  printks showed bdi_nr_reclaimable
  0, bdi_nr_writeback 24, bdi_thresh 1 in balance_dirty_pages (though I've
  not done enough to check if those really correlate with the hangs),
  and I'm wondering if the bdi_stat_sum business is needed on the
  !nr_reclaimable path.
 
 FWIW my tired brain seems to think it the !nr_reclaimable path needs it
 just the same. So this change seems to make sense for now :-)

Thanks.

  So I'm running now with the patch below, good so far, but can't judge
  until tomorrow whether it has actually addressed the problem seen.

Last night's run went well: that patch does indeed seem to have fixed it.
Looking at the timings (some variance but _very_ much less than the night
before), there does appear to be some other occasional slight slowdown -
but I've no reason to suspect your patch for it, nor to suppose it's
something new: it may just be an artifact of my heavy swap thrashing.


[PATCH mm] mm per-device dirty threshold fix

Fix occasional hang when a task couldn't get out of balance_dirty_pages:
mm-per-device-dirty-threshold.patch needs to reevaluate bdi_nr_writeback
across all cpus when bdi_thresh is low, even in the case when there was
no bdi_nr_reclaimable.

Signed-off-by: Hugh Dickins [EMAIL PROTECTED]
---
 mm/page-writeback.c |   53 +++---
 1 file changed, 24 insertions(+), 29 deletions(-)

--- 2.6.23-rc6-mm1/mm/page-writeback.c  2007-09-18 12:28:25.0 +0100
+++ linux/mm/page-writeback.c   2007-09-19 20:00:46.0 +0100
@@ -379,7 +379,7 @@ static void balance_dirty_pages(struct a
bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
if (bdi_nr_reclaimable + bdi_nr_writeback = bdi_thresh)
-   break;
+   break;
 
if (!bdi-dirty_exceeded)
bdi-dirty_exceeded = 1;
@@ -392,39 +392,34 @@ static void balance_dirty_pages(struct a
 */
if (bdi_nr_reclaimable) {
writeback_inodes(wbc);
-
+   pages_written += write_chunk - wbc.nr_to_write;
get_dirty_limits(background_thresh, dirty_thresh,
   bdi_thresh, bdi);
+   }
 
-   /*
-* In order to avoid the stacked BDI deadlock we need
-* to ensure we accurately count the 'dirty' pages when
-* the threshold is low.
-*
-* Otherwise it would be possible to get thresh+n pages
-* reported dirty, even though there are thresh-m pages
-* actually dirty; with m+n sitting in the percpu
-* deltas.
-*/
-   if (bdi_thresh  2*bdi_stat_error(bdi)) {
-   bdi_nr_reclaimable =
-   bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-   bdi_nr_writeback =
-   bdi_stat_sum(bdi, BDI_WRITEBACK);
-   } else {
-   bdi_nr_reclaimable =
-   bdi_stat(bdi, 

Re: 2.6.23-rc6-mm1 -- mkfs stuck in 'D'

2007-09-20 Thread Peter Zijlstra
On Thu, 20 Sep 2007 12:31:39 +0100 (BST) Hugh Dickins
[EMAIL PROTECTED] wrote:


Thanks Hugh!

 [PATCH mm] mm per-device dirty threshold fix
 
 Fix occasional hang when a task couldn't get out of balance_dirty_pages:
 mm-per-device-dirty-threshold.patch needs to reevaluate bdi_nr_writeback
 across all cpus when bdi_thresh is low, even in the case when there was
 no bdi_nr_reclaimable.
 
 Signed-off-by: Hugh Dickins [EMAIL PROTECTED]

Acked-by: Peter Zijlstra [EMAIL PROTECTED]

 ---
  mm/page-writeback.c |   53 +++---
  1 file changed, 24 insertions(+), 29 deletions(-)
 
 --- 2.6.23-rc6-mm1/mm/page-writeback.c2007-09-18 12:28:25.0 
 +0100
 +++ linux/mm/page-writeback.c 2007-09-19 20:00:46.0 +0100
 @@ -379,7 +379,7 @@ static void balance_dirty_pages(struct a
   bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
   bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
   if (bdi_nr_reclaimable + bdi_nr_writeback = bdi_thresh)
 - break;
 + break;
  
   if (!bdi-dirty_exceeded)
   bdi-dirty_exceeded = 1;
 @@ -392,39 +392,34 @@ static void balance_dirty_pages(struct a
*/
   if (bdi_nr_reclaimable) {
   writeback_inodes(wbc);
 -
 + pages_written += write_chunk - wbc.nr_to_write;
   get_dirty_limits(background_thresh, dirty_thresh,
  bdi_thresh, bdi);
 + }
  
 - /*
 -  * In order to avoid the stacked BDI deadlock we need
 -  * to ensure we accurately count the 'dirty' pages when
 -  * the threshold is low.
 -  *
 -  * Otherwise it would be possible to get thresh+n pages
 -  * reported dirty, even though there are thresh-m pages
 -  * actually dirty; with m+n sitting in the percpu
 -  * deltas.
 -  */
 - if (bdi_thresh  2*bdi_stat_error(bdi)) {
 - bdi_nr_reclaimable =
 - bdi_stat_sum(bdi, BDI_RECLAIMABLE);
 - bdi_nr_writeback =
 - bdi_stat_sum(bdi, BDI_WRITEBACK);
 - } else {
 - bdi_nr_reclaimable =
 - bdi_stat(bdi, BDI_RECLAIMABLE);
 - bdi_nr_writeback =
 - bdi_stat(bdi, BDI_WRITEBACK);
 - }
 + /*
 +  * In order to avoid the stacked BDI deadlock we need
 +  * to ensure we accurately count the 'dirty' pages when
 +  * the threshold is low.
 +  *
 +  * Otherwise it would be possible to get thresh+n pages
 +  * reported dirty, even though there are thresh-m pages
 +  * actually dirty; with m+n sitting in the percpu
 +  * deltas.
 +  */
 + if (bdi_thresh  2*bdi_stat_error(bdi)) {
 + bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
 + bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
 + } else if (bdi_nr_reclaimable) {
 + bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
 + bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
 + }
  
 - if (bdi_nr_reclaimable + bdi_nr_writeback = bdi_thresh)
 - break;
 + if (bdi_nr_reclaimable + bdi_nr_writeback = bdi_thresh)
 + break;
 + if (pages_written = write_chunk)
 + break;  /* We've done our duty */
  
 - pages_written += write_chunk - wbc.nr_to_write;
 - if (pages_written = write_chunk)
 - break;  /* We've done our duty */
 - }
   congestion_wait(WRITE, HZ/10);
   }
  
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23-rc6-mm1 -- mkfs stuck in 'D'

2007-09-19 Thread Peter Zijlstra
On Wed, 19 Sep 2007 21:03:19 +0100 (BST) Hugh Dickins
<[EMAIL PROTECTED]> wrote:

> On Wed, 19 Sep 2007, Andy Whitcroft wrote:
> > Seems I have a case of a largish i386 NUMA (NUMA-Q) which has a mkfs
> > stuck in a 'D' wait:
> > 
> >  ===
> > mkfs.ext2 D c10220f4 0  6233   6222
> >  [] io_schedule_timeout+0x1e/0x28
> >  [] congestion_wait+0x62/0x7a
> >  [] get_dirty_limits+0x16a/0x172
> >  [] balance_dirty_pages+0x154/0x1be
> >  [] generic_perform_write+0x168/0x18a
> >  [] generic_file_buffered_write+0x73/0x107
> >  [] __generic_file_aio_write_nolock+0x47a/0x4a5
> >  [] generic_file_aio_write_nolock+0x48/0x9b
> >  [] do_sync_write+0xbf/0xfc
> >  [] vfs_write+0x8d/0x108
> >  [] sys_write+0x41/0x67
> >  [] syscall_call+0x7/0xb
> >  ===
> 
> [edited out some bogus lines from stale stack]
> 
> > This machine and others have run numerous test runs on this kernel and
> > this is the first time I've see a hang like this.
> 
> I've been seeing something like that on 4-way PPC64: in my case I've
> shells hanging in D state trying to append to kernel build log on ext3
> (the builds themselves going on elsewhere, in tmpfs): one of the shells
> holding i_mutex and stuck doing congestion_waits from balance_dirty_pages.
> 
> > I wonder if this is the ultimate cause of the couple of mainline hangs
> > which were seen, but not diagnosed.
> 
> My *guess* is that this is peculiar to 2.6.23-rc6-mm1, and from Peter's
> mm-per-device-dirty-threshold.patch.  printks showed bdi_nr_reclaimable
> 0, bdi_nr_writeback 24, bdi_thresh 1 in balance_dirty_pages (though I've
> not done enough to check if those really correlate with the hangs),
> and I'm wondering if the bdi_stat_sum business is needed on the
> !nr_reclaimable path.

FWIW my tired brain seems to think it the !nr_reclaimable path needs it
just the same. So this change seems to make sense for now :-)

> So I'm running now with the patch below, good so far, but can't judge
> until tomorrow whether it has actually addressed the problem seen.
> 
> Not-yet-Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]>
> ---
>  mm/page-writeback.c |   53 +++---
>  1 file changed, 24 insertions(+), 29 deletions(-)
> 
> --- 2.6.23-rc6-mm1/mm/page-writeback.c2007-09-18 12:28:25.0 
> +0100
> +++ linux/mm/page-writeback.c 2007-09-19 20:00:46.0 +0100
> @@ -379,7 +379,7 @@ static void balance_dirty_pages(struct a
>   bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
>   bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
>   if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> - break;
> + break;
>  
>   if (!bdi->dirty_exceeded)
>   bdi->dirty_exceeded = 1;
> @@ -392,39 +392,34 @@ static void balance_dirty_pages(struct a
>*/
>   if (bdi_nr_reclaimable) {
>   writeback_inodes();
> -
> + pages_written += write_chunk - wbc.nr_to_write;
>   get_dirty_limits(_thresh, _thresh,
>  _thresh, bdi);
> + }
>  
> - /*
> -  * In order to avoid the stacked BDI deadlock we need
> -  * to ensure we accurately count the 'dirty' pages when
> -  * the threshold is low.
> -  *
> -  * Otherwise it would be possible to get thresh+n pages
> -  * reported dirty, even though there are thresh-m pages
> -  * actually dirty; with m+n sitting in the percpu
> -  * deltas.
> -  */
> - if (bdi_thresh < 2*bdi_stat_error(bdi)) {
> - bdi_nr_reclaimable =
> - bdi_stat_sum(bdi, BDI_RECLAIMABLE);
> - bdi_nr_writeback =
> - bdi_stat_sum(bdi, BDI_WRITEBACK);
> - } else {
> - bdi_nr_reclaimable =
> - bdi_stat(bdi, BDI_RECLAIMABLE);
> - bdi_nr_writeback =
> - bdi_stat(bdi, BDI_WRITEBACK);
> - }
> + /*
> +  * In order to avoid the stacked BDI deadlock we need
> +  * to ensure we accurately count the 'dirty' pages when
> +  * the threshold is low.
> +  *
> +  * Otherwise it would be possible to get thresh+n pages
> +  * reported dirty, even though there are thresh-m pages
> +  * actually dirty; with m+n sitting in the percpu
> +  * deltas.
> +  */
> + if (bdi_thresh < 2*bdi_stat_error(bdi)) {
> + bdi_nr_reclaimable = 

Re: 2.6.23-rc6-mm1 -- mkfs stuck in 'D'

2007-09-19 Thread Hugh Dickins
On Wed, 19 Sep 2007, Andy Whitcroft wrote:
> Seems I have a case of a largish i386 NUMA (NUMA-Q) which has a mkfs
> stuck in a 'D' wait:
> 
>  ===
> mkfs.ext2 D c10220f4 0  6233   6222
>  [] io_schedule_timeout+0x1e/0x28
>  [] congestion_wait+0x62/0x7a
>  [] get_dirty_limits+0x16a/0x172
>  [] balance_dirty_pages+0x154/0x1be
>  [] generic_perform_write+0x168/0x18a
>  [] generic_file_buffered_write+0x73/0x107
>  [] __generic_file_aio_write_nolock+0x47a/0x4a5
>  [] generic_file_aio_write_nolock+0x48/0x9b
>  [] do_sync_write+0xbf/0xfc
>  [] vfs_write+0x8d/0x108
>  [] sys_write+0x41/0x67
>  [] syscall_call+0x7/0xb
>  ===

[edited out some bogus lines from stale stack]

> This machine and others have run numerous test runs on this kernel and
> this is the first time I've see a hang like this.

I've been seeing something like that on 4-way PPC64: in my case I've
shells hanging in D state trying to append to kernel build log on ext3
(the builds themselves going on elsewhere, in tmpfs): one of the shells
holding i_mutex and stuck doing congestion_waits from balance_dirty_pages.

> I wonder if this is the ultimate cause of the couple of mainline hangs
> which were seen, but not diagnosed.

My *guess* is that this is peculiar to 2.6.23-rc6-mm1, and from Peter's
mm-per-device-dirty-threshold.patch.  printks showed bdi_nr_reclaimable
0, bdi_nr_writeback 24, bdi_thresh 1 in balance_dirty_pages (though I've
not done enough to check if those really correlate with the hangs),
and I'm wondering if the bdi_stat_sum business is needed on the
!nr_reclaimable path.

So I'm running now with the patch below, good so far, but can't judge
until tomorrow whether it has actually addressed the problem seen.

Not-yet-Signed-off-by: Hugh Dickins <[EMAIL PROTECTED]>
---
 mm/page-writeback.c |   53 +++---
 1 file changed, 24 insertions(+), 29 deletions(-)

--- 2.6.23-rc6-mm1/mm/page-writeback.c  2007-09-18 12:28:25.0 +0100
+++ linux/mm/page-writeback.c   2007-09-19 20:00:46.0 +0100
@@ -379,7 +379,7 @@ static void balance_dirty_pages(struct a
bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
-   break;
+   break;
 
if (!bdi->dirty_exceeded)
bdi->dirty_exceeded = 1;
@@ -392,39 +392,34 @@ static void balance_dirty_pages(struct a
 */
if (bdi_nr_reclaimable) {
writeback_inodes();
-
+   pages_written += write_chunk - wbc.nr_to_write;
get_dirty_limits(_thresh, _thresh,
   _thresh, bdi);
+   }
 
-   /*
-* In order to avoid the stacked BDI deadlock we need
-* to ensure we accurately count the 'dirty' pages when
-* the threshold is low.
-*
-* Otherwise it would be possible to get thresh+n pages
-* reported dirty, even though there are thresh-m pages
-* actually dirty; with m+n sitting in the percpu
-* deltas.
-*/
-   if (bdi_thresh < 2*bdi_stat_error(bdi)) {
-   bdi_nr_reclaimable =
-   bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-   bdi_nr_writeback =
-   bdi_stat_sum(bdi, BDI_WRITEBACK);
-   } else {
-   bdi_nr_reclaimable =
-   bdi_stat(bdi, BDI_RECLAIMABLE);
-   bdi_nr_writeback =
-   bdi_stat(bdi, BDI_WRITEBACK);
-   }
+   /*
+* In order to avoid the stacked BDI deadlock we need
+* to ensure we accurately count the 'dirty' pages when
+* the threshold is low.
+*
+* Otherwise it would be possible to get thresh+n pages
+* reported dirty, even though there are thresh-m pages
+* actually dirty; with m+n sitting in the percpu
+* deltas.
+*/
+   if (bdi_thresh < 2*bdi_stat_error(bdi)) {
+   bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
+   bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
+   } else if (bdi_nr_reclaimable) {
+   bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
+   bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
+   }
 
-   

2.6.23-rc6-mm1 -- mkfs stuck in 'D'

2007-09-19 Thread Andy Whitcroft
Seems I have a case of a largish i386 NUMA (NUMA-Q) which has a mkfs
stuck in a 'D' wait:

 ===
mkfs.ext2 D c10220f4 0  6233   6222
   c344fc80 0082 0286 c10220f4 c344fc90 002ed099 c2963340 c2b9f640
   c142bce0 c2b9f640 c344fc90 002ed099 c344fcfc c344fcc0 c1219563 c1109bf2
   c344fcc4 c186e4d4 c186e4d4 002ed099 c1022612 c2b9f640 c186e000 c104000c
Call Trace:
 [] lock_timer_base+0x19/0x35
 [] schedule_timeout+0x70/0x8d
 [] prop_fraction_single+0x37/0x5d
 [] process_timeout+0x0/0x5
 [] task_dirty_limit+0x3a/0xb5
 [] io_schedule_timeout+0x1e/0x28
 [] congestion_wait+0x62/0x7a
 [] autoremove_wake_function+0x0/0x33
 [] get_dirty_limits+0x16a/0x172
 [] autoremove_wake_function+0x0/0x33
 [] balance_dirty_pages+0x154/0x1be
 [] generic_perform_write+0x168/0x18a
 [] generic_file_buffered_write+0x73/0x107
 [] __generic_file_aio_write_nolock+0x47a/0x4a5
 [] do_sock_write+0x92/0x99
 [] sock_aio_write+0x52/0x5e
 [] generic_file_aio_write_nolock+0x48/0x9b
 [] do_sync_write+0xbf/0xfc
 [] autoremove_wake_function+0x0/0x33
 [] do_page_fault+0x2cc/0x739
 [] vfs_write+0x8d/0x108
 [] sys_write+0x41/0x67
 [] syscall_call+0x7/0xb
 ===

This machine and others have run numerous test runs on this kernel and
this is the first time I've see a hang like this.

I wonder if this is the ultimate cause of the couple of mainline hangs
which were seen, but not diagnosed.

-apw
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


2.6.23-rc6-mm1 -- mkfs stuck in 'D'

2007-09-19 Thread Andy Whitcroft
Seems I have a case of a largish i386 NUMA (NUMA-Q) which has a mkfs
stuck in a 'D' wait:

 ===
mkfs.ext2 D c10220f4 0  6233   6222
   c344fc80 0082 0286 c10220f4 c344fc90 002ed099 c2963340 c2b9f640
   c142bce0 c2b9f640 c344fc90 002ed099 c344fcfc c344fcc0 c1219563 c1109bf2
   c344fcc4 c186e4d4 c186e4d4 002ed099 c1022612 c2b9f640 c186e000 c104000c
Call Trace:
 [c10220f4] lock_timer_base+0x19/0x35
 [c1219563] schedule_timeout+0x70/0x8d
 [c1109bf2] prop_fraction_single+0x37/0x5d
 [c1022612] process_timeout+0x0/0x5
 [c104000c] task_dirty_limit+0x3a/0xb5
 [c12194da] io_schedule_timeout+0x1e/0x28
 [c10454b4] congestion_wait+0x62/0x7a
 [c102b021] autoremove_wake_function+0x0/0x33
 [c10402af] get_dirty_limits+0x16a/0x172
 [c102b021] autoremove_wake_function+0x0/0x33
 [c104040b] balance_dirty_pages+0x154/0x1be
 [c103bda3] generic_perform_write+0x168/0x18a
 [c103be38] generic_file_buffered_write+0x73/0x107
 [c103c346] __generic_file_aio_write_nolock+0x47a/0x4a5
 [c11b0fef] do_sock_write+0x92/0x99
 [c11b1048] sock_aio_write+0x52/0x5e
 [c103c3b9] generic_file_aio_write_nolock+0x48/0x9b
 [c105d2d6] do_sync_write+0xbf/0xfc
 [c102b021] autoremove_wake_function+0x0/0x33
 [c1010311] do_page_fault+0x2cc/0x739
 [c105d3a0] vfs_write+0x8d/0x108
 [c105d4c3] sys_write+0x41/0x67
 [c100260a] syscall_call+0x7/0xb
 ===

This machine and others have run numerous test runs on this kernel and
this is the first time I've see a hang like this.

I wonder if this is the ultimate cause of the couple of mainline hangs
which were seen, but not diagnosed.

-apw
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23-rc6-mm1 -- mkfs stuck in 'D'

2007-09-19 Thread Hugh Dickins
On Wed, 19 Sep 2007, Andy Whitcroft wrote:
 Seems I have a case of a largish i386 NUMA (NUMA-Q) which has a mkfs
 stuck in a 'D' wait:
 
  ===
 mkfs.ext2 D c10220f4 0  6233   6222
  [c12194da] io_schedule_timeout+0x1e/0x28
  [c10454b4] congestion_wait+0x62/0x7a
  [c10402af] get_dirty_limits+0x16a/0x172
  [c104040b] balance_dirty_pages+0x154/0x1be
  [c103bda3] generic_perform_write+0x168/0x18a
  [c103be38] generic_file_buffered_write+0x73/0x107
  [c103c346] __generic_file_aio_write_nolock+0x47a/0x4a5
  [c103c3b9] generic_file_aio_write_nolock+0x48/0x9b
  [c105d2d6] do_sync_write+0xbf/0xfc
  [c105d3a0] vfs_write+0x8d/0x108
  [c105d4c3] sys_write+0x41/0x67
  [c100260a] syscall_call+0x7/0xb
  ===

[edited out some bogus lines from stale stack]

 This machine and others have run numerous test runs on this kernel and
 this is the first time I've see a hang like this.

I've been seeing something like that on 4-way PPC64: in my case I've
shells hanging in D state trying to append to kernel build log on ext3
(the builds themselves going on elsewhere, in tmpfs): one of the shells
holding i_mutex and stuck doing congestion_waits from balance_dirty_pages.

 I wonder if this is the ultimate cause of the couple of mainline hangs
 which were seen, but not diagnosed.

My *guess* is that this is peculiar to 2.6.23-rc6-mm1, and from Peter's
mm-per-device-dirty-threshold.patch.  printks showed bdi_nr_reclaimable
0, bdi_nr_writeback 24, bdi_thresh 1 in balance_dirty_pages (though I've
not done enough to check if those really correlate with the hangs),
and I'm wondering if the bdi_stat_sum business is needed on the
!nr_reclaimable path.

So I'm running now with the patch below, good so far, but can't judge
until tomorrow whether it has actually addressed the problem seen.

Not-yet-Signed-off-by: Hugh Dickins [EMAIL PROTECTED]
---
 mm/page-writeback.c |   53 +++---
 1 file changed, 24 insertions(+), 29 deletions(-)

--- 2.6.23-rc6-mm1/mm/page-writeback.c  2007-09-18 12:28:25.0 +0100
+++ linux/mm/page-writeback.c   2007-09-19 20:00:46.0 +0100
@@ -379,7 +379,7 @@ static void balance_dirty_pages(struct a
bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
if (bdi_nr_reclaimable + bdi_nr_writeback = bdi_thresh)
-   break;
+   break;
 
if (!bdi-dirty_exceeded)
bdi-dirty_exceeded = 1;
@@ -392,39 +392,34 @@ static void balance_dirty_pages(struct a
 */
if (bdi_nr_reclaimable) {
writeback_inodes(wbc);
-
+   pages_written += write_chunk - wbc.nr_to_write;
get_dirty_limits(background_thresh, dirty_thresh,
   bdi_thresh, bdi);
+   }
 
-   /*
-* In order to avoid the stacked BDI deadlock we need
-* to ensure we accurately count the 'dirty' pages when
-* the threshold is low.
-*
-* Otherwise it would be possible to get thresh+n pages
-* reported dirty, even though there are thresh-m pages
-* actually dirty; with m+n sitting in the percpu
-* deltas.
-*/
-   if (bdi_thresh  2*bdi_stat_error(bdi)) {
-   bdi_nr_reclaimable =
-   bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-   bdi_nr_writeback =
-   bdi_stat_sum(bdi, BDI_WRITEBACK);
-   } else {
-   bdi_nr_reclaimable =
-   bdi_stat(bdi, BDI_RECLAIMABLE);
-   bdi_nr_writeback =
-   bdi_stat(bdi, BDI_WRITEBACK);
-   }
+   /*
+* In order to avoid the stacked BDI deadlock we need
+* to ensure we accurately count the 'dirty' pages when
+* the threshold is low.
+*
+* Otherwise it would be possible to get thresh+n pages
+* reported dirty, even though there are thresh-m pages
+* actually dirty; with m+n sitting in the percpu
+* deltas.
+*/
+   if (bdi_thresh  2*bdi_stat_error(bdi)) {
+   bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
+   bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
+   } else if (bdi_nr_reclaimable) {
+   bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
+

Re: 2.6.23-rc6-mm1 -- mkfs stuck in 'D'

2007-09-19 Thread Peter Zijlstra
On Wed, 19 Sep 2007 21:03:19 +0100 (BST) Hugh Dickins
[EMAIL PROTECTED] wrote:

 On Wed, 19 Sep 2007, Andy Whitcroft wrote:
  Seems I have a case of a largish i386 NUMA (NUMA-Q) which has a mkfs
  stuck in a 'D' wait:
  
   ===
  mkfs.ext2 D c10220f4 0  6233   6222
   [c12194da] io_schedule_timeout+0x1e/0x28
   [c10454b4] congestion_wait+0x62/0x7a
   [c10402af] get_dirty_limits+0x16a/0x172
   [c104040b] balance_dirty_pages+0x154/0x1be
   [c103bda3] generic_perform_write+0x168/0x18a
   [c103be38] generic_file_buffered_write+0x73/0x107
   [c103c346] __generic_file_aio_write_nolock+0x47a/0x4a5
   [c103c3b9] generic_file_aio_write_nolock+0x48/0x9b
   [c105d2d6] do_sync_write+0xbf/0xfc
   [c105d3a0] vfs_write+0x8d/0x108
   [c105d4c3] sys_write+0x41/0x67
   [c100260a] syscall_call+0x7/0xb
   ===
 
 [edited out some bogus lines from stale stack]
 
  This machine and others have run numerous test runs on this kernel and
  this is the first time I've see a hang like this.
 
 I've been seeing something like that on 4-way PPC64: in my case I've
 shells hanging in D state trying to append to kernel build log on ext3
 (the builds themselves going on elsewhere, in tmpfs): one of the shells
 holding i_mutex and stuck doing congestion_waits from balance_dirty_pages.
 
  I wonder if this is the ultimate cause of the couple of mainline hangs
  which were seen, but not diagnosed.
 
 My *guess* is that this is peculiar to 2.6.23-rc6-mm1, and from Peter's
 mm-per-device-dirty-threshold.patch.  printks showed bdi_nr_reclaimable
 0, bdi_nr_writeback 24, bdi_thresh 1 in balance_dirty_pages (though I've
 not done enough to check if those really correlate with the hangs),
 and I'm wondering if the bdi_stat_sum business is needed on the
 !nr_reclaimable path.

FWIW my tired brain seems to think it the !nr_reclaimable path needs it
just the same. So this change seems to make sense for now :-)

 So I'm running now with the patch below, good so far, but can't judge
 until tomorrow whether it has actually addressed the problem seen.
 
 Not-yet-Signed-off-by: Hugh Dickins [EMAIL PROTECTED]
 ---
  mm/page-writeback.c |   53 +++---
  1 file changed, 24 insertions(+), 29 deletions(-)
 
 --- 2.6.23-rc6-mm1/mm/page-writeback.c2007-09-18 12:28:25.0 
 +0100
 +++ linux/mm/page-writeback.c 2007-09-19 20:00:46.0 +0100
 @@ -379,7 +379,7 @@ static void balance_dirty_pages(struct a
   bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
   bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
   if (bdi_nr_reclaimable + bdi_nr_writeback = bdi_thresh)
 - break;
 + break;
  
   if (!bdi-dirty_exceeded)
   bdi-dirty_exceeded = 1;
 @@ -392,39 +392,34 @@ static void balance_dirty_pages(struct a
*/
   if (bdi_nr_reclaimable) {
   writeback_inodes(wbc);
 -
 + pages_written += write_chunk - wbc.nr_to_write;
   get_dirty_limits(background_thresh, dirty_thresh,
  bdi_thresh, bdi);
 + }
  
 - /*
 -  * In order to avoid the stacked BDI deadlock we need
 -  * to ensure we accurately count the 'dirty' pages when
 -  * the threshold is low.
 -  *
 -  * Otherwise it would be possible to get thresh+n pages
 -  * reported dirty, even though there are thresh-m pages
 -  * actually dirty; with m+n sitting in the percpu
 -  * deltas.
 -  */
 - if (bdi_thresh  2*bdi_stat_error(bdi)) {
 - bdi_nr_reclaimable =
 - bdi_stat_sum(bdi, BDI_RECLAIMABLE);
 - bdi_nr_writeback =
 - bdi_stat_sum(bdi, BDI_WRITEBACK);
 - } else {
 - bdi_nr_reclaimable =
 - bdi_stat(bdi, BDI_RECLAIMABLE);
 - bdi_nr_writeback =
 - bdi_stat(bdi, BDI_WRITEBACK);
 - }
 + /*
 +  * In order to avoid the stacked BDI deadlock we need
 +  * to ensure we accurately count the 'dirty' pages when
 +  * the threshold is low.
 +  *
 +  * Otherwise it would be possible to get thresh+n pages
 +  * reported dirty, even though there are thresh-m pages
 +  * actually dirty; with m+n sitting in the percpu
 +  * deltas.
 +  */
 + if (bdi_thresh  2*bdi_stat_error(bdi)) {
 + bdi_nr_reclaimable = bdi_stat_sum(bdi,