Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-23 Thread Martin Knoblauch
- Original Message 
> From: Alasdair G Kergon <[EMAIL PROTECTED]>
> To: Martin Knoblauch <[EMAIL PROTECTED]>
> Cc: Linus Torvalds <[EMAIL PROTECTED]>; Mel Gorman <[EMAIL PROTECTED]>; 
> Fengguang Wu <[EMAIL PROTECTED]>; Mike Snitzer <[EMAIL PROTECTED]>; Peter 
> Zijlstra <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; Ingo Molnar <[EMAIL 
> PROTECTED]>; linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL 
> PROTECTED]>; [EMAIL PROTECTED]; Jens Axboe <[EMAIL PROTECTED]>; Milan Broz 
> <[EMAIL PROTECTED]>; Neil Brown <[EMAIL PROTECTED]>
> Sent: Wednesday, January 23, 2008 12:40:52 AM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> On Tue, Jan 22, 2008 at 07:25:15AM -0800, Martin Knoblauch wrote:
> >   [EMAIL PROTECTED] ~]# dmsetup table
> > VolGroup00-LogVol02: 0 350945280 linear 104:2 67109248
> > VolGroup00-LogVol01: 0 8388608 linear 104:2 418054528
> > VolGroup00-LogVol00: 0 67108864 linear 104:2 384
> 
> The IO should pass straight through simple linear targets like
> that without needing to get broken up, so I wouldn't expect those patches to
> make any difference in this particular case.
> 

Alasdair,

 LVM/DM are off the hook :-) I converted one box to direct using partitions and 
the performance is the same disappointment as with LVM/DM. Thanks anyway for 
looking at my problem.

 I will move the discussion now to a new thread, targetting CCISS directly. 

Cheers
Martin


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-23 Thread Martin Knoblauch
- Original Message 
 From: Alasdair G Kergon [EMAIL PROTECTED]
 To: Martin Knoblauch [EMAIL PROTECTED]
 Cc: Linus Torvalds [EMAIL PROTECTED]; Mel Gorman [EMAIL PROTECTED]; 
 Fengguang Wu [EMAIL PROTECTED]; Mike Snitzer [EMAIL PROTECTED]; Peter 
 Zijlstra [EMAIL PROTECTED]; [EMAIL PROTECTED]; Ingo Molnar [EMAIL 
 PROTECTED]; linux-kernel@vger.kernel.org; [EMAIL PROTECTED] [EMAIL 
 PROTECTED]; [EMAIL PROTECTED]; Jens Axboe [EMAIL PROTECTED]; Milan Broz 
 [EMAIL PROTECTED]; Neil Brown [EMAIL PROTECTED]
 Sent: Wednesday, January 23, 2008 12:40:52 AM
 Subject: Re: regression: 100% io-wait with 2.6.24-rcX
 
 On Tue, Jan 22, 2008 at 07:25:15AM -0800, Martin Knoblauch wrote:
[EMAIL PROTECTED] ~]# dmsetup table
  VolGroup00-LogVol02: 0 350945280 linear 104:2 67109248
  VolGroup00-LogVol01: 0 8388608 linear 104:2 418054528
  VolGroup00-LogVol00: 0 67108864 linear 104:2 384
 
 The IO should pass straight through simple linear targets like
 that without needing to get broken up, so I wouldn't expect those patches to
 make any difference in this particular case.
 

Alasdair,

 LVM/DM are off the hook :-) I converted one box to direct using partitions and 
the performance is the same disappointment as with LVM/DM. Thanks anyway for 
looking at my problem.

 I will move the discussion now to a new thread, targetting CCISS directly. 

Cheers
Martin


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-22 Thread Martin Knoblauch
- Original Message 
> From: Alasdair G Kergon <[EMAIL PROTECTED]>
> To: Martin Knoblauch <[EMAIL PROTECTED]>
> Cc: Linus Torvalds <[EMAIL PROTECTED]>; Mel Gorman <[EMAIL PROTECTED]>; 
> Fengguang Wu <[EMAIL PROTECTED]>; Mike Snitzer <[EMAIL PROTECTED]>; Peter 
> Zijlstra <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; Ingo Molnar <[EMAIL 
> PROTECTED]>; linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL 
> PROTECTED]>; [EMAIL PROTECTED]; Jens Axboe <[EMAIL PROTECTED]>; Milan Broz 
> <[EMAIL PROTECTED]>; Neil Brown <[EMAIL PROTECTED]>
> Sent: Tuesday, January 22, 2008 3:39:33 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> 
> See if these patches make any difference:
> 
  http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/
> 
> dm-md-merge_bvec_fn-with-separate-bdev-and-sector.patch
> dm-introduce-merge_bvec_fn.patch
> dm-linear-add-merge.patch
> dm-table-remove-merge_bvec-sector-restriction.patch
>  


 nope. Exactely the same poor results. To rule out LVM/DM I really have to see 
what happens if I setup a system with filesystems directly on partitions. Might 
take some time though.

Cheers
Martin



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-22 Thread Martin Knoblauch

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de

- Original Message 
> From: Alasdair G Kergon <[EMAIL PROTECTED]>
> To: Martin Knoblauch <[EMAIL PROTECTED]>
> Cc: Linus Torvalds <[EMAIL PROTECTED]>; Mel Gorman <[EMAIL PROTECTED]>; 
> Fengguang Wu <[EMAIL PROTECTED]>; Mike Snitzer <[EMAIL PROTECTED]>; Peter 
> Zijlstra <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; Ingo Molnar <[EMAIL 
> PROTECTED]>; linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL 
> PROTECTED]>; [EMAIL PROTECTED]; Jens Axboe <[EMAIL PROTECTED]>; Milan Broz 
> <[EMAIL PROTECTED]>; Neil Brown <[EMAIL PROTECTED]>
> Sent: Tuesday, January 22, 2008 3:39:33 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> On Fri, Jan 18, 2008 at 11:01:11AM -0800, Martin Knoblauch wrote:
> >  At least, rc1-rc5 have shown that the CCISS system can do well. Now
> > the question is which part of the system does not cope well with the
> > larger IO sizes? Is it the CCISS controller, LVM or both. I am
> open
> 
 to
> > suggestions on how to debug that. 
> 
> What is your LVM device configuration?
>   E.g. 'dmsetup table' and 'dmsetup info -c' output.
> Some configurations lead to large IOs getting split up on the
> way
> 
 through
> device-mapper.
>
Hi Alastair,

 here is the output, the filesystem in question is on LogVol02:

  [EMAIL PROTECTED] ~]# dmsetup table
VolGroup00-LogVol02: 0 350945280 linear 104:2 67109248
VolGroup00-LogVol01: 0 8388608 linear 104:2 418054528
VolGroup00-LogVol00: 0 67108864 linear 104:2 384
[EMAIL PROTECTED] ~]# dmsetup info -c
Name Maj Min Stat Open Targ Event  UUID
VolGroup00-LogVol02 253   1 L--w11  0 
LVM-IV4PeE8cdxA3piC1qk79GY9PE9OC4OgmOZ4OzOgGQIdF3qDx6fJmlZukXXLIy39R
VolGroup00-LogVol01 253   2 L--w11  0 
LVM-IV4PeE8cdxA3piC1qk79GY9PE9OC4Ogmfn2CcAd2Fh7i48twe8PZc2XK5bSOe1Fq
VolGroup00-LogVol00 253   0 L--w11  0 
LVM-IV4PeE8cdxA3piC1qk79GY9PE9OC4OgmfYjxQKFP3zw2fGsezJN7ypSrfmP7oSvE

> See if these patches make any difference:
> 
  http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/
> 
> dm-md-merge_bvec_fn-with-separate-bdev-and-sector.patch
> dm-introduce-merge_bvec_fn.patch
> dm-linear-add-merge.patch
> dm-table-remove-merge_bvec-sector-restriction.patch
>  

 thanks for the suggestion. Are they supposed to apply to mainline?

Cheers
Martin



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-22 Thread Martin Knoblauch
- Original Message 
 From: Alasdair G Kergon [EMAIL PROTECTED]
 To: Martin Knoblauch [EMAIL PROTECTED]
 Cc: Linus Torvalds [EMAIL PROTECTED]; Mel Gorman [EMAIL PROTECTED]; 
 Fengguang Wu [EMAIL PROTECTED]; Mike Snitzer [EMAIL PROTECTED]; Peter 
 Zijlstra [EMAIL PROTECTED]; [EMAIL PROTECTED]; Ingo Molnar [EMAIL 
 PROTECTED]; linux-kernel@vger.kernel.org; [EMAIL PROTECTED] [EMAIL 
 PROTECTED]; [EMAIL PROTECTED]; Jens Axboe [EMAIL PROTECTED]; Milan Broz 
 [EMAIL PROTECTED]; Neil Brown [EMAIL PROTECTED]
 Sent: Tuesday, January 22, 2008 3:39:33 PM
 Subject: Re: regression: 100% io-wait with 2.6.24-rcX
 
 
 See if these patches make any difference:
 
  http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/
 
 dm-md-merge_bvec_fn-with-separate-bdev-and-sector.patch
 dm-introduce-merge_bvec_fn.patch
 dm-linear-add-merge.patch
 dm-table-remove-merge_bvec-sector-restriction.patch
  


 nope. Exactely the same poor results. To rule out LVM/DM I really have to see 
what happens if I setup a system with filesystems directly on partitions. Might 
take some time though.

Cheers
Martin



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-19 Thread Martin Knoblauch
- Original Message 
> From: Mike Snitzer <[EMAIL PROTECTED]>
> To: Linus Torvalds <[EMAIL PROTECTED]>
> Cc: Mel Gorman <[EMAIL PROTECTED]>; Martin Knoblauch <[EMAIL PROTECTED]>; 
> Fengguang Wu <[EMAIL PROTECTED]>; Peter Zijlstra <[EMAIL PROTECTED]>; [EMAIL 
> PROTECTED]; Ingo Molnar <[EMAIL PROTECTED]>; linux-kernel@vger.kernel.org; 
> "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; [EMAIL PROTECTED]
> Sent: Friday, January 18, 2008 11:47:02 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> > I can fire up 2.6.24-rc8 in short order to see if things are vastly
> > improved (as Martin seems to indicate that he is happy with
> > AACRAID on 2.6.24-rc8).  Although even Martin's AACRAID
> > numbers from 2.6.19.2
> 
 > are still quite good (relative to mine).  Martin can you share any tuning
> > you may have done to get AACRAID to where it is for you right now?
Mike,

 I have always been happy with the AACRAID box compared to the CCISS system. 
Even with the "regression" in 2.6.24-rc1..rc5 it was more than acceptable to 
me. For me the differences between 2.6.19  and 2.6.24-rc8 on the AACRAID setup 
are:

- 11% (single stream) to 25% (dual/triple stream) regression in DIO. Something 
I do not care much about. I just measure it for reference.
+ the very nice behaviour when writing to different targets (mix3), which I 
attribute to Peter's per-dbi stuff.

 And until -rc6 I was extremely pleased with the cool speedup I saw on my CCISS 
boxes. This would have been the next "production" kernel for me. But lets 
discuss this under a seperate topic. It has nothing to do with the original 
wait-io issue.

 Oh, before I forget. There has been no tuning for the AACRAID. The system is 
an IBM x3650 with built in AACRAID and battery backed write cache. The disks 
are 6x142GB/15krpm in a RAID5 setup. I see one big difference between your an 
my tests. I do 1MB writes to simulate the behaviour of the real applications, 
while yours seem to be much smaller.
 
Cheers
Martin



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] writeback: speed up writeback of big dirty files

2008-01-19 Thread Martin Knoblauch
 Original Message 
> From: Fengguang Wu <[EMAIL PROTECTED]>
> To: Linus Torvalds <[EMAIL PROTECTED]>
> Cc: Mike Snitzer <[EMAIL PROTECTED]>; Martin Knoblauch <[EMAIL PROTECTED]>; 
> Peter Zijlstra <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; Ingo Molnar <[EMAIL 
> PROTECTED]>; linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL 
> PROTECTED]>
> Sent: Thursday, January 17, 2008 6:28:18 AM
> Subject: [PATCH] writeback: speed up writeback of big dirty files
> 
> On Jan 16, 2008 9:15 AM, Martin Knoblauch
> 
> 
 wrote:
> > Fengguang's latest writeback patch applies cleanly, builds, boots
> on
> 
 2.6.24-rc8.
> 
> Linus, if possible, I'd suggest this patch be merged for 2.6.24.
> 
> It's a safer version of the reverted patch. It was tested on
> ext2/ext3/jfs/xfs/reiserfs and won't 100% iowait even without the
> other bug fixing patches.
> 
> Fengguang
> ---
> 
> writeback: speed up writeback of big dirty files
> 
> After making dirty a 100M file, the normal behavior is to
> start the writeback for all data after 30s delays. But
> sometimes the following happens instead:
> 
> - after 30s:~4M
> - after 5s: ~4M
> - after 5s: all remaining 92M
> 
> Some analyze shows that the internal io dispatch queues goes like this:
> 
> s_ios_more_io
> -
> 1)100M,1K 0
> 2)1K  96M
> 3)0   96M
> 1) initial state with a 100M file and a 1K file
> 2) 4M written, nr_to_write <= 0, so write more
> 3) 1K written, nr_to_write > 0, no more writes(BUG)
> nr_to_write > 0 in (3) fools the upper layer to think that data
> have
> 
 all been
> written out. The big dirty file is actually still sitting in
> s_more_io.
> 
 We
> cannot simply splice s_more_io back to s_io as soon as s_io
> becomes
> 
 empty, and
> let the loop in generic_sync_sb_inodes() continue: this may
> starve
> 
 newly
> expired inodes in s_dirty.  It is also not an option to draw
> inodes
> 
 from both
> s_more_io and s_dirty, an let the loop go on: this might lead to
> live
> 
 locks,
> and might also starve other superblocks in sync time(well kupdate
> may
> 
 still
> starve some superblocks, that's another bug).
> We have to return when a full scan of s_io completes. So nr_to_write
> >
> 
 0 does
> not necessarily mean that "all data are written". This patch
> introduces
> 
 a flag
> writeback_control.more_io to indicate that more io should be done.
> With
> 
 it the
> big dirty file no longer has to wait for the next kupdate invocation
> 5s
> 
 later.
> 
> In sync_sb_inodes() we only set more_io on super_blocks we
> actually
> 
 visited.
> This aviods the interaction between two pdflush deamons.
> 
> Also in __sync_single_inode() we don't blindly keep requeuing the io
> if
> 
 the
> filesystem cannot progress. Failing to do so may lead to 100% iowait.
> 
> Tested-by: Mike Snitzer 
> Signed-off-by: Fengguang Wu 
> ---
>  fs/fs-writeback.c |   18 --
>  include/linux/writeback.h |1 +
>  mm/page-writeback.c   |9 ++---
>  3 files changed, 23 insertions(+), 5 deletions(-)
> 
> --- linux.orig/fs/fs-writeback.c
> +++ linux/fs/fs-writeback.c
> @@ -284,7 +284,17 @@ __sync_single_inode(struct inode *inode,
>   * soon as the queue becomes uncongested.
>   */
>  inode->i_state |= I_DIRTY_PAGES;
> -requeue_io(inode);
> +if (wbc->nr_to_write <= 0) {
> +/*
> + * slice used up: queue for next turn
> + */
> +requeue_io(inode);
> +} else {
> +/*
> + * somehow blocked: retry later
> + */
> +redirty_tail(inode);
> +}
>  } else {
>  /*
>   * Otherwise fully redirty the inode so that
> @@ -479,8 +489,12 @@ sync_sb_inodes(struct super_block *sb, s
>  iput(inode);
>  cond_resched();
>  spin_lock(_lock);
> -if (wbc->nr_to_write <= 0)
> +if (wbc->nr_to_write <= 0) {
> +wbc->more_io = 1;
>  break;
> +}
> +if (!list_empty(>s_more_io))
> +wbc->more_io = 1;
>  }
>  return;/* Leave any unwritten inodes on s_io */
>  }
> --- linux.orig/include/linux/writeback.h
> +++ linux/include/linux/writebac

Re: [PATCH] writeback: speed up writeback of big dirty files

2008-01-19 Thread Martin Knoblauch
 Original Message 
 From: Fengguang Wu [EMAIL PROTECTED]
 To: Linus Torvalds [EMAIL PROTECTED]
 Cc: Mike Snitzer [EMAIL PROTECTED]; Martin Knoblauch [EMAIL PROTECTED]; 
 Peter Zijlstra [EMAIL PROTECTED]; [EMAIL PROTECTED]; Ingo Molnar [EMAIL 
 PROTECTED]; linux-kernel@vger.kernel.org; [EMAIL PROTECTED] [EMAIL 
 PROTECTED]
 Sent: Thursday, January 17, 2008 6:28:18 AM
 Subject: [PATCH] writeback: speed up writeback of big dirty files
 
 On Jan 16, 2008 9:15 AM, Martin Knoblauch
 
 
 wrote:
  Fengguang's latest writeback patch applies cleanly, builds, boots
 on
 
 2.6.24-rc8.
 
 Linus, if possible, I'd suggest this patch be merged for 2.6.24.
 
 It's a safer version of the reverted patch. It was tested on
 ext2/ext3/jfs/xfs/reiserfs and won't 100% iowait even without the
 other bug fixing patches.
 
 Fengguang
 ---
 
 writeback: speed up writeback of big dirty files
 
 After making dirty a 100M file, the normal behavior is to
 start the writeback for all data after 30s delays. But
 sometimes the following happens instead:
 
 - after 30s:~4M
 - after 5s: ~4M
 - after 5s: all remaining 92M
 
 Some analyze shows that the internal io dispatch queues goes like this:
 
 s_ios_more_io
 -
 1)100M,1K 0
 2)1K  96M
 3)0   96M
 1) initial state with a 100M file and a 1K file
 2) 4M written, nr_to_write = 0, so write more
 3) 1K written, nr_to_write  0, no more writes(BUG)
 nr_to_write  0 in (3) fools the upper layer to think that data
 have
 
 all been
 written out. The big dirty file is actually still sitting in
 s_more_io.
 
 We
 cannot simply splice s_more_io back to s_io as soon as s_io
 becomes
 
 empty, and
 let the loop in generic_sync_sb_inodes() continue: this may
 starve
 
 newly
 expired inodes in s_dirty.  It is also not an option to draw
 inodes
 
 from both
 s_more_io and s_dirty, an let the loop go on: this might lead to
 live
 
 locks,
 and might also starve other superblocks in sync time(well kupdate
 may
 
 still
 starve some superblocks, that's another bug).
 We have to return when a full scan of s_io completes. So nr_to_write
 
 
 0 does
 not necessarily mean that all data are written. This patch
 introduces
 
 a flag
 writeback_control.more_io to indicate that more io should be done.
 With
 
 it the
 big dirty file no longer has to wait for the next kupdate invocation
 5s
 
 later.
 
 In sync_sb_inodes() we only set more_io on super_blocks we
 actually
 
 visited.
 This aviods the interaction between two pdflush deamons.
 
 Also in __sync_single_inode() we don't blindly keep requeuing the io
 if
 
 the
 filesystem cannot progress. Failing to do so may lead to 100% iowait.
 
 Tested-by: Mike Snitzer 
 Signed-off-by: Fengguang Wu 
 ---
  fs/fs-writeback.c |   18 --
  include/linux/writeback.h |1 +
  mm/page-writeback.c   |9 ++---
  3 files changed, 23 insertions(+), 5 deletions(-)
 
 --- linux.orig/fs/fs-writeback.c
 +++ linux/fs/fs-writeback.c
 @@ -284,7 +284,17 @@ __sync_single_inode(struct inode *inode,
   * soon as the queue becomes uncongested.
   */
  inode-i_state |= I_DIRTY_PAGES;
 -requeue_io(inode);
 +if (wbc-nr_to_write = 0) {
 +/*
 + * slice used up: queue for next turn
 + */
 +requeue_io(inode);
 +} else {
 +/*
 + * somehow blocked: retry later
 + */
 +redirty_tail(inode);
 +}
  } else {
  /*
   * Otherwise fully redirty the inode so that
 @@ -479,8 +489,12 @@ sync_sb_inodes(struct super_block *sb, s
  iput(inode);
  cond_resched();
  spin_lock(inode_lock);
 -if (wbc-nr_to_write = 0)
 +if (wbc-nr_to_write = 0) {
 +wbc-more_io = 1;
  break;
 +}
 +if (!list_empty(sb-s_more_io))
 +wbc-more_io = 1;
  }
  return;/* Leave any unwritten inodes on s_io */
  }
 --- linux.orig/include/linux/writeback.h
 +++ linux/include/linux/writeback.h
 @@ -62,6 +62,7 @@ struct writeback_control {
  unsigned for_reclaim:1;/* Invoked from the page
 allocator
 
 */
  unsigned for_writepages:1;/* This is a writepages() call */
  unsigned range_cyclic:1;/* range_start is cyclic */
 +unsigned more_io:1;/* more io to be dispatched */
  };
  
  /*
 --- linux.orig/mm/page-writeback.c
 +++ linux/mm/page-writeback.c
 @@ -558,6 +558,7 @@ static void background_writeout(unsigned
  global_page_state(NR_UNSTABLE_NFS)  background_thresh
   min_pages = 0)
  break;
 +wbc.more_io = 0;
  wbc.encountered_congestion = 0;
  wbc.nr_to_write

Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-19 Thread Martin Knoblauch
- Original Message 
 From: Mike Snitzer [EMAIL PROTECTED]
 To: Linus Torvalds [EMAIL PROTECTED]
 Cc: Mel Gorman [EMAIL PROTECTED]; Martin Knoblauch [EMAIL PROTECTED]; 
 Fengguang Wu [EMAIL PROTECTED]; Peter Zijlstra [EMAIL PROTECTED]; [EMAIL 
 PROTECTED]; Ingo Molnar [EMAIL PROTECTED]; linux-kernel@vger.kernel.org; 
 [EMAIL PROTECTED] [EMAIL PROTECTED]; [EMAIL PROTECTED]
 Sent: Friday, January 18, 2008 11:47:02 PM
 Subject: Re: regression: 100% io-wait with 2.6.24-rcX
 
  I can fire up 2.6.24-rc8 in short order to see if things are vastly
  improved (as Martin seems to indicate that he is happy with
  AACRAID on 2.6.24-rc8).  Although even Martin's AACRAID
  numbers from 2.6.19.2
 
  are still quite good (relative to mine).  Martin can you share any tuning
  you may have done to get AACRAID to where it is for you right now?
Mike,

 I have always been happy with the AACRAID box compared to the CCISS system. 
Even with the regression in 2.6.24-rc1..rc5 it was more than acceptable to 
me. For me the differences between 2.6.19  and 2.6.24-rc8 on the AACRAID setup 
are:

- 11% (single stream) to 25% (dual/triple stream) regression in DIO. Something 
I do not care much about. I just measure it for reference.
+ the very nice behaviour when writing to different targets (mix3), which I 
attribute to Peter's per-dbi stuff.

 And until -rc6 I was extremely pleased with the cool speedup I saw on my CCISS 
boxes. This would have been the next production kernel for me. But lets 
discuss this under a seperate topic. It has nothing to do with the original 
wait-io issue.

 Oh, before I forget. There has been no tuning for the AACRAID. The system is 
an IBM x3650 with built in AACRAID and battery backed write cache. The disks 
are 6x142GB/15krpm in a RAID5 setup. I see one big difference between your an 
my tests. I do 1MB writes to simulate the behaviour of the real applications, 
while yours seem to be much smaller.
 
Cheers
Martin



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-18 Thread Martin Knoblauch

--- Linus Torvalds <[EMAIL PROTECTED]> wrote:

> 
> 
> On Fri, 18 Jan 2008, Mel Gorman wrote:
> > 
> > Right, and this is consistent with other complaints about the PFN
> > of the page mattering to some hardware.
> 
> I don't think it's actually the PFN per se.
> 
> I think it's simply that some controllers (quite probably affected by
> both  driver and hardware limits) have some subtle interactions with
> the size of  the IO commands.
> 
> For example, let's say that you have a controller that has some limit
> X on  the size of IO in flight (whether due to hardware or driver
> issues doesn't  really matter) in addition to a limit on the size
> of the scatter-gather  size. They all tend to have limits, and
> they differ.
> 
> Now, the PFN doesn't matter per se, but the allocation pattern
> definitely  matters for whether the IO's are physically
> contiguous, and thus matters  for the size of the scatter-gather
> thing.
> 
> Now, generally the rule-of-thumb is that you want big commands, so 
> physical merging is good for you, but I could well imagine that the
> IO  limits interact, and end up hurting each other. Let's say that a
> better  allocation order allows for bigger contiguous physical areas,
> and thus  fewer scatter-gather entries.
> 
> What does that result in? The obvious answer is
> 
>   "Better performance obviously, because the controller needs to do
> fewer scatter-gather lookups, and the requests are bigger, because
> there are fewer IO's that hit scatter-gather limits!"
> 
> Agreed?
> 
> Except maybe the *real* answer for some controllers end up being
> 
>   "Worse performance, because individual commands grow because they
> don't  hit the per-command limits, but now we hit the global
> size-in-flight limits and have many fewer of these good commands in
> flight. And while the commands are larger, it means that there
> are fewer outstanding commands, which can mean that the disk
> cannot scheduling things as well, or makes high latency of command
> generation by the controller much more visible because there aren't
> enough concurrent requests queued up to hide it"
> 
> Is this the reason? I have no idea. But somebody who knows the
> AACRAID hardware and driver limits might think about interactions
> like that. Sometimes you actually might want to have smaller 
> individual commands if there is some other limit that means that
> it can be more advantageous to have many small requests over a
> few big onees.
> 
> RAID might well make it worse. Maybe small requests work better
> because they are simpler to schedule because they only hit one
> disk (eg if you have simple striping)! So that's another reason
> why one *large* request may actually be slower than two requests
> half the size, even if it's against the "normal rule".
> 
> And it may be that that AACRAID box takes a big hit on DIO
> exactly because DIO has been optimized almost purely for making
> one command as big as possible.
> 
> Just a theory.
> 
>   Linus

 just to make one thing clear - I am not so much concerned about the
performance of AACRAID. It is OK with or without Mel's patch. It is
better with Mel's patch. The regression in DIO compared to 2.6.19.2 is
completely independent of Mel's stuff.

 What interests me much more is the behaviour of the CCISS+LVM based
system. Here I see a huge benefit of reverting Mel's patch.

 I dirtied the system after reboot as Mel suggested (24 parallel kernel
build) and repeated the tests. The dirtying did not make any
difference. Here are the results:

Test  -rc8-rc8-without-Mels-Patch
dd1   57  94
dd1-dir   87  86
dd2   2x8.5   2x45
dd2-dir   2x432x43
dd3   3x7 3x30
dd3-dir   3x28.5  3x28.5
mix3  59,2x25 98,2x24

 The big IO size with Mel's patch really has a devastating effect on
the parallel write. Nowhere near the value one would expect, while the
numbers are perfect without Mel's patch as in rc1-rc5. To bad I did not
see this earlier. Maybe we could have found a solution for .24.

 At least, rc1-rc5 have shown that the CCISS system can do well. Now
the question is which part of the system does not cope well with the
larger IO sizes? Is it the CCISS controller, LVM or both. I am open to
suggestions on how to debug that. 

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-18 Thread Martin Knoblauch
- Original Message 
> From: Mel Gorman <[EMAIL PROTECTED]>
> To: Martin Knoblauch <[EMAIL PROTECTED]>
> Cc: Fengguang Wu <[EMAIL PROTECTED]>; Mike Snitzer <[EMAIL PROTECTED]>; Peter 
> Zijlstra <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; Ingo Molnar <[EMAIL 
> PROTECTED]>; linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL 
> PROTECTED]>; Linus Torvalds <[EMAIL PROTECTED]>; [EMAIL PROTECTED]
> Sent: Thursday, January 17, 2008 11:12:21 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> On (17/01/08 13:50), Martin Knoblauch didst pronounce:
> > > 
> > 
> > The effect  is  defintely  depending on  the  IO  hardware.
> > 
 performed the same tests
> > on a different box with an AACRAID controller and there things
> > look different.
> 
> I take it different also means it does not show this odd performance
> behaviour and is similar whether the patch is applied or not?
>

Here are the numbers (MB/s) from the AACRAID box, after a fresh boot:

Test   2.6.19.2   2.6.24-rc6  
2.6.24-rc6-81eabcbe0b991ddef5216f30ae91c4b226d54b6d
dd1 325   350 290
dd1-dir   180   160 160
dd2 2x90 2x113 2x110
dd2-dir   2x120   2x922x93
dd33x54  3x70   3x70
dd3-dir  3x83  3x64   3x64
mix3  55,2x30  400,2x25   310,2x25

 What we are seing here is that:

a) DIRECT IO takes a much bigger hit (2.6.19 vs. 2.6.24) on this IO system 
compared to the CCISS box
b) Reverting your patch hurts single stream
c) dual/triple stream are not affected by your patch and are improved over 
2.6.19
d) the mix3 performance is improved compared to 2.6.19.
d1) reverting your patch hurts the local-disk part of mix3
e) the AACRAID setup is definitely faster than the CCISS.

 So, on this box your patch is definitely needed to get the pre-2.6.24 
performance
when writing a single big file.

 Actually things on the CCISS box might be even more complicated. I forgot the 
fact
that on that box we have ext2/LVM/DM/Hardware, while on the AACRAID box we have
ext2/Hardware. Do you think that the LVM/MD are sensitive to the page 
order/coloring?

 Anyway: does your patch only address this performance issue, or are there also
data integrity concerns without it? I may consider reverting the patch for my
production environment. It really helps two thirds of my boxes big time, while 
it does
not hurt the other third that much :-)

> > 
> >  I can certainly stress the box before doing the tests. Please
> > define "many" for the kernel compiles :-)
> > 
> 
> With 8GiB of RAM, try making 24 copies of the kernel and compiling them
> all simultaneously. Running that for for 20-30 minutes should be enough
> 
 to randomise the freelists affecting what color of page is used for the
> dd  test.
> 

 ouch :-) OK, I will try that.

Martin



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-18 Thread Martin Knoblauch
- Original Message 
 From: Mel Gorman [EMAIL PROTECTED]
 To: Martin Knoblauch [EMAIL PROTECTED]
 Cc: Fengguang Wu [EMAIL PROTECTED]; Mike Snitzer [EMAIL PROTECTED]; Peter 
 Zijlstra [EMAIL PROTECTED]; [EMAIL PROTECTED]; Ingo Molnar [EMAIL 
 PROTECTED]; linux-kernel@vger.kernel.org; [EMAIL PROTECTED] [EMAIL 
 PROTECTED]; Linus Torvalds [EMAIL PROTECTED]; [EMAIL PROTECTED]
 Sent: Thursday, January 17, 2008 11:12:21 PM
 Subject: Re: regression: 100% io-wait with 2.6.24-rcX
 
 On (17/01/08 13:50), Martin Knoblauch didst pronounce:
   
  
  The effect  is  defintely  depending on  the  IO  hardware.
  
 performed the same tests
  on a different box with an AACRAID controller and there things
  look different.
 
 I take it different also means it does not show this odd performance
 behaviour and is similar whether the patch is applied or not?


Here are the numbers (MB/s) from the AACRAID box, after a fresh boot:

Test   2.6.19.2   2.6.24-rc6  
2.6.24-rc6-81eabcbe0b991ddef5216f30ae91c4b226d54b6d
dd1 325   350 290
dd1-dir   180   160 160
dd2 2x90 2x113 2x110
dd2-dir   2x120   2x922x93
dd33x54  3x70   3x70
dd3-dir  3x83  3x64   3x64
mix3  55,2x30  400,2x25   310,2x25

 What we are seing here is that:

a) DIRECT IO takes a much bigger hit (2.6.19 vs. 2.6.24) on this IO system 
compared to the CCISS box
b) Reverting your patch hurts single stream
c) dual/triple stream are not affected by your patch and are improved over 
2.6.19
d) the mix3 performance is improved compared to 2.6.19.
d1) reverting your patch hurts the local-disk part of mix3
e) the AACRAID setup is definitely faster than the CCISS.

 So, on this box your patch is definitely needed to get the pre-2.6.24 
performance
when writing a single big file.

 Actually things on the CCISS box might be even more complicated. I forgot the 
fact
that on that box we have ext2/LVM/DM/Hardware, while on the AACRAID box we have
ext2/Hardware. Do you think that the LVM/MD are sensitive to the page 
order/coloring?

 Anyway: does your patch only address this performance issue, or are there also
data integrity concerns without it? I may consider reverting the patch for my
production environment. It really helps two thirds of my boxes big time, while 
it does
not hurt the other third that much :-)

  
   I can certainly stress the box before doing the tests. Please
  define many for the kernel compiles :-)
  
 
 With 8GiB of RAM, try making 24 copies of the kernel and compiling them
 all simultaneously. Running that for for 20-30 minutes should be enough
 
 to randomise the freelists affecting what color of page is used for the
 dd  test.
 

 ouch :-) OK, I will try that.

Martin



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-18 Thread Martin Knoblauch

--- Linus Torvalds [EMAIL PROTECTED] wrote:

 
 
 On Fri, 18 Jan 2008, Mel Gorman wrote:
  
  Right, and this is consistent with other complaints about the PFN
  of the page mattering to some hardware.
 
 I don't think it's actually the PFN per se.
 
 I think it's simply that some controllers (quite probably affected by
 both  driver and hardware limits) have some subtle interactions with
 the size of  the IO commands.
 
 For example, let's say that you have a controller that has some limit
 X on  the size of IO in flight (whether due to hardware or driver
 issues doesn't  really matter) in addition to a limit on the size
 of the scatter-gather  size. They all tend to have limits, and
 they differ.
 
 Now, the PFN doesn't matter per se, but the allocation pattern
 definitely  matters for whether the IO's are physically
 contiguous, and thus matters  for the size of the scatter-gather
 thing.
 
 Now, generally the rule-of-thumb is that you want big commands, so 
 physical merging is good for you, but I could well imagine that the
 IO  limits interact, and end up hurting each other. Let's say that a
 better  allocation order allows for bigger contiguous physical areas,
 and thus  fewer scatter-gather entries.
 
 What does that result in? The obvious answer is
 
   Better performance obviously, because the controller needs to do
 fewer scatter-gather lookups, and the requests are bigger, because
 there are fewer IO's that hit scatter-gather limits!
 
 Agreed?
 
 Except maybe the *real* answer for some controllers end up being
 
   Worse performance, because individual commands grow because they
 don't  hit the per-command limits, but now we hit the global
 size-in-flight limits and have many fewer of these good commands in
 flight. And while the commands are larger, it means that there
 are fewer outstanding commands, which can mean that the disk
 cannot scheduling things as well, or makes high latency of command
 generation by the controller much more visible because there aren't
 enough concurrent requests queued up to hide it
 
 Is this the reason? I have no idea. But somebody who knows the
 AACRAID hardware and driver limits might think about interactions
 like that. Sometimes you actually might want to have smaller 
 individual commands if there is some other limit that means that
 it can be more advantageous to have many small requests over a
 few big onees.
 
 RAID might well make it worse. Maybe small requests work better
 because they are simpler to schedule because they only hit one
 disk (eg if you have simple striping)! So that's another reason
 why one *large* request may actually be slower than two requests
 half the size, even if it's against the normal rule.
 
 And it may be that that AACRAID box takes a big hit on DIO
 exactly because DIO has been optimized almost purely for making
 one command as big as possible.
 
 Just a theory.
 
   Linus

 just to make one thing clear - I am not so much concerned about the
performance of AACRAID. It is OK with or without Mel's patch. It is
better with Mel's patch. The regression in DIO compared to 2.6.19.2 is
completely independent of Mel's stuff.

 What interests me much more is the behaviour of the CCISS+LVM based
system. Here I see a huge benefit of reverting Mel's patch.

 I dirtied the system after reboot as Mel suggested (24 parallel kernel
build) and repeated the tests. The dirtying did not make any
difference. Here are the results:

Test  -rc8-rc8-without-Mels-Patch
dd1   57  94
dd1-dir   87  86
dd2   2x8.5   2x45
dd2-dir   2x432x43
dd3   3x7 3x30
dd3-dir   3x28.5  3x28.5
mix3  59,2x25 98,2x24

 The big IO size with Mel's patch really has a devastating effect on
the parallel write. Nowhere near the value one would expect, while the
numbers are perfect without Mel's patch as in rc1-rc5. To bad I did not
see this earlier. Maybe we could have found a solution for .24.

 At least, rc1-rc5 have shown that the CCISS system can do well. Now
the question is which part of the system does not cope well with the
larger IO sizes? Is it the CCISS controller, LVM or both. I am open to
suggestions on how to debug that. 

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-17 Thread Martin Knoblauch
- Original Message 
> From: Mel Gorman <[EMAIL PROTECTED]>
> To: Martin Knoblauch <[EMAIL PROTECTED]>
> Cc: Fengguang Wu <[EMAIL PROTECTED]>; Mike Snitzer <[EMAIL PROTECTED]>; Peter 
> Zijlstra <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; Ingo Molnar <[EMAIL 
> PROTECTED]>; linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL 
> PROTECTED]>; Linus Torvalds <[EMAIL PROTECTED]>; [EMAIL PROTECTED]
> Sent: Thursday, January 17, 2008 9:23:57 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> On (17/01/08 09:44), Martin Knoblauch didst pronounce:
> > > > > > > > On Wed, Jan 16, 2008 at 01:26:41AM -0800,
> Martin
> 
 Knoblauch wrote:
> > > > > > > For those interested in using your writeback
> improvements
> 
 in
> > > > > > > production sooner rather than later (primarily with
> ext3);
> 
 what
> > > > > > > recommendations do you have?  Just heavily test our
> own
> 
 2.6.24
> > > > > > > evolving "close, but not ready for merge" -mm
> writeback
> 
 patchset?
> > > > > > > 
> > > > > > 
> > > > > >  I can add myself to Mikes question. It would be good to
> know
> 
 a
> > > > > 
> > > > > "roadmap" for the writeback changes. Testing 2.6.24-rcX so
> far
> 
 has
> > > > > been showing quite nice improvement of the overall
> writeback
> 
 situation and
> > > > > it would be sad to see this [partially] gone in 2.6.24-final.
> > > > > Linus apparently already has reverted  "...2250b". I
> will
> 
 definitely
> > > > > repeat my tests  with -rc8. and report.
> > > > > 
> > > > Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> > > > Maybe we can push it to 2.6.24 after your testing.
> > > > 
> > > Hi Fengguang,
> > > 
> > > something really bad has happened between -rc3 and -rc6.
> > > Embarrassingly I did not catch that earlier :-(
> > > Compared to the numbers I posted in
> > > http://lkml.org/lkml/2007/10/26/208 , dd1 is now at 60 MB/sec
> > > (slight plus), while dd2/dd3 suck the same way as in pre 2.6.24.
> > > The only test that is still good is mix3, which I attribute to
> > > the per-BDI stuff.
> 
> I suspect that the IO hardware you have is very sensitive to the
> color of the physical page. I wonder, do you boot the system cleanly
> and then run these tests? If so, it would be interesting to know what
> happens if you stress the system first (many kernel compiles for example,
> basically anything that would use a lot of memory in different ways for some
> time) to randomise the free lists a bit and then run your test. You'd need to 
> run
> the test three times for 2.6.23, 2.6.24-rc8 and 2.6.24-rc8 with the patch you
> identified reverted.
>

 The effect  is  defintely  depending on  the  IO  hardware. I performed the 
same tests
on a different box with an AACRAID controller and there things look different. 
Basically
the "offending" commit helps seingle stream performance on that box, while 
dual/triple
stream are not affected. So I suspect that the CCISS is just not behaving well.

 And yes, the tests are usually done on a freshly booted box. Of course, I 
repeat them
a few times. On the CCISS box the numbers are very constant. On the AACRAID box
they vary quite a bit.

 I can certainly stress the box before doing the tests. Please define "many" 
for the kernel
compiles :-)

> > 
> >  OK, the change happened between rc5 and rc6. Just following a
> > gut feeling, I reverted
> > 
> > #commit 81eabcbe0b991ddef5216f30ae91c4b226d54b6d
> > #Author: Mel Gorman 
> > #Date:   Mon Dec 17 16:20:05 2007 -0800
> > #

> > 
> > This has brought back the good results I observed and reported.
> > I do not know what to make out of this. At least on the systems
> > I care about (HP/DL380g4, dual CPUs, HT-enabled, 8 GB Memory,
> > SmartaArray6i controller with 4x72GB SCSI disks as RAID5 (battery
> > protected writeback cache enabled) and gigabit networking (tg3)) this
> > optimisation is a dissaster.
> > 
> 
> That patch was not an optimisation, it was a regression fix
> against 2.6.23 and I don't believe reverting it is an option. Other IO
> hardware benefits from having the allocator supply pages in PFN order.

 I think this late in the 2.6.24 game we just should leave things as they are. 
But
we should try to find a way to make CCISS faster, as it apparently can be 
faster.

>

Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-17 Thread Martin Knoblauch
- Original Message 
> From: Mike Snitzer <[EMAIL PROTECTED]>
> To: Martin Knoblauch <[EMAIL PROTECTED]>
> Cc: Fengguang Wu <[EMAIL PROTECTED]>; Peter Zijlstra <[EMAIL PROTECTED]>; 
> [EMAIL PROTECTED]; Ingo Molnar <[EMAIL PROTECTED]>; 
> linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; Linus 
> Torvalds <[EMAIL PROTECTED]>
> Sent: Thursday, January 17, 2008 5:11:50 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> 
> I've backported Peter's perbdi patchset to 2.6.22.x.  I can share it
> with anyone who might be interested.
> 
> As expected, it has yielded 2.6.24-rcX level scaling.  Given the test
> result matrix you previously posted, 2.6.22.x+perbdi might give you
> what you're looking for (sans improved writeback that 2.6.24 was
> thought to be providing).  That is, much improved scaling with better
> O_DIRECT and network throughput.  Just a thought...
> 
> Unfortunately, my priorities (and computing resources) have shifted
> and I won't be able to thoroughly test Fengguang's new writeback patch
> on 2.6.24-rc8... whereby missing out on providing
> justification/testing to others on _some_ improved writeback being
> included in 2.6.24 final.
> 
> Not to mention the window for writeback improvement is all but closed
> considering the 2.6.24-rc8 announcement's 2.6.24 final release
> timetable.
> 
Mike,

 thanks for the offer, but the improved throughput is my #1 priority nowadays.
And while the better scaling for different targets is nothing to frown upon, the
much better scaling when writing to the same target would have been the big
winner for me.

 Anyway, I located the "offending" commit. Lets see what the experts say.


Cheers
Martin



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-17 Thread Martin Knoblauch
- Original Message 
> From: Martin Knoblauch <[EMAIL PROTECTED]>
> To: Fengguang Wu <[EMAIL PROTECTED]>
> Cc: Mike Snitzer <[EMAIL PROTECTED]>; Peter Zijlstra <[EMAIL PROTECTED]>; 
> [EMAIL PROTECTED]; Ingo Molnar <[EMAIL PROTECTED]>; 
> linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; Linus 
> Torvalds <[EMAIL PROTECTED]>
> Sent: Thursday, January 17, 2008 2:52:58 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> - Original Message 
> > From: Fengguang Wu 
> > To: Martin Knoblauch 
> > Cc: Mike Snitzer ; Peter
> Zijlstra
> 
 ; [EMAIL PROTECTED]; Ingo Molnar
> ;
> 
 linux-kernel@vger.kernel.org;
> "[EMAIL PROTECTED]"
> 
 ; Linus
> Torvalds
> 
 
> > Sent: Wednesday, January 16, 2008 1:00:04 PM
> > Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> > 
> > On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
> > > > For those interested in using your writeback improvements in
> > > > production sooner rather than later (primarily with ext3); what
> > > > recommendations do you have?  Just heavily test our own 2.6.24
> > +
> > 
>  your
> > > > evolving "close, but not ready for merge" -mm writeback patchset?
> > > > 
> > > Hi Fengguang, Mike,
> > > 
> > >  I can add myself to Mikes question. It would be good to know
> > a
> > 
>  "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has
> > been
> > 
>  showing quite nice improvement of the overall writeback situation and
> > it
> > 
>  would be sad to see this [partially] gone in 2.6.24-final.
> > Linus
> > 
>  apparently already has reverted  "...2250b". I will definitely
> repeat
> 
 my
> > tests
> > 
>  with -rc8. and report.
> > 
> > Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> > Maybe we can push it to 2.6.24 after your testing.
> > 
> Hi Fengguang,
> 
>  something really bad has happened between -rc3 and
> -rc6.
> 
 Embarrassingly I did not catch that earlier :-(
> 
>  Compared to the numbers I posted
> in
> 
 http://lkml.org/lkml/2007/10/26/208 , dd1 is now at 60 MB/sec
> (slight
> 
 plus), while dd2/dd3 suck the same way as in pre 2.6.24. The only
> test
> 
 that is still good is mix3, which I attribute to the per-BDI stuff.
> 
>  At the moment I am frantically trying to find when things went down.
> I
> 
 did run -rc8 and rc8+yourpatch. No difference to what I see with
> -rc6.
> 
 Sorry that I cannot provide any input to your patch.
> 

 OK, the change happened between rc5 and rc6. Just following a gut feeling, I 
reverted

#commit 81eabcbe0b991ddef5216f30ae91c4b226d54b6d
#Author: Mel Gorman <[EMAIL PROTECTED]>
#Date:   Mon Dec 17 16:20:05 2007 -0800
#
#mm: fix page allocation for larger I/O segments
#
#In some cases the IO subsystem is able to merge requests if the pages are
#adjacent in physical memory.  This was achieved in the allocator by having
#expand() return pages in physically contiguous order in situations were a
#large buddy was split.  However, list-based anti-fragmentation changed the
#order pages were returned in to avoid searching in buffered_rmqueue() for a
#page of the appropriate migrate type.
#
#This patch restores behaviour of rmqueue_bulk() preserving the physical
#order of pages returned by the allocator without incurring increased search
#costs for anti-fragmentation.
#
#Signed-off-by: Mel Gorman <[EMAIL PROTECTED]>
#Cc: James Bottomley <[EMAIL PROTECTED]>
#Cc: Jens Axboe <[EMAIL PROTECTED]>
#Cc: Mark Lord <[EMAIL PROTECTED]
#Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
#Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]>
diff -urN linux-2.6.24-rc5/mm/page_alloc.c linux-2.6.24-rc6/mm/page_alloc.c
--- linux-2.6.24-rc5/mm/page_alloc.c2007-12-21 04:14:11.305633890 +
+++ linux-2.6.24-rc6/mm/page_alloc.c2007-12-21 04:14:17.746985697 +
@@ -847,8 +847,19 @@
struct page *page = __rmqueue(zone, order, migratetype);
if (unlikely(page == NULL))
break;
+
+   /*
+* Split buddy pages returned by expand() are received here
+* in physical page order. The page is added to the callers and
+* list and the list head then moves forward. From the callers
+* perspective, the linked list is ordered by page number in
+* some conditions. This is useful for IO devices that can
+* merge IO requests if the physica

Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-17 Thread Martin Knoblauch
- Original Message 
> From: Fengguang Wu <[EMAIL PROTECTED]>
> To: Martin Knoblauch <[EMAIL PROTECTED]>
> Cc: Mike Snitzer <[EMAIL PROTECTED]>; Peter Zijlstra <[EMAIL PROTECTED]>; 
> [EMAIL PROTECTED]; Ingo Molnar <[EMAIL PROTECTED]>; 
> linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; Linus 
> Torvalds <[EMAIL PROTECTED]>
> Sent: Wednesday, January 16, 2008 1:00:04 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
> > > For those interested in using your writeback improvements in
> > > production sooner rather than later (primarily with ext3); what
> > > recommendations do you have?  Just heavily test our own 2.6.24
> +
> 
 your
> > > evolving "close, but not ready for merge" -mm writeback patchset?
> > > 
> > Hi Fengguang, Mike,
> > 
> >  I can add myself to Mikes question. It would be good to know
> a
> 
 "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has
> been
> 
 showing quite nice improvement of the overall writeback situation and
> it
> 
 would be sad to see this [partially] gone in 2.6.24-final.
> Linus
> 
 apparently already has reverted  "...2250b". I will definitely repeat my
> tests
> 
 with -rc8. and report.
> 
> Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> Maybe we can push it to 2.6.24 after your testing.
> 
Hi Fengguang,

 something really bad has happened between -rc3 and -rc6. Embarrassingly I did 
not catch that earlier :-(

 Compared to the numbers I posted in http://lkml.org/lkml/2007/10/26/208 , dd1 
is now at 60 MB/sec (slight plus), while dd2/dd3 suck the same way as in pre 
2.6.24. The only test that is still good is mix3, which I attribute to the 
per-BDI stuff.

 At the moment I am frantically trying to find when things went down. I did run 
-rc8 and rc8+yourpatch. No difference to what I see with -rc6. Sorry that I 
cannot provide any input to your patch.

Depressed
Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-17 Thread Martin Knoblauch
- Original Message 
 From: Fengguang Wu [EMAIL PROTECTED]
 To: Martin Knoblauch [EMAIL PROTECTED]
 Cc: Mike Snitzer [EMAIL PROTECTED]; Peter Zijlstra [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; Ingo Molnar [EMAIL PROTECTED]; 
 linux-kernel@vger.kernel.org; [EMAIL PROTECTED] [EMAIL PROTECTED]; Linus 
 Torvalds [EMAIL PROTECTED]
 Sent: Wednesday, January 16, 2008 1:00:04 PM
 Subject: Re: regression: 100% io-wait with 2.6.24-rcX
 
 On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
   For those interested in using your writeback improvements in
   production sooner rather than later (primarily with ext3); what
   recommendations do you have?  Just heavily test our own 2.6.24
 +
 
 your
   evolving close, but not ready for merge -mm writeback patchset?
   
  Hi Fengguang, Mike,
  
   I can add myself to Mikes question. It would be good to know
 a
 
 roadmap for the writeback changes. Testing 2.6.24-rcX so far has
 been
 
 showing quite nice improvement of the overall writeback situation and
 it
 
 would be sad to see this [partially] gone in 2.6.24-final.
 Linus
 
 apparently already has reverted  ...2250b. I will definitely repeat my
 tests
 
 with -rc8. and report.
 
 Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
 Maybe we can push it to 2.6.24 after your testing.
 
Hi Fengguang,

 something really bad has happened between -rc3 and -rc6. Embarrassingly I did 
not catch that earlier :-(

 Compared to the numbers I posted in http://lkml.org/lkml/2007/10/26/208 , dd1 
is now at 60 MB/sec (slight plus), while dd2/dd3 suck the same way as in pre 
2.6.24. The only test that is still good is mix3, which I attribute to the 
per-BDI stuff.

 At the moment I am frantically trying to find when things went down. I did run 
-rc8 and rc8+yourpatch. No difference to what I see with -rc6. Sorry that I 
cannot provide any input to your patch.

Depressed
Martin

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-17 Thread Martin Knoblauch
- Original Message 
 From: Martin Knoblauch [EMAIL PROTECTED]
 To: Fengguang Wu [EMAIL PROTECTED]
 Cc: Mike Snitzer [EMAIL PROTECTED]; Peter Zijlstra [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; Ingo Molnar [EMAIL PROTECTED]; 
 linux-kernel@vger.kernel.org; [EMAIL PROTECTED] [EMAIL PROTECTED]; Linus 
 Torvalds [EMAIL PROTECTED]
 Sent: Thursday, January 17, 2008 2:52:58 PM
 Subject: Re: regression: 100% io-wait with 2.6.24-rcX
 
 - Original Message 
  From: Fengguang Wu 
  To: Martin Knoblauch 
  Cc: Mike Snitzer ; Peter
 Zijlstra
 
 ; [EMAIL PROTECTED]; Ingo Molnar
 ;
 
 linux-kernel@vger.kernel.org;
 [EMAIL PROTECTED]
 
 ; Linus
 Torvalds
 
 
  Sent: Wednesday, January 16, 2008 1:00:04 PM
  Subject: Re: regression: 100% io-wait with 2.6.24-rcX
  
  On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
For those interested in using your writeback improvements in
production sooner rather than later (primarily with ext3); what
recommendations do you have?  Just heavily test our own 2.6.24
  +
  
  your
evolving close, but not ready for merge -mm writeback patchset?

   Hi Fengguang, Mike,
   
I can add myself to Mikes question. It would be good to know
  a
  
  roadmap for the writeback changes. Testing 2.6.24-rcX so far has
  been
  
  showing quite nice improvement of the overall writeback situation and
  it
  
  would be sad to see this [partially] gone in 2.6.24-final.
  Linus
  
  apparently already has reverted  ...2250b. I will definitely
 repeat
 
 my
  tests
  
  with -rc8. and report.
  
  Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
  Maybe we can push it to 2.6.24 after your testing.
  
 Hi Fengguang,
 
  something really bad has happened between -rc3 and
 -rc6.
 
 Embarrassingly I did not catch that earlier :-(
 
  Compared to the numbers I posted
 in
 
 http://lkml.org/lkml/2007/10/26/208 , dd1 is now at 60 MB/sec
 (slight
 
 plus), while dd2/dd3 suck the same way as in pre 2.6.24. The only
 test
 
 that is still good is mix3, which I attribute to the per-BDI stuff.
 
  At the moment I am frantically trying to find when things went down.
 I
 
 did run -rc8 and rc8+yourpatch. No difference to what I see with
 -rc6.
 
 Sorry that I cannot provide any input to your patch.
 

 OK, the change happened between rc5 and rc6. Just following a gut feeling, I 
reverted

#commit 81eabcbe0b991ddef5216f30ae91c4b226d54b6d
#Author: Mel Gorman [EMAIL PROTECTED]
#Date:   Mon Dec 17 16:20:05 2007 -0800
#
#mm: fix page allocation for larger I/O segments
#
#In some cases the IO subsystem is able to merge requests if the pages are
#adjacent in physical memory.  This was achieved in the allocator by having
#expand() return pages in physically contiguous order in situations were a
#large buddy was split.  However, list-based anti-fragmentation changed the
#order pages were returned in to avoid searching in buffered_rmqueue() for a
#page of the appropriate migrate type.
#
#This patch restores behaviour of rmqueue_bulk() preserving the physical
#order of pages returned by the allocator without incurring increased search
#costs for anti-fragmentation.
#
#Signed-off-by: Mel Gorman [EMAIL PROTECTED]
#Cc: James Bottomley [EMAIL PROTECTED]
#Cc: Jens Axboe [EMAIL PROTECTED]
#Cc: Mark Lord [EMAIL PROTECTED]
#Signed-off-by: Andrew Morton [EMAIL PROTECTED]
#Signed-off-by: Linus Torvalds [EMAIL PROTECTED]
diff -urN linux-2.6.24-rc5/mm/page_alloc.c linux-2.6.24-rc6/mm/page_alloc.c
--- linux-2.6.24-rc5/mm/page_alloc.c2007-12-21 04:14:11.305633890 +
+++ linux-2.6.24-rc6/mm/page_alloc.c2007-12-21 04:14:17.746985697 +
@@ -847,8 +847,19 @@
struct page *page = __rmqueue(zone, order, migratetype);
if (unlikely(page == NULL))
break;
+
+   /*
+* Split buddy pages returned by expand() are received here
+* in physical page order. The page is added to the callers and
+* list and the list head then moves forward. From the callers
+* perspective, the linked list is ordered by page number in
+* some conditions. This is useful for IO devices that can
+* merge IO requests if the physical pages are ordered
+* properly.
+*/
list_add(page-lru, list);
set_page_private(page, migratetype);
+   list = page-lru;
}
spin_unlock(zone-lock);
return i;



This has brought back the good results I observed and reported.
I do not know what to make out of this. At least on the systems I care
about (HP/DL380g4, dual CPUs, HT-enabled, 8 GB Memory, SmartaArray6i
controller with 4x72GB SCSI disks as RAID5 (battery protected writeback
cache enabled) and gigabit networking (tg3)) this optimisation is a dissaster.

 On the other hand, it is not a regression

Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-17 Thread Martin Knoblauch
- Original Message 
 From: Mike Snitzer [EMAIL PROTECTED]
 To: Martin Knoblauch [EMAIL PROTECTED]
 Cc: Fengguang Wu [EMAIL PROTECTED]; Peter Zijlstra [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; Ingo Molnar [EMAIL PROTECTED]; 
 linux-kernel@vger.kernel.org; [EMAIL PROTECTED] [EMAIL PROTECTED]; Linus 
 Torvalds [EMAIL PROTECTED]
 Sent: Thursday, January 17, 2008 5:11:50 PM
 Subject: Re: regression: 100% io-wait with 2.6.24-rcX
 
 
 I've backported Peter's perbdi patchset to 2.6.22.x.  I can share it
 with anyone who might be interested.
 
 As expected, it has yielded 2.6.24-rcX level scaling.  Given the test
 result matrix you previously posted, 2.6.22.x+perbdi might give you
 what you're looking for (sans improved writeback that 2.6.24 was
 thought to be providing).  That is, much improved scaling with better
 O_DIRECT and network throughput.  Just a thought...
 
 Unfortunately, my priorities (and computing resources) have shifted
 and I won't be able to thoroughly test Fengguang's new writeback patch
 on 2.6.24-rc8... whereby missing out on providing
 justification/testing to others on _some_ improved writeback being
 included in 2.6.24 final.
 
 Not to mention the window for writeback improvement is all but closed
 considering the 2.6.24-rc8 announcement's 2.6.24 final release
 timetable.
 
Mike,

 thanks for the offer, but the improved throughput is my #1 priority nowadays.
And while the better scaling for different targets is nothing to frown upon, the
much better scaling when writing to the same target would have been the big
winner for me.

 Anyway, I located the offending commit. Lets see what the experts say.


Cheers
Martin



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-17 Thread Martin Knoblauch
- Original Message 
 From: Mel Gorman [EMAIL PROTECTED]
 To: Martin Knoblauch [EMAIL PROTECTED]
 Cc: Fengguang Wu [EMAIL PROTECTED]; Mike Snitzer [EMAIL PROTECTED]; Peter 
 Zijlstra [EMAIL PROTECTED]; [EMAIL PROTECTED]; Ingo Molnar [EMAIL 
 PROTECTED]; linux-kernel@vger.kernel.org; [EMAIL PROTECTED] [EMAIL 
 PROTECTED]; Linus Torvalds [EMAIL PROTECTED]; [EMAIL PROTECTED]
 Sent: Thursday, January 17, 2008 9:23:57 PM
 Subject: Re: regression: 100% io-wait with 2.6.24-rcX
 
 On (17/01/08 09:44), Martin Knoblauch didst pronounce:
On Wed, Jan 16, 2008 at 01:26:41AM -0800,
 Martin
 
 Knoblauch wrote:
   For those interested in using your writeback
 improvements
 
 in
   production sooner rather than later (primarily with
 ext3);
 
 what
   recommendations do you have?  Just heavily test our
 own
 
 2.6.24
   evolving close, but not ready for merge -mm
 writeback
 
 patchset?
   
  
   I can add myself to Mikes question. It would be good to
 know
 
 a
 
 roadmap for the writeback changes. Testing 2.6.24-rcX so
 far
 
 has
 been showing quite nice improvement of the overall
 writeback
 
 situation and
 it would be sad to see this [partially] gone in 2.6.24-final.
 Linus apparently already has reverted  ...2250b. I
 will
 
 definitely
 repeat my tests  with -rc8. and report.
 
Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
Maybe we can push it to 2.6.24 after your testing.

   Hi Fengguang,
   
   something really bad has happened between -rc3 and -rc6.
   Embarrassingly I did not catch that earlier :-(
   Compared to the numbers I posted in
   http://lkml.org/lkml/2007/10/26/208 , dd1 is now at 60 MB/sec
   (slight plus), while dd2/dd3 suck the same way as in pre 2.6.24.
   The only test that is still good is mix3, which I attribute to
   the per-BDI stuff.
 
 I suspect that the IO hardware you have is very sensitive to the
 color of the physical page. I wonder, do you boot the system cleanly
 and then run these tests? If so, it would be interesting to know what
 happens if you stress the system first (many kernel compiles for example,
 basically anything that would use a lot of memory in different ways for some
 time) to randomise the free lists a bit and then run your test. You'd need to 
 run
 the test three times for 2.6.23, 2.6.24-rc8 and 2.6.24-rc8 with the patch you
 identified reverted.


 The effect  is  defintely  depending on  the  IO  hardware. I performed the 
same tests
on a different box with an AACRAID controller and there things look different. 
Basically
the offending commit helps seingle stream performance on that box, while 
dual/triple
stream are not affected. So I suspect that the CCISS is just not behaving well.

 And yes, the tests are usually done on a freshly booted box. Of course, I 
repeat them
a few times. On the CCISS box the numbers are very constant. On the AACRAID box
they vary quite a bit.

 I can certainly stress the box before doing the tests. Please define many 
for the kernel
compiles :-)

  
   OK, the change happened between rc5 and rc6. Just following a
  gut feeling, I reverted
  
  #commit 81eabcbe0b991ddef5216f30ae91c4b226d54b6d
  #Author: Mel Gorman 
  #Date:   Mon Dec 17 16:20:05 2007 -0800
  #

  
  This has brought back the good results I observed and reported.
  I do not know what to make out of this. At least on the systems
  I care about (HP/DL380g4, dual CPUs, HT-enabled, 8 GB Memory,
  SmartaArray6i controller with 4x72GB SCSI disks as RAID5 (battery
  protected writeback cache enabled) and gigabit networking (tg3)) this
  optimisation is a dissaster.
  
 
 That patch was not an optimisation, it was a regression fix
 against 2.6.23 and I don't believe reverting it is an option. Other IO
 hardware benefits from having the allocator supply pages in PFN order.

 I think this late in the 2.6.24 game we just should leave things as they are. 
But
we should try to find a way to make CCISS faster, as it apparently can be 
faster.

 Your controller would seem to suffer when presented with the same situation
 but I don't know why that is. I've added James to the cc in case he has seen 
 this
 sort of situation before.
 
  On the other hand, it is not a regression against 2.6.22/23. Those
 
  had bad IO scaling to. It would just be a shame to loose an apparently
 
  great performance win.
 
 Could you try running your tests again when the system has been
 stressed with some other workload first?
 

 Will do.

Cheers
Martin



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-16 Thread Martin Knoblauch
- Original Message 
> From: Fengguang Wu <[EMAIL PROTECTED]>
> To: Martin Knoblauch <[EMAIL PROTECTED]>
> Cc: Mike Snitzer <[EMAIL PROTECTED]>; Peter Zijlstra <[EMAIL PROTECTED]>; 
> [EMAIL PROTECTED]; Ingo Molnar <[EMAIL PROTECTED]>; 
> linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; Linus 
> Torvalds <[EMAIL PROTECTED]>
> Sent: Wednesday, January 16, 2008 1:00:04 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
> > > For those interested in using your writeback improvements in
> > > production sooner rather than later (primarily with ext3); what
> > > recommendations do you have?  Just heavily test our own 2.6.24
> +
> 
 your
> > > evolving "close, but not ready for merge" -mm writeback patchset?
> > > 
> > Hi Fengguang, Mike,
> > 
> >  I can add myself to Mikes question. It would be good to know
> a
> 
 "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has
> been
> 
 showing quite nice improvement of the overall writeback situation and
> it
> 
 would be sad to see this [partially] gone in 2.6.24-final.
> Linus
> 
 apparently already has reverted  "...2250b". I will definitely repeat my
> tests
> 
 with -rc8. and report.
> 
> Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> Maybe we can push it to 2.6.24 after your testing.
> 

 Will do tomorrow or friday. Actually a patch against -rc8 would be nicer for 
me, as I have not looked at -rc7 due to holidays and some of the reported 
problems with it.

Cheers
Martin

> Fengguang
> ---
>  fs/fs-writeback.c |   17 +++--
>  include/linux/writeback.h |1 +
>  mm/page-writeback.c   |9 ++---
>  3 files changed, 22 insertions(+), 5 deletions(-)
> 
> --- linux.orig/fs/fs-writeback.c
> +++ linux/fs/fs-writeback.c
> @@ -284,7 +284,16 @@ __sync_single_inode(struct inode *inode,
>   * soon as the queue becomes uncongested.
>   */
>  inode->i_state |= I_DIRTY_PAGES;
> -requeue_io(inode);
> +if (wbc->nr_to_write <= 0)
> +/*
> + * slice used up: queue for next turn
> + */
> +requeue_io(inode);
> +else
> +/*
> + * somehow blocked: retry later
> + */
> +redirty_tail(inode);
>  } else {
>  /*
>   * Otherwise fully redirty the inode so that
> @@ -479,8 +488,12 @@ sync_sb_inodes(struct super_block *sb, s
>  iput(inode);
>  cond_resched();
>  spin_lock(_lock);
> -if (wbc->nr_to_write <= 0)
> +if (wbc->nr_to_write <= 0) {
> +wbc->more_io = 1;
>  break;
> +}
> +if (!list_empty(>s_more_io))
> +wbc->more_io = 1;
>  }
>  return;/* Leave any unwritten inodes on s_io */
>  }
> --- linux.orig/include/linux/writeback.h
> +++ linux/include/linux/writeback.h
> @@ -62,6 +62,7 @@ struct writeback_control {
>  unsigned for_reclaim:1;/* Invoked from the page
> allocator
> 
 */
>  unsigned for_writepages:1;/* This is a writepages() call */
>  unsigned range_cyclic:1;/* range_start is cyclic */
> +unsigned more_io:1;/* more io to be dispatched */
>  };
>  
>  /*
> --- linux.orig/mm/page-writeback.c
> +++ linux/mm/page-writeback.c
> @@ -558,6 +558,7 @@ static void background_writeout(unsigned
>  global_page_state(NR_UNSTABLE_NFS) < background_thresh
>  && min_pages <= 0)
>  break;
> +wbc.more_io = 0;
>  wbc.encountered_congestion = 0;
>  wbc.nr_to_write = MAX_WRITEBACK_PAGES;
>  wbc.pages_skipped = 0;
> @@ -565,8 +566,9 @@ static void background_writeout(unsigned
>  min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
>  if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
>  /* Wrote less than expected */
> -congestion_wait(WRITE, HZ/10);
> -if (!wbc.encountered_congestion)
> +if (wbc.encountered_congestion || wbc.more_io)
> +congestion_wait(WRITE, HZ/10);
> +else
>  break;
>  }
>  }
> @@ -631,11 +633,12 @@ static void wb_kupdate(unsigned long arg
>  global_page_state(NR_UNSTABLE_NFS)

Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-16 Thread Martin Knoblauch
- Original Message 
> From: Mike Snitzer <[EMAIL PROTECTED]>
> To: Fengguang Wu <[EMAIL PROTECTED]>
> Cc: Peter Zijlstra <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; Ingo Molnar 
> <[EMAIL PROTECTED]>; linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" 
> <[EMAIL PROTECTED]>; Linus Torvalds <[EMAIL PROTECTED]>; Andrew Morton 
> <[EMAIL PROTECTED]>
> Sent: Tuesday, January 15, 2008 10:13:22 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> On Jan 14, 2008 7:50 AM, Fengguang Wu  wrote:
> > On Mon, Jan 14, 2008 at 12:41:26PM +0100, Peter Zijlstra wrote:
> > >
> > > On Mon, 2008-01-14 at 12:30 +0100, Joerg Platte wrote:
> > > > Am Montag, 14. Januar 2008 schrieb Fengguang Wu:
> > > >
> > > > > Joerg, this patch fixed the bug for me :-)
> > > >
> > > > Fengguang, congratulations, I can confirm that your patch
> fixed
> 
 the bug! With
> > > > previous kernels the bug showed up after each reboot. Now,
> when
> 
 booting the
> > > > patched kernel everything is fine and there is no longer
> any
> 
 suspicious
> > > > iowait!
> > > >
> > > > Do you have an idea why this problem appeared in 2.6.24?
> Did
> 
 somebody change
> > > > the ext2 code or is it related to the changes in the scheduler?
> > >
> > > It was Fengguang who changed the inode writeback code, and I
> guess
> 
 the
> > > new and improved code was less able do deal with these funny corner
> > > cases. But he has been very good in tracking them down and
> solving
> 
 them,
> > > kudos to him for that work!
> >
> > Thank you.
> >
> > In particular the bug is triggered by the patch named:
> > "writeback: introduce writeback_control.more_io to
> indicate
> 
 more io"
> > That patch means to speed up writeback, but unfortunately its
> > aggressiveness has disclosed bugs in reiserfs, jfs and now ext2.
> >
> > Linus, given the number of bugs it triggered, I'd recommend revert
> > this patch(git commit
> 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b).
> 
 Let's
> > push it back to -mm tree for more testings?
> 
> Fengguang,
> 
> I'd like to better understand where your writeback work stands
> relative to 2.6.24-rcX and -mm.  To be clear, your changes in
> 2.6.24-rc7 have been benchmarked to provide a ~33% sequential write
> performance improvement with ext3 (as compared to 2.6.22, CFS could be
> helping, etc but...).  Very impressive!
> 
> Given this improvement it is unfortunate to see your request to revert
> 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b but it is understandable if
> you're not confident in it for 2.6.24.
> 
> That said, you recently posted an -mm patchset that first reverts
> 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b and then goes on to address
> the "slow writes for concurrent large and small file writes" bug:
> http://lkml.org/lkml/2008/1/15/132
> 
> For those interested in using your writeback improvements in
> production sooner rather than later (primarily with ext3); what
> recommendations do you have?  Just heavily test our own 2.6.24 + your
> evolving "close, but not ready for merge" -mm writeback patchset?
> 
Hi Fengguang, Mike,

 I can add myself to Mikes question. It would be good to know a "roadmap" for 
the writeback changes. Testing 2.6.24-rcX so far has been showing quite nice 
improvement of the overall writeback situation and it would be sad to see this 
[partially] gone in 2.6.24-final. Linus apparently already has reverted  
"...2250b". I will definitely repeat my tests with -rc8. and report.

 Cheers
Martin




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-16 Thread Martin Knoblauch
- Original Message 
 From: Mike Snitzer [EMAIL PROTECTED]
 To: Fengguang Wu [EMAIL PROTECTED]
 Cc: Peter Zijlstra [EMAIL PROTECTED]; [EMAIL PROTECTED]; Ingo Molnar 
 [EMAIL PROTECTED]; linux-kernel@vger.kernel.org; [EMAIL PROTECTED] 
 [EMAIL PROTECTED]; Linus Torvalds [EMAIL PROTECTED]; Andrew Morton 
 [EMAIL PROTECTED]
 Sent: Tuesday, January 15, 2008 10:13:22 PM
 Subject: Re: regression: 100% io-wait with 2.6.24-rcX
 
 On Jan 14, 2008 7:50 AM, Fengguang Wu  wrote:
  On Mon, Jan 14, 2008 at 12:41:26PM +0100, Peter Zijlstra wrote:
  
   On Mon, 2008-01-14 at 12:30 +0100, Joerg Platte wrote:
Am Montag, 14. Januar 2008 schrieb Fengguang Wu:
   
 Joerg, this patch fixed the bug for me :-)
   
Fengguang, congratulations, I can confirm that your patch
 fixed
 
 the bug! With
previous kernels the bug showed up after each reboot. Now,
 when
 
 booting the
patched kernel everything is fine and there is no longer
 any
 
 suspicious
iowait!
   
Do you have an idea why this problem appeared in 2.6.24?
 Did
 
 somebody change
the ext2 code or is it related to the changes in the scheduler?
  
   It was Fengguang who changed the inode writeback code, and I
 guess
 
 the
   new and improved code was less able do deal with these funny corner
   cases. But he has been very good in tracking them down and
 solving
 
 them,
   kudos to him for that work!
 
  Thank you.
 
  In particular the bug is triggered by the patch named:
  writeback: introduce writeback_control.more_io to
 indicate
 
 more io
  That patch means to speed up writeback, but unfortunately its
  aggressiveness has disclosed bugs in reiserfs, jfs and now ext2.
 
  Linus, given the number of bugs it triggered, I'd recommend revert
  this patch(git commit
 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b).
 
 Let's
  push it back to -mm tree for more testings?
 
 Fengguang,
 
 I'd like to better understand where your writeback work stands
 relative to 2.6.24-rcX and -mm.  To be clear, your changes in
 2.6.24-rc7 have been benchmarked to provide a ~33% sequential write
 performance improvement with ext3 (as compared to 2.6.22, CFS could be
 helping, etc but...).  Very impressive!
 
 Given this improvement it is unfortunate to see your request to revert
 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b but it is understandable if
 you're not confident in it for 2.6.24.
 
 That said, you recently posted an -mm patchset that first reverts
 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b and then goes on to address
 the slow writes for concurrent large and small file writes bug:
 http://lkml.org/lkml/2008/1/15/132
 
 For those interested in using your writeback improvements in
 production sooner rather than later (primarily with ext3); what
 recommendations do you have?  Just heavily test our own 2.6.24 + your
 evolving close, but not ready for merge -mm writeback patchset?
 
Hi Fengguang, Mike,

 I can add myself to Mikes question. It would be good to know a roadmap for 
the writeback changes. Testing 2.6.24-rcX so far has been showing quite nice 
improvement of the overall writeback situation and it would be sad to see this 
[partially] gone in 2.6.24-final. Linus apparently already has reverted  
...2250b. I will definitely repeat my tests with -rc8. and report.

 Cheers
Martin




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression: 100% io-wait with 2.6.24-rcX

2008-01-16 Thread Martin Knoblauch
- Original Message 
 From: Fengguang Wu [EMAIL PROTECTED]
 To: Martin Knoblauch [EMAIL PROTECTED]
 Cc: Mike Snitzer [EMAIL PROTECTED]; Peter Zijlstra [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; Ingo Molnar [EMAIL PROTECTED]; 
 linux-kernel@vger.kernel.org; [EMAIL PROTECTED] [EMAIL PROTECTED]; Linus 
 Torvalds [EMAIL PROTECTED]
 Sent: Wednesday, January 16, 2008 1:00:04 PM
 Subject: Re: regression: 100% io-wait with 2.6.24-rcX
 
 On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
   For those interested in using your writeback improvements in
   production sooner rather than later (primarily with ext3); what
   recommendations do you have?  Just heavily test our own 2.6.24
 +
 
 your
   evolving close, but not ready for merge -mm writeback patchset?
   
  Hi Fengguang, Mike,
  
   I can add myself to Mikes question. It would be good to know
 a
 
 roadmap for the writeback changes. Testing 2.6.24-rcX so far has
 been
 
 showing quite nice improvement of the overall writeback situation and
 it
 
 would be sad to see this [partially] gone in 2.6.24-final.
 Linus
 
 apparently already has reverted  ...2250b. I will definitely repeat my
 tests
 
 with -rc8. and report.
 
 Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
 Maybe we can push it to 2.6.24 after your testing.
 

 Will do tomorrow or friday. Actually a patch against -rc8 would be nicer for 
me, as I have not looked at -rc7 due to holidays and some of the reported 
problems with it.

Cheers
Martin

 Fengguang
 ---
  fs/fs-writeback.c |   17 +++--
  include/linux/writeback.h |1 +
  mm/page-writeback.c   |9 ++---
  3 files changed, 22 insertions(+), 5 deletions(-)
 
 --- linux.orig/fs/fs-writeback.c
 +++ linux/fs/fs-writeback.c
 @@ -284,7 +284,16 @@ __sync_single_inode(struct inode *inode,
   * soon as the queue becomes uncongested.
   */
  inode-i_state |= I_DIRTY_PAGES;
 -requeue_io(inode);
 +if (wbc-nr_to_write = 0)
 +/*
 + * slice used up: queue for next turn
 + */
 +requeue_io(inode);
 +else
 +/*
 + * somehow blocked: retry later
 + */
 +redirty_tail(inode);
  } else {
  /*
   * Otherwise fully redirty the inode so that
 @@ -479,8 +488,12 @@ sync_sb_inodes(struct super_block *sb, s
  iput(inode);
  cond_resched();
  spin_lock(inode_lock);
 -if (wbc-nr_to_write = 0)
 +if (wbc-nr_to_write = 0) {
 +wbc-more_io = 1;
  break;
 +}
 +if (!list_empty(sb-s_more_io))
 +wbc-more_io = 1;
  }
  return;/* Leave any unwritten inodes on s_io */
  }
 --- linux.orig/include/linux/writeback.h
 +++ linux/include/linux/writeback.h
 @@ -62,6 +62,7 @@ struct writeback_control {
  unsigned for_reclaim:1;/* Invoked from the page
 allocator
 
 */
  unsigned for_writepages:1;/* This is a writepages() call */
  unsigned range_cyclic:1;/* range_start is cyclic */
 +unsigned more_io:1;/* more io to be dispatched */
  };
  
  /*
 --- linux.orig/mm/page-writeback.c
 +++ linux/mm/page-writeback.c
 @@ -558,6 +558,7 @@ static void background_writeout(unsigned
  global_page_state(NR_UNSTABLE_NFS)  background_thresh
   min_pages = 0)
  break;
 +wbc.more_io = 0;
  wbc.encountered_congestion = 0;
  wbc.nr_to_write = MAX_WRITEBACK_PAGES;
  wbc.pages_skipped = 0;
 @@ -565,8 +566,9 @@ static void background_writeout(unsigned
  min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
  if (wbc.nr_to_write  0 || wbc.pages_skipped  0) {
  /* Wrote less than expected */
 -congestion_wait(WRITE, HZ/10);
 -if (!wbc.encountered_congestion)
 +if (wbc.encountered_congestion || wbc.more_io)
 +congestion_wait(WRITE, HZ/10);
 +else
  break;
  }
  }
 @@ -631,11 +633,12 @@ static void wb_kupdate(unsigned long arg
  global_page_state(NR_UNSTABLE_NFS) +
  (inodes_stat.nr_inodes - inodes_stat.nr_unused);
  while (nr_to_write  0) {
 +wbc.more_io = 0;
  wbc.encountered_congestion = 0;
  wbc.nr_to_write = MAX_WRITEBACK_PAGES;
  writeback_inodes(wbc);
  if (wbc.nr_to_write  0) {
 -if (wbc.encountered_congestion)
 +if (wbc.encountered_congestion || wbc.more_io)
  congestion_wait(WRITE, HZ/10);
  else
  break;/* All the old data is written */
 
 
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo

Re: Strange NFS write performance Linux->Solaris-10/VXFS, maybe VW related

2008-01-14 Thread Martin Knoblauch
- Original Message 
> From: Martin Knoblauch <[EMAIL PROTECTED]>
> To: Chris Snook <[EMAIL PROTECTED]>
> Cc: linux-kernel@vger.kernel.org; [EMAIL PROTECTED]; spam trap <[EMAIL 
> PROTECTED]>
> Sent: Saturday, December 29, 2007 12:11:08 PM
> Subject: Re: Strange NFS write performance Linux->Solaris-10/VXFS, maybe VW 
> related
> 
> - Original Message 
> > From: Chris Snook 
> > To: Martin Knoblauch 
> > Cc: linux-kernel@vger.kernel.org; [EMAIL PROTECTED]
> > Sent: Friday, December 28, 2007 7:45:13 PM
> > Subject: Re: Strange NFS write performance
> Linux->Solaris-10/VXFS,
> 
 maybe VW related
> > 
> > Martin Knoblauch wrote:
> > > Hi,
> > > 
> > > currently I am tracking down an "interesting" effect when writing
> 
> > 3) It sounds like the bottleneck is the vxfs filesystem.  It
> > only *appears* on  the client side because writes up
> until
> 
 dirty_ratio
> > get buffered on the client. 
> >   If you can confirm that the server is actually writing stuff to
> > disk slower  when the client is in writeback mode, then it's possible
> > the Linux NFSclient is  doing something inefficient in
> writeback
> 
 mode.
> > 
> 
> so, is the output of "iostat -d -l1 d111" during two runs. The
> first
> 
 run is with 750 MB, the second with 850MB.
> 
> // 750MB
> $ iostat -d -l 1 md111 2
>md111
> kps tps serv
>  22   0   14
>   0   00
>   0   0   13
> 29347 468   12
> 37040 593   17
> 30938 492   25
> 30421 491   25
> 41626 676   16
> 42913 703   14
> 39890 647   15
> 9009 1417
> 8963 1417
> 5143  817
> 34814 547   10
> 49323 775   12
> 28624 4516
>  22   16
>  finish
>   0   00
>   0   00
> 
>  Here it seems that the disk is writing for 26-28 seconds with avg.
> 29
> 
 MB/sec. Fine.
> 
> // 850MB
> $ iostat -d -l 1 md111 2
>md111
> kps tps serv
>   0   00
> 11275 180   10
> 39874 635   14
> 37403 587   17
> 24341 392   30
> 25989 423   26
> 22464 375   30
> 21922 361   32
> 27924 450   26
> 21507 342   21
> 9217 153   15
> 9260 150   15
> 9544 155   15
> 9298 150   14
> 10118 162   11
> 15505 250   12
> 27513 448   14
> 26698 436   15
> 26144 431   15
> 25201 412   14
>  38 seconds in run
>   0   00
>   0   00
> 579  17   12
>   0   00
>   0   00
>   0   00
>   0   00
> 518   9   16
> 485   86
>   9   17
> 514   97
>   0   00
>   0   00
> 541   98
> 532  106
>   0   00
>   0   00
> 650  127
>   0   00
> 242   89
> 1023  185
> 304   56
> 418   87
> 283   55
> 303   58
> 527  106
>   0   00
>   0   00
>   0   00
>   5   1   13
>   0   00
>   0   00
>   0   00
>   0   00
>   0   00
>   0   0   11
>   0   00
>   0   00
>   0   00
>   1   0   15
>   0   00
>  96   2   15
> 138   3   10
> 11057 1756
> 17549 2806
> 351   85
>   0   00
> # 218 seconds in run, finish.
> 
>  So, for the first 38 seconds everything looks similar to the 750
> MB case. For the next about 180 seconds most time nothing happens.
> Averaging 4.1 MB/sec.
> 
> Maybe it is time to capture the traffic. What are the best
> tcpdump parameters for NFS? I always forget :-(
> 
> Cheers
> Martin
> 
> 
Hi,

 now that the seasonal festivities are over - Happy New Year btw. - any 
comments/suggestions on my problem?

Cheers
Martin



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange NFS write performance Linux-Solaris-10/VXFS, maybe VW related

2008-01-14 Thread Martin Knoblauch
- Original Message 
 From: Martin Knoblauch [EMAIL PROTECTED]
 To: Chris Snook [EMAIL PROTECTED]
 Cc: linux-kernel@vger.kernel.org; [EMAIL PROTECTED]; spam trap [EMAIL 
 PROTECTED]
 Sent: Saturday, December 29, 2007 12:11:08 PM
 Subject: Re: Strange NFS write performance Linux-Solaris-10/VXFS, maybe VW 
 related
 
 - Original Message 
  From: Chris Snook 
  To: Martin Knoblauch 
  Cc: linux-kernel@vger.kernel.org; [EMAIL PROTECTED]
  Sent: Friday, December 28, 2007 7:45:13 PM
  Subject: Re: Strange NFS write performance
 Linux-Solaris-10/VXFS,
 
 maybe VW related
  
  Martin Knoblauch wrote:
   Hi,
   
   currently I am tracking down an interesting effect when writing
 
  3) It sounds like the bottleneck is the vxfs filesystem.  It
  only *appears* on  the client side because writes up
 until
 
 dirty_ratio
  get buffered on the client. 
If you can confirm that the server is actually writing stuff to
  disk slower  when the client is in writeback mode, then it's possible
  the Linux NFSclient is  doing something inefficient in
 writeback
 
 mode.
  
 
 so, is the output of iostat -d -l1 d111 during two runs. The
 first
 
 run is with 750 MB, the second with 850MB.
 
 // 750MB
 $ iostat -d -l 1 md111 2
md111
 kps tps serv
  22   0   14
   0   00
   0   0   13
 29347 468   12
 37040 593   17
 30938 492   25
 30421 491   25
 41626 676   16
 42913 703   14
 39890 647   15
 9009 1417
 8963 1417
 5143  817
 34814 547   10
 49323 775   12
 28624 4516
  22   16
  finish
   0   00
   0   00
 
  Here it seems that the disk is writing for 26-28 seconds with avg.
 29
 
 MB/sec. Fine.
 
 // 850MB
 $ iostat -d -l 1 md111 2
md111
 kps tps serv
   0   00
 11275 180   10
 39874 635   14
 37403 587   17
 24341 392   30
 25989 423   26
 22464 375   30
 21922 361   32
 27924 450   26
 21507 342   21
 9217 153   15
 9260 150   15
 9544 155   15
 9298 150   14
 10118 162   11
 15505 250   12
 27513 448   14
 26698 436   15
 26144 431   15
 25201 412   14
  38 seconds in run
   0   00
   0   00
 579  17   12
   0   00
   0   00
   0   00
   0   00
 518   9   16
 485   86
   9   17
 514   97
   0   00
   0   00
 541   98
 532  106
   0   00
   0   00
 650  127
   0   00
 242   89
 1023  185
 304   56
 418   87
 283   55
 303   58
 527  106
   0   00
   0   00
   0   00
   5   1   13
   0   00
   0   00
   0   00
   0   00
   0   00
   0   0   11
   0   00
   0   00
   0   00
   1   0   15
   0   00
  96   2   15
 138   3   10
 11057 1756
 17549 2806
 351   85
   0   00
 # 218 seconds in run, finish.
 
  So, for the first 38 seconds everything looks similar to the 750
 MB case. For the next about 180 seconds most time nothing happens.
 Averaging 4.1 MB/sec.
 
 Maybe it is time to capture the traffic. What are the best
 tcpdump parameters for NFS? I always forget :-(
 
 Cheers
 Martin
 
 
Hi,

 now that the seasonal festivities are over - Happy New Year btw. - any 
comments/suggestions on my problem?

Cheers
Martin



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/11] writeback bug fixes and simplifications

2008-01-11 Thread Martin Knoblauch
- Original Message 
> From: WU Fengguang <[EMAIL PROTECTED]>
> To: Hans-Peter Jansen <[EMAIL PROTECTED]>
> Cc: Sascha Warner <[EMAIL PROTECTED]>; Andrew Morton <[EMAIL PROTECTED]>; 
> linux-kernel@vger.kernel.org; Peter Zijlstra <[EMAIL PROTECTED]>
> Sent: Wednesday, January 9, 2008 4:33:32 AM
> Subject: Re: [PATCH 00/11] writeback bug fixes and simplifications
> 
> On Sat, Dec 29, 2007 at 03:56:59PM +0100, Hans-Peter Jansen wrote:
> > Am Freitag, 28. Dezember 2007 schrieb Sascha Warner:
> > > Andrew Morton wrote:
> > > > On Thu, 27 Dec 2007 23:08:40 +0100 Sascha
> Warner
> 
  
> > wrote:
> > > >> Hi,
> > > >>
> > > >> I applied your patches to 2.6.24-rc6-mm1, but now I am
> faced
> 
 with one
> > > >> pdflush often using 100% CPU for a long time. There seem to
> be
> 
 some
> > > >> rare pauses from its 100% usage, however.
> > > >>
> > > >> On ~23 minutes uptime i have ~19 minutes pdflush runtime.
> > > >>
> > > >> This is on E6600, x86_64, 2 Gig RAM, SATA HDD, running on gentoo
> > > >> ~x64_64
> > > >>
> > > >> Let me know if you need more info.
> > > >
> > > > (some) cc's restored.  Please, always do reply-to-all.
> > >
> > > Hi Wu,
> > 
> > Sascha, if you want to address Fengguang by his first name, note
> that
> 
 
> > chinese and bavarians (and some others I forgot now, too)
> typically
> 
 use the 
> > order:
> >   
> >   lastname firstname 
> > 
> > when they spell their names. Another evidence is, that the name Wu
> is
> 
 a 
> > pretty common chinese family name.
> > 
> > Fengguang, if it's the other way around, correct me please (and
> I'm
> 
 going to 
> > wear a big brown paper bag for the rest of the day..). 
> 
> You are right. We normally do "Fengguang" or "Mr. Wu" :-)
> For LKML the first name is less ambiguous.
> 
> Thanks,
> Fengguang
> 

 Just cannot resist. Hans-Peter mentions Bavarian using Lastname-Givenname as 
well. This is only true in a folklore context (or when you are very deep in the 
countryside). Officially the bavarians use the usual German Given/Lastname. 
Although they will never admit to be Germans, of course :-)


Cheers
Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/11] writeback bug fixes and simplifications

2008-01-11 Thread Martin Knoblauch
- Original Message 
 From: WU Fengguang [EMAIL PROTECTED]
 To: Hans-Peter Jansen [EMAIL PROTECTED]
 Cc: Sascha Warner [EMAIL PROTECTED]; Andrew Morton [EMAIL PROTECTED]; 
 linux-kernel@vger.kernel.org; Peter Zijlstra [EMAIL PROTECTED]
 Sent: Wednesday, January 9, 2008 4:33:32 AM
 Subject: Re: [PATCH 00/11] writeback bug fixes and simplifications
 
 On Sat, Dec 29, 2007 at 03:56:59PM +0100, Hans-Peter Jansen wrote:
  Am Freitag, 28. Dezember 2007 schrieb Sascha Warner:
   Andrew Morton wrote:
On Thu, 27 Dec 2007 23:08:40 +0100 Sascha
 Warner
 
  
  wrote:
Hi,
   
I applied your patches to 2.6.24-rc6-mm1, but now I am
 faced
 
 with one
pdflush often using 100% CPU for a long time. There seem to
 be
 
 some
rare pauses from its 100% usage, however.
   
On ~23 minutes uptime i have ~19 minutes pdflush runtime.
   
This is on E6600, x86_64, 2 Gig RAM, SATA HDD, running on gentoo
~x64_64
   
Let me know if you need more info.
   
(some) cc's restored.  Please, always do reply-to-all.
  
   Hi Wu,
  
  Sascha, if you want to address Fengguang by his first name, note
 that
 
 
  chinese and bavarians (and some others I forgot now, too)
 typically
 
 use the 
  order:

lastname firstname 
  
  when they spell their names. Another evidence is, that the name Wu
 is
 
 a 
  pretty common chinese family name.
  
  Fengguang, if it's the other way around, correct me please (and
 I'm
 
 going to 
  wear a big brown paper bag for the rest of the day..). 
 
 You are right. We normally do Fengguang or Mr. Wu :-)
 For LKML the first name is less ambiguous.
 
 Thanks,
 Fengguang
 

 Just cannot resist. Hans-Peter mentions Bavarian using Lastname-Givenname as 
well. This is only true in a folklore context (or when you are very deep in the 
countryside). Officially the bavarians use the usual German Given/Lastname. 
Although they will never admit to be Germans, of course :-)


Cheers
Martin

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange NFS write performance Linux->Solaris-10/VXFS, maybe VW related

2007-12-29 Thread Martin Knoblauch
- Original Message 
> From: Chris Snook <[EMAIL PROTECTED]>
> To: Martin Knoblauch <[EMAIL PROTECTED]>
> Cc: linux-kernel@vger.kernel.org; [EMAIL PROTECTED]
> Sent: Friday, December 28, 2007 7:45:13 PM
> Subject: Re: Strange NFS write performance Linux->Solaris-10/VXFS, maybe VW 
> related
> 
> Martin Knoblauch wrote:
> > Hi,
> > 
> > currently I am tracking down an "interesting" effect when writing

> 3) It sounds like the bottleneck is the vxfs filesystem.  It
> only *appears* on  the client side because writes up until dirty_ratio
> get buffered on the client. 
>   If you can confirm that the server is actually writing stuff to
> disk slower  when the client is in writeback mode, then it's possible
> the Linux NFSclient is  doing something inefficient in writeback mode.
> 

so, is the output of "iostat -d -l1 d111" during two runs. The first run is 
with 750 MB, the second with 850MB.

// 750MB
$ iostat -d -l 1 md111 2
   md111
kps tps serv
 22   0   14
  0   00
  0   0   13
29347 468   12
37040 593   17
30938 492   25
30421 491   25
41626 676   16
42913 703   14
39890 647   15
9009 1417
8963 1417
5143  817
34814 547   10
49323 775   12
28624 4516
 22   16
 finish
  0   00
  0   00

 Here it seems that the disk is writing for 26-28 seconds with avg. 29 MB/sec. 
Fine.

// 850MB
$ iostat -d -l 1 md111 2
   md111
kps tps serv
  0   00
11275 180   10
39874 635   14
37403 587   17
24341 392   30
25989 423   26
22464 375   30
21922 361   32
27924 450   26
21507 342   21
9217 153   15
9260 150   15
9544 155   15
9298 150   14
10118 162   11
15505 250   12
27513 448   14
26698 436   15
26144 431   15
25201 412   14
 38 seconds in run
  0   00
  0   00
579  17   12
  0   00
  0   00
  0   00
  0   00
518   9   16
485   86
  9   17
514   97
  0   00
  0   00
541   98
532  106
  0   00
  0   00
650  127
  0   00
242   89
1023  185
304   56
418   87
283   55
303   58
527  106
  0   00
  0   00
  0   00
  5   1   13
  0   00
  0   00
  0   00
  0   00
  0   00
  0   0   11
  0   00
  0   00
  0   00
  1   0   15
  0   00
 96   2   15
138   3   10
11057 1756
17549 2806
351   85
  0   00
# 218 seconds in run, finish.

 So, for the first 38 seconds everything looks similar to the 750 MB case. For 
the next about 180 seconds most time nothing happens. Averaging 4.1 MB/sec.

Maybe it is time to capture the traffic. What are the best tcpdump parameters 
for NFS? I always forget :-(

Cheers
Martin



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange NFS write performance Linux->Solaris-10/VXFS, maybe VW related

2007-12-29 Thread Martin Knoblauch
- Original Message 
> From: Chris Snook <[EMAIL PROTECTED]>
> To: Martin Knoblauch <[EMAIL PROTECTED]>
> Cc: linux-kernel@vger.kernel.org; [EMAIL PROTECTED]
> Sent: Friday, December 28, 2007 7:45:13 PM
> Subject: Re: Strange NFS write performance Linux->Solaris-10/VXFS, maybe VW 
> related
> 
> Martin Knoblauch wrote:
> > Hi,
> > 
> > currently I am tracking down an "interesting" effect when writing
> to
> 
 a
> > Solars-10/Sparc based server. The server exports two filesystems.
> One
> 
 UFS,
> > one VXFS. The filesystems are mounted NFS3/TCP, no special
> options.
> 
 Linux
> > kernel in question is 2.6.24-rc6, but it happens with earlier kernels
> > (2.6.19.2, 2.6.22.6) as well. The client is x86_64 with 8 GB of ram.
> > 
> > The problem: when writing to the VXFS based filesystem,
> performance
> 
 drops
> > dramatically when the the filesize reaches or exceeds
> "dirty_ratio".
> 
 For a
> > dirty_ratio of 10% (about 800MB) files below 750 MB are
> transfered
> 
 with about
> > 30 MB/sec. Anything above 770 MB drops down to below 10 MB/sec. If
> I
> 
 perform
> > the same tests on the UFS based FS, performance stays at about
> 30
> 
 MB/sec
> > until 3GB and likely larger (I just stopped at 3 GB).
> > 
> > Any ideas what could cause this difference? Any suggestions
> on
> 
 debugging it?
> 
> 1) Try normal NFS tuning, such as rsize/wsize tuning.
>

  rsize/wsize only have minimal effect. The negotiated  size seems to be 
optimal.

> 2) You're entering synchronous writeback mode, so you can delay the
> 
 problem by raising dirty_ratio to 100, or reduce the size of the problem
> by lowering  dirty_ratio to 1.  Either one could help.
> 

 For experiments, sure. But I do not think that I want to have 8 GB of dirty 
pages [potentially] laying around. Are you sure that 1% is a useful value for 
dirty_ratio? Looking at the code, it seems a minimum of 5% is  enforced  in 
"page-writeback.c:get_dirty_limits":

dirty_ratio = vm_dirty_ratio;
if (dirty_ratio > unmapped_ratio / 2)
dirty_ratio = unmapped_ratio / 2;

if (dirty_ratio < 5)
dirty_ratio = 5;


> 3) It sounds like the bottleneck is the vxfs filesystem.  It only
> 
 *appears* on  the client side because writes up until dirty_ratio get buffered 
on
> the client. 

 Sure, the fact that a UFS (or SAM-FS) based FS behaves well in the same 
situation points in that direction.

>   If you can confirm that the server is actually writing stuff to disk
> 
 slower  when the client is in writeback mode, then it's possible the Linux
> NFS client is  doing something inefficient in writeback mode.
> 

 I will try to get an iostat trace from the Sun side. Thanks for the suggestion.

Cheers
Martin
PS: Happy Year 2008 to all Kernel Hackers and their families



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange NFS write performance Linux-Solaris-10/VXFS, maybe VW related

2007-12-29 Thread Martin Knoblauch
- Original Message 
 From: Chris Snook [EMAIL PROTECTED]
 To: Martin Knoblauch [EMAIL PROTECTED]
 Cc: linux-kernel@vger.kernel.org; [EMAIL PROTECTED]
 Sent: Friday, December 28, 2007 7:45:13 PM
 Subject: Re: Strange NFS write performance Linux-Solaris-10/VXFS, maybe VW 
 related
 
 Martin Knoblauch wrote:
  Hi,
  
  currently I am tracking down an interesting effect when writing
 to
 
 a
  Solars-10/Sparc based server. The server exports two filesystems.
 One
 
 UFS,
  one VXFS. The filesystems are mounted NFS3/TCP, no special
 options.
 
 Linux
  kernel in question is 2.6.24-rc6, but it happens with earlier kernels
  (2.6.19.2, 2.6.22.6) as well. The client is x86_64 with 8 GB of ram.
  
  The problem: when writing to the VXFS based filesystem,
 performance
 
 drops
  dramatically when the the filesize reaches or exceeds
 dirty_ratio.
 
 For a
  dirty_ratio of 10% (about 800MB) files below 750 MB are
 transfered
 
 with about
  30 MB/sec. Anything above 770 MB drops down to below 10 MB/sec. If
 I
 
 perform
  the same tests on the UFS based FS, performance stays at about
 30
 
 MB/sec
  until 3GB and likely larger (I just stopped at 3 GB).
  
  Any ideas what could cause this difference? Any suggestions
 on
 
 debugging it?
 
 1) Try normal NFS tuning, such as rsize/wsize tuning.


  rsize/wsize only have minimal effect. The negotiated  size seems to be 
optimal.

 2) You're entering synchronous writeback mode, so you can delay the
 
 problem by raising dirty_ratio to 100, or reduce the size of the problem
 by lowering  dirty_ratio to 1.  Either one could help.
 

 For experiments, sure. But I do not think that I want to have 8 GB of dirty 
pages [potentially] laying around. Are you sure that 1% is a useful value for 
dirty_ratio? Looking at the code, it seems a minimum of 5% is  enforced  in 
page-writeback.c:get_dirty_limits:

dirty_ratio = vm_dirty_ratio;
if (dirty_ratio  unmapped_ratio / 2)
dirty_ratio = unmapped_ratio / 2;

if (dirty_ratio  5)
dirty_ratio = 5;


 3) It sounds like the bottleneck is the vxfs filesystem.  It only
 
 *appears* on  the client side because writes up until dirty_ratio get buffered 
on
 the client. 

 Sure, the fact that a UFS (or SAM-FS) based FS behaves well in the same 
situation points in that direction.

   If you can confirm that the server is actually writing stuff to disk
 
 slower  when the client is in writeback mode, then it's possible the Linux
 NFS client is  doing something inefficient in writeback mode.
 

 I will try to get an iostat trace from the Sun side. Thanks for the suggestion.

Cheers
Martin
PS: Happy Year 2008 to all Kernel Hackers and their families



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange NFS write performance Linux-Solaris-10/VXFS, maybe VW related

2007-12-29 Thread Martin Knoblauch
- Original Message 
 From: Chris Snook [EMAIL PROTECTED]
 To: Martin Knoblauch [EMAIL PROTECTED]
 Cc: linux-kernel@vger.kernel.org; [EMAIL PROTECTED]
 Sent: Friday, December 28, 2007 7:45:13 PM
 Subject: Re: Strange NFS write performance Linux-Solaris-10/VXFS, maybe VW 
 related
 
 Martin Knoblauch wrote:
  Hi,
  
  currently I am tracking down an interesting effect when writing

 3) It sounds like the bottleneck is the vxfs filesystem.  It
 only *appears* on  the client side because writes up until dirty_ratio
 get buffered on the client. 
   If you can confirm that the server is actually writing stuff to
 disk slower  when the client is in writeback mode, then it's possible
 the Linux NFSclient is  doing something inefficient in writeback mode.
 

so, is the output of iostat -d -l1 d111 during two runs. The first run is 
with 750 MB, the second with 850MB.

// 750MB
$ iostat -d -l 1 md111 2
   md111
kps tps serv
 22   0   14
  0   00
  0   0   13
29347 468   12
37040 593   17
30938 492   25
30421 491   25
41626 676   16
42913 703   14
39890 647   15
9009 1417
8963 1417
5143  817
34814 547   10
49323 775   12
28624 4516
 22   16
 finish
  0   00
  0   00

 Here it seems that the disk is writing for 26-28 seconds with avg. 29 MB/sec. 
Fine.

// 850MB
$ iostat -d -l 1 md111 2
   md111
kps tps serv
  0   00
11275 180   10
39874 635   14
37403 587   17
24341 392   30
25989 423   26
22464 375   30
21922 361   32
27924 450   26
21507 342   21
9217 153   15
9260 150   15
9544 155   15
9298 150   14
10118 162   11
15505 250   12
27513 448   14
26698 436   15
26144 431   15
25201 412   14
 38 seconds in run
  0   00
  0   00
579  17   12
  0   00
  0   00
  0   00
  0   00
518   9   16
485   86
  9   17
514   97
  0   00
  0   00
541   98
532  106
  0   00
  0   00
650  127
  0   00
242   89
1023  185
304   56
418   87
283   55
303   58
527  106
  0   00
  0   00
  0   00
  5   1   13
  0   00
  0   00
  0   00
  0   00
  0   00
  0   0   11
  0   00
  0   00
  0   00
  1   0   15
  0   00
 96   2   15
138   3   10
11057 1756
17549 2806
351   85
  0   00
# 218 seconds in run, finish.

 So, for the first 38 seconds everything looks similar to the 750 MB case. For 
the next about 180 seconds most time nothing happens. Averaging 4.1 MB/sec.

Maybe it is time to capture the traffic. What are the best tcpdump parameters 
for NFS? I always forget :-(

Cheers
Martin



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Strange NFS write performance Linux->Solaris-10/VXFS, maybe VW related

2007-12-28 Thread Martin Knoblauch
Hi,

 currently I am tracking down an "interesting" effect when writing to a 
Solars-10/Sparc based server. The server exports two filesystems. One UFS, one 
VXFS. The filesystems are mounted NFS3/TCP, no special options. Linux kernel in 
question is 2.6.24-rc6, but it happens with earlier kernels (2.6.19.2, 
2.6.22.6) as well. The client is x86_64 with 8 GB of ram. 

 The problem: when writing to the VXFS based filesystem, performance drops 
dramatically when the the filesize reaches or exceeds "dirty_ratio". For a 
dirty_ratio of 10% (about 800MB) files below 750 MB are transfered with about 
30 MB/sec. Anything above 770 MB drops down to below 10 MB/sec. If I perform 
the same tests on the UFS based FS, performance stays at about 30 MB/sec until 
3GB and likely larger (I just stopped at 3 GB).

 Any ideas what could cause this difference? Any suggestions on debugging it?

spsdm5:/lfs/test_ufs on /mnt/test_ufs type nfs 
(rw,proto=tcp,nfsvers=3,hard,intr,addr=160.50.118.37)
spsdm5:/lfs/test_vxfs on /mnt/test_vxfs type nfs 
(rw,proto=tcp,nfsvers=3,hard,intr,addr=160.50.118.37)

Cheers
Martin
PS: Please CC me, as I am not subscribed. Don't worry about the spamtrap name 
:-)

----------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Strange NFS write performance Linux-Solaris-10/VXFS, maybe VW related

2007-12-28 Thread Martin Knoblauch
Hi,

 currently I am tracking down an interesting effect when writing to a 
Solars-10/Sparc based server. The server exports two filesystems. One UFS, one 
VXFS. The filesystems are mounted NFS3/TCP, no special options. Linux kernel in 
question is 2.6.24-rc6, but it happens with earlier kernels (2.6.19.2, 
2.6.22.6) as well. The client is x86_64 with 8 GB of ram. 

 The problem: when writing to the VXFS based filesystem, performance drops 
dramatically when the the filesize reaches or exceeds dirty_ratio. For a 
dirty_ratio of 10% (about 800MB) files below 750 MB are transfered with about 
30 MB/sec. Anything above 770 MB drops down to below 10 MB/sec. If I perform 
the same tests on the UFS based FS, performance stays at about 30 MB/sec until 
3GB and likely larger (I just stopped at 3 GB).

 Any ideas what could cause this difference? Any suggestions on debugging it?

spsdm5:/lfs/test_ufs on /mnt/test_ufs type nfs 
(rw,proto=tcp,nfsvers=3,hard,intr,addr=160.50.118.37)
spsdm5:/lfs/test_vxfs on /mnt/test_vxfs type nfs 
(rw,proto=tcp,nfsvers=3,hard,intr,addr=160.50.118.37)

Cheers
Martin
PS: Please CC me, as I am not subscribed. Don't worry about the spamtrap name 
:-)

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


What is the unit of "nr_writeback"?

2007-12-04 Thread Martin Knoblauch
Hi,

 forgive the stupid question. What is the unit of "nr_writeback"? One would 
usually assume a rate, but looking at the code I see it added together with 
nr_dirty and nr_unstable, somehow defeating the assumption.

Cheers
Martin
------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Stack warning from 2.6.24-rc

2007-12-04 Thread Martin Knoblauch
- Original Message 
> From: Ingo Molnar <[EMAIL PROTECTED]>
> To: Martin Knoblauch <[EMAIL PROTECTED]>
> Cc: linux-kernel@vger.kernel.org
> Sent: Tuesday, December 4, 2007 12:52:23 PM
> Subject: Re: Stack warning from 2.6.24-rc
> 
> 
> * Martin Knoblauch  wrote:
> 
> >  I see the following stack warning(s) on a IBM x3650 (2xDual-Core, 8 
> >  GB, AACRAID with 6x146GB RAID5) running 2.6.24-rc3/rc4:
> > 
> > [  180.739846] mount.nfs used greatest stack depth: 3192 bytes left
> > [  666.121007] bash used greatest stack depth: 3160 bytes left
> > 
> >  Nothing bad has happened so far. The message does not show on a 
> >  similarly configured HP/DL-380g4 (CCISS instead of AACRAID) running 
> >  rc3. Anything to worry? Anything I can do to help debugging?
> 
> those are generated by:
> 
>   CONFIG_DEBUG_STACKOVERFLOW=y
>   CONFIG_DEBUG_STACK_USAGE=y
> 
> and look quite harmless. If they were much closer to zero it would be
> a problem.
> 
> Ingo
> 

 OK, I will ignore it then. I was just surprised to see it.

Thanks
Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Stack warning from 2.6.24-rc

2007-12-04 Thread Martin Knoblauch
- Original Message 
 From: Ingo Molnar [EMAIL PROTECTED]
 To: Martin Knoblauch [EMAIL PROTECTED]
 Cc: linux-kernel@vger.kernel.org
 Sent: Tuesday, December 4, 2007 12:52:23 PM
 Subject: Re: Stack warning from 2.6.24-rc
 
 
 * Martin Knoblauch  wrote:
 
   I see the following stack warning(s) on a IBM x3650 (2xDual-Core, 8 
   GB, AACRAID with 6x146GB RAID5) running 2.6.24-rc3/rc4:
  
  [  180.739846] mount.nfs used greatest stack depth: 3192 bytes left
  [  666.121007] bash used greatest stack depth: 3160 bytes left
  
   Nothing bad has happened so far. The message does not show on a 
   similarly configured HP/DL-380g4 (CCISS instead of AACRAID) running 
   rc3. Anything to worry? Anything I can do to help debugging?
 
 those are generated by:
 
   CONFIG_DEBUG_STACKOVERFLOW=y
   CONFIG_DEBUG_STACK_USAGE=y
 
 and look quite harmless. If they were much closer to zero it would be
 a problem.
 
 Ingo
 

 OK, I will ignore it then. I was just surprised to see it.

Thanks
Martin

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


What is the unit of nr_writeback?

2007-12-04 Thread Martin Knoblauch
Hi,

 forgive the stupid question. What is the unit of nr_writeback? One would 
usually assume a rate, but looking at the code I see it added together with 
nr_dirty and nr_unstable, somehow defeating the assumption.

Cheers
Martin
--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: iozone write 50% regression in kernel 2.6.24-rc1

2007-11-12 Thread Martin Knoblauch
- Original Message 
> From: "Zhang, Yanmin" <[EMAIL PROTECTED]>
> To: Martin Knoblauch <[EMAIL PROTECTED]>
> Cc: [EMAIL PROTECTED]; LKML 
> Sent: Monday, November 12, 2007 1:45:57 AM
> Subject: Re: iozone write 50% regression in kernel 2.6.24-rc1
> 
> On Fri, 2007-11-09 at 04:36 -0800, Martin Knoblauch wrote:
> > - Original Message 
> > > From: "Zhang, Yanmin" 
> > > To: [EMAIL PROTECTED]
> > > Cc: LKML 
> > > Sent: Friday, November 9, 2007 10:47:52 AM
> > > Subject: iozone write 50% regression in kernel 2.6.24-rc1
> > > 
> > > Comparing with 2.6.23, iozone sequential write/rewrite (512M) has
> > > 50%
> > > 
> >  regression
> > > in kernel 2.6.24-rc1. 2.6.24-rc2 has the same regression.
> > > 
> > > My machine has 8 processor cores and 8GB memory.
> > > 
> > > By bisect, I located patch
> >
> >
> 
 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h
> =
> > > 04fbfdc14e5f48463820d6b9807daa5e9c92c51f.
> > > 
> > > 
> > > Another behavior: with kernel 2.6.23, if I run iozone for many
> > > times
> > > 
> >  after rebooting machine,
> > > the result looks stable. But with 2.6.24-rc1, the first run of
> > > iozone
> > > 
> >  got a very small result and
> > > following run has 4Xorig_result.
> > > 
> > > What I reported is the regression of 2nd/3rd run, because first run
> > > has
> > > 
> >  bigger regression.
> > > 
> > > I also tried to change
> > > /proc/sys/vm/dirty_ratio,dirty_backgroud_ratio
> > > 
> >  and didn't get improvement.
> >  could you tell us the exact iozone command you are using?
> iozone -i 0 -r 4k -s 512m
> 

 OK, I definitely do not see the reported effect.  On a HP Proliant with a 
RAID5 on CCISS I get:

2.6.19.2: 654-738 MB/sec write, 1126-1154 MB/sec rewrite
2.6.24-rc2: 772-820 MB/sec write, 1495-1539 MB/sec rewrite

 The first run is always slowest, all subsequent runs are faster and the same 
speed.

> 
> >  I would like to repeat it on my setup, because I definitely see
> the
> 
 opposite behaviour in 2.6.24-rc1/rc2. The speed there is much
> better
> 
 than in 2.6.22 and before (I skipped 2.6.23, because I was waiting
> for
> 
 the per-bdi changes). I definitely do not see the difference between
> 1st
> 
 and subsequent runs. But then, I do my tests with 5GB file sizes like:
> > 
> > iozone3_283/src/current/iozone -t 5 -F /scratch/X1
> /scratch/X2
> 
 /scratch/X3 /scratch/X4 /scratch/X5 -s 5000M -r 1024 -c -e -i 0 -i 1
> My machine uses SATA (AHCI) disk.
> 

 4x72GB SCSI disks building a RAID5 on a CCISS controller with battery backed 
write cache. Systems are 2 CPUs (64-bit) with 8 GB memory. I could test on some 
IBM boxes (2x dual core, 8 GB) with RAID5 on "aacraid", but I need some time to 
free up one of the boxes.

Cheers
Martin



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: iozone write 50% regression in kernel 2.6.24-rc1

2007-11-12 Thread Martin Knoblauch
- Original Message 
 From: Zhang, Yanmin [EMAIL PROTECTED]
 To: Martin Knoblauch [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]; LKML linux-kernel@vger.kernel.org
 Sent: Monday, November 12, 2007 1:45:57 AM
 Subject: Re: iozone write 50% regression in kernel 2.6.24-rc1
 
 On Fri, 2007-11-09 at 04:36 -0800, Martin Knoblauch wrote:
  - Original Message 
   From: Zhang, Yanmin 
   To: [EMAIL PROTECTED]
   Cc: LKML 
   Sent: Friday, November 9, 2007 10:47:52 AM
   Subject: iozone write 50% regression in kernel 2.6.24-rc1
   
   Comparing with 2.6.23, iozone sequential write/rewrite (512M) has
   50%
   
   regression
   in kernel 2.6.24-rc1. 2.6.24-rc2 has the same regression.
   
   My machine has 8 processor cores and 8GB memory.
   
   By bisect, I located patch
 
 
 
 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h
 =
   04fbfdc14e5f48463820d6b9807daa5e9c92c51f.
   
   
   Another behavior: with kernel 2.6.23, if I run iozone for many
   times
   
   after rebooting machine,
   the result looks stable. But with 2.6.24-rc1, the first run of
   iozone
   
   got a very small result and
   following run has 4Xorig_result.
   
   What I reported is the regression of 2nd/3rd run, because first run
   has
   
   bigger regression.
   
   I also tried to change
   /proc/sys/vm/dirty_ratio,dirty_backgroud_ratio
   
   and didn't get improvement.
   could you tell us the exact iozone command you are using?
 iozone -i 0 -r 4k -s 512m
 

 OK, I definitely do not see the reported effect.  On a HP Proliant with a 
RAID5 on CCISS I get:

2.6.19.2: 654-738 MB/sec write, 1126-1154 MB/sec rewrite
2.6.24-rc2: 772-820 MB/sec write, 1495-1539 MB/sec rewrite

 The first run is always slowest, all subsequent runs are faster and the same 
speed.

 
   I would like to repeat it on my setup, because I definitely see
 the
 
 opposite behaviour in 2.6.24-rc1/rc2. The speed there is much
 better
 
 than in 2.6.22 and before (I skipped 2.6.23, because I was waiting
 for
 
 the per-bdi changes). I definitely do not see the difference between
 1st
 
 and subsequent runs. But then, I do my tests with 5GB file sizes like:
  
  iozone3_283/src/current/iozone -t 5 -F /scratch/X1
 /scratch/X2
 
 /scratch/X3 /scratch/X4 /scratch/X5 -s 5000M -r 1024 -c -e -i 0 -i 1
 My machine uses SATA (AHCI) disk.
 

 4x72GB SCSI disks building a RAID5 on a CCISS controller with battery backed 
write cache. Systems are 2 CPUs (64-bit) with 8 GB memory. I could test on some 
IBM boxes (2x dual core, 8 GB) with RAID5 on aacraid, but I need some time to 
free up one of the boxes.

Cheers
Martin



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: iozone write 50% regression in kernel 2.6.24-rc1

2007-11-09 Thread Martin Knoblauch
- Original Message 
> From: "Zhang, Yanmin" <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Cc: LKML 
> Sent: Friday, November 9, 2007 10:47:52 AM
> Subject: iozone write 50% regression in kernel 2.6.24-rc1
> 
> Comparing with 2.6.23, iozone sequential write/rewrite (512M) has
> 50%
> 
 regression
> in kernel 2.6.24-rc1. 2.6.24-rc2 has the same regression.
> 
> My machine has 8 processor cores and 8GB memory.
> 
> By bisect, I located patch
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=
> 04fbfdc14e5f48463820d6b9807daa5e9c92c51f.
> 
> 
> Another behavior: with kernel 2.6.23, if I run iozone for many
> times
> 
 after rebooting machine,
> the result looks stable. But with 2.6.24-rc1, the first run of
> iozone
> 
 got a very small result and
> following run has 4Xorig_result.
> 
> What I reported is the regression of 2nd/3rd run, because first run
> has
> 
 bigger regression.
> 
> I also tried to change
> /proc/sys/vm/dirty_ratio,dirty_backgroud_ratio
> 
 and didn't get improvement.
> 
> -yanmin
> -
Hi Yanmin,

 could you tell us the exact iozone command you are using? I would like to 
repeat it on my setup, because I definitely see the opposite behaviour in 
2.6.24-rc1/rc2. The speed there is much better than in 2.6.22 and before (I 
skipped 2.6.23, because I was waiting for the per-bdi changes). I definitely do 
not see the difference between 1st and subsequent runs. But then, I do my tests 
with 5GB file sizes like:

iozone3_283/src/current/iozone -t 5 -F /scratch/X1 /scratch/X2 /scratch/X3 
/scratch/X4 /scratch/X5 -s 5000M -r 1024 -c -e -i 0 -i 1

Kind regards
Martin



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: iozone write 50% regression in kernel 2.6.24-rc1

2007-11-09 Thread Martin Knoblauch
- Original Message 
 From: Zhang, Yanmin [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Cc: LKML linux-kernel@vger.kernel.org
 Sent: Friday, November 9, 2007 10:47:52 AM
 Subject: iozone write 50% regression in kernel 2.6.24-rc1
 
 Comparing with 2.6.23, iozone sequential write/rewrite (512M) has
 50%
 
 regression
 in kernel 2.6.24-rc1. 2.6.24-rc2 has the same regression.
 
 My machine has 8 processor cores and 8GB memory.
 
 By bisect, I located patch
 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=
 04fbfdc14e5f48463820d6b9807daa5e9c92c51f.
 
 
 Another behavior: with kernel 2.6.23, if I run iozone for many
 times
 
 after rebooting machine,
 the result looks stable. But with 2.6.24-rc1, the first run of
 iozone
 
 got a very small result and
 following run has 4Xorig_result.
 
 What I reported is the regression of 2nd/3rd run, because first run
 has
 
 bigger regression.
 
 I also tried to change
 /proc/sys/vm/dirty_ratio,dirty_backgroud_ratio
 
 and didn't get improvement.
 
 -yanmin
 -
Hi Yanmin,

 could you tell us the exact iozone command you are using? I would like to 
repeat it on my setup, because I definitely see the opposite behaviour in 
2.6.24-rc1/rc2. The speed there is much better than in 2.6.22 and before (I 
skipped 2.6.23, because I was waiting for the per-bdi changes). I definitely do 
not see the difference between 1st and subsequent runs. But then, I do my tests 
with 5GB file sizes like:

iozone3_283/src/current/iozone -t 5 -F /scratch/X1 /scratch/X2 /scratch/X3 
/scratch/X4 /scratch/X5 -s 5000M -r 1024 -c -e -i 0 -i 1

Kind regards
Martin



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc1: First impressions

2007-10-29 Thread Martin Knoblauch
- Original Message 
> From: Ingo Molnar <[EMAIL PROTECTED]>
> To: Andrew Morton <[EMAIL PROTECTED]>
> Cc: [EMAIL PROTECTED]; linux-kernel@vger.kernel.org; [EMAIL PROTECTED]; 
> [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Sent: Friday, October 26, 2007 9:33:40 PM
> Subject: Re: 2.6.24-rc1: First impressions
> 
> 
> * Andrew Morton  wrote:
> 
> > > > dd1 - copy 16 GB from /dev/zero to local FS
> > > > dd1-dir - same, but using O_DIRECT for output
> > > > dd2/dd2-dir - copy 2x7.6 GB in parallel from /dev/zero to
> local
> 
 FS
> > > > dd3/dd3-dir - copy 3x5.2 GB in parallel from /dev/zero lo
> local
> 
 FS
> > > > net1 - copy 5.2 GB from NFS3 share to local FS
> > > > mix3 - copy 3x5.2 GB from /dev/zero to local disk and two
> NFS3
> 
 shares
> > > > 
> > > >  I did the numbers for 2.6.19.2, 2.6.22.6 and 2.6.24-rc1.
> All
> 
 units 
> > > >  are MB/sec.
> > > > 
> > > > test   2.6.19.2 2.6.22.62.6.24.-rc1
> > > > 
> > > > dd1  28   50 96
> > > > dd1-dir  88   88 86
> > > > dd2  2x16.5 2x11 2x44.5
> > > > dd2-dir2x44 2x44   2x43
> > > > dd3   3x9.83x8.7   3x30
> > > > dd3-dir  3x29.5   3x29.5 3x28.5
> > > > net1  30-3350-55  37-52
> > > > mix3  17/3225/50 
> 96/35
> 
 (disk/combined-network)
> > > 
> > > wow, really nice results!
> > 
> > Those changes seem suspiciously large to me.  I wonder if
> there's
> 
 less 
> > physical IO happening during the timed run, and correspondingly more 
> > afterwards.
> 
> so a final 'sync' should be added to the test too, and the time
> it
> 
 takes 
> factored into the bandwidth numbers?
> 

 One of the reasons I do 15 GB transfers is to make sure that I am well above 
the possible page cache size. And of course I am doing a final sync to finish 
the runs :-) The sync is also running faster in 2.6.24-rc1.

 If I factor it in the results for dd1/dd3 are:

test2.6.19.22.6.22.62.6.24-rc1
sync time   18sec19sec  6sec
dd1 27.5 47.592
dd3 3x9.1  3x8.5   3x29

So basically including the sync time make 2.6.24-rc1 even more promosing. Now, 
I know that my benchmarks numbers are crude and show only a very small aspect 
of system performance. But - it is an aspect I care about a lot. And those 
benchmarks match my use-case pretty good.

Cheers
Martin





-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc1: First impressions

2007-10-29 Thread Martin Knoblauch
- Original Message 
> From: Andrew Morton <[EMAIL PROTECTED]>
> To: Arjan van de Ven <[EMAIL PROTECTED]>
> Cc: Ingo Molnar <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; 
> linux-kernel@vger.kernel.org; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL 
> PROTECTED]; [EMAIL PROTECTED]
> Sent: Saturday, October 27, 2007 7:59:51 AM
> Subject: Re: 2.6.24-rc1: First impressions
> 
> On Fri, 26 Oct 2007 22:46:57 -0700 Arjan van de
> Ven
> 
  wrote:
> 
> > > > > dd1 - copy 16 GB from /dev/zero to local FS
> > > > > dd1-dir - same, but using O_DIRECT for output
> > > > > dd2/dd2-dir - copy 2x7.6 GB in parallel from /dev/zero to
> local
> 
 FS
> > > > > dd3/dd3-dir - copy 3x5.2 GB in parallel from /dev/zero lo
> local
> 
 FS
> > > > > net1 - copy 5.2 GB from NFS3 share to local FS
> > > > > mix3 - copy 3x5.2 GB from /dev/zero to local disk and two NFS3
> > > > > shares
> > > > > 
> > > > >  I did the numbers for 2.6.19.2, 2.6.22.6 and 2.6.24-rc1. All
> > > > > units are MB/sec.
> > > > > 
> > > > > test   2.6.19.2 2.6.22.62.6.24.-rc1
> > > >
> >
> 
 
> > > > > dd1  28   50 96
> > > > > dd1-dir  88   88 86
> > > > > dd2  2x16.5 2x11 2x44.5
> > > > > dd2-dir2x44 2x44   2x43
> > > > > dd3   3x9.83x8.7   3x30
> > > > > dd3-dir  3x29.5   3x29.5 3x28.5
> > > > > net1  30-3350-55  37-52
> > > > > mix3  17/3225/50  96/35
> > > > > (disk/combined-network)
> > > > 
> > > > wow, really nice results!
> > > 
> > > Those changes seem suspiciously large to me.  I wonder if
> there's
> 
 less
> > > physical IO happening during the timed run, and
> correspondingly
> 
 more
> > > afterwards.
> > > 
> > 
> > another option... this is ext2.. didn't the ext2 reservation
> stuff
> 
 get
> > merged into -rc1? for ext3 that gave a 4x or so speed boost (much
> > better sequential allocation pattern)
> > 
> 
> Yes, one would expect that to make a large difference in
> dd2/dd2-dir
> 
 and
> dd3/dd3-dir - but only on SMP.  On UP there's not enough concurrency
> in the fs block allocator for any damage to occur.
>

 Just for the record the test are done on  SMP.
 
> Reservations won't affect dd1 though, and that went faster too.
>

 This is the one result that surprised me most, as I did not really expect any 
big moves here. I am not complaining :-), but definitely it would be nice to 
understand the why.

Cheers
Martin
> 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc1: First impressions

2007-10-29 Thread Martin Knoblauch
- Original Message 
 From: Andrew Morton [EMAIL PROTECTED]
 To: Arjan van de Ven [EMAIL PROTECTED]
 Cc: Ingo Molnar [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 linux-kernel@vger.kernel.org; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL 
 PROTECTED]; [EMAIL PROTECTED]
 Sent: Saturday, October 27, 2007 7:59:51 AM
 Subject: Re: 2.6.24-rc1: First impressions
 
 On Fri, 26 Oct 2007 22:46:57 -0700 Arjan van de
 Ven
 
  wrote:
 
 dd1 - copy 16 GB from /dev/zero to local FS
 dd1-dir - same, but using O_DIRECT for output
 dd2/dd2-dir - copy 2x7.6 GB in parallel from /dev/zero to
 local
 
 FS
 dd3/dd3-dir - copy 3x5.2 GB in parallel from /dev/zero lo
 local
 
 FS
 net1 - copy 5.2 GB from NFS3 share to local FS
 mix3 - copy 3x5.2 GB from /dev/zero to local disk and two NFS3
 shares
 
  I did the numbers for 2.6.19.2, 2.6.22.6 and 2.6.24-rc1. All
 units are MB/sec.
 
 test   2.6.19.2 2.6.22.62.6.24.-rc1
   
 
 
 
 dd1  28   50 96
 dd1-dir  88   88 86
 dd2  2x16.5 2x11 2x44.5
 dd2-dir2x44 2x44   2x43
 dd3   3x9.83x8.7   3x30
 dd3-dir  3x29.5   3x29.5 3x28.5
 net1  30-3350-55  37-52
 mix3  17/3225/50  96/35
 (disk/combined-network)

wow, really nice results!
   
   Those changes seem suspiciously large to me.  I wonder if
 there's
 
 less
   physical IO happening during the timed run, and
 correspondingly
 
 more
   afterwards.
   
  
  another option... this is ext2.. didn't the ext2 reservation
 stuff
 
 get
  merged into -rc1? for ext3 that gave a 4x or so speed boost (much
  better sequential allocation pattern)
  
 
 Yes, one would expect that to make a large difference in
 dd2/dd2-dir
 
 and
 dd3/dd3-dir - but only on SMP.  On UP there's not enough concurrency
 in the fs block allocator for any damage to occur.


 Just for the record the test are done on  SMP.
 
 Reservations won't affect dd1 though, and that went faster too.


 This is the one result that surprised me most, as I did not really expect any 
big moves here. I am not complaining :-), but definitely it would be nice to 
understand the why.

Cheers
Martin
 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc1: First impressions

2007-10-29 Thread Martin Knoblauch
- Original Message 
 From: Ingo Molnar [EMAIL PROTECTED]
 To: Andrew Morton [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]; linux-kernel@vger.kernel.org; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]
 Sent: Friday, October 26, 2007 9:33:40 PM
 Subject: Re: 2.6.24-rc1: First impressions
 
 
 * Andrew Morton  wrote:
 
dd1 - copy 16 GB from /dev/zero to local FS
dd1-dir - same, but using O_DIRECT for output
dd2/dd2-dir - copy 2x7.6 GB in parallel from /dev/zero to
 local
 
 FS
dd3/dd3-dir - copy 3x5.2 GB in parallel from /dev/zero lo
 local
 
 FS
net1 - copy 5.2 GB from NFS3 share to local FS
mix3 - copy 3x5.2 GB from /dev/zero to local disk and two
 NFS3
 
 shares

 I did the numbers for 2.6.19.2, 2.6.22.6 and 2.6.24-rc1.
 All
 
 units 
 are MB/sec.

test   2.6.19.2 2.6.22.62.6.24.-rc1

dd1  28   50 96
dd1-dir  88   88 86
dd2  2x16.5 2x11 2x44.5
dd2-dir2x44 2x44   2x43
dd3   3x9.83x8.7   3x30
dd3-dir  3x29.5   3x29.5 3x28.5
net1  30-3350-55  37-52
mix3  17/3225/50 
 96/35
 
 (disk/combined-network)
   
   wow, really nice results!
  
  Those changes seem suspiciously large to me.  I wonder if
 there's
 
 less 
  physical IO happening during the timed run, and correspondingly more 
  afterwards.
 
 so a final 'sync' should be added to the test too, and the time
 it
 
 takes 
 factored into the bandwidth numbers?
 

 One of the reasons I do 15 GB transfers is to make sure that I am well above 
the possible page cache size. And of course I am doing a final sync to finish 
the runs :-) The sync is also running faster in 2.6.24-rc1.

 If I factor it in the results for dd1/dd3 are:

test2.6.19.22.6.22.62.6.24-rc1
sync time   18sec19sec  6sec
dd1 27.5 47.592
dd3 3x9.1  3x8.5   3x29

So basically including the sync time make 2.6.24-rc1 even more promosing. Now, 
I know that my benchmarks numbers are crude and show only a very small aspect 
of system performance. But - it is an aspect I care about a lot. And those 
benchmarks match my use-case pretty good.

Cheers
Martin





-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


2.6.24-rc1: First impressions

2007-10-26 Thread Martin Knoblauch
Hi ,

 just to give some feedback on 2.6.24-rc1. For some time I am tracking 
IO/writeback problems that hurt system responsiveness big-time. I tested Peters 
stuff together with Fenguangs additions and it looked promising. Therefore I 
was very happy to see Peters stuff going into 2.6.24 and waited eagerly for 
rc1. In short, I am impressed. This really looks good. IO throughput is great 
and I could not reproduce the responsiveness problems so far.

 Below are a some numbers of my brute-force I/O tests that I can use to bring 
responsiveness down. My platform is a HP/DL380g4, dual CPUs, HT-enabled, 8 GB 
Memory, SmartaArray6i controller with 4x72GB SCSI disks as RAID5 (battery 
protected writeback cahe enabled) and gigabit networking (tg3). User space is 
64-bit RHEL4.3

 I am basically doing copies using "dd" with 1MB blocksize. Local Filesystem 
ist ext2 (noatime). IO-Scheduler is dealine, as it tends to give best results. 
NFS3 Server is a Sun/T2000/Solaris10. The tests are:

dd1 - copy 16 GB from /dev/zero to local FS
dd1-dir - same, but using O_DIRECT for output
dd2/dd2-dir - copy 2x7.6 GB in parallel from /dev/zero to local FS
dd3/dd3-dir - copy 3x5.2 GB in parallel from /dev/zero lo local FS
net1 - copy 5.2 GB from NFS3 share to local FS
mix3 - copy 3x5.2 GB from /dev/zero to local disk and two NFS3 shares

 I did the numbers for 2.6.19.2, 2.6.22.6 and 2.6.24-rc1. All units are MB/sec.

test   2.6.19.2 2.6.22.62.6.24.-rc1

dd1   285096
dd1-dir 888886
dd2  2x16.5   2x112x44.5
dd2-dir  2x44  2x442x43
dd33x9.83x8.7 3x30
dd3-dir  3x29.5  3x29.53x28.5
net130-33 50-55 37-52
mix3   17/32 25/5096/35 (disk/combined-network)


 Some observations:

- single threaded disk speed really went up wit 2.6.24-rc1. It is now even 
better than O_DIRECT
- O_DIRECT took a slight hit compared to the older kernels. Not an issue for 
me, but maybe others care
- multi threaded non O_DIRECT scales for the first time ever  Almost no 
loss compared to single threaded !!
- network throughput took a hit from 2.6.22.6 and is not as repeatable. Still 
better than 2.6.19.2 though

 What actually surprises me most is the big performance win on the single 
threaded non O_DIRECT dd test. I did not expect that :-) What I had hoped for 
was of course the scalability.

 So, this looks great and most likely I will push 2.6.24 (maybe .X) into my 
environment.

Happy weekend
Martin

------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


2.6.24-rc1: First impressions

2007-10-26 Thread Martin Knoblauch
Hi ,

 just to give some feedback on 2.6.24-rc1. For some time I am tracking 
IO/writeback problems that hurt system responsiveness big-time. I tested Peters 
stuff together with Fenguangs additions and it looked promising. Therefore I 
was very happy to see Peters stuff going into 2.6.24 and waited eagerly for 
rc1. In short, I am impressed. This really looks good. IO throughput is great 
and I could not reproduce the responsiveness problems so far.

 Below are a some numbers of my brute-force I/O tests that I can use to bring 
responsiveness down. My platform is a HP/DL380g4, dual CPUs, HT-enabled, 8 GB 
Memory, SmartaArray6i controller with 4x72GB SCSI disks as RAID5 (battery 
protected writeback cahe enabled) and gigabit networking (tg3). User space is 
64-bit RHEL4.3

 I am basically doing copies using dd with 1MB blocksize. Local Filesystem 
ist ext2 (noatime). IO-Scheduler is dealine, as it tends to give best results. 
NFS3 Server is a Sun/T2000/Solaris10. The tests are:

dd1 - copy 16 GB from /dev/zero to local FS
dd1-dir - same, but using O_DIRECT for output
dd2/dd2-dir - copy 2x7.6 GB in parallel from /dev/zero to local FS
dd3/dd3-dir - copy 3x5.2 GB in parallel from /dev/zero lo local FS
net1 - copy 5.2 GB from NFS3 share to local FS
mix3 - copy 3x5.2 GB from /dev/zero to local disk and two NFS3 shares

 I did the numbers for 2.6.19.2, 2.6.22.6 and 2.6.24-rc1. All units are MB/sec.

test   2.6.19.2 2.6.22.62.6.24.-rc1

dd1   285096
dd1-dir 888886
dd2  2x16.5   2x112x44.5
dd2-dir  2x44  2x442x43
dd33x9.83x8.7 3x30
dd3-dir  3x29.5  3x29.53x28.5
net130-33 50-55 37-52
mix3   17/32 25/5096/35 (disk/combined-network)


 Some observations:

- single threaded disk speed really went up wit 2.6.24-rc1. It is now even 
better than O_DIRECT
- O_DIRECT took a slight hit compared to the older kernels. Not an issue for 
me, but maybe others care
- multi threaded non O_DIRECT scales for the first time ever  Almost no 
loss compared to single threaded !!
- network throughput took a hit from 2.6.22.6 and is not as repeatable. Still 
better than 2.6.19.2 though

 What actually surprises me most is the big performance win on the single 
threaded non O_DIRECT dd test. I did not expect that :-) What I had hoped for 
was of course the scalability.

 So, this looks great and most likely I will push 2.6.24 (maybe .X) into my 
environment.

Happy weekend
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] sluggish writeback fixes

2007-10-03 Thread Martin Knoblauch

--- Fengguang Wu <[EMAIL PROTECTED]> wrote:

> Andrew,
> 
> The following patches fix the sluggish writeback behavior.
> They are well understood and well tested - but not yet widely tested.
> 
> The first patch reverts the debugging -mm only
> check_dirty_inode_list.patch -
> which is no longer necessary.
> 
> The following 4 patches do the real jobs:
> 
> [PATCH 2/5] writeback: fix time ordering of the per superblock inode
> lists 8
> [PATCH 3/5] writeback: fix ntfs with sb_has_dirty_inodes()
> [PATCH 4/5] writeback: remove pages_skipped accounting in
> __block_write_full_page()
> [PATCH 5/5] writeback: introduce writeback_control.more_io to
> indicate more io
> 
> They share the same goal as the following patches in -mm. Therefore
> I'd
> recommend to put the last 4 new ones after them:
> 
>
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists.patch
>
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-2.patch
>
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-3.patch
>
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-4.patch
> writeback-fix-comment-use-helper-function.patch
>
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-5.patch
>
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-6.patch
>
writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-7.patch
> writeback-fix-periodic-superblock-dirty-inode-flushing.patch
> 
> Regards,
> Fengguang
Hi Fenguang,

 now that Peters stuff seems to make it into mainline, do you think
your fixes should go in as well? Would definitely help to broaden the
tester base. Definitely by one very interested tester :-)

Keep on the good work
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-03 Thread Martin Knoblauch

--- Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> On Mon, 2007-10-01 at 14:22 -0700, Andrew Morton wrote:
> 
> > nfs-remove-congestion_end.patch
> > lib-percpu_counter_add.patch
> > lib-percpu_counter_sub.patch
> > lib-percpu_counter-variable-batch.patch
> > lib-make-percpu_counter_add-take-s64.patch
> > lib-percpu_counter_set.patch
> > lib-percpu_counter_sum_positive.patch
> > lib-percpu_count_sum.patch
> > lib-percpu_counter_init-error-handling.patch
> > lib-percpu_counter_init_irq.patch
> > mm-bdi-init-hooks.patch
> > mm-scalable-bdi-statistics-counters.patch
> > mm-count-reclaimable-pages-per-bdi.patch
> > mm-count-writeback-pages-per-bdi.patch
> 
> This one:
> > mm-expose-bdi-statistics-in-sysfs.patch
> 
> > lib-floating-proportions.patch
> > mm-per-device-dirty-threshold.patch
> > mm-per-device-dirty-threshold-warning-fix.patch
> > mm-per-device-dirty-threshold-fix.patch
> > mm-dirty-balancing-for-tasks.patch
> > mm-dirty-balancing-for-tasks-warning-fix.patch
> 
> And, this one:
> > debug-sysfs-files-for-the-current-ratio-size-total.patch
> 
> 
> I'm not sure polluting /sys/block//queue/ like that is The Right
> Thing. These patches sure were handy when debugging this, but not
> sure
> they want to move to maineline.
> 
> Maybe we want /sys/bdi// or maybe /debug/bdi//
> 
> Opinions?
> 
Hi Peter,

 my only opinion is that it is great to see that stuff moving into
mainline. If it really goes in, there will be one more very interested
rc-tester :-)

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-03 Thread Martin Knoblauch

--- Peter Zijlstra [EMAIL PROTECTED] wrote:

 On Mon, 2007-10-01 at 14:22 -0700, Andrew Morton wrote:
 
  nfs-remove-congestion_end.patch
  lib-percpu_counter_add.patch
  lib-percpu_counter_sub.patch
  lib-percpu_counter-variable-batch.patch
  lib-make-percpu_counter_add-take-s64.patch
  lib-percpu_counter_set.patch
  lib-percpu_counter_sum_positive.patch
  lib-percpu_count_sum.patch
  lib-percpu_counter_init-error-handling.patch
  lib-percpu_counter_init_irq.patch
  mm-bdi-init-hooks.patch
  mm-scalable-bdi-statistics-counters.patch
  mm-count-reclaimable-pages-per-bdi.patch
  mm-count-writeback-pages-per-bdi.patch
 
 This one:
  mm-expose-bdi-statistics-in-sysfs.patch
 
  lib-floating-proportions.patch
  mm-per-device-dirty-threshold.patch
  mm-per-device-dirty-threshold-warning-fix.patch
  mm-per-device-dirty-threshold-fix.patch
  mm-dirty-balancing-for-tasks.patch
  mm-dirty-balancing-for-tasks-warning-fix.patch
 
 And, this one:
  debug-sysfs-files-for-the-current-ratio-size-total.patch
 
 
 I'm not sure polluting /sys/block/foo/queue/ like that is The Right
 Thing. These patches sure were handy when debugging this, but not
 sure
 they want to move to maineline.
 
 Maybe we want /sys/bdi/foo/ or maybe /debug/bdi/foo/
 
 Opinions?
 
Hi Peter,

 my only opinion is that it is great to see that stuff moving into
mainline. If it really goes in, there will be one more very interested
rc-tester :-)

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5] sluggish writeback fixes

2007-10-03 Thread Martin Knoblauch

--- Fengguang Wu [EMAIL PROTECTED] wrote:

 Andrew,
 
 The following patches fix the sluggish writeback behavior.
 They are well understood and well tested - but not yet widely tested.
 
 The first patch reverts the debugging -mm only
 check_dirty_inode_list.patch -
 which is no longer necessary.
 
 The following 4 patches do the real jobs:
 
 [PATCH 2/5] writeback: fix time ordering of the per superblock inode
 lists 8
 [PATCH 3/5] writeback: fix ntfs with sb_has_dirty_inodes()
 [PATCH 4/5] writeback: remove pages_skipped accounting in
 __block_write_full_page()
 [PATCH 5/5] writeback: introduce writeback_control.more_io to
 indicate more io
 
 They share the same goal as the following patches in -mm. Therefore
 I'd
 recommend to put the last 4 new ones after them:
 

writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists.patch

writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-2.patch

writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-3.patch

writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-4.patch
 writeback-fix-comment-use-helper-function.patch

writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-5.patch

writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-6.patch

writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-7.patch
 writeback-fix-periodic-superblock-dirty-inode-flushing.patch
 
 Regards,
 Fengguang
Hi Fenguang,

 now that Peters stuff seems to make it into mainline, do you think
your fixes should go in as well? Would definitely help to broaden the
tester base. Definitely by one very interested tester :-)

Keep on the good work
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: huge improvement with per-device dirty throttling

2007-09-06 Thread Martin Knoblauch

--- Martin Knoblauch <[EMAIL PROTECTED]> wrote:

> 
> --- Leroy van Logchem <[EMAIL PROTECTED]> wrote:
> 
> > Andrea Arcangeli wrote:
> > > On Wed, Aug 22, 2007 at 01:05:13PM +0200, Andi Kleen wrote:
> > >> Ok perhaps the new adaptive dirty limits helps your single disk
> > >> a lot too. But your improvements seem to be more "collateral
> > damage" @)
> > >>
> > >> But if that was true it might be enough to just change the dirty
> > limits
> > >> to get the same effect on your system. You might want to play
> with
> > >> /proc/sys/vm/dirty_*
> > > 
> > > The adaptive dirty limit is per task so it can't be reproduced
> with
> > > global sysctl. It made quite some difference when I researched
> into
> > it
> > > in function of time. This isn't in function of time but it
> > certainly
> > > makes a lot of difference too, actually it's the most important
> > part
> > > of the patchset for most people, the rest is for the corner cases
> > that
> > > aren't handled right currently (writing to a slow device with
> > > writeback cache has always been hanging the whole thing).
> > 
> > 
> > Self-tuning > static sysctl's. The last years we needed to use very
> 
> > small values for dirty_ratio and dirty_background_ratio to soften
> the
> > 
> > latency problems we have during sustained writes. Imo these patches
> 
> > really help in many cases, please commit to mainline.
> > 
> > -- 
> > Leroy
> > 
> 
>  while it helps in some situations, I did some tests today with
> 2.6.22.6+bdi-v9 (Peter was so kind) which seem to indicate that it
> hurts NFS writes. Anyone seen similar effects?
> 
>  Otherwise I would just second your request. It definitely helps the
> problematic performance of my CCISS based RAID5 volume.
> 

 please disregard my comment about NFS write performance. What I have
seen is caused by some other stuff I am toying with.

 So, I second your request to push this forward.

Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: huge improvement with per-device dirty throttling

2007-09-06 Thread Martin Knoblauch

--- Martin Knoblauch [EMAIL PROTECTED] wrote:

 
 --- Leroy van Logchem [EMAIL PROTECTED] wrote:
 
  Andrea Arcangeli wrote:
   On Wed, Aug 22, 2007 at 01:05:13PM +0200, Andi Kleen wrote:
   Ok perhaps the new adaptive dirty limits helps your single disk
   a lot too. But your improvements seem to be more collateral
  damage @)
  
   But if that was true it might be enough to just change the dirty
  limits
   to get the same effect on your system. You might want to play
 with
   /proc/sys/vm/dirty_*
   
   The adaptive dirty limit is per task so it can't be reproduced
 with
   global sysctl. It made quite some difference when I researched
 into
  it
   in function of time. This isn't in function of time but it
  certainly
   makes a lot of difference too, actually it's the most important
  part
   of the patchset for most people, the rest is for the corner cases
  that
   aren't handled right currently (writing to a slow device with
   writeback cache has always been hanging the whole thing).
  
  
  Self-tuning  static sysctl's. The last years we needed to use very
 
  small values for dirty_ratio and dirty_background_ratio to soften
 the
  
  latency problems we have during sustained writes. Imo these patches
 
  really help in many cases, please commit to mainline.
  
  -- 
  Leroy
  
 
  while it helps in some situations, I did some tests today with
 2.6.22.6+bdi-v9 (Peter was so kind) which seem to indicate that it
 hurts NFS writes. Anyone seen similar effects?
 
  Otherwise I would just second your request. It definitely helps the
 problematic performance of my CCISS based RAID5 volume.
 

 please disregard my comment about NFS write performance. What I have
seen is caused by some other stuff I am toying with.

 So, I second your request to push this forward.

Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: huge improvement with per-device dirty throttling

2007-09-05 Thread Martin Knoblauch

--- Andrea Arcangeli <[EMAIL PROTECTED]> wrote:

> On Wed, Aug 22, 2007 at 01:05:13PM +0200, Andi Kleen wrote:
> > Ok perhaps the new adaptive dirty limits helps your single disk
> > a lot too. But your improvements seem to be more "collateral
> damage" @)
> > 
> > But if that was true it might be enough to just change the dirty
> limits
> > to get the same effect on your system. You might want to play with
> > /proc/sys/vm/dirty_*
> 
> The adaptive dirty limit is per task so it can't be reproduced with
> global sysctl. It made quite some difference when I researched into
> it
> in function of time. This isn't in function of time but it certainly
> makes a lot of difference too, actually it's the most important part
> of the patchset for most people, the rest is for the corner cases
> that

> aren't handled right currently (writing to a slow device with
> writeback cache has always been hanging the whole thing).

 didn't see that remark before. I just realized that "slow device with
writeback cache" pretty well describes the CCISS controller in the
DL380g4. Could you elaborate why that is a problematic case?

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: huge improvement with per-device dirty throttling

2007-09-05 Thread Martin Knoblauch

--- Andrea Arcangeli [EMAIL PROTECTED] wrote:

 On Wed, Aug 22, 2007 at 01:05:13PM +0200, Andi Kleen wrote:
  Ok perhaps the new adaptive dirty limits helps your single disk
  a lot too. But your improvements seem to be more collateral
 damage @)
  
  But if that was true it might be enough to just change the dirty
 limits
  to get the same effect on your system. You might want to play with
  /proc/sys/vm/dirty_*
 
 The adaptive dirty limit is per task so it can't be reproduced with
 global sysctl. It made quite some difference when I researched into
 it
 in function of time. This isn't in function of time but it certainly
 makes a lot of difference too, actually it's the most important part
 of the patchset for most people, the rest is for the corner cases
 that

 aren't handled right currently (writing to a slow device with
 writeback cache has always been hanging the whole thing).

 didn't see that remark before. I just realized that slow device with
writeback cache pretty well describes the CCISS controller in the
DL380g4. Could you elaborate why that is a problematic case?

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: huge improvement with per-device dirty throttling

2007-09-04 Thread Martin Knoblauch

--- Leroy van Logchem <[EMAIL PROTECTED]> wrote:

> Andrea Arcangeli wrote:
> > On Wed, Aug 22, 2007 at 01:05:13PM +0200, Andi Kleen wrote:
> >> Ok perhaps the new adaptive dirty limits helps your single disk
> >> a lot too. But your improvements seem to be more "collateral
> damage" @)
> >>
> >> But if that was true it might be enough to just change the dirty
> limits
> >> to get the same effect on your system. You might want to play with
> >> /proc/sys/vm/dirty_*
> > 
> > The adaptive dirty limit is per task so it can't be reproduced with
> > global sysctl. It made quite some difference when I researched into
> it
> > in function of time. This isn't in function of time but it
> certainly
> > makes a lot of difference too, actually it's the most important
> part
> > of the patchset for most people, the rest is for the corner cases
> that
> > aren't handled right currently (writing to a slow device with
> > writeback cache has always been hanging the whole thing).
> 
> 
> Self-tuning > static sysctl's. The last years we needed to use very 
> small values for dirty_ratio and dirty_background_ratio to soften the
> 
> latency problems we have during sustained writes. Imo these patches 
> really help in many cases, please commit to mainline.
> 
> -- 
> Leroy
> 

 while it helps in some situations, I did some tests today with
2.6.22.6+bdi-v9 (Peter was so kind) which seem to indicate that it
hurts NFS writes. Anyone seen similar effects?

 Otherwise I would just second your request. It definitely helps the
problematic performance of my CCISS based RAID5 volume.

Martin

Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: huge improvement with per-device dirty throttling

2007-09-04 Thread Martin Knoblauch

--- Leroy van Logchem [EMAIL PROTECTED] wrote:

 Andrea Arcangeli wrote:
  On Wed, Aug 22, 2007 at 01:05:13PM +0200, Andi Kleen wrote:
  Ok perhaps the new adaptive dirty limits helps your single disk
  a lot too. But your improvements seem to be more collateral
 damage @)
 
  But if that was true it might be enough to just change the dirty
 limits
  to get the same effect on your system. You might want to play with
  /proc/sys/vm/dirty_*
  
  The adaptive dirty limit is per task so it can't be reproduced with
  global sysctl. It made quite some difference when I researched into
 it
  in function of time. This isn't in function of time but it
 certainly
  makes a lot of difference too, actually it's the most important
 part
  of the patchset for most people, the rest is for the corner cases
 that
  aren't handled right currently (writing to a slow device with
  writeback cache has always been hanging the whole thing).
 
 
 Self-tuning  static sysctl's. The last years we needed to use very 
 small values for dirty_ratio and dirty_background_ratio to soften the
 
 latency problems we have during sustained writes. Imo these patches 
 really help in many cases, please commit to mainline.
 
 -- 
 Leroy
 

 while it helps in some situations, I did some tests today with
2.6.22.6+bdi-v9 (Peter was so kind) which seem to indicate that it
hurts NFS writes. Anyone seen similar effects?

 Otherwise I would just second your request. It definitely helps the
problematic performance of my CCISS based RAID5 volume.

Martin

Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RFC: [PATCH] Small patch on top of per device dirty throttling -v9

2007-09-03 Thread Martin Knoblauch

--- Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> On Thu, 2007-08-23 at 08:59 -0700, Martin Knoblauch wrote:
> > --- Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> > 
> > > On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote:
> > > 
> > > > Peter,
> > > > 
> > > >  any chance to get a rollup against 2.6.22-stable?
> > > > 
> > > >  The 2.6.23 series may not be usable for me due to the
> > > > nosharedcache changes for NFS (the new default will massively
> > > > disturb the user-space automounter).
> > > 
> > > I'll see what I can do, bit busy with other stuff atm, hopefully
> > > after
> > > the weekend.
> > > 
> > Hi Peter,
> > 
> >  any progress on a version against 2.6.22.5? I have seen the very
> > positive report from Jeffrey W. Baker and would really love to test
> > your patch. But as I said, anything newer than 2.6.22.x might not
> be an
> > option due to the NFS changes.
> 
> mindless port, seems to compile and boot on my test box ymmv.
> 
Hi Peter,

 while doing my tests I observed that setting dirty_ratio below 5% did
not make a difference at all. Just by chance I found that this
apparently is an enforced limit in mm/page-writeback.c.

 With below patch I have lowered the limit to 2%. With that, things
look a lot better on my systems. Load during write stays below 1.5 for
one writer. Responsiveness is good. 

This may even help without the throttling patch. Not sure that this is
the right thing to do, but it helps :-)

Cheers
Martin

--- linux-2.6.22.5-bdi-v9/mm/page-writeback.c
+++ linux-2.6.22.6+bdi-v9/mm/page-writeback.c
@@ -311,8 +311,11 @@
if (dirty_ratio > unmapped_ratio / 2)
dirty_ratio = unmapped_ratio / 2;

-   if (dirty_ratio < 5)
-   dirty_ratio = 5;
+/*
+** MKN: Lower enforced limit from 5% to 2%
+*/
+   if (dirty_ratio < 2)
+   dirty_ratio = 2;

background_ratio = dirty_background_ratio;
if (background_ratio >= dirty_ratio)


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: recent nfs change causes autofs regression

2007-09-03 Thread Martin Knoblauch

--- Jakob Oestergaard <[EMAIL PROTECTED]> wrote:

> On Fri, Aug 31, 2007 at 09:43:29AM -0700, Linus Torvalds wrote:
> ...
> > This is *not* a security hole. In order to make it a security hole,
> you 
> > need to be root in the first place.
> 
> Non-root users can write to places where root might believe they
> cannot write
> because he might be under the mistaken assumption that ro means ro.
> 
> I am under the impression that that could have implications in some
> setups.
>

 That was never in question.
 
> ...
> > 
> >  - it's a misfeature that people are used to, and has been around
> forever.
> 
> Sure, they're used it it, but I doubt they are aware of it.
>

 So, the right thing to do (tm) is to make them aware without breaking
their setup. 

 Log any detected inconsistencies in the dmesg buffer and to syslog. If
the sysadmin is not competent enough to notice, to bad.
 
Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: recent nfs change causes autofs regression

2007-09-03 Thread Martin Knoblauch

--- Jakob Oestergaard [EMAIL PROTECTED] wrote:

 On Fri, Aug 31, 2007 at 09:43:29AM -0700, Linus Torvalds wrote:
 ...
  This is *not* a security hole. In order to make it a security hole,
 you 
  need to be root in the first place.
 
 Non-root users can write to places where root might believe they
 cannot write
 because he might be under the mistaken assumption that ro means ro.
 
 I am under the impression that that could have implications in some
 setups.


 That was never in question.
 
 ...
  
   - it's a misfeature that people are used to, and has been around
 forever.
 
 Sure, they're used it it, but I doubt they are aware of it.


 So, the right thing to do (tm) is to make them aware without breaking
their setup. 

 Log any detected inconsistencies in the dmesg buffer and to syslog. If
the sysadmin is not competent enough to notice, to bad.
 
Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RFC: [PATCH] Small patch on top of per device dirty throttling -v9

2007-09-03 Thread Martin Knoblauch

--- Peter Zijlstra [EMAIL PROTECTED] wrote:

 On Thu, 2007-08-23 at 08:59 -0700, Martin Knoblauch wrote:
  --- Peter Zijlstra [EMAIL PROTECTED] wrote:
  
   On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote:
   
Peter,

 any chance to get a rollup against 2.6.22-stable?

 The 2.6.23 series may not be usable for me due to the
nosharedcache changes for NFS (the new default will massively
disturb the user-space automounter).
   
   I'll see what I can do, bit busy with other stuff atm, hopefully
   after
   the weekend.
   
  Hi Peter,
  
   any progress on a version against 2.6.22.5? I have seen the very
  positive report from Jeffrey W. Baker and would really love to test
  your patch. But as I said, anything newer than 2.6.22.x might not
 be an
  option due to the NFS changes.
 
 mindless port, seems to compile and boot on my test box ymmv.
 
Hi Peter,

 while doing my tests I observed that setting dirty_ratio below 5% did
not make a difference at all. Just by chance I found that this
apparently is an enforced limit in mm/page-writeback.c.

 With below patch I have lowered the limit to 2%. With that, things
look a lot better on my systems. Load during write stays below 1.5 for
one writer. Responsiveness is good. 

This may even help without the throttling patch. Not sure that this is
the right thing to do, but it helps :-)

Cheers
Martin

--- linux-2.6.22.5-bdi-v9/mm/page-writeback.c
+++ linux-2.6.22.6+bdi-v9/mm/page-writeback.c
@@ -311,8 +311,11 @@
if (dirty_ratio  unmapped_ratio / 2)
dirty_ratio = unmapped_ratio / 2;

-   if (dirty_ratio  5)
-   dirty_ratio = 5;
+/*
+** MKN: Lower enforced limit from 5% to 2%
+*/
+   if (dirty_ratio  2)
+   dirty_ratio = 2;

background_ratio = dirty_background_ratio;
if (background_ratio = dirty_ratio)


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: recent nfs change causes autofs regression

2007-08-31 Thread Martin Knoblauch

--- Ian Kent <[EMAIL PROTECTED]> wrote:

> On Thu, 30 Aug 2007, Linus Torvalds wrote:
> > 
> > 
> > On Fri, 31 Aug 2007, Trond Myklebust wrote:
> > > 
> > > It did not. The previous behaviour was to always silently
> override the
> > > user mount options.
> > 
> > ..so it still worked for any sane setup, at least.
> > 
> > You broke that. Hua gave good reasons for why he cannot use the
> current 
> > kernel. It's a regression.
> > 
> > In other words, the new behaviour is *worse* than the behaviour you
> 
> > consider to be the incorrect one.
> > 
> 
> This all came about due to complains about not being able to mount
> the 
> same server file system with different options, most commonly ro vs.
> rw 
> which I think was due to the shared super block changes some time
> ago. 
> And, to some extent, I have to plead guilty for not complaining
> enough 
> about this default in the beginning, which is basically unacceptable
> for 
> sure.
> 
> We have seen breakage in Fedora with the introduction of the patches
> and 
> this is typical of it. It also breaks amd and admins have no way of 
> altering this that I'm aware of (help us here Ion).
> 
> I understand Tronds concerns but the fact remains that other Unixs
> allow 
> this behaviour but don't assert cache coherancy and many sysadmin
> don't 
> realize this. So the broken behavior is expected to work and we can't
> 
> simply stop allowing it unless we want to attend a public hanging
> with us 
> as the paticipants.
> 
> There is no question that the new behavior is worse and this change
> is 
> unacceptable as a solution to the original problem.
> 
> I really think that reversing the default, as has been suggested, 
> documenting the risk in the mount.nfs man page and perhaps issuing a 
> warning from the kernel is a better way to handle this. At least we
> will 
> be doing more to raise public awareness of the issue than others.
> 

 I can only second that. Changing the default behavior in this way is
really bad.

 Not that I am disagreeing with the technical reasons, but the change
breaks working setups. And -EBUSY is not very helpful as a message
here. It does not matter that the user tools may handle the breakage
incorrect. The users (admins) had workings setups for years. And they
were obviously working "good enough".

 And one should not forget that there will be a considerable time until
"nosharecache" will trickle down into distributions.

 If the situation stays this way, quite a few people will not be able
to move beyond 2.6.22 for some time. E.g. for I am working for a
company that operates some linux "clusters" at a few german automotive
cdompanies. For certain reasons everything there is based on
automounter maps (both autofs and amd style). We have almost zero
influence on that setup. The maps are a mess - we will run into the
sharecache problem. At the same time I am trying to fight the notorious
"system turns into frozen molassis on moderate I/O load". There maybe
some interesting developements coming forth after 2.6.22. Not good :-(

 What I would like to see done for the at hand situation is:

- make "nosharecache" the default for the forseeable future
- log any attempt to mount option-inconsistent NFS filesystems to dmesh
and syslog (apparently the NFS client is able to detect them :-). Do
this regardless of the "nosharecache" option. This way admins will at
least be made aware of the situation.
- In a year or so we can talk about making the default safe. With
proper advertising.

 Just my € 0.02.

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: recent nfs change causes autofs regression

2007-08-31 Thread Martin Knoblauch

--- Ian Kent [EMAIL PROTECTED] wrote:

 On Thu, 30 Aug 2007, Linus Torvalds wrote:
  
  
  On Fri, 31 Aug 2007, Trond Myklebust wrote:
   
   It did not. The previous behaviour was to always silently
 override the
   user mount options.
  
  ..so it still worked for any sane setup, at least.
  
  You broke that. Hua gave good reasons for why he cannot use the
 current 
  kernel. It's a regression.
  
  In other words, the new behaviour is *worse* than the behaviour you
 
  consider to be the incorrect one.
  
 
 This all came about due to complains about not being able to mount
 the 
 same server file system with different options, most commonly ro vs.
 rw 
 which I think was due to the shared super block changes some time
 ago. 
 And, to some extent, I have to plead guilty for not complaining
 enough 
 about this default in the beginning, which is basically unacceptable
 for 
 sure.
 
 We have seen breakage in Fedora with the introduction of the patches
 and 
 this is typical of it. It also breaks amd and admins have no way of 
 altering this that I'm aware of (help us here Ion).
 
 I understand Tronds concerns but the fact remains that other Unixs
 allow 
 this behaviour but don't assert cache coherancy and many sysadmin
 don't 
 realize this. So the broken behavior is expected to work and we can't
 
 simply stop allowing it unless we want to attend a public hanging
 with us 
 as the paticipants.
 
 There is no question that the new behavior is worse and this change
 is 
 unacceptable as a solution to the original problem.
 
 I really think that reversing the default, as has been suggested, 
 documenting the risk in the mount.nfs man page and perhaps issuing a 
 warning from the kernel is a better way to handle this. At least we
 will 
 be doing more to raise public awareness of the issue than others.
 

 I can only second that. Changing the default behavior in this way is
really bad.

 Not that I am disagreeing with the technical reasons, but the change
breaks working setups. And -EBUSY is not very helpful as a message
here. It does not matter that the user tools may handle the breakage
incorrect. The users (admins) had workings setups for years. And they
were obviously working good enough.

 And one should not forget that there will be a considerable time until
nosharecache will trickle down into distributions.

 If the situation stays this way, quite a few people will not be able
to move beyond 2.6.22 for some time. E.g. for I am working for a
company that operates some linux clusters at a few german automotive
cdompanies. For certain reasons everything there is based on
automounter maps (both autofs and amd style). We have almost zero
influence on that setup. The maps are a mess - we will run into the
sharecache problem. At the same time I am trying to fight the notorious
system turns into frozen molassis on moderate I/O load. There maybe
some interesting developements coming forth after 2.6.22. Not good :-(

 What I would like to see done for the at hand situation is:

- make nosharecache the default for the forseeable future
- log any attempt to mount option-inconsistent NFS filesystems to dmesh
and syslog (apparently the NFS client is able to detect them :-). Do
this regardless of the nosharecache option. This way admins will at
least be made aware of the situation.
- In a year or so we can talk about making the default safe. With
proper advertising.

 Just my € 0.02.

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-30 Thread Martin Knoblauch

--- Jens Axboe <[EMAIL PROTECTED]> wrote:

> 
> Try limiting the queue depth on the cciss device, some of those are
> notoriously bad at starving commands. Something like the below hack,
> see
> if it makes a difference (and please verify in dmesg that it prints
> the
> message about limiting depth!):
> 
> diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> index 084358a..257e1c3 100644
> --- a/drivers/block/cciss.c
> +++ b/drivers/block/cciss.c
> @@ -2992,7 +2992,12 @@ static int cciss_pci_init(ctlr_info_t *c,
> struct pci_dev *pdev)
>   if (board_id == products[i].board_id) {
>   c->product_name = products[i].product_name;
>   c->access = *(products[i].access);
> +#if 0
>   c->nr_cmds = products[i].nr_cmds;
> +#else
> + c->nr_cmds = 2;
> + printk("cciss: limited max commands to 2\n");
> +#endif
>   break;
>   }
>   }
> 
> -- 
> Jens Axboe
> 
> 
Hi Jens,

 how exactely is the queue depth related to the max # of commands? I
ask, because with the 2.6.22 kernel the "maximum queue depth since
init" seems to be never higher than 16, even with much higher
outstanding commands. On a 2.6.19 kernel, maximum queue depth is much
higher, just a bit below "max # of commands since init".

[2.6.22]# cat /proc/driver/cciss/cciss0
cciss0: HP Smart Array 6i Controller
Board ID: 0x40910e11
Firmware Version: 2.76
IRQ: 51
Logical drives: 1
Max sectors: 2048
Current Q depth: 0
Current # commands on controller: 145
Max Q depth since init: 16
Max # commands on controller since init: 204
Max SG entries since init: 31
Sequential access devices: 0

[2.6.19] cat /proc/driver/cciss/cciss0
cciss0: HP Smart Array 6i Controller
Board ID: 0x40910e11
Firmware Version: 2.76
IRQ: 51
Logical drives: 1
Current Q depth: 0
Current # commands on controller: 0
Max Q depth since init: 197
Max # commands on controller since init: 198
Max SG entries since init: 31
Sequential access devices: 0

Cheers
Martin




--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-30 Thread Martin Knoblauch

--- Robert Hancock <[EMAIL PROTECTED]> wrote:

> 
> I saw a bulletin from HP recently that sugggested disabling the 
> write-back cache on some Smart Array controllers as a workaround
> because 
> it reduced performance in applications that did large bulk writes. 
> Presumably they are planning on releasing some updated firmware that 
> fixes this eventually..
> 
> -- 
> Robert Hancock  Saskatoon, SK, Canada
> To email, remove "nospam" from [EMAIL PROTECTED]
> Home Page: http://www.roberthancock.com/
> 
Robert,

 just checked it out. At least with the "6i", you do not want to
disable the WBC :-) Performance really goes down the toilet for all
cases.

 Do you still have a pointer to that bulletin?

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: regression of autofs for current git?

2007-08-30 Thread Martin Knoblauch
On Wed, 2007-08-29 at 20:09 -0700, Ian Kent wrote:
>
>http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=75180df2ed467866ada839fe73cf7cc7d75c0a22
>
>This (and it's related patches) may be the problem.
>I can probably tell if you post your map or if you strace the
automount
>process managing the a problem mount point and look for mount
returning
>EBUSY when it should succeed.

 Likely. That is the one that will break the user-space automounter as
well (and keeps me from .23). I don't care very much about what the
default is, but it would be great if the new behaviour could be
globally changed at run- (or boot-) time. It will be some time until
the new mount option makes it into the distros.

Cheers
Martin
PS: Sorry, but I likely killed the CC list


------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: regression of autofs for current git?

2007-08-30 Thread Martin Knoblauch
On Wed, 2007-08-29 at 20:09 -0700, Ian Kent wrote:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=75180df2ed467866ada839fe73cf7cc7d75c0a22

This (and it's related patches) may be the problem.
I can probably tell if you post your map or if you strace the
automount
process managing the a problem mount point and look for mount
returning
EBUSY when it should succeed.

 Likely. That is the one that will break the user-space automounter as
well (and keeps me from .23). I don't care very much about what the
default is, but it would be great if the new behaviour could be
globally changed at run- (or boot-) time. It will be some time until
the new mount option makes it into the distros.

Cheers
Martin
PS: Sorry, but I likely killed the CC list


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-30 Thread Martin Knoblauch

--- Robert Hancock [EMAIL PROTECTED] wrote:

 
 I saw a bulletin from HP recently that sugggested disabling the 
 write-back cache on some Smart Array controllers as a workaround
 because 
 it reduced performance in applications that did large bulk writes. 
 Presumably they are planning on releasing some updated firmware that 
 fixes this eventually..
 
 -- 
 Robert Hancock  Saskatoon, SK, Canada
 To email, remove nospam from [EMAIL PROTECTED]
 Home Page: http://www.roberthancock.com/
 
Robert,

 just checked it out. At least with the 6i, you do not want to
disable the WBC :-) Performance really goes down the toilet for all
cases.

 Do you still have a pointer to that bulletin?

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-30 Thread Martin Knoblauch

--- Jens Axboe [EMAIL PROTECTED] wrote:

 
 Try limiting the queue depth on the cciss device, some of those are
 notoriously bad at starving commands. Something like the below hack,
 see
 if it makes a difference (and please verify in dmesg that it prints
 the
 message about limiting depth!):
 
 diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
 index 084358a..257e1c3 100644
 --- a/drivers/block/cciss.c
 +++ b/drivers/block/cciss.c
 @@ -2992,7 +2992,12 @@ static int cciss_pci_init(ctlr_info_t *c,
 struct pci_dev *pdev)
   if (board_id == products[i].board_id) {
   c-product_name = products[i].product_name;
   c-access = *(products[i].access);
 +#if 0
   c-nr_cmds = products[i].nr_cmds;
 +#else
 + c-nr_cmds = 2;
 + printk(cciss: limited max commands to 2\n);
 +#endif
   break;
   }
   }
 
 -- 
 Jens Axboe
 
 
Hi Jens,

 how exactely is the queue depth related to the max # of commands? I
ask, because with the 2.6.22 kernel the maximum queue depth since
init seems to be never higher than 16, even with much higher
outstanding commands. On a 2.6.19 kernel, maximum queue depth is much
higher, just a bit below max # of commands since init.

[2.6.22]# cat /proc/driver/cciss/cciss0
cciss0: HP Smart Array 6i Controller
Board ID: 0x40910e11
Firmware Version: 2.76
IRQ: 51
Logical drives: 1
Max sectors: 2048
Current Q depth: 0
Current # commands on controller: 145
Max Q depth since init: 16
Max # commands on controller since init: 204
Max SG entries since init: 31
Sequential access devices: 0

[2.6.19] cat /proc/driver/cciss/cciss0
cciss0: HP Smart Array 6i Controller
Board ID: 0x40910e11
Firmware Version: 2.76
IRQ: 51
Logical drives: 1
Current Q depth: 0
Current # commands on controller: 0
Max Q depth since init: 197
Max # commands on controller since init: 198
Max SG entries since init: 31
Sequential access devices: 0

Cheers
Martin




--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-29 Thread Martin Knoblauch

--- Chuck Ebbert <[EMAIL PROTECTED]> wrote:

> On 08/28/2007 11:53 AM, Martin Knoblauch wrote:
> > 
> >  The basic setup is a dual x86_64 box with 8 GB of memory. The
> DL380
> > has a HW RAID5, made from 4x72GB disks and about 100 MB write
> cache.
> > The performance of the block device with O_DIRECT is about 90
> MB/sec.
> > 
> >  The problematic behaviour comes when we are moving large files
> through
> > the system. The file usage in this case is mostly "use once" or
> > streaming. As soon as the amount of file data is larger than 7.5
> GB, we
> > see occasional unresponsiveness of the system (e.g. no more ssh
> > connections into the box) of more than 1 or 2 minutes (!) duration
> > (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads
> and
> > some other poor guys being in "D" state.
> 
> Try booting with "mem=4096M", "mem=2048M", ...
> 
> 

 hmm. I tried 1024M a while ago and IIRC did not see a lot [any]
difference. But as it is no big deal, I will repeat it tomorrow.

 Just curious - what are you expecting? Why should it help?

Thanks
Martin


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-29 Thread Martin Knoblauch

--- Jens Axboe <[EMAIL PROTECTED]> wrote:

> On Tue, Aug 28 2007, Martin Knoblauch wrote:
> > Keywords: I/O, bdi-v9, cfs
> > 
> 
> Try limiting the queue depth on the cciss device, some of those are
> notoriously bad at starving commands. Something like the below hack,
> see
> if it makes a difference (and please verify in dmesg that it prints
> the
> message about limiting depth!):
> 
> diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
> index 084358a..257e1c3 100644
> --- a/drivers/block/cciss.c
> +++ b/drivers/block/cciss.c
> @@ -2992,7 +2992,12 @@ static int cciss_pci_init(ctlr_info_t *c,
> struct pci_dev *pdev)
>   if (board_id == products[i].board_id) {
>   c->product_name = products[i].product_name;
>   c->access = *(products[i].access);
> +#if 0
>   c->nr_cmds = products[i].nr_cmds;
> +#else
> + c->nr_cmds = 2;
> + printk("cciss: limited max commands to 2\n");
> +#endif
>   break;
>   }
>   }
> 
> -- 
> Jens Axboe
> 
> 
>
Hi Jens,

 thanks for the suggestion. Unfortunatelly the non-direct [parallel]
writes to the device got considreably slower. I guess the "6i"
controller copes better with higher values.

 Can nr_cmds be changed at runtime? Maybe there is a optimal setting.

[   69.438851] SCSI subsystem initialized
[   69.442712] HP CISS Driver (v 3.6.14)
[   69.442871] ACPI: PCI Interrupt :04:03.0[A] -> GSI 51 (level,
low) -> IRQ 51
[   69.442899] cciss: limited max commands to 2 (Smart Array 6i)
[   69.482370] cciss0: <0x46> at PCI :04:03.0 IRQ 51 using DAC
[   69.494352]   blocks= 426759840 block_size= 512
[   69.498350]   heads=255, sectors=32, cylinders=52299
[   69.498352]
[   69.498509]   blocks= 426759840 block_size= 512
[   69.498602]   heads=255, sectors=32, cylinders=52299
[   69.498604]
[   69.498608]  cciss/c0d0: p1 p2

Cheers
Martin


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-29 Thread Martin Knoblauch

--- Fengguang Wu <[EMAIL PROTECTED]> wrote:

> On Wed, Aug 29, 2007 at 01:15:45AM -0700, Martin Knoblauch wrote:
> > 
> > --- Fengguang Wu <[EMAIL PROTECTED]> wrote:
> > 
> > > You are apparently running into the sluggish kupdate-style
> writeback
> > > problem with large files: huge amount of dirty pages are getting
> > > accumulated and flushed to the disk all at once when dirty
> background
> > > ratio is reached. The current -mm tree has some fixes for it, and
> > > there are some more in my tree. Martin, I'll send you the patch
> if
> > > you'd like to try it out.
> > >
> > Hi Fengguang,
> > 
> >  Yeah, that pretty much describes the situation we end up. Although
> > "sluggish" is much to friendly if we hit the situation :-)
> > 
> >  Yes, I am very interested  to check out your patch. I saw your
> > postings on LKML already and was already curious. Any chance you
> have
> > something agains 2.6.22-stable? I have reasons not to move to -23
> or
> > -mm.
> 
> Well, they are a dozen patches from various sources.  I managed to
> back-port them. It compiles and runs, however I cannot guarantee
> more...
>

 Thanks. I understand the limited scope of the warranty :-) I will give
it a spin today.
 
> > > >  Another thing I saw during my tests is that when writing to
> NFS,
> > > the
> > > > "dirty" or "nr_dirty" numbers are always 0. Is this a
> conceptual
> > > thing,
> > > > or a bug?
> > > 
> > > What are the nr_unstable numbers?
> > >
> > 
> >  Ahh. Yes, they go up to 80-90k pages. Comparable to the nr_dirty
> > numbers for the disk case. Good to know.
> > 
> >  For NFS, the nr_writeback numbers seem surprisingly high. They
> also go
> > to 80-90k (pages ?). In the disk case they rarely go over 12k.
> 
> Maybe the difference of throttling one single 'cp' and a dozen
> 'nfsd'?
>

 No "nfsd" running on that box. It is just a client.

Cheers
Martin
 

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-29 Thread Martin Knoblauch

--- Fengguang Wu <[EMAIL PROTECTED]> wrote:

> On Tue, Aug 28, 2007 at 08:53:07AM -0700, Martin Knoblauch wrote:
> [...]
> >  The basic setup is a dual x86_64 box with 8 GB of memory. The
> DL380
> > has a HW RAID5, made from 4x72GB disks and about 100 MB write
> cache.
> > The performance of the block device with O_DIRECT is about 90
> MB/sec.
> > 
> >  The problematic behaviour comes when we are moving large files
> through
> > the system. The file usage in this case is mostly "use once" or
> > streaming. As soon as the amount of file data is larger than 7.5
> GB, we
> > see occasional unresponsiveness of the system (e.g. no more ssh
> > connections into the box) of more than 1 or 2 minutes (!) duration
> > (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads
> and
> > some other poor guys being in "D" state.
> [...]
> >  Just by chance I found out that doing all I/O inc sync-mode does
> > prevent the load from going up. Of course, I/O throughput is not
> > stellar (but not much worse than the non-O_DIRECT case). But the
> > responsiveness seem OK. Maybe a solution, as this can be controlled
> via
> > mount (would be great for O_DIRECT :-).
> > 
> >  In general 2.6.22 seems to bee better that 2.6.19, but this is
> highly
> > subjective :-( I am using the following setting in /proc. They seem
> to
> > provide the smoothest responsiveness:
> > 
> > vm.dirty_background_ratio = 1
> > vm.dirty_ratio = 1
> > vm.swappiness = 1
> > vm.vfs_cache_pressure = 1
> 
> You are apparently running into the sluggish kupdate-style writeback
> problem with large files: huge amount of dirty pages are getting
> accumulated and flushed to the disk all at once when dirty background
> ratio is reached. The current -mm tree has some fixes for it, and
> there are some more in my tree. Martin, I'll send you the patch if
> you'd like to try it out.
>
Hi Fengguang,

 Yeah, that pretty much describes the situation we end up. Although
"sluggish" is much to friendly if we hit the situation :-)

 Yes, I am very interested  to check out your patch. I saw your
postings on LKML already and was already curious. Any chance you have
something agains 2.6.22-stable? I have reasons not to move to -23 or
-mm.

> >  Another thing I saw during my tests is that when writing to NFS,
> the
> > "dirty" or "nr_dirty" numbers are always 0. Is this a conceptual
> thing,
> > or a bug?
> 
> What are the nr_unstable numbers?
>

 Ahh. Yes, they go up to 80-90k pages. Comparable to the nr_dirty
numbers for the disk case. Good to know.

 For NFS, the nr_writeback numbers seem surprisingly high. They also go
to 80-90k (pages ?). In the disk case they rarely go over 12k.

Cheers
Martin
> Fengguang
> 
> 


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-29 Thread Martin Knoblauch

--- Fengguang Wu [EMAIL PROTECTED] wrote:

 On Tue, Aug 28, 2007 at 08:53:07AM -0700, Martin Knoblauch wrote:
 [...]
   The basic setup is a dual x86_64 box with 8 GB of memory. The
 DL380
  has a HW RAID5, made from 4x72GB disks and about 100 MB write
 cache.
  The performance of the block device with O_DIRECT is about 90
 MB/sec.
  
   The problematic behaviour comes when we are moving large files
 through
  the system. The file usage in this case is mostly use once or
  streaming. As soon as the amount of file data is larger than 7.5
 GB, we
  see occasional unresponsiveness of the system (e.g. no more ssh
  connections into the box) of more than 1 or 2 minutes (!) duration
  (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads
 and
  some other poor guys being in D state.
 [...]
   Just by chance I found out that doing all I/O inc sync-mode does
  prevent the load from going up. Of course, I/O throughput is not
  stellar (but not much worse than the non-O_DIRECT case). But the
  responsiveness seem OK. Maybe a solution, as this can be controlled
 via
  mount (would be great for O_DIRECT :-).
  
   In general 2.6.22 seems to bee better that 2.6.19, but this is
 highly
  subjective :-( I am using the following setting in /proc. They seem
 to
  provide the smoothest responsiveness:
  
  vm.dirty_background_ratio = 1
  vm.dirty_ratio = 1
  vm.swappiness = 1
  vm.vfs_cache_pressure = 1
 
 You are apparently running into the sluggish kupdate-style writeback
 problem with large files: huge amount of dirty pages are getting
 accumulated and flushed to the disk all at once when dirty background
 ratio is reached. The current -mm tree has some fixes for it, and
 there are some more in my tree. Martin, I'll send you the patch if
 you'd like to try it out.

Hi Fengguang,

 Yeah, that pretty much describes the situation we end up. Although
sluggish is much to friendly if we hit the situation :-)

 Yes, I am very interested  to check out your patch. I saw your
postings on LKML already and was already curious. Any chance you have
something agains 2.6.22-stable? I have reasons not to move to -23 or
-mm.

   Another thing I saw during my tests is that when writing to NFS,
 the
  dirty or nr_dirty numbers are always 0. Is this a conceptual
 thing,
  or a bug?
 
 What are the nr_unstable numbers?


 Ahh. Yes, they go up to 80-90k pages. Comparable to the nr_dirty
numbers for the disk case. Good to know.

 For NFS, the nr_writeback numbers seem surprisingly high. They also go
to 80-90k (pages ?). In the disk case they rarely go over 12k.

Cheers
Martin
 Fengguang
 
 


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-29 Thread Martin Knoblauch

--- Fengguang Wu [EMAIL PROTECTED] wrote:

 On Wed, Aug 29, 2007 at 01:15:45AM -0700, Martin Knoblauch wrote:
  
  --- Fengguang Wu [EMAIL PROTECTED] wrote:
  
   You are apparently running into the sluggish kupdate-style
 writeback
   problem with large files: huge amount of dirty pages are getting
   accumulated and flushed to the disk all at once when dirty
 background
   ratio is reached. The current -mm tree has some fixes for it, and
   there are some more in my tree. Martin, I'll send you the patch
 if
   you'd like to try it out.
  
  Hi Fengguang,
  
   Yeah, that pretty much describes the situation we end up. Although
  sluggish is much to friendly if we hit the situation :-)
  
   Yes, I am very interested  to check out your patch. I saw your
  postings on LKML already and was already curious. Any chance you
 have
  something agains 2.6.22-stable? I have reasons not to move to -23
 or
  -mm.
 
 Well, they are a dozen patches from various sources.  I managed to
 back-port them. It compiles and runs, however I cannot guarantee
 more...


 Thanks. I understand the limited scope of the warranty :-) I will give
it a spin today.
 
 Another thing I saw during my tests is that when writing to
 NFS,
   the
dirty or nr_dirty numbers are always 0. Is this a
 conceptual
   thing,
or a bug?
   
   What are the nr_unstable numbers?
  
  
   Ahh. Yes, they go up to 80-90k pages. Comparable to the nr_dirty
  numbers for the disk case. Good to know.
  
   For NFS, the nr_writeback numbers seem surprisingly high. They
 also go
  to 80-90k (pages ?). In the disk case they rarely go over 12k.
 
 Maybe the difference of throttling one single 'cp' and a dozen
 'nfsd'?


 No nfsd running on that box. It is just a client.

Cheers
Martin
 

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-29 Thread Martin Knoblauch

--- Jens Axboe [EMAIL PROTECTED] wrote:

 On Tue, Aug 28 2007, Martin Knoblauch wrote:
  Keywords: I/O, bdi-v9, cfs
  
 
 Try limiting the queue depth on the cciss device, some of those are
 notoriously bad at starving commands. Something like the below hack,
 see
 if it makes a difference (and please verify in dmesg that it prints
 the
 message about limiting depth!):
 
 diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
 index 084358a..257e1c3 100644
 --- a/drivers/block/cciss.c
 +++ b/drivers/block/cciss.c
 @@ -2992,7 +2992,12 @@ static int cciss_pci_init(ctlr_info_t *c,
 struct pci_dev *pdev)
   if (board_id == products[i].board_id) {
   c-product_name = products[i].product_name;
   c-access = *(products[i].access);
 +#if 0
   c-nr_cmds = products[i].nr_cmds;
 +#else
 + c-nr_cmds = 2;
 + printk(cciss: limited max commands to 2\n);
 +#endif
   break;
   }
   }
 
 -- 
 Jens Axboe
 
 

Hi Jens,

 thanks for the suggestion. Unfortunatelly the non-direct [parallel]
writes to the device got considreably slower. I guess the 6i
controller copes better with higher values.

 Can nr_cmds be changed at runtime? Maybe there is a optimal setting.

[   69.438851] SCSI subsystem initialized
[   69.442712] HP CISS Driver (v 3.6.14)
[   69.442871] ACPI: PCI Interrupt :04:03.0[A] - GSI 51 (level,
low) - IRQ 51
[   69.442899] cciss: limited max commands to 2 (Smart Array 6i)
[   69.482370] cciss0: 0x46 at PCI :04:03.0 IRQ 51 using DAC
[   69.494352]   blocks= 426759840 block_size= 512
[   69.498350]   heads=255, sectors=32, cylinders=52299
[   69.498352]
[   69.498509]   blocks= 426759840 block_size= 512
[   69.498602]   heads=255, sectors=32, cylinders=52299
[   69.498604]
[   69.498608]  cciss/c0d0: p1 p2

Cheers
Martin


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour - next try

2007-08-29 Thread Martin Knoblauch

--- Chuck Ebbert [EMAIL PROTECTED] wrote:

 On 08/28/2007 11:53 AM, Martin Knoblauch wrote:
  
   The basic setup is a dual x86_64 box with 8 GB of memory. The
 DL380
  has a HW RAID5, made from 4x72GB disks and about 100 MB write
 cache.
  The performance of the block device with O_DIRECT is about 90
 MB/sec.
  
   The problematic behaviour comes when we are moving large files
 through
  the system. The file usage in this case is mostly use once or
  streaming. As soon as the amount of file data is larger than 7.5
 GB, we
  see occasional unresponsiveness of the system (e.g. no more ssh
  connections into the box) of more than 1 or 2 minutes (!) duration
  (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads
 and
  some other poor guys being in D state.
 
 Try booting with mem=4096M, mem=2048M, ...
 
 

 hmm. I tried 1024M a while ago and IIRC did not see a lot [any]
difference. But as it is no big deal, I will repeat it tomorrow.

 Just curious - what are you expecting? Why should it help?

Thanks
Martin


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Understanding I/O behaviour - next try

2007-08-28 Thread Martin Knoblauch
Keywords: I/O, bdi-v9, cfs

Hi,

 a while ago I asked a few questions on the Linux I/O behaviour,
because I were (still am) fighting some "misbehaviour" related to heavy
I/O.

 The basic setup is a dual x86_64 box with 8 GB of memory. The DL380
has a HW RAID5, made from 4x72GB disks and about 100 MB write cache.
The performance of the block device with O_DIRECT is about 90 MB/sec.

 The problematic behaviour comes when we are moving large files through
the system. The file usage in this case is mostly "use once" or
streaming. As soon as the amount of file data is larger than 7.5 GB, we
see occasional unresponsiveness of the system (e.g. no more ssh
connections into the box) of more than 1 or 2 minutes (!) duration
(kernels up to 2.6.19). Load goes up, mainly due to pdflush threads and
some other poor guys being in "D" state.

 The data flows in basically three modes. All of them are affected:

local-disk -> NFS
NFS -> local-disk
NFS -> NFS

 NFS is V3/TCP.

 So, I made a few experiments in the last few days, using three
different kernels: 2.6.22.5, 2.6.22.5+cfs20.4 an 2.6.22.5+bdi-v9.

 The first observation (independent of the kernel) is that we *should*
use O_DIRECT, at least for output to the local disk. Here we see about
90 MB/sec write performance. A simple "dd" using 1,2 and 3 parallel
threads to the same block device (through a ext2 FS) gives:

O_Direct: 88 MB/s, 2x44, 3x29.5
non-O_DIRECT: 51 MB/s, 2x19, 3x12.5

- Observation 1a: IO schedulers are mostly equivalent, with CFQ
slightly worse than AS and DEADLINE
- Observation 1b: when using a 2.6.22.5+cfs20.4, the non-O_DIRECT
performance goes [slightly] down. With three threads it is 3x10 MB/s.
Ingo?
- Observation 1c: bdi-v9 does not help in this case, which is not
surprising.

 The real question here is why the non-O_DIRECT case is so slow. Is
this a general thing? Is this related to the CCISS controller? Using
O_DIRECT is unfortunatelly not an option for us.

 When using three different targets (local disk plus two different NFS
Filesystems) bdi-v9 is a big winner. Without it, all threads are [seem
to be] limited to the speed of the slowest FS. With bdi-v9 we see a
considerable speedup.

 Just by chance I found out that doing all I/O inc sync-mode does
prevent the load from going up. Of course, I/O throughput is not
stellar (but not much worse than the non-O_DIRECT case). But the
responsiveness seem OK. Maybe a solution, as this can be controlled via
mount (would be great for O_DIRECT :-).

 In general 2.6.22 seems to bee better that 2.6.19, but this is highly
subjective :-( I am using the following setting in /proc. They seem to
provide the smoothest responsiveness:

vm.dirty_background_ratio = 1
vm.dirty_ratio = 1
vm.swappiness = 1
vm.vfs_cache_pressure = 1

 Another thing I saw during my tests is that when writing to NFS, the
"dirty" or "nr_dirty" numbers are always 0. Is this a conceptual thing,
or a bug?

 In any case, view this as a report for one specific loadcase that does
not behave very well. It seems there are ways to make things better
(sync, per device throttling, ...), but nothing "perfect yet. Use once
does seem to be a problem.

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Understanding I/O behaviour - next try

2007-08-28 Thread Martin Knoblauch
Keywords: I/O, bdi-v9, cfs

Hi,

 a while ago I asked a few questions on the Linux I/O behaviour,
because I were (still am) fighting some misbehaviour related to heavy
I/O.

 The basic setup is a dual x86_64 box with 8 GB of memory. The DL380
has a HW RAID5, made from 4x72GB disks and about 100 MB write cache.
The performance of the block device with O_DIRECT is about 90 MB/sec.

 The problematic behaviour comes when we are moving large files through
the system. The file usage in this case is mostly use once or
streaming. As soon as the amount of file data is larger than 7.5 GB, we
see occasional unresponsiveness of the system (e.g. no more ssh
connections into the box) of more than 1 or 2 minutes (!) duration
(kernels up to 2.6.19). Load goes up, mainly due to pdflush threads and
some other poor guys being in D state.

 The data flows in basically three modes. All of them are affected:

local-disk - NFS
NFS - local-disk
NFS - NFS

 NFS is V3/TCP.

 So, I made a few experiments in the last few days, using three
different kernels: 2.6.22.5, 2.6.22.5+cfs20.4 an 2.6.22.5+bdi-v9.

 The first observation (independent of the kernel) is that we *should*
use O_DIRECT, at least for output to the local disk. Here we see about
90 MB/sec write performance. A simple dd using 1,2 and 3 parallel
threads to the same block device (through a ext2 FS) gives:

O_Direct: 88 MB/s, 2x44, 3x29.5
non-O_DIRECT: 51 MB/s, 2x19, 3x12.5

- Observation 1a: IO schedulers are mostly equivalent, with CFQ
slightly worse than AS and DEADLINE
- Observation 1b: when using a 2.6.22.5+cfs20.4, the non-O_DIRECT
performance goes [slightly] down. With three threads it is 3x10 MB/s.
Ingo?
- Observation 1c: bdi-v9 does not help in this case, which is not
surprising.

 The real question here is why the non-O_DIRECT case is so slow. Is
this a general thing? Is this related to the CCISS controller? Using
O_DIRECT is unfortunatelly not an option for us.

 When using three different targets (local disk plus two different NFS
Filesystems) bdi-v9 is a big winner. Without it, all threads are [seem
to be] limited to the speed of the slowest FS. With bdi-v9 we see a
considerable speedup.

 Just by chance I found out that doing all I/O inc sync-mode does
prevent the load from going up. Of course, I/O throughput is not
stellar (but not much worse than the non-O_DIRECT case). But the
responsiveness seem OK. Maybe a solution, as this can be controlled via
mount (would be great for O_DIRECT :-).

 In general 2.6.22 seems to bee better that 2.6.19, but this is highly
subjective :-( I am using the following setting in /proc. They seem to
provide the smoothest responsiveness:

vm.dirty_background_ratio = 1
vm.dirty_ratio = 1
vm.swappiness = 1
vm.vfs_cache_pressure = 1

 Another thing I saw during my tests is that when writing to NFS, the
dirty or nr_dirty numbers are always 0. Is this a conceptual thing,
or a bug?

 In any case, view this as a report for one specific loadcase that does
not behave very well. It seems there are ways to make things better
(sync, per device throttling, ...), but nothing perfect yet. Use once
does seem to be a problem.

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 00/23] per device dirty throttling -v9

2007-08-24 Thread Martin Knoblauch

--- Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> On Thu, 2007-08-23 at 08:59 -0700, Martin Knoblauch wrote:
> > --- Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> > 
> > > On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote:
> > > 
> > > > Peter,
> > > > 
> > > >  any chance to get a rollup against 2.6.22-stable?
> > > > 
> > > >  The 2.6.23 series may not be usable for me due to the
> > > > nosharedcache changes for NFS (the new default will massively
> > > > disturb the user-space automounter).
> > > 
> > > I'll see what I can do, bit busy with other stuff atm, hopefully
> > > after
> > > the weekend.
> > > 
> > Hi Peter,
> > 
> >  any progress on a version against 2.6.22.5? I have seen the very
> > positive report from Jeffrey W. Baker and would really love to test
> > your patch. But as I said, anything newer than 2.6.22.x might not
> be an
> > option due to the NFS changes.
> 
> mindless port, seems to compile and boot on my test box ymmv.
> 
> I think .5 should not present anything other than trivial rejects if
> anything. But I'm not keeping -stable in my git remotes so I can't
> say
> for sure.

Hi Peter,

 thanks a lot. It applies to 2.6.22.5 almost cleanly, with just one
8-line offset in readahead.c.

 I will report testing-results separately.

Thanks
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 00/23] per device dirty throttling -v9

2007-08-24 Thread Martin Knoblauch

--- Peter Zijlstra [EMAIL PROTECTED] wrote:

 On Thu, 2007-08-23 at 08:59 -0700, Martin Knoblauch wrote:
  --- Peter Zijlstra [EMAIL PROTECTED] wrote:
  
   On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote:
   
Peter,

 any chance to get a rollup against 2.6.22-stable?

 The 2.6.23 series may not be usable for me due to the
nosharedcache changes for NFS (the new default will massively
disturb the user-space automounter).
   
   I'll see what I can do, bit busy with other stuff atm, hopefully
   after
   the weekend.
   
  Hi Peter,
  
   any progress on a version against 2.6.22.5? I have seen the very
  positive report from Jeffrey W. Baker and would really love to test
  your patch. But as I said, anything newer than 2.6.22.x might not
 be an
  option due to the NFS changes.
 
 mindless port, seems to compile and boot on my test box ymmv.
 
 I think .5 should not present anything other than trivial rejects if
 anything. But I'm not keeping -stable in my git remotes so I can't
 say
 for sure.

Hi Peter,

 thanks a lot. It applies to 2.6.22.5 almost cleanly, with just one
8-line offset in readahead.c.

 I will report testing-results separately.

Thanks
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 00/23] per device dirty throttling -v9

2007-08-23 Thread Martin Knoblauch

--- Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote:
> 
> > Peter,
> > 
> >  any chance to get a rollup against 2.6.22-stable?
> > 
> >  The 2.6.23 series may not be usable for me due to the
> > nosharedcache changes for NFS (the new default will massively
> > disturb the user-space automounter).
> 
> I'll see what I can do, bit busy with other stuff atm, hopefully
> after
> the weekend.
> 
Hi Peter,

 any progress on a version against 2.6.22.5? I have seen the very
positive report from Jeffrey W. Baker and would really love to test
your patch. But as I said, anything newer than 2.6.22.x might not be an
option due to the NFS changes.

Kind regards
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 00/23] per device dirty throttling -v9

2007-08-23 Thread Martin Knoblauch

--- Peter Zijlstra [EMAIL PROTECTED] wrote:

 On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote:
 
  Peter,
  
   any chance to get a rollup against 2.6.22-stable?
  
   The 2.6.23 series may not be usable for me due to the
  nosharedcache changes for NFS (the new default will massively
  disturb the user-space automounter).
 
 I'll see what I can do, bit busy with other stuff atm, hopefully
 after
 the weekend.
 
Hi Peter,

 any progress on a version against 2.6.22.5? I have seen the very
positive report from Jeffrey W. Baker and would really love to test
your patch. But as I said, anything newer than 2.6.22.x might not be an
option due to the NFS changes.

Kind regards
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 00/23] per device dirty throttling -v9

2007-08-16 Thread Martin Knoblauch

--- Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote:
> 
> > Peter,
> > 
> >  any chance to get a rollup against 2.6.22-stable?
> > 
> >  The 2.6.23 series may not be usable for me due to the
> > nosharedcache changes for NFS (the new default will massively
> > disturb the user-space automounter).
> 
> I'll see what I can do, bit busy with other stuff atm, hopefully
> after the weekend.
> 
Hi Peter,

 that would be highly appreciated. Thanks a lot in advance.

Martin


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 00/23] per device dirty throttling -v9

2007-08-16 Thread Martin Knoblauch
>Per device dirty throttling patches
>
>These patches aim to improve balance_dirty_pages() and directly
>address three issues:
>1) inter device starvation
>2) stacked device deadlocks
>3) inter process starvation
>
>1 and 2 are a direct result from removing the global dirty
>limit and using per device dirty limits. By giving each device
>its own dirty limit is will no longer starve another device,
>and the cyclic dependancy on the dirty limit is broken.
>
>In order to efficiently distribute the dirty limit across
>the independant devices a floating proportion is used, this
>will allocate a share of the total limit proportional to the
>device's recent activity.
>
>3 is done by also scaling the dirty limit proportional to the
>current task's recent dirty rate.
>
>Changes since -v8:
>- cleanup of the proportion code
>- fix percpu_counter_add(, -(unsigned long))
>- fix per task dirty rate code
>- fwd port to .23-rc2-mm2

Peter,

 any chance to get a rollup against 2.6.22-stable?

 The 2.6.23 series may not be usable for me due to the
nosharedcache changes for NFS (the new default will massively
disturb the user-space automounter).

Cheers
Martin 


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 00/23] per device dirty throttling -v9

2007-08-16 Thread Martin Knoblauch
Per device dirty throttling patches

These patches aim to improve balance_dirty_pages() and directly
address three issues:
1) inter device starvation
2) stacked device deadlocks
3) inter process starvation

1 and 2 are a direct result from removing the global dirty
limit and using per device dirty limits. By giving each device
its own dirty limit is will no longer starve another device,
and the cyclic dependancy on the dirty limit is broken.

In order to efficiently distribute the dirty limit across
the independant devices a floating proportion is used, this
will allocate a share of the total limit proportional to the
device's recent activity.

3 is done by also scaling the dirty limit proportional to the
current task's recent dirty rate.

Changes since -v8:
- cleanup of the proportion code
- fix percpu_counter_add(counter, -(unsigned long))
- fix per task dirty rate code
- fwd port to .23-rc2-mm2

Peter,

 any chance to get a rollup against 2.6.22-stable?

 The 2.6.23 series may not be usable for me due to the
nosharedcache changes for NFS (the new default will massively
disturb the user-space automounter).

Cheers
Martin 


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 00/23] per device dirty throttling -v9

2007-08-16 Thread Martin Knoblauch

--- Peter Zijlstra [EMAIL PROTECTED] wrote:

 On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote:
 
  Peter,
  
   any chance to get a rollup against 2.6.22-stable?
  
   The 2.6.23 series may not be usable for me due to the
  nosharedcache changes for NFS (the new default will massively
  disturb the user-space automounter).
 
 I'll see what I can do, bit busy with other stuff atm, hopefully
 after the weekend.
 
Hi Peter,

 that would be highly appreciated. Thanks a lot in advance.

Martin


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/17] per device dirty throttling -v7

2007-07-18 Thread Martin Knoblauch
Miklos Szeredi wrote:

>> Latest version of the per bdi dirty throttling patches.
>>
>> Most of the changes since last time are little cleanups and more
>> detail in the split out of the floating proportion into their
>> own little lib.
>>
>> Patches are against 2.6.22-rc4-mm2
>>
>> A rollup of all this against 2.6.21 is available here:
>>
http://programming.kicks-ass.net/kernel-patches/balance_dirty_pages/2.6.21-per_bdi_dirty_pages.patch
>>
>> This patch-set passes the starve an USB stick test..
>
>I've done some testing of several problem cases.

 just curious - what are the plans towards inclusion in mainline?

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/17] per device dirty throttling -v7

2007-07-18 Thread Martin Knoblauch
Miklos Szeredi wrote:

 Latest version of the per bdi dirty throttling patches.

 Most of the changes since last time are little cleanups and more
 detail in the split out of the floating proportion into their
 own little lib.

 Patches are against 2.6.22-rc4-mm2

 A rollup of all this against 2.6.21 is available here:

http://programming.kicks-ass.net/kernel-patches/balance_dirty_pages/2.6.21-per_bdi_dirty_pages.patch

 This patch-set passes the starve an USB stick test..

I've done some testing of several problem cases.

 just curious - what are the plans towards inclusion in mainline?

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-09 Thread Martin Knoblauch

--- Jesper Juhl <[EMAIL PROTECTED]> wrote:

> On 05/07/07, Jesper Juhl <[EMAIL PROTECTED]> wrote:
> > On 05/07/07, Martin Knoblauch <[EMAIL PROTECTED]> wrote:
> > > Hi,
> > >
> >
> > I'd suspect you can't get both at 100%.
> >
> > I'd guess you are probably using a 100Hz no-preempt kernel.  Have
> you
> > tried a 1000Hz + preempt kernel?   Sure, you'll get a bit lower
> > overall throughput, but interactive responsiveness should be better
> -
> > if it is, then you could experiment with various combinations of
> > CONFIG_PREEMPT, CONFIG_PREEMPT_VOLUNTARY, CONFIG_PREEMPT_NONE and
> > CONFIG_HZ_1000, CONFIG_HZ_300, CONFIG_HZ_250, CONFIG_HZ_100 to see
> > what gives you the best balance between throughput and interactive
> > responsiveness (you could also throw CONFIG_PREEMPT_BKL and/or
> > CONFIG_NO_HZ, but I don't think the impact will be as significant
> as
> > with the other options, so to keep things simple I'd leave those
> out
> > at first) .
> >
> > I'd guess that something like CONFIG_PREEMPT_VOLUNTARY +
> CONFIG_HZ_300
> > would probably be a good compromise for you, but just to see if
> > there's any effect at all, start out with CONFIG_PREEMPT +
> > CONFIG_HZ_1000.
> >
> 
> I'm currious, did you ever try playing around with CONFIG_PREEMPT*
> and
> CONFIG_HZ* to see if that had any noticable impact on interactive
> performance and stuff like logging into the box via ssh etc...?
> 
> -- 
> Jesper Juhl <[EMAIL PROTECTED]>
> Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
> Plain text mails only, please  http://www.expita.com/nomime.html
> 
> 
Hi Jesper,

 my initial kernel was [EMAIL PROTECTED] I have switched to 300HZ, but
have not observed much difference. The config is now:

config-2.6.22-rc7:# CONFIG_PREEMPT_NONE is not set
config-2.6.22-rc7:CONFIG_PREEMPT_VOLUNTARY=y
config-2.6.22-rc7:# CONFIG_PREEMPT is not set
config-2.6.22-rc7:CONFIG_PREEMPT_BKL=y

Cheers


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-09 Thread Martin Knoblauch

--- Jesper Juhl [EMAIL PROTECTED] wrote:

 On 05/07/07, Jesper Juhl [EMAIL PROTECTED] wrote:
  On 05/07/07, Martin Knoblauch [EMAIL PROTECTED] wrote:
   Hi,
  
 
  I'd suspect you can't get both at 100%.
 
  I'd guess you are probably using a 100Hz no-preempt kernel.  Have
 you
  tried a 1000Hz + preempt kernel?   Sure, you'll get a bit lower
  overall throughput, but interactive responsiveness should be better
 -
  if it is, then you could experiment with various combinations of
  CONFIG_PREEMPT, CONFIG_PREEMPT_VOLUNTARY, CONFIG_PREEMPT_NONE and
  CONFIG_HZ_1000, CONFIG_HZ_300, CONFIG_HZ_250, CONFIG_HZ_100 to see
  what gives you the best balance between throughput and interactive
  responsiveness (you could also throw CONFIG_PREEMPT_BKL and/or
  CONFIG_NO_HZ, but I don't think the impact will be as significant
 as
  with the other options, so to keep things simple I'd leave those
 out
  at first) .
 
  I'd guess that something like CONFIG_PREEMPT_VOLUNTARY +
 CONFIG_HZ_300
  would probably be a good compromise for you, but just to see if
  there's any effect at all, start out with CONFIG_PREEMPT +
  CONFIG_HZ_1000.
 
 
 I'm currious, did you ever try playing around with CONFIG_PREEMPT*
 and
 CONFIG_HZ* to see if that had any noticable impact on interactive
 performance and stuff like logging into the box via ssh etc...?
 
 -- 
 Jesper Juhl [EMAIL PROTECTED]
 Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
 Plain text mails only, please  http://www.expita.com/nomime.html
 
 
Hi Jesper,

 my initial kernel was [EMAIL PROTECTED] I have switched to 300HZ, but
have not observed much difference. The config is now:

config-2.6.22-rc7:# CONFIG_PREEMPT_NONE is not set
config-2.6.22-rc7:CONFIG_PREEMPT_VOLUNTARY=y
config-2.6.22-rc7:# CONFIG_PREEMPT is not set
config-2.6.22-rc7:CONFIG_PREEMPT_BKL=y

Cheers


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-06 Thread Martin Knoblauch

--- Daniel J Blueman <[EMAIL PROTECTED]> wrote:

> On 5 Jul, 16:50, Martin Knoblauch <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> >  for a customer we are operating a rackful of HP/DL380/G4 boxes
> that
> > have given us some problems with system responsiveness under [I/O
> > triggered] system load.
> [snip]
> 
> IIRC, the locking in the CCISS driver was pretty heavy until later in
> the 2.6 series (2.6.16?) kernels; I don't think they were backported
> to the 1000 or so patches that comprise RH EL 4 kernels.
> 
> With write performance being really poor on the Smartarray
> controllers
> without the battery-backed write cache, and with less-good locking,
> performance can really suck.
> 
> On a total quiescent hp DL380 G2 (dual PIII, 1.13GHz Tualatin 512KB
> L2$) running RH EL 5 (2.6.18) with a 32MB SmartArray 5i controller
> with 6x36GB 10K RPM SCSI disks and all latest firmware:
> 
> # dd if=/dev/cciss/c0d0p2 of=/dev/zero bs=1024k count=1000
> 509+1 records in
> 509+1 records out
> 534643200 bytes (535 MB) copied, 11.6336 seconds, 46.0 MB/s
> 
> # dd if=/dev/zero of=/dev/cciss/c0d0p2 bs=1024k count=100
> 100+0 records in
> 100+0 records out
> 104857600 bytes (105 MB) copied, 22.3091 seconds, 4.7 MB/s
> 
> Oh dear! There are internal performance problems with this
> controller.
> The SmartArray 5i in the newer DL380 G3 (dual P4 2.8GHz, 512KB L2$)
> is
> perhaps twice the read performance (PCI-X helps some) but still
> sucks.
> 
> I'd get the BBWC in or install another controller.
> 
Hi Daniel,

 thanks for the suggestion. The DL380g4 boxes have the "6i" and all
systems are equipped with the BBWC (192 MB, split 50/50).

 The thing is not really a speed daemon, but sufficient for the task.

 The problem really seems to be related to the VM system not writing
out dirty pages early enough and then getting into trouble when the
pressure gets to high.

Cheers
Martin



--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-06 Thread Martin Knoblauch
Brice Figureau wrote:

>> CFQ gives less (about 10-15%) throughput except for the kernel
>> with the
>> cfs cpu scheduler, where CFQ is on par with the other IO
>> schedulers.
>>
>
>Please have a look to kernel bug #7372:
>http://bugzilla.kernel.org/show_bug.cgi?id=7372
>
>It seems I encountered the almost same issue.
>
>The fix on my side, beside running 2.6.17 (which was working fine
>for me) was to:
>1) have /proc/sys/vm/vfs_cache_pressure=1
>2) have /proc/sys/vm/dirty_ratio=1 and 
> /proc/sys/vm/dirty_background_ratio=1
>3) have /proc/sys/vm/swappiness=2
>4) run Peter Zijlstra: per dirty device throttling patch on the
> top of 2.6.21.5:
>http://www.ussg.iu.edu/hypermail/linux/kernel/0706.1/2776.html

Brice,

 any of them sufficient, or all together nedded? Just to avoid
confusion.

Cheers
Martin


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-06 Thread Martin Knoblauch
Martin Knoblauch wrote:
>--- Robert Hancock <[EMAIL PROTECTED]> wrote:
>
>>
>> Try playing with reducing /proc/sys/vm/dirty_ratio and see how that
>> helps. This workload will fill up memory with dirty data very
>> quickly,
>> and it seems like system responsiveness often goes down the toilet
>> when
>> this happens and the system is going crazy trying to write it all
>> out.
>>
>
>Definitely the "going crazy" part is the worst problem I see with 2.6
>based kernels (late 2.4 was really better in this corner case).
>
>I am just now playing with dirty_ratio. Anybody knows what the lower
>limit is? "0" seems acceptabel, but does it actually imply "write out
>immediatelly"?
>
>Another problem, the VM parameters are not really well documented in
>their behaviour and interdependence.

 Lowering dirty_ration just leads to more imbalanced write-speed for
the three dd's. Even when lowering the number to 0, the hich load
stays.

 Now, on another experiment I mounted the FS with "sync". And now the
load stays below/around 3. No more "pdflush" daemons going wild. And
the responsiveness is good, with no drops.

 My question is now: is there a parameter that one can use to force
immediate writeout for every process. This may hurt overall performance
of the system, but might really help my situation. Setting dirty_ratio
to 0 does not seem to do it.

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-06 Thread Martin Knoblauch
>>b) any ideas how to optimize the settings of the /proc/sys/vm/
>>parameters? The documentation is a bit thin here.
>>
>>
>I cant offer any advice there, but is raid-5 really the best choice
>for your needs? I would not choose raid-5 for a system that is
>regularly performing lots of large writes at the same time, dont
>forget that each write can require several reads to recalculate the
>partity.
>
>Does the raid card have much cache ram?
>

 192 MB, split 50/50 to read write.

>If you can afford to loose some space raid-10 would probably perform
>better.

 RAID5 most likely is not the best solution and I would not use it if
the described use-case was happening all the time. It happens a few
times a day and then things go down when all memory is filled with
page-cache.

 And the same also happens when copying large amountd of data from one
NFS mounted FS to another NFS mounted FS. No disk involved there.
Memory fills with page-cache until it reaches a ceeling and then for
some time responsiveness is really really bad.

 I am just now playing with the dirty_* stuff. Maybe it helps.

Cheers
Martin



------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-06 Thread Martin Knoblauch

--- Robert Hancock <[EMAIL PROTECTED]> wrote:

> 
> Try playing with reducing /proc/sys/vm/dirty_ratio and see how that 
> helps. This workload will fill up memory with dirty data very
> quickly, 
> and it seems like system responsiveness often goes down the toilet
> when 
> this happens and the system is going crazy trying to write it all
> out.
> 

 Definitely the "going crazy" part is the worst problem I see with 2.6
based kernels (late 2.4 was really better in this corner case).

 I am just now playing with dirty_ratio. Anybody knows what the lower
limit is? "0" seems acceptabel, but does it actually imply "write out
immediatelly"?

 Another problem, the VM parameters are not really well dociúmented in
their behaviour and interdependence.

Cheers
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-06 Thread Martin Knoblauch

--- Jesper Juhl <[EMAIL PROTECTED]> wrote:

> On 06/07/07, Robert Hancock <[EMAIL PROTECTED]> wrote:
> [snip]
> >
> > Try playing with reducing /proc/sys/vm/dirty_ratio and see how that
> > helps. This workload will fill up memory with dirty data very
> quickly,
> > and it seems like system responsiveness often goes down the toilet
> when
> > this happens and the system is going crazy trying to write it all
> out.
> >
> 
> Perhaps trying out a different elevator would also be worthwhile.
> 

 AS seems to be the best one (NOOP and DeadLine seem to be equally OK).
CFQ gives less (about 10-15%) throughput except for the kernel with the
cfs cpu scheduler, where CFQ is on par with the other IO schedulers.

Thanks
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Understanding I/O behaviour

2007-07-06 Thread Martin Knoblauch

--- Jesper Juhl [EMAIL PROTECTED] wrote:

 On 06/07/07, Robert Hancock [EMAIL PROTECTED] wrote:
 [snip]
 
  Try playing with reducing /proc/sys/vm/dirty_ratio and see how that
  helps. This workload will fill up memory with dirty data very
 quickly,
  and it seems like system responsiveness often goes down the toilet
 when
  this happens and the system is going crazy trying to write it all
 out.
 
 
 Perhaps trying out a different elevator would also be worthwhile.
 

 AS seems to be the best one (NOOP and DeadLine seem to be equally OK).
CFQ gives less (about 10-15%) throughput except for the kernel with the
cfs cpu scheduler, where CFQ is on par with the other IO schedulers.

Thanks
Martin

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >