Re: regression: 100% io-wait with 2.6.24-rcX
- Original Message > From: Alasdair G Kergon <[EMAIL PROTECTED]> > To: Martin Knoblauch <[EMAIL PROTECTED]> > Cc: Linus Torvalds <[EMAIL PROTECTED]>; Mel Gorman <[EMAIL PROTECTED]>; > Fengguang Wu <[EMAIL PROTECTED]>; Mike Snitzer <[EMAIL PROTECTED]>; Peter > Zijlstra <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; Ingo Molnar <[EMAIL > PROTECTED]>; linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL > PROTECTED]>; [EMAIL PROTECTED]; Jens Axboe <[EMAIL PROTECTED]>; Milan Broz > <[EMAIL PROTECTED]>; Neil Brown <[EMAIL PROTECTED]> > Sent: Wednesday, January 23, 2008 12:40:52 AM > Subject: Re: regression: 100% io-wait with 2.6.24-rcX > > On Tue, Jan 22, 2008 at 07:25:15AM -0800, Martin Knoblauch wrote: > > [EMAIL PROTECTED] ~]# dmsetup table > > VolGroup00-LogVol02: 0 350945280 linear 104:2 67109248 > > VolGroup00-LogVol01: 0 8388608 linear 104:2 418054528 > > VolGroup00-LogVol00: 0 67108864 linear 104:2 384 > > The IO should pass straight through simple linear targets like > that without needing to get broken up, so I wouldn't expect those patches to > make any difference in this particular case. > Alasdair, LVM/DM are off the hook :-) I converted one box to direct using partitions and the performance is the same disappointment as with LVM/DM. Thanks anyway for looking at my problem. I will move the discussion now to a new thread, targetting CCISS directly. Cheers Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression: 100% io-wait with 2.6.24-rcX
- Original Message From: Alasdair G Kergon [EMAIL PROTECTED] To: Martin Knoblauch [EMAIL PROTECTED] Cc: Linus Torvalds [EMAIL PROTECTED]; Mel Gorman [EMAIL PROTECTED]; Fengguang Wu [EMAIL PROTECTED]; Mike Snitzer [EMAIL PROTECTED]; Peter Zijlstra [EMAIL PROTECTED]; [EMAIL PROTECTED]; Ingo Molnar [EMAIL PROTECTED]; linux-kernel@vger.kernel.org; [EMAIL PROTECTED] [EMAIL PROTECTED]; [EMAIL PROTECTED]; Jens Axboe [EMAIL PROTECTED]; Milan Broz [EMAIL PROTECTED]; Neil Brown [EMAIL PROTECTED] Sent: Wednesday, January 23, 2008 12:40:52 AM Subject: Re: regression: 100% io-wait with 2.6.24-rcX On Tue, Jan 22, 2008 at 07:25:15AM -0800, Martin Knoblauch wrote: [EMAIL PROTECTED] ~]# dmsetup table VolGroup00-LogVol02: 0 350945280 linear 104:2 67109248 VolGroup00-LogVol01: 0 8388608 linear 104:2 418054528 VolGroup00-LogVol00: 0 67108864 linear 104:2 384 The IO should pass straight through simple linear targets like that without needing to get broken up, so I wouldn't expect those patches to make any difference in this particular case. Alasdair, LVM/DM are off the hook :-) I converted one box to direct using partitions and the performance is the same disappointment as with LVM/DM. Thanks anyway for looking at my problem. I will move the discussion now to a new thread, targetting CCISS directly. Cheers Martin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression: 100% io-wait with 2.6.24-rcX
- Original Message > From: Alasdair G Kergon <[EMAIL PROTECTED]> > To: Martin Knoblauch <[EMAIL PROTECTED]> > Cc: Linus Torvalds <[EMAIL PROTECTED]>; Mel Gorman <[EMAIL PROTECTED]>; > Fengguang Wu <[EMAIL PROTECTED]>; Mike Snitzer <[EMAIL PROTECTED]>; Peter > Zijlstra <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; Ingo Molnar <[EMAIL > PROTECTED]>; linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL > PROTECTED]>; [EMAIL PROTECTED]; Jens Axboe <[EMAIL PROTECTED]>; Milan Broz > <[EMAIL PROTECTED]>; Neil Brown <[EMAIL PROTECTED]> > Sent: Tuesday, January 22, 2008 3:39:33 PM > Subject: Re: regression: 100% io-wait with 2.6.24-rcX > > > See if these patches make any difference: > http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/ > > dm-md-merge_bvec_fn-with-separate-bdev-and-sector.patch > dm-introduce-merge_bvec_fn.patch > dm-linear-add-merge.patch > dm-table-remove-merge_bvec-sector-restriction.patch > nope. Exactely the same poor results. To rule out LVM/DM I really have to see what happens if I setup a system with filesystems directly on partitions. Might take some time though. Cheers Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression: 100% io-wait with 2.6.24-rcX
-- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - Original Message > From: Alasdair G Kergon <[EMAIL PROTECTED]> > To: Martin Knoblauch <[EMAIL PROTECTED]> > Cc: Linus Torvalds <[EMAIL PROTECTED]>; Mel Gorman <[EMAIL PROTECTED]>; > Fengguang Wu <[EMAIL PROTECTED]>; Mike Snitzer <[EMAIL PROTECTED]>; Peter > Zijlstra <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; Ingo Molnar <[EMAIL > PROTECTED]>; linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL > PROTECTED]>; [EMAIL PROTECTED]; Jens Axboe <[EMAIL PROTECTED]>; Milan Broz > <[EMAIL PROTECTED]>; Neil Brown <[EMAIL PROTECTED]> > Sent: Tuesday, January 22, 2008 3:39:33 PM > Subject: Re: regression: 100% io-wait with 2.6.24-rcX > > On Fri, Jan 18, 2008 at 11:01:11AM -0800, Martin Knoblauch wrote: > > At least, rc1-rc5 have shown that the CCISS system can do well. Now > > the question is which part of the system does not cope well with the > > larger IO sizes? Is it the CCISS controller, LVM or both. I am > open > to > > suggestions on how to debug that. > > What is your LVM device configuration? > E.g. 'dmsetup table' and 'dmsetup info -c' output. > Some configurations lead to large IOs getting split up on the > way > through > device-mapper. > Hi Alastair, here is the output, the filesystem in question is on LogVol02: [EMAIL PROTECTED] ~]# dmsetup table VolGroup00-LogVol02: 0 350945280 linear 104:2 67109248 VolGroup00-LogVol01: 0 8388608 linear 104:2 418054528 VolGroup00-LogVol00: 0 67108864 linear 104:2 384 [EMAIL PROTECTED] ~]# dmsetup info -c Name Maj Min Stat Open Targ Event UUID VolGroup00-LogVol02 253 1 L--w11 0 LVM-IV4PeE8cdxA3piC1qk79GY9PE9OC4OgmOZ4OzOgGQIdF3qDx6fJmlZukXXLIy39R VolGroup00-LogVol01 253 2 L--w11 0 LVM-IV4PeE8cdxA3piC1qk79GY9PE9OC4Ogmfn2CcAd2Fh7i48twe8PZc2XK5bSOe1Fq VolGroup00-LogVol00 253 0 L--w11 0 LVM-IV4PeE8cdxA3piC1qk79GY9PE9OC4OgmfYjxQKFP3zw2fGsezJN7ypSrfmP7oSvE > See if these patches make any difference: > http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/ > > dm-md-merge_bvec_fn-with-separate-bdev-and-sector.patch > dm-introduce-merge_bvec_fn.patch > dm-linear-add-merge.patch > dm-table-remove-merge_bvec-sector-restriction.patch > thanks for the suggestion. Are they supposed to apply to mainline? Cheers Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression: 100% io-wait with 2.6.24-rcX
- Original Message From: Alasdair G Kergon [EMAIL PROTECTED] To: Martin Knoblauch [EMAIL PROTECTED] Cc: Linus Torvalds [EMAIL PROTECTED]; Mel Gorman [EMAIL PROTECTED]; Fengguang Wu [EMAIL PROTECTED]; Mike Snitzer [EMAIL PROTECTED]; Peter Zijlstra [EMAIL PROTECTED]; [EMAIL PROTECTED]; Ingo Molnar [EMAIL PROTECTED]; linux-kernel@vger.kernel.org; [EMAIL PROTECTED] [EMAIL PROTECTED]; [EMAIL PROTECTED]; Jens Axboe [EMAIL PROTECTED]; Milan Broz [EMAIL PROTECTED]; Neil Brown [EMAIL PROTECTED] Sent: Tuesday, January 22, 2008 3:39:33 PM Subject: Re: regression: 100% io-wait with 2.6.24-rcX See if these patches make any difference: http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/ dm-md-merge_bvec_fn-with-separate-bdev-and-sector.patch dm-introduce-merge_bvec_fn.patch dm-linear-add-merge.patch dm-table-remove-merge_bvec-sector-restriction.patch nope. Exactely the same poor results. To rule out LVM/DM I really have to see what happens if I setup a system with filesystems directly on partitions. Might take some time though. Cheers Martin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression: 100% io-wait with 2.6.24-rcX
- Original Message > From: Mike Snitzer <[EMAIL PROTECTED]> > To: Linus Torvalds <[EMAIL PROTECTED]> > Cc: Mel Gorman <[EMAIL PROTECTED]>; Martin Knoblauch <[EMAIL PROTECTED]>; > Fengguang Wu <[EMAIL PROTECTED]>; Peter Zijlstra <[EMAIL PROTECTED]>; [EMAIL > PROTECTED]; Ingo Molnar <[EMAIL PROTECTED]>; linux-kernel@vger.kernel.org; > "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; [EMAIL PROTECTED] > Sent: Friday, January 18, 2008 11:47:02 PM > Subject: Re: regression: 100% io-wait with 2.6.24-rcX > > > I can fire up 2.6.24-rc8 in short order to see if things are vastly > > improved (as Martin seems to indicate that he is happy with > > AACRAID on 2.6.24-rc8). Although even Martin's AACRAID > > numbers from 2.6.19.2 > > are still quite good (relative to mine). Martin can you share any tuning > > you may have done to get AACRAID to where it is for you right now? Mike, I have always been happy with the AACRAID box compared to the CCISS system. Even with the "regression" in 2.6.24-rc1..rc5 it was more than acceptable to me. For me the differences between 2.6.19 and 2.6.24-rc8 on the AACRAID setup are: - 11% (single stream) to 25% (dual/triple stream) regression in DIO. Something I do not care much about. I just measure it for reference. + the very nice behaviour when writing to different targets (mix3), which I attribute to Peter's per-dbi stuff. And until -rc6 I was extremely pleased with the cool speedup I saw on my CCISS boxes. This would have been the next "production" kernel for me. But lets discuss this under a seperate topic. It has nothing to do with the original wait-io issue. Oh, before I forget. There has been no tuning for the AACRAID. The system is an IBM x3650 with built in AACRAID and battery backed write cache. The disks are 6x142GB/15krpm in a RAID5 setup. I see one big difference between your an my tests. I do 1MB writes to simulate the behaviour of the real applications, while yours seem to be much smaller. Cheers Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] writeback: speed up writeback of big dirty files
Original Message > From: Fengguang Wu <[EMAIL PROTECTED]> > To: Linus Torvalds <[EMAIL PROTECTED]> > Cc: Mike Snitzer <[EMAIL PROTECTED]>; Martin Knoblauch <[EMAIL PROTECTED]>; > Peter Zijlstra <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; Ingo Molnar <[EMAIL > PROTECTED]>; linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL > PROTECTED]> > Sent: Thursday, January 17, 2008 6:28:18 AM > Subject: [PATCH] writeback: speed up writeback of big dirty files > > On Jan 16, 2008 9:15 AM, Martin Knoblauch > > wrote: > > Fengguang's latest writeback patch applies cleanly, builds, boots > on > 2.6.24-rc8. > > Linus, if possible, I'd suggest this patch be merged for 2.6.24. > > It's a safer version of the reverted patch. It was tested on > ext2/ext3/jfs/xfs/reiserfs and won't 100% iowait even without the > other bug fixing patches. > > Fengguang > --- > > writeback: speed up writeback of big dirty files > > After making dirty a 100M file, the normal behavior is to > start the writeback for all data after 30s delays. But > sometimes the following happens instead: > > - after 30s:~4M > - after 5s: ~4M > - after 5s: all remaining 92M > > Some analyze shows that the internal io dispatch queues goes like this: > > s_ios_more_io > - > 1)100M,1K 0 > 2)1K 96M > 3)0 96M > 1) initial state with a 100M file and a 1K file > 2) 4M written, nr_to_write <= 0, so write more > 3) 1K written, nr_to_write > 0, no more writes(BUG) > nr_to_write > 0 in (3) fools the upper layer to think that data > have > all been > written out. The big dirty file is actually still sitting in > s_more_io. > We > cannot simply splice s_more_io back to s_io as soon as s_io > becomes > empty, and > let the loop in generic_sync_sb_inodes() continue: this may > starve > newly > expired inodes in s_dirty. It is also not an option to draw > inodes > from both > s_more_io and s_dirty, an let the loop go on: this might lead to > live > locks, > and might also starve other superblocks in sync time(well kupdate > may > still > starve some superblocks, that's another bug). > We have to return when a full scan of s_io completes. So nr_to_write > > > 0 does > not necessarily mean that "all data are written". This patch > introduces > a flag > writeback_control.more_io to indicate that more io should be done. > With > it the > big dirty file no longer has to wait for the next kupdate invocation > 5s > later. > > In sync_sb_inodes() we only set more_io on super_blocks we > actually > visited. > This aviods the interaction between two pdflush deamons. > > Also in __sync_single_inode() we don't blindly keep requeuing the io > if > the > filesystem cannot progress. Failing to do so may lead to 100% iowait. > > Tested-by: Mike Snitzer > Signed-off-by: Fengguang Wu > --- > fs/fs-writeback.c | 18 -- > include/linux/writeback.h |1 + > mm/page-writeback.c |9 ++--- > 3 files changed, 23 insertions(+), 5 deletions(-) > > --- linux.orig/fs/fs-writeback.c > +++ linux/fs/fs-writeback.c > @@ -284,7 +284,17 @@ __sync_single_inode(struct inode *inode, > * soon as the queue becomes uncongested. > */ > inode->i_state |= I_DIRTY_PAGES; > -requeue_io(inode); > +if (wbc->nr_to_write <= 0) { > +/* > + * slice used up: queue for next turn > + */ > +requeue_io(inode); > +} else { > +/* > + * somehow blocked: retry later > + */ > +redirty_tail(inode); > +} > } else { > /* > * Otherwise fully redirty the inode so that > @@ -479,8 +489,12 @@ sync_sb_inodes(struct super_block *sb, s > iput(inode); > cond_resched(); > spin_lock(_lock); > -if (wbc->nr_to_write <= 0) > +if (wbc->nr_to_write <= 0) { > +wbc->more_io = 1; > break; > +} > +if (!list_empty(>s_more_io)) > +wbc->more_io = 1; > } > return;/* Leave any unwritten inodes on s_io */ > } > --- linux.orig/include/linux/writeback.h > +++ linux/include/linux/writebac
Re: [PATCH] writeback: speed up writeback of big dirty files
Original Message From: Fengguang Wu [EMAIL PROTECTED] To: Linus Torvalds [EMAIL PROTECTED] Cc: Mike Snitzer [EMAIL PROTECTED]; Martin Knoblauch [EMAIL PROTECTED]; Peter Zijlstra [EMAIL PROTECTED]; [EMAIL PROTECTED]; Ingo Molnar [EMAIL PROTECTED]; linux-kernel@vger.kernel.org; [EMAIL PROTECTED] [EMAIL PROTECTED] Sent: Thursday, January 17, 2008 6:28:18 AM Subject: [PATCH] writeback: speed up writeback of big dirty files On Jan 16, 2008 9:15 AM, Martin Knoblauch wrote: Fengguang's latest writeback patch applies cleanly, builds, boots on 2.6.24-rc8. Linus, if possible, I'd suggest this patch be merged for 2.6.24. It's a safer version of the reverted patch. It was tested on ext2/ext3/jfs/xfs/reiserfs and won't 100% iowait even without the other bug fixing patches. Fengguang --- writeback: speed up writeback of big dirty files After making dirty a 100M file, the normal behavior is to start the writeback for all data after 30s delays. But sometimes the following happens instead: - after 30s:~4M - after 5s: ~4M - after 5s: all remaining 92M Some analyze shows that the internal io dispatch queues goes like this: s_ios_more_io - 1)100M,1K 0 2)1K 96M 3)0 96M 1) initial state with a 100M file and a 1K file 2) 4M written, nr_to_write = 0, so write more 3) 1K written, nr_to_write 0, no more writes(BUG) nr_to_write 0 in (3) fools the upper layer to think that data have all been written out. The big dirty file is actually still sitting in s_more_io. We cannot simply splice s_more_io back to s_io as soon as s_io becomes empty, and let the loop in generic_sync_sb_inodes() continue: this may starve newly expired inodes in s_dirty. It is also not an option to draw inodes from both s_more_io and s_dirty, an let the loop go on: this might lead to live locks, and might also starve other superblocks in sync time(well kupdate may still starve some superblocks, that's another bug). We have to return when a full scan of s_io completes. So nr_to_write 0 does not necessarily mean that all data are written. This patch introduces a flag writeback_control.more_io to indicate that more io should be done. With it the big dirty file no longer has to wait for the next kupdate invocation 5s later. In sync_sb_inodes() we only set more_io on super_blocks we actually visited. This aviods the interaction between two pdflush deamons. Also in __sync_single_inode() we don't blindly keep requeuing the io if the filesystem cannot progress. Failing to do so may lead to 100% iowait. Tested-by: Mike Snitzer Signed-off-by: Fengguang Wu --- fs/fs-writeback.c | 18 -- include/linux/writeback.h |1 + mm/page-writeback.c |9 ++--- 3 files changed, 23 insertions(+), 5 deletions(-) --- linux.orig/fs/fs-writeback.c +++ linux/fs/fs-writeback.c @@ -284,7 +284,17 @@ __sync_single_inode(struct inode *inode, * soon as the queue becomes uncongested. */ inode-i_state |= I_DIRTY_PAGES; -requeue_io(inode); +if (wbc-nr_to_write = 0) { +/* + * slice used up: queue for next turn + */ +requeue_io(inode); +} else { +/* + * somehow blocked: retry later + */ +redirty_tail(inode); +} } else { /* * Otherwise fully redirty the inode so that @@ -479,8 +489,12 @@ sync_sb_inodes(struct super_block *sb, s iput(inode); cond_resched(); spin_lock(inode_lock); -if (wbc-nr_to_write = 0) +if (wbc-nr_to_write = 0) { +wbc-more_io = 1; break; +} +if (!list_empty(sb-s_more_io)) +wbc-more_io = 1; } return;/* Leave any unwritten inodes on s_io */ } --- linux.orig/include/linux/writeback.h +++ linux/include/linux/writeback.h @@ -62,6 +62,7 @@ struct writeback_control { unsigned for_reclaim:1;/* Invoked from the page allocator */ unsigned for_writepages:1;/* This is a writepages() call */ unsigned range_cyclic:1;/* range_start is cyclic */ +unsigned more_io:1;/* more io to be dispatched */ }; /* --- linux.orig/mm/page-writeback.c +++ linux/mm/page-writeback.c @@ -558,6 +558,7 @@ static void background_writeout(unsigned global_page_state(NR_UNSTABLE_NFS) background_thresh min_pages = 0) break; +wbc.more_io = 0; wbc.encountered_congestion = 0; wbc.nr_to_write
Re: regression: 100% io-wait with 2.6.24-rcX
- Original Message From: Mike Snitzer [EMAIL PROTECTED] To: Linus Torvalds [EMAIL PROTECTED] Cc: Mel Gorman [EMAIL PROTECTED]; Martin Knoblauch [EMAIL PROTECTED]; Fengguang Wu [EMAIL PROTECTED]; Peter Zijlstra [EMAIL PROTECTED]; [EMAIL PROTECTED]; Ingo Molnar [EMAIL PROTECTED]; linux-kernel@vger.kernel.org; [EMAIL PROTECTED] [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Friday, January 18, 2008 11:47:02 PM Subject: Re: regression: 100% io-wait with 2.6.24-rcX I can fire up 2.6.24-rc8 in short order to see if things are vastly improved (as Martin seems to indicate that he is happy with AACRAID on 2.6.24-rc8). Although even Martin's AACRAID numbers from 2.6.19.2 are still quite good (relative to mine). Martin can you share any tuning you may have done to get AACRAID to where it is for you right now? Mike, I have always been happy with the AACRAID box compared to the CCISS system. Even with the regression in 2.6.24-rc1..rc5 it was more than acceptable to me. For me the differences between 2.6.19 and 2.6.24-rc8 on the AACRAID setup are: - 11% (single stream) to 25% (dual/triple stream) regression in DIO. Something I do not care much about. I just measure it for reference. + the very nice behaviour when writing to different targets (mix3), which I attribute to Peter's per-dbi stuff. And until -rc6 I was extremely pleased with the cool speedup I saw on my CCISS boxes. This would have been the next production kernel for me. But lets discuss this under a seperate topic. It has nothing to do with the original wait-io issue. Oh, before I forget. There has been no tuning for the AACRAID. The system is an IBM x3650 with built in AACRAID and battery backed write cache. The disks are 6x142GB/15krpm in a RAID5 setup. I see one big difference between your an my tests. I do 1MB writes to simulate the behaviour of the real applications, while yours seem to be much smaller. Cheers Martin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression: 100% io-wait with 2.6.24-rcX
--- Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > On Fri, 18 Jan 2008, Mel Gorman wrote: > > > > Right, and this is consistent with other complaints about the PFN > > of the page mattering to some hardware. > > I don't think it's actually the PFN per se. > > I think it's simply that some controllers (quite probably affected by > both driver and hardware limits) have some subtle interactions with > the size of the IO commands. > > For example, let's say that you have a controller that has some limit > X on the size of IO in flight (whether due to hardware or driver > issues doesn't really matter) in addition to a limit on the size > of the scatter-gather size. They all tend to have limits, and > they differ. > > Now, the PFN doesn't matter per se, but the allocation pattern > definitely matters for whether the IO's are physically > contiguous, and thus matters for the size of the scatter-gather > thing. > > Now, generally the rule-of-thumb is that you want big commands, so > physical merging is good for you, but I could well imagine that the > IO limits interact, and end up hurting each other. Let's say that a > better allocation order allows for bigger contiguous physical areas, > and thus fewer scatter-gather entries. > > What does that result in? The obvious answer is > > "Better performance obviously, because the controller needs to do > fewer scatter-gather lookups, and the requests are bigger, because > there are fewer IO's that hit scatter-gather limits!" > > Agreed? > > Except maybe the *real* answer for some controllers end up being > > "Worse performance, because individual commands grow because they > don't hit the per-command limits, but now we hit the global > size-in-flight limits and have many fewer of these good commands in > flight. And while the commands are larger, it means that there > are fewer outstanding commands, which can mean that the disk > cannot scheduling things as well, or makes high latency of command > generation by the controller much more visible because there aren't > enough concurrent requests queued up to hide it" > > Is this the reason? I have no idea. But somebody who knows the > AACRAID hardware and driver limits might think about interactions > like that. Sometimes you actually might want to have smaller > individual commands if there is some other limit that means that > it can be more advantageous to have many small requests over a > few big onees. > > RAID might well make it worse. Maybe small requests work better > because they are simpler to schedule because they only hit one > disk (eg if you have simple striping)! So that's another reason > why one *large* request may actually be slower than two requests > half the size, even if it's against the "normal rule". > > And it may be that that AACRAID box takes a big hit on DIO > exactly because DIO has been optimized almost purely for making > one command as big as possible. > > Just a theory. > > Linus just to make one thing clear - I am not so much concerned about the performance of AACRAID. It is OK with or without Mel's patch. It is better with Mel's patch. The regression in DIO compared to 2.6.19.2 is completely independent of Mel's stuff. What interests me much more is the behaviour of the CCISS+LVM based system. Here I see a huge benefit of reverting Mel's patch. I dirtied the system after reboot as Mel suggested (24 parallel kernel build) and repeated the tests. The dirtying did not make any difference. Here are the results: Test -rc8-rc8-without-Mels-Patch dd1 57 94 dd1-dir 87 86 dd2 2x8.5 2x45 dd2-dir 2x432x43 dd3 3x7 3x30 dd3-dir 3x28.5 3x28.5 mix3 59,2x25 98,2x24 The big IO size with Mel's patch really has a devastating effect on the parallel write. Nowhere near the value one would expect, while the numbers are perfect without Mel's patch as in rc1-rc5. To bad I did not see this earlier. Maybe we could have found a solution for .24. At least, rc1-rc5 have shown that the CCISS system can do well. Now the question is which part of the system does not cope well with the larger IO sizes? Is it the CCISS controller, LVM or both. I am open to suggestions on how to debug that. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression: 100% io-wait with 2.6.24-rcX
- Original Message > From: Mel Gorman <[EMAIL PROTECTED]> > To: Martin Knoblauch <[EMAIL PROTECTED]> > Cc: Fengguang Wu <[EMAIL PROTECTED]>; Mike Snitzer <[EMAIL PROTECTED]>; Peter > Zijlstra <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; Ingo Molnar <[EMAIL > PROTECTED]>; linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL > PROTECTED]>; Linus Torvalds <[EMAIL PROTECTED]>; [EMAIL PROTECTED] > Sent: Thursday, January 17, 2008 11:12:21 PM > Subject: Re: regression: 100% io-wait with 2.6.24-rcX > > On (17/01/08 13:50), Martin Knoblauch didst pronounce: > > > > > > > The effect is defintely depending on the IO hardware. > > performed the same tests > > on a different box with an AACRAID controller and there things > > look different. > > I take it different also means it does not show this odd performance > behaviour and is similar whether the patch is applied or not? > Here are the numbers (MB/s) from the AACRAID box, after a fresh boot: Test 2.6.19.2 2.6.24-rc6 2.6.24-rc6-81eabcbe0b991ddef5216f30ae91c4b226d54b6d dd1 325 350 290 dd1-dir 180 160 160 dd2 2x90 2x113 2x110 dd2-dir 2x120 2x922x93 dd33x54 3x70 3x70 dd3-dir 3x83 3x64 3x64 mix3 55,2x30 400,2x25 310,2x25 What we are seing here is that: a) DIRECT IO takes a much bigger hit (2.6.19 vs. 2.6.24) on this IO system compared to the CCISS box b) Reverting your patch hurts single stream c) dual/triple stream are not affected by your patch and are improved over 2.6.19 d) the mix3 performance is improved compared to 2.6.19. d1) reverting your patch hurts the local-disk part of mix3 e) the AACRAID setup is definitely faster than the CCISS. So, on this box your patch is definitely needed to get the pre-2.6.24 performance when writing a single big file. Actually things on the CCISS box might be even more complicated. I forgot the fact that on that box we have ext2/LVM/DM/Hardware, while on the AACRAID box we have ext2/Hardware. Do you think that the LVM/MD are sensitive to the page order/coloring? Anyway: does your patch only address this performance issue, or are there also data integrity concerns without it? I may consider reverting the patch for my production environment. It really helps two thirds of my boxes big time, while it does not hurt the other third that much :-) > > > > I can certainly stress the box before doing the tests. Please > > define "many" for the kernel compiles :-) > > > > With 8GiB of RAM, try making 24 copies of the kernel and compiling them > all simultaneously. Running that for for 20-30 minutes should be enough > to randomise the freelists affecting what color of page is used for the > dd test. > ouch :-) OK, I will try that. Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression: 100% io-wait with 2.6.24-rcX
- Original Message From: Mel Gorman [EMAIL PROTECTED] To: Martin Knoblauch [EMAIL PROTECTED] Cc: Fengguang Wu [EMAIL PROTECTED]; Mike Snitzer [EMAIL PROTECTED]; Peter Zijlstra [EMAIL PROTECTED]; [EMAIL PROTECTED]; Ingo Molnar [EMAIL PROTECTED]; linux-kernel@vger.kernel.org; [EMAIL PROTECTED] [EMAIL PROTECTED]; Linus Torvalds [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Thursday, January 17, 2008 11:12:21 PM Subject: Re: regression: 100% io-wait with 2.6.24-rcX On (17/01/08 13:50), Martin Knoblauch didst pronounce: The effect is defintely depending on the IO hardware. performed the same tests on a different box with an AACRAID controller and there things look different. I take it different also means it does not show this odd performance behaviour and is similar whether the patch is applied or not? Here are the numbers (MB/s) from the AACRAID box, after a fresh boot: Test 2.6.19.2 2.6.24-rc6 2.6.24-rc6-81eabcbe0b991ddef5216f30ae91c4b226d54b6d dd1 325 350 290 dd1-dir 180 160 160 dd2 2x90 2x113 2x110 dd2-dir 2x120 2x922x93 dd33x54 3x70 3x70 dd3-dir 3x83 3x64 3x64 mix3 55,2x30 400,2x25 310,2x25 What we are seing here is that: a) DIRECT IO takes a much bigger hit (2.6.19 vs. 2.6.24) on this IO system compared to the CCISS box b) Reverting your patch hurts single stream c) dual/triple stream are not affected by your patch and are improved over 2.6.19 d) the mix3 performance is improved compared to 2.6.19. d1) reverting your patch hurts the local-disk part of mix3 e) the AACRAID setup is definitely faster than the CCISS. So, on this box your patch is definitely needed to get the pre-2.6.24 performance when writing a single big file. Actually things on the CCISS box might be even more complicated. I forgot the fact that on that box we have ext2/LVM/DM/Hardware, while on the AACRAID box we have ext2/Hardware. Do you think that the LVM/MD are sensitive to the page order/coloring? Anyway: does your patch only address this performance issue, or are there also data integrity concerns without it? I may consider reverting the patch for my production environment. It really helps two thirds of my boxes big time, while it does not hurt the other third that much :-) I can certainly stress the box before doing the tests. Please define many for the kernel compiles :-) With 8GiB of RAM, try making 24 copies of the kernel and compiling them all simultaneously. Running that for for 20-30 minutes should be enough to randomise the freelists affecting what color of page is used for the dd test. ouch :-) OK, I will try that. Martin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression: 100% io-wait with 2.6.24-rcX
--- Linus Torvalds [EMAIL PROTECTED] wrote: On Fri, 18 Jan 2008, Mel Gorman wrote: Right, and this is consistent with other complaints about the PFN of the page mattering to some hardware. I don't think it's actually the PFN per se. I think it's simply that some controllers (quite probably affected by both driver and hardware limits) have some subtle interactions with the size of the IO commands. For example, let's say that you have a controller that has some limit X on the size of IO in flight (whether due to hardware or driver issues doesn't really matter) in addition to a limit on the size of the scatter-gather size. They all tend to have limits, and they differ. Now, the PFN doesn't matter per se, but the allocation pattern definitely matters for whether the IO's are physically contiguous, and thus matters for the size of the scatter-gather thing. Now, generally the rule-of-thumb is that you want big commands, so physical merging is good for you, but I could well imagine that the IO limits interact, and end up hurting each other. Let's say that a better allocation order allows for bigger contiguous physical areas, and thus fewer scatter-gather entries. What does that result in? The obvious answer is Better performance obviously, because the controller needs to do fewer scatter-gather lookups, and the requests are bigger, because there are fewer IO's that hit scatter-gather limits! Agreed? Except maybe the *real* answer for some controllers end up being Worse performance, because individual commands grow because they don't hit the per-command limits, but now we hit the global size-in-flight limits and have many fewer of these good commands in flight. And while the commands are larger, it means that there are fewer outstanding commands, which can mean that the disk cannot scheduling things as well, or makes high latency of command generation by the controller much more visible because there aren't enough concurrent requests queued up to hide it Is this the reason? I have no idea. But somebody who knows the AACRAID hardware and driver limits might think about interactions like that. Sometimes you actually might want to have smaller individual commands if there is some other limit that means that it can be more advantageous to have many small requests over a few big onees. RAID might well make it worse. Maybe small requests work better because they are simpler to schedule because they only hit one disk (eg if you have simple striping)! So that's another reason why one *large* request may actually be slower than two requests half the size, even if it's against the normal rule. And it may be that that AACRAID box takes a big hit on DIO exactly because DIO has been optimized almost purely for making one command as big as possible. Just a theory. Linus just to make one thing clear - I am not so much concerned about the performance of AACRAID. It is OK with or without Mel's patch. It is better with Mel's patch. The regression in DIO compared to 2.6.19.2 is completely independent of Mel's stuff. What interests me much more is the behaviour of the CCISS+LVM based system. Here I see a huge benefit of reverting Mel's patch. I dirtied the system after reboot as Mel suggested (24 parallel kernel build) and repeated the tests. The dirtying did not make any difference. Here are the results: Test -rc8-rc8-without-Mels-Patch dd1 57 94 dd1-dir 87 86 dd2 2x8.5 2x45 dd2-dir 2x432x43 dd3 3x7 3x30 dd3-dir 3x28.5 3x28.5 mix3 59,2x25 98,2x24 The big IO size with Mel's patch really has a devastating effect on the parallel write. Nowhere near the value one would expect, while the numbers are perfect without Mel's patch as in rc1-rc5. To bad I did not see this earlier. Maybe we could have found a solution for .24. At least, rc1-rc5 have shown that the CCISS system can do well. Now the question is which part of the system does not cope well with the larger IO sizes? Is it the CCISS controller, LVM or both. I am open to suggestions on how to debug that. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression: 100% io-wait with 2.6.24-rcX
- Original Message > From: Mel Gorman <[EMAIL PROTECTED]> > To: Martin Knoblauch <[EMAIL PROTECTED]> > Cc: Fengguang Wu <[EMAIL PROTECTED]>; Mike Snitzer <[EMAIL PROTECTED]>; Peter > Zijlstra <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; Ingo Molnar <[EMAIL > PROTECTED]>; linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL > PROTECTED]>; Linus Torvalds <[EMAIL PROTECTED]>; [EMAIL PROTECTED] > Sent: Thursday, January 17, 2008 9:23:57 PM > Subject: Re: regression: 100% io-wait with 2.6.24-rcX > > On (17/01/08 09:44), Martin Knoblauch didst pronounce: > > > > > > > > On Wed, Jan 16, 2008 at 01:26:41AM -0800, > Martin > Knoblauch wrote: > > > > > > > For those interested in using your writeback > improvements > in > > > > > > > production sooner rather than later (primarily with > ext3); > what > > > > > > > recommendations do you have? Just heavily test our > own > 2.6.24 > > > > > > > evolving "close, but not ready for merge" -mm > writeback > patchset? > > > > > > > > > > > > > > > > > > > I can add myself to Mikes question. It would be good to > know > a > > > > > > > > > > "roadmap" for the writeback changes. Testing 2.6.24-rcX so > far > has > > > > > been showing quite nice improvement of the overall > writeback > situation and > > > > > it would be sad to see this [partially] gone in 2.6.24-final. > > > > > Linus apparently already has reverted "...2250b". I > will > definitely > > > > > repeat my tests with -rc8. and report. > > > > > > > > > Thank you, Martin. Can you help test this patch on 2.6.24-rc7? > > > > Maybe we can push it to 2.6.24 after your testing. > > > > > > > Hi Fengguang, > > > > > > something really bad has happened between -rc3 and -rc6. > > > Embarrassingly I did not catch that earlier :-( > > > Compared to the numbers I posted in > > > http://lkml.org/lkml/2007/10/26/208 , dd1 is now at 60 MB/sec > > > (slight plus), while dd2/dd3 suck the same way as in pre 2.6.24. > > > The only test that is still good is mix3, which I attribute to > > > the per-BDI stuff. > > I suspect that the IO hardware you have is very sensitive to the > color of the physical page. I wonder, do you boot the system cleanly > and then run these tests? If so, it would be interesting to know what > happens if you stress the system first (many kernel compiles for example, > basically anything that would use a lot of memory in different ways for some > time) to randomise the free lists a bit and then run your test. You'd need to > run > the test three times for 2.6.23, 2.6.24-rc8 and 2.6.24-rc8 with the patch you > identified reverted. > The effect is defintely depending on the IO hardware. I performed the same tests on a different box with an AACRAID controller and there things look different. Basically the "offending" commit helps seingle stream performance on that box, while dual/triple stream are not affected. So I suspect that the CCISS is just not behaving well. And yes, the tests are usually done on a freshly booted box. Of course, I repeat them a few times. On the CCISS box the numbers are very constant. On the AACRAID box they vary quite a bit. I can certainly stress the box before doing the tests. Please define "many" for the kernel compiles :-) > > > > OK, the change happened between rc5 and rc6. Just following a > > gut feeling, I reverted > > > > #commit 81eabcbe0b991ddef5216f30ae91c4b226d54b6d > > #Author: Mel Gorman > > #Date: Mon Dec 17 16:20:05 2007 -0800 > > # > > > > This has brought back the good results I observed and reported. > > I do not know what to make out of this. At least on the systems > > I care about (HP/DL380g4, dual CPUs, HT-enabled, 8 GB Memory, > > SmartaArray6i controller with 4x72GB SCSI disks as RAID5 (battery > > protected writeback cache enabled) and gigabit networking (tg3)) this > > optimisation is a dissaster. > > > > That patch was not an optimisation, it was a regression fix > against 2.6.23 and I don't believe reverting it is an option. Other IO > hardware benefits from having the allocator supply pages in PFN order. I think this late in the 2.6.24 game we just should leave things as they are. But we should try to find a way to make CCISS faster, as it apparently can be faster. >
Re: regression: 100% io-wait with 2.6.24-rcX
- Original Message > From: Mike Snitzer <[EMAIL PROTECTED]> > To: Martin Knoblauch <[EMAIL PROTECTED]> > Cc: Fengguang Wu <[EMAIL PROTECTED]>; Peter Zijlstra <[EMAIL PROTECTED]>; > [EMAIL PROTECTED]; Ingo Molnar <[EMAIL PROTECTED]>; > linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; Linus > Torvalds <[EMAIL PROTECTED]> > Sent: Thursday, January 17, 2008 5:11:50 PM > Subject: Re: regression: 100% io-wait with 2.6.24-rcX > > > I've backported Peter's perbdi patchset to 2.6.22.x. I can share it > with anyone who might be interested. > > As expected, it has yielded 2.6.24-rcX level scaling. Given the test > result matrix you previously posted, 2.6.22.x+perbdi might give you > what you're looking for (sans improved writeback that 2.6.24 was > thought to be providing). That is, much improved scaling with better > O_DIRECT and network throughput. Just a thought... > > Unfortunately, my priorities (and computing resources) have shifted > and I won't be able to thoroughly test Fengguang's new writeback patch > on 2.6.24-rc8... whereby missing out on providing > justification/testing to others on _some_ improved writeback being > included in 2.6.24 final. > > Not to mention the window for writeback improvement is all but closed > considering the 2.6.24-rc8 announcement's 2.6.24 final release > timetable. > Mike, thanks for the offer, but the improved throughput is my #1 priority nowadays. And while the better scaling for different targets is nothing to frown upon, the much better scaling when writing to the same target would have been the big winner for me. Anyway, I located the "offending" commit. Lets see what the experts say. Cheers Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression: 100% io-wait with 2.6.24-rcX
- Original Message > From: Martin Knoblauch <[EMAIL PROTECTED]> > To: Fengguang Wu <[EMAIL PROTECTED]> > Cc: Mike Snitzer <[EMAIL PROTECTED]>; Peter Zijlstra <[EMAIL PROTECTED]>; > [EMAIL PROTECTED]; Ingo Molnar <[EMAIL PROTECTED]>; > linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; Linus > Torvalds <[EMAIL PROTECTED]> > Sent: Thursday, January 17, 2008 2:52:58 PM > Subject: Re: regression: 100% io-wait with 2.6.24-rcX > > - Original Message > > From: Fengguang Wu > > To: Martin Knoblauch > > Cc: Mike Snitzer ; Peter > Zijlstra > ; [EMAIL PROTECTED]; Ingo Molnar > ; > linux-kernel@vger.kernel.org; > "[EMAIL PROTECTED]" > ; Linus > Torvalds > > > Sent: Wednesday, January 16, 2008 1:00:04 PM > > Subject: Re: regression: 100% io-wait with 2.6.24-rcX > > > > On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote: > > > > For those interested in using your writeback improvements in > > > > production sooner rather than later (primarily with ext3); what > > > > recommendations do you have? Just heavily test our own 2.6.24 > > + > > > your > > > > evolving "close, but not ready for merge" -mm writeback patchset? > > > > > > > Hi Fengguang, Mike, > > > > > > I can add myself to Mikes question. It would be good to know > > a > > > "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has > > been > > > showing quite nice improvement of the overall writeback situation and > > it > > > would be sad to see this [partially] gone in 2.6.24-final. > > Linus > > > apparently already has reverted "...2250b". I will definitely > repeat > my > > tests > > > with -rc8. and report. > > > > Thank you, Martin. Can you help test this patch on 2.6.24-rc7? > > Maybe we can push it to 2.6.24 after your testing. > > > Hi Fengguang, > > something really bad has happened between -rc3 and > -rc6. > Embarrassingly I did not catch that earlier :-( > > Compared to the numbers I posted > in > http://lkml.org/lkml/2007/10/26/208 , dd1 is now at 60 MB/sec > (slight > plus), while dd2/dd3 suck the same way as in pre 2.6.24. The only > test > that is still good is mix3, which I attribute to the per-BDI stuff. > > At the moment I am frantically trying to find when things went down. > I > did run -rc8 and rc8+yourpatch. No difference to what I see with > -rc6. > Sorry that I cannot provide any input to your patch. > OK, the change happened between rc5 and rc6. Just following a gut feeling, I reverted #commit 81eabcbe0b991ddef5216f30ae91c4b226d54b6d #Author: Mel Gorman <[EMAIL PROTECTED]> #Date: Mon Dec 17 16:20:05 2007 -0800 # #mm: fix page allocation for larger I/O segments # #In some cases the IO subsystem is able to merge requests if the pages are #adjacent in physical memory. This was achieved in the allocator by having #expand() return pages in physically contiguous order in situations were a #large buddy was split. However, list-based anti-fragmentation changed the #order pages were returned in to avoid searching in buffered_rmqueue() for a #page of the appropriate migrate type. # #This patch restores behaviour of rmqueue_bulk() preserving the physical #order of pages returned by the allocator without incurring increased search #costs for anti-fragmentation. # #Signed-off-by: Mel Gorman <[EMAIL PROTECTED]> #Cc: James Bottomley <[EMAIL PROTECTED]> #Cc: Jens Axboe <[EMAIL PROTECTED]> #Cc: Mark Lord <[EMAIL PROTECTED] #Signed-off-by: Andrew Morton <[EMAIL PROTECTED]> #Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]> diff -urN linux-2.6.24-rc5/mm/page_alloc.c linux-2.6.24-rc6/mm/page_alloc.c --- linux-2.6.24-rc5/mm/page_alloc.c2007-12-21 04:14:11.305633890 + +++ linux-2.6.24-rc6/mm/page_alloc.c2007-12-21 04:14:17.746985697 + @@ -847,8 +847,19 @@ struct page *page = __rmqueue(zone, order, migratetype); if (unlikely(page == NULL)) break; + + /* +* Split buddy pages returned by expand() are received here +* in physical page order. The page is added to the callers and +* list and the list head then moves forward. From the callers +* perspective, the linked list is ordered by page number in +* some conditions. This is useful for IO devices that can +* merge IO requests if the physica
Re: regression: 100% io-wait with 2.6.24-rcX
- Original Message > From: Fengguang Wu <[EMAIL PROTECTED]> > To: Martin Knoblauch <[EMAIL PROTECTED]> > Cc: Mike Snitzer <[EMAIL PROTECTED]>; Peter Zijlstra <[EMAIL PROTECTED]>; > [EMAIL PROTECTED]; Ingo Molnar <[EMAIL PROTECTED]>; > linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; Linus > Torvalds <[EMAIL PROTECTED]> > Sent: Wednesday, January 16, 2008 1:00:04 PM > Subject: Re: regression: 100% io-wait with 2.6.24-rcX > > On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote: > > > For those interested in using your writeback improvements in > > > production sooner rather than later (primarily with ext3); what > > > recommendations do you have? Just heavily test our own 2.6.24 > + > your > > > evolving "close, but not ready for merge" -mm writeback patchset? > > > > > Hi Fengguang, Mike, > > > > I can add myself to Mikes question. It would be good to know > a > "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has > been > showing quite nice improvement of the overall writeback situation and > it > would be sad to see this [partially] gone in 2.6.24-final. > Linus > apparently already has reverted "...2250b". I will definitely repeat my > tests > with -rc8. and report. > > Thank you, Martin. Can you help test this patch on 2.6.24-rc7? > Maybe we can push it to 2.6.24 after your testing. > Hi Fengguang, something really bad has happened between -rc3 and -rc6. Embarrassingly I did not catch that earlier :-( Compared to the numbers I posted in http://lkml.org/lkml/2007/10/26/208 , dd1 is now at 60 MB/sec (slight plus), while dd2/dd3 suck the same way as in pre 2.6.24. The only test that is still good is mix3, which I attribute to the per-BDI stuff. At the moment I am frantically trying to find when things went down. I did run -rc8 and rc8+yourpatch. No difference to what I see with -rc6. Sorry that I cannot provide any input to your patch. Depressed Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression: 100% io-wait with 2.6.24-rcX
- Original Message From: Fengguang Wu [EMAIL PROTECTED] To: Martin Knoblauch [EMAIL PROTECTED] Cc: Mike Snitzer [EMAIL PROTECTED]; Peter Zijlstra [EMAIL PROTECTED]; [EMAIL PROTECTED]; Ingo Molnar [EMAIL PROTECTED]; linux-kernel@vger.kernel.org; [EMAIL PROTECTED] [EMAIL PROTECTED]; Linus Torvalds [EMAIL PROTECTED] Sent: Wednesday, January 16, 2008 1:00:04 PM Subject: Re: regression: 100% io-wait with 2.6.24-rcX On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote: For those interested in using your writeback improvements in production sooner rather than later (primarily with ext3); what recommendations do you have? Just heavily test our own 2.6.24 + your evolving close, but not ready for merge -mm writeback patchset? Hi Fengguang, Mike, I can add myself to Mikes question. It would be good to know a roadmap for the writeback changes. Testing 2.6.24-rcX so far has been showing quite nice improvement of the overall writeback situation and it would be sad to see this [partially] gone in 2.6.24-final. Linus apparently already has reverted ...2250b. I will definitely repeat my tests with -rc8. and report. Thank you, Martin. Can you help test this patch on 2.6.24-rc7? Maybe we can push it to 2.6.24 after your testing. Hi Fengguang, something really bad has happened between -rc3 and -rc6. Embarrassingly I did not catch that earlier :-( Compared to the numbers I posted in http://lkml.org/lkml/2007/10/26/208 , dd1 is now at 60 MB/sec (slight plus), while dd2/dd3 suck the same way as in pre 2.6.24. The only test that is still good is mix3, which I attribute to the per-BDI stuff. At the moment I am frantically trying to find when things went down. I did run -rc8 and rc8+yourpatch. No difference to what I see with -rc6. Sorry that I cannot provide any input to your patch. Depressed Martin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression: 100% io-wait with 2.6.24-rcX
- Original Message From: Martin Knoblauch [EMAIL PROTECTED] To: Fengguang Wu [EMAIL PROTECTED] Cc: Mike Snitzer [EMAIL PROTECTED]; Peter Zijlstra [EMAIL PROTECTED]; [EMAIL PROTECTED]; Ingo Molnar [EMAIL PROTECTED]; linux-kernel@vger.kernel.org; [EMAIL PROTECTED] [EMAIL PROTECTED]; Linus Torvalds [EMAIL PROTECTED] Sent: Thursday, January 17, 2008 2:52:58 PM Subject: Re: regression: 100% io-wait with 2.6.24-rcX - Original Message From: Fengguang Wu To: Martin Knoblauch Cc: Mike Snitzer ; Peter Zijlstra ; [EMAIL PROTECTED]; Ingo Molnar ; linux-kernel@vger.kernel.org; [EMAIL PROTECTED] ; Linus Torvalds Sent: Wednesday, January 16, 2008 1:00:04 PM Subject: Re: regression: 100% io-wait with 2.6.24-rcX On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote: For those interested in using your writeback improvements in production sooner rather than later (primarily with ext3); what recommendations do you have? Just heavily test our own 2.6.24 + your evolving close, but not ready for merge -mm writeback patchset? Hi Fengguang, Mike, I can add myself to Mikes question. It would be good to know a roadmap for the writeback changes. Testing 2.6.24-rcX so far has been showing quite nice improvement of the overall writeback situation and it would be sad to see this [partially] gone in 2.6.24-final. Linus apparently already has reverted ...2250b. I will definitely repeat my tests with -rc8. and report. Thank you, Martin. Can you help test this patch on 2.6.24-rc7? Maybe we can push it to 2.6.24 after your testing. Hi Fengguang, something really bad has happened between -rc3 and -rc6. Embarrassingly I did not catch that earlier :-( Compared to the numbers I posted in http://lkml.org/lkml/2007/10/26/208 , dd1 is now at 60 MB/sec (slight plus), while dd2/dd3 suck the same way as in pre 2.6.24. The only test that is still good is mix3, which I attribute to the per-BDI stuff. At the moment I am frantically trying to find when things went down. I did run -rc8 and rc8+yourpatch. No difference to what I see with -rc6. Sorry that I cannot provide any input to your patch. OK, the change happened between rc5 and rc6. Just following a gut feeling, I reverted #commit 81eabcbe0b991ddef5216f30ae91c4b226d54b6d #Author: Mel Gorman [EMAIL PROTECTED] #Date: Mon Dec 17 16:20:05 2007 -0800 # #mm: fix page allocation for larger I/O segments # #In some cases the IO subsystem is able to merge requests if the pages are #adjacent in physical memory. This was achieved in the allocator by having #expand() return pages in physically contiguous order in situations were a #large buddy was split. However, list-based anti-fragmentation changed the #order pages were returned in to avoid searching in buffered_rmqueue() for a #page of the appropriate migrate type. # #This patch restores behaviour of rmqueue_bulk() preserving the physical #order of pages returned by the allocator without incurring increased search #costs for anti-fragmentation. # #Signed-off-by: Mel Gorman [EMAIL PROTECTED] #Cc: James Bottomley [EMAIL PROTECTED] #Cc: Jens Axboe [EMAIL PROTECTED] #Cc: Mark Lord [EMAIL PROTECTED] #Signed-off-by: Andrew Morton [EMAIL PROTECTED] #Signed-off-by: Linus Torvalds [EMAIL PROTECTED] diff -urN linux-2.6.24-rc5/mm/page_alloc.c linux-2.6.24-rc6/mm/page_alloc.c --- linux-2.6.24-rc5/mm/page_alloc.c2007-12-21 04:14:11.305633890 + +++ linux-2.6.24-rc6/mm/page_alloc.c2007-12-21 04:14:17.746985697 + @@ -847,8 +847,19 @@ struct page *page = __rmqueue(zone, order, migratetype); if (unlikely(page == NULL)) break; + + /* +* Split buddy pages returned by expand() are received here +* in physical page order. The page is added to the callers and +* list and the list head then moves forward. From the callers +* perspective, the linked list is ordered by page number in +* some conditions. This is useful for IO devices that can +* merge IO requests if the physical pages are ordered +* properly. +*/ list_add(page-lru, list); set_page_private(page, migratetype); + list = page-lru; } spin_unlock(zone-lock); return i; This has brought back the good results I observed and reported. I do not know what to make out of this. At least on the systems I care about (HP/DL380g4, dual CPUs, HT-enabled, 8 GB Memory, SmartaArray6i controller with 4x72GB SCSI disks as RAID5 (battery protected writeback cache enabled) and gigabit networking (tg3)) this optimisation is a dissaster. On the other hand, it is not a regression
Re: regression: 100% io-wait with 2.6.24-rcX
- Original Message From: Mike Snitzer [EMAIL PROTECTED] To: Martin Knoblauch [EMAIL PROTECTED] Cc: Fengguang Wu [EMAIL PROTECTED]; Peter Zijlstra [EMAIL PROTECTED]; [EMAIL PROTECTED]; Ingo Molnar [EMAIL PROTECTED]; linux-kernel@vger.kernel.org; [EMAIL PROTECTED] [EMAIL PROTECTED]; Linus Torvalds [EMAIL PROTECTED] Sent: Thursday, January 17, 2008 5:11:50 PM Subject: Re: regression: 100% io-wait with 2.6.24-rcX I've backported Peter's perbdi patchset to 2.6.22.x. I can share it with anyone who might be interested. As expected, it has yielded 2.6.24-rcX level scaling. Given the test result matrix you previously posted, 2.6.22.x+perbdi might give you what you're looking for (sans improved writeback that 2.6.24 was thought to be providing). That is, much improved scaling with better O_DIRECT and network throughput. Just a thought... Unfortunately, my priorities (and computing resources) have shifted and I won't be able to thoroughly test Fengguang's new writeback patch on 2.6.24-rc8... whereby missing out on providing justification/testing to others on _some_ improved writeback being included in 2.6.24 final. Not to mention the window for writeback improvement is all but closed considering the 2.6.24-rc8 announcement's 2.6.24 final release timetable. Mike, thanks for the offer, but the improved throughput is my #1 priority nowadays. And while the better scaling for different targets is nothing to frown upon, the much better scaling when writing to the same target would have been the big winner for me. Anyway, I located the offending commit. Lets see what the experts say. Cheers Martin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression: 100% io-wait with 2.6.24-rcX
- Original Message From: Mel Gorman [EMAIL PROTECTED] To: Martin Knoblauch [EMAIL PROTECTED] Cc: Fengguang Wu [EMAIL PROTECTED]; Mike Snitzer [EMAIL PROTECTED]; Peter Zijlstra [EMAIL PROTECTED]; [EMAIL PROTECTED]; Ingo Molnar [EMAIL PROTECTED]; linux-kernel@vger.kernel.org; [EMAIL PROTECTED] [EMAIL PROTECTED]; Linus Torvalds [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Thursday, January 17, 2008 9:23:57 PM Subject: Re: regression: 100% io-wait with 2.6.24-rcX On (17/01/08 09:44), Martin Knoblauch didst pronounce: On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote: For those interested in using your writeback improvements in production sooner rather than later (primarily with ext3); what recommendations do you have? Just heavily test our own 2.6.24 evolving close, but not ready for merge -mm writeback patchset? I can add myself to Mikes question. It would be good to know a roadmap for the writeback changes. Testing 2.6.24-rcX so far has been showing quite nice improvement of the overall writeback situation and it would be sad to see this [partially] gone in 2.6.24-final. Linus apparently already has reverted ...2250b. I will definitely repeat my tests with -rc8. and report. Thank you, Martin. Can you help test this patch on 2.6.24-rc7? Maybe we can push it to 2.6.24 after your testing. Hi Fengguang, something really bad has happened between -rc3 and -rc6. Embarrassingly I did not catch that earlier :-( Compared to the numbers I posted in http://lkml.org/lkml/2007/10/26/208 , dd1 is now at 60 MB/sec (slight plus), while dd2/dd3 suck the same way as in pre 2.6.24. The only test that is still good is mix3, which I attribute to the per-BDI stuff. I suspect that the IO hardware you have is very sensitive to the color of the physical page. I wonder, do you boot the system cleanly and then run these tests? If so, it would be interesting to know what happens if you stress the system first (many kernel compiles for example, basically anything that would use a lot of memory in different ways for some time) to randomise the free lists a bit and then run your test. You'd need to run the test three times for 2.6.23, 2.6.24-rc8 and 2.6.24-rc8 with the patch you identified reverted. The effect is defintely depending on the IO hardware. I performed the same tests on a different box with an AACRAID controller and there things look different. Basically the offending commit helps seingle stream performance on that box, while dual/triple stream are not affected. So I suspect that the CCISS is just not behaving well. And yes, the tests are usually done on a freshly booted box. Of course, I repeat them a few times. On the CCISS box the numbers are very constant. On the AACRAID box they vary quite a bit. I can certainly stress the box before doing the tests. Please define many for the kernel compiles :-) OK, the change happened between rc5 and rc6. Just following a gut feeling, I reverted #commit 81eabcbe0b991ddef5216f30ae91c4b226d54b6d #Author: Mel Gorman #Date: Mon Dec 17 16:20:05 2007 -0800 # This has brought back the good results I observed and reported. I do not know what to make out of this. At least on the systems I care about (HP/DL380g4, dual CPUs, HT-enabled, 8 GB Memory, SmartaArray6i controller with 4x72GB SCSI disks as RAID5 (battery protected writeback cache enabled) and gigabit networking (tg3)) this optimisation is a dissaster. That patch was not an optimisation, it was a regression fix against 2.6.23 and I don't believe reverting it is an option. Other IO hardware benefits from having the allocator supply pages in PFN order. I think this late in the 2.6.24 game we just should leave things as they are. But we should try to find a way to make CCISS faster, as it apparently can be faster. Your controller would seem to suffer when presented with the same situation but I don't know why that is. I've added James to the cc in case he has seen this sort of situation before. On the other hand, it is not a regression against 2.6.22/23. Those had bad IO scaling to. It would just be a shame to loose an apparently great performance win. Could you try running your tests again when the system has been stressed with some other workload first? Will do. Cheers Martin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression: 100% io-wait with 2.6.24-rcX
- Original Message > From: Fengguang Wu <[EMAIL PROTECTED]> > To: Martin Knoblauch <[EMAIL PROTECTED]> > Cc: Mike Snitzer <[EMAIL PROTECTED]>; Peter Zijlstra <[EMAIL PROTECTED]>; > [EMAIL PROTECTED]; Ingo Molnar <[EMAIL PROTECTED]>; > linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; Linus > Torvalds <[EMAIL PROTECTED]> > Sent: Wednesday, January 16, 2008 1:00:04 PM > Subject: Re: regression: 100% io-wait with 2.6.24-rcX > > On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote: > > > For those interested in using your writeback improvements in > > > production sooner rather than later (primarily with ext3); what > > > recommendations do you have? Just heavily test our own 2.6.24 > + > your > > > evolving "close, but not ready for merge" -mm writeback patchset? > > > > > Hi Fengguang, Mike, > > > > I can add myself to Mikes question. It would be good to know > a > "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has > been > showing quite nice improvement of the overall writeback situation and > it > would be sad to see this [partially] gone in 2.6.24-final. > Linus > apparently already has reverted "...2250b". I will definitely repeat my > tests > with -rc8. and report. > > Thank you, Martin. Can you help test this patch on 2.6.24-rc7? > Maybe we can push it to 2.6.24 after your testing. > Will do tomorrow or friday. Actually a patch against -rc8 would be nicer for me, as I have not looked at -rc7 due to holidays and some of the reported problems with it. Cheers Martin > Fengguang > --- > fs/fs-writeback.c | 17 +++-- > include/linux/writeback.h |1 + > mm/page-writeback.c |9 ++--- > 3 files changed, 22 insertions(+), 5 deletions(-) > > --- linux.orig/fs/fs-writeback.c > +++ linux/fs/fs-writeback.c > @@ -284,7 +284,16 @@ __sync_single_inode(struct inode *inode, > * soon as the queue becomes uncongested. > */ > inode->i_state |= I_DIRTY_PAGES; > -requeue_io(inode); > +if (wbc->nr_to_write <= 0) > +/* > + * slice used up: queue for next turn > + */ > +requeue_io(inode); > +else > +/* > + * somehow blocked: retry later > + */ > +redirty_tail(inode); > } else { > /* > * Otherwise fully redirty the inode so that > @@ -479,8 +488,12 @@ sync_sb_inodes(struct super_block *sb, s > iput(inode); > cond_resched(); > spin_lock(_lock); > -if (wbc->nr_to_write <= 0) > +if (wbc->nr_to_write <= 0) { > +wbc->more_io = 1; > break; > +} > +if (!list_empty(>s_more_io)) > +wbc->more_io = 1; > } > return;/* Leave any unwritten inodes on s_io */ > } > --- linux.orig/include/linux/writeback.h > +++ linux/include/linux/writeback.h > @@ -62,6 +62,7 @@ struct writeback_control { > unsigned for_reclaim:1;/* Invoked from the page > allocator > */ > unsigned for_writepages:1;/* This is a writepages() call */ > unsigned range_cyclic:1;/* range_start is cyclic */ > +unsigned more_io:1;/* more io to be dispatched */ > }; > > /* > --- linux.orig/mm/page-writeback.c > +++ linux/mm/page-writeback.c > @@ -558,6 +558,7 @@ static void background_writeout(unsigned > global_page_state(NR_UNSTABLE_NFS) < background_thresh > && min_pages <= 0) > break; > +wbc.more_io = 0; > wbc.encountered_congestion = 0; > wbc.nr_to_write = MAX_WRITEBACK_PAGES; > wbc.pages_skipped = 0; > @@ -565,8 +566,9 @@ static void background_writeout(unsigned > min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write; > if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) { > /* Wrote less than expected */ > -congestion_wait(WRITE, HZ/10); > -if (!wbc.encountered_congestion) > +if (wbc.encountered_congestion || wbc.more_io) > +congestion_wait(WRITE, HZ/10); > +else > break; > } > } > @@ -631,11 +633,12 @@ static void wb_kupdate(unsigned long arg > global_page_state(NR_UNSTABLE_NFS)
Re: regression: 100% io-wait with 2.6.24-rcX
- Original Message > From: Mike Snitzer <[EMAIL PROTECTED]> > To: Fengguang Wu <[EMAIL PROTECTED]> > Cc: Peter Zijlstra <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; Ingo Molnar > <[EMAIL PROTECTED]>; linux-kernel@vger.kernel.org; "[EMAIL PROTECTED]" > <[EMAIL PROTECTED]>; Linus Torvalds <[EMAIL PROTECTED]>; Andrew Morton > <[EMAIL PROTECTED]> > Sent: Tuesday, January 15, 2008 10:13:22 PM > Subject: Re: regression: 100% io-wait with 2.6.24-rcX > > On Jan 14, 2008 7:50 AM, Fengguang Wu wrote: > > On Mon, Jan 14, 2008 at 12:41:26PM +0100, Peter Zijlstra wrote: > > > > > > On Mon, 2008-01-14 at 12:30 +0100, Joerg Platte wrote: > > > > Am Montag, 14. Januar 2008 schrieb Fengguang Wu: > > > > > > > > > Joerg, this patch fixed the bug for me :-) > > > > > > > > Fengguang, congratulations, I can confirm that your patch > fixed > the bug! With > > > > previous kernels the bug showed up after each reboot. Now, > when > booting the > > > > patched kernel everything is fine and there is no longer > any > suspicious > > > > iowait! > > > > > > > > Do you have an idea why this problem appeared in 2.6.24? > Did > somebody change > > > > the ext2 code or is it related to the changes in the scheduler? > > > > > > It was Fengguang who changed the inode writeback code, and I > guess > the > > > new and improved code was less able do deal with these funny corner > > > cases. But he has been very good in tracking them down and > solving > them, > > > kudos to him for that work! > > > > Thank you. > > > > In particular the bug is triggered by the patch named: > > "writeback: introduce writeback_control.more_io to > indicate > more io" > > That patch means to speed up writeback, but unfortunately its > > aggressiveness has disclosed bugs in reiserfs, jfs and now ext2. > > > > Linus, given the number of bugs it triggered, I'd recommend revert > > this patch(git commit > 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b). > Let's > > push it back to -mm tree for more testings? > > Fengguang, > > I'd like to better understand where your writeback work stands > relative to 2.6.24-rcX and -mm. To be clear, your changes in > 2.6.24-rc7 have been benchmarked to provide a ~33% sequential write > performance improvement with ext3 (as compared to 2.6.22, CFS could be > helping, etc but...). Very impressive! > > Given this improvement it is unfortunate to see your request to revert > 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b but it is understandable if > you're not confident in it for 2.6.24. > > That said, you recently posted an -mm patchset that first reverts > 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b and then goes on to address > the "slow writes for concurrent large and small file writes" bug: > http://lkml.org/lkml/2008/1/15/132 > > For those interested in using your writeback improvements in > production sooner rather than later (primarily with ext3); what > recommendations do you have? Just heavily test our own 2.6.24 + your > evolving "close, but not ready for merge" -mm writeback patchset? > Hi Fengguang, Mike, I can add myself to Mikes question. It would be good to know a "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has been showing quite nice improvement of the overall writeback situation and it would be sad to see this [partially] gone in 2.6.24-final. Linus apparently already has reverted "...2250b". I will definitely repeat my tests with -rc8. and report. Cheers Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression: 100% io-wait with 2.6.24-rcX
- Original Message From: Mike Snitzer [EMAIL PROTECTED] To: Fengguang Wu [EMAIL PROTECTED] Cc: Peter Zijlstra [EMAIL PROTECTED]; [EMAIL PROTECTED]; Ingo Molnar [EMAIL PROTECTED]; linux-kernel@vger.kernel.org; [EMAIL PROTECTED] [EMAIL PROTECTED]; Linus Torvalds [EMAIL PROTECTED]; Andrew Morton [EMAIL PROTECTED] Sent: Tuesday, January 15, 2008 10:13:22 PM Subject: Re: regression: 100% io-wait with 2.6.24-rcX On Jan 14, 2008 7:50 AM, Fengguang Wu wrote: On Mon, Jan 14, 2008 at 12:41:26PM +0100, Peter Zijlstra wrote: On Mon, 2008-01-14 at 12:30 +0100, Joerg Platte wrote: Am Montag, 14. Januar 2008 schrieb Fengguang Wu: Joerg, this patch fixed the bug for me :-) Fengguang, congratulations, I can confirm that your patch fixed the bug! With previous kernels the bug showed up after each reboot. Now, when booting the patched kernel everything is fine and there is no longer any suspicious iowait! Do you have an idea why this problem appeared in 2.6.24? Did somebody change the ext2 code or is it related to the changes in the scheduler? It was Fengguang who changed the inode writeback code, and I guess the new and improved code was less able do deal with these funny corner cases. But he has been very good in tracking them down and solving them, kudos to him for that work! Thank you. In particular the bug is triggered by the patch named: writeback: introduce writeback_control.more_io to indicate more io That patch means to speed up writeback, but unfortunately its aggressiveness has disclosed bugs in reiserfs, jfs and now ext2. Linus, given the number of bugs it triggered, I'd recommend revert this patch(git commit 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b). Let's push it back to -mm tree for more testings? Fengguang, I'd like to better understand where your writeback work stands relative to 2.6.24-rcX and -mm. To be clear, your changes in 2.6.24-rc7 have been benchmarked to provide a ~33% sequential write performance improvement with ext3 (as compared to 2.6.22, CFS could be helping, etc but...). Very impressive! Given this improvement it is unfortunate to see your request to revert 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b but it is understandable if you're not confident in it for 2.6.24. That said, you recently posted an -mm patchset that first reverts 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b and then goes on to address the slow writes for concurrent large and small file writes bug: http://lkml.org/lkml/2008/1/15/132 For those interested in using your writeback improvements in production sooner rather than later (primarily with ext3); what recommendations do you have? Just heavily test our own 2.6.24 + your evolving close, but not ready for merge -mm writeback patchset? Hi Fengguang, Mike, I can add myself to Mikes question. It would be good to know a roadmap for the writeback changes. Testing 2.6.24-rcX so far has been showing quite nice improvement of the overall writeback situation and it would be sad to see this [partially] gone in 2.6.24-final. Linus apparently already has reverted ...2250b. I will definitely repeat my tests with -rc8. and report. Cheers Martin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression: 100% io-wait with 2.6.24-rcX
- Original Message From: Fengguang Wu [EMAIL PROTECTED] To: Martin Knoblauch [EMAIL PROTECTED] Cc: Mike Snitzer [EMAIL PROTECTED]; Peter Zijlstra [EMAIL PROTECTED]; [EMAIL PROTECTED]; Ingo Molnar [EMAIL PROTECTED]; linux-kernel@vger.kernel.org; [EMAIL PROTECTED] [EMAIL PROTECTED]; Linus Torvalds [EMAIL PROTECTED] Sent: Wednesday, January 16, 2008 1:00:04 PM Subject: Re: regression: 100% io-wait with 2.6.24-rcX On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote: For those interested in using your writeback improvements in production sooner rather than later (primarily with ext3); what recommendations do you have? Just heavily test our own 2.6.24 + your evolving close, but not ready for merge -mm writeback patchset? Hi Fengguang, Mike, I can add myself to Mikes question. It would be good to know a roadmap for the writeback changes. Testing 2.6.24-rcX so far has been showing quite nice improvement of the overall writeback situation and it would be sad to see this [partially] gone in 2.6.24-final. Linus apparently already has reverted ...2250b. I will definitely repeat my tests with -rc8. and report. Thank you, Martin. Can you help test this patch on 2.6.24-rc7? Maybe we can push it to 2.6.24 after your testing. Will do tomorrow or friday. Actually a patch against -rc8 would be nicer for me, as I have not looked at -rc7 due to holidays and some of the reported problems with it. Cheers Martin Fengguang --- fs/fs-writeback.c | 17 +++-- include/linux/writeback.h |1 + mm/page-writeback.c |9 ++--- 3 files changed, 22 insertions(+), 5 deletions(-) --- linux.orig/fs/fs-writeback.c +++ linux/fs/fs-writeback.c @@ -284,7 +284,16 @@ __sync_single_inode(struct inode *inode, * soon as the queue becomes uncongested. */ inode-i_state |= I_DIRTY_PAGES; -requeue_io(inode); +if (wbc-nr_to_write = 0) +/* + * slice used up: queue for next turn + */ +requeue_io(inode); +else +/* + * somehow blocked: retry later + */ +redirty_tail(inode); } else { /* * Otherwise fully redirty the inode so that @@ -479,8 +488,12 @@ sync_sb_inodes(struct super_block *sb, s iput(inode); cond_resched(); spin_lock(inode_lock); -if (wbc-nr_to_write = 0) +if (wbc-nr_to_write = 0) { +wbc-more_io = 1; break; +} +if (!list_empty(sb-s_more_io)) +wbc-more_io = 1; } return;/* Leave any unwritten inodes on s_io */ } --- linux.orig/include/linux/writeback.h +++ linux/include/linux/writeback.h @@ -62,6 +62,7 @@ struct writeback_control { unsigned for_reclaim:1;/* Invoked from the page allocator */ unsigned for_writepages:1;/* This is a writepages() call */ unsigned range_cyclic:1;/* range_start is cyclic */ +unsigned more_io:1;/* more io to be dispatched */ }; /* --- linux.orig/mm/page-writeback.c +++ linux/mm/page-writeback.c @@ -558,6 +558,7 @@ static void background_writeout(unsigned global_page_state(NR_UNSTABLE_NFS) background_thresh min_pages = 0) break; +wbc.more_io = 0; wbc.encountered_congestion = 0; wbc.nr_to_write = MAX_WRITEBACK_PAGES; wbc.pages_skipped = 0; @@ -565,8 +566,9 @@ static void background_writeout(unsigned min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write; if (wbc.nr_to_write 0 || wbc.pages_skipped 0) { /* Wrote less than expected */ -congestion_wait(WRITE, HZ/10); -if (!wbc.encountered_congestion) +if (wbc.encountered_congestion || wbc.more_io) +congestion_wait(WRITE, HZ/10); +else break; } } @@ -631,11 +633,12 @@ static void wb_kupdate(unsigned long arg global_page_state(NR_UNSTABLE_NFS) + (inodes_stat.nr_inodes - inodes_stat.nr_unused); while (nr_to_write 0) { +wbc.more_io = 0; wbc.encountered_congestion = 0; wbc.nr_to_write = MAX_WRITEBACK_PAGES; writeback_inodes(wbc); if (wbc.nr_to_write 0) { -if (wbc.encountered_congestion) +if (wbc.encountered_congestion || wbc.more_io) congestion_wait(WRITE, HZ/10); else break;/* All the old data is written */ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo
Re: Strange NFS write performance Linux->Solaris-10/VXFS, maybe VW related
- Original Message > From: Martin Knoblauch <[EMAIL PROTECTED]> > To: Chris Snook <[EMAIL PROTECTED]> > Cc: linux-kernel@vger.kernel.org; [EMAIL PROTECTED]; spam trap <[EMAIL > PROTECTED]> > Sent: Saturday, December 29, 2007 12:11:08 PM > Subject: Re: Strange NFS write performance Linux->Solaris-10/VXFS, maybe VW > related > > - Original Message > > From: Chris Snook > > To: Martin Knoblauch > > Cc: linux-kernel@vger.kernel.org; [EMAIL PROTECTED] > > Sent: Friday, December 28, 2007 7:45:13 PM > > Subject: Re: Strange NFS write performance > Linux->Solaris-10/VXFS, > maybe VW related > > > > Martin Knoblauch wrote: > > > Hi, > > > > > > currently I am tracking down an "interesting" effect when writing > > > 3) It sounds like the bottleneck is the vxfs filesystem. It > > only *appears* on the client side because writes up > until > dirty_ratio > > get buffered on the client. > > If you can confirm that the server is actually writing stuff to > > disk slower when the client is in writeback mode, then it's possible > > the Linux NFSclient is doing something inefficient in > writeback > mode. > > > > so, is the output of "iostat -d -l1 d111" during two runs. The > first > run is with 750 MB, the second with 850MB. > > // 750MB > $ iostat -d -l 1 md111 2 >md111 > kps tps serv > 22 0 14 > 0 00 > 0 0 13 > 29347 468 12 > 37040 593 17 > 30938 492 25 > 30421 491 25 > 41626 676 16 > 42913 703 14 > 39890 647 15 > 9009 1417 > 8963 1417 > 5143 817 > 34814 547 10 > 49323 775 12 > 28624 4516 > 22 16 > finish > 0 00 > 0 00 > > Here it seems that the disk is writing for 26-28 seconds with avg. > 29 > MB/sec. Fine. > > // 850MB > $ iostat -d -l 1 md111 2 >md111 > kps tps serv > 0 00 > 11275 180 10 > 39874 635 14 > 37403 587 17 > 24341 392 30 > 25989 423 26 > 22464 375 30 > 21922 361 32 > 27924 450 26 > 21507 342 21 > 9217 153 15 > 9260 150 15 > 9544 155 15 > 9298 150 14 > 10118 162 11 > 15505 250 12 > 27513 448 14 > 26698 436 15 > 26144 431 15 > 25201 412 14 > 38 seconds in run > 0 00 > 0 00 > 579 17 12 > 0 00 > 0 00 > 0 00 > 0 00 > 518 9 16 > 485 86 > 9 17 > 514 97 > 0 00 > 0 00 > 541 98 > 532 106 > 0 00 > 0 00 > 650 127 > 0 00 > 242 89 > 1023 185 > 304 56 > 418 87 > 283 55 > 303 58 > 527 106 > 0 00 > 0 00 > 0 00 > 5 1 13 > 0 00 > 0 00 > 0 00 > 0 00 > 0 00 > 0 0 11 > 0 00 > 0 00 > 0 00 > 1 0 15 > 0 00 > 96 2 15 > 138 3 10 > 11057 1756 > 17549 2806 > 351 85 > 0 00 > # 218 seconds in run, finish. > > So, for the first 38 seconds everything looks similar to the 750 > MB case. For the next about 180 seconds most time nothing happens. > Averaging 4.1 MB/sec. > > Maybe it is time to capture the traffic. What are the best > tcpdump parameters for NFS? I always forget :-( > > Cheers > Martin > > Hi, now that the seasonal festivities are over - Happy New Year btw. - any comments/suggestions on my problem? Cheers Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Strange NFS write performance Linux-Solaris-10/VXFS, maybe VW related
- Original Message From: Martin Knoblauch [EMAIL PROTECTED] To: Chris Snook [EMAIL PROTECTED] Cc: linux-kernel@vger.kernel.org; [EMAIL PROTECTED]; spam trap [EMAIL PROTECTED] Sent: Saturday, December 29, 2007 12:11:08 PM Subject: Re: Strange NFS write performance Linux-Solaris-10/VXFS, maybe VW related - Original Message From: Chris Snook To: Martin Knoblauch Cc: linux-kernel@vger.kernel.org; [EMAIL PROTECTED] Sent: Friday, December 28, 2007 7:45:13 PM Subject: Re: Strange NFS write performance Linux-Solaris-10/VXFS, maybe VW related Martin Knoblauch wrote: Hi, currently I am tracking down an interesting effect when writing 3) It sounds like the bottleneck is the vxfs filesystem. It only *appears* on the client side because writes up until dirty_ratio get buffered on the client. If you can confirm that the server is actually writing stuff to disk slower when the client is in writeback mode, then it's possible the Linux NFSclient is doing something inefficient in writeback mode. so, is the output of iostat -d -l1 d111 during two runs. The first run is with 750 MB, the second with 850MB. // 750MB $ iostat -d -l 1 md111 2 md111 kps tps serv 22 0 14 0 00 0 0 13 29347 468 12 37040 593 17 30938 492 25 30421 491 25 41626 676 16 42913 703 14 39890 647 15 9009 1417 8963 1417 5143 817 34814 547 10 49323 775 12 28624 4516 22 16 finish 0 00 0 00 Here it seems that the disk is writing for 26-28 seconds with avg. 29 MB/sec. Fine. // 850MB $ iostat -d -l 1 md111 2 md111 kps tps serv 0 00 11275 180 10 39874 635 14 37403 587 17 24341 392 30 25989 423 26 22464 375 30 21922 361 32 27924 450 26 21507 342 21 9217 153 15 9260 150 15 9544 155 15 9298 150 14 10118 162 11 15505 250 12 27513 448 14 26698 436 15 26144 431 15 25201 412 14 38 seconds in run 0 00 0 00 579 17 12 0 00 0 00 0 00 0 00 518 9 16 485 86 9 17 514 97 0 00 0 00 541 98 532 106 0 00 0 00 650 127 0 00 242 89 1023 185 304 56 418 87 283 55 303 58 527 106 0 00 0 00 0 00 5 1 13 0 00 0 00 0 00 0 00 0 00 0 0 11 0 00 0 00 0 00 1 0 15 0 00 96 2 15 138 3 10 11057 1756 17549 2806 351 85 0 00 # 218 seconds in run, finish. So, for the first 38 seconds everything looks similar to the 750 MB case. For the next about 180 seconds most time nothing happens. Averaging 4.1 MB/sec. Maybe it is time to capture the traffic. What are the best tcpdump parameters for NFS? I always forget :-( Cheers Martin Hi, now that the seasonal festivities are over - Happy New Year btw. - any comments/suggestions on my problem? Cheers Martin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/11] writeback bug fixes and simplifications
- Original Message > From: WU Fengguang <[EMAIL PROTECTED]> > To: Hans-Peter Jansen <[EMAIL PROTECTED]> > Cc: Sascha Warner <[EMAIL PROTECTED]>; Andrew Morton <[EMAIL PROTECTED]>; > linux-kernel@vger.kernel.org; Peter Zijlstra <[EMAIL PROTECTED]> > Sent: Wednesday, January 9, 2008 4:33:32 AM > Subject: Re: [PATCH 00/11] writeback bug fixes and simplifications > > On Sat, Dec 29, 2007 at 03:56:59PM +0100, Hans-Peter Jansen wrote: > > Am Freitag, 28. Dezember 2007 schrieb Sascha Warner: > > > Andrew Morton wrote: > > > > On Thu, 27 Dec 2007 23:08:40 +0100 Sascha > Warner > > > wrote: > > > >> Hi, > > > >> > > > >> I applied your patches to 2.6.24-rc6-mm1, but now I am > faced > with one > > > >> pdflush often using 100% CPU for a long time. There seem to > be > some > > > >> rare pauses from its 100% usage, however. > > > >> > > > >> On ~23 minutes uptime i have ~19 minutes pdflush runtime. > > > >> > > > >> This is on E6600, x86_64, 2 Gig RAM, SATA HDD, running on gentoo > > > >> ~x64_64 > > > >> > > > >> Let me know if you need more info. > > > > > > > > (some) cc's restored. Please, always do reply-to-all. > > > > > > Hi Wu, > > > > Sascha, if you want to address Fengguang by his first name, note > that > > > chinese and bavarians (and some others I forgot now, too) > typically > use the > > order: > > > > lastname firstname > > > > when they spell their names. Another evidence is, that the name Wu > is > a > > pretty common chinese family name. > > > > Fengguang, if it's the other way around, correct me please (and > I'm > going to > > wear a big brown paper bag for the rest of the day..). > > You are right. We normally do "Fengguang" or "Mr. Wu" :-) > For LKML the first name is less ambiguous. > > Thanks, > Fengguang > Just cannot resist. Hans-Peter mentions Bavarian using Lastname-Givenname as well. This is only true in a folklore context (or when you are very deep in the countryside). Officially the bavarians use the usual German Given/Lastname. Although they will never admit to be Germans, of course :-) Cheers Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/11] writeback bug fixes and simplifications
- Original Message From: WU Fengguang [EMAIL PROTECTED] To: Hans-Peter Jansen [EMAIL PROTECTED] Cc: Sascha Warner [EMAIL PROTECTED]; Andrew Morton [EMAIL PROTECTED]; linux-kernel@vger.kernel.org; Peter Zijlstra [EMAIL PROTECTED] Sent: Wednesday, January 9, 2008 4:33:32 AM Subject: Re: [PATCH 00/11] writeback bug fixes and simplifications On Sat, Dec 29, 2007 at 03:56:59PM +0100, Hans-Peter Jansen wrote: Am Freitag, 28. Dezember 2007 schrieb Sascha Warner: Andrew Morton wrote: On Thu, 27 Dec 2007 23:08:40 +0100 Sascha Warner wrote: Hi, I applied your patches to 2.6.24-rc6-mm1, but now I am faced with one pdflush often using 100% CPU for a long time. There seem to be some rare pauses from its 100% usage, however. On ~23 minutes uptime i have ~19 minutes pdflush runtime. This is on E6600, x86_64, 2 Gig RAM, SATA HDD, running on gentoo ~x64_64 Let me know if you need more info. (some) cc's restored. Please, always do reply-to-all. Hi Wu, Sascha, if you want to address Fengguang by his first name, note that chinese and bavarians (and some others I forgot now, too) typically use the order: lastname firstname when they spell their names. Another evidence is, that the name Wu is a pretty common chinese family name. Fengguang, if it's the other way around, correct me please (and I'm going to wear a big brown paper bag for the rest of the day..). You are right. We normally do Fengguang or Mr. Wu :-) For LKML the first name is less ambiguous. Thanks, Fengguang Just cannot resist. Hans-Peter mentions Bavarian using Lastname-Givenname as well. This is only true in a folklore context (or when you are very deep in the countryside). Officially the bavarians use the usual German Given/Lastname. Although they will never admit to be Germans, of course :-) Cheers Martin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Strange NFS write performance Linux->Solaris-10/VXFS, maybe VW related
- Original Message > From: Chris Snook <[EMAIL PROTECTED]> > To: Martin Knoblauch <[EMAIL PROTECTED]> > Cc: linux-kernel@vger.kernel.org; [EMAIL PROTECTED] > Sent: Friday, December 28, 2007 7:45:13 PM > Subject: Re: Strange NFS write performance Linux->Solaris-10/VXFS, maybe VW > related > > Martin Knoblauch wrote: > > Hi, > > > > currently I am tracking down an "interesting" effect when writing > 3) It sounds like the bottleneck is the vxfs filesystem. It > only *appears* on the client side because writes up until dirty_ratio > get buffered on the client. > If you can confirm that the server is actually writing stuff to > disk slower when the client is in writeback mode, then it's possible > the Linux NFSclient is doing something inefficient in writeback mode. > so, is the output of "iostat -d -l1 d111" during two runs. The first run is with 750 MB, the second with 850MB. // 750MB $ iostat -d -l 1 md111 2 md111 kps tps serv 22 0 14 0 00 0 0 13 29347 468 12 37040 593 17 30938 492 25 30421 491 25 41626 676 16 42913 703 14 39890 647 15 9009 1417 8963 1417 5143 817 34814 547 10 49323 775 12 28624 4516 22 16 finish 0 00 0 00 Here it seems that the disk is writing for 26-28 seconds with avg. 29 MB/sec. Fine. // 850MB $ iostat -d -l 1 md111 2 md111 kps tps serv 0 00 11275 180 10 39874 635 14 37403 587 17 24341 392 30 25989 423 26 22464 375 30 21922 361 32 27924 450 26 21507 342 21 9217 153 15 9260 150 15 9544 155 15 9298 150 14 10118 162 11 15505 250 12 27513 448 14 26698 436 15 26144 431 15 25201 412 14 38 seconds in run 0 00 0 00 579 17 12 0 00 0 00 0 00 0 00 518 9 16 485 86 9 17 514 97 0 00 0 00 541 98 532 106 0 00 0 00 650 127 0 00 242 89 1023 185 304 56 418 87 283 55 303 58 527 106 0 00 0 00 0 00 5 1 13 0 00 0 00 0 00 0 00 0 00 0 0 11 0 00 0 00 0 00 1 0 15 0 00 96 2 15 138 3 10 11057 1756 17549 2806 351 85 0 00 # 218 seconds in run, finish. So, for the first 38 seconds everything looks similar to the 750 MB case. For the next about 180 seconds most time nothing happens. Averaging 4.1 MB/sec. Maybe it is time to capture the traffic. What are the best tcpdump parameters for NFS? I always forget :-( Cheers Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Strange NFS write performance Linux->Solaris-10/VXFS, maybe VW related
- Original Message > From: Chris Snook <[EMAIL PROTECTED]> > To: Martin Knoblauch <[EMAIL PROTECTED]> > Cc: linux-kernel@vger.kernel.org; [EMAIL PROTECTED] > Sent: Friday, December 28, 2007 7:45:13 PM > Subject: Re: Strange NFS write performance Linux->Solaris-10/VXFS, maybe VW > related > > Martin Knoblauch wrote: > > Hi, > > > > currently I am tracking down an "interesting" effect when writing > to > a > > Solars-10/Sparc based server. The server exports two filesystems. > One > UFS, > > one VXFS. The filesystems are mounted NFS3/TCP, no special > options. > Linux > > kernel in question is 2.6.24-rc6, but it happens with earlier kernels > > (2.6.19.2, 2.6.22.6) as well. The client is x86_64 with 8 GB of ram. > > > > The problem: when writing to the VXFS based filesystem, > performance > drops > > dramatically when the the filesize reaches or exceeds > "dirty_ratio". > For a > > dirty_ratio of 10% (about 800MB) files below 750 MB are > transfered > with about > > 30 MB/sec. Anything above 770 MB drops down to below 10 MB/sec. If > I > perform > > the same tests on the UFS based FS, performance stays at about > 30 > MB/sec > > until 3GB and likely larger (I just stopped at 3 GB). > > > > Any ideas what could cause this difference? Any suggestions > on > debugging it? > > 1) Try normal NFS tuning, such as rsize/wsize tuning. > rsize/wsize only have minimal effect. The negotiated size seems to be optimal. > 2) You're entering synchronous writeback mode, so you can delay the > problem by raising dirty_ratio to 100, or reduce the size of the problem > by lowering dirty_ratio to 1. Either one could help. > For experiments, sure. But I do not think that I want to have 8 GB of dirty pages [potentially] laying around. Are you sure that 1% is a useful value for dirty_ratio? Looking at the code, it seems a minimum of 5% is enforced in "page-writeback.c:get_dirty_limits": dirty_ratio = vm_dirty_ratio; if (dirty_ratio > unmapped_ratio / 2) dirty_ratio = unmapped_ratio / 2; if (dirty_ratio < 5) dirty_ratio = 5; > 3) It sounds like the bottleneck is the vxfs filesystem. It only > *appears* on the client side because writes up until dirty_ratio get buffered on > the client. Sure, the fact that a UFS (or SAM-FS) based FS behaves well in the same situation points in that direction. > If you can confirm that the server is actually writing stuff to disk > slower when the client is in writeback mode, then it's possible the Linux > NFS client is doing something inefficient in writeback mode. > I will try to get an iostat trace from the Sun side. Thanks for the suggestion. Cheers Martin PS: Happy Year 2008 to all Kernel Hackers and their families -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Strange NFS write performance Linux-Solaris-10/VXFS, maybe VW related
- Original Message From: Chris Snook [EMAIL PROTECTED] To: Martin Knoblauch [EMAIL PROTECTED] Cc: linux-kernel@vger.kernel.org; [EMAIL PROTECTED] Sent: Friday, December 28, 2007 7:45:13 PM Subject: Re: Strange NFS write performance Linux-Solaris-10/VXFS, maybe VW related Martin Knoblauch wrote: Hi, currently I am tracking down an interesting effect when writing to a Solars-10/Sparc based server. The server exports two filesystems. One UFS, one VXFS. The filesystems are mounted NFS3/TCP, no special options. Linux kernel in question is 2.6.24-rc6, but it happens with earlier kernels (2.6.19.2, 2.6.22.6) as well. The client is x86_64 with 8 GB of ram. The problem: when writing to the VXFS based filesystem, performance drops dramatically when the the filesize reaches or exceeds dirty_ratio. For a dirty_ratio of 10% (about 800MB) files below 750 MB are transfered with about 30 MB/sec. Anything above 770 MB drops down to below 10 MB/sec. If I perform the same tests on the UFS based FS, performance stays at about 30 MB/sec until 3GB and likely larger (I just stopped at 3 GB). Any ideas what could cause this difference? Any suggestions on debugging it? 1) Try normal NFS tuning, such as rsize/wsize tuning. rsize/wsize only have minimal effect. The negotiated size seems to be optimal. 2) You're entering synchronous writeback mode, so you can delay the problem by raising dirty_ratio to 100, or reduce the size of the problem by lowering dirty_ratio to 1. Either one could help. For experiments, sure. But I do not think that I want to have 8 GB of dirty pages [potentially] laying around. Are you sure that 1% is a useful value for dirty_ratio? Looking at the code, it seems a minimum of 5% is enforced in page-writeback.c:get_dirty_limits: dirty_ratio = vm_dirty_ratio; if (dirty_ratio unmapped_ratio / 2) dirty_ratio = unmapped_ratio / 2; if (dirty_ratio 5) dirty_ratio = 5; 3) It sounds like the bottleneck is the vxfs filesystem. It only *appears* on the client side because writes up until dirty_ratio get buffered on the client. Sure, the fact that a UFS (or SAM-FS) based FS behaves well in the same situation points in that direction. If you can confirm that the server is actually writing stuff to disk slower when the client is in writeback mode, then it's possible the Linux NFS client is doing something inefficient in writeback mode. I will try to get an iostat trace from the Sun side. Thanks for the suggestion. Cheers Martin PS: Happy Year 2008 to all Kernel Hackers and their families -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Strange NFS write performance Linux-Solaris-10/VXFS, maybe VW related
- Original Message From: Chris Snook [EMAIL PROTECTED] To: Martin Knoblauch [EMAIL PROTECTED] Cc: linux-kernel@vger.kernel.org; [EMAIL PROTECTED] Sent: Friday, December 28, 2007 7:45:13 PM Subject: Re: Strange NFS write performance Linux-Solaris-10/VXFS, maybe VW related Martin Knoblauch wrote: Hi, currently I am tracking down an interesting effect when writing 3) It sounds like the bottleneck is the vxfs filesystem. It only *appears* on the client side because writes up until dirty_ratio get buffered on the client. If you can confirm that the server is actually writing stuff to disk slower when the client is in writeback mode, then it's possible the Linux NFSclient is doing something inefficient in writeback mode. so, is the output of iostat -d -l1 d111 during two runs. The first run is with 750 MB, the second with 850MB. // 750MB $ iostat -d -l 1 md111 2 md111 kps tps serv 22 0 14 0 00 0 0 13 29347 468 12 37040 593 17 30938 492 25 30421 491 25 41626 676 16 42913 703 14 39890 647 15 9009 1417 8963 1417 5143 817 34814 547 10 49323 775 12 28624 4516 22 16 finish 0 00 0 00 Here it seems that the disk is writing for 26-28 seconds with avg. 29 MB/sec. Fine. // 850MB $ iostat -d -l 1 md111 2 md111 kps tps serv 0 00 11275 180 10 39874 635 14 37403 587 17 24341 392 30 25989 423 26 22464 375 30 21922 361 32 27924 450 26 21507 342 21 9217 153 15 9260 150 15 9544 155 15 9298 150 14 10118 162 11 15505 250 12 27513 448 14 26698 436 15 26144 431 15 25201 412 14 38 seconds in run 0 00 0 00 579 17 12 0 00 0 00 0 00 0 00 518 9 16 485 86 9 17 514 97 0 00 0 00 541 98 532 106 0 00 0 00 650 127 0 00 242 89 1023 185 304 56 418 87 283 55 303 58 527 106 0 00 0 00 0 00 5 1 13 0 00 0 00 0 00 0 00 0 00 0 0 11 0 00 0 00 0 00 1 0 15 0 00 96 2 15 138 3 10 11057 1756 17549 2806 351 85 0 00 # 218 seconds in run, finish. So, for the first 38 seconds everything looks similar to the 750 MB case. For the next about 180 seconds most time nothing happens. Averaging 4.1 MB/sec. Maybe it is time to capture the traffic. What are the best tcpdump parameters for NFS? I always forget :-( Cheers Martin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Strange NFS write performance Linux->Solaris-10/VXFS, maybe VW related
Hi, currently I am tracking down an "interesting" effect when writing to a Solars-10/Sparc based server. The server exports two filesystems. One UFS, one VXFS. The filesystems are mounted NFS3/TCP, no special options. Linux kernel in question is 2.6.24-rc6, but it happens with earlier kernels (2.6.19.2, 2.6.22.6) as well. The client is x86_64 with 8 GB of ram. The problem: when writing to the VXFS based filesystem, performance drops dramatically when the the filesize reaches or exceeds "dirty_ratio". For a dirty_ratio of 10% (about 800MB) files below 750 MB are transfered with about 30 MB/sec. Anything above 770 MB drops down to below 10 MB/sec. If I perform the same tests on the UFS based FS, performance stays at about 30 MB/sec until 3GB and likely larger (I just stopped at 3 GB). Any ideas what could cause this difference? Any suggestions on debugging it? spsdm5:/lfs/test_ufs on /mnt/test_ufs type nfs (rw,proto=tcp,nfsvers=3,hard,intr,addr=160.50.118.37) spsdm5:/lfs/test_vxfs on /mnt/test_vxfs type nfs (rw,proto=tcp,nfsvers=3,hard,intr,addr=160.50.118.37) Cheers Martin PS: Please CC me, as I am not subscribed. Don't worry about the spamtrap name :-) ---------- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Strange NFS write performance Linux-Solaris-10/VXFS, maybe VW related
Hi, currently I am tracking down an interesting effect when writing to a Solars-10/Sparc based server. The server exports two filesystems. One UFS, one VXFS. The filesystems are mounted NFS3/TCP, no special options. Linux kernel in question is 2.6.24-rc6, but it happens with earlier kernels (2.6.19.2, 2.6.22.6) as well. The client is x86_64 with 8 GB of ram. The problem: when writing to the VXFS based filesystem, performance drops dramatically when the the filesize reaches or exceeds dirty_ratio. For a dirty_ratio of 10% (about 800MB) files below 750 MB are transfered with about 30 MB/sec. Anything above 770 MB drops down to below 10 MB/sec. If I perform the same tests on the UFS based FS, performance stays at about 30 MB/sec until 3GB and likely larger (I just stopped at 3 GB). Any ideas what could cause this difference? Any suggestions on debugging it? spsdm5:/lfs/test_ufs on /mnt/test_ufs type nfs (rw,proto=tcp,nfsvers=3,hard,intr,addr=160.50.118.37) spsdm5:/lfs/test_vxfs on /mnt/test_vxfs type nfs (rw,proto=tcp,nfsvers=3,hard,intr,addr=160.50.118.37) Cheers Martin PS: Please CC me, as I am not subscribed. Don't worry about the spamtrap name :-) -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
What is the unit of "nr_writeback"?
Hi, forgive the stupid question. What is the unit of "nr_writeback"? One would usually assume a rate, but looking at the code I see it added together with nr_dirty and nr_unstable, somehow defeating the assumption. Cheers Martin ------ Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Stack warning from 2.6.24-rc
- Original Message > From: Ingo Molnar <[EMAIL PROTECTED]> > To: Martin Knoblauch <[EMAIL PROTECTED]> > Cc: linux-kernel@vger.kernel.org > Sent: Tuesday, December 4, 2007 12:52:23 PM > Subject: Re: Stack warning from 2.6.24-rc > > > * Martin Knoblauch wrote: > > > I see the following stack warning(s) on a IBM x3650 (2xDual-Core, 8 > > GB, AACRAID with 6x146GB RAID5) running 2.6.24-rc3/rc4: > > > > [ 180.739846] mount.nfs used greatest stack depth: 3192 bytes left > > [ 666.121007] bash used greatest stack depth: 3160 bytes left > > > > Nothing bad has happened so far. The message does not show on a > > similarly configured HP/DL-380g4 (CCISS instead of AACRAID) running > > rc3. Anything to worry? Anything I can do to help debugging? > > those are generated by: > > CONFIG_DEBUG_STACKOVERFLOW=y > CONFIG_DEBUG_STACK_USAGE=y > > and look quite harmless. If they were much closer to zero it would be > a problem. > > Ingo > OK, I will ignore it then. I was just surprised to see it. Thanks Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Stack warning from 2.6.24-rc
- Original Message From: Ingo Molnar [EMAIL PROTECTED] To: Martin Knoblauch [EMAIL PROTECTED] Cc: linux-kernel@vger.kernel.org Sent: Tuesday, December 4, 2007 12:52:23 PM Subject: Re: Stack warning from 2.6.24-rc * Martin Knoblauch wrote: I see the following stack warning(s) on a IBM x3650 (2xDual-Core, 8 GB, AACRAID with 6x146GB RAID5) running 2.6.24-rc3/rc4: [ 180.739846] mount.nfs used greatest stack depth: 3192 bytes left [ 666.121007] bash used greatest stack depth: 3160 bytes left Nothing bad has happened so far. The message does not show on a similarly configured HP/DL-380g4 (CCISS instead of AACRAID) running rc3. Anything to worry? Anything I can do to help debugging? those are generated by: CONFIG_DEBUG_STACKOVERFLOW=y CONFIG_DEBUG_STACK_USAGE=y and look quite harmless. If they were much closer to zero it would be a problem. Ingo OK, I will ignore it then. I was just surprised to see it. Thanks Martin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
What is the unit of nr_writeback?
Hi, forgive the stupid question. What is the unit of nr_writeback? One would usually assume a rate, but looking at the code I see it added together with nr_dirty and nr_unstable, somehow defeating the assumption. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: iozone write 50% regression in kernel 2.6.24-rc1
- Original Message > From: "Zhang, Yanmin" <[EMAIL PROTECTED]> > To: Martin Knoblauch <[EMAIL PROTECTED]> > Cc: [EMAIL PROTECTED]; LKML > Sent: Monday, November 12, 2007 1:45:57 AM > Subject: Re: iozone write 50% regression in kernel 2.6.24-rc1 > > On Fri, 2007-11-09 at 04:36 -0800, Martin Knoblauch wrote: > > - Original Message > > > From: "Zhang, Yanmin" > > > To: [EMAIL PROTECTED] > > > Cc: LKML > > > Sent: Friday, November 9, 2007 10:47:52 AM > > > Subject: iozone write 50% regression in kernel 2.6.24-rc1 > > > > > > Comparing with 2.6.23, iozone sequential write/rewrite (512M) has > > > 50% > > > > > regression > > > in kernel 2.6.24-rc1. 2.6.24-rc2 has the same regression. > > > > > > My machine has 8 processor cores and 8GB memory. > > > > > > By bisect, I located patch > > > > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h > = > > > 04fbfdc14e5f48463820d6b9807daa5e9c92c51f. > > > > > > > > > Another behavior: with kernel 2.6.23, if I run iozone for many > > > times > > > > > after rebooting machine, > > > the result looks stable. But with 2.6.24-rc1, the first run of > > > iozone > > > > > got a very small result and > > > following run has 4Xorig_result. > > > > > > What I reported is the regression of 2nd/3rd run, because first run > > > has > > > > > bigger regression. > > > > > > I also tried to change > > > /proc/sys/vm/dirty_ratio,dirty_backgroud_ratio > > > > > and didn't get improvement. > > could you tell us the exact iozone command you are using? > iozone -i 0 -r 4k -s 512m > OK, I definitely do not see the reported effect. On a HP Proliant with a RAID5 on CCISS I get: 2.6.19.2: 654-738 MB/sec write, 1126-1154 MB/sec rewrite 2.6.24-rc2: 772-820 MB/sec write, 1495-1539 MB/sec rewrite The first run is always slowest, all subsequent runs are faster and the same speed. > > > I would like to repeat it on my setup, because I definitely see > the > opposite behaviour in 2.6.24-rc1/rc2. The speed there is much > better > than in 2.6.22 and before (I skipped 2.6.23, because I was waiting > for > the per-bdi changes). I definitely do not see the difference between > 1st > and subsequent runs. But then, I do my tests with 5GB file sizes like: > > > > iozone3_283/src/current/iozone -t 5 -F /scratch/X1 > /scratch/X2 > /scratch/X3 /scratch/X4 /scratch/X5 -s 5000M -r 1024 -c -e -i 0 -i 1 > My machine uses SATA (AHCI) disk. > 4x72GB SCSI disks building a RAID5 on a CCISS controller with battery backed write cache. Systems are 2 CPUs (64-bit) with 8 GB memory. I could test on some IBM boxes (2x dual core, 8 GB) with RAID5 on "aacraid", but I need some time to free up one of the boxes. Cheers Martin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: iozone write 50% regression in kernel 2.6.24-rc1
- Original Message From: Zhang, Yanmin [EMAIL PROTECTED] To: Martin Knoblauch [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; LKML linux-kernel@vger.kernel.org Sent: Monday, November 12, 2007 1:45:57 AM Subject: Re: iozone write 50% regression in kernel 2.6.24-rc1 On Fri, 2007-11-09 at 04:36 -0800, Martin Knoblauch wrote: - Original Message From: Zhang, Yanmin To: [EMAIL PROTECTED] Cc: LKML Sent: Friday, November 9, 2007 10:47:52 AM Subject: iozone write 50% regression in kernel 2.6.24-rc1 Comparing with 2.6.23, iozone sequential write/rewrite (512M) has 50% regression in kernel 2.6.24-rc1. 2.6.24-rc2 has the same regression. My machine has 8 processor cores and 8GB memory. By bisect, I located patch http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h = 04fbfdc14e5f48463820d6b9807daa5e9c92c51f. Another behavior: with kernel 2.6.23, if I run iozone for many times after rebooting machine, the result looks stable. But with 2.6.24-rc1, the first run of iozone got a very small result and following run has 4Xorig_result. What I reported is the regression of 2nd/3rd run, because first run has bigger regression. I also tried to change /proc/sys/vm/dirty_ratio,dirty_backgroud_ratio and didn't get improvement. could you tell us the exact iozone command you are using? iozone -i 0 -r 4k -s 512m OK, I definitely do not see the reported effect. On a HP Proliant with a RAID5 on CCISS I get: 2.6.19.2: 654-738 MB/sec write, 1126-1154 MB/sec rewrite 2.6.24-rc2: 772-820 MB/sec write, 1495-1539 MB/sec rewrite The first run is always slowest, all subsequent runs are faster and the same speed. I would like to repeat it on my setup, because I definitely see the opposite behaviour in 2.6.24-rc1/rc2. The speed there is much better than in 2.6.22 and before (I skipped 2.6.23, because I was waiting for the per-bdi changes). I definitely do not see the difference between 1st and subsequent runs. But then, I do my tests with 5GB file sizes like: iozone3_283/src/current/iozone -t 5 -F /scratch/X1 /scratch/X2 /scratch/X3 /scratch/X4 /scratch/X5 -s 5000M -r 1024 -c -e -i 0 -i 1 My machine uses SATA (AHCI) disk. 4x72GB SCSI disks building a RAID5 on a CCISS controller with battery backed write cache. Systems are 2 CPUs (64-bit) with 8 GB memory. I could test on some IBM boxes (2x dual core, 8 GB) with RAID5 on aacraid, but I need some time to free up one of the boxes. Cheers Martin - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: iozone write 50% regression in kernel 2.6.24-rc1
- Original Message > From: "Zhang, Yanmin" <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Cc: LKML > Sent: Friday, November 9, 2007 10:47:52 AM > Subject: iozone write 50% regression in kernel 2.6.24-rc1 > > Comparing with 2.6.23, iozone sequential write/rewrite (512M) has > 50% > regression > in kernel 2.6.24-rc1. 2.6.24-rc2 has the same regression. > > My machine has 8 processor cores and 8GB memory. > > By bisect, I located patch > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h= > 04fbfdc14e5f48463820d6b9807daa5e9c92c51f. > > > Another behavior: with kernel 2.6.23, if I run iozone for many > times > after rebooting machine, > the result looks stable. But with 2.6.24-rc1, the first run of > iozone > got a very small result and > following run has 4Xorig_result. > > What I reported is the regression of 2nd/3rd run, because first run > has > bigger regression. > > I also tried to change > /proc/sys/vm/dirty_ratio,dirty_backgroud_ratio > and didn't get improvement. > > -yanmin > - Hi Yanmin, could you tell us the exact iozone command you are using? I would like to repeat it on my setup, because I definitely see the opposite behaviour in 2.6.24-rc1/rc2. The speed there is much better than in 2.6.22 and before (I skipped 2.6.23, because I was waiting for the per-bdi changes). I definitely do not see the difference between 1st and subsequent runs. But then, I do my tests with 5GB file sizes like: iozone3_283/src/current/iozone -t 5 -F /scratch/X1 /scratch/X2 /scratch/X3 /scratch/X4 /scratch/X5 -s 5000M -r 1024 -c -e -i 0 -i 1 Kind regards Martin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: iozone write 50% regression in kernel 2.6.24-rc1
- Original Message From: Zhang, Yanmin [EMAIL PROTECTED] To: [EMAIL PROTECTED] Cc: LKML linux-kernel@vger.kernel.org Sent: Friday, November 9, 2007 10:47:52 AM Subject: iozone write 50% regression in kernel 2.6.24-rc1 Comparing with 2.6.23, iozone sequential write/rewrite (512M) has 50% regression in kernel 2.6.24-rc1. 2.6.24-rc2 has the same regression. My machine has 8 processor cores and 8GB memory. By bisect, I located patch http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h= 04fbfdc14e5f48463820d6b9807daa5e9c92c51f. Another behavior: with kernel 2.6.23, if I run iozone for many times after rebooting machine, the result looks stable. But with 2.6.24-rc1, the first run of iozone got a very small result and following run has 4Xorig_result. What I reported is the regression of 2nd/3rd run, because first run has bigger regression. I also tried to change /proc/sys/vm/dirty_ratio,dirty_backgroud_ratio and didn't get improvement. -yanmin - Hi Yanmin, could you tell us the exact iozone command you are using? I would like to repeat it on my setup, because I definitely see the opposite behaviour in 2.6.24-rc1/rc2. The speed there is much better than in 2.6.22 and before (I skipped 2.6.23, because I was waiting for the per-bdi changes). I definitely do not see the difference between 1st and subsequent runs. But then, I do my tests with 5GB file sizes like: iozone3_283/src/current/iozone -t 5 -F /scratch/X1 /scratch/X2 /scratch/X3 /scratch/X4 /scratch/X5 -s 5000M -r 1024 -c -e -i 0 -i 1 Kind regards Martin - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc1: First impressions
- Original Message > From: Ingo Molnar <[EMAIL PROTECTED]> > To: Andrew Morton <[EMAIL PROTECTED]> > Cc: [EMAIL PROTECTED]; linux-kernel@vger.kernel.org; [EMAIL PROTECTED]; > [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] > Sent: Friday, October 26, 2007 9:33:40 PM > Subject: Re: 2.6.24-rc1: First impressions > > > * Andrew Morton wrote: > > > > > dd1 - copy 16 GB from /dev/zero to local FS > > > > dd1-dir - same, but using O_DIRECT for output > > > > dd2/dd2-dir - copy 2x7.6 GB in parallel from /dev/zero to > local > FS > > > > dd3/dd3-dir - copy 3x5.2 GB in parallel from /dev/zero lo > local > FS > > > > net1 - copy 5.2 GB from NFS3 share to local FS > > > > mix3 - copy 3x5.2 GB from /dev/zero to local disk and two > NFS3 > shares > > > > > > > > I did the numbers for 2.6.19.2, 2.6.22.6 and 2.6.24-rc1. > All > units > > > > are MB/sec. > > > > > > > > test 2.6.19.2 2.6.22.62.6.24.-rc1 > > > > > > > > dd1 28 50 96 > > > > dd1-dir 88 88 86 > > > > dd2 2x16.5 2x11 2x44.5 > > > > dd2-dir2x44 2x44 2x43 > > > > dd3 3x9.83x8.7 3x30 > > > > dd3-dir 3x29.5 3x29.5 3x28.5 > > > > net1 30-3350-55 37-52 > > > > mix3 17/3225/50 > 96/35 > (disk/combined-network) > > > > > > wow, really nice results! > > > > Those changes seem suspiciously large to me. I wonder if > there's > less > > physical IO happening during the timed run, and correspondingly more > > afterwards. > > so a final 'sync' should be added to the test too, and the time > it > takes > factored into the bandwidth numbers? > One of the reasons I do 15 GB transfers is to make sure that I am well above the possible page cache size. And of course I am doing a final sync to finish the runs :-) The sync is also running faster in 2.6.24-rc1. If I factor it in the results for dd1/dd3 are: test2.6.19.22.6.22.62.6.24-rc1 sync time 18sec19sec 6sec dd1 27.5 47.592 dd3 3x9.1 3x8.5 3x29 So basically including the sync time make 2.6.24-rc1 even more promosing. Now, I know that my benchmarks numbers are crude and show only a very small aspect of system performance. But - it is an aspect I care about a lot. And those benchmarks match my use-case pretty good. Cheers Martin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc1: First impressions
- Original Message > From: Andrew Morton <[EMAIL PROTECTED]> > To: Arjan van de Ven <[EMAIL PROTECTED]> > Cc: Ingo Molnar <[EMAIL PROTECTED]>; [EMAIL PROTECTED]; > linux-kernel@vger.kernel.org; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL > PROTECTED]; [EMAIL PROTECTED] > Sent: Saturday, October 27, 2007 7:59:51 AM > Subject: Re: 2.6.24-rc1: First impressions > > On Fri, 26 Oct 2007 22:46:57 -0700 Arjan van de > Ven > wrote: > > > > > > dd1 - copy 16 GB from /dev/zero to local FS > > > > > dd1-dir - same, but using O_DIRECT for output > > > > > dd2/dd2-dir - copy 2x7.6 GB in parallel from /dev/zero to > local > FS > > > > > dd3/dd3-dir - copy 3x5.2 GB in parallel from /dev/zero lo > local > FS > > > > > net1 - copy 5.2 GB from NFS3 share to local FS > > > > > mix3 - copy 3x5.2 GB from /dev/zero to local disk and two NFS3 > > > > > shares > > > > > > > > > > I did the numbers for 2.6.19.2, 2.6.22.6 and 2.6.24-rc1. All > > > > > units are MB/sec. > > > > > > > > > > test 2.6.19.2 2.6.22.62.6.24.-rc1 > > > > > > > > > > > > dd1 28 50 96 > > > > > dd1-dir 88 88 86 > > > > > dd2 2x16.5 2x11 2x44.5 > > > > > dd2-dir2x44 2x44 2x43 > > > > > dd3 3x9.83x8.7 3x30 > > > > > dd3-dir 3x29.5 3x29.5 3x28.5 > > > > > net1 30-3350-55 37-52 > > > > > mix3 17/3225/50 96/35 > > > > > (disk/combined-network) > > > > > > > > wow, really nice results! > > > > > > Those changes seem suspiciously large to me. I wonder if > there's > less > > > physical IO happening during the timed run, and > correspondingly > more > > > afterwards. > > > > > > > another option... this is ext2.. didn't the ext2 reservation > stuff > get > > merged into -rc1? for ext3 that gave a 4x or so speed boost (much > > better sequential allocation pattern) > > > > Yes, one would expect that to make a large difference in > dd2/dd2-dir > and > dd3/dd3-dir - but only on SMP. On UP there's not enough concurrency > in the fs block allocator for any damage to occur. > Just for the record the test are done on SMP. > Reservations won't affect dd1 though, and that went faster too. > This is the one result that surprised me most, as I did not really expect any big moves here. I am not complaining :-), but definitely it would be nice to understand the why. Cheers Martin > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc1: First impressions
- Original Message From: Andrew Morton [EMAIL PROTECTED] To: Arjan van de Ven [EMAIL PROTECTED] Cc: Ingo Molnar [EMAIL PROTECTED]; [EMAIL PROTECTED]; linux-kernel@vger.kernel.org; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Saturday, October 27, 2007 7:59:51 AM Subject: Re: 2.6.24-rc1: First impressions On Fri, 26 Oct 2007 22:46:57 -0700 Arjan van de Ven wrote: dd1 - copy 16 GB from /dev/zero to local FS dd1-dir - same, but using O_DIRECT for output dd2/dd2-dir - copy 2x7.6 GB in parallel from /dev/zero to local FS dd3/dd3-dir - copy 3x5.2 GB in parallel from /dev/zero lo local FS net1 - copy 5.2 GB from NFS3 share to local FS mix3 - copy 3x5.2 GB from /dev/zero to local disk and two NFS3 shares I did the numbers for 2.6.19.2, 2.6.22.6 and 2.6.24-rc1. All units are MB/sec. test 2.6.19.2 2.6.22.62.6.24.-rc1 dd1 28 50 96 dd1-dir 88 88 86 dd2 2x16.5 2x11 2x44.5 dd2-dir2x44 2x44 2x43 dd3 3x9.83x8.7 3x30 dd3-dir 3x29.5 3x29.5 3x28.5 net1 30-3350-55 37-52 mix3 17/3225/50 96/35 (disk/combined-network) wow, really nice results! Those changes seem suspiciously large to me. I wonder if there's less physical IO happening during the timed run, and correspondingly more afterwards. another option... this is ext2.. didn't the ext2 reservation stuff get merged into -rc1? for ext3 that gave a 4x or so speed boost (much better sequential allocation pattern) Yes, one would expect that to make a large difference in dd2/dd2-dir and dd3/dd3-dir - but only on SMP. On UP there's not enough concurrency in the fs block allocator for any damage to occur. Just for the record the test are done on SMP. Reservations won't affect dd1 though, and that went faster too. This is the one result that surprised me most, as I did not really expect any big moves here. I am not complaining :-), but definitely it would be nice to understand the why. Cheers Martin - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.24-rc1: First impressions
- Original Message From: Ingo Molnar [EMAIL PROTECTED] To: Andrew Morton [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; linux-kernel@vger.kernel.org; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Friday, October 26, 2007 9:33:40 PM Subject: Re: 2.6.24-rc1: First impressions * Andrew Morton wrote: dd1 - copy 16 GB from /dev/zero to local FS dd1-dir - same, but using O_DIRECT for output dd2/dd2-dir - copy 2x7.6 GB in parallel from /dev/zero to local FS dd3/dd3-dir - copy 3x5.2 GB in parallel from /dev/zero lo local FS net1 - copy 5.2 GB from NFS3 share to local FS mix3 - copy 3x5.2 GB from /dev/zero to local disk and two NFS3 shares I did the numbers for 2.6.19.2, 2.6.22.6 and 2.6.24-rc1. All units are MB/sec. test 2.6.19.2 2.6.22.62.6.24.-rc1 dd1 28 50 96 dd1-dir 88 88 86 dd2 2x16.5 2x11 2x44.5 dd2-dir2x44 2x44 2x43 dd3 3x9.83x8.7 3x30 dd3-dir 3x29.5 3x29.5 3x28.5 net1 30-3350-55 37-52 mix3 17/3225/50 96/35 (disk/combined-network) wow, really nice results! Those changes seem suspiciously large to me. I wonder if there's less physical IO happening during the timed run, and correspondingly more afterwards. so a final 'sync' should be added to the test too, and the time it takes factored into the bandwidth numbers? One of the reasons I do 15 GB transfers is to make sure that I am well above the possible page cache size. And of course I am doing a final sync to finish the runs :-) The sync is also running faster in 2.6.24-rc1. If I factor it in the results for dd1/dd3 are: test2.6.19.22.6.22.62.6.24-rc1 sync time 18sec19sec 6sec dd1 27.5 47.592 dd3 3x9.1 3x8.5 3x29 So basically including the sync time make 2.6.24-rc1 even more promosing. Now, I know that my benchmarks numbers are crude and show only a very small aspect of system performance. But - it is an aspect I care about a lot. And those benchmarks match my use-case pretty good. Cheers Martin - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.6.24-rc1: First impressions
Hi , just to give some feedback on 2.6.24-rc1. For some time I am tracking IO/writeback problems that hurt system responsiveness big-time. I tested Peters stuff together with Fenguangs additions and it looked promising. Therefore I was very happy to see Peters stuff going into 2.6.24 and waited eagerly for rc1. In short, I am impressed. This really looks good. IO throughput is great and I could not reproduce the responsiveness problems so far. Below are a some numbers of my brute-force I/O tests that I can use to bring responsiveness down. My platform is a HP/DL380g4, dual CPUs, HT-enabled, 8 GB Memory, SmartaArray6i controller with 4x72GB SCSI disks as RAID5 (battery protected writeback cahe enabled) and gigabit networking (tg3). User space is 64-bit RHEL4.3 I am basically doing copies using "dd" with 1MB blocksize. Local Filesystem ist ext2 (noatime). IO-Scheduler is dealine, as it tends to give best results. NFS3 Server is a Sun/T2000/Solaris10. The tests are: dd1 - copy 16 GB from /dev/zero to local FS dd1-dir - same, but using O_DIRECT for output dd2/dd2-dir - copy 2x7.6 GB in parallel from /dev/zero to local FS dd3/dd3-dir - copy 3x5.2 GB in parallel from /dev/zero lo local FS net1 - copy 5.2 GB from NFS3 share to local FS mix3 - copy 3x5.2 GB from /dev/zero to local disk and two NFS3 shares I did the numbers for 2.6.19.2, 2.6.22.6 and 2.6.24-rc1. All units are MB/sec. test 2.6.19.2 2.6.22.62.6.24.-rc1 dd1 285096 dd1-dir 888886 dd2 2x16.5 2x112x44.5 dd2-dir 2x44 2x442x43 dd33x9.83x8.7 3x30 dd3-dir 3x29.5 3x29.53x28.5 net130-33 50-55 37-52 mix3 17/32 25/5096/35 (disk/combined-network) Some observations: - single threaded disk speed really went up wit 2.6.24-rc1. It is now even better than O_DIRECT - O_DIRECT took a slight hit compared to the older kernels. Not an issue for me, but maybe others care - multi threaded non O_DIRECT scales for the first time ever Almost no loss compared to single threaded !! - network throughput took a hit from 2.6.22.6 and is not as repeatable. Still better than 2.6.19.2 though What actually surprises me most is the big performance win on the single threaded non O_DIRECT dd test. I did not expect that :-) What I had hoped for was of course the scalability. So, this looks great and most likely I will push 2.6.24 (maybe .X) into my environment. Happy weekend Martin ------ Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.6.24-rc1: First impressions
Hi , just to give some feedback on 2.6.24-rc1. For some time I am tracking IO/writeback problems that hurt system responsiveness big-time. I tested Peters stuff together with Fenguangs additions and it looked promising. Therefore I was very happy to see Peters stuff going into 2.6.24 and waited eagerly for rc1. In short, I am impressed. This really looks good. IO throughput is great and I could not reproduce the responsiveness problems so far. Below are a some numbers of my brute-force I/O tests that I can use to bring responsiveness down. My platform is a HP/DL380g4, dual CPUs, HT-enabled, 8 GB Memory, SmartaArray6i controller with 4x72GB SCSI disks as RAID5 (battery protected writeback cahe enabled) and gigabit networking (tg3). User space is 64-bit RHEL4.3 I am basically doing copies using dd with 1MB blocksize. Local Filesystem ist ext2 (noatime). IO-Scheduler is dealine, as it tends to give best results. NFS3 Server is a Sun/T2000/Solaris10. The tests are: dd1 - copy 16 GB from /dev/zero to local FS dd1-dir - same, but using O_DIRECT for output dd2/dd2-dir - copy 2x7.6 GB in parallel from /dev/zero to local FS dd3/dd3-dir - copy 3x5.2 GB in parallel from /dev/zero lo local FS net1 - copy 5.2 GB from NFS3 share to local FS mix3 - copy 3x5.2 GB from /dev/zero to local disk and two NFS3 shares I did the numbers for 2.6.19.2, 2.6.22.6 and 2.6.24-rc1. All units are MB/sec. test 2.6.19.2 2.6.22.62.6.24.-rc1 dd1 285096 dd1-dir 888886 dd2 2x16.5 2x112x44.5 dd2-dir 2x44 2x442x43 dd33x9.83x8.7 3x30 dd3-dir 3x29.5 3x29.53x28.5 net130-33 50-55 37-52 mix3 17/32 25/5096/35 (disk/combined-network) Some observations: - single threaded disk speed really went up wit 2.6.24-rc1. It is now even better than O_DIRECT - O_DIRECT took a slight hit compared to the older kernels. Not an issue for me, but maybe others care - multi threaded non O_DIRECT scales for the first time ever Almost no loss compared to single threaded !! - network throughput took a hit from 2.6.22.6 and is not as repeatable. Still better than 2.6.19.2 though What actually surprises me most is the big performance win on the single threaded non O_DIRECT dd test. I did not expect that :-) What I had hoped for was of course the scalability. So, this looks great and most likely I will push 2.6.24 (maybe .X) into my environment. Happy weekend Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] sluggish writeback fixes
--- Fengguang Wu <[EMAIL PROTECTED]> wrote: > Andrew, > > The following patches fix the sluggish writeback behavior. > They are well understood and well tested - but not yet widely tested. > > The first patch reverts the debugging -mm only > check_dirty_inode_list.patch - > which is no longer necessary. > > The following 4 patches do the real jobs: > > [PATCH 2/5] writeback: fix time ordering of the per superblock inode > lists 8 > [PATCH 3/5] writeback: fix ntfs with sb_has_dirty_inodes() > [PATCH 4/5] writeback: remove pages_skipped accounting in > __block_write_full_page() > [PATCH 5/5] writeback: introduce writeback_control.more_io to > indicate more io > > They share the same goal as the following patches in -mm. Therefore > I'd > recommend to put the last 4 new ones after them: > > writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists.patch > writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-2.patch > writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-3.patch > writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-4.patch > writeback-fix-comment-use-helper-function.patch > writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-5.patch > writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-6.patch > writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-7.patch > writeback-fix-periodic-superblock-dirty-inode-flushing.patch > > Regards, > Fengguang Hi Fenguang, now that Peters stuff seems to make it into mainline, do you think your fixes should go in as well? Would definitely help to broaden the tester base. Definitely by one very interested tester :-) Keep on the good work Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)
--- Peter Zijlstra <[EMAIL PROTECTED]> wrote: > On Mon, 2007-10-01 at 14:22 -0700, Andrew Morton wrote: > > > nfs-remove-congestion_end.patch > > lib-percpu_counter_add.patch > > lib-percpu_counter_sub.patch > > lib-percpu_counter-variable-batch.patch > > lib-make-percpu_counter_add-take-s64.patch > > lib-percpu_counter_set.patch > > lib-percpu_counter_sum_positive.patch > > lib-percpu_count_sum.patch > > lib-percpu_counter_init-error-handling.patch > > lib-percpu_counter_init_irq.patch > > mm-bdi-init-hooks.patch > > mm-scalable-bdi-statistics-counters.patch > > mm-count-reclaimable-pages-per-bdi.patch > > mm-count-writeback-pages-per-bdi.patch > > This one: > > mm-expose-bdi-statistics-in-sysfs.patch > > > lib-floating-proportions.patch > > mm-per-device-dirty-threshold.patch > > mm-per-device-dirty-threshold-warning-fix.patch > > mm-per-device-dirty-threshold-fix.patch > > mm-dirty-balancing-for-tasks.patch > > mm-dirty-balancing-for-tasks-warning-fix.patch > > And, this one: > > debug-sysfs-files-for-the-current-ratio-size-total.patch > > > I'm not sure polluting /sys/block//queue/ like that is The Right > Thing. These patches sure were handy when debugging this, but not > sure > they want to move to maineline. > > Maybe we want /sys/bdi// or maybe /debug/bdi// > > Opinions? > Hi Peter, my only opinion is that it is great to see that stuff moving into mainline. If it really goes in, there will be one more very interested rc-tester :-) Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)
--- Peter Zijlstra [EMAIL PROTECTED] wrote: On Mon, 2007-10-01 at 14:22 -0700, Andrew Morton wrote: nfs-remove-congestion_end.patch lib-percpu_counter_add.patch lib-percpu_counter_sub.patch lib-percpu_counter-variable-batch.patch lib-make-percpu_counter_add-take-s64.patch lib-percpu_counter_set.patch lib-percpu_counter_sum_positive.patch lib-percpu_count_sum.patch lib-percpu_counter_init-error-handling.patch lib-percpu_counter_init_irq.patch mm-bdi-init-hooks.patch mm-scalable-bdi-statistics-counters.patch mm-count-reclaimable-pages-per-bdi.patch mm-count-writeback-pages-per-bdi.patch This one: mm-expose-bdi-statistics-in-sysfs.patch lib-floating-proportions.patch mm-per-device-dirty-threshold.patch mm-per-device-dirty-threshold-warning-fix.patch mm-per-device-dirty-threshold-fix.patch mm-dirty-balancing-for-tasks.patch mm-dirty-balancing-for-tasks-warning-fix.patch And, this one: debug-sysfs-files-for-the-current-ratio-size-total.patch I'm not sure polluting /sys/block/foo/queue/ like that is The Right Thing. These patches sure were handy when debugging this, but not sure they want to move to maineline. Maybe we want /sys/bdi/foo/ or maybe /debug/bdi/foo/ Opinions? Hi Peter, my only opinion is that it is great to see that stuff moving into mainline. If it really goes in, there will be one more very interested rc-tester :-) Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] sluggish writeback fixes
--- Fengguang Wu [EMAIL PROTECTED] wrote: Andrew, The following patches fix the sluggish writeback behavior. They are well understood and well tested - but not yet widely tested. The first patch reverts the debugging -mm only check_dirty_inode_list.patch - which is no longer necessary. The following 4 patches do the real jobs: [PATCH 2/5] writeback: fix time ordering of the per superblock inode lists 8 [PATCH 3/5] writeback: fix ntfs with sb_has_dirty_inodes() [PATCH 4/5] writeback: remove pages_skipped accounting in __block_write_full_page() [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io They share the same goal as the following patches in -mm. Therefore I'd recommend to put the last 4 new ones after them: writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-2.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-3.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-4.patch writeback-fix-comment-use-helper-function.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-5.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-6.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-7.patch writeback-fix-periodic-superblock-dirty-inode-flushing.patch Regards, Fengguang Hi Fenguang, now that Peters stuff seems to make it into mainline, do you think your fixes should go in as well? Would definitely help to broaden the tester base. Definitely by one very interested tester :-) Keep on the good work Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: huge improvement with per-device dirty throttling
--- Martin Knoblauch <[EMAIL PROTECTED]> wrote: > > --- Leroy van Logchem <[EMAIL PROTECTED]> wrote: > > > Andrea Arcangeli wrote: > > > On Wed, Aug 22, 2007 at 01:05:13PM +0200, Andi Kleen wrote: > > >> Ok perhaps the new adaptive dirty limits helps your single disk > > >> a lot too. But your improvements seem to be more "collateral > > damage" @) > > >> > > >> But if that was true it might be enough to just change the dirty > > limits > > >> to get the same effect on your system. You might want to play > with > > >> /proc/sys/vm/dirty_* > > > > > > The adaptive dirty limit is per task so it can't be reproduced > with > > > global sysctl. It made quite some difference when I researched > into > > it > > > in function of time. This isn't in function of time but it > > certainly > > > makes a lot of difference too, actually it's the most important > > part > > > of the patchset for most people, the rest is for the corner cases > > that > > > aren't handled right currently (writing to a slow device with > > > writeback cache has always been hanging the whole thing). > > > > > > Self-tuning > static sysctl's. The last years we needed to use very > > > small values for dirty_ratio and dirty_background_ratio to soften > the > > > > latency problems we have during sustained writes. Imo these patches > > > really help in many cases, please commit to mainline. > > > > -- > > Leroy > > > > while it helps in some situations, I did some tests today with > 2.6.22.6+bdi-v9 (Peter was so kind) which seem to indicate that it > hurts NFS writes. Anyone seen similar effects? > > Otherwise I would just second your request. It definitely helps the > problematic performance of my CCISS based RAID5 volume. > please disregard my comment about NFS write performance. What I have seen is caused by some other stuff I am toying with. So, I second your request to push this forward. Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: huge improvement with per-device dirty throttling
--- Martin Knoblauch [EMAIL PROTECTED] wrote: --- Leroy van Logchem [EMAIL PROTECTED] wrote: Andrea Arcangeli wrote: On Wed, Aug 22, 2007 at 01:05:13PM +0200, Andi Kleen wrote: Ok perhaps the new adaptive dirty limits helps your single disk a lot too. But your improvements seem to be more collateral damage @) But if that was true it might be enough to just change the dirty limits to get the same effect on your system. You might want to play with /proc/sys/vm/dirty_* The adaptive dirty limit is per task so it can't be reproduced with global sysctl. It made quite some difference when I researched into it in function of time. This isn't in function of time but it certainly makes a lot of difference too, actually it's the most important part of the patchset for most people, the rest is for the corner cases that aren't handled right currently (writing to a slow device with writeback cache has always been hanging the whole thing). Self-tuning static sysctl's. The last years we needed to use very small values for dirty_ratio and dirty_background_ratio to soften the latency problems we have during sustained writes. Imo these patches really help in many cases, please commit to mainline. -- Leroy while it helps in some situations, I did some tests today with 2.6.22.6+bdi-v9 (Peter was so kind) which seem to indicate that it hurts NFS writes. Anyone seen similar effects? Otherwise I would just second your request. It definitely helps the problematic performance of my CCISS based RAID5 volume. please disregard my comment about NFS write performance. What I have seen is caused by some other stuff I am toying with. So, I second your request to push this forward. Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: huge improvement with per-device dirty throttling
--- Andrea Arcangeli <[EMAIL PROTECTED]> wrote: > On Wed, Aug 22, 2007 at 01:05:13PM +0200, Andi Kleen wrote: > > Ok perhaps the new adaptive dirty limits helps your single disk > > a lot too. But your improvements seem to be more "collateral > damage" @) > > > > But if that was true it might be enough to just change the dirty > limits > > to get the same effect on your system. You might want to play with > > /proc/sys/vm/dirty_* > > The adaptive dirty limit is per task so it can't be reproduced with > global sysctl. It made quite some difference when I researched into > it > in function of time. This isn't in function of time but it certainly > makes a lot of difference too, actually it's the most important part > of the patchset for most people, the rest is for the corner cases > that > aren't handled right currently (writing to a slow device with > writeback cache has always been hanging the whole thing). didn't see that remark before. I just realized that "slow device with writeback cache" pretty well describes the CCISS controller in the DL380g4. Could you elaborate why that is a problematic case? Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: huge improvement with per-device dirty throttling
--- Andrea Arcangeli [EMAIL PROTECTED] wrote: On Wed, Aug 22, 2007 at 01:05:13PM +0200, Andi Kleen wrote: Ok perhaps the new adaptive dirty limits helps your single disk a lot too. But your improvements seem to be more collateral damage @) But if that was true it might be enough to just change the dirty limits to get the same effect on your system. You might want to play with /proc/sys/vm/dirty_* The adaptive dirty limit is per task so it can't be reproduced with global sysctl. It made quite some difference when I researched into it in function of time. This isn't in function of time but it certainly makes a lot of difference too, actually it's the most important part of the patchset for most people, the rest is for the corner cases that aren't handled right currently (writing to a slow device with writeback cache has always been hanging the whole thing). didn't see that remark before. I just realized that slow device with writeback cache pretty well describes the CCISS controller in the DL380g4. Could you elaborate why that is a problematic case? Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: huge improvement with per-device dirty throttling
--- Leroy van Logchem <[EMAIL PROTECTED]> wrote: > Andrea Arcangeli wrote: > > On Wed, Aug 22, 2007 at 01:05:13PM +0200, Andi Kleen wrote: > >> Ok perhaps the new adaptive dirty limits helps your single disk > >> a lot too. But your improvements seem to be more "collateral > damage" @) > >> > >> But if that was true it might be enough to just change the dirty > limits > >> to get the same effect on your system. You might want to play with > >> /proc/sys/vm/dirty_* > > > > The adaptive dirty limit is per task so it can't be reproduced with > > global sysctl. It made quite some difference when I researched into > it > > in function of time. This isn't in function of time but it > certainly > > makes a lot of difference too, actually it's the most important > part > > of the patchset for most people, the rest is for the corner cases > that > > aren't handled right currently (writing to a slow device with > > writeback cache has always been hanging the whole thing). > > > Self-tuning > static sysctl's. The last years we needed to use very > small values for dirty_ratio and dirty_background_ratio to soften the > > latency problems we have during sustained writes. Imo these patches > really help in many cases, please commit to mainline. > > -- > Leroy > while it helps in some situations, I did some tests today with 2.6.22.6+bdi-v9 (Peter was so kind) which seem to indicate that it hurts NFS writes. Anyone seen similar effects? Otherwise I would just second your request. It definitely helps the problematic performance of my CCISS based RAID5 volume. Martin Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: huge improvement with per-device dirty throttling
--- Leroy van Logchem [EMAIL PROTECTED] wrote: Andrea Arcangeli wrote: On Wed, Aug 22, 2007 at 01:05:13PM +0200, Andi Kleen wrote: Ok perhaps the new adaptive dirty limits helps your single disk a lot too. But your improvements seem to be more collateral damage @) But if that was true it might be enough to just change the dirty limits to get the same effect on your system. You might want to play with /proc/sys/vm/dirty_* The adaptive dirty limit is per task so it can't be reproduced with global sysctl. It made quite some difference when I researched into it in function of time. This isn't in function of time but it certainly makes a lot of difference too, actually it's the most important part of the patchset for most people, the rest is for the corner cases that aren't handled right currently (writing to a slow device with writeback cache has always been hanging the whole thing). Self-tuning static sysctl's. The last years we needed to use very small values for dirty_ratio and dirty_background_ratio to soften the latency problems we have during sustained writes. Imo these patches really help in many cases, please commit to mainline. -- Leroy while it helps in some situations, I did some tests today with 2.6.22.6+bdi-v9 (Peter was so kind) which seem to indicate that it hurts NFS writes. Anyone seen similar effects? Otherwise I would just second your request. It definitely helps the problematic performance of my CCISS based RAID5 volume. Martin Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RFC: [PATCH] Small patch on top of per device dirty throttling -v9
--- Peter Zijlstra <[EMAIL PROTECTED]> wrote: > On Thu, 2007-08-23 at 08:59 -0700, Martin Knoblauch wrote: > > --- Peter Zijlstra <[EMAIL PROTECTED]> wrote: > > > > > On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote: > > > > > > > Peter, > > > > > > > > any chance to get a rollup against 2.6.22-stable? > > > > > > > > The 2.6.23 series may not be usable for me due to the > > > > nosharedcache changes for NFS (the new default will massively > > > > disturb the user-space automounter). > > > > > > I'll see what I can do, bit busy with other stuff atm, hopefully > > > after > > > the weekend. > > > > > Hi Peter, > > > > any progress on a version against 2.6.22.5? I have seen the very > > positive report from Jeffrey W. Baker and would really love to test > > your patch. But as I said, anything newer than 2.6.22.x might not > be an > > option due to the NFS changes. > > mindless port, seems to compile and boot on my test box ymmv. > Hi Peter, while doing my tests I observed that setting dirty_ratio below 5% did not make a difference at all. Just by chance I found that this apparently is an enforced limit in mm/page-writeback.c. With below patch I have lowered the limit to 2%. With that, things look a lot better on my systems. Load during write stays below 1.5 for one writer. Responsiveness is good. This may even help without the throttling patch. Not sure that this is the right thing to do, but it helps :-) Cheers Martin --- linux-2.6.22.5-bdi-v9/mm/page-writeback.c +++ linux-2.6.22.6+bdi-v9/mm/page-writeback.c @@ -311,8 +311,11 @@ if (dirty_ratio > unmapped_ratio / 2) dirty_ratio = unmapped_ratio / 2; - if (dirty_ratio < 5) - dirty_ratio = 5; +/* +** MKN: Lower enforced limit from 5% to 2% +*/ + if (dirty_ratio < 2) + dirty_ratio = 2; background_ratio = dirty_background_ratio; if (background_ratio >= dirty_ratio) -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: recent nfs change causes autofs regression
--- Jakob Oestergaard <[EMAIL PROTECTED]> wrote: > On Fri, Aug 31, 2007 at 09:43:29AM -0700, Linus Torvalds wrote: > ... > > This is *not* a security hole. In order to make it a security hole, > you > > need to be root in the first place. > > Non-root users can write to places where root might believe they > cannot write > because he might be under the mistaken assumption that ro means ro. > > I am under the impression that that could have implications in some > setups. > That was never in question. > ... > > > > - it's a misfeature that people are used to, and has been around > forever. > > Sure, they're used it it, but I doubt they are aware of it. > So, the right thing to do (tm) is to make them aware without breaking their setup. Log any detected inconsistencies in the dmesg buffer and to syslog. If the sysadmin is not competent enough to notice, to bad. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: recent nfs change causes autofs regression
--- Jakob Oestergaard [EMAIL PROTECTED] wrote: On Fri, Aug 31, 2007 at 09:43:29AM -0700, Linus Torvalds wrote: ... This is *not* a security hole. In order to make it a security hole, you need to be root in the first place. Non-root users can write to places where root might believe they cannot write because he might be under the mistaken assumption that ro means ro. I am under the impression that that could have implications in some setups. That was never in question. ... - it's a misfeature that people are used to, and has been around forever. Sure, they're used it it, but I doubt they are aware of it. So, the right thing to do (tm) is to make them aware without breaking their setup. Log any detected inconsistencies in the dmesg buffer and to syslog. If the sysadmin is not competent enough to notice, to bad. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RFC: [PATCH] Small patch on top of per device dirty throttling -v9
--- Peter Zijlstra [EMAIL PROTECTED] wrote: On Thu, 2007-08-23 at 08:59 -0700, Martin Knoblauch wrote: --- Peter Zijlstra [EMAIL PROTECTED] wrote: On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote: Peter, any chance to get a rollup against 2.6.22-stable? The 2.6.23 series may not be usable for me due to the nosharedcache changes for NFS (the new default will massively disturb the user-space automounter). I'll see what I can do, bit busy with other stuff atm, hopefully after the weekend. Hi Peter, any progress on a version against 2.6.22.5? I have seen the very positive report from Jeffrey W. Baker and would really love to test your patch. But as I said, anything newer than 2.6.22.x might not be an option due to the NFS changes. mindless port, seems to compile and boot on my test box ymmv. Hi Peter, while doing my tests I observed that setting dirty_ratio below 5% did not make a difference at all. Just by chance I found that this apparently is an enforced limit in mm/page-writeback.c. With below patch I have lowered the limit to 2%. With that, things look a lot better on my systems. Load during write stays below 1.5 for one writer. Responsiveness is good. This may even help without the throttling patch. Not sure that this is the right thing to do, but it helps :-) Cheers Martin --- linux-2.6.22.5-bdi-v9/mm/page-writeback.c +++ linux-2.6.22.6+bdi-v9/mm/page-writeback.c @@ -311,8 +311,11 @@ if (dirty_ratio unmapped_ratio / 2) dirty_ratio = unmapped_ratio / 2; - if (dirty_ratio 5) - dirty_ratio = 5; +/* +** MKN: Lower enforced limit from 5% to 2% +*/ + if (dirty_ratio 2) + dirty_ratio = 2; background_ratio = dirty_background_ratio; if (background_ratio = dirty_ratio) -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: recent nfs change causes autofs regression
--- Ian Kent <[EMAIL PROTECTED]> wrote: > On Thu, 30 Aug 2007, Linus Torvalds wrote: > > > > > > On Fri, 31 Aug 2007, Trond Myklebust wrote: > > > > > > It did not. The previous behaviour was to always silently > override the > > > user mount options. > > > > ..so it still worked for any sane setup, at least. > > > > You broke that. Hua gave good reasons for why he cannot use the > current > > kernel. It's a regression. > > > > In other words, the new behaviour is *worse* than the behaviour you > > > consider to be the incorrect one. > > > > This all came about due to complains about not being able to mount > the > same server file system with different options, most commonly ro vs. > rw > which I think was due to the shared super block changes some time > ago. > And, to some extent, I have to plead guilty for not complaining > enough > about this default in the beginning, which is basically unacceptable > for > sure. > > We have seen breakage in Fedora with the introduction of the patches > and > this is typical of it. It also breaks amd and admins have no way of > altering this that I'm aware of (help us here Ion). > > I understand Tronds concerns but the fact remains that other Unixs > allow > this behaviour but don't assert cache coherancy and many sysadmin > don't > realize this. So the broken behavior is expected to work and we can't > > simply stop allowing it unless we want to attend a public hanging > with us > as the paticipants. > > There is no question that the new behavior is worse and this change > is > unacceptable as a solution to the original problem. > > I really think that reversing the default, as has been suggested, > documenting the risk in the mount.nfs man page and perhaps issuing a > warning from the kernel is a better way to handle this. At least we > will > be doing more to raise public awareness of the issue than others. > I can only second that. Changing the default behavior in this way is really bad. Not that I am disagreeing with the technical reasons, but the change breaks working setups. And -EBUSY is not very helpful as a message here. It does not matter that the user tools may handle the breakage incorrect. The users (admins) had workings setups for years. And they were obviously working "good enough". And one should not forget that there will be a considerable time until "nosharecache" will trickle down into distributions. If the situation stays this way, quite a few people will not be able to move beyond 2.6.22 for some time. E.g. for I am working for a company that operates some linux "clusters" at a few german automotive cdompanies. For certain reasons everything there is based on automounter maps (both autofs and amd style). We have almost zero influence on that setup. The maps are a mess - we will run into the sharecache problem. At the same time I am trying to fight the notorious "system turns into frozen molassis on moderate I/O load". There maybe some interesting developements coming forth after 2.6.22. Not good :-( What I would like to see done for the at hand situation is: - make "nosharecache" the default for the forseeable future - log any attempt to mount option-inconsistent NFS filesystems to dmesh and syslog (apparently the NFS client is able to detect them :-). Do this regardless of the "nosharecache" option. This way admins will at least be made aware of the situation. - In a year or so we can talk about making the default safe. With proper advertising. Just my 0.02. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: recent nfs change causes autofs regression
--- Ian Kent [EMAIL PROTECTED] wrote: On Thu, 30 Aug 2007, Linus Torvalds wrote: On Fri, 31 Aug 2007, Trond Myklebust wrote: It did not. The previous behaviour was to always silently override the user mount options. ..so it still worked for any sane setup, at least. You broke that. Hua gave good reasons for why he cannot use the current kernel. It's a regression. In other words, the new behaviour is *worse* than the behaviour you consider to be the incorrect one. This all came about due to complains about not being able to mount the same server file system with different options, most commonly ro vs. rw which I think was due to the shared super block changes some time ago. And, to some extent, I have to plead guilty for not complaining enough about this default in the beginning, which is basically unacceptable for sure. We have seen breakage in Fedora with the introduction of the patches and this is typical of it. It also breaks amd and admins have no way of altering this that I'm aware of (help us here Ion). I understand Tronds concerns but the fact remains that other Unixs allow this behaviour but don't assert cache coherancy and many sysadmin don't realize this. So the broken behavior is expected to work and we can't simply stop allowing it unless we want to attend a public hanging with us as the paticipants. There is no question that the new behavior is worse and this change is unacceptable as a solution to the original problem. I really think that reversing the default, as has been suggested, documenting the risk in the mount.nfs man page and perhaps issuing a warning from the kernel is a better way to handle this. At least we will be doing more to raise public awareness of the issue than others. I can only second that. Changing the default behavior in this way is really bad. Not that I am disagreeing with the technical reasons, but the change breaks working setups. And -EBUSY is not very helpful as a message here. It does not matter that the user tools may handle the breakage incorrect. The users (admins) had workings setups for years. And they were obviously working good enough. And one should not forget that there will be a considerable time until nosharecache will trickle down into distributions. If the situation stays this way, quite a few people will not be able to move beyond 2.6.22 for some time. E.g. for I am working for a company that operates some linux clusters at a few german automotive cdompanies. For certain reasons everything there is based on automounter maps (both autofs and amd style). We have almost zero influence on that setup. The maps are a mess - we will run into the sharecache problem. At the same time I am trying to fight the notorious system turns into frozen molassis on moderate I/O load. There maybe some interesting developements coming forth after 2.6.22. Not good :-( What I would like to see done for the at hand situation is: - make nosharecache the default for the forseeable future - log any attempt to mount option-inconsistent NFS filesystems to dmesh and syslog (apparently the NFS client is able to detect them :-). Do this regardless of the nosharecache option. This way admins will at least be made aware of the situation. - In a year or so we can talk about making the default safe. With proper advertising. Just my 0.02. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
--- Jens Axboe <[EMAIL PROTECTED]> wrote: > > Try limiting the queue depth on the cciss device, some of those are > notoriously bad at starving commands. Something like the below hack, > see > if it makes a difference (and please verify in dmesg that it prints > the > message about limiting depth!): > > diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c > index 084358a..257e1c3 100644 > --- a/drivers/block/cciss.c > +++ b/drivers/block/cciss.c > @@ -2992,7 +2992,12 @@ static int cciss_pci_init(ctlr_info_t *c, > struct pci_dev *pdev) > if (board_id == products[i].board_id) { > c->product_name = products[i].product_name; > c->access = *(products[i].access); > +#if 0 > c->nr_cmds = products[i].nr_cmds; > +#else > + c->nr_cmds = 2; > + printk("cciss: limited max commands to 2\n"); > +#endif > break; > } > } > > -- > Jens Axboe > > Hi Jens, how exactely is the queue depth related to the max # of commands? I ask, because with the 2.6.22 kernel the "maximum queue depth since init" seems to be never higher than 16, even with much higher outstanding commands. On a 2.6.19 kernel, maximum queue depth is much higher, just a bit below "max # of commands since init". [2.6.22]# cat /proc/driver/cciss/cciss0 cciss0: HP Smart Array 6i Controller Board ID: 0x40910e11 Firmware Version: 2.76 IRQ: 51 Logical drives: 1 Max sectors: 2048 Current Q depth: 0 Current # commands on controller: 145 Max Q depth since init: 16 Max # commands on controller since init: 204 Max SG entries since init: 31 Sequential access devices: 0 [2.6.19] cat /proc/driver/cciss/cciss0 cciss0: HP Smart Array 6i Controller Board ID: 0x40910e11 Firmware Version: 2.76 IRQ: 51 Logical drives: 1 Current Q depth: 0 Current # commands on controller: 0 Max Q depth since init: 197 Max # commands on controller since init: 198 Max SG entries since init: 31 Sequential access devices: 0 Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
--- Robert Hancock <[EMAIL PROTECTED]> wrote: > > I saw a bulletin from HP recently that sugggested disabling the > write-back cache on some Smart Array controllers as a workaround > because > it reduced performance in applications that did large bulk writes. > Presumably they are planning on releasing some updated firmware that > fixes this eventually.. > > -- > Robert Hancock Saskatoon, SK, Canada > To email, remove "nospam" from [EMAIL PROTECTED] > Home Page: http://www.roberthancock.com/ > Robert, just checked it out. At least with the "6i", you do not want to disable the WBC :-) Performance really goes down the toilet for all cases. Do you still have a pointer to that bulletin? Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: regression of autofs for current git?
On Wed, 2007-08-29 at 20:09 -0700, Ian Kent wrote: > >http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=75180df2ed467866ada839fe73cf7cc7d75c0a22 > >This (and it's related patches) may be the problem. >I can probably tell if you post your map or if you strace the automount >process managing the a problem mount point and look for mount returning >EBUSY when it should succeed. Likely. That is the one that will break the user-space automounter as well (and keeps me from .23). I don't care very much about what the default is, but it would be great if the new behaviour could be globally changed at run- (or boot-) time. It will be some time until the new mount option makes it into the distros. Cheers Martin PS: Sorry, but I likely killed the CC list ------ Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: regression of autofs for current git?
On Wed, 2007-08-29 at 20:09 -0700, Ian Kent wrote: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=75180df2ed467866ada839fe73cf7cc7d75c0a22 This (and it's related patches) may be the problem. I can probably tell if you post your map or if you strace the automount process managing the a problem mount point and look for mount returning EBUSY when it should succeed. Likely. That is the one that will break the user-space automounter as well (and keeps me from .23). I don't care very much about what the default is, but it would be great if the new behaviour could be globally changed at run- (or boot-) time. It will be some time until the new mount option makes it into the distros. Cheers Martin PS: Sorry, but I likely killed the CC list -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
--- Robert Hancock [EMAIL PROTECTED] wrote: I saw a bulletin from HP recently that sugggested disabling the write-back cache on some Smart Array controllers as a workaround because it reduced performance in applications that did large bulk writes. Presumably they are planning on releasing some updated firmware that fixes this eventually.. -- Robert Hancock Saskatoon, SK, Canada To email, remove nospam from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ Robert, just checked it out. At least with the 6i, you do not want to disable the WBC :-) Performance really goes down the toilet for all cases. Do you still have a pointer to that bulletin? Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
--- Jens Axboe [EMAIL PROTECTED] wrote: Try limiting the queue depth on the cciss device, some of those are notoriously bad at starving commands. Something like the below hack, see if it makes a difference (and please verify in dmesg that it prints the message about limiting depth!): diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c index 084358a..257e1c3 100644 --- a/drivers/block/cciss.c +++ b/drivers/block/cciss.c @@ -2992,7 +2992,12 @@ static int cciss_pci_init(ctlr_info_t *c, struct pci_dev *pdev) if (board_id == products[i].board_id) { c-product_name = products[i].product_name; c-access = *(products[i].access); +#if 0 c-nr_cmds = products[i].nr_cmds; +#else + c-nr_cmds = 2; + printk(cciss: limited max commands to 2\n); +#endif break; } } -- Jens Axboe Hi Jens, how exactely is the queue depth related to the max # of commands? I ask, because with the 2.6.22 kernel the maximum queue depth since init seems to be never higher than 16, even with much higher outstanding commands. On a 2.6.19 kernel, maximum queue depth is much higher, just a bit below max # of commands since init. [2.6.22]# cat /proc/driver/cciss/cciss0 cciss0: HP Smart Array 6i Controller Board ID: 0x40910e11 Firmware Version: 2.76 IRQ: 51 Logical drives: 1 Max sectors: 2048 Current Q depth: 0 Current # commands on controller: 145 Max Q depth since init: 16 Max # commands on controller since init: 204 Max SG entries since init: 31 Sequential access devices: 0 [2.6.19] cat /proc/driver/cciss/cciss0 cciss0: HP Smart Array 6i Controller Board ID: 0x40910e11 Firmware Version: 2.76 IRQ: 51 Logical drives: 1 Current Q depth: 0 Current # commands on controller: 0 Max Q depth since init: 197 Max # commands on controller since init: 198 Max SG entries since init: 31 Sequential access devices: 0 Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
--- Chuck Ebbert <[EMAIL PROTECTED]> wrote: > On 08/28/2007 11:53 AM, Martin Knoblauch wrote: > > > > The basic setup is a dual x86_64 box with 8 GB of memory. The > DL380 > > has a HW RAID5, made from 4x72GB disks and about 100 MB write > cache. > > The performance of the block device with O_DIRECT is about 90 > MB/sec. > > > > The problematic behaviour comes when we are moving large files > through > > the system. The file usage in this case is mostly "use once" or > > streaming. As soon as the amount of file data is larger than 7.5 > GB, we > > see occasional unresponsiveness of the system (e.g. no more ssh > > connections into the box) of more than 1 or 2 minutes (!) duration > > (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads > and > > some other poor guys being in "D" state. > > Try booting with "mem=4096M", "mem=2048M", ... > > hmm. I tried 1024M a while ago and IIRC did not see a lot [any] difference. But as it is no big deal, I will repeat it tomorrow. Just curious - what are you expecting? Why should it help? Thanks Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
--- Jens Axboe <[EMAIL PROTECTED]> wrote: > On Tue, Aug 28 2007, Martin Knoblauch wrote: > > Keywords: I/O, bdi-v9, cfs > > > > Try limiting the queue depth on the cciss device, some of those are > notoriously bad at starving commands. Something like the below hack, > see > if it makes a difference (and please verify in dmesg that it prints > the > message about limiting depth!): > > diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c > index 084358a..257e1c3 100644 > --- a/drivers/block/cciss.c > +++ b/drivers/block/cciss.c > @@ -2992,7 +2992,12 @@ static int cciss_pci_init(ctlr_info_t *c, > struct pci_dev *pdev) > if (board_id == products[i].board_id) { > c->product_name = products[i].product_name; > c->access = *(products[i].access); > +#if 0 > c->nr_cmds = products[i].nr_cmds; > +#else > + c->nr_cmds = 2; > + printk("cciss: limited max commands to 2\n"); > +#endif > break; > } > } > > -- > Jens Axboe > > > Hi Jens, thanks for the suggestion. Unfortunatelly the non-direct [parallel] writes to the device got considreably slower. I guess the "6i" controller copes better with higher values. Can nr_cmds be changed at runtime? Maybe there is a optimal setting. [ 69.438851] SCSI subsystem initialized [ 69.442712] HP CISS Driver (v 3.6.14) [ 69.442871] ACPI: PCI Interrupt :04:03.0[A] -> GSI 51 (level, low) -> IRQ 51 [ 69.442899] cciss: limited max commands to 2 (Smart Array 6i) [ 69.482370] cciss0: <0x46> at PCI :04:03.0 IRQ 51 using DAC [ 69.494352] blocks= 426759840 block_size= 512 [ 69.498350] heads=255, sectors=32, cylinders=52299 [ 69.498352] [ 69.498509] blocks= 426759840 block_size= 512 [ 69.498602] heads=255, sectors=32, cylinders=52299 [ 69.498604] [ 69.498608] cciss/c0d0: p1 p2 Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
--- Fengguang Wu <[EMAIL PROTECTED]> wrote: > On Wed, Aug 29, 2007 at 01:15:45AM -0700, Martin Knoblauch wrote: > > > > --- Fengguang Wu <[EMAIL PROTECTED]> wrote: > > > > > You are apparently running into the sluggish kupdate-style > writeback > > > problem with large files: huge amount of dirty pages are getting > > > accumulated and flushed to the disk all at once when dirty > background > > > ratio is reached. The current -mm tree has some fixes for it, and > > > there are some more in my tree. Martin, I'll send you the patch > if > > > you'd like to try it out. > > > > > Hi Fengguang, > > > > Yeah, that pretty much describes the situation we end up. Although > > "sluggish" is much to friendly if we hit the situation :-) > > > > Yes, I am very interested to check out your patch. I saw your > > postings on LKML already and was already curious. Any chance you > have > > something agains 2.6.22-stable? I have reasons not to move to -23 > or > > -mm. > > Well, they are a dozen patches from various sources. I managed to > back-port them. It compiles and runs, however I cannot guarantee > more... > Thanks. I understand the limited scope of the warranty :-) I will give it a spin today. > > > > Another thing I saw during my tests is that when writing to > NFS, > > > the > > > > "dirty" or "nr_dirty" numbers are always 0. Is this a > conceptual > > > thing, > > > > or a bug? > > > > > > What are the nr_unstable numbers? > > > > > > > Ahh. Yes, they go up to 80-90k pages. Comparable to the nr_dirty > > numbers for the disk case. Good to know. > > > > For NFS, the nr_writeback numbers seem surprisingly high. They > also go > > to 80-90k (pages ?). In the disk case they rarely go over 12k. > > Maybe the difference of throttling one single 'cp' and a dozen > 'nfsd'? > No "nfsd" running on that box. It is just a client. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
--- Fengguang Wu <[EMAIL PROTECTED]> wrote: > On Tue, Aug 28, 2007 at 08:53:07AM -0700, Martin Knoblauch wrote: > [...] > > The basic setup is a dual x86_64 box with 8 GB of memory. The > DL380 > > has a HW RAID5, made from 4x72GB disks and about 100 MB write > cache. > > The performance of the block device with O_DIRECT is about 90 > MB/sec. > > > > The problematic behaviour comes when we are moving large files > through > > the system. The file usage in this case is mostly "use once" or > > streaming. As soon as the amount of file data is larger than 7.5 > GB, we > > see occasional unresponsiveness of the system (e.g. no more ssh > > connections into the box) of more than 1 or 2 minutes (!) duration > > (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads > and > > some other poor guys being in "D" state. > [...] > > Just by chance I found out that doing all I/O inc sync-mode does > > prevent the load from going up. Of course, I/O throughput is not > > stellar (but not much worse than the non-O_DIRECT case). But the > > responsiveness seem OK. Maybe a solution, as this can be controlled > via > > mount (would be great for O_DIRECT :-). > > > > In general 2.6.22 seems to bee better that 2.6.19, but this is > highly > > subjective :-( I am using the following setting in /proc. They seem > to > > provide the smoothest responsiveness: > > > > vm.dirty_background_ratio = 1 > > vm.dirty_ratio = 1 > > vm.swappiness = 1 > > vm.vfs_cache_pressure = 1 > > You are apparently running into the sluggish kupdate-style writeback > problem with large files: huge amount of dirty pages are getting > accumulated and flushed to the disk all at once when dirty background > ratio is reached. The current -mm tree has some fixes for it, and > there are some more in my tree. Martin, I'll send you the patch if > you'd like to try it out. > Hi Fengguang, Yeah, that pretty much describes the situation we end up. Although "sluggish" is much to friendly if we hit the situation :-) Yes, I am very interested to check out your patch. I saw your postings on LKML already and was already curious. Any chance you have something agains 2.6.22-stable? I have reasons not to move to -23 or -mm. > > Another thing I saw during my tests is that when writing to NFS, > the > > "dirty" or "nr_dirty" numbers are always 0. Is this a conceptual > thing, > > or a bug? > > What are the nr_unstable numbers? > Ahh. Yes, they go up to 80-90k pages. Comparable to the nr_dirty numbers for the disk case. Good to know. For NFS, the nr_writeback numbers seem surprisingly high. They also go to 80-90k (pages ?). In the disk case they rarely go over 12k. Cheers Martin > Fengguang > > -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
--- Fengguang Wu [EMAIL PROTECTED] wrote: On Tue, Aug 28, 2007 at 08:53:07AM -0700, Martin Knoblauch wrote: [...] The basic setup is a dual x86_64 box with 8 GB of memory. The DL380 has a HW RAID5, made from 4x72GB disks and about 100 MB write cache. The performance of the block device with O_DIRECT is about 90 MB/sec. The problematic behaviour comes when we are moving large files through the system. The file usage in this case is mostly use once or streaming. As soon as the amount of file data is larger than 7.5 GB, we see occasional unresponsiveness of the system (e.g. no more ssh connections into the box) of more than 1 or 2 minutes (!) duration (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads and some other poor guys being in D state. [...] Just by chance I found out that doing all I/O inc sync-mode does prevent the load from going up. Of course, I/O throughput is not stellar (but not much worse than the non-O_DIRECT case). But the responsiveness seem OK. Maybe a solution, as this can be controlled via mount (would be great for O_DIRECT :-). In general 2.6.22 seems to bee better that 2.6.19, but this is highly subjective :-( I am using the following setting in /proc. They seem to provide the smoothest responsiveness: vm.dirty_background_ratio = 1 vm.dirty_ratio = 1 vm.swappiness = 1 vm.vfs_cache_pressure = 1 You are apparently running into the sluggish kupdate-style writeback problem with large files: huge amount of dirty pages are getting accumulated and flushed to the disk all at once when dirty background ratio is reached. The current -mm tree has some fixes for it, and there are some more in my tree. Martin, I'll send you the patch if you'd like to try it out. Hi Fengguang, Yeah, that pretty much describes the situation we end up. Although sluggish is much to friendly if we hit the situation :-) Yes, I am very interested to check out your patch. I saw your postings on LKML already and was already curious. Any chance you have something agains 2.6.22-stable? I have reasons not to move to -23 or -mm. Another thing I saw during my tests is that when writing to NFS, the dirty or nr_dirty numbers are always 0. Is this a conceptual thing, or a bug? What are the nr_unstable numbers? Ahh. Yes, they go up to 80-90k pages. Comparable to the nr_dirty numbers for the disk case. Good to know. For NFS, the nr_writeback numbers seem surprisingly high. They also go to 80-90k (pages ?). In the disk case they rarely go over 12k. Cheers Martin Fengguang -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
--- Fengguang Wu [EMAIL PROTECTED] wrote: On Wed, Aug 29, 2007 at 01:15:45AM -0700, Martin Knoblauch wrote: --- Fengguang Wu [EMAIL PROTECTED] wrote: You are apparently running into the sluggish kupdate-style writeback problem with large files: huge amount of dirty pages are getting accumulated and flushed to the disk all at once when dirty background ratio is reached. The current -mm tree has some fixes for it, and there are some more in my tree. Martin, I'll send you the patch if you'd like to try it out. Hi Fengguang, Yeah, that pretty much describes the situation we end up. Although sluggish is much to friendly if we hit the situation :-) Yes, I am very interested to check out your patch. I saw your postings on LKML already and was already curious. Any chance you have something agains 2.6.22-stable? I have reasons not to move to -23 or -mm. Well, they are a dozen patches from various sources. I managed to back-port them. It compiles and runs, however I cannot guarantee more... Thanks. I understand the limited scope of the warranty :-) I will give it a spin today. Another thing I saw during my tests is that when writing to NFS, the dirty or nr_dirty numbers are always 0. Is this a conceptual thing, or a bug? What are the nr_unstable numbers? Ahh. Yes, they go up to 80-90k pages. Comparable to the nr_dirty numbers for the disk case. Good to know. For NFS, the nr_writeback numbers seem surprisingly high. They also go to 80-90k (pages ?). In the disk case they rarely go over 12k. Maybe the difference of throttling one single 'cp' and a dozen 'nfsd'? No nfsd running on that box. It is just a client. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
--- Jens Axboe [EMAIL PROTECTED] wrote: On Tue, Aug 28 2007, Martin Knoblauch wrote: Keywords: I/O, bdi-v9, cfs Try limiting the queue depth on the cciss device, some of those are notoriously bad at starving commands. Something like the below hack, see if it makes a difference (and please verify in dmesg that it prints the message about limiting depth!): diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c index 084358a..257e1c3 100644 --- a/drivers/block/cciss.c +++ b/drivers/block/cciss.c @@ -2992,7 +2992,12 @@ static int cciss_pci_init(ctlr_info_t *c, struct pci_dev *pdev) if (board_id == products[i].board_id) { c-product_name = products[i].product_name; c-access = *(products[i].access); +#if 0 c-nr_cmds = products[i].nr_cmds; +#else + c-nr_cmds = 2; + printk(cciss: limited max commands to 2\n); +#endif break; } } -- Jens Axboe Hi Jens, thanks for the suggestion. Unfortunatelly the non-direct [parallel] writes to the device got considreably slower. I guess the 6i controller copes better with higher values. Can nr_cmds be changed at runtime? Maybe there is a optimal setting. [ 69.438851] SCSI subsystem initialized [ 69.442712] HP CISS Driver (v 3.6.14) [ 69.442871] ACPI: PCI Interrupt :04:03.0[A] - GSI 51 (level, low) - IRQ 51 [ 69.442899] cciss: limited max commands to 2 (Smart Array 6i) [ 69.482370] cciss0: 0x46 at PCI :04:03.0 IRQ 51 using DAC [ 69.494352] blocks= 426759840 block_size= 512 [ 69.498350] heads=255, sectors=32, cylinders=52299 [ 69.498352] [ 69.498509] blocks= 426759840 block_size= 512 [ 69.498602] heads=255, sectors=32, cylinders=52299 [ 69.498604] [ 69.498608] cciss/c0d0: p1 p2 Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour - next try
--- Chuck Ebbert [EMAIL PROTECTED] wrote: On 08/28/2007 11:53 AM, Martin Knoblauch wrote: The basic setup is a dual x86_64 box with 8 GB of memory. The DL380 has a HW RAID5, made from 4x72GB disks and about 100 MB write cache. The performance of the block device with O_DIRECT is about 90 MB/sec. The problematic behaviour comes when we are moving large files through the system. The file usage in this case is mostly use once or streaming. As soon as the amount of file data is larger than 7.5 GB, we see occasional unresponsiveness of the system (e.g. no more ssh connections into the box) of more than 1 or 2 minutes (!) duration (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads and some other poor guys being in D state. Try booting with mem=4096M, mem=2048M, ... hmm. I tried 1024M a while ago and IIRC did not see a lot [any] difference. But as it is no big deal, I will repeat it tomorrow. Just curious - what are you expecting? Why should it help? Thanks Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Understanding I/O behaviour - next try
Keywords: I/O, bdi-v9, cfs Hi, a while ago I asked a few questions on the Linux I/O behaviour, because I were (still am) fighting some "misbehaviour" related to heavy I/O. The basic setup is a dual x86_64 box with 8 GB of memory. The DL380 has a HW RAID5, made from 4x72GB disks and about 100 MB write cache. The performance of the block device with O_DIRECT is about 90 MB/sec. The problematic behaviour comes when we are moving large files through the system. The file usage in this case is mostly "use once" or streaming. As soon as the amount of file data is larger than 7.5 GB, we see occasional unresponsiveness of the system (e.g. no more ssh connections into the box) of more than 1 or 2 minutes (!) duration (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads and some other poor guys being in "D" state. The data flows in basically three modes. All of them are affected: local-disk -> NFS NFS -> local-disk NFS -> NFS NFS is V3/TCP. So, I made a few experiments in the last few days, using three different kernels: 2.6.22.5, 2.6.22.5+cfs20.4 an 2.6.22.5+bdi-v9. The first observation (independent of the kernel) is that we *should* use O_DIRECT, at least for output to the local disk. Here we see about 90 MB/sec write performance. A simple "dd" using 1,2 and 3 parallel threads to the same block device (through a ext2 FS) gives: O_Direct: 88 MB/s, 2x44, 3x29.5 non-O_DIRECT: 51 MB/s, 2x19, 3x12.5 - Observation 1a: IO schedulers are mostly equivalent, with CFQ slightly worse than AS and DEADLINE - Observation 1b: when using a 2.6.22.5+cfs20.4, the non-O_DIRECT performance goes [slightly] down. With three threads it is 3x10 MB/s. Ingo? - Observation 1c: bdi-v9 does not help in this case, which is not surprising. The real question here is why the non-O_DIRECT case is so slow. Is this a general thing? Is this related to the CCISS controller? Using O_DIRECT is unfortunatelly not an option for us. When using three different targets (local disk plus two different NFS Filesystems) bdi-v9 is a big winner. Without it, all threads are [seem to be] limited to the speed of the slowest FS. With bdi-v9 we see a considerable speedup. Just by chance I found out that doing all I/O inc sync-mode does prevent the load from going up. Of course, I/O throughput is not stellar (but not much worse than the non-O_DIRECT case). But the responsiveness seem OK. Maybe a solution, as this can be controlled via mount (would be great for O_DIRECT :-). In general 2.6.22 seems to bee better that 2.6.19, but this is highly subjective :-( I am using the following setting in /proc. They seem to provide the smoothest responsiveness: vm.dirty_background_ratio = 1 vm.dirty_ratio = 1 vm.swappiness = 1 vm.vfs_cache_pressure = 1 Another thing I saw during my tests is that when writing to NFS, the "dirty" or "nr_dirty" numbers are always 0. Is this a conceptual thing, or a bug? In any case, view this as a report for one specific loadcase that does not behave very well. It seems there are ways to make things better (sync, per device throttling, ...), but nothing "perfect yet. Use once does seem to be a problem. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Understanding I/O behaviour - next try
Keywords: I/O, bdi-v9, cfs Hi, a while ago I asked a few questions on the Linux I/O behaviour, because I were (still am) fighting some misbehaviour related to heavy I/O. The basic setup is a dual x86_64 box with 8 GB of memory. The DL380 has a HW RAID5, made from 4x72GB disks and about 100 MB write cache. The performance of the block device with O_DIRECT is about 90 MB/sec. The problematic behaviour comes when we are moving large files through the system. The file usage in this case is mostly use once or streaming. As soon as the amount of file data is larger than 7.5 GB, we see occasional unresponsiveness of the system (e.g. no more ssh connections into the box) of more than 1 or 2 minutes (!) duration (kernels up to 2.6.19). Load goes up, mainly due to pdflush threads and some other poor guys being in D state. The data flows in basically three modes. All of them are affected: local-disk - NFS NFS - local-disk NFS - NFS NFS is V3/TCP. So, I made a few experiments in the last few days, using three different kernels: 2.6.22.5, 2.6.22.5+cfs20.4 an 2.6.22.5+bdi-v9. The first observation (independent of the kernel) is that we *should* use O_DIRECT, at least for output to the local disk. Here we see about 90 MB/sec write performance. A simple dd using 1,2 and 3 parallel threads to the same block device (through a ext2 FS) gives: O_Direct: 88 MB/s, 2x44, 3x29.5 non-O_DIRECT: 51 MB/s, 2x19, 3x12.5 - Observation 1a: IO schedulers are mostly equivalent, with CFQ slightly worse than AS and DEADLINE - Observation 1b: when using a 2.6.22.5+cfs20.4, the non-O_DIRECT performance goes [slightly] down. With three threads it is 3x10 MB/s. Ingo? - Observation 1c: bdi-v9 does not help in this case, which is not surprising. The real question here is why the non-O_DIRECT case is so slow. Is this a general thing? Is this related to the CCISS controller? Using O_DIRECT is unfortunatelly not an option for us. When using three different targets (local disk plus two different NFS Filesystems) bdi-v9 is a big winner. Without it, all threads are [seem to be] limited to the speed of the slowest FS. With bdi-v9 we see a considerable speedup. Just by chance I found out that doing all I/O inc sync-mode does prevent the load from going up. Of course, I/O throughput is not stellar (but not much worse than the non-O_DIRECT case). But the responsiveness seem OK. Maybe a solution, as this can be controlled via mount (would be great for O_DIRECT :-). In general 2.6.22 seems to bee better that 2.6.19, but this is highly subjective :-( I am using the following setting in /proc. They seem to provide the smoothest responsiveness: vm.dirty_background_ratio = 1 vm.dirty_ratio = 1 vm.swappiness = 1 vm.vfs_cache_pressure = 1 Another thing I saw during my tests is that when writing to NFS, the dirty or nr_dirty numbers are always 0. Is this a conceptual thing, or a bug? In any case, view this as a report for one specific loadcase that does not behave very well. It seems there are ways to make things better (sync, per device throttling, ...), but nothing perfect yet. Use once does seem to be a problem. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 00/23] per device dirty throttling -v9
--- Peter Zijlstra <[EMAIL PROTECTED]> wrote: > On Thu, 2007-08-23 at 08:59 -0700, Martin Knoblauch wrote: > > --- Peter Zijlstra <[EMAIL PROTECTED]> wrote: > > > > > On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote: > > > > > > > Peter, > > > > > > > > any chance to get a rollup against 2.6.22-stable? > > > > > > > > The 2.6.23 series may not be usable for me due to the > > > > nosharedcache changes for NFS (the new default will massively > > > > disturb the user-space automounter). > > > > > > I'll see what I can do, bit busy with other stuff atm, hopefully > > > after > > > the weekend. > > > > > Hi Peter, > > > > any progress on a version against 2.6.22.5? I have seen the very > > positive report from Jeffrey W. Baker and would really love to test > > your patch. But as I said, anything newer than 2.6.22.x might not > be an > > option due to the NFS changes. > > mindless port, seems to compile and boot on my test box ymmv. > > I think .5 should not present anything other than trivial rejects if > anything. But I'm not keeping -stable in my git remotes so I can't > say > for sure. Hi Peter, thanks a lot. It applies to 2.6.22.5 almost cleanly, with just one 8-line offset in readahead.c. I will report testing-results separately. Thanks Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 00/23] per device dirty throttling -v9
--- Peter Zijlstra [EMAIL PROTECTED] wrote: On Thu, 2007-08-23 at 08:59 -0700, Martin Knoblauch wrote: --- Peter Zijlstra [EMAIL PROTECTED] wrote: On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote: Peter, any chance to get a rollup against 2.6.22-stable? The 2.6.23 series may not be usable for me due to the nosharedcache changes for NFS (the new default will massively disturb the user-space automounter). I'll see what I can do, bit busy with other stuff atm, hopefully after the weekend. Hi Peter, any progress on a version against 2.6.22.5? I have seen the very positive report from Jeffrey W. Baker and would really love to test your patch. But as I said, anything newer than 2.6.22.x might not be an option due to the NFS changes. mindless port, seems to compile and boot on my test box ymmv. I think .5 should not present anything other than trivial rejects if anything. But I'm not keeping -stable in my git remotes so I can't say for sure. Hi Peter, thanks a lot. It applies to 2.6.22.5 almost cleanly, with just one 8-line offset in readahead.c. I will report testing-results separately. Thanks Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 00/23] per device dirty throttling -v9
--- Peter Zijlstra <[EMAIL PROTECTED]> wrote: > On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote: > > > Peter, > > > > any chance to get a rollup against 2.6.22-stable? > > > > The 2.6.23 series may not be usable for me due to the > > nosharedcache changes for NFS (the new default will massively > > disturb the user-space automounter). > > I'll see what I can do, bit busy with other stuff atm, hopefully > after > the weekend. > Hi Peter, any progress on a version against 2.6.22.5? I have seen the very positive report from Jeffrey W. Baker and would really love to test your patch. But as I said, anything newer than 2.6.22.x might not be an option due to the NFS changes. Kind regards Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 00/23] per device dirty throttling -v9
--- Peter Zijlstra [EMAIL PROTECTED] wrote: On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote: Peter, any chance to get a rollup against 2.6.22-stable? The 2.6.23 series may not be usable for me due to the nosharedcache changes for NFS (the new default will massively disturb the user-space automounter). I'll see what I can do, bit busy with other stuff atm, hopefully after the weekend. Hi Peter, any progress on a version against 2.6.22.5? I have seen the very positive report from Jeffrey W. Baker and would really love to test your patch. But as I said, anything newer than 2.6.22.x might not be an option due to the NFS changes. Kind regards Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 00/23] per device dirty throttling -v9
--- Peter Zijlstra <[EMAIL PROTECTED]> wrote: > On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote: > > > Peter, > > > > any chance to get a rollup against 2.6.22-stable? > > > > The 2.6.23 series may not be usable for me due to the > > nosharedcache changes for NFS (the new default will massively > > disturb the user-space automounter). > > I'll see what I can do, bit busy with other stuff atm, hopefully > after the weekend. > Hi Peter, that would be highly appreciated. Thanks a lot in advance. Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 00/23] per device dirty throttling -v9
>Per device dirty throttling patches > >These patches aim to improve balance_dirty_pages() and directly >address three issues: >1) inter device starvation >2) stacked device deadlocks >3) inter process starvation > >1 and 2 are a direct result from removing the global dirty >limit and using per device dirty limits. By giving each device >its own dirty limit is will no longer starve another device, >and the cyclic dependancy on the dirty limit is broken. > >In order to efficiently distribute the dirty limit across >the independant devices a floating proportion is used, this >will allocate a share of the total limit proportional to the >device's recent activity. > >3 is done by also scaling the dirty limit proportional to the >current task's recent dirty rate. > >Changes since -v8: >- cleanup of the proportion code >- fix percpu_counter_add(, -(unsigned long)) >- fix per task dirty rate code >- fwd port to .23-rc2-mm2 Peter, any chance to get a rollup against 2.6.22-stable? The 2.6.23 series may not be usable for me due to the nosharedcache changes for NFS (the new default will massively disturb the user-space automounter). Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 00/23] per device dirty throttling -v9
Per device dirty throttling patches These patches aim to improve balance_dirty_pages() and directly address three issues: 1) inter device starvation 2) stacked device deadlocks 3) inter process starvation 1 and 2 are a direct result from removing the global dirty limit and using per device dirty limits. By giving each device its own dirty limit is will no longer starve another device, and the cyclic dependancy on the dirty limit is broken. In order to efficiently distribute the dirty limit across the independant devices a floating proportion is used, this will allocate a share of the total limit proportional to the device's recent activity. 3 is done by also scaling the dirty limit proportional to the current task's recent dirty rate. Changes since -v8: - cleanup of the proportion code - fix percpu_counter_add(counter, -(unsigned long)) - fix per task dirty rate code - fwd port to .23-rc2-mm2 Peter, any chance to get a rollup against 2.6.22-stable? The 2.6.23 series may not be usable for me due to the nosharedcache changes for NFS (the new default will massively disturb the user-space automounter). Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 00/23] per device dirty throttling -v9
--- Peter Zijlstra [EMAIL PROTECTED] wrote: On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote: Peter, any chance to get a rollup against 2.6.22-stable? The 2.6.23 series may not be usable for me due to the nosharedcache changes for NFS (the new default will massively disturb the user-space automounter). I'll see what I can do, bit busy with other stuff atm, hopefully after the weekend. Hi Peter, that would be highly appreciated. Thanks a lot in advance. Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/17] per device dirty throttling -v7
Miklos Szeredi wrote: >> Latest version of the per bdi dirty throttling patches. >> >> Most of the changes since last time are little cleanups and more >> detail in the split out of the floating proportion into their >> own little lib. >> >> Patches are against 2.6.22-rc4-mm2 >> >> A rollup of all this against 2.6.21 is available here: >> http://programming.kicks-ass.net/kernel-patches/balance_dirty_pages/2.6.21-per_bdi_dirty_pages.patch >> >> This patch-set passes the starve an USB stick test.. > >I've done some testing of several problem cases. just curious - what are the plans towards inclusion in mainline? Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/17] per device dirty throttling -v7
Miklos Szeredi wrote: Latest version of the per bdi dirty throttling patches. Most of the changes since last time are little cleanups and more detail in the split out of the floating proportion into their own little lib. Patches are against 2.6.22-rc4-mm2 A rollup of all this against 2.6.21 is available here: http://programming.kicks-ass.net/kernel-patches/balance_dirty_pages/2.6.21-per_bdi_dirty_pages.patch This patch-set passes the starve an USB stick test.. I've done some testing of several problem cases. just curious - what are the plans towards inclusion in mainline? Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour
--- Jesper Juhl <[EMAIL PROTECTED]> wrote: > On 05/07/07, Jesper Juhl <[EMAIL PROTECTED]> wrote: > > On 05/07/07, Martin Knoblauch <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > > > > I'd suspect you can't get both at 100%. > > > > I'd guess you are probably using a 100Hz no-preempt kernel. Have > you > > tried a 1000Hz + preempt kernel? Sure, you'll get a bit lower > > overall throughput, but interactive responsiveness should be better > - > > if it is, then you could experiment with various combinations of > > CONFIG_PREEMPT, CONFIG_PREEMPT_VOLUNTARY, CONFIG_PREEMPT_NONE and > > CONFIG_HZ_1000, CONFIG_HZ_300, CONFIG_HZ_250, CONFIG_HZ_100 to see > > what gives you the best balance between throughput and interactive > > responsiveness (you could also throw CONFIG_PREEMPT_BKL and/or > > CONFIG_NO_HZ, but I don't think the impact will be as significant > as > > with the other options, so to keep things simple I'd leave those > out > > at first) . > > > > I'd guess that something like CONFIG_PREEMPT_VOLUNTARY + > CONFIG_HZ_300 > > would probably be a good compromise for you, but just to see if > > there's any effect at all, start out with CONFIG_PREEMPT + > > CONFIG_HZ_1000. > > > > I'm currious, did you ever try playing around with CONFIG_PREEMPT* > and > CONFIG_HZ* to see if that had any noticable impact on interactive > performance and stuff like logging into the box via ssh etc...? > > -- > Jesper Juhl <[EMAIL PROTECTED]> > Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html > Plain text mails only, please http://www.expita.com/nomime.html > > Hi Jesper, my initial kernel was [EMAIL PROTECTED] I have switched to 300HZ, but have not observed much difference. The config is now: config-2.6.22-rc7:# CONFIG_PREEMPT_NONE is not set config-2.6.22-rc7:CONFIG_PREEMPT_VOLUNTARY=y config-2.6.22-rc7:# CONFIG_PREEMPT is not set config-2.6.22-rc7:CONFIG_PREEMPT_BKL=y Cheers -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour
--- Jesper Juhl [EMAIL PROTECTED] wrote: On 05/07/07, Jesper Juhl [EMAIL PROTECTED] wrote: On 05/07/07, Martin Knoblauch [EMAIL PROTECTED] wrote: Hi, I'd suspect you can't get both at 100%. I'd guess you are probably using a 100Hz no-preempt kernel. Have you tried a 1000Hz + preempt kernel? Sure, you'll get a bit lower overall throughput, but interactive responsiveness should be better - if it is, then you could experiment with various combinations of CONFIG_PREEMPT, CONFIG_PREEMPT_VOLUNTARY, CONFIG_PREEMPT_NONE and CONFIG_HZ_1000, CONFIG_HZ_300, CONFIG_HZ_250, CONFIG_HZ_100 to see what gives you the best balance between throughput and interactive responsiveness (you could also throw CONFIG_PREEMPT_BKL and/or CONFIG_NO_HZ, but I don't think the impact will be as significant as with the other options, so to keep things simple I'd leave those out at first) . I'd guess that something like CONFIG_PREEMPT_VOLUNTARY + CONFIG_HZ_300 would probably be a good compromise for you, but just to see if there's any effect at all, start out with CONFIG_PREEMPT + CONFIG_HZ_1000. I'm currious, did you ever try playing around with CONFIG_PREEMPT* and CONFIG_HZ* to see if that had any noticable impact on interactive performance and stuff like logging into the box via ssh etc...? -- Jesper Juhl [EMAIL PROTECTED] Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html Hi Jesper, my initial kernel was [EMAIL PROTECTED] I have switched to 300HZ, but have not observed much difference. The config is now: config-2.6.22-rc7:# CONFIG_PREEMPT_NONE is not set config-2.6.22-rc7:CONFIG_PREEMPT_VOLUNTARY=y config-2.6.22-rc7:# CONFIG_PREEMPT is not set config-2.6.22-rc7:CONFIG_PREEMPT_BKL=y Cheers -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour
--- Daniel J Blueman <[EMAIL PROTECTED]> wrote: > On 5 Jul, 16:50, Martin Knoblauch <[EMAIL PROTECTED]> wrote: > > Hi, > > > > for a customer we are operating a rackful of HP/DL380/G4 boxes > that > > have given us some problems with system responsiveness under [I/O > > triggered] system load. > [snip] > > IIRC, the locking in the CCISS driver was pretty heavy until later in > the 2.6 series (2.6.16?) kernels; I don't think they were backported > to the 1000 or so patches that comprise RH EL 4 kernels. > > With write performance being really poor on the Smartarray > controllers > without the battery-backed write cache, and with less-good locking, > performance can really suck. > > On a total quiescent hp DL380 G2 (dual PIII, 1.13GHz Tualatin 512KB > L2$) running RH EL 5 (2.6.18) with a 32MB SmartArray 5i controller > with 6x36GB 10K RPM SCSI disks and all latest firmware: > > # dd if=/dev/cciss/c0d0p2 of=/dev/zero bs=1024k count=1000 > 509+1 records in > 509+1 records out > 534643200 bytes (535 MB) copied, 11.6336 seconds, 46.0 MB/s > > # dd if=/dev/zero of=/dev/cciss/c0d0p2 bs=1024k count=100 > 100+0 records in > 100+0 records out > 104857600 bytes (105 MB) copied, 22.3091 seconds, 4.7 MB/s > > Oh dear! There are internal performance problems with this > controller. > The SmartArray 5i in the newer DL380 G3 (dual P4 2.8GHz, 512KB L2$) > is > perhaps twice the read performance (PCI-X helps some) but still > sucks. > > I'd get the BBWC in or install another controller. > Hi Daniel, thanks for the suggestion. The DL380g4 boxes have the "6i" and all systems are equipped with the BBWC (192 MB, split 50/50). The thing is not really a speed daemon, but sufficient for the task. The problem really seems to be related to the VM system not writing out dirty pages early enough and then getting into trouble when the pressure gets to high. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour
Brice Figureau wrote: >> CFQ gives less (about 10-15%) throughput except for the kernel >> with the >> cfs cpu scheduler, where CFQ is on par with the other IO >> schedulers. >> > >Please have a look to kernel bug #7372: >http://bugzilla.kernel.org/show_bug.cgi?id=7372 > >It seems I encountered the almost same issue. > >The fix on my side, beside running 2.6.17 (which was working fine >for me) was to: >1) have /proc/sys/vm/vfs_cache_pressure=1 >2) have /proc/sys/vm/dirty_ratio=1 and > /proc/sys/vm/dirty_background_ratio=1 >3) have /proc/sys/vm/swappiness=2 >4) run Peter Zijlstra: per dirty device throttling patch on the > top of 2.6.21.5: >http://www.ussg.iu.edu/hypermail/linux/kernel/0706.1/2776.html Brice, any of them sufficient, or all together nedded? Just to avoid confusion. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour
Martin Knoblauch wrote: >--- Robert Hancock <[EMAIL PROTECTED]> wrote: > >> >> Try playing with reducing /proc/sys/vm/dirty_ratio and see how that >> helps. This workload will fill up memory with dirty data very >> quickly, >> and it seems like system responsiveness often goes down the toilet >> when >> this happens and the system is going crazy trying to write it all >> out. >> > >Definitely the "going crazy" part is the worst problem I see with 2.6 >based kernels (late 2.4 was really better in this corner case). > >I am just now playing with dirty_ratio. Anybody knows what the lower >limit is? "0" seems acceptabel, but does it actually imply "write out >immediatelly"? > >Another problem, the VM parameters are not really well documented in >their behaviour and interdependence. Lowering dirty_ration just leads to more imbalanced write-speed for the three dd's. Even when lowering the number to 0, the hich load stays. Now, on another experiment I mounted the FS with "sync". And now the load stays below/around 3. No more "pdflush" daemons going wild. And the responsiveness is good, with no drops. My question is now: is there a parameter that one can use to force immediate writeout for every process. This may hurt overall performance of the system, but might really help my situation. Setting dirty_ratio to 0 does not seem to do it. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour
>>b) any ideas how to optimize the settings of the /proc/sys/vm/ >>parameters? The documentation is a bit thin here. >> >> >I cant offer any advice there, but is raid-5 really the best choice >for your needs? I would not choose raid-5 for a system that is >regularly performing lots of large writes at the same time, dont >forget that each write can require several reads to recalculate the >partity. > >Does the raid card have much cache ram? > 192 MB, split 50/50 to read write. >If you can afford to loose some space raid-10 would probably perform >better. RAID5 most likely is not the best solution and I would not use it if the described use-case was happening all the time. It happens a few times a day and then things go down when all memory is filled with page-cache. And the same also happens when copying large amountd of data from one NFS mounted FS to another NFS mounted FS. No disk involved there. Memory fills with page-cache until it reaches a ceeling and then for some time responsiveness is really really bad. I am just now playing with the dirty_* stuff. Maybe it helps. Cheers Martin ------ Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour
--- Robert Hancock <[EMAIL PROTECTED]> wrote: > > Try playing with reducing /proc/sys/vm/dirty_ratio and see how that > helps. This workload will fill up memory with dirty data very > quickly, > and it seems like system responsiveness often goes down the toilet > when > this happens and the system is going crazy trying to write it all > out. > Definitely the "going crazy" part is the worst problem I see with 2.6 based kernels (late 2.4 was really better in this corner case). I am just now playing with dirty_ratio. Anybody knows what the lower limit is? "0" seems acceptabel, but does it actually imply "write out immediatelly"? Another problem, the VM parameters are not really well dociúmented in their behaviour and interdependence. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour
--- Jesper Juhl <[EMAIL PROTECTED]> wrote: > On 06/07/07, Robert Hancock <[EMAIL PROTECTED]> wrote: > [snip] > > > > Try playing with reducing /proc/sys/vm/dirty_ratio and see how that > > helps. This workload will fill up memory with dirty data very > quickly, > > and it seems like system responsiveness often goes down the toilet > when > > this happens and the system is going crazy trying to write it all > out. > > > > Perhaps trying out a different elevator would also be worthwhile. > AS seems to be the best one (NOOP and DeadLine seem to be equally OK). CFQ gives less (about 10-15%) throughput except for the kernel with the cfs cpu scheduler, where CFQ is on par with the other IO schedulers. Thanks Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Understanding I/O behaviour
--- Jesper Juhl [EMAIL PROTECTED] wrote: On 06/07/07, Robert Hancock [EMAIL PROTECTED] wrote: [snip] Try playing with reducing /proc/sys/vm/dirty_ratio and see how that helps. This workload will fill up memory with dirty data very quickly, and it seems like system responsiveness often goes down the toilet when this happens and the system is going crazy trying to write it all out. Perhaps trying out a different elevator would also be worthwhile. AS seems to be the best one (NOOP and DeadLine seem to be equally OK). CFQ gives less (about 10-15%) throughput except for the kernel with the cfs cpu scheduler, where CFQ is on par with the other IO schedulers. Thanks Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/