Re: writeout stalls in current -git
On Tue, Nov 20, 2007 at 02:16:17PM +0100, Damien Wyart wrote: > Hello, > > > > > Ok, so it's not synchronous writes that we are doing - we're just > > > > submitting bio's tagged as WRITE_SYNC to get the I/O issued quickly. > > > > The "synchronous" nature appears to be coming from higher level > > > > locking when reclaiming inodes (on the flush lock). It appears that > > > > inode write clustering is failing completely so we are writing the > > > > same block multiple times i.e. once for each inode in the cluster we > > > > have to write. > > > > Works for me. The only remaining stalls are sub second and look > > > completely valid, considering the amount of files being removed. > > > > > Tested-by: Torsten Kaiser <[EMAIL PROTECTED]> > > * David Chinner <[EMAIL PROTECTED]> [2007-11-08 11:38]: > > Great - thanks for reporting the problem and testing the fix. > > This patch has not yet made its way into 2.6.24 (rc3). Is it intended? > Maybe the fix can wait for 2.6.25, but wanted to make sure... The patch is in the XFS dev tree being QA'd, and we will push it to 2.6.24-rcX in the next few days. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeout stalls in current -git
Hello, > > > Ok, so it's not synchronous writes that we are doing - we're just > > > submitting bio's tagged as WRITE_SYNC to get the I/O issued > > > quickly. The "synchronous" nature appears to be coming from higher > > > level locking when reclaiming inodes (on the flush lock). It > > > appears that inode write clustering is failing completely so we > > > are writing the same block multiple times i.e. once for each inode > > > in the cluster we have to write. > > Works for me. The only remaining stalls are sub second and look > > completely valid, considering the amount of files being removed. > > > Tested-by: Torsten Kaiser <[EMAIL PROTECTED]> * David Chinner <[EMAIL PROTECTED]> [2007-11-08 11:38]: > Great - thanks for reporting the problem and testing the fix. This patch has not yet made its way into 2.6.24 (rc3). Is it intended? Maybe the fix can wait for 2.6.25, but wanted to make sure... -- Damien Wyart - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeout stalls in current -git
On Wed, Nov 07, 2007 at 08:15:06AM +0100, Torsten Kaiser wrote: > On 11/7/07, David Chinner <[EMAIL PROTECTED]> wrote: > > Ok, so it's not synchronous writes that we are doing - we're just > > submitting bio's tagged as WRITE_SYNC to get the I/O issued quickly. > > The "synchronous" nature appears to be coming from higher level > > locking when reclaiming inodes (on the flush lock). It appears that > > inode write clustering is failing completely so we are writing the > > same block multiple times i.e. once for each inode in the cluster we > > have to write. > > Works for me. The only remaining stalls are sub second and look > completely valid, considering the amount of files being removed. > Tested-by: Torsten Kaiser <[EMAIL PROTECTED]> Great - thanks for reporting the problem and testing the fix. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeout stalls in current -git
On 11/7/07, David Chinner <[EMAIL PROTECTED]> wrote: > Ok, so it's not synchronous writes that we are doing - we're just > submitting bio's tagged as WRITE_SYNC to get the I/O issued quickly. > The "synchronous" nature appears to be coming from higher level > locking when reclaiming inodes (on the flush lock). It appears that > inode write clustering is failing completely so we are writing the > same block multiple times i.e. once for each inode in the cluster we > have to write. Works for me. The only remaining stalls are sub second and look completely valid, considering the amount of files being removed. iostat 10 from this test: 3 0 0 3500192332 20495600 105 8512 1809 6473 6 10 83 1 0 0 0 3500200332 20457600 0 4367 1355 3712 2 6 92 0 2 0 0 3504264332 20352800 0 6805 1912 4967 4 8 88 0 0 0 0 3511632332 20352800 0 2843 805 1791 2 4 94 0 0 0 0 3516852332 20351600 0 3375 879 2712 3 5 93 0 0 0 0 3530544332 20266800 186 776 488 1152 4 2 89 4 0 0 0 3574788332 20496000 226 326 358 787 0 1 98 0 0 0 0 3576820332 20496000 0 376 332 737 0 0 99 0 0 0 0 3578432332 20496000 0 356 293 606 1 1 99 0 0 0 0 3580192332 20496000 0 101 104 384 0 0 99 0 I'm pleased to note that this is now much faster again. Thanks! Tested-by: Torsten Kaiser <[EMAIL PROTECTED]> CC's please note: It looks like this was really a different problem then the 100% iowait that was seen with reiserfs. Also the one complete stall I have seen is probably something else. But I have not been able to reproduce this again with -mm and have never seen this on mainline, so I will just ignore that single event until I see it again. Torsten > --- > fs/xfs/xfs_iget.c |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > Index: 2.6.x-xfs-new/fs/xfs/xfs_iget.c > === > --- 2.6.x-xfs-new.orig/fs/xfs/xfs_iget.c2007-11-02 13:44:46.0 > +1100 > +++ 2.6.x-xfs-new/fs/xfs/xfs_iget.c 2007-11-07 13:08:42.534440675 +1100 > @@ -248,7 +248,7 @@ finish_inode: > icl = NULL; > if (radix_tree_gang_lookup(&pag->pag_ici_root, (void**)&iq, > first_index, 1)) { > - if ((iq->i_ino & mask) == first_index) > + if ((XFS_INO_TO_AGINO(mp, iq->i_ino) & mask) == first_index) > icl = iq->i_cluster; > } > > - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeout stalls in current -git
On Wed, Nov 07, 2007 at 10:31:14AM +1100, David Chinner wrote: > On Tue, Nov 06, 2007 at 10:53:25PM +0100, Torsten Kaiser wrote: > > On 11/6/07, David Chinner <[EMAIL PROTECTED]> wrote: > > > Rather than vmstat, can you use something like iostat to show how busy > > > your > > > disks are? i.e. are we seeing RMW cycles in the raid5 or some such issue. > > > > Both "vmstat 10" and "iostat -x 10" output from this test: > > procs ---memory-- ---swap-- -io -system-- > > cpu > > r b swpd free buff cache si sobibo in cs us sy id > > wa > > 2 0 0 3700592 0 85424003183 108 244 2 1 95 > > 1 > > -> emerge reads something, don't knwo for sure what... > > 1 0 0 3665352 0 8794000 239 2 343 585 2 1 97 > > 0 > > > > > The last 20% of the btrace look more or less completely like this, no > > other programs do any IO... > > > > 253,03 104626 526.293450729 974 C WS 79344288 + 8 [0] > > 253,03 104627 526.293455078 974 C WS 79344296 + 8 [0] > > 253,0136469 444.513863133 1068 Q WS 154998480 + 8 [xfssyncd] > > 253,0136470 444.513863135 1068 Q WS 154998488 + 8 [xfssyncd] > ^^ > Apparently we are doing synchronous writes. That would explain why > it is slow. We shouldn't be doing synchronous writes here. I'll see if > I can reproduce this. > > > > Yes, I can reproduce the sync writes coming out of xfssyncd. I'll > look into this further and send a patch when I have something concrete. Ok, so it's not synchronous writes that we are doing - we're just submitting bio's tagged as WRITE_SYNC to get the I/O issued quickly. The "synchronous" nature appears to be coming from higher level locking when reclaiming inodes (on the flush lock). It appears that inode write clustering is failing completely so we are writing the same block multiple times i.e. once for each inode in the cluster we have to write. This must be a side effect of some other change as we haven't changed anything in the reclaim code recently. /me scurries off to run some tests Indeed it is. The patch below should fix the problem - the inode clusters weren't getting set up properly when inodes were being read in or allocated. This is a regression, introduced by this mod: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=da353b0d64e070ae7c5342a0d56ec20ae9ef5cfb Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/xfs_iget.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: 2.6.x-xfs-new/fs/xfs/xfs_iget.c === --- 2.6.x-xfs-new.orig/fs/xfs/xfs_iget.c2007-11-02 13:44:46.0 +1100 +++ 2.6.x-xfs-new/fs/xfs/xfs_iget.c 2007-11-07 13:08:42.534440675 +1100 @@ -248,7 +248,7 @@ finish_inode: icl = NULL; if (radix_tree_gang_lookup(&pag->pag_ici_root, (void**)&iq, first_index, 1)) { - if ((iq->i_ino & mask) == first_index) + if ((XFS_INO_TO_AGINO(mp, iq->i_ino) & mask) == first_index) icl = iq->i_cluster; } - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeout stalls in current -git
On Tue, Nov 06, 2007 at 10:53:25PM +0100, Torsten Kaiser wrote: > On 11/6/07, David Chinner <[EMAIL PROTECTED]> wrote: > > Rather than vmstat, can you use something like iostat to show how busy your > > disks are? i.e. are we seeing RMW cycles in the raid5 or some such issue. > > Both "vmstat 10" and "iostat -x 10" output from this test: > procs ---memory-- ---swap-- -io -system-- cpu > r b swpd free buff cache si sobibo in cs us sy id wa > 2 0 0 3700592 0 85424003183 108 244 2 1 95 1 > -> emerge reads something, don't knwo for sure what... > 1 0 0 3665352 0 8794000 239 2 343 585 2 1 97 0 > > The last 20% of the btrace look more or less completely like this, no > other programs do any IO... > > 253,03 104626 526.293450729 974 C WS 79344288 + 8 [0] > 253,03 104627 526.293455078 974 C WS 79344296 + 8 [0] > 253,0136469 444.513863133 1068 Q WS 154998480 + 8 [xfssyncd] > 253,0136470 444.513863135 1068 Q WS 154998488 + 8 [xfssyncd] ^^ Apparently we are doing synchronous writes. That would explain why it is slow. We shouldn't be doing synchronous writes here. I'll see if I can reproduce this. Yes, I can reproduce the sync writes coming out of xfssyncd. I'll look into this further and send a patch when I have something concrete. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeout stalls in current -git
On 11/6/07, Fengguang Wu <[EMAIL PROTECTED]> wrote: > -- > Subject: writeback: remove pages_skipped accounting in > __block_write_full_page() > From: Fengguang Wu <[EMAIL PROTECTED]> > > Miklos Szeredi <[EMAIL PROTECTED]> and me identified a writeback bug: [sni] > fs/buffer.c |1 - > fs/xfs/linux-2.6/xfs_aops.c |5 ++--- > 2 files changed, 2 insertions(+), 4 deletions(-) I have now testet v2.6.24-rc1-748-g2655e2c with above patch reverted. This does still stall. On 11/6/07, David Chinner <[EMAIL PROTECTED]> wrote: > Rather than vmstat, can you use something like iostat to show how busy your > disks are? i.e. are we seeing RMW cycles in the raid5 or some such issue. Both "vmstat 10" and "iostat -x 10" output from this test: procs ---memory-- ---swap-- -io -system-- cpu r b swpd free buff cache si sobibo in cs us sy id wa 2 0 0 3700592 0 85424003183 108 244 2 1 95 1 -> emerge reads something, don't knwo for sure what... 1 0 0 3665352 0 8794000 239 2 343 585 2 1 97 0 0 0 0 3657728 0 9122800 32235 445 833 0 0 99 0 1 0 0 3653136 0 9469200 33033 455 844 1 1 98 0 0 0 0 3646836 0 9772000 289 3 422 751 1 1 98 0 0 0 0 3616468 0 9969200 18533 399 614 9 3 87 1 -> starts to remove the kernel tree 0 0 0 3610452 0 10259200 138 3598 1398 3945 3 6 90 1 0 0 0 3607136 0 10454800 2 5962 1919 6070 4 9 87 0 0 0 0 3606636 0 10508000 0 1539 810 2200 1 2 97 0 -> first stall 28 sec. 0 0 0 3606592 0 10529200 0 698 679 1390 0 1 99 0 0 0 0 3606440 0 10553200 0 658 690 1457 0 0 99 0 0 0 0 3606068 0 10612800 1 1780 947 1982 1 3 96 0 -> second stall 24 sec. 0 0 0 3606036 0 10646400 4 858 758 1457 0 1 98 0 0 0 0 3605380 0 10687200 0 1173 807 1880 1 2 97 0 0 0 0 3605000 0 10774800 1 2413 1103 2996 2 4 94 0 -> third stall 38 sec. 0 0 0 3604488 0 1084720045 897 748 1577 0 1 98 0 0 0 0 3604176 0 10876400 0 824 752 1700 0 1 98 0 0 0 0 3604012 0 10898800 0 660 643 1237 0 1 99 0 0 0 0 3608936 0 11012000 1 3490 1232 3455 3 5 91 0 -> fourth stall 64 sec. 1 0 0 3609060 0 11029600 0 568 669 1222 0 1 99 0 0 0 0 3609464 0 11049600 0 604 638 1366 0 1 99 0 0 0 0 3609244 0 11074000 0 844 714 1282 0 1 99 0 0 0 0 3609508 0 11091200 0 552 584 1185 1 1 99 0 2 0 0 3609436 0 3200 0 658 643 1442 0 1 99 0 0 0 0 3609212 0 11134800 0 714 637 1382 0 0 99 0 0 0 0 3619132 0 11049200 130 1086 736 1870 4 3 91 2 0 0 0 3657016 0 11549600 466 589 718 1367 1 1 98 0 -> emerge finishs, dirty data was the hole time <1Mb, stays now below 300kb (btrace running...) 0 0 0 3657844 0 11566000 0 564 635 1226 1 1 99 0 0 0 0 3658236 0 11584000 0 582 600 1248 1 0 99 0 0 0 0 3658296 0 11601200 0 566 606 1232 1 1 99 0 0 0 0 3657924 0 11621200 0 688 596 1321 1 0 99 0 0 0 0 3658252 0 11641600 0 631 642 1356 1 0 98 0 0 0 0 3658184 0 11659200 0 566 575 1273 0 0 99 0 procs ---memory-- ---swap-- -io -system-- cpu r b swpd free buff cache si sobibo in cs us sy id wa 2 0 0 3658344 0 11677200 0 649 606 1301 0 0 99 0 0 0 0 3658548 0 11697600 0 617 624 1345 0 0 99 0 0 0 0 3659204 0 11716000 0 550 576 1223 1 1 99 0 0 0 0 3659944 0 11734400 0 620 583 1272 0 0 99 0 0 0 0 3660548 0 11754000 0 605 611 1338 0 0 99 0 0 0 0 3661236 0 11773200 0 582 569 1275 0 0 99 0 0 0 0 3662420 0 11788800 0 590 571 1157 0 0 99 0 0 0 0 3664324 0 11806800 0 566 553 1222 1 1 99 0 0 0 0 3665240 0 11816800 0 401 574 862 0 0 99 0 0 0 0 3666984 0 11828000 0 454 574 958 1 1 99 0 0 0 0 3668664 0 11840000 0 396 559 946 0 0 99 0 0 0 0 3670628 0 11849600 0 296 495 784 0 0 99 0 0 0 0 3671316 0 118496
Re: writeout stalls in current -git
On 11/6/07, Peter Zijlstra <[EMAIL PROTECTED]> wrote: > On Tue, 2007-11-06 at 15:25 +1100, David Chinner wrote: > > > I'm struggling to understand what possible changed in XFS or writeback that > > would lead to stalls like this, esp. as you appear to be removing files when > > the stalls occur. > > Just a crazy idea,.. > > Could there be a set_page_dirty() that doesn't have > balance_dirty_pages() call near? For example modifying meta data in > unlink? > > Such a situation could lead to an excess of dirty pages and the next > call to balance_dirty_pages() would appear to stall, as it would > desperately try to get below the limit again. Only if accounting of the dirty pages is also broken. In the unmerge testcase I see most of the time only <200kb of dirty data in /proc/meminfo. The system has 4Gb of RAM so I'm not sure if it should ever be valid to stall even the emerge/install testcase. Torsten Now building a kernel with the skipped-pages-accounting-patch reverted... - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeout stalls in current -git
On Tue, 2007-11-06 at 15:25 +1100, David Chinner wrote: > I'm struggling to understand what possible changed in XFS or writeback that > would lead to stalls like this, esp. as you appear to be removing files when > the stalls occur. Just a crazy idea,.. Could there be a set_page_dirty() that doesn't have balance_dirty_pages() call near? For example modifying meta data in unlink? Such a situation could lead to an excess of dirty pages and the next call to balance_dirty_pages() would appear to stall, as it would desperately try to get below the limit again. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeout stalls in current -git
On Mon, 2007-11-05 at 15:57 -0800, Andrew Morton wrote: > > > Subject: mm: speed up writeback ramp-up on clean systems > > > > > > We allow violation of bdi limits if there is a lot of room on the > > > system. Once we hit half the total limit we start enforcing bdi limits > > > and bdi ramp-up should happen. Doing it this way avoids many small > > > writeouts on an otherwise idle system and should also speed up the > > > ramp-up. > > Given the problems we're having in there I'm a bit reluctant to go tossing > hastily put together and inadequately tested stuff onto the fire. And > that's what this patch looks like to me. Not really hastily, I think it was written before the stuff hit mainline. Inadequately tested, perhaps, its been in my and probably Wu's kernels for a while. Granted that's not a lot of testing in the face of those who have problems atm. > Wanna convince me otherwise? I'm perfectly happy with this patch earning its credits in -mm for a while and maybe going in around -rc4 or something like that (hoping that by then we've fixed these nagging issues). Another patch I did come up with yesterday - not driven by any problems in that area - could perhaps join this one on that path: --- Subject: mm: bdi: tweak task dirty penalty Penalizing heavy dirtiers with 1/8-th the total dirty limit might be rather excessive on large memory machines. Use sqrt to scale it sub-linearly. Update the comment while we're there. Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]> --- mm/page-writeback.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) Index: linux-2.6-2/mm/page-writeback.c === --- linux-2.6-2.orig/mm/page-writeback.c +++ linux-2.6-2/mm/page-writeback.c @@ -213,17 +213,21 @@ static inline void task_dirties_fraction } /* - * scale the dirty limit + * Task specific dirty limit: * - * task specific dirty limit: + * dirty -= 8 * sqrt(dirty) * p_{t} * - * dirty -= (dirty/8) * p_{t} + * Penalize tasks that dirty a lot of pages by lowering their dirty limit. This + * avoids infrequent dirtiers from getting stuck in this other guys dirty + * pages. + * + * Use a sub-linear function to scale the penalty, we only need a little room. */ void task_dirty_limit(struct task_struct *tsk, long *pdirty) { long numerator, denominator; long dirty = *pdirty; - u64 inv = dirty >> 3; + u64 inv = 8*int_sqrt(dirty); task_dirties_fraction(tsk, &numerator, &denominator); inv *= numerator; - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeout stalls in current -git
On Fri, Nov 02, 2007 at 08:22:10PM +0100, Torsten Kaiser wrote: > [ 547.20] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 58858 > > global 12829 72 0 wc __ tw 0 sk 0 > [ 550.48] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 57834 > > global 12017 62 0 wc __ tw 0 sk 0 > [ 552.71] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 56810 > > global 11133 83 0 wc __ tw 0 sk 0 > [ 558.66] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 55786 > > global 10470 33 0 wc _M tw 0 sk 0 4s > [ 562.75] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 54762 > > global 10555 69 0 wc _M tw 0 sk 0 3s > [ 565.15] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 53738 > > global 9562 498 0 wc _M tw -2 sk 0 4s > [ 569.49] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 52712 > > global 8960 2 0 wc _M tw 0 sk 0 3s > [ 572.91] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 51688 > > global 8088 205 0 wc _M tw -13 sk 0 2s > [ 574.61] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 50651 > > global 7114 188 0 wc _M tw -1 sk 0 10s > [ 584.27] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 49626 > > global 14544 0 0 wc _M tw -1 sk 0 9s > [ 593.05] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 48601 > > global 24583 736 0 wc _M tw -1 sk 0 7s > [ 600.18] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 47576 > > global 27004 6 0 wc _M tw 587 sk 0 > [ 600.18] mm/page-writeback.c 676 wb_kupdate: pdflush(285) 47139 > > global 27004 6 0 wc __ tw 1014 sk 0 The above messages and the below 'D' state pdflush indicate that one single writeback_inodes(4MB) call takes a long time(up to 10s!) to complete. Let's try reverting the below patch with `patch -R`? It looks like the most relevant change - if it's not a low level bug. > [note] first stall, the output from emerge stops, so it seems it can > not start processing the next file until the stall ends > [ 630.00] SysRq : Emergency Sync > [ 630.12] Emergency Sync complete > [ 632.85] SysRq : Show Blocked State > [ 632.85] taskPC stack pid father > [ 632.85] pdflush D 81000f091788 0 285 2 > [ 632.85] 810005d4da80 0046 0800 > 0071 > [ 632.85] 81000fd52400 8022d61c 80819b00 > 80819b00 > [ 632.85] 80815f40 80819b00 810100316f98 > > [ 632.85] Call Trace: > [ 632.85] [] task_rq_lock+0x4c/0x90 > [ 632.85] [] __wake_up_common+0x5a/0x90 > [ 632.85] [] __down+0xa7/0x11e > [ 632.85] [] default_wake_function+0x0/0x10 > [ 632.85] [] __down_failed+0x35/0x3a > [ 632.85] [] xfs_buf_lock+0x3e/0x40 > [ 632.85] [] _xfs_buf_find+0x13e/0x240 > [ 632.85] [] xfs_buf_get_flags+0x6f/0x190 > [ 632.85] [] xfs_buf_read_flags+0x12/0xa0 > [ 632.85] [] xfs_trans_read_buf+0x64/0x340 > [ 632.85] [] xfs_itobp+0x81/0x1e0 > [ 632.85] [] write_cache_pages+0x123/0x330 > [ 632.85] [] xfs_iflush+0xfe/0x520 > [ 632.85] [] __down_read_trylock+0x42/0x60 > [ 632.85] [] xfs_inode_flush+0x179/0x1b0 > [ 632.85] [] xfs_fs_write_inode+0x2f/0x90 > [ 632.85] [] __writeback_single_inode+0x2ac/0x380 > [ 632.85] [] dm_table_any_congested+0x2e/0x80 > [ 632.85] [] generic_sync_sb_inodes+0x20d/0x330 > [ 632.85] [] writeback_inodes+0xa2/0xe0 > [ 632.85] [] wb_kupdate+0xa6/0x140 > [ 632.85] [] pdflush+0x0/0x1e0 > [ 632.85] [] pdflush+0x110/0x1e0 > [ 632.85] [] wb_kupdate+0x0/0x140 > [ 632.85] [] kthread+0x4b/0x80 > [ 632.85] [] child_rip+0xa/0x12 > [ 632.85] [] kthread+0x0/0x80 > [ 632.85] [] child_rip+0x0/0x12 > [ 632.85] > [ 632.85] emergeD 0 6220 6129 > [ 632.85] 810103ced9f8 0086 > 0071 > [ 632.85] 81000fd52cf8 80819b00 > 80819b00 > [ 632.85] 80815f40 80819b00 810103ced9b8 > 810103ced9a8 > [ 632.85] Call Trace: > [ 632.85] [] __down+0xa7/0x11e > [ 632.85] [] default_wake_function+0x0/0x10 > [ 632.85] [] __down_failed+0x35/0x3a > [ 632.85] [] xfs_buf_lock+0x3e/0x40 > [ 632.85] [] _xfs_buf_find+0x13e/0x240 > [ 632.85] [] xfs_buf_get_flags+0x6f/0x190 > [ 632.85] [] xfs_buf_read_flags+0x12/0xa0 > [ 632.85] [] xfs_trans_read_buf+0x64/0x340 > [ 632.85] [] xfs_itobp+0x81/0x1e0 > [ 632.85] [] xfs_buf_rele+0x2e/0xd0 > [ 632.85] [] xfs_iflush+0xfe/0x520 > [ 632.85] [] __down_read_trylock+0x42/0x60 > [ 632.85] [] xfs_inode_item_push+0x12/0x20 > [ 632.85] [] xfs_trans_push_ail+0x267/0x2b0 > [ 632.85] [] xfs_log_reserve+0x72/0x120 > [ 632.85] [] xfs_trans_reserve+0xa8/0x210 > [ 63
Re: writeout stalls in current -git
On 11/6/07, David Chinner <[EMAIL PROTECTED]> wrote: > On Mon, Nov 05, 2007 at 07:27:16PM +0100, Torsten Kaiser wrote: > > On 11/5/07, David Chinner <[EMAIL PROTECTED]> wrote: > > > Ok, so it's probably a side effect of the writeback changes. > > > > > > Attached are two patches (two because one was in a separate patchset as > > > a standalone change) that should prevent async writeback from blocking > > > on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first. > > > Can you see if this fixes the problem? > > > > Now testing v2.6.24-rc1-650-gb55d1b1+ the fix for the missapplied > > raid5-patch > > Applying your two patches ontop of that does not fix the stalls. > > So you are having RAID5 problems as well? The first 2.6.24-rc1-git-kernel that I patched with your patches did not boot for me. (Oops send in one of my previous mails) But given that the stacktrace was not xfs related and I had seen this patch on the lkml, I tried to fix this Oops this way. I did not have troubles with the RAID5 otherwise. > I'm struggling to understand what possible changed in XFS or writeback that > would lead to stalls like this, esp. as you appear to be removing files when > the stalls occur. Rather than vmstat, can you use something like iostat to > show how busy your disks are? i.e. are we seeing RMW cycles in the raid5 or > some such issue. Will do this this evening. > OOC, what is the 'xfs_info ' output for your filesystem? meta-data=/dev/mapper/root isize=256agcount=32, agsize=4731132 blks = sectsz=512 attr=1 data = bsize=4096 blocks=151396224, imaxpct=25 = sunit=0 swidth=0 blks, unwritten=1 naming =version 2 bsize=4096 log =internal bsize=4096 blocks=32768, version=1 = sectsz=512 sunit=0 blks, lazy-count=0 realtime =none extsz=4096 blocks=0, rtextents=0 > > vmstat 10 output from unmerging (uninstalling) a kernel: > > 1 0 0 3512188332 19264400 18512 368 735 10 3 85 > > 1 > > -> emerge starts to remove the kernel source files > > 3 0 0 3506624332 1928360015 9825 2458 8307 7 12 81 > > 0 > > 0 0 0 3507212332 19283600 0 554 630 1233 0 1 99 > > 0 > > 0 0 0 3507292332 19283600 0 537 580 1328 0 1 99 > > 0 > > 0 0 0 3507168332 19283600 0 633 626 1380 0 1 99 > > 0 > > 0 0 0 3507116332 19283600 0 1510 768 2030 1 2 97 > > 0 > > 0 0 0 3507596332 19283600 0 524 540 1544 0 0 99 > > 0 > > procs ---memory-- ---swap-- -io -system-- > > cpu > > r b swpd free buff cache si sobibo in cs us sy id > > wa > > 0 0 0 3507540332 19283600 0 489 551 1293 0 0 99 > > 0 > > 0 0 0 3507528332 19283600 0 527 510 1432 1 1 99 > > 0 > > 0 0 0 3508052332 19284000 0 2088 910 2964 2 3 95 > > 0 > > 0 0 0 3507888332 19284000 0 442 565 1383 1 1 99 > > 0 > > 0 0 0 3508704332 19284000 0 497 529 1479 0 0 99 > > 0 > > 0 0 0 3508704332 19284000 0 594 595 1458 0 0 99 > > 0 > > 0 0 0 3511492332 19284000 0 2381 1028 2941 2 3 95 > > 0 > > 0 0 0 3510684332 19284000 0 699 600 1390 0 0 99 > > 0 > > 0 0 0 3511636332 19284000 0 741 661 1641 0 0 > > 100 0 > > 0 0 0 3524020332 19284000 0 2452 1080 3910 2 3 95 > > 0 > > 0 0 0 3524040332 19284400 0 530 617 1297 0 0 99 > > 0 > > 0 0 0 3524128332 19284400 0 812 674 1667 0 1 99 > > 0 > > 0 0 0 3527000332 19367200 339 721 754 1681 3 2 93 > > 1 > > -> emerge is finished, no dirty or writeback data in /proc/meminfo > > At this point, can you run a "sync" and see how long that takes to > complete? Already tried that: http://lkml.org/lkml/2007/11/2/178 See the logs from the second unmerge in the second half of the mail. The sync did not stop this writeout, but returned immediately. > The only thing I can think that woul dbe written out after > this point is inodes, but even then it seems to go on for a long, > long time and it really doesn't seem like XFS is holding up the > inode writes. Yes, I completly agree that this is much to long. Thats why I included the after-emerge-finished parts of the logs. But I still partly suspect xfs, because the xfssyncd shows up when I hip SysRq+W. > Another option is to use blktrace/blkparse to determine which process is > issuing this I/O. > > > 0 0 0 3583780332 19506000 0 494 555 1080 0 1 99 > > 0 > > 0 0
Re: writeout stalls in current -git
On Mon, Nov 05, 2007 at 07:27:16PM +0100, Torsten Kaiser wrote: > On 11/5/07, David Chinner <[EMAIL PROTECTED]> wrote: > > Ok, so it's probably a side effect of the writeback changes. > > > > Attached are two patches (two because one was in a separate patchset as > > a standalone change) that should prevent async writeback from blocking > > on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first. > > Can you see if this fixes the problem? > > Now testing v2.6.24-rc1-650-gb55d1b1+ the fix for the missapplied raid5-patch > Applying your two patches ontop of that does not fix the stalls. So you are having RAID5 problems as well? I'm struggling to understand what possible changed in XFS or writeback that would lead to stalls like this, esp. as you appear to be removing files when the stalls occur. Rather than vmstat, can you use something like iostat to show how busy your disks are? i.e. are we seeing RMW cycles in the raid5 or some such issue. OOC, what is the 'xfs_info ' output for your filesystem? > vmstat 10 output from unmerging (uninstalling) a kernel: > 1 0 0 3512188332 19264400 18512 368 735 10 3 85 1 > -> emerge starts to remove the kernel source files > 3 0 0 3506624332 1928360015 9825 2458 8307 7 12 81 0 > 0 0 0 3507212332 19283600 0 554 630 1233 0 1 99 0 > 0 0 0 3507292332 19283600 0 537 580 1328 0 1 99 0 > 0 0 0 3507168332 19283600 0 633 626 1380 0 1 99 0 > 0 0 0 3507116332 19283600 0 1510 768 2030 1 2 97 0 > 0 0 0 3507596332 19283600 0 524 540 1544 0 0 99 0 > procs ---memory-- ---swap-- -io -system-- cpu > r b swpd free buff cache si sobibo in cs us sy id wa > 0 0 0 3507540332 19283600 0 489 551 1293 0 0 99 0 > 0 0 0 3507528332 19283600 0 527 510 1432 1 1 99 0 > 0 0 0 3508052332 19284000 0 2088 910 2964 2 3 95 0 > 0 0 0 3507888332 19284000 0 442 565 1383 1 1 99 0 > 0 0 0 3508704332 19284000 0 497 529 1479 0 0 99 0 > 0 0 0 3508704332 19284000 0 594 595 1458 0 0 99 0 > 0 0 0 3511492332 19284000 0 2381 1028 2941 2 3 95 0 > 0 0 0 3510684332 19284000 0 699 600 1390 0 0 99 0 > 0 0 0 3511636332 19284000 0 741 661 1641 0 0 100 > 0 > 0 0 0 3524020332 19284000 0 2452 1080 3910 2 3 95 0 > 0 0 0 3524040332 19284400 0 530 617 1297 0 0 99 0 > 0 0 0 3524128332 19284400 0 812 674 1667 0 1 99 0 > 0 0 0 3527000332 19367200 339 721 754 1681 3 2 93 1 > -> emerge is finished, no dirty or writeback data in /proc/meminfo At this point, can you run a "sync" and see how long that takes to complete? The only thing I can think that woul dbe written out after this point is inodes, but even then it seems to go on for a long, long time and it really doesn't seem like XFS is holding up the inode writes. Another option is to use blktrace/blkparse to determine which process is issuing this I/O. > 0 0 0 3583780332 19506000 0 494 555 1080 0 1 99 0 > 0 0 0 3584352332 19506000 099 347 559 0 0 99 0 > 0 0 0 3585232332 19506000 011 301 621 0 0 99 0 > -> disks go idle. > > So these patches do not seem to be the source of these excessive disk > writes... Well, the patches I posted should prevent blocking in the places that it was seen, so if that does not stop the slowdowns then either the writeback code is not feeding us inodes fast enough or the block device below is having some kind of problem Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeout stalls in current -git
On Fri, 2 Nov 2007 18:33:29 +0800 Fengguang Wu <[EMAIL PROTECTED]> wrote: > On Fri, Nov 02, 2007 at 11:15:32AM +0100, Peter Zijlstra wrote: > > On Fri, 2007-11-02 at 10:21 +0800, Fengguang Wu wrote: > > > > > Interestingly, no background_writeout() appears, but only > > > balance_dirty_pages() and wb_kupdate. Obviously wb_kupdate won't > > > block the process. > > > > Yeah, the background threshold is not (yet) scaled. So it can happen > > that the bdi_dirty limit is below the background limit. > > > > I'm curious though as to these stalls, though, I can't seem to think of > > what goes wrong.. esp since most writeback seems to happen from pdflush. > > Me confused too. The new debug patch will confirm whether emerge is > waiting in balance_dirty_pages(). > > > (or I'm totally misreading it - quite a possible as I'm still recovering > > from a serious cold and not all the green stuff has yet figured out its > > proper place wrt brain cells 'n stuff) > > Do take care of yourself. > > > > > I still have this patch floating around: > > I think this patch is OK for 2.6.24 :-) > > Reviewed-by: Fengguang Wu <[EMAIL PROTECTED]> I would prefer Tested-by: :( > > > > --- > > Subject: mm: speed up writeback ramp-up on clean systems > > > > We allow violation of bdi limits if there is a lot of room on the > > system. Once we hit half the total limit we start enforcing bdi limits > > and bdi ramp-up should happen. Doing it this way avoids many small > > writeouts on an otherwise idle system and should also speed up the > > ramp-up. Given the problems we're having in there I'm a bit reluctant to go tossing hastily put together and inadequately tested stuff onto the fire. And that's what this patch looks like to me. Wanna convince me otherwise? - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeout stalls in current -git
On 11/5/07, David Chinner <[EMAIL PROTECTED]> wrote: > Ok, so it's probably a side effect of the writeback changes. > > Attached are two patches (two because one was in a separate patchset as > a standalone change) that should prevent async writeback from blocking > on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first. > Can you see if this fixes the problem? Now testing v2.6.24-rc1-650-gb55d1b1+ the fix for the missapplied raid5-patch Applying your two patches ontop of that does not fix the stalls. vmstat 10 output from unmerging (uninstalling) a kernel: 1 0 0 3512188332 19264400 18512 368 735 10 3 85 1 -> emerge starts to remove the kernel source files 3 0 0 3506624332 1928360015 9825 2458 8307 7 12 81 0 0 0 0 3507212332 19283600 0 554 630 1233 0 1 99 0 0 0 0 3507292332 19283600 0 537 580 1328 0 1 99 0 0 0 0 3507168332 19283600 0 633 626 1380 0 1 99 0 0 0 0 3507116332 19283600 0 1510 768 2030 1 2 97 0 0 0 0 3507596332 19283600 0 524 540 1544 0 0 99 0 procs ---memory-- ---swap-- -io -system-- cpu r b swpd free buff cache si sobibo in cs us sy id wa 0 0 0 3507540332 19283600 0 489 551 1293 0 0 99 0 0 0 0 3507528332 19283600 0 527 510 1432 1 1 99 0 0 0 0 3508052332 19284000 0 2088 910 2964 2 3 95 0 0 0 0 3507888332 19284000 0 442 565 1383 1 1 99 0 0 0 0 3508704332 19284000 0 497 529 1479 0 0 99 0 0 0 0 3508704332 19284000 0 594 595 1458 0 0 99 0 0 0 0 3511492332 19284000 0 2381 1028 2941 2 3 95 0 0 0 0 3510684332 19284000 0 699 600 1390 0 0 99 0 0 0 0 3511636332 19284000 0 741 661 1641 0 0 100 0 0 0 0 3524020332 19284000 0 2452 1080 3910 2 3 95 0 0 0 0 3524040332 19284400 0 530 617 1297 0 0 99 0 0 0 0 3524128332 19284400 0 812 674 1667 0 1 99 0 0 0 0 3527000332 19367200 339 721 754 1681 3 2 93 1 -> emerge is finished, no dirty or writeback data in /proc/meminfo 0 0 0 3571056332 19476800 111 639 632 1344 0 1 99 0 0 0 0 3571260332 19476800 0 757 688 1405 1 0 99 0 0 0 0 3571156332 19476800 0 753 641 1361 0 0 99 0 0 0 0 3571404332 19476800 0 766 653 1389 0 0 99 0 1 0 0 3571136332 19476800 6 764 669 1488 0 0 99 0 0 0 0 3571668332 19482400 0 764 657 1482 0 0 99 0 0 0 0 3571848332 19482400 0 673 659 1406 0 0 99 0 0 0 0 3571908332 1950520022 753 638 1500 0 1 99 0 0 0 0 3573052332 19505200 0 765 631 1482 0 1 99 0 0 0 0 3574144332 19505200 0 771 640 1497 0 0 99 0 0 0 0 3573468332 19505200 0 458 485 1251 0 0 99 0 0 0 0 3574184332 19505200 0 427 474 1192 0 0 100 0 0 0 0 3575092332 19505200 0 461 482 1235 0 0 99 0 0 0 0 3576368332 19505600 0 582 556 1310 0 0 99 0 0 0 0 3579300332 19505600 0 695 571 1402 0 0 99 0 0 0 0 3580376332 19505600 0 417 568 906 0 0 99 0 0 0 0 3581212332 19505600 0 421 559 977 0 1 99 0 0 0 0 3583780332 19506000 0 494 555 1080 0 1 99 0 0 0 0 3584352332 19506000 099 347 559 0 0 99 0 0 0 0 3585232332 19506000 011 301 621 0 0 99 0 -> disks go idle. So these patches do not seem to be the source of these excessive disk writes... Torsten - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeout stalls in current -git
On 11/5/07, David Chinner <[EMAIL PROTECTED]> wrote: > On Sun, Nov 04, 2007 at 12:19:19PM +0100, Torsten Kaiser wrote: > > I can now confirm, that I see this also with the current > > mainline-git-version > > I used 2.6.24-rc1-git-b4f555081fdd27d13e6ff39d455d5aefae9d2c0c > > plus the fix for the sg changes in ieee1394. > > Ok, so it's probably a side effect of the writeback changes. > > Attached are two patches (two because one was in a separate patchset as > a standalone change) that should prevent async writeback from blocking > on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first. > Can you see if this fixes the problem? Applied both patches against the kernel mentioned above. This blows up at boot: [ 80.807589] Filesystem "dm-0": Disabling barriers, not supported by the underlying device [ 80.820241] XFS mounting filesystem dm-0 [ 80.913144] [ cut here ] [ 80.914932] kernel BUG at drivers/md/raid5.c:143! [ 80.916751] invalid opcode: [1] SMP [ 80.918338] CPU 3 [ 80.919142] Modules linked in: [ 80.920345] Pid: 974, comm: md1_raid5 Not tainted 2.6.24-rc1 #3 [ 80.922628] RIP: 0010:[] [] __release_stripe+0x164/0x170 [ 80.925935] RSP: 0018:8100060e7dd0 EFLAGS: 00010002 [ 80.927987] RAX: RBX: 81010141c288 RCX: [ 80.930738] RDX: RSI: 81010141c288 RDI: 810004fb3200 [ 80.933488] RBP: 810004fb3200 R08: R09: 0005 [ 80.936240] R10: 0e00 R11: e200038465e8 R12: 81010141c298 [ 80.938990] R13: 0286 R14: 810004fb3330 R15: [ 80.941741] FS: 0060c870() GS:810100313700() knlGS: [ 80.944861] CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b [ 80.947080] CR2: 7fff7b295000 CR3: 000101842000 CR4: 06e0 [ 80.949830] DR0: DR1: DR2: [ 80.952580] DR3: DR6: 0ff0 DR7: 0400 [ 80.955332] Process md1_raid5 (pid: 974, threadinfo 8100060e6000, task 81000645c730) [ 80.958584] Stack: 81010141c288 01f4 810004fb3200 804b6f2d [ 80.961761] 01f4 81010141c288 804c8bd0 [ 80.964681] 8100060e7ee8 804bd094 81000645c730 8100060e7e70 [ 80.967518] Call Trace: [ 80.968558] [] release_stripe+0x3d/0x60 [ 80.970677] [] md_thread+0x0/0x100 [ 80.972629] [] raid5d+0x344/0x450 [ 80.974549] [] process_timeout+0x0/0x10 [ 80.976668] [] schedule_timeout+0x5a/0xd0 [ 80.978855] [] md_thread+0x0/0x100 [ 80.980807] [] md_thread+0x30/0x100 [ 80.982794] [] autoremove_wake_function+0x0/0x30 [ 80.985214] [] md_thread+0x0/0x100 [ 80.987167] [] kthread+0x4b/0x80 [ 80.989054] [] child_rip+0xa/0x12 [ 80.990972] [] kthread+0x0/0x80 [ 80.992824] [] child_rip+0x0/0x12 [ 80.994743] [ 80.995588] [ 80.995588] Code: 0f 0b eb fe 0f 1f 84 00 00 00 00 00 48 83 ec 28 48 89 5c 24 [ 80.999307] RIP [] __release_stripe+0x164/0x170 [ 81.001711] RSP Switching back to unpatched 2.6.23-mm1 boots sucessfull... Torsten - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeout stalls in current -git
On Sun, Nov 04, 2007 at 12:19:19PM +0100, Torsten Kaiser wrote: > On 11/2/07, David Chinner <[EMAIL PROTECTED]> wrote: > > That's stalled waiting on the inode cluster buffer lock. That implies > > that the inode lcuser is already being written out and the inode has > > been redirtied during writeout. > > > > Does the kernel you are testing have the "flush inodes in ascending > > inode number order" patches applied? If so, can you remove that > > patch and see if the problem goes away? > > I can now confirm, that I see this also with the current mainline-git-version > I used 2.6.24-rc1-git-b4f555081fdd27d13e6ff39d455d5aefae9d2c0c > plus the fix for the sg changes in ieee1394. Ok, so it's probably a side effect of the writeback changes. Attached are two patches (two because one was in a separate patchset as a standalone change) that should prevent async writeback from blocking on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first. Can you see if this fixes the problem? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group --- fs/xfs/xfs_inode.c | 283 - 1 file changed, 129 insertions(+), 154 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/xfs_inode.c === --- 2.6.x-xfs-new.orig/fs/xfs/xfs_inode.c 2007-09-12 15:41:22.0 +1000 +++ 2.6.x-xfs-new/fs/xfs/xfs_inode.c2007-09-13 08:57:06.395641940 +1000 @@ -124,6 +124,126 @@ xfs_inobp_check( #endif /* + * Simple wrapper for calling xfs_imap() that includes error + * and bounds checking + */ +STATIC int +xfs_ino_to_imap( + xfs_mount_t *mp, + xfs_trans_t *tp, + xfs_ino_t ino, + xfs_imap_t *imap, + uintimap_flags) +{ + int error; + + error = xfs_imap(mp, tp, ino, imap, imap_flags); + if (error) { + cmn_err(CE_WARN, "xfs_ino_to_imap: xfs_imap() returned an " + "error %d on %s. Returning error.", + error, mp->m_fsname); + return error; + } + + /* +* If the inode number maps to a block outside the bounds +* of the file system then return NULL rather than calling +* read_buf and panicing when we get an error from the +* driver. +*/ + if ((imap->im_blkno + imap->im_len) > + XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks)) { + xfs_fs_cmn_err(CE_ALERT, mp, "xfs_ino_to_imap: " + "(imap->im_blkno (0x%llx) + imap->im_len (0x%llx)) > " + " XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) (0x%llx)", + (unsigned long long) imap->im_blkno, + (unsigned long long) imap->im_len, + XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks)); + return XFS_ERROR(EINVAL); + } + return 0; +} + +/* + * Find the buffer associated with the given inode map + * We do basic validation checks on the buffer once it has been + * retrieved from disk. + */ +STATIC int +xfs_imap_to_bp( + xfs_mount_t *mp, + xfs_trans_t *tp, + xfs_imap_t *imap, + xfs_buf_t **bpp, + uintbuf_flags, + uintimap_flags) +{ + int error; + int i; + int ni; + xfs_buf_t *bp; + + error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp, imap->im_blkno, + (int)imap->im_len, XFS_BUF_LOCK, &bp); + if (error) { + cmn_err(CE_WARN, "xfs_imap_to_bp: xfs_trans_read_buf()returned " + "an error %d on %s. Returning error.", + error, mp->m_fsname); + return error; + } + + /* +* Validate the magic number and version of every inode in the buffer +* (if DEBUG kernel) or the first inode in the buffer, otherwise. +*/ +#ifdef DEBUG + ni = BBTOB(imap->im_len) >> mp->m_sb.sb_inodelog; +#else /* usual case */ + ni = 1; +#endif + + for (i = 0; i < ni; i++) { + int di_ok; + xfs_dinode_t*dip; + + dip = (xfs_dinode_t *)xfs_buf_offset(bp, + (i << mp->m_sb.sb_inodelog)); + di_ok = be16_to_cpu(dip->di_core.di_magic) == XFS_DINODE_MAGIC && + XFS_DINODE_GOOD_VERSION(dip->di_core.di_version); + if (unlikely(XFS_TEST_ERROR(!di_ok, mp, + XFS_ERRTAG_ITOBP_INOTOBP, + XFS_RANDOM_ITOBP_INOTOBP))) { + if (imap_flags & XFS_IMAP_BULKSTAT) { + xfs_trans_brelse(tp, bp); + return XFS_ERROR(EINVAL)
Re: writeout stalls in current -git
On 11/2/07, David Chinner <[EMAIL PROTECTED]> wrote: > That's stalled waiting on the inode cluster buffer lock. That implies > that the inode lcuser is already being written out and the inode has > been redirtied during writeout. > > Does the kernel you are testing have the "flush inodes in ascending > inode number order" patches applied? If so, can you remove that > patch and see if the problem goes away? I can now confirm, that I see this also with the current mainline-git-version I used 2.6.24-rc1-git-b4f555081fdd27d13e6ff39d455d5aefae9d2c0c plus the fix for the sg changes in ieee1394. Bisecting would be troublesome, as the sg changes prevent mainline to boot with my normal config / kill my network. treogen ~ # vmstat 10 procs ---memory-- ---swap-- -io -system-- cpu r b swpd free buff cache si sobibo in cs us sy id wa -> starting emerge 1 0 0 3627072332 157724009713 41 189 2 2 94 2 0 0 0 3607240332 16373600 59910 332 951 2 1 93 4 0 0 0 3601920332 16759200 380 2 218 870 1 1 98 0 0 0 0 3596356332 17164800 40421 182 818 0 0 99 0 0 0 0 3579328332 18043600 87812 147 912 1 1 97 2 0 0 0 3575376332 18277600 236 4 244 953 1 1 95 3 2 1 0 3571792332 18508400 232 7 256 1003 2 1 95 2 0 0 0 3564844332 18736400 228 605 246 1167 2 1 93 4 0 0 0 3562128332 18978400 230 4 527 1238 2 1 93 4 0 1 0 3558764332 19196400 21624 438 1059 1 1 93 6 0 0 0 3555120332 19386800 19936 406 959 0 0 92 8 0 0 0 3552008332 19592800 19711 458 1023 1 1 90 8 0 0 0 3548728332 19766000 183 7 496 1086 1 1 90 8 0 0 0 3545560332 19937200 170 8 483 1017 1 1 90 9 0 1 0 3542124332 20125600 190 1 544 1137 1 1 88 10 1 0 0 3536924332 20329600 195 7 637 1209 2 1 89 8 1 1 0 3485096332 24918400 10116 10372 4537 13 3 76 8 2 0 0 3442004332 27972800 108640 219 1349 7 3 87 4 -> emerge is done reading its package database 1 0 0 3254796332 44863600 027 128 8360 24 6 70 0 2 0 0 3143304332 554016004733 213 4480 16 11 72 1 -> kernel unpacked 1 0 0 3125700332 56041600 120 122 1675 24 1 75 0 1 0 0 3117356332 56796800 0 674 157 2975 24 2 73 1 2 0 0 3111636332 57373600 0 1143 151 1924 23 1 75 1 2 0 0 3102836332 58133200 0 890 153 1330 24 1 75 0 1 0 0 3097236332 58736000 0 656 194 1593 24 1 74 0 1 0 0 3086824332 59548000 0 812 235 2657 25 1 74 0 -> tar.bz2 created, installing starts now 0 0 0 3091612332 6010240082 708 499 2397 17 4 78 1 0 0 0 3086088332 6021800069 2459 769 2237 3 4 88 6 0 0 0 3085916332 60223600 2 1752 693 949 1 2 96 1 0 0 0 3084544332 6035640066 4057 1176 2850 3 6 91 0 0 0 0 3078780332 6055720098 3194 1169 3288 5 6 89 0 0 0 0 3077940332 6059240017 1139 823 1547 1 2 97 0 0 0 0 3078268332 60592400 0 888 807 1329 0 1 99 0 -> first short stall procs ---memory-- ---swap-- -io -system-- cpu r b swpd free buff cache si sobibo in cs us sy id wa 0 0 0 3077040332 60592400 0 1950 785 1495 0 2 89 8 0 0 0 3076588332 60589600 2 3807 925 2046 1 4 95 0 0 0 0 3076900332 6060520011 2564 768 1471 1 3 95 1 0 0 0 3071584332 6079280087 2499 1108 3433 4 6 90 0 -> second longer stall (emerge was not able to complete a single filemove until the 'resume' line) 0 0 0 3071592332 60792800 0 693 692 1289 0 0 99 0 0 0 0 3072584332 60792800 0 792 731 1507 0 1 99 0 0 0 0 3072840332 60792800 0 806 707 1521 0 1 99 0 0 0 0 3072724332 60792800 0 782 695 1372 0 0 99 0 0 0 0 3072972332 60792800 0 677 612 1301 0 0 99 0 0 0 0 3072772332 60792800 0 738 681 1352 1 1 99 0 0 0 0 3073020332 60792800 0 785 708 1328 0 1 99 0 0 0 0 3072896332 60792800 0 833 722 1383 0 0 99 0 -> emerge resumed 0 0 0 3069476332 60797200 2 4885 812 2062 1 4 90 5 1 0 0 3069648332 60806800 4
Re: writeout stalls in current -git
On 11/2/07, David Chinner <[EMAIL PROTECTED]> wrote: > On Fri, Nov 02, 2007 at 08:22:10PM +0100, Torsten Kaiser wrote: > > [ 630.00] SysRq : Emergency Sync > > [ 630.12] Emergency Sync complete > > [ 632.85] SysRq : Show Blocked State > > [ 632.85] taskPC stack pid father > > [ 632.85] pdflush D 81000f091788 0 285 2 > > [ 632.85] 810005d4da80 0046 0800 > > 0071 > > [ 632.85] 81000fd52400 8022d61c 80819b00 > > 80819b00 > > [ 632.85] 80815f40 80819b00 810100316f98 > > > > [ 632.85] Call Trace: > > [ 632.85] [] task_rq_lock+0x4c/0x90 > > [ 632.85] [] __wake_up_common+0x5a/0x90 > > [ 632.85] [] __down+0xa7/0x11e > > [ 632.85] [] default_wake_function+0x0/0x10 > > [ 632.85] [] __down_failed+0x35/0x3a > > [ 632.85] [] xfs_buf_lock+0x3e/0x40 > > [ 632.85] [] _xfs_buf_find+0x13e/0x240 > > [ 632.85] [] xfs_buf_get_flags+0x6f/0x190 > > [ 632.85] [] xfs_buf_read_flags+0x12/0xa0 > > [ 632.85] [] xfs_trans_read_buf+0x64/0x340 > > [ 632.85] [] xfs_itobp+0x81/0x1e0 > > [ 632.85] [] write_cache_pages+0x123/0x330 > > [ 632.85] [] xfs_iflush+0xfe/0x520 > > That's stalled waiting on the inode cluster buffer lock. That implies > that the inode lcuser is already being written out and the inode has > been redirtied during writeout. > > Does the kernel you are testing have the "flush inodes in ascending > inode number order" patches applied? If so, can you remove that > patch and see if the problem goes away? It's 2.6.23-mm1 with only some small fixes. In it's broken-out directory I see: git-xfs.patch and writeback-fix-periodic-superblock-dirty-inode-flushing.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-2.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-3.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-4.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-5.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-6.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists-7.patch writeback-fix-time-ordering-of-the-per-superblock-dirty-inode-lists.patch writeback-fix-time-ordering-of-the-per-superblock-inode-lists-8.patch writeback-introduce-writeback_controlmore_io-to-indicate-more-io.patch I don't know if the patch you mentioned is part of that version of the mm-patchset. Torsten - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeout stalls in current -git
On Fri, Nov 02, 2007 at 08:22:10PM +0100, Torsten Kaiser wrote: > [ 630.00] SysRq : Emergency Sync > [ 630.12] Emergency Sync complete > [ 632.85] SysRq : Show Blocked State > [ 632.85] taskPC stack pid father > [ 632.85] pdflush D 81000f091788 0 285 2 > [ 632.85] 810005d4da80 0046 0800 > 0071 > [ 632.85] 81000fd52400 8022d61c 80819b00 > 80819b00 > [ 632.85] 80815f40 80819b00 810100316f98 > > [ 632.85] Call Trace: > [ 632.85] [] task_rq_lock+0x4c/0x90 > [ 632.85] [] __wake_up_common+0x5a/0x90 > [ 632.85] [] __down+0xa7/0x11e > [ 632.85] [] default_wake_function+0x0/0x10 > [ 632.85] [] __down_failed+0x35/0x3a > [ 632.85] [] xfs_buf_lock+0x3e/0x40 > [ 632.85] [] _xfs_buf_find+0x13e/0x240 > [ 632.85] [] xfs_buf_get_flags+0x6f/0x190 > [ 632.85] [] xfs_buf_read_flags+0x12/0xa0 > [ 632.85] [] xfs_trans_read_buf+0x64/0x340 > [ 632.85] [] xfs_itobp+0x81/0x1e0 > [ 632.85] [] write_cache_pages+0x123/0x330 > [ 632.85] [] xfs_iflush+0xfe/0x520 That's stalled waiting on the inode cluster buffer lock. That implies that the inode lcuser is already being written out and the inode has been redirtied during writeout. Does the kernel you are testing have the "flush inodes in ascending inode number order" patches applied? If so, can you remove that patch and see if the problem goes away? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeout stalls in current -git
On 11/2/07, Peter Zijlstra <[EMAIL PROTECTED]> wrote: > On Fri, 2007-11-02 at 10:21 +0800, Fengguang Wu wrote: > > > Interestingly, no background_writeout() appears, but only > > balance_dirty_pages() and wb_kupdate. Obviously wb_kupdate won't > > block the process. > > Yeah, the background threshold is not (yet) scaled. So it can happen > that the bdi_dirty limit is below the background limit. I still have not seen a trigger of the "throttle_vm_writeout". This time installing 2.6.24-rc1 again it not even triggerd any other debugs apart from the one in wb_kupdate. But 300Mb of new files might still not trigger this with 4Gb of RAM. I'm currently testing 2.6.23-mm1 with this patch and the second writeback-debug patch. > I'm curious though as to these stalls, though, I can't seem to think of > what goes wrong.. esp since most writeback seems to happen from pdflush. I also don't know. But looking at the time the system takes to write out 8kb, I'm starting to suspect that something is writing this out, but not marking it clean... (Or redirtying it immediately?) > (or I'm totally misreading it - quite a possible as I'm still recovering > from a serious cold and not all the green stuff has yet figured out its > proper place wrt brain cells 'n stuff) Get well soon! > I still have this patch floating around: > > --- > Subject: mm: speed up writeback ramp-up on clean systems applied, but did not fix the stalls. Here the complete log from vmstat 10 and the syslog from an install of vanilla 2.6.24-rc1. (Please note: I installed the source of vanilla 2.6.24-rc1, but I am still using 2.6.23-mm1!) All lines with [note] are my comments about what the system was doing, both logs are from the same run, so the notes should be more or less in sync. I used SysRq+L to insert the SysRq-Helptext into the syslog as marker... The visible effects are similar to the unmerge run, but the stalls during the moving did only start later. But the same effect after emerge finished and the almost all dirty data was written, was visible: I can still hear the disks and see the hdd light flickering (mostly on) for much, much longer than it should take to write 8kb. vmstat 10: [note]emerge start 1 0 0 3668496332 18774800 029 39 491 3 0 96 0 1 0 0 3623940332 10008317 1724 3893 15 2 81 1 0 0 0 3559488332 25243200 102148 11719 4536 9 4 74 13 2 0 0 3482220332 311916007060 93 3818 11 3 86 0 1 0 0 3289352332 48693200 235 33 11997 25 3 72 0 1 0 0 3174036332 596412001033 35 3937 21 4 75 0 2 0 0 3215756332 55529200 628 85 742 12 12 76 0 2 0 0 3202128332 5597920032 9 34 1566 31 1 68 0 2 0 0 3192804332 568072006046 172 4206 30 2 67 1 3 0 0 3202424332 57262000 020 111 2223 27 1 72 0 1 0 0 3196112332 57890000 0 1649 149 2763 25 2 73 0 1 0 0 3190004332 58495600 017 110 2270 25 1 74 0 1 0 0 3183952332 59084000 011 104 2553 25 1 74 0 1 0 0 3176952332 59706800 0 2153 124 2886 25 2 72 0 1 0 0 3171044332 60259200 022 109 2580 26 1 73 0 1 0 0 3174896332 60549600 173 1441 312 2249 9 6 84 1 1 0 0 3165204332 61185600 569 3221 606 4236 4 7 87 1 0 0 0 3160856332 61351600 116 2281 570 3077 3 5 92 0 procs ---memory-- ---swap-- -io -system-- cpu r b swpd free buff cache si sobibo in cs us sy id wa 0 0 0 3154712332 61520000 108 2166 528 3038 3 4 93 0 0 0 0 3156008332 6154200018 1941 537 1015 0 2 97 0 0 0 0 3156652332 61550400 8 2232 547 900 0 2 98 0 0 0 0 3156748332 6156720012 1932 537 947 0 2 98 0 0 0 0 3154720332 6159000014 2204 584 1256 1 2 97 0 0 0 0 3154256332 6160600010 2676 610 1317 1 3 96 0 1 0 0 3152488332 61628400 9 1994 573 1024 1 2 97 0 0 0 0 3152404332 61640800 4 2218 540 904 0 2 97 0 0 0 0 3151244332 6171560044 2198 598 1921 2 4 94 0 0 0 0 3147224332 61867200 110 1802 644 2575 3 4 93 0 0 0 0 3144608332 6198240080 1590 543 1900 2 4 95 0 0 0 0 3140768332 62144800 111 1758 657 2735 3 4 93 0 0 0 0 3140816332 6218960026 801 531 1667 1 2 98 0 [note] first stall, SysRq+W 1 0 0 3127620332 62189600 0 640 490 1381 2 1 97 0 0 0 0 3127780332 62190000 0 627 475 1531 2 1 98
Re: writeout stalls in current -git
On 11/2/07, Fengguang Wu <[EMAIL PROTECTED]> wrote: > I guess the new debug printks will provide more hints on it. The "throttle_vm_writeout" did not trigger for my new workload. Except one (the first) "balance_dirty_pages" came from line 445, the newly added. But I found an other workload that looks much more ... hmm ... 'mad'. If I do an unmerge the emerge program will read all files to revalidate their checksum and then delete it. If I do this unmerge the progress of emerge will stall periodically for ~47 second. (Two times I used a stopwatch to get this value. I think all other stalls where identical, at least in KSysGuard they looked evenly spaced) What really counts as 'mad' is this output from vmstat 10: 0 0 0 3639044332 17742000 29220 101 618 1 1 98 0 1 0 0 3624068332 18062800 32322 137 663 5 2 93 0 0 0 0 3602456332 18397200 30123 159 641 9 3 87 2 -> this was emerge collecting its package database 0 0 0 3600052332 1842640019 7743 823 5543 3 8 89 0 0 0 0 3599332332 18428000 1 2532 517 2341 1 2 97 0 -> normal removing, now the emerge stalls 0 0 0 3599404332 18428000 0 551 323 1290 0 0 99 0 0 0 0 3599648332 18428000 0 644 314 1222 0 1 99 0 0 0 0 3599648332 18428400 0 569 296 1242 0 0 99 0 0 0 0 3599868332 18428800 0 2362 320 2735 1 2 97 0 -> resumes for a short time, then stalls again 0 0 0 3599488332 18428800 0 584 292 1395 0 0 99 0 0 0 0 3600216332 18428800 0 550 301 1361 0 0 99 0 0 0 0 3594176332 18429600 0 562 300 1373 2 1 97 0 0 0 0 3594648332 18429600 0 1278 336 1881 1 1 98 0 0 0 0 3594172332 18430800 1 2812 421 2840 1 4 95 0 -> and again 0 0 0 3594296332 18430800 0 545 342 1283 0 0 99 0 0 0 0 3594376332 18430800 0 561 319 1314 0 1 99 0 0 0 0 3594340332 18430800 0 586 327 1258 0 1 99 0 0 0 0 3594644332 18430800 0 498 248 1376 0 0 99 0 0 0 0 3595116332 18434800 0 3519 565 3452 2 4 95 0 -> and again 0 0 0 3595320332 18434800 0 483 284 1163 0 0 99 0 3 0 0 3595444332 18435200 0 498 247 1173 3 0 97 0 1 0 0 3585108332 18460000 0 1298 644 2394 1 1 98 0 1 0 0 3588152332 18460800 0 3154 520 3221 2 4 94 0 -> and again procs ---memory-- ---swap-- -io -system-- cpu r b swpd free buff cache si sobibo in cs us sy id wa 2 0 0 3588540332 18460800 0 574 268 1332 0 1 99 0 1 0 0 3588744332 18460800 0 546 335 1289 0 0 99 0 1 0 0 3588628332 18460800 0 638 348 1257 0 1 99 0 1 0 0 3588952332 18460800 0 567 310 1226 0 1 99 0 1 0 0 3603644332 1849720059 2821 531 2419 3 4 91 1 1 0 0 3649476332 18627200 370 395 380 1335 1 1 98 0 -> emerge finishes, and now the system goes 'mad' The Dirty:-line from /proc/meminfo stays at 8 or 12 kB, but there system is writing like 'mad': 1 0 0 3650616332 18627600 0 424 296 1126 0 1 99 0 1 0 0 3650708332 18627600 0 418 249 1190 0 0 99 0 1 0 0 3650716332 18627600 0 418 256 1151 0 1 99 0 1 0 0 3650816332 18627600 0 420 257 1120 0 0 99 0 1 0 0 3651132332 18627600 0 418 269 1145 0 0 99 0 1 0 0 3651332332 18628000 0 419 294 1099 0 1 99 0 1 0 0 3651732332 18628000 0 423 311 1072 0 1 99 0 1 0 0 3652048332 18628000 0 400 317 1127 0 0 99 0 1 0 0 3652024332 18628000 0 426 346 1066 0 1 99 0 2 0 0 3652304332 18628000 0 425 357 1132 0 1 99 0 2 0 0 3652652332 18628000 0 416 364 1184 0 0 99 0 1 0 0 3652836332 18628000 0 413 397 1110 0 1 99 0 1 0 0 3652852332 18628400 0 426 427 1290 0 1 99 0 1 0 0 3652060332 1864200014 404 421 1768 1 1 97 0 1 0 0 3652904332 18642000 0 418 437 1792 1 1 98 0 1 0 0 3653572332 18642000 0 410 442 1481 1 1 99 0 2 0 0 3653872332 18642000 0 410 451 1206 0 1 99 0 3 0 0 3654572332 18642000 0 414 479 1341 0 1 99 0 1 0 0 3651720332 18983200 341 420 540 1600 1 1 98 1 1
Re: writeout stalls in current -git
On Fri, Nov 02, 2007 at 11:15:32AM +0100, Peter Zijlstra wrote: > On Fri, 2007-11-02 at 10:21 +0800, Fengguang Wu wrote: > > > Interestingly, no background_writeout() appears, but only > > balance_dirty_pages() and wb_kupdate. Obviously wb_kupdate won't > > block the process. > > Yeah, the background threshold is not (yet) scaled. So it can happen > that the bdi_dirty limit is below the background limit. > > I'm curious though as to these stalls, though, I can't seem to think of > what goes wrong.. esp since most writeback seems to happen from pdflush. Me confused too. The new debug patch will confirm whether emerge is waiting in balance_dirty_pages(). > (or I'm totally misreading it - quite a possible as I'm still recovering > from a serious cold and not all the green stuff has yet figured out its > proper place wrt brain cells 'n stuff) Do take care of yourself. > > I still have this patch floating around: I think this patch is OK for 2.6.24 :-) Reviewed-by: Fengguang Wu <[EMAIL PROTECTED]> > > --- > Subject: mm: speed up writeback ramp-up on clean systems > > We allow violation of bdi limits if there is a lot of room on the > system. Once we hit half the total limit we start enforcing bdi limits > and bdi ramp-up should happen. Doing it this way avoids many small > writeouts on an otherwise idle system and should also speed up the > ramp-up. > > Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]> > > --- > mm/page-writeback.c | 19 +-- > 1 file changed, 17 insertions(+), 2 deletions(-) > > Index: linux-2.6/mm/page-writeback.c > === > --- linux-2.6.orig/mm/page-writeback.c2007-09-28 10:08:33.937415368 > +0200 > +++ linux-2.6/mm/page-writeback.c 2007-09-28 10:54:26.018247516 +0200 > @@ -355,8 +355,8 @@ get_dirty_limits(long *pbackground, long > */ > static void balance_dirty_pages(struct address_space *mapping) > { > - long bdi_nr_reclaimable; > - long bdi_nr_writeback; > + long nr_reclaimable, bdi_nr_reclaimable; > + long nr_writeback, bdi_nr_writeback; > long background_thresh; > long dirty_thresh; > long bdi_thresh; > @@ -376,11 +376,26 @@ static void balance_dirty_pages(struct a > > get_dirty_limits(&background_thresh, &dirty_thresh, > &bdi_thresh, bdi); > + > + nr_reclaimable = global_page_state(NR_FILE_DIRTY) + > + global_page_state(NR_UNSTABLE_NFS); > + nr_writeback = global_page_state(NR_WRITEBACK); > + > bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE); > bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK); > + > if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh) > break; > > + /* > + * Throttle it only when the background writeback cannot > + * catch-up. This avoids (excessively) small writeouts > + * when the bdi limits are ramping up. > + */ > + if (nr_reclaimable + nr_writeback < > + (background_thresh + dirty_thresh) / 2) > + break; > + > if (!bdi->dirty_exceeded) > bdi->dirty_exceeded = 1; > > > - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeout stalls in current -git
On Fri, 2007-11-02 at 10:21 +0800, Fengguang Wu wrote: > Interestingly, no background_writeout() appears, but only > balance_dirty_pages() and wb_kupdate. Obviously wb_kupdate won't > block the process. Yeah, the background threshold is not (yet) scaled. So it can happen that the bdi_dirty limit is below the background limit. I'm curious though as to these stalls, though, I can't seem to think of what goes wrong.. esp since most writeback seems to happen from pdflush. (or I'm totally misreading it - quite a possible as I'm still recovering from a serious cold and not all the green stuff has yet figured out its proper place wrt brain cells 'n stuff) I still have this patch floating around: --- Subject: mm: speed up writeback ramp-up on clean systems We allow violation of bdi limits if there is a lot of room on the system. Once we hit half the total limit we start enforcing bdi limits and bdi ramp-up should happen. Doing it this way avoids many small writeouts on an otherwise idle system and should also speed up the ramp-up. Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]> --- mm/page-writeback.c | 19 +-- 1 file changed, 17 insertions(+), 2 deletions(-) Index: linux-2.6/mm/page-writeback.c === --- linux-2.6.orig/mm/page-writeback.c 2007-09-28 10:08:33.937415368 +0200 +++ linux-2.6/mm/page-writeback.c 2007-09-28 10:54:26.018247516 +0200 @@ -355,8 +355,8 @@ get_dirty_limits(long *pbackground, long */ static void balance_dirty_pages(struct address_space *mapping) { - long bdi_nr_reclaimable; - long bdi_nr_writeback; + long nr_reclaimable, bdi_nr_reclaimable; + long nr_writeback, bdi_nr_writeback; long background_thresh; long dirty_thresh; long bdi_thresh; @@ -376,11 +376,26 @@ static void balance_dirty_pages(struct a get_dirty_limits(&background_thresh, &dirty_thresh, &bdi_thresh, bdi); + + nr_reclaimable = global_page_state(NR_FILE_DIRTY) + + global_page_state(NR_UNSTABLE_NFS); + nr_writeback = global_page_state(NR_WRITEBACK); + bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE); bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK); + if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh) break; + /* +* Throttle it only when the background writeback cannot +* catch-up. This avoids (excessively) small writeouts +* when the bdi limits are ramping up. +*/ + if (nr_reclaimable + nr_writeback < + (background_thresh + dirty_thresh) / 2) + break; + if (!bdi->dirty_exceeded) bdi->dirty_exceeded = 1; - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeout stalls in current -git
On Fri, Nov 02, 2007 at 08:42:05AM +0100, Torsten Kaiser wrote: > The Subject is still missleading, I'm using 2.6.23-mm1. > > On 11/2/07, Fengguang Wu <[EMAIL PROTECTED]> wrote: > > On Thu, Nov 01, 2007 at 07:20:51PM +0100, Torsten Kaiser wrote: > > > On 11/1/07, Fengguang Wu <[EMAIL PROTECTED]> wrote: > > > > On Wed, Oct 31, 2007 at 04:22:10PM +0100, Torsten Kaiser wrote: > > > > > Since 2.6.23-mm1 I also experience strange hangs during heavy > > > > > writeouts. > > > > > Each time I noticed this I was using emerge (package util from the > > > > > gentoo distribution) to install/upgrade a package. The last step, > > > > > where this hang occurred, is moving the prepared files from a tmpfs > > > > > partion to the main xfs filesystem. > > > > > The hangs where not fatal, after a few second everything resumed > > > > > normal, so I was not able to capture a good image of what was > > > > > happening. > > > > > > > > Thank you for the detailed report. > > > > > > > > How severe was the hangs? Only writeouts stalled, all apps stalled, or > > > > cannot type and run new commands? > > > > > > Only writeout stalled. The emerge that was moving the files hung, but > > > everything else worked normaly. > > > I was able to run new commands, like coping the /proc/meminfo. > > > > But you mentioned in the next mail that `watch cat /proc/meminfo` > > could also be blocked for some time - I guess in the same time emerge > > was stalled? > > The behavior was different on these stalls. > On first report the writeout stopped completly, the emerge stopped, > but at that time a cat /proc/meminfo >~/stall/meminfo did succedd and > not stall. > About the watch cat /proc/meminfo, I will write in the answer to the > other mail... OK. > > > [snip] > > > > > After this SysRq+W writeback resumed again. Possible that writing > > > > > above into the syslog triggered that. > > > > > > > > Maybe. Are the log files on another disk/partition? > > > > > > No, everything was going to / > > > > > > What might be interesting is, that doing cat /proc/meminfo > > > >~/stall/meminfo did not resume the writeback. So there might some > > > threshold that only was broken with the additional write from > > > syslog-ng. Or syslog-ng does some flushing, I dont now. (I'm using the > > > > Have you tried explicit `sync`? ;-) > > No. I wanted to see what is stalled. So I startet by collecting info > from /proc and then the SysRq+W. And after hitting SysRQ the writeout > started to resume without any further action. > > But I think I have seen a `sync` stall also. During an other emerge I > noticed the system slowing down and wanted to use `sync` to speed up > the writeout. The result was, that the writeout did not speed up > imiedetly only after around a minitue. The `sync` only returned at > that time. > Can writers starve `sync`? I guess the new debug printks will provide more hints on it. > > > syslog-ng package from gentoo: > > > http://www.balabit.com/products/syslog_ng/ , version 2.0.5) > > > > > > > > The source tmpfs is mounted with any special parameters, but the > > > > > target xfs filesystem resides on a dm-crypt device that is on top a 3 > > > > > disk RAID5 md. > > > > > During the hang all CPUs where idle. > > > > > > > > No iowaits? ;-) > > > > > > No, I have a KSysGuard in my taskbar that showed no activity at all. > > > > > > OK, the subject does not match for my case, but there was also a tmpfs > > > involved. And I found no thread with stalls on xfs. :-) > > > > Do you mean it is actually related with tmpfs? > > I don't know. It's just that I have seen tmpfs also redirtieing inodes > in these logs and the stalling emerge is moving files from tmpfs to > xfs. > It could be, but I don't know enough about tmpfs internals to really be sure. > I just wanted to mention, that tmpfs is involved somehow. The requeue messages for tmpfs are not pleasant, but known to be fine ;-) Fengguang - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeout stalls in current -git
On 11/2/07, Fengguang Wu <[EMAIL PROTECTED]> wrote: > On Thu, Nov 01, 2007 at 08:00:10PM +0100, Torsten Kaiser wrote: > > On 11/1/07, Torsten Kaiser <[EMAIL PROTECTED]> wrote: > > > On 11/1/07, Fengguang Wu <[EMAIL PROTECTED]> wrote: > > > > Thank you. Maybe we can start by the applied debug patch :-) > > > > > > Will applied it and try to recreate this. > > > > Patch applied, used emerge to install a 2.6.24-rc1 kernel. > > > > I had no complete stalls, but three times during the move from tmpfs > > to the main xfs the emerge got noticeable slower. There still was > > writeout happening, but as emerge prints out every file it has written > > during the pause not one file was processed. > > > > vmstat 10: > > procs ---memory-- ---swap-- -io -system-- > > cpu > > r b swpd free buff cache si sobibo in cs us sy id > > wa > > 0 1 0 3146424332 61476800 134 1849 438 2515 3 4 91 > > 2 > > 0 0 0 3146644332 61478400 2 1628 507 646 0 2 85 > > 13 > > 0 0 0 3146868332 61486800 5 2359 527 1076 0 3 97 > > 0 > > 1 0 0 3144372332 6161480096 2829 607 2666 2 5 92 > > 0 > > -> normal writeout > > 0 0 0 3140560332 61814400 152 2764 633 3308 3 6 91 > > 0 > > 0 0 0 3137332332 61990800 114 1801 588 2858 3 4 93 > > 0 > > 0 0 0 3136912332 6201360020 827 393 1605 1 2 98 > > 0 > > -> first stall > > 'stall': vmstat's output stalls for some time, or emerge stalls for > the next several vmstat lines? emerge stalls. The vmstat did work normally. > > 0 0 0 3137088332 62013600 0 557 339 1437 0 1 99 > > 0 > > 0 0 0 3137160332 62013600 0 642 310 1400 0 1 99 > > 0 So meaning that these last three lines indicated that for ~30 seconds the writeout was much slower than normal. > > 0 0 0 3136588332 62017200 6 2972 527 1195 0 3 80 > > 16 > > 0 0 0 3136276332 6203480010 2668 558 1195 0 3 96 > > 0 > > 0 0 0 3135228332 62042400 8 2712 522 1311 0 4 96 > > 0 > > 0 0 0 3131740332 6215240075 2935 559 2457 2 5 93 > > 0 > > 0 0 0 3128348332 6229720085 1470 490 2607 3 4 93 > > 0 > > 0 0 0 3129292332 62297200 0 527 353 1398 0 1 99 > > 0 > > -> second longer stall > > 0 0 0 3128520332 62302800 6 488 249 1390 0 1 99 > > 0 > > 0 0 0 3128236332 62302800 0 482 222 1222 0 1 99 > > 0 > > 0 0 0 3128408332 62302800 0 585 269 1301 0 0 99 > > 0 > > 0 0 0 3128532332 62302800 0 610 262 1278 0 0 99 > > 0 > > 0 0 0 3128568332 62302800 0 636 345 1639 0 1 99 > > 0 > > 0 0 0 3129032332 62304000 1 664 337 1466 0 1 99 > > 0 > > 0 0 0 3129484332 62304000 0 658 300 1508 0 0 > > 100 0 > > 0 0 0 3129576332 62304000 0 562 271 1454 0 1 99 > > 0 > > 0 0 0 3129736332 62304000 0 627 278 1406 0 1 99 > > 0 > > 0 0 0 3129368332 62304000 0 507 274 1301 0 1 99 > > 0 > > 0 0 0 3129004332 62304000 0 444 211 1213 0 0 99 > > 0 The second time the slowdown was much longer. > > 0 1 0 3127260332 62304000 0 1036 305 1242 0 1 95 > > 4 > > 0 0 0 3126280332 62312800 7 4241 555 1575 1 5 84 > > 10 > > 0 0 0 3124948332 62323200 6 4194 529 1505 1 4 95 > > 0 > > 0 0 0 3125228332 6241680058 1966 586 1964 2 4 94 > > 0 > > -> emerge resumed to normal speed, without any intervention from my side > > 0 0 0 3120932332 62590400 112 1546 546 2565 3 4 93 > > 0 > > 0 0 0 3118012332 62756800 128 1542 612 2705 3 4 93 > > 0 > > Interesting, the 'bo' never falls to zero. Yes, I was not able to recreate the complete stall from the first mail, but even this slowdown does not look completly healthy. I "hope" this is the same bug, as I seem to be able to trigger this slowdown much easier. [snip logs] > > Interestingly, no background_writeout() appears, but only > balance_dirty_pages() and wb_kupdate. Obviously wb_kupdate won't > block the process. Yes, I noticed that too. The only time I have seen background_writeout was during bootup and shutdown. As for the stalled watch cat /proc/meminfo: That happend on the third slowdown/stall when emerge was already finished > > If I see the complete stall again, I will post that too. > > Thank you, could you run it with the attached new debug patch? I will, but it will have to wait until the evening.
Re: writeout stalls in current -git
The Subject is still missleading, I'm using 2.6.23-mm1. On 11/2/07, Fengguang Wu <[EMAIL PROTECTED]> wrote: > On Thu, Nov 01, 2007 at 07:20:51PM +0100, Torsten Kaiser wrote: > > On 11/1/07, Fengguang Wu <[EMAIL PROTECTED]> wrote: > > > On Wed, Oct 31, 2007 at 04:22:10PM +0100, Torsten Kaiser wrote: > > > > Since 2.6.23-mm1 I also experience strange hangs during heavy writeouts. > > > > Each time I noticed this I was using emerge (package util from the > > > > gentoo distribution) to install/upgrade a package. The last step, > > > > where this hang occurred, is moving the prepared files from a tmpfs > > > > partion to the main xfs filesystem. > > > > The hangs where not fatal, after a few second everything resumed > > > > normal, so I was not able to capture a good image of what was > > > > happening. > > > > > > Thank you for the detailed report. > > > > > > How severe was the hangs? Only writeouts stalled, all apps stalled, or > > > cannot type and run new commands? > > > > Only writeout stalled. The emerge that was moving the files hung, but > > everything else worked normaly. > > I was able to run new commands, like coping the /proc/meminfo. > > But you mentioned in the next mail that `watch cat /proc/meminfo` > could also be blocked for some time - I guess in the same time emerge > was stalled? The behavior was different on these stalls. On first report the writeout stopped completly, the emerge stopped, but at that time a cat /proc/meminfo >~/stall/meminfo did succedd and not stall. About the watch cat /proc/meminfo, I will write in the answer to the other mail... > > [snip] > > > > After this SysRq+W writeback resumed again. Possible that writing > > > > above into the syslog triggered that. > > > > > > Maybe. Are the log files on another disk/partition? > > > > No, everything was going to / > > > > What might be interesting is, that doing cat /proc/meminfo > > >~/stall/meminfo did not resume the writeback. So there might some > > threshold that only was broken with the additional write from > > syslog-ng. Or syslog-ng does some flushing, I dont now. (I'm using the > > Have you tried explicit `sync`? ;-) No. I wanted to see what is stalled. So I startet by collecting info from /proc and then the SysRq+W. And after hitting SysRQ the writeout started to resume without any further action. But I think I have seen a `sync` stall also. During an other emerge I noticed the system slowing down and wanted to use `sync` to speed up the writeout. The result was, that the writeout did not speed up imiedetly only after around a minitue. The `sync` only returned at that time. Can writers starve `sync`? > > syslog-ng package from gentoo: > > http://www.balabit.com/products/syslog_ng/ , version 2.0.5) > > > > > > The source tmpfs is mounted with any special parameters, but the > > > > target xfs filesystem resides on a dm-crypt device that is on top a 3 > > > > disk RAID5 md. > > > > During the hang all CPUs where idle. > > > > > > No iowaits? ;-) > > > > No, I have a KSysGuard in my taskbar that showed no activity at all. > > > > OK, the subject does not match for my case, but there was also a tmpfs > > involved. And I found no thread with stalls on xfs. :-) > > Do you mean it is actually related with tmpfs? I don't know. It's just that I have seen tmpfs also redirtieing inodes in these logs and the stalling emerge is moving files from tmpfs to xfs. It could be, but I don't know enough about tmpfs internals to really be sure. I just wanted to mention, that tmpfs is involved somehow. Torsten - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: writeout stalls in current -git
On Thu, Nov 01, 2007 at 08:00:10PM +0100, Torsten Kaiser wrote: > On 11/1/07, Torsten Kaiser <[EMAIL PROTECTED]> wrote: > > On 11/1/07, Fengguang Wu <[EMAIL PROTECTED]> wrote: > > > Thank you. Maybe we can start by the applied debug patch :-) > > > > Will applied it and try to recreate this. > > Patch applied, used emerge to install a 2.6.24-rc1 kernel. > > I had no complete stalls, but three times during the move from tmpfs > to the main xfs the emerge got noticeable slower. There still was > writeout happening, but as emerge prints out every file it has written > during the pause not one file was processed. > > vmstat 10: > procs ---memory-- ---swap-- -io -system-- cpu > r b swpd free buff cache si sobibo in cs us sy id wa > 0 1 0 3146424332 61476800 134 1849 438 2515 3 4 91 2 > 0 0 0 3146644332 61478400 2 1628 507 646 0 2 85 13 > 0 0 0 3146868332 61486800 5 2359 527 1076 0 3 97 0 > 1 0 0 3144372332 6161480096 2829 607 2666 2 5 92 0 > -> normal writeout > 0 0 0 3140560332 61814400 152 2764 633 3308 3 6 91 0 > 0 0 0 3137332332 61990800 114 1801 588 2858 3 4 93 0 > 0 0 0 3136912332 6201360020 827 393 1605 1 2 98 0 > -> first stall 'stall': vmstat's output stalls for some time, or emerge stalls for the next several vmstat lines? > 0 0 0 3137088332 62013600 0 557 339 1437 0 1 99 0 > 0 0 0 3137160332 62013600 0 642 310 1400 0 1 99 0 > 0 0 0 3136588332 62017200 6 2972 527 1195 0 3 80 16 > 0 0 0 3136276332 6203480010 2668 558 1195 0 3 96 0 > 0 0 0 3135228332 62042400 8 2712 522 1311 0 4 96 0 > 0 0 0 3131740332 6215240075 2935 559 2457 2 5 93 0 > 0 0 0 3128348332 6229720085 1470 490 2607 3 4 93 0 > 0 0 0 3129292332 62297200 0 527 353 1398 0 1 99 0 > -> second longer stall > 0 0 0 3128520332 62302800 6 488 249 1390 0 1 99 0 > 0 0 0 3128236332 62302800 0 482 222 1222 0 1 99 0 > 0 0 0 3128408332 62302800 0 585 269 1301 0 0 99 0 > 0 0 0 3128532332 62302800 0 610 262 1278 0 0 99 0 > 0 0 0 3128568332 62302800 0 636 345 1639 0 1 99 0 > 0 0 0 3129032332 62304000 1 664 337 1466 0 1 99 0 > 0 0 0 3129484332 62304000 0 658 300 1508 0 0 100 > 0 > 0 0 0 3129576332 62304000 0 562 271 1454 0 1 99 0 > 0 0 0 3129736332 62304000 0 627 278 1406 0 1 99 0 > 0 0 0 3129368332 62304000 0 507 274 1301 0 1 99 0 > 0 0 0 3129004332 62304000 0 444 211 1213 0 0 99 0 > 0 1 0 3127260332 62304000 0 1036 305 1242 0 1 95 4 > 0 0 0 3126280332 62312800 7 4241 555 1575 1 5 84 10 > 0 0 0 3124948332 62323200 6 4194 529 1505 1 4 95 0 > 0 0 0 3125228332 6241680058 1966 586 1964 2 4 94 0 > -> emerge resumed to normal speed, without any intervention from my side > 0 0 0 3120932332 62590400 112 1546 546 2565 3 4 93 0 > 0 0 0 3118012332 62756800 128 1542 612 2705 3 4 93 0 Interesting, the 'bo' never falls to zero. > > >From syslog: > first stall: > [ 575.05] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 47259 > > global 610 0 0 wc __ tw 1023 sk 0 > [ 586.35] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 50465 > > global 6117 0 0 wc _M tw 967 sk 0 > [ 586.36] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 50408 > > global 6117 0 0 wc __ tw 1022 sk 0 > [ 599.90] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 53523 > > global 11141 0 0 wc __ tw 1009 sk 0 > [ 635.78] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 59397 > > global 12757 124 0 wc __ tw 0 sk 0 > [ 638.47] mm/page-writeback.c 418 balance_dirty_pages: > emerge(6113) > 1536 global 11405 51 0 wc __ tw 0 sk 0 > [ 638.82] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 58373 > > global 11276 48 0 wc __ tw -1 sk 0 > [ 641.26] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 57348 > > global 10565 100 0 wc __ tw 0 sk 0 > [ 643.98] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 56324 > > global 9788 103 0 wc __ tw -1 sk 0 > [ 646.12] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 55299 > > global 8912 6 0 wc __ tw 0 sk 0 > > second stall: > [ 664.04] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 48117 > > global 2864 81 0 wc _M tw -13 sk 0 > [ 664.40] mm/page-writeback.c
writeout stalls in current -git
On Thu, Nov 01, 2007 at 07:20:51PM +0100, Torsten Kaiser wrote: > On 11/1/07, Fengguang Wu <[EMAIL PROTECTED]> wrote: > > On Wed, Oct 31, 2007 at 04:22:10PM +0100, Torsten Kaiser wrote: > > > Since 2.6.23-mm1 I also experience strange hangs during heavy writeouts. > > > Each time I noticed this I was using emerge (package util from the > > > gentoo distribution) to install/upgrade a package. The last step, > > > where this hang occurred, is moving the prepared files from a tmpfs > > > partion to the main xfs filesystem. > > > The hangs where not fatal, after a few second everything resumed > > > normal, so I was not able to capture a good image of what was > > > happening. > > > > Thank you for the detailed report. > > > > How severe was the hangs? Only writeouts stalled, all apps stalled, or > > cannot type and run new commands? > > Only writeout stalled. The emerge that was moving the files hung, but > everything else worked normaly. > I was able to run new commands, like coping the /proc/meminfo. But you mentioned in the next mail that `watch cat /proc/meminfo` could also be blocked for some time - I guess in the same time emerge was stalled? > [snip] > > > After this SysRq+W writeback resumed again. Possible that writing > > > above into the syslog triggered that. > > > > Maybe. Are the log files on another disk/partition? > > No, everything was going to / > > What might be interesting is, that doing cat /proc/meminfo > >~/stall/meminfo did not resume the writeback. So there might some > threshold that only was broken with the additional write from > syslog-ng. Or syslog-ng does some flushing, I dont now. (I'm using the Have you tried explicit `sync`? ;-) > syslog-ng package from gentoo: > http://www.balabit.com/products/syslog_ng/ , version 2.0.5) > > > > The source tmpfs is mounted with any special parameters, but the > > > target xfs filesystem resides on a dm-crypt device that is on top a 3 > > > disk RAID5 md. > > > During the hang all CPUs where idle. > > > > No iowaits? ;-) > > No, I have a KSysGuard in my taskbar that showed no activity at all. > > OK, the subject does not match for my case, but there was also a tmpfs > involved. And I found no thread with stalls on xfs. :-) Do you mean it is actually related with tmpfs? > > > The system is x86_64 with CONFIG_NO_HZ=y, but was still receiving ~330 > > > interrupts per second because of the bttv driver. (But I was not using > > > that device at this time.) > > > > > > I'm willing to test patches or more provide more information, but lack > > > a good testcase to trigger this on demand. > > > > Thank you. Maybe we can start by the applied debug patch :-) > > Will applied it and try to recreate this. > > Thanks for looking into it. Thank you for the rich information, too :-) Fengguang - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html