Re: Processes stuck on D state on Dual Opteron
Claudio Martins wrote: On Tuesday 12 April 2005 01:46, Andrew Morton wrote: Claudio Martins <[EMAIL PROTECTED]> wrote: I think I'm going to give a try to Neil's patch, but I'll have to apply some patches from -mm. Just this one if you're using 2.6.12-rc2: --- 25/drivers/md/md.c~avoid-deadlock-in-sync_page_io-by-using-gfp_noio Mon Apr 11 16:55:07 2005 +++ 25-akpm/drivers/md/md.c Mon Apr 11 16:55:07 2005 @@ -332,7 +332,7 @@ static int bi_complete(struct bio *bio, static int sync_page_io(struct block_device *bdev, sector_t sector, int size, struct page *page, int rw) { - struct bio *bio = bio_alloc(GFP_KERNEL, 1); + struct bio *bio = bio_alloc(GFP_NOIO, 1); struct completion event; int ret; _ Hi Andrew, all, Sorry for the delay in reporting. This patch does indeed fix the problem. The machine ran stress for almost 15h straight with no problems at all. As for Nick's patch I, too, think it would be nice to be included (once the performance problems are sorted out), since it seemed to make the block layer more robust and well behaved (at least with stress), although I didn't run performance tests to measure regressions. Thanks Nick, Neil, Andrew and all others for your great help with this issue. I'll have to put the machine on production now with the patch applied, but let me know if I can be of any further help with these issues. Thanks for reporting and testing - what we need is more people like you contributing to Linux ;) -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
Chen, Kenneth W wrote: Nick Piggin wrote on Tuesday, April 12, 2005 4:09 AM Chen, Kenneth W wrote: I like the patch a lot and already did bench it on our db setup. However, I'm seeing a negative regression compare to a very very crappy patch (see attached, you can laugh at me for doing things like that :-). OK - if we go that way, perhaps the following patch may be the way to do it. OK, if you are going to do it that way, then the ioc_batching code in get_request has to be reworked. We never push the queue so hard that it kicks itself into the batching mode. However, calls to get_io_context and put_io_context are unconditional in that function. Execution profile shows that these two little functions actually consumed lots of cpu cycles. AFAICS, ioc_*batching() is trying to push more requests onto the queue that is full (or near full) and give high priority to the process that hits the last req slot. Why do we need to go all the way to tsk->io_context to keep track of that state? For a clean up bonus, I think the tracking can be moved into the queue structure. OK - well it is no different to what you had before these patches, so probably future work would be seperate patches. get_io_context can probably be reworked. For example, it is only called with the current thread, so it probably doesn't need to increment the refcount, as most users are only using it process context... all users in ll_rw_blk.c, anyway. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
On Tuesday 12 April 2005 01:46, Andrew Morton wrote: > Claudio Martins <[EMAIL PROTECTED]> wrote: > > I think I'm going to give a try to Neil's patch, but I'll have to apply > > some patches from -mm. > > Just this one if you're using 2.6.12-rc2: > > --- 25/drivers/md/md.c~avoid-deadlock-in-sync_page_io-by-using-gfp_noio Mon > Apr 11 16:55:07 2005 +++ 25-akpm/drivers/md/md.c Mon Apr 11 16:55:07 2005 > @@ -332,7 +332,7 @@ static int bi_complete(struct bio *bio, > static int sync_page_io(struct block_device *bdev, sector_t sector, int > size, struct page *page, int rw) > { > - struct bio *bio = bio_alloc(GFP_KERNEL, 1); > + struct bio *bio = bio_alloc(GFP_NOIO, 1); > struct completion event; > int ret; > > _ Hi Andrew, all, Sorry for the delay in reporting. This patch does indeed fix the problem. The machine ran stress for almost 15h straight with no problems at all. As for Nick's patch I, too, think it would be nice to be included (once the performance problems are sorted out), since it seemed to make the block layer more robust and well behaved (at least with stress), although I didn't run performance tests to measure regressions. Thanks Nick, Neil, Andrew and all others for your great help with this issue. I'll have to put the machine on production now with the patch applied, but let me know if I can be of any further help with these issues. Thanks Claudio - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Processes stuck on D state on Dual Opteron
Nick Piggin wrote on Tuesday, April 12, 2005 4:09 AM > Chen, Kenneth W wrote: > > I like the patch a lot and already did bench it on our db setup. However, > > I'm seeing a negative regression compare to a very very crappy patch (see > > attached, you can laugh at me for doing things like that :-). > > OK - if we go that way, perhaps the following patch may be the > way to do it. OK, if you are going to do it that way, then the ioc_batching code in get_request has to be reworked. We never push the queue so hard that it kicks itself into the batching mode. However, calls to get_io_context and put_io_context are unconditional in that function. Execution profile shows that these two little functions actually consumed lots of cpu cycles. AFAICS, ioc_*batching() is trying to push more requests onto the queue that is full (or near full) and give high priority to the process that hits the last req slot. Why do we need to go all the way to tsk->io_context to keep track of that state? For a clean up bonus, I think the tracking can be moved into the queue structure. > > My first reaction is that the overhead is in wait queue setup and tear down > > in get_request_wait function. Throwing the following patch on top does > > improve > > things a bit, but we are still in the negative territory. I can't explain > > why. > > Everything suppose to be faster. So I'm staring at the execution profile at > > the moment. > > > > Hmm, that's a bit disappointing. Like you said though, I'm sure we > should be able to get better performance out of this. Absolutely. I'm disappointed too and this is totally out of expectation. There must be some other factors. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
Nick Piggin wrote: It is a bit subtle: get_request may only drop the lock and return NULL (after retaking the lock), if we fail on a memory allocation. If we just fail due to unavailable queue slots, then the lock is never dropped. And the mem allocation can't fail because it is a mempool alloc with GFP_NOIO. I'm jumping in here, because we have seen this problem on a X86-64 system, with 4gb of ram, and SLES9 (2.6.5-7.141) You can drive the node into this state: Mem-info: Node 1 DMA per-cpu: empty Node 1 Normal per-cpu: cpu 0 hot: low 32, high 96, batch 16 cpu 0 cold: low 0, high 32, batch 16 cpu 1 hot: low 32, high 96, batch 16 cpu 1 cold: low 0, high 32, batch 16 Node 1 HighMem per-cpu: empty Node 0 DMA per-cpu: cpu 0 hot: low 2, high 6, batch 1 cpu 0 cold: low 0, high 2, batch 1 cpu 1 hot: low 2, high 6, batch 1 cpu 1 cold: low 0, high 2, batch 1 Node 0 Normal per-cpu: cpu 0 hot: low 32, high 96, batch 16 cpu 0 cold: low 0, high 32, batch 16 cpu 1 hot: low 32, high 96, batch 16 cpu 1 cold: low 0, high 32, batch 16 Node 0 HighMem per-cpu: empty Free pages: 10360kB (0kB HighMem) Active:485853 inactive:421820 dirty:0 writeback:0 unstable:0 free:2590 slab:10816 mapped:903444 pagetables:2097 Node 1 DMA free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB lowmem_reserve[]: 0 1664 1664 Node 1 Normal free:2464kB min:2468kB low:4936kB high:7404kB active:918440kB inactive:710360kB present:1703936kB lowmem_reserve[]: 0 0 0 Node 1 HighMem free:0kB min:128kB low:256kB high:384kB active:0kB inactive:0kB present:0kB lowmem_reserve[]: 0 0 0 Node 0 DMA free:4928kB min:20kB low:40kB high:60kB active:0kB inactive:0kB present:16384kB lowmem_reserve[]: 0 2031 2031 Node 0 Normal free:2968kB min:3016kB low:6032kB high:9048kB active:1024968kB inactive:976924kB present:2080764kB lowmem_reserve[]: 0 0 0 Node 0 HighMem free:0kB min:128kB low:256kB high:384kB active:0kB inactive:0kB present:0kB lowmem_reserve[]: 0 0 0 Node 1 DMA: empty Node 1 Normal: 46*4kB 19*8kB 9*16kB 4*32kB 1*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 2464kB Node 1 HighMem: empty Node 0 DMA: 4*4kB 4*8kB 1*16kB 2*32kB 3*64kB 4*128kB 2*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 4928kB Node 0 Normal: 0*4kB 1*8kB 1*16kB 0*32kB 0*64kB 1*128kB 1*256kB 3*512kB 1*1024kB 0*2048kB 0*4096kB = 2968kB Node 0 HighMem: empty Swap cache: add 1009224, delete 106245, find 179674/181478, race 0+2 Free swap: 4739812kB 950271 pages of RAM 17513 reserved pages 2788 pages shared 902980 pages swap cached with processes doing this: SysRq : Show State sibling task PC pid father child younger older init D 0100e810 0 1 0 2 (NOTLB) 01007ff81be8 0006 010002c1d6e0 Call Trace:{try_to_free_pages+283} {schedule_timeout+173} {process_timeout+0} {io_schedule_timeout+82} {blk_congestion_wait+141} {autoremove_wake_function+0} {autoremove_wake_function+0} {__alloc_pages+776} {read_swap_cache_async+63} {swapin_readahead+97} {do_swap_page+142} {handle_mm_fault+337} {do_page_fault+411} {sys_select+1097} {sys_select+1311} {error_exit+0} mg.C.2D 0100e810 0 1971 1955 1972 (NOTLB) 0100e236bc68 0006 0001 0100816ed360 Call Trace:{try_to_free_pages+283} {schedule_timeout+173} {process_timeout+0} {io_schedule_timeout+82} {blk_congestion_wait+141} {autoremove_wake_function+0} {autoremove_wake_function+0} {__alloc_pages+776} {do_wp_page+285} {handle_mm_fault+373} {do_page_fault+411} {error_exit+0} mg.C.2S 01007b0a06a0 0 1972 1971 1974 (NOTLB) 0100bc1c1ca0 0006 0010 00010246 0004c7c0 0100816ec280 00768780 010081f23390 00018780 0100816ed360 Call Trace:{__alloc_pages+852} {__down_interruptible+216} {default_wake_function+0} {recalc_task_prio+940} {__down_failed_interruptible+53} {:mosal:.text.lock.mosal_sync+5} {:mod_vipkl:VIPKL_EQ_poll+607} {:mod_vipkl:VIPKL_EQ_poll_stat+529} {:mod_vipkl:VIPKL_ioctl+5144} {:mod_vipkl:vipkl_wrap_kernel_ioctl+417} {filp_close+126} {sys_ioctl+612} {system_call+124} mg.C.2S 01007b0a18c0 0 1974 19711972 (NOTLB) 0100a3955ca0 0006 0001e7d422e8 01002c9ca550 0005f138 0100816ec280 00768780 010081f23390 00018780 0100816ed360 Call Trace:{__alloc_pages+852} {__down_interruptible+216} {default_wake_function+0}
Re: Processes stuck on D state on Dual Opteron
Chen, Kenneth W wrote: On Tue, Apr 12 2005, Nick Piggin wrote: Actually the patches I have sent you do fix real bugs, but they also make the block layer less likely to recurse into page reclaim, so it may be eg. hiding the problem that Neil's patch fixes. Jens Axboe wrote on Tuesday, April 12, 2005 12:08 AM Can you push those to Andrew? I'm quite happy with the way they turned out. It would be nice if Ken would bench 2.6.12-rc2 with and without those patches. I like the patch a lot and already did bench it on our db setup. However, I'm seeing a negative regression compare to a very very crappy patch (see attached, you can laugh at me for doing things like that :-). OK - if we go that way, perhaps the following patch may be the way to do it. My first reaction is that the overhead is in wait queue setup and tear down in get_request_wait function. Throwing the following patch on top does improve things a bit, but we are still in the negative territory. I can't explain why. Everything suppose to be faster. So I'm staring at the execution profile at the moment. Hmm, that's a bit disappointing. Like you said though, I'm sure we should be able to get better performance out of this. I'll look at it and see if we can rework it. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
Nick Piggin wrote: Nick Piggin wrote: Chen, Kenneth W wrote: I like the patch a lot and already did bench it on our db setup. However, I'm seeing a negative regression compare to a very very crappy patch (see attached, you can laugh at me for doing things like that :-). OK - if we go that way, perhaps the following patch may be the way to do it. Here. Actually yes this is good I think. What I was worried about is that you could lose some fairness due to not being put on the queue before allocation. This is probably a silly thing to worry about, because up until that point things aren't really deterministic anyway (and before this patchset it would try doing a GFP_ATOMIC allocation first anyway). However after the subsequent locking rework, both these get_request() calls will be performed under the same lock - giving you the same fairness. So it is nothing to worry about anyway! It is a bit subtle: get_request may only drop the lock and return NULL (after retaking the lock), if we fail on a memory allocation. If we just fail due to unavailable queue slots, then the lock is never dropped. And the mem allocation can't fail because it is a mempool alloc with GFP_NOIO. Nick -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
Nick Piggin wrote: Chen, Kenneth W wrote: I like the patch a lot and already did bench it on our db setup. However, I'm seeing a negative regression compare to a very very crappy patch (see attached, you can laugh at me for doing things like that :-). OK - if we go that way, perhaps the following patch may be the way to do it. Here. -- SUSE Labs, Novell Inc. Index: linux-2.6/drivers/block/ll_rw_blk.c === --- linux-2.6.orig/drivers/block/ll_rw_blk.c2005-04-12 21:03:01.0 +1000 +++ linux-2.6/drivers/block/ll_rw_blk.c 2005-04-12 21:03:45.0 +1000 @@ -1956,10 +1956,11 @@ out: */ static struct request *get_request_wait(request_queue_t *q, int rw) { - DEFINE_WAIT(wait); struct request *rq; - do { + rq = get_request(q, rw, GFP_NOIO); + while (!rq) { + DEFINE_WAIT(wait); struct request_list *rl = >rq; prepare_to_wait_exclusive(>wait[rw], , @@ -1987,7 +1988,7 @@ static struct request *get_request_wait( spin_lock_irq(q->queue_lock); } finish_wait(>wait[rw], ); - } while (!rq); + } return rq; }
RE: Processes stuck on D state on Dual Opteron
On Tue, Apr 12 2005, Nick Piggin wrote: > Actually the patches I have sent you do fix real bugs, but they also > make the block layer less likely to recurse into page reclaim, so it > may be eg. hiding the problem that Neil's patch fixes. Jens Axboe wrote on Tuesday, April 12, 2005 12:08 AM > Can you push those to Andrew? I'm quite happy with the way they turned > out. It would be nice if Ken would bench 2.6.12-rc2 with and without > those patches. I like the patch a lot and already did bench it on our db setup. However, I'm seeing a negative regression compare to a very very crappy patch (see attached, you can laugh at me for doing things like that :-). My first reaction is that the overhead is in wait queue setup and tear down in get_request_wait function. Throwing the following patch on top does improve things a bit, but we are still in the negative territory. I can't explain why. Everything suppose to be faster. So I'm staring at the execution profile at the moment. diff -Nru a/drivers/block/ll_rw_blk.c b/drivers/block/ll_rw_blk.c --- a/drivers/block/ll_rw_blk.c 2005-04-12 00:48:12 -07:00 +++ b/drivers/block/ll_rw_blk.c 2005-04-12 00:48:12 -07:00 @@ -1740,10 +1740,35 @@ */ static struct request *get_request_wait(request_queue_t *q, int rw) { - DEFINE_WAIT(wait); struct request *rq; + struct request_list *rl = >rq; + int gfp_flag = GFP_ATOMIC; + + if (rl->count[rw] < queue_congestion_off_threshold(q)) { + rq = kmem_cache_alloc(request_cachep, gfp_flag); + if (rq) { + if (!elv_set_request(q, rq, gfp_flag)) { + + rl->count[rw]++; + INIT_LIST_HEAD(>queuelist); + rq->flags = rw; + rq->rq_status = RQ_ACTIVE; + rq->ref_count = 1; + rq->q = q; + rq->rl = rl; + rq->special = NULL; + rq->data_len = 0; + rq->data = NULL; + rq->sense = NULL; + + return rq; + } + kmem_cache_free(request_cachep, rq); + } + } do { + DEFINE_WAIT(wait); struct request_list *rl = >rq; prepare_to_wait_exclusive(>wait[rw], , begin 666 old_freereq.patch M9 M3G)U(&$O9')I=F5R2AQ*2D*( D)8FQK7W!L M=6=?9&5V:6-E*'$I.PHM"0EG;W1O(&=E=%]R<3L*+0E]"BT):[EMAIL PROTECTED]&)Ahttp://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
On Tue, Apr 12 2005, Nick Piggin wrote: > Actually the patches I have sent you do fix real bugs, but they also > make the block layer less likely to recurse into page reclaim, so it > may be eg. hiding the problem that Neil's patch fixes. Can you push those to Andrew? I'm quite happy with the way they turned out. It would be nice if Ken would bench 2.6.12-rc2 with and without those patches. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
On Tue, Apr 12 2005, Nick Piggin wrote: Actually the patches I have sent you do fix real bugs, but they also make the block layer less likely to recurse into page reclaim, so it may be eg. hiding the problem that Neil's patch fixes. Can you push those to Andrew? I'm quite happy with the way they turned out. It would be nice if Ken would bench 2.6.12-rc2 with and without those patches. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Processes stuck on D state on Dual Opteron
On Tue, Apr 12 2005, Nick Piggin wrote: Actually the patches I have sent you do fix real bugs, but they also make the block layer less likely to recurse into page reclaim, so it may be eg. hiding the problem that Neil's patch fixes. Jens Axboe wrote on Tuesday, April 12, 2005 12:08 AM Can you push those to Andrew? I'm quite happy with the way they turned out. It would be nice if Ken would bench 2.6.12-rc2 with and without those patches. I like the patch a lot and already did bench it on our db setup. However, I'm seeing a negative regression compare to a very very crappy patch (see attached, you can laugh at me for doing things like that :-). My first reaction is that the overhead is in wait queue setup and tear down in get_request_wait function. Throwing the following patch on top does improve things a bit, but we are still in the negative territory. I can't explain why. Everything suppose to be faster. So I'm staring at the execution profile at the moment. diff -Nru a/drivers/block/ll_rw_blk.c b/drivers/block/ll_rw_blk.c --- a/drivers/block/ll_rw_blk.c 2005-04-12 00:48:12 -07:00 +++ b/drivers/block/ll_rw_blk.c 2005-04-12 00:48:12 -07:00 @@ -1740,10 +1740,35 @@ */ static struct request *get_request_wait(request_queue_t *q, int rw) { - DEFINE_WAIT(wait); struct request *rq; + struct request_list *rl = q-rq; + int gfp_flag = GFP_ATOMIC; + + if (rl-count[rw] queue_congestion_off_threshold(q)) { + rq = kmem_cache_alloc(request_cachep, gfp_flag); + if (rq) { + if (!elv_set_request(q, rq, gfp_flag)) { + + rl-count[rw]++; + INIT_LIST_HEAD(rq-queuelist); + rq-flags = rw; + rq-rq_status = RQ_ACTIVE; + rq-ref_count = 1; + rq-q = q; + rq-rl = rl; + rq-special = NULL; + rq-data_len = 0; + rq-data = NULL; + rq-sense = NULL; + + return rq; + } + kmem_cache_free(request_cachep, rq); + } + } do { + DEFINE_WAIT(wait); struct request_list *rl = q-rq; prepare_to_wait_exclusive(rl-wait[rw], wait, begin 666 old_freereq.patch M9EF9B M3G)U($O9')I=F5RR]B;]C:R]L;%]R=U]B;LN8R!B+V1R:79E MG,O8FQO8VLO;Q?G=?8FQK+F,*+2TM($O9')I=F5RR]B;]C:R]L;%]R M=U]B;LN8PDR,# U+3 T+3 T(# P.C4X.C4U(TP-SHP, [EMAIL PROTECTED]FEV M97)S+V)L;V-K+VQL7W)W7V)L:RYC3(P,#4M,#0M,#0@,# [EMAIL PROTECTED]@+3 W M.C PD! (TQ.3DV+#$U(LQ.3DV+#$T($! B *( ER82 ]()I;RT^8FE? MG@)B H,2 \/!24]?4E=?04A%040I.PH@BL)[EMAIL PROTECTED])A8B!A(9R964@ MF5Q=65S=!FF]M('1H92!FF5E;ES= J+PHK69R965R97$@/2!G971? MF5Q=65S=AQ+[EMAIL PROTECTED])0RD[BL*([EMAIL PROTECTED]@7-P:6Y? M;]C:U]IG$H2T^75E=65?;]C:RD[B *+0EI9B H96QV7W%U975E7V5M M'1Y*'$I*2![BL):[EMAIL PROTECTED]5L=E]Q=65U95]E;7!T2AQ*2D*( D)8FQK7W!L M=6=?95V:6-E*'$I.PHM0EG;W1O(=E=%]R3L*+0E]BT):[EMAIL PROTECTED])AG)I M97(IBT)6=O=[EMAIL PROTECTED])Q.PH@B )96Q?F5T([EMAIL PROTECTED]F=E*'$L A(9R97$L()I;RD[B )W=I=-H(AE;%]R970I('L* ` end - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
Nick Piggin wrote: Chen, Kenneth W wrote: I like the patch a lot and already did bench it on our db setup. However, I'm seeing a negative regression compare to a very very crappy patch (see attached, you can laugh at me for doing things like that :-). OK - if we go that way, perhaps the following patch may be the way to do it. Here. -- SUSE Labs, Novell Inc. Index: linux-2.6/drivers/block/ll_rw_blk.c === --- linux-2.6.orig/drivers/block/ll_rw_blk.c2005-04-12 21:03:01.0 +1000 +++ linux-2.6/drivers/block/ll_rw_blk.c 2005-04-12 21:03:45.0 +1000 @@ -1956,10 +1956,11 @@ out: */ static struct request *get_request_wait(request_queue_t *q, int rw) { - DEFINE_WAIT(wait); struct request *rq; - do { + rq = get_request(q, rw, GFP_NOIO); + while (!rq) { + DEFINE_WAIT(wait); struct request_list *rl = q-rq; prepare_to_wait_exclusive(rl-wait[rw], wait, @@ -1987,7 +1988,7 @@ static struct request *get_request_wait( spin_lock_irq(q-queue_lock); } finish_wait(rl-wait[rw], wait); - } while (!rq); + } return rq; }
Re: Processes stuck on D state on Dual Opteron
Nick Piggin wrote: Nick Piggin wrote: Chen, Kenneth W wrote: I like the patch a lot and already did bench it on our db setup. However, I'm seeing a negative regression compare to a very very crappy patch (see attached, you can laugh at me for doing things like that :-). OK - if we go that way, perhaps the following patch may be the way to do it. Here. Actually yes this is good I think. What I was worried about is that you could lose some fairness due to not being put on the queue before allocation. This is probably a silly thing to worry about, because up until that point things aren't really deterministic anyway (and before this patchset it would try doing a GFP_ATOMIC allocation first anyway). However after the subsequent locking rework, both these get_request() calls will be performed under the same lock - giving you the same fairness. So it is nothing to worry about anyway! It is a bit subtle: get_request may only drop the lock and return NULL (after retaking the lock), if we fail on a memory allocation. If we just fail due to unavailable queue slots, then the lock is never dropped. And the mem allocation can't fail because it is a mempool alloc with GFP_NOIO. Nick -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
Chen, Kenneth W wrote: On Tue, Apr 12 2005, Nick Piggin wrote: Actually the patches I have sent you do fix real bugs, but they also make the block layer less likely to recurse into page reclaim, so it may be eg. hiding the problem that Neil's patch fixes. Jens Axboe wrote on Tuesday, April 12, 2005 12:08 AM Can you push those to Andrew? I'm quite happy with the way they turned out. It would be nice if Ken would bench 2.6.12-rc2 with and without those patches. I like the patch a lot and already did bench it on our db setup. However, I'm seeing a negative regression compare to a very very crappy patch (see attached, you can laugh at me for doing things like that :-). OK - if we go that way, perhaps the following patch may be the way to do it. My first reaction is that the overhead is in wait queue setup and tear down in get_request_wait function. Throwing the following patch on top does improve things a bit, but we are still in the negative territory. I can't explain why. Everything suppose to be faster. So I'm staring at the execution profile at the moment. Hmm, that's a bit disappointing. Like you said though, I'm sure we should be able to get better performance out of this. I'll look at it and see if we can rework it. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
Nick Piggin wrote: It is a bit subtle: get_request may only drop the lock and return NULL (after retaking the lock), if we fail on a memory allocation. If we just fail due to unavailable queue slots, then the lock is never dropped. And the mem allocation can't fail because it is a mempool alloc with GFP_NOIO. I'm jumping in here, because we have seen this problem on a X86-64 system, with 4gb of ram, and SLES9 (2.6.5-7.141) You can drive the node into this state: Mem-info: Node 1 DMA per-cpu: empty Node 1 Normal per-cpu: cpu 0 hot: low 32, high 96, batch 16 cpu 0 cold: low 0, high 32, batch 16 cpu 1 hot: low 32, high 96, batch 16 cpu 1 cold: low 0, high 32, batch 16 Node 1 HighMem per-cpu: empty Node 0 DMA per-cpu: cpu 0 hot: low 2, high 6, batch 1 cpu 0 cold: low 0, high 2, batch 1 cpu 1 hot: low 2, high 6, batch 1 cpu 1 cold: low 0, high 2, batch 1 Node 0 Normal per-cpu: cpu 0 hot: low 32, high 96, batch 16 cpu 0 cold: low 0, high 32, batch 16 cpu 1 hot: low 32, high 96, batch 16 cpu 1 cold: low 0, high 32, batch 16 Node 0 HighMem per-cpu: empty Free pages: 10360kB (0kB HighMem) Active:485853 inactive:421820 dirty:0 writeback:0 unstable:0 free:2590 slab:10816 mapped:903444 pagetables:2097 Node 1 DMA free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB lowmem_reserve[]: 0 1664 1664 Node 1 Normal free:2464kB min:2468kB low:4936kB high:7404kB active:918440kB inactive:710360kB present:1703936kB lowmem_reserve[]: 0 0 0 Node 1 HighMem free:0kB min:128kB low:256kB high:384kB active:0kB inactive:0kB present:0kB lowmem_reserve[]: 0 0 0 Node 0 DMA free:4928kB min:20kB low:40kB high:60kB active:0kB inactive:0kB present:16384kB lowmem_reserve[]: 0 2031 2031 Node 0 Normal free:2968kB min:3016kB low:6032kB high:9048kB active:1024968kB inactive:976924kB present:2080764kB lowmem_reserve[]: 0 0 0 Node 0 HighMem free:0kB min:128kB low:256kB high:384kB active:0kB inactive:0kB present:0kB lowmem_reserve[]: 0 0 0 Node 1 DMA: empty Node 1 Normal: 46*4kB 19*8kB 9*16kB 4*32kB 1*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 2464kB Node 1 HighMem: empty Node 0 DMA: 4*4kB 4*8kB 1*16kB 2*32kB 3*64kB 4*128kB 2*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 4928kB Node 0 Normal: 0*4kB 1*8kB 1*16kB 0*32kB 0*64kB 1*128kB 1*256kB 3*512kB 1*1024kB 0*2048kB 0*4096kB = 2968kB Node 0 HighMem: empty Swap cache: add 1009224, delete 106245, find 179674/181478, race 0+2 Free swap: 4739812kB 950271 pages of RAM 17513 reserved pages 2788 pages shared 902980 pages swap cached with processes doing this: SysRq : Show State sibling task PC pid father child younger older init D 0100e810 0 1 0 2 (NOTLB) 01007ff81be8 0006 010002c1d6e0 Call Trace:8017338b{try_to_free_pages+283} 80147d0d{schedule_timeout+173} 80147c50{process_timeout+0} 8013a292{io_schedule_timeout+82} 80280efd{blk_congestion_wait+141} 8013c530{autoremove_wake_function+0} 8013c530{autoremove_wake_function+0} 8016ab68{__alloc_pages+776} 8018573f{read_swap_cache_async+63} 801781b1{swapin_readahead+97} 8017834e{do_swap_page+142} 801796a1{handle_mm_fault+337} 80123ebb{do_page_fault+411} 801a3259{sys_select+1097} 801a332f{sys_select+1311} 801122a9{error_exit+0} mg.C.2D 0100e810 0 1971 1955 1972 (NOTLB) 0100e236bc68 0006 0001 0100816ed360 Call Trace:8017338b{try_to_free_pages+283} 80147d0d{schedule_timeout+173} 80147c50{process_timeout+0} 8013a292{io_schedule_timeout+82} 80280efd{blk_congestion_wait+141} 8013c530{autoremove_wake_function+0} 8013c530{autoremove_wake_function+0} 8016ab68{__alloc_pages+776} 801778ad{do_wp_page+285} 801796c5{handle_mm_fault+373} 80123ebb{do_page_fault+411} 801122a9{error_exit+0} mg.C.2S 01007b0a06a0 0 1972 1971 1974 (NOTLB) 0100bc1c1ca0 0006 0010 00010246 0004c7c0 0100816ec280 00768780 010081f23390 00018780 0100816ed360 Call Trace:8016abb4{__alloc_pages+852} 80110ac8{__down_interruptible+216} 80139280{default_wake_function+0} 8013531c{recalc_task_prio+940} 80230d91{__down_failed_interruptible+53} a01cc47e{:mosal:.text.lock.mosal_sync+5}
RE: Processes stuck on D state on Dual Opteron
Nick Piggin wrote on Tuesday, April 12, 2005 4:09 AM Chen, Kenneth W wrote: I like the patch a lot and already did bench it on our db setup. However, I'm seeing a negative regression compare to a very very crappy patch (see attached, you can laugh at me for doing things like that :-). OK - if we go that way, perhaps the following patch may be the way to do it. OK, if you are going to do it that way, then the ioc_batching code in get_request has to be reworked. We never push the queue so hard that it kicks itself into the batching mode. However, calls to get_io_context and put_io_context are unconditional in that function. Execution profile shows that these two little functions actually consumed lots of cpu cycles. AFAICS, ioc_*batching() is trying to push more requests onto the queue that is full (or near full) and give high priority to the process that hits the last req slot. Why do we need to go all the way to tsk-io_context to keep track of that state? For a clean up bonus, I think the tracking can be moved into the queue structure. My first reaction is that the overhead is in wait queue setup and tear down in get_request_wait function. Throwing the following patch on top does improve things a bit, but we are still in the negative territory. I can't explain why. Everything suppose to be faster. So I'm staring at the execution profile at the moment. Hmm, that's a bit disappointing. Like you said though, I'm sure we should be able to get better performance out of this. Absolutely. I'm disappointed too and this is totally out of expectation. There must be some other factors. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
On Tuesday 12 April 2005 01:46, Andrew Morton wrote: Claudio Martins [EMAIL PROTECTED] wrote: I think I'm going to give a try to Neil's patch, but I'll have to apply some patches from -mm. Just this one if you're using 2.6.12-rc2: --- 25/drivers/md/md.c~avoid-deadlock-in-sync_page_io-by-using-gfp_noio Mon Apr 11 16:55:07 2005 +++ 25-akpm/drivers/md/md.c Mon Apr 11 16:55:07 2005 @@ -332,7 +332,7 @@ static int bi_complete(struct bio *bio, static int sync_page_io(struct block_device *bdev, sector_t sector, int size, struct page *page, int rw) { - struct bio *bio = bio_alloc(GFP_KERNEL, 1); + struct bio *bio = bio_alloc(GFP_NOIO, 1); struct completion event; int ret; _ Hi Andrew, all, Sorry for the delay in reporting. This patch does indeed fix the problem. The machine ran stress for almost 15h straight with no problems at all. As for Nick's patch I, too, think it would be nice to be included (once the performance problems are sorted out), since it seemed to make the block layer more robust and well behaved (at least with stress), although I didn't run performance tests to measure regressions. Thanks Nick, Neil, Andrew and all others for your great help with this issue. I'll have to put the machine on production now with the patch applied, but let me know if I can be of any further help with these issues. Thanks Claudio - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
Chen, Kenneth W wrote: Nick Piggin wrote on Tuesday, April 12, 2005 4:09 AM Chen, Kenneth W wrote: I like the patch a lot and already did bench it on our db setup. However, I'm seeing a negative regression compare to a very very crappy patch (see attached, you can laugh at me for doing things like that :-). OK - if we go that way, perhaps the following patch may be the way to do it. OK, if you are going to do it that way, then the ioc_batching code in get_request has to be reworked. We never push the queue so hard that it kicks itself into the batching mode. However, calls to get_io_context and put_io_context are unconditional in that function. Execution profile shows that these two little functions actually consumed lots of cpu cycles. AFAICS, ioc_*batching() is trying to push more requests onto the queue that is full (or near full) and give high priority to the process that hits the last req slot. Why do we need to go all the way to tsk-io_context to keep track of that state? For a clean up bonus, I think the tracking can be moved into the queue structure. OK - well it is no different to what you had before these patches, so probably future work would be seperate patches. get_io_context can probably be reworked. For example, it is only called with the current thread, so it probably doesn't need to increment the refcount, as most users are only using it process context... all users in ll_rw_blk.c, anyway. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
Claudio Martins wrote: On Tuesday 12 April 2005 01:46, Andrew Morton wrote: Claudio Martins [EMAIL PROTECTED] wrote: I think I'm going to give a try to Neil's patch, but I'll have to apply some patches from -mm. Just this one if you're using 2.6.12-rc2: --- 25/drivers/md/md.c~avoid-deadlock-in-sync_page_io-by-using-gfp_noio Mon Apr 11 16:55:07 2005 +++ 25-akpm/drivers/md/md.c Mon Apr 11 16:55:07 2005 @@ -332,7 +332,7 @@ static int bi_complete(struct bio *bio, static int sync_page_io(struct block_device *bdev, sector_t sector, int size, struct page *page, int rw) { - struct bio *bio = bio_alloc(GFP_KERNEL, 1); + struct bio *bio = bio_alloc(GFP_NOIO, 1); struct completion event; int ret; _ Hi Andrew, all, Sorry for the delay in reporting. This patch does indeed fix the problem. The machine ran stress for almost 15h straight with no problems at all. As for Nick's patch I, too, think it would be nice to be included (once the performance problems are sorted out), since it seemed to make the block layer more robust and well behaved (at least with stress), although I didn't run performance tests to measure regressions. Thanks Nick, Neil, Andrew and all others for your great help with this issue. I'll have to put the machine on production now with the patch applied, but let me know if I can be of any further help with these issues. Thanks for reporting and testing - what we need is more people like you contributing to Linux ;) -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
On Tue, 2005-04-12 at 01:22 +0100, Claudio Martins wrote: > On Monday 11 April 2005 23:59, Nick Piggin wrote: > > > > > OK, I'll try them in a few minutes and report back. > > > > I'm not overly hopeful. If they fix the problem, then it's likely > > that the real bug is hidden. > > > > Well, the thing is, they do fix the problem. Or at least they hide it very > well ;-) > > It has been running for more than 5 hours now with stress with no problems > and no stuck processes. > Well, that is good... I guess ;) Actually the patches I have sent you do fix real bugs, but they also make the block layer less likely to recurse into page reclaim, so it may be eg. hiding the problem that Neil's patch fixes. It may be that your fundamental problem is solved by my patches, but we need to be sure. > I think I'm going to give a try to Neil's patch, but I'll have to apply > some > patches from -mm. > Yep that would be good. Please test -rc2 with Andrew's patch, and obviously my patches backed out. Thanks for sticking with it. Nick - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
Claudio Martins <[EMAIL PROTECTED]> wrote: > > I think I'm going to give a try to Neil's patch, but I'll have to apply > some > patches from -mm. Just this one if you're using 2.6.12-rc2: --- 25/drivers/md/md.c~avoid-deadlock-in-sync_page_io-by-using-gfp_noio Mon Apr 11 16:55:07 2005 +++ 25-akpm/drivers/md/md.c Mon Apr 11 16:55:07 2005 @@ -332,7 +332,7 @@ static int bi_complete(struct bio *bio, static int sync_page_io(struct block_device *bdev, sector_t sector, int size, struct page *page, int rw) { - struct bio *bio = bio_alloc(GFP_KERNEL, 1); + struct bio *bio = bio_alloc(GFP_NOIO, 1); struct completion event; int ret; _ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
On Tuesday 12 April 2005 00:46, Neil Brown wrote: > On Monday April 11, [EMAIL PROTECTED] wrote: > > Neil, have you had a look at the traces? Do they mean much to you? > > Just looked. > bio_alloc_bioset seems implicated, as does sync_page_io. > > sync_page_io used to use a 'struct bio' on the stack, but Jens Axboe > change it to use bio_alloc (don't know why..) and I should have > checked the change better. > > sync_page_io can be called on the write out path, so it should use > GFP_NOIO rather than GFP_KERNEL. > > See if this helps Actually this patch is against 2.6.12-rc2-mm1 > which uses md_super_write instead of sync_page_io (which is now only > used for read). So if you are using a non-mm kernel (which seems to > be the case) you'll need to apply the patch by hand. > Hi Neil, I'll test this patch, but I'm wondering if I have to apply all the md-related patches from broken out directory of 2.6.12-rc2-mm1 or only some specific ones? Anyway I'm happy to test all those md updates, if you think they might help. Thanks Claudio - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
On Monday 11 April 2005 23:59, Nick Piggin wrote: > > > OK, I'll try them in a few minutes and report back. > > I'm not overly hopeful. If they fix the problem, then it's likely > that the real bug is hidden. > Well, the thing is, they do fix the problem. Or at least they hide it very well ;-) It has been running for more than 5 hours now with stress with no problems and no stuck processes. I think I'm going to give a try to Neil's patch, but I'll have to apply some patches from -mm. Thanks Claudio - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
On Monday April 11, [EMAIL PROTECTED] wrote: > > Neil, have you had a look at the traces? Do they mean much to you? > Just looked. bio_alloc_bioset seems implicated, as does sync_page_io. sync_page_io used to use a 'struct bio' on the stack, but Jens Axboe change it to use bio_alloc (don't know why..) and I should have checked the change better. sync_page_io can be called on the write out path, so it should use GFP_NOIO rather than GFP_KERNEL. See if this helps Actually this patch is against 2.6.12-rc2-mm1 which uses md_super_write instead of sync_page_io (which is now only used for read). So if you are using a non-mm kernel (which seems to be the case) you'll need to apply the patch by hand. Thanks, NeilBrown --- Avoid deadlock in sync_page_io by using GFP_NOIO ..as sync_page_io can be called on the write-out path. Ditto for md_super_write Signed-off-by: Neil Brown <[EMAIL PROTECTED]> ### Diffstat output ./drivers/md/md.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff ./drivers/md/md.c~current~ ./drivers/md/md.c --- ./drivers/md/md.c~current~ 2005-04-08 11:25:26.0 +1000 +++ ./drivers/md/md.c 2005-04-12 09:42:29.0 +1000 @@ -351,7 +351,7 @@ void md_super_write(mddev_t *mddev, mdk_ * if zero is reached. * If an error occurred, call md_error */ - struct bio *bio = bio_alloc(GFP_KERNEL, 1); + struct bio *bio = bio_alloc(GFP_NOIO, 1); bio->bi_bdev = rdev->bdev; bio->bi_sector = sector; @@ -374,7 +374,7 @@ static int bi_complete(struct bio *bio, int sync_page_io(struct block_device *bdev, sector_t sector, int size, struct page *page, int rw) { - struct bio *bio = bio_alloc(GFP_KERNEL, 1); + struct bio *bio = bio_alloc(GFP_NOIO, 1); struct completion event; int ret; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
Claudio Martins wrote: Right. I'm using two Seagate ATA133 disks (ide controler is AMD-8111) each with 4 partitions, so I get 4 md Raid1 devices. The first one, md0, is for swap. The rest are ~$ df -h FilesystemSize Used Avail Use% Mounted on /dev/md1 4.6G 1.9G 2.6G 42% / tmpfs1005M 0 1005M 0% /dev/shm /dev/md3 32G 107M 30G 1% /home /dev/md2 31G 149M 29G 1% /var In these tests, /home on md3 is the working area for stress. The io scheduler used is the anticipatory. OK. OK, I'll try them in a few minutes and report back. I'm not overly hopeful. If they fix the problem, then it's likely that the real bug is hidden. I'm curious as whether increasing the vm.min_free_kbytes sysctl value would help or not in this case. But I guess it wouldn't since there is already some free memory and also the alloc failures are order 0, right? Yes. And the failures you were seeing with my first patch were coming from the mempool code anyway. We want those to fail early so they don't eat into the min_free_kbytes memory. You could try raising min_free_kbytes though. If that fixes it, then it indicates there might be some problem in a memory allocation failure path in software raid somewhere. Thanks -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
On Monday 11 April 2005 13:45, Nick Piggin wrote: > > No luck yet (on SMP i386). How many disks are you using in each > raid1 array? You are using one array for swap, and one mounted as > ext3 for the working area of the `stress` program, right? > Right. I'm using two Seagate ATA133 disks (ide controler is AMD-8111) each with 4 partitions, so I get 4 md Raid1 devices. The first one, md0, is for swap. The rest are ~$ df -h FilesystemSize Used Avail Use% Mounted on /dev/md1 4.6G 1.9G 2.6G 42% / tmpfs1005M 0 1005M 0% /dev/shm /dev/md3 32G 107M 30G 1% /home /dev/md2 31G 149M 29G 1% /var In these tests, /home on md3 is the working area for stress. The io scheduler used is the anticipatory. > Neil, have you had a look at the traces? Do they mean much to you? > > Claudio - I have attached another patch you could try. It has a more > complete set of mempool and related memory allocation fixes, as well > as some other recent patches I had which reduces atomic memory usage > by the block layer. Could you try if you get time? Thanks. OK, I'll try them in a few minutes and report back. I'm curious as whether increasing the vm.min_free_kbytes sysctl value would help or not in this case. But I guess it wouldn't since there is already some free memory and also the alloc failures are order 0, right? Thanks Claudio - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
Nick Piggin wrote: The common theme seems to be: try_to_free_pages, swap_writepage, mempool_alloc, down/down_failed in .text.lock.md. Next I would suspect md/raid1 - maybe some deadlock in an uncommon memory allocation failure path? I'll see if I can reproduce it here. No luck yet (on SMP i386). How many disks are you using in each raid1 array? You are using one array for swap, and one mounted as ext3 for the working area of the `stress` program, right? Neil, have you had a look at the traces? Do they mean much to you? Claudio - I have attached another patch you could try. It has a more complete set of mempool and related memory allocation fixes, as well as some other recent patches I had which reduces atomic memory usage by the block layer. Could you try if you get time? Thanks. Nick -- SUSE Labs, Novell Inc. Index: linux-2.6/drivers/block/ll_rw_blk.c === --- linux-2.6.orig/drivers/block/ll_rw_blk.c2005-04-11 22:18:49.0 +1000 +++ linux-2.6/drivers/block/ll_rw_blk.c 2005-04-11 22:38:10.0 +1000 @@ -1450,7 +1450,7 @@ EXPORT_SYMBOL(blk_remove_plug); */ void __generic_unplug_device(request_queue_t *q) { - if (test_bit(QUEUE_FLAG_STOPPED, >queue_flags)) + if (unlikely(test_bit(QUEUE_FLAG_STOPPED, >queue_flags))) return; if (!blk_remove_plug(q)) @@ -1828,7 +1828,6 @@ static void __freed_request(request_queu clear_queue_congested(q, rw); if (rl->count[rw] + 1 <= q->nr_requests) { - smp_mb(); if (waitqueue_active(>wait[rw])) wake_up(>wait[rw]); @@ -1860,18 +1859,20 @@ static void freed_request(request_queue_ #define blkdev_free_rq(list) list_entry((list)->next, struct request, queuelist) /* - * Get a free request, queue_lock must not be held + * Get a free request, queue_lock must be held. + * Returns NULL on failure, with queue_lock held. + * Returns !NULL on success, with queue_lock *not held*. */ static struct request *get_request(request_queue_t *q, int rw, int gfp_mask) { + int batching; struct request *rq = NULL; struct request_list *rl = >rq; - struct io_context *ioc = get_io_context(gfp_mask); + struct io_context *ioc = get_io_context(GFP_ATOMIC); if (unlikely(test_bit(QUEUE_FLAG_DRAIN, >queue_flags))) goto out; - spin_lock_irq(q->queue_lock); if (rl->count[rw]+1 >= q->nr_requests) { /* * The queue will fill after this allocation, so set it as @@ -1884,6 +1885,8 @@ static struct request *get_request(reque blk_set_queue_full(q, rw); } } + + batching = ioc_batching(q, ioc); switch (elv_may_queue(q, rw)) { case ELV_MQUEUE_NO: @@ -1894,12 +1897,11 @@ static struct request *get_request(reque goto get_rq; } - if (blk_queue_full(q, rw) && !ioc_batching(q, ioc)) { + if (blk_queue_full(q, rw) && !batching) { /* * The queue is full and the allocating process is not a * "batcher", and not exempted by the IO scheduler */ - spin_unlock_irq(q->queue_lock); goto out; } @@ -1933,11 +1935,10 @@ rq_starved: if (unlikely(rl->count[rw] == 0)) rl->starved[rw] = 1; - spin_unlock_irq(q->queue_lock); goto out; } - if (ioc_batching(q, ioc)) + if (batching) ioc->nr_batch_requests--; rq_init(q, rq); @@ -1950,13 +1951,14 @@ out: /* * No available requests for this queue, unplug the device and wait for some * requests to become available. + * + * Called with q->queue_lock held, and returns with it unlocked. */ static struct request *get_request_wait(request_queue_t *q, int rw) { DEFINE_WAIT(wait); struct request *rq; - generic_unplug_device(q); do { struct request_list *rl = >rq; @@ -1968,6 +1970,8 @@ static struct request *get_request_wait( if (!rq) { struct io_context *ioc; + __generic_unplug_device(q); + spin_unlock_irq(q->queue_lock); io_schedule(); /* @@ -1979,6 +1983,8 @@ static struct request *get_request_wait( ioc = get_io_context(GFP_NOIO); ioc_set_batching(q, ioc); put_io_context(ioc); + + spin_lock_irq(q->queue_lock); } finish_wait(>wait[rw], ); } while (!rq); @@ -1992,10 +1998,15 @@ struct request *blk_get_request(request_ BUG_ON(rw != READ && rw != WRITE); + spin_lock_irq(q->queue_lock); if
Re: Processes stuck on D state on Dual Opteron
Claudio Martins wrote: On Sunday 10 April 2005 03:47, Andrew Morton wrote: Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from cutting in during long sysrq traces. Also, capture the `sysrq-m' output so we can see if the thing is out of memory. Hi Andrew, Thanks for the tip. I booted with nmi_watchdog=0 and was able to get a full sysrq-t as well as a sysrq-m. Since it might be a little too big for the list, I've put it on a text file at: http://193.136.132.235/dl145/dump1-2.6.12-rc2.txt OK, you _may_ be out of memory here (depending on what the lower zone protection for DMA ends up as), however you are well above all the "emergency watermarks" in ZONE_NORMAL. Also: I also made a run with the mempool-can-fail patch from Nick Piggin. With this I got some nice memory allocation errors from the md threads when the trouble started. The dump (with sysrq-t and sysrq-m included) is at: http://193.136.132.235/dl145/dump2-2.6.12-rc2-nick1.txt This one shows plenty of memory. The allocation failure messages are actually a good thing, and show that my patch is sort of working. I have reworked it a bit so they won't show up though. So probably not your common or garden memory deadlock. The common theme seems to be: try_to_free_pages, swap_writepage, mempool_alloc, down/down_failed in .text.lock.md. Next I would suspect md/raid1 - maybe some deadlock in an uncommon memory allocation failure path? I'll see if I can reproduce it here. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
Claudio Martins wrote: On Sunday 10 April 2005 03:47, Andrew Morton wrote: Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from cutting in during long sysrq traces. Also, capture the `sysrq-m' output so we can see if the thing is out of memory. Hi Andrew, Thanks for the tip. I booted with nmi_watchdog=0 and was able to get a full sysrq-t as well as a sysrq-m. Since it might be a little too big for the list, I've put it on a text file at: http://193.136.132.235/dl145/dump1-2.6.12-rc2.txt I also made a run with the mempool-can-fail patch from Nick Piggin. With this I got some nice memory allocation errors from the md threads when the trouble started. The dump (with sysrq-t and sysrq-m included) is at: http://193.136.132.235/dl145/dump2-2.6.12-rc2-nick1.txt Let me know if you find it more convenient to send the dumps by mail or something. Hope this helps. Itried to get these just now, but couldn't. Would you gzip them and send them to me privately? Thanks, Nick -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
Claudio Martins wrote: On Sunday 10 April 2005 03:47, Andrew Morton wrote: Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from cutting in during long sysrq traces. Also, capture the `sysrq-m' output so we can see if the thing is out of memory. Hi Andrew, Thanks for the tip. I booted with nmi_watchdog=0 and was able to get a full sysrq-t as well as a sysrq-m. Since it might be a little too big for the list, I've put it on a text file at: http://193.136.132.235/dl145/dump1-2.6.12-rc2.txt I also made a run with the mempool-can-fail patch from Nick Piggin. With this I got some nice memory allocation errors from the md threads when the trouble started. The dump (with sysrq-t and sysrq-m included) is at: http://193.136.132.235/dl145/dump2-2.6.12-rc2-nick1.txt Let me know if you find it more convenient to send the dumps by mail or something. Hope this helps. Itried to get these just now, but couldn't. Would you gzip them and send them to me privately? Thanks, Nick -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
Claudio Martins wrote: On Sunday 10 April 2005 03:47, Andrew Morton wrote: Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from cutting in during long sysrq traces. Also, capture the `sysrq-m' output so we can see if the thing is out of memory. Hi Andrew, Thanks for the tip. I booted with nmi_watchdog=0 and was able to get a full sysrq-t as well as a sysrq-m. Since it might be a little too big for the list, I've put it on a text file at: http://193.136.132.235/dl145/dump1-2.6.12-rc2.txt OK, you _may_ be out of memory here (depending on what the lower zone protection for DMA ends up as), however you are well above all the emergency watermarks in ZONE_NORMAL. Also: I also made a run with the mempool-can-fail patch from Nick Piggin. With this I got some nice memory allocation errors from the md threads when the trouble started. The dump (with sysrq-t and sysrq-m included) is at: http://193.136.132.235/dl145/dump2-2.6.12-rc2-nick1.txt This one shows plenty of memory. The allocation failure messages are actually a good thing, and show that my patch is sort of working. I have reworked it a bit so they won't show up though. So probably not your common or garden memory deadlock. The common theme seems to be: try_to_free_pages, swap_writepage, mempool_alloc, down/down_failed in .text.lock.md. Next I would suspect md/raid1 - maybe some deadlock in an uncommon memory allocation failure path? I'll see if I can reproduce it here. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
Nick Piggin wrote: The common theme seems to be: try_to_free_pages, swap_writepage, mempool_alloc, down/down_failed in .text.lock.md. Next I would suspect md/raid1 - maybe some deadlock in an uncommon memory allocation failure path? I'll see if I can reproduce it here. No luck yet (on SMP i386). How many disks are you using in each raid1 array? You are using one array for swap, and one mounted as ext3 for the working area of the `stress` program, right? Neil, have you had a look at the traces? Do they mean much to you? Claudio - I have attached another patch you could try. It has a more complete set of mempool and related memory allocation fixes, as well as some other recent patches I had which reduces atomic memory usage by the block layer. Could you try if you get time? Thanks. Nick -- SUSE Labs, Novell Inc. Index: linux-2.6/drivers/block/ll_rw_blk.c === --- linux-2.6.orig/drivers/block/ll_rw_blk.c2005-04-11 22:18:49.0 +1000 +++ linux-2.6/drivers/block/ll_rw_blk.c 2005-04-11 22:38:10.0 +1000 @@ -1450,7 +1450,7 @@ EXPORT_SYMBOL(blk_remove_plug); */ void __generic_unplug_device(request_queue_t *q) { - if (test_bit(QUEUE_FLAG_STOPPED, q-queue_flags)) + if (unlikely(test_bit(QUEUE_FLAG_STOPPED, q-queue_flags))) return; if (!blk_remove_plug(q)) @@ -1828,7 +1828,6 @@ static void __freed_request(request_queu clear_queue_congested(q, rw); if (rl-count[rw] + 1 = q-nr_requests) { - smp_mb(); if (waitqueue_active(rl-wait[rw])) wake_up(rl-wait[rw]); @@ -1860,18 +1859,20 @@ static void freed_request(request_queue_ #define blkdev_free_rq(list) list_entry((list)-next, struct request, queuelist) /* - * Get a free request, queue_lock must not be held + * Get a free request, queue_lock must be held. + * Returns NULL on failure, with queue_lock held. + * Returns !NULL on success, with queue_lock *not held*. */ static struct request *get_request(request_queue_t *q, int rw, int gfp_mask) { + int batching; struct request *rq = NULL; struct request_list *rl = q-rq; - struct io_context *ioc = get_io_context(gfp_mask); + struct io_context *ioc = get_io_context(GFP_ATOMIC); if (unlikely(test_bit(QUEUE_FLAG_DRAIN, q-queue_flags))) goto out; - spin_lock_irq(q-queue_lock); if (rl-count[rw]+1 = q-nr_requests) { /* * The queue will fill after this allocation, so set it as @@ -1884,6 +1885,8 @@ static struct request *get_request(reque blk_set_queue_full(q, rw); } } + + batching = ioc_batching(q, ioc); switch (elv_may_queue(q, rw)) { case ELV_MQUEUE_NO: @@ -1894,12 +1897,11 @@ static struct request *get_request(reque goto get_rq; } - if (blk_queue_full(q, rw) !ioc_batching(q, ioc)) { + if (blk_queue_full(q, rw) !batching) { /* * The queue is full and the allocating process is not a * batcher, and not exempted by the IO scheduler */ - spin_unlock_irq(q-queue_lock); goto out; } @@ -1933,11 +1935,10 @@ rq_starved: if (unlikely(rl-count[rw] == 0)) rl-starved[rw] = 1; - spin_unlock_irq(q-queue_lock); goto out; } - if (ioc_batching(q, ioc)) + if (batching) ioc-nr_batch_requests--; rq_init(q, rq); @@ -1950,13 +1951,14 @@ out: /* * No available requests for this queue, unplug the device and wait for some * requests to become available. + * + * Called with q-queue_lock held, and returns with it unlocked. */ static struct request *get_request_wait(request_queue_t *q, int rw) { DEFINE_WAIT(wait); struct request *rq; - generic_unplug_device(q); do { struct request_list *rl = q-rq; @@ -1968,6 +1970,8 @@ static struct request *get_request_wait( if (!rq) { struct io_context *ioc; + __generic_unplug_device(q); + spin_unlock_irq(q-queue_lock); io_schedule(); /* @@ -1979,6 +1983,8 @@ static struct request *get_request_wait( ioc = get_io_context(GFP_NOIO); ioc_set_batching(q, ioc); put_io_context(ioc); + + spin_lock_irq(q-queue_lock); } finish_wait(rl-wait[rw], wait); } while (!rq); @@ -1992,10 +1998,15 @@ struct request *blk_get_request(request_ BUG_ON(rw != READ rw != WRITE); + spin_lock_irq(q-queue_lock); if (gfp_mask
Re: Processes stuck on D state on Dual Opteron
On Monday 11 April 2005 13:45, Nick Piggin wrote: No luck yet (on SMP i386). How many disks are you using in each raid1 array? You are using one array for swap, and one mounted as ext3 for the working area of the `stress` program, right? Right. I'm using two Seagate ATA133 disks (ide controler is AMD-8111) each with 4 partitions, so I get 4 md Raid1 devices. The first one, md0, is for swap. The rest are ~$ df -h FilesystemSize Used Avail Use% Mounted on /dev/md1 4.6G 1.9G 2.6G 42% / tmpfs1005M 0 1005M 0% /dev/shm /dev/md3 32G 107M 30G 1% /home /dev/md2 31G 149M 29G 1% /var In these tests, /home on md3 is the working area for stress. The io scheduler used is the anticipatory. Neil, have you had a look at the traces? Do they mean much to you? Claudio - I have attached another patch you could try. It has a more complete set of mempool and related memory allocation fixes, as well as some other recent patches I had which reduces atomic memory usage by the block layer. Could you try if you get time? Thanks. OK, I'll try them in a few minutes and report back. I'm curious as whether increasing the vm.min_free_kbytes sysctl value would help or not in this case. But I guess it wouldn't since there is already some free memory and also the alloc failures are order 0, right? Thanks Claudio - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
Claudio Martins wrote: Right. I'm using two Seagate ATA133 disks (ide controler is AMD-8111) each with 4 partitions, so I get 4 md Raid1 devices. The first one, md0, is for swap. The rest are ~$ df -h FilesystemSize Used Avail Use% Mounted on /dev/md1 4.6G 1.9G 2.6G 42% / tmpfs1005M 0 1005M 0% /dev/shm /dev/md3 32G 107M 30G 1% /home /dev/md2 31G 149M 29G 1% /var In these tests, /home on md3 is the working area for stress. The io scheduler used is the anticipatory. OK. OK, I'll try them in a few minutes and report back. I'm not overly hopeful. If they fix the problem, then it's likely that the real bug is hidden. I'm curious as whether increasing the vm.min_free_kbytes sysctl value would help or not in this case. But I guess it wouldn't since there is already some free memory and also the alloc failures are order 0, right? Yes. And the failures you were seeing with my first patch were coming from the mempool code anyway. We want those to fail early so they don't eat into the min_free_kbytes memory. You could try raising min_free_kbytes though. If that fixes it, then it indicates there might be some problem in a memory allocation failure path in software raid somewhere. Thanks -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
On Monday April 11, [EMAIL PROTECTED] wrote: Neil, have you had a look at the traces? Do they mean much to you? Just looked. bio_alloc_bioset seems implicated, as does sync_page_io. sync_page_io used to use a 'struct bio' on the stack, but Jens Axboe change it to use bio_alloc (don't know why..) and I should have checked the change better. sync_page_io can be called on the write out path, so it should use GFP_NOIO rather than GFP_KERNEL. See if this helps Actually this patch is against 2.6.12-rc2-mm1 which uses md_super_write instead of sync_page_io (which is now only used for read). So if you are using a non-mm kernel (which seems to be the case) you'll need to apply the patch by hand. Thanks, NeilBrown --- Avoid deadlock in sync_page_io by using GFP_NOIO ..as sync_page_io can be called on the write-out path. Ditto for md_super_write Signed-off-by: Neil Brown [EMAIL PROTECTED] ### Diffstat output ./drivers/md/md.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff ./drivers/md/md.c~current~ ./drivers/md/md.c --- ./drivers/md/md.c~current~ 2005-04-08 11:25:26.0 +1000 +++ ./drivers/md/md.c 2005-04-12 09:42:29.0 +1000 @@ -351,7 +351,7 @@ void md_super_write(mddev_t *mddev, mdk_ * if zero is reached. * If an error occurred, call md_error */ - struct bio *bio = bio_alloc(GFP_KERNEL, 1); + struct bio *bio = bio_alloc(GFP_NOIO, 1); bio-bi_bdev = rdev-bdev; bio-bi_sector = sector; @@ -374,7 +374,7 @@ static int bi_complete(struct bio *bio, int sync_page_io(struct block_device *bdev, sector_t sector, int size, struct page *page, int rw) { - struct bio *bio = bio_alloc(GFP_KERNEL, 1); + struct bio *bio = bio_alloc(GFP_NOIO, 1); struct completion event; int ret; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
On Monday 11 April 2005 23:59, Nick Piggin wrote: OK, I'll try them in a few minutes and report back. I'm not overly hopeful. If they fix the problem, then it's likely that the real bug is hidden. Well, the thing is, they do fix the problem. Or at least they hide it very well ;-) It has been running for more than 5 hours now with stress with no problems and no stuck processes. I think I'm going to give a try to Neil's patch, but I'll have to apply some patches from -mm. Thanks Claudio - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
On Tuesday 12 April 2005 00:46, Neil Brown wrote: On Monday April 11, [EMAIL PROTECTED] wrote: Neil, have you had a look at the traces? Do they mean much to you? Just looked. bio_alloc_bioset seems implicated, as does sync_page_io. sync_page_io used to use a 'struct bio' on the stack, but Jens Axboe change it to use bio_alloc (don't know why..) and I should have checked the change better. sync_page_io can be called on the write out path, so it should use GFP_NOIO rather than GFP_KERNEL. See if this helps Actually this patch is against 2.6.12-rc2-mm1 which uses md_super_write instead of sync_page_io (which is now only used for read). So if you are using a non-mm kernel (which seems to be the case) you'll need to apply the patch by hand. Hi Neil, I'll test this patch, but I'm wondering if I have to apply all the md-related patches from broken out directory of 2.6.12-rc2-mm1 or only some specific ones? Anyway I'm happy to test all those md updates, if you think they might help. Thanks Claudio - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
Claudio Martins [EMAIL PROTECTED] wrote: I think I'm going to give a try to Neil's patch, but I'll have to apply some patches from -mm. Just this one if you're using 2.6.12-rc2: --- 25/drivers/md/md.c~avoid-deadlock-in-sync_page_io-by-using-gfp_noio Mon Apr 11 16:55:07 2005 +++ 25-akpm/drivers/md/md.c Mon Apr 11 16:55:07 2005 @@ -332,7 +332,7 @@ static int bi_complete(struct bio *bio, static int sync_page_io(struct block_device *bdev, sector_t sector, int size, struct page *page, int rw) { - struct bio *bio = bio_alloc(GFP_KERNEL, 1); + struct bio *bio = bio_alloc(GFP_NOIO, 1); struct completion event; int ret; _ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
On Tue, 2005-04-12 at 01:22 +0100, Claudio Martins wrote: On Monday 11 April 2005 23:59, Nick Piggin wrote: OK, I'll try them in a few minutes and report back. I'm not overly hopeful. If they fix the problem, then it's likely that the real bug is hidden. Well, the thing is, they do fix the problem. Or at least they hide it very well ;-) It has been running for more than 5 hours now with stress with no problems and no stuck processes. Well, that is good... I guess ;) Actually the patches I have sent you do fix real bugs, but they also make the block layer less likely to recurse into page reclaim, so it may be eg. hiding the problem that Neil's patch fixes. It may be that your fundamental problem is solved by my patches, but we need to be sure. I think I'm going to give a try to Neil's patch, but I'll have to apply some patches from -mm. Yep that would be good. Please test -rc2 with Andrew's patch, and obviously my patches backed out. Thanks for sticking with it. Nick - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
On Sunday 10 April 2005 03:47, Andrew Morton wrote: > > Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from > cutting in during long sysrq traces. > > Also, capture the `sysrq-m' output so we can see if the thing is out of > memory. Hi Andrew, Thanks for the tip. I booted with nmi_watchdog=0 and was able to get a full sysrq-t as well as a sysrq-m. Since it might be a little too big for the list, I've put it on a text file at: http://193.136.132.235/dl145/dump1-2.6.12-rc2.txt I also made a run with the mempool-can-fail patch from Nick Piggin. With this I got some nice memory allocation errors from the md threads when the trouble started. The dump (with sysrq-t and sysrq-m included) is at: http://193.136.132.235/dl145/dump2-2.6.12-rc2-nick1.txt Let me know if you find it more convenient to send the dumps by mail or something. Hope this helps. Thanks, Claudio - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
On Sunday 10 April 2005 03:47, Andrew Morton wrote: Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from cutting in during long sysrq traces. Also, capture the `sysrq-m' output so we can see if the thing is out of memory. Hi Andrew, Thanks for the tip. I booted with nmi_watchdog=0 and was able to get a full sysrq-t as well as a sysrq-m. Since it might be a little too big for the list, I've put it on a text file at: http://193.136.132.235/dl145/dump1-2.6.12-rc2.txt I also made a run with the mempool-can-fail patch from Nick Piggin. With this I got some nice memory allocation errors from the md threads when the trouble started. The dump (with sysrq-t and sysrq-m included) is at: http://193.136.132.235/dl145/dump2-2.6.12-rc2-nick1.txt Let me know if you find it more convenient to send the dumps by mail or something. Hope this helps. Thanks, Claudio - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
On Sunday 10 April 2005 03:53, Nick Piggin wrote: > > Looks like you may possibly have a memory allocation deadlock > (although I can't explain the NMI oops). > > I would be interested to see if the following patch is of any > help to you. > Hi Nick, I'll build a kernel with your patch and report on the results as soon as possible. Thanks Claudio - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
On Sunday 10 April 2005 03:47, Andrew Morton wrote: > > Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from > cutting in during long sysrq traces. > > Also, capture the `sysrq-m' output so we can see if the thing is out of > memory. OK, will do it ASAP and report back. Thanks, Claudio - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
Claudio Martins wrote: On Tuesday 05 April 2005 03:12, Andrew Morton wrote: Claudio Martins <[EMAIL PROTECTED]> wrote: While stress testing 2.6.12-rc2 on an HP DL145 I get processes stuck in D state after some time. This machine is a dual Opteron 248 with 2GB (ECC) on one node (the other node has no RAM modules plugged in, since this board works only with pairs). I was using stress (http://weather.ou.edu/~apw/projects/stress/) with the following command line: stress -v -c 20 -i 12 -m 10 -d 20 [snip] Unfortunately the system Oopsed in the middle of dumping the tasks, but from what I can see I'm tempted to think that this might be related to the MD code. md2_raid1 is blocked on D state and, although not shown on the dump, I know from ps command that md0_raid1 (the swap partition) was also on D state (along with the stress processes which are responsible for hogging memory, and top and df). There were about 200MB swapped out, but the swap partition size is 1GB. Looks like you may possibly have a memory allocation deadlock (although I can't explain the NMI oops). I would be interested to see if the following patch is of any help to you. Thanks, Nick -- SUSE Labs, Novell Inc. Index: linux-2.6/mm/mempool.c === --- linux-2.6.orig/mm/mempool.c 2005-03-30 10:39:51.0 +1000 +++ linux-2.6/mm/mempool.c 2005-03-30 10:41:29.0 +1000 @@ -198,7 +198,10 @@ void * mempool_alloc(mempool_t *pool, in void *element; unsigned long flags; DEFINE_WAIT(wait); - int gfp_nowait = gfp_mask & ~(__GFP_WAIT | __GFP_IO); + int gfp_nowait; + + gfp_mask |= __GFP_NORETRY; /* don't loop in __alloc_pages */ + gfp_nowait = gfp_mask & ~(__GFP_WAIT | __GFP_IO); might_sleep_if(gfp_mask & __GFP_WAIT); repeat_alloc:
Re: Processes stuck on D state on Dual Opteron
Claudio Martins <[EMAIL PROTECTED]> wrote: > > I repeated the test to try to get more output from alt-sysreq-T, but it > oopsed again with even less output. >By the way, I have also tested 2.6.11.6 and I get stuck processes in the > same way. With 2.6.9 I get a hard lockup with no working alt-sysrq, after > about 30 to 60mins of stress. It could be an md deadlock, or it could be an out-of-memory deadlock. md trying to allocate memory on the swapout path. >This is with preempt enabled (as well as BKL preempt). I want to test also > without preempt and also without using MD Raid1, but I'll have to reach the > machine and hit the power button, so not possible until tomorrow :-( > > The original original message in this thread containing the details of the > setup and a .config is at: > > http://marc.theaimsgroup.com/?l=linux-kernel=111266784320156=2 > >I am happy to test any patches and also wonder if enabling any of the > options in the kernel debugging section could help in trying to find where > the deadlock is. Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from cutting in during long sysrq traces. Also, capture the `sysrq-m' output so we can see if the thing is out of memory. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
On Tuesday 05 April 2005 03:12, Andrew Morton wrote: > Claudio Martins <[EMAIL PROTECTED]> wrote: > >While stress testing 2.6.12-rc2 on an HP DL145 I get processes stuck > > in D state after some time. > >This machine is a dual Opteron 248 with 2GB (ECC) on one node (the > > other node has no RAM modules plugged in, since this board works only > > with pairs). > > > >I was using stress (http://weather.ou.edu/~apw/projects/stress/) with > > the following command line: > > > > stress -v -c 20 -i 12 -m 10 -d 20 > > > >This causes a constant load avg. of around 70, makes the machine go > > into swap a little, and writes up to about 20GB of random data to disk > > while eating up all CPU. After about half and hour random processes like > > top, df, etc get stuck in D state. Half of the 60 or so stress processes > > are also in D state. The machine keeps being responsive for maybe some 15 > > minutes but then the shells just hang and sshd stops responding to > > connections, though the machine replies to pings (I don't have console > > acess till tomorrow). > > > >The system is using ext3 with md software Raid1. > > > > I'm interested in knowing if anyone out there with dual Opterons can > > reproduce this or not. I also have access to an HP DL360 Dual Xeon, so I > > will try to find out if this is AMD64 specific as soon as possible. > > Please let me know if you want me to run some other tests or give some > > more info to help solve this one. > > Can you capture the output from alt-sysrq-T? Hi Andrew, Due to other tasks, only now was I able to repeat the tests and capture the the output from alt-sysrq-T. I booted with serial console, put stress to work and when the processes started to get hung on D state I managed to capture the following: SysRq : Show State sibling task PC pid father child younger older init D 81007fcfe0d8 0 1 0 2 (NOTLB) 810003253768 0082 81007fd19170 007d 81007fd19170 810003251470 271b 810074468e70 810003251680 8027a79a Call Trace:{__make_request+1274} {__down+152} {default_wake_function+0} {mempool_alloc+164} {__down_failed+53} {.text.lock.md+155} {make_request+868} {cache_alloc_refill+413} {generic_make_request+545} {autoremove_wake_function+0} {autoremove_wake_function+0} {submit_bio+223} {test_set_page_writeback+203} {swap_writepage+184} {shrink_zone+2678} {thread_return+0} {thread_return+88} {try_to_free_pages+311} {autoremove_wake_function+0} {__alloc_pages+533} {__get_free_pages+14} {__pollwait+74} {pipe_poll+66} {do_select+725} {__pollwait+0} {sys_select+735} {system_call+126} migration/0 S 810002c12720 0 2 1 3 (L-TLB) 81007ff0fea8 0046 810074806ef0 00750001 81007ff0fe58 8100032506f0 0129 810075281230 810003250900 810072ffde88 Call Trace:{migration_thread+532} {migration_thread+0} {kthread+217} {child_rip+8} {kthread+0} {child_rip+0} ksoftirqd/0 S 0 3 1 4 2 (L-TLB) 81007ff11f08 0046 810072e00430 007d 810002c194e0 810003250030 00d1 810072f3a030 810003250240 Call Trace:{__do_softirq+113} {ksoftirqd+0} {ksoftirqd+0} {ksoftirqd+99} {ksoftirqd+0} {kthread+217} {child_rip+8} {kthread+0} {child_rip+0} migration/1 S 810002c1a720 0 4 1 5 3 (L-TLB) 81007ff15ea8 0046 810072d1cff0 00730001 810079fe7e98 81007ff134b0 00a3 810075281230 81007ff136c0 81003381de88 Call Trace:{migration_thread+532} {migration_thread+0} {kthread+217} {child_rip+8} {kthread+0} {child_rip+0} ksoftirqd/1 S 0001 0 5 1 6 4 (L-TLB) 81007ff19f08 0046 810075376db0 0077802b8e7e 810002c114e0 81007ff12df0 01b4 810074125130 81007ff13000 Call Trace:{__do_softirq+113} {ksoftirqd+0} {ksoftirqd+0} {ksoftirqd+99} {ksoftirqd+0} {kthread+217} {child_rip+8} {kthread+0} {child_rip+0} events/0 S 094f2f7a804e 0 6 1 7 5 (L-TLB) 81007ff3be58 0046 0246 8013d00d 7ffe0c00 81007ff12730 0c80 803f40c0 81007ff12940 Call Trace:{__mod_timer+317} {cache_reap+0} {worker_thread+305} {default_wake_function+0}
Re: Processes stuck on D state on Dual Opteron
On Tuesday 05 April 2005 03:12, Andrew Morton wrote: Claudio Martins [EMAIL PROTECTED] wrote: While stress testing 2.6.12-rc2 on an HP DL145 I get processes stuck in D state after some time. This machine is a dual Opteron 248 with 2GB (ECC) on one node (the other node has no RAM modules plugged in, since this board works only with pairs). I was using stress (http://weather.ou.edu/~apw/projects/stress/) with the following command line: stress -v -c 20 -i 12 -m 10 -d 20 This causes a constant load avg. of around 70, makes the machine go into swap a little, and writes up to about 20GB of random data to disk while eating up all CPU. After about half and hour random processes like top, df, etc get stuck in D state. Half of the 60 or so stress processes are also in D state. The machine keeps being responsive for maybe some 15 minutes but then the shells just hang and sshd stops responding to connections, though the machine replies to pings (I don't have console acess till tomorrow). The system is using ext3 with md software Raid1. I'm interested in knowing if anyone out there with dual Opterons can reproduce this or not. I also have access to an HP DL360 Dual Xeon, so I will try to find out if this is AMD64 specific as soon as possible. Please let me know if you want me to run some other tests or give some more info to help solve this one. Can you capture the output from alt-sysrq-T? Hi Andrew, Due to other tasks, only now was I able to repeat the tests and capture the the output from alt-sysrq-T. I booted with serial console, put stress to work and when the processes started to get hung on D state I managed to capture the following: SysRq : Show State sibling task PC pid father child younger older init D 81007fcfe0d8 0 1 0 2 (NOTLB) 810003253768 0082 81007fd19170 007d 81007fd19170 810003251470 271b 810074468e70 810003251680 8027a79a Call Trace:8027a79a{__make_request+1274} 8037ab68{__down+152} 8012f4f0{default_wake_function+0} 80158de4{mempool_alloc+164} 8037c649{__down_failed+53} 802ed53d{.text.lock.md+155} 802d8204{make_request+868} 8015db7d{cache_alloc_refill+413} 8027abd1{generic_make_request+545} 8014a230{autoremove_wake_function+0} 8014a230{autoremove_wake_function+0} 8027accf{submit_bio+223} 8015c39b{test_set_page_writeback+203} 8016e9d8{swap_writepage+184} 80161bc6{shrink_zone+2678} 8037b3e0{thread_return+0} 8037b438{thread_return+88} 80162187{try_to_free_pages+311} 8014a230{autoremove_wake_function+0} 8015a685{__alloc_pages+533} 8015a88e{__get_free_pages+14} 8018c72a{__pollwait+74} 80185c72{pipe_poll+66} 8018caa5{do_select+725} 8018c6e0{__pollwait+0} 8018ceef{sys_select+735} 8010db06{system_call+126} migration/0 S 810002c12720 0 2 1 3 (L-TLB) 81007ff0fea8 0046 810074806ef0 00750001 81007ff0fe58 8100032506f0 0129 810075281230 810003250900 810072ffde88 Call Trace:80130a24{migration_thread+532} 80130810{migration_thread+0} 80149c09{kthread+217} 8010e6ef{child_rip+8} 80149b30{kthread+0} 8010e6e7{child_rip+0} ksoftirqd/0 S 0 3 1 4 2 (L-TLB) 81007ff11f08 0046 810072e00430 007d 810002c194e0 810003250030 00d1 810072f3a030 810003250240 Call Trace:801393e1{__do_softirq+113} 801399c0{ksoftirqd+0} 801399c0{ksoftirqd+0} 80139a23{ksoftirqd+99} 801399c0{ksoftirqd+0} 80149c09{kthread+217} 8010e6ef{child_rip+8} 80149b30{kthread+0} 8010e6e7{child_rip+0} migration/1 S 810002c1a720 0 4 1 5 3 (L-TLB) 81007ff15ea8 0046 810072d1cff0 00730001 810079fe7e98 81007ff134b0 00a3 810075281230 81007ff136c0 81003381de88 Call Trace:80130a24{migration_thread+532} 80130810{migration_thread+0} 80149c09{kthread+217} 8010e6ef{child_rip+8} 80149b30{kthread+0} 8010e6e7{child_rip+0} ksoftirqd/1 S 0001 0 5 1 6 4 (L-TLB) 81007ff19f08 0046
Re: Processes stuck on D state on Dual Opteron
Claudio Martins [EMAIL PROTECTED] wrote: I repeated the test to try to get more output from alt-sysreq-T, but it oopsed again with even less output. By the way, I have also tested 2.6.11.6 and I get stuck processes in the same way. With 2.6.9 I get a hard lockup with no working alt-sysrq, after about 30 to 60mins of stress. It could be an md deadlock, or it could be an out-of-memory deadlock. md trying to allocate memory on the swapout path. This is with preempt enabled (as well as BKL preempt). I want to test also without preempt and also without using MD Raid1, but I'll have to reach the machine and hit the power button, so not possible until tomorrow :-( The original original message in this thread containing the details of the setup and a .config is at: http://marc.theaimsgroup.com/?l=linux-kernelm=111266784320156w=2 I am happy to test any patches and also wonder if enabling any of the options in the kernel debugging section could help in trying to find where the deadlock is. Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from cutting in during long sysrq traces. Also, capture the `sysrq-m' output so we can see if the thing is out of memory. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
Claudio Martins wrote: On Tuesday 05 April 2005 03:12, Andrew Morton wrote: Claudio Martins [EMAIL PROTECTED] wrote: While stress testing 2.6.12-rc2 on an HP DL145 I get processes stuck in D state after some time. This machine is a dual Opteron 248 with 2GB (ECC) on one node (the other node has no RAM modules plugged in, since this board works only with pairs). I was using stress (http://weather.ou.edu/~apw/projects/stress/) with the following command line: stress -v -c 20 -i 12 -m 10 -d 20 [snip] Unfortunately the system Oopsed in the middle of dumping the tasks, but from what I can see I'm tempted to think that this might be related to the MD code. md2_raid1 is blocked on D state and, although not shown on the dump, I know from ps command that md0_raid1 (the swap partition) was also on D state (along with the stress processes which are responsible for hogging memory, and top and df). There were about 200MB swapped out, but the swap partition size is 1GB. Looks like you may possibly have a memory allocation deadlock (although I can't explain the NMI oops). I would be interested to see if the following patch is of any help to you. Thanks, Nick -- SUSE Labs, Novell Inc. Index: linux-2.6/mm/mempool.c === --- linux-2.6.orig/mm/mempool.c 2005-03-30 10:39:51.0 +1000 +++ linux-2.6/mm/mempool.c 2005-03-30 10:41:29.0 +1000 @@ -198,7 +198,10 @@ void * mempool_alloc(mempool_t *pool, in void *element; unsigned long flags; DEFINE_WAIT(wait); - int gfp_nowait = gfp_mask ~(__GFP_WAIT | __GFP_IO); + int gfp_nowait; + + gfp_mask |= __GFP_NORETRY; /* don't loop in __alloc_pages */ + gfp_nowait = gfp_mask ~(__GFP_WAIT | __GFP_IO); might_sleep_if(gfp_mask __GFP_WAIT); repeat_alloc:
Re: Processes stuck on D state on Dual Opteron
On Sunday 10 April 2005 03:47, Andrew Morton wrote: Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from cutting in during long sysrq traces. Also, capture the `sysrq-m' output so we can see if the thing is out of memory. OK, will do it ASAP and report back. Thanks, Claudio - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
On Sunday 10 April 2005 03:53, Nick Piggin wrote: Looks like you may possibly have a memory allocation deadlock (although I can't explain the NMI oops). I would be interested to see if the following patch is of any help to you. Hi Nick, I'll build a kernel with your patch and report on the results as soon as possible. Thanks Claudio - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Processes stuck on D state on Dual Opteron
Claudio Martins <[EMAIL PROTECTED]> wrote: > >While stress testing 2.6.12-rc2 on an HP DL145 I get processes stuck in D > state after some time. >This machine is a dual Opteron 248 with 2GB (ECC) on one node (the other > node has no RAM modules plugged in, since this board works only with pairs). > >I was using stress (http://weather.ou.edu/~apw/projects/stress/) with the > following command line: > > stress -v -c 20 -i 12 -m 10 -d 20 > >This causes a constant load avg. of around 70, makes the machine go into > swap a little, and writes up to about 20GB of random data to disk while > eating up all CPU. After about half and hour random processes like top, df, > etc get stuck in D state. Half of the 60 or so stress processes are also in > D > state. The machine keeps being responsive for maybe some 15 minutes but then > the shells just hang and sshd stops responding to connections, though the > machine replies to pings (I don't have console acess till tomorrow). > >The system is using ext3 with md software Raid1. > > I'm interested in knowing if anyone out there with dual Opterons can > reproduce this or not. I also have access to an HP DL360 Dual Xeon, so I > will > try to find out if this is AMD64 specific as soon as possible. Please let me > know if you want me to run some other tests or give some more info to help > solve this one. Can you capture the output from alt-sysrq-T? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/