Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Nick Piggin
Claudio Martins wrote: On Tuesday 12 April 2005 01:46, Andrew Morton wrote: Claudio Martins <[EMAIL PROTECTED]> wrote: I think I'm going to give a try to Neil's patch, but I'll have to apply some patches from -mm. Just this one if you're using 2.6.12-rc2: ---

Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Nick Piggin
Chen, Kenneth W wrote: Nick Piggin wrote on Tuesday, April 12, 2005 4:09 AM Chen, Kenneth W wrote: I like the patch a lot and already did bench it on our db setup. However, I'm seeing a negative regression compare to a very very crappy patch (see attached, you can laugh at me for doing things

Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Claudio Martins
On Tuesday 12 April 2005 01:46, Andrew Morton wrote: > Claudio Martins <[EMAIL PROTECTED]> wrote: > > I think I'm going to give a try to Neil's patch, but I'll have to apply > > some patches from -mm. > > Just this one if you're using 2.6.12-rc2: > > ---

RE: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Chen, Kenneth W
Nick Piggin wrote on Tuesday, April 12, 2005 4:09 AM > Chen, Kenneth W wrote: > > I like the patch a lot and already did bench it on our db setup. However, > > I'm seeing a negative regression compare to a very very crappy patch (see > > attached, you can laugh at me for doing things like that

Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Thomas Davis
Nick Piggin wrote: It is a bit subtle: get_request may only drop the lock and return NULL (after retaking the lock), if we fail on a memory allocation. If we just fail due to unavailable queue slots, then the lock is never dropped. And the mem allocation can't fail because it is a mempool alloc

Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Nick Piggin
Chen, Kenneth W wrote: On Tue, Apr 12 2005, Nick Piggin wrote: Actually the patches I have sent you do fix real bugs, but they also make the block layer less likely to recurse into page reclaim, so it may be eg. hiding the problem that Neil's patch fixes. Jens Axboe wrote on Tuesday, April 12,

Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Nick Piggin
Nick Piggin wrote: Nick Piggin wrote: Chen, Kenneth W wrote: I like the patch a lot and already did bench it on our db setup. However, I'm seeing a negative regression compare to a very very crappy patch (see attached, you can laugh at me for doing things like that :-). OK - if we go that

Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Nick Piggin
Nick Piggin wrote: Chen, Kenneth W wrote: I like the patch a lot and already did bench it on our db setup. However, I'm seeing a negative regression compare to a very very crappy patch (see attached, you can laugh at me for doing things like that :-). OK - if we go that way, perhaps the

RE: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Chen, Kenneth W
On Tue, Apr 12 2005, Nick Piggin wrote: > Actually the patches I have sent you do fix real bugs, but they also > make the block layer less likely to recurse into page reclaim, so it > may be eg. hiding the problem that Neil's patch fixes. Jens Axboe wrote on Tuesday, April 12, 2005 12:08 AM > Can

Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Jens Axboe
On Tue, Apr 12 2005, Nick Piggin wrote: > Actually the patches I have sent you do fix real bugs, but they also > make the block layer less likely to recurse into page reclaim, so it > may be eg. hiding the problem that Neil's patch fixes. Can you push those to Andrew? I'm quite happy with the way

Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Jens Axboe
On Tue, Apr 12 2005, Nick Piggin wrote: Actually the patches I have sent you do fix real bugs, but they also make the block layer less likely to recurse into page reclaim, so it may be eg. hiding the problem that Neil's patch fixes. Can you push those to Andrew? I'm quite happy with the way

RE: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Chen, Kenneth W
On Tue, Apr 12 2005, Nick Piggin wrote: Actually the patches I have sent you do fix real bugs, but they also make the block layer less likely to recurse into page reclaim, so it may be eg. hiding the problem that Neil's patch fixes. Jens Axboe wrote on Tuesday, April 12, 2005 12:08 AM Can you

Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Nick Piggin
Nick Piggin wrote: Chen, Kenneth W wrote: I like the patch a lot and already did bench it on our db setup. However, I'm seeing a negative regression compare to a very very crappy patch (see attached, you can laugh at me for doing things like that :-). OK - if we go that way, perhaps the

Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Nick Piggin
Nick Piggin wrote: Nick Piggin wrote: Chen, Kenneth W wrote: I like the patch a lot and already did bench it on our db setup. However, I'm seeing a negative regression compare to a very very crappy patch (see attached, you can laugh at me for doing things like that :-). OK - if we go that

Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Nick Piggin
Chen, Kenneth W wrote: On Tue, Apr 12 2005, Nick Piggin wrote: Actually the patches I have sent you do fix real bugs, but they also make the block layer less likely to recurse into page reclaim, so it may be eg. hiding the problem that Neil's patch fixes. Jens Axboe wrote on Tuesday, April 12,

Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Thomas Davis
Nick Piggin wrote: It is a bit subtle: get_request may only drop the lock and return NULL (after retaking the lock), if we fail on a memory allocation. If we just fail due to unavailable queue slots, then the lock is never dropped. And the mem allocation can't fail because it is a mempool alloc

RE: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Chen, Kenneth W
Nick Piggin wrote on Tuesday, April 12, 2005 4:09 AM Chen, Kenneth W wrote: I like the patch a lot and already did bench it on our db setup. However, I'm seeing a negative regression compare to a very very crappy patch (see attached, you can laugh at me for doing things like that :-). OK

Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Claudio Martins
On Tuesday 12 April 2005 01:46, Andrew Morton wrote: Claudio Martins [EMAIL PROTECTED] wrote: I think I'm going to give a try to Neil's patch, but I'll have to apply some patches from -mm. Just this one if you're using 2.6.12-rc2: ---

Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Nick Piggin
Chen, Kenneth W wrote: Nick Piggin wrote on Tuesday, April 12, 2005 4:09 AM Chen, Kenneth W wrote: I like the patch a lot and already did bench it on our db setup. However, I'm seeing a negative regression compare to a very very crappy patch (see attached, you can laugh at me for doing things

Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Nick Piggin
Claudio Martins wrote: On Tuesday 12 April 2005 01:46, Andrew Morton wrote: Claudio Martins [EMAIL PROTECTED] wrote: I think I'm going to give a try to Neil's patch, but I'll have to apply some patches from -mm. Just this one if you're using 2.6.12-rc2: ---

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Nick Piggin
On Tue, 2005-04-12 at 01:22 +0100, Claudio Martins wrote: > On Monday 11 April 2005 23:59, Nick Piggin wrote: > > > > > OK, I'll try them in a few minutes and report back. > > > > I'm not overly hopeful. If they fix the problem, then it's likely > > that the real bug is hidden. > > > > Well,

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Andrew Morton
Claudio Martins <[EMAIL PROTECTED]> wrote: > > I think I'm going to give a try to Neil's patch, but I'll have to apply > some > patches from -mm. Just this one if you're using 2.6.12-rc2: --- 25/drivers/md/md.c~avoid-deadlock-in-sync_page_io-by-using-gfp_noio Mon Apr 11 16:55:07 2005 +++

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Claudio Martins
On Tuesday 12 April 2005 00:46, Neil Brown wrote: > On Monday April 11, [EMAIL PROTECTED] wrote: > > Neil, have you had a look at the traces? Do they mean much to you? > > Just looked. > bio_alloc_bioset seems implicated, as does sync_page_io. > > sync_page_io used to use a 'struct bio' on the

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Claudio Martins
On Monday 11 April 2005 23:59, Nick Piggin wrote: > > > OK, I'll try them in a few minutes and report back. > > I'm not overly hopeful. If they fix the problem, then it's likely > that the real bug is hidden. > Well, the thing is, they do fix the problem. Or at least they hide it very well

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Neil Brown
On Monday April 11, [EMAIL PROTECTED] wrote: > > Neil, have you had a look at the traces? Do they mean much to you? > Just looked. bio_alloc_bioset seems implicated, as does sync_page_io. sync_page_io used to use a 'struct bio' on the stack, but Jens Axboe change it to use bio_alloc (don't

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Nick Piggin
Claudio Martins wrote: Right. I'm using two Seagate ATA133 disks (ide controler is AMD-8111) each with 4 partitions, so I get 4 md Raid1 devices. The first one, md0, is for swap. The rest are ~$ df -h FilesystemSize Used Avail Use% Mounted on /dev/md1 4.6G 1.9G

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Claudio Martins
On Monday 11 April 2005 13:45, Nick Piggin wrote: > > No luck yet (on SMP i386). How many disks are you using in each > raid1 array? You are using one array for swap, and one mounted as > ext3 for the working area of the `stress` program, right? > Right. I'm using two Seagate ATA133 disks

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Nick Piggin
Nick Piggin wrote: The common theme seems to be: try_to_free_pages, swap_writepage, mempool_alloc, down/down_failed in .text.lock.md. Next I would suspect md/raid1 - maybe some deadlock in an uncommon memory allocation failure path? I'll see if I can reproduce it here. No luck yet (on SMP i386).

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Nick Piggin
Claudio Martins wrote: On Sunday 10 April 2005 03:47, Andrew Morton wrote: Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from cutting in during long sysrq traces. Also, capture the `sysrq-m' output so we can see if the thing is out of memory. Hi Andrew, Thanks for the

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Nick Piggin
Claudio Martins wrote: On Sunday 10 April 2005 03:47, Andrew Morton wrote: Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from cutting in during long sysrq traces. Also, capture the `sysrq-m' output so we can see if the thing is out of memory. Hi Andrew, Thanks for the

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Nick Piggin
Claudio Martins wrote: On Sunday 10 April 2005 03:47, Andrew Morton wrote: Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from cutting in during long sysrq traces. Also, capture the `sysrq-m' output so we can see if the thing is out of memory. Hi Andrew, Thanks for the

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Nick Piggin
Claudio Martins wrote: On Sunday 10 April 2005 03:47, Andrew Morton wrote: Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from cutting in during long sysrq traces. Also, capture the `sysrq-m' output so we can see if the thing is out of memory. Hi Andrew, Thanks for the

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Nick Piggin
Nick Piggin wrote: The common theme seems to be: try_to_free_pages, swap_writepage, mempool_alloc, down/down_failed in .text.lock.md. Next I would suspect md/raid1 - maybe some deadlock in an uncommon memory allocation failure path? I'll see if I can reproduce it here. No luck yet (on SMP i386).

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Claudio Martins
On Monday 11 April 2005 13:45, Nick Piggin wrote: No luck yet (on SMP i386). How many disks are you using in each raid1 array? You are using one array for swap, and one mounted as ext3 for the working area of the `stress` program, right? Right. I'm using two Seagate ATA133 disks (ide

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Nick Piggin
Claudio Martins wrote: Right. I'm using two Seagate ATA133 disks (ide controler is AMD-8111) each with 4 partitions, so I get 4 md Raid1 devices. The first one, md0, is for swap. The rest are ~$ df -h FilesystemSize Used Avail Use% Mounted on /dev/md1 4.6G 1.9G

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Neil Brown
On Monday April 11, [EMAIL PROTECTED] wrote: Neil, have you had a look at the traces? Do they mean much to you? Just looked. bio_alloc_bioset seems implicated, as does sync_page_io. sync_page_io used to use a 'struct bio' on the stack, but Jens Axboe change it to use bio_alloc (don't know

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Claudio Martins
On Monday 11 April 2005 23:59, Nick Piggin wrote: OK, I'll try them in a few minutes and report back. I'm not overly hopeful. If they fix the problem, then it's likely that the real bug is hidden. Well, the thing is, they do fix the problem. Or at least they hide it very well ;-)

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Claudio Martins
On Tuesday 12 April 2005 00:46, Neil Brown wrote: On Monday April 11, [EMAIL PROTECTED] wrote: Neil, have you had a look at the traces? Do they mean much to you? Just looked. bio_alloc_bioset seems implicated, as does sync_page_io. sync_page_io used to use a 'struct bio' on the stack, but

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Andrew Morton
Claudio Martins [EMAIL PROTECTED] wrote: I think I'm going to give a try to Neil's patch, but I'll have to apply some patches from -mm. Just this one if you're using 2.6.12-rc2: --- 25/drivers/md/md.c~avoid-deadlock-in-sync_page_io-by-using-gfp_noio Mon Apr 11 16:55:07 2005 +++

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Nick Piggin
On Tue, 2005-04-12 at 01:22 +0100, Claudio Martins wrote: On Monday 11 April 2005 23:59, Nick Piggin wrote: OK, I'll try them in a few minutes and report back. I'm not overly hopeful. If they fix the problem, then it's likely that the real bug is hidden. Well, the thing is,

Re: Processes stuck on D state on Dual Opteron

2005-04-10 Thread Claudio Martins
On Sunday 10 April 2005 03:47, Andrew Morton wrote: > > Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from > cutting in during long sysrq traces. > > Also, capture the `sysrq-m' output so we can see if the thing is out of > memory. Hi Andrew, Thanks for the tip. I

Re: Processes stuck on D state on Dual Opteron

2005-04-10 Thread Claudio Martins
On Sunday 10 April 2005 03:47, Andrew Morton wrote: Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from cutting in during long sysrq traces. Also, capture the `sysrq-m' output so we can see if the thing is out of memory. Hi Andrew, Thanks for the tip. I booted with

Re: Processes stuck on D state on Dual Opteron

2005-04-09 Thread Claudio Martins
On Sunday 10 April 2005 03:53, Nick Piggin wrote: > > Looks like you may possibly have a memory allocation deadlock > (although I can't explain the NMI oops). > > I would be interested to see if the following patch is of any > help to you. > Hi Nick, I'll build a kernel with your patch and

Re: Processes stuck on D state on Dual Opteron

2005-04-09 Thread Claudio Martins
On Sunday 10 April 2005 03:47, Andrew Morton wrote: > > Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from > cutting in during long sysrq traces. > > Also, capture the `sysrq-m' output so we can see if the thing is out of > memory. OK, will do it ASAP and report back.

Re: Processes stuck on D state on Dual Opteron

2005-04-09 Thread Nick Piggin
Claudio Martins wrote: On Tuesday 05 April 2005 03:12, Andrew Morton wrote: Claudio Martins <[EMAIL PROTECTED]> wrote: While stress testing 2.6.12-rc2 on an HP DL145 I get processes stuck in D state after some time. This machine is a dual Opteron 248 with 2GB (ECC) on one node (the other node

Re: Processes stuck on D state on Dual Opteron

2005-04-09 Thread Andrew Morton
Claudio Martins <[EMAIL PROTECTED]> wrote: > > I repeated the test to try to get more output from alt-sysreq-T, but it > oopsed again with even less output. >By the way, I have also tested 2.6.11.6 and I get stuck processes in the > same way. With 2.6.9 I get a hard lockup with no

Re: Processes stuck on D state on Dual Opteron

2005-04-09 Thread Claudio Martins
On Tuesday 05 April 2005 03:12, Andrew Morton wrote: > Claudio Martins <[EMAIL PROTECTED]> wrote: > >While stress testing 2.6.12-rc2 on an HP DL145 I get processes stuck > > in D state after some time. > >This machine is a dual Opteron 248 with 2GB (ECC) on one node (the > > other node

Re: Processes stuck on D state on Dual Opteron

2005-04-09 Thread Claudio Martins
On Tuesday 05 April 2005 03:12, Andrew Morton wrote: Claudio Martins [EMAIL PROTECTED] wrote: While stress testing 2.6.12-rc2 on an HP DL145 I get processes stuck in D state after some time. This machine is a dual Opteron 248 with 2GB (ECC) on one node (the other node has no RAM

Re: Processes stuck on D state on Dual Opteron

2005-04-09 Thread Andrew Morton
Claudio Martins [EMAIL PROTECTED] wrote: I repeated the test to try to get more output from alt-sysreq-T, but it oopsed again with even less output. By the way, I have also tested 2.6.11.6 and I get stuck processes in the same way. With 2.6.9 I get a hard lockup with no working

Re: Processes stuck on D state on Dual Opteron

2005-04-09 Thread Nick Piggin
Claudio Martins wrote: On Tuesday 05 April 2005 03:12, Andrew Morton wrote: Claudio Martins [EMAIL PROTECTED] wrote: While stress testing 2.6.12-rc2 on an HP DL145 I get processes stuck in D state after some time. This machine is a dual Opteron 248 with 2GB (ECC) on one node (the other node

Re: Processes stuck on D state on Dual Opteron

2005-04-09 Thread Claudio Martins
On Sunday 10 April 2005 03:47, Andrew Morton wrote: Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from cutting in during long sysrq traces. Also, capture the `sysrq-m' output so we can see if the thing is out of memory. OK, will do it ASAP and report back. Thanks,

Re: Processes stuck on D state on Dual Opteron

2005-04-09 Thread Claudio Martins
On Sunday 10 April 2005 03:53, Nick Piggin wrote: Looks like you may possibly have a memory allocation deadlock (although I can't explain the NMI oops). I would be interested to see if the following patch is of any help to you. Hi Nick, I'll build a kernel with your patch and report

Re: Processes stuck on D state on Dual Opteron

2005-04-04 Thread Andrew Morton
Claudio Martins <[EMAIL PROTECTED]> wrote: > >While stress testing 2.6.12-rc2 on an HP DL145 I get processes stuck in D > state after some time. >This machine is a dual Opteron 248 with 2GB (ECC) on one node (the other > node has no RAM modules plugged in, since this board works only