subject:"Re\: Processes stuck on D state on Dual Opteron"

Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Nick Piggin

Claudio Martins wrote:
On Tuesday 12 April 2005 01:46, Andrew Morton wrote:
Claudio Martins <[EMAIL PROTECTED]> wrote:
 I think I'm going to give a try to Neil's patch, but I'll have to apply
some patches from -mm.
Just this one if you're using 2.6.12-rc2:
--- 25/drivers/md/md.c~avoid-deadlock-in-sync_page_io-by-using-gfp_noio Mon
Apr 11 16:55:07 2005 +++ 25-akpm/drivers/md/md.c Mon Apr 11 16:55:07 2005
@@ -332,7 +332,7 @@ static int bi_complete(struct bio *bio,
static int sync_page_io(struct block_device *bdev, sector_t sector, int
size, struct page *page, int rw)
{
- struct bio *bio = bio_alloc(GFP_KERNEL, 1);
+ struct bio *bio = bio_alloc(GFP_NOIO, 1);
 struct completion event;
 int ret;
_

  Hi Andrew, all,
  Sorry for the delay in reporting. This patch does indeed fix the problem. 
The machine ran stress for almost 15h straight with no problems at all.

 As for Nick's patch  I, too, think it would be nice to be included (once the 
performance problems are sorted out), since it seemed to make the block layer 
more robust and well behaved (at least with stress),  although I didn't run 
performance tests to measure regressions.

  Thanks Nick, Neil, Andrew and all others for your great help with this 
issue. I'll have to put the machine on production now with the patch applied, 
but let me know if I can be of any further help with these issues.

Thanks for reporting and testing - what we need is more people
like you contributing to Linux ;)
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Nick Piggin

Chen, Kenneth W wrote:
Nick Piggin wrote on Tuesday, April 12, 2005 4:09 AM
Chen, Kenneth W wrote:
I like the patch a lot and already did bench it on our db setup.  However,
I'm seeing a negative regression compare to a very very crappy patch (see
attached, you can laugh at me for doing things like that :-).
OK - if we go that way, perhaps the following patch may be the
way to do it.

OK, if you are going to do it that way, then the ioc_batching code in 
get_request
has to be reworked.  We never push the queue so hard that it kicks itself into 
the
batching mode.  However, calls to get_io_context and put_io_context are 
unconditional
in that function.  Execution profile shows that these two little functions 
actually
consumed lots of cpu cycles.
AFAICS, ioc_*batching() is trying to push more requests onto the queue that is 
full
(or near full) and give high priority to the process that hits the last req 
slot.
Why do we need to go all the way to tsk->io_context to keep track of that state?
For a clean up bonus, I think the tracking can be moved into the queue 
structure.
OK - well it is no different to what you had before these patches, so
probably future work would be seperate patches.
get_io_context can probably be reworked. For example, it is only called
with the current thread, so it probably doesn't need to increment the
refcount, as most users are only using it process context... all users
in ll_rw_blk.c, anyway.
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Claudio Martins

On Tuesday 12 April 2005 01:46, Andrew Morton wrote:
> Claudio Martins <[EMAIL PROTECTED]> wrote:
> >   I think I'm going to give a try to Neil's patch, but I'll have to apply
> > some patches from -mm.
>
> Just this one if you're using 2.6.12-rc2:
>
> --- 25/drivers/md/md.c~avoid-deadlock-in-sync_page_io-by-using-gfp_noio Mon
> Apr 11 16:55:07 2005 +++ 25-akpm/drivers/md/md.c Mon Apr 11 16:55:07 2005
> @@ -332,7 +332,7 @@ static int bi_complete(struct bio *bio,
>  static int sync_page_io(struct block_device *bdev, sector_t sector, int
> size, struct page *page, int rw)
>  {
> - struct bio *bio = bio_alloc(GFP_KERNEL, 1);
> + struct bio *bio = bio_alloc(GFP_NOIO, 1);
>   struct completion event;
>   int ret;
>
> _

  Hi Andrew, all,

  Sorry for the delay in reporting. This patch does indeed fix the problem. 
The machine ran stress for almost 15h straight with no problems at all.

 As for Nick's patch  I, too, think it would be nice to be included (once the 
performance problems are sorted out), since it seemed to make the block layer 
more robust and well behaved (at least with stress),  although I didn't run 
performance tests to measure regressions.

  Thanks Nick, Neil, Andrew and all others for your great help with this 
issue. I'll have to put the machine on production now with the patch applied, 
but let me know if I can be of any further help with these issues.

 Thanks

Claudio

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Chen, Kenneth W

Nick Piggin wrote on Tuesday, April 12, 2005 4:09 AM
> Chen, Kenneth W wrote:
> > I like the patch a lot and already did bench it on our db setup.  However,
> > I'm seeing a negative regression compare to a very very crappy patch (see
> > attached, you can laugh at me for doing things like that :-).
>
> OK - if we go that way, perhaps the following patch may be the
> way to do it.

OK, if you are going to do it that way, then the ioc_batching code in 
get_request
has to be reworked.  We never push the queue so hard that it kicks itself into 
the
batching mode.  However, calls to get_io_context and put_io_context are 
unconditional
in that function.  Execution profile shows that these two little functions 
actually
consumed lots of cpu cycles.

AFAICS, ioc_*batching() is trying to push more requests onto the queue that is 
full
(or near full) and give high priority to the process that hits the last req 
slot.
Why do we need to go all the way to tsk->io_context to keep track of that state?
For a clean up bonus, I think the tracking can be moved into the queue 
structure.

> > My first reaction is that the overhead is in wait queue setup and tear down
> > in get_request_wait function. Throwing the following patch on top does 
> > improve
> > things a bit, but we are still in the negative territory.  I can't explain 
> > why.
> > Everything suppose to be faster.  So I'm staring at the execution profile at
> > the moment.
> >
>
> Hmm, that's a bit disappointing. Like you said though, I'm sure we
> should be able to get better performance out of this.

Absolutely. I'm disappointed too and this is totally out of expectation.  There
must be some other factors.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Thomas Davis

Nick Piggin wrote:
It is a bit subtle: get_request may only drop the lock and return NULL
(after retaking the lock), if we fail on a memory allocation. If we
just fail due to unavailable queue slots, then the lock is never
dropped. And the mem allocation can't fail because it is a mempool
alloc with GFP_NOIO.
I'm jumping in here, because we have seen this problem on a X86-64 system, 
with 4gb of ram, and SLES9 (2.6.5-7.141)
You can drive the node into this state:
Mem-info:
Node 1 DMA per-cpu: empty
Node 1 Normal per-cpu:
cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16
cpu 1 hot: low 32, high 96, batch 16
cpu 1 cold: low 0, high 32, batch 16
Node 1 HighMem per-cpu: empty
Node 0 DMA per-cpu:
cpu 0 hot: low 2, high 6, batch 1
cpu 0 cold: low 0, high 2, batch 1
cpu 1 hot: low 2, high 6, batch 1
cpu 1 cold: low 0, high 2, batch 1
Node 0 Normal per-cpu:
cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16
cpu 1 hot: low 32, high 96, batch 16
cpu 1 cold: low 0, high 32, batch 16
Node 0 HighMem per-cpu: empty
Free pages:   10360kB (0kB HighMem)
Active:485853 inactive:421820 dirty:0 writeback:0 unstable:0 free:2590 
slab:10816 mapped:903444 pagetables:2097
Node 1 DMA free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB
lowmem_reserve[]: 0 1664 1664
Node 1 Normal free:2464kB min:2468kB low:4936kB high:7404kB active:918440kB 
inactive:710360kB present:1703936kB
lowmem_reserve[]: 0 0 0
Node 1 HighMem free:0kB min:128kB low:256kB high:384kB active:0kB inactive:0kB 
present:0kB
lowmem_reserve[]: 0 0 0
Node 0 DMA free:4928kB min:20kB low:40kB high:60kB active:0kB inactive:0kB 
present:16384kB
lowmem_reserve[]: 0 2031 2031
Node 0 Normal free:2968kB min:3016kB low:6032kB high:9048kB active:1024968kB 
inactive:976924kB present:2080764kB
lowmem_reserve[]: 0 0 0
Node 0 HighMem free:0kB min:128kB low:256kB high:384kB active:0kB inactive:0kB 
present:0kB
lowmem_reserve[]: 0 0 0
Node 1 DMA: empty
Node 1 Normal: 46*4kB 19*8kB 9*16kB 4*32kB 1*64kB 0*128kB 1*256kB 1*512kB 
1*1024kB 0*2048kB 0*4096kB = 2464kB
Node 1 HighMem: empty
Node 0 DMA: 4*4kB 4*8kB 1*16kB 2*32kB 3*64kB 4*128kB 2*256kB 1*512kB 1*1024kB 
1*2048kB 0*4096kB = 4928kB
Node 0 Normal: 0*4kB 1*8kB 1*16kB 0*32kB 0*64kB 1*128kB 1*256kB 3*512kB 
1*1024kB 0*2048kB 0*4096kB = 2968kB
Node 0 HighMem: empty
Swap cache: add 1009224, delete 106245, find 179674/181478, race 0+2
Free swap:   4739812kB
950271 pages of RAM
17513 reserved pages
2788 pages shared
902980 pages swap cached
with processes doing this:
SysRq : Show State
  sibling
 task PC  pid father child younger older
init  D 0100e810 0 1  0 2   (NOTLB)
01007ff81be8 0006  
     
   010002c1d6e0
Call Trace:{try_to_free_pages+283} 
{schedule_timeout+173}
  {process_timeout+0} 
{io_schedule_timeout+82}
  {blk_congestion_wait+141} 
{autoremove_wake_function+0}
  {autoremove_wake_function+0} 
{__alloc_pages+776}
  {read_swap_cache_async+63} 
{swapin_readahead+97}
  {do_swap_page+142} 
{handle_mm_fault+337}
  {do_page_fault+411} {sys_select+1097}
  {sys_select+1311} {error_exit+0}
mg.C.2D 0100e810 0  1971   1955  1972   (NOTLB)
0100e236bc68 0006  
     
  0001 0100816ed360
Call Trace:{try_to_free_pages+283} 
{schedule_timeout+173}
  {process_timeout+0} 
{io_schedule_timeout+82}
  {blk_congestion_wait+141} 
{autoremove_wake_function+0}
  {autoremove_wake_function+0} 
{__alloc_pages+776}
  {do_wp_page+285} {handle_mm_fault+373}
  {do_page_fault+411} {error_exit+0}
mg.C.2S 01007b0a06a0 0  1972   1971  1974   (NOTLB)
0100bc1c1ca0 0006 0010 00010246
  0004c7c0 0100816ec280 00768780 010081f23390
  00018780 0100816ed360
Call Trace:{__alloc_pages+852} 
{__down_interruptible+216}
  {default_wake_function+0} 
{recalc_task_prio+940}
  {__down_failed_interruptible+53}
  {:mosal:.text.lock.mosal_sync+5}
  {:mod_vipkl:VIPKL_EQ_poll+607} 
{:mod_vipkl:VIPKL_EQ_poll_stat+529}
  {:mod_vipkl:VIPKL_ioctl+5144} 
{:mod_vipkl:vipkl_wrap_kernel_ioctl+417}
  {filp_close+126} {sys_ioctl+612}
  {system_call+124}
mg.C.2S 01007b0a18c0 0  1974   19711972 (NOTLB)
0100a3955ca0 0006 0001e7d422e8 01002c9ca550
  0005f138 0100816ec280 00768780 010081f23390
  00018780 0100816ed360
Call Trace:{__alloc_pages+852} 
{__down_interruptible+216}
  {default_wake_function+0}

Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Nick Piggin

Chen, Kenneth W wrote:
On Tue, Apr 12 2005, Nick Piggin wrote:
Actually the patches I have sent you do fix real bugs, but they also
make the block layer less likely to recurse into page reclaim, so it
may be eg. hiding the problem that Neil's patch fixes.

Jens Axboe wrote on Tuesday, April 12, 2005 12:08 AM
Can you push those to Andrew? I'm quite happy with the way they turned
out. It would be nice if Ken would bench 2.6.12-rc2 with and without
those patches.

I like the patch a lot and already did bench it on our db setup.  However,
I'm seeing a negative regression compare to a very very crappy patch (see
attached, you can laugh at me for doing things like that :-).
OK - if we go that way, perhaps the following patch may be the
way to do it.
My first reaction is that the overhead is in wait queue setup and tear down
in get_request_wait function. Throwing the following patch on top does improve
things a bit, but we are still in the negative territory.  I can't explain why.
Everything suppose to be faster.  So I'm staring at the execution profile at
the moment.
Hmm, that's a bit disappointing. Like you said though, I'm sure we
should be able to get better performance out of this.
I'll look at it and see if we can rework it.
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Nick Piggin

Nick Piggin wrote:
Nick Piggin wrote:
Chen, Kenneth W wrote:

I like the patch a lot and already did bench it on our db setup.  
However,
I'm seeing a negative regression compare to a very very crappy patch 
(see
attached, you can laugh at me for doing things like that :-).

OK - if we go that way, perhaps the following patch may be the
way to do it.
Here.
Actually yes this is good I think.
What I was worried about is that you could lose some fairness due
to not being put on the queue before allocation.
This is probably a silly thing to worry about, because up until
that point things aren't really deterministic anyway (and before this
patchset it would try doing a GFP_ATOMIC allocation first anyway).
However after the subsequent locking rework, both these get_request()
calls will be performed under the same lock - giving you the same
fairness. So it is nothing to worry about anyway!
It is a bit subtle: get_request may only drop the lock and return NULL
(after retaking the lock), if we fail on a memory allocation. If we
just fail due to unavailable queue slots, then the lock is never
dropped. And the mem allocation can't fail because it is a mempool
alloc with GFP_NOIO.
Nick
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Nick Piggin

Nick Piggin wrote:
Chen, Kenneth W wrote:

I like the patch a lot and already did bench it on our db setup.  
However,
I'm seeing a negative regression compare to a very very crappy patch (see
attached, you can laugh at me for doing things like that :-).

OK - if we go that way, perhaps the following patch may be the
way to do it.
Here.
--
SUSE Labs, Novell Inc.
Index: linux-2.6/drivers/block/ll_rw_blk.c
===
--- linux-2.6.orig/drivers/block/ll_rw_blk.c2005-04-12 21:03:01.0 
+1000
+++ linux-2.6/drivers/block/ll_rw_blk.c 2005-04-12 21:03:45.0 +1000
@@ -1956,10 +1956,11 @@ out:
  */
 static struct request *get_request_wait(request_queue_t *q, int rw)
 {
-   DEFINE_WAIT(wait);
struct request *rq;
 
-   do {
+   rq = get_request(q, rw, GFP_NOIO);
+   while (!rq) {
+   DEFINE_WAIT(wait);
struct request_list *rl = >rq;
 
prepare_to_wait_exclusive(>wait[rw], ,
@@ -1987,7 +1988,7 @@ static struct request *get_request_wait(
spin_lock_irq(q->queue_lock);
}
finish_wait(>wait[rw], );
-   } while (!rq);
+   }
 
return rq;
 }

RE: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Chen, Kenneth W

On Tue, Apr 12 2005, Nick Piggin wrote:
> Actually the patches I have sent you do fix real bugs, but they also
> make the block layer less likely to recurse into page reclaim, so it
> may be eg. hiding the problem that Neil's patch fixes.

Jens Axboe wrote on Tuesday, April 12, 2005 12:08 AM
> Can you push those to Andrew? I'm quite happy with the way they turned
> out. It would be nice if Ken would bench 2.6.12-rc2 with and without
> those patches.


I like the patch a lot and already did bench it on our db setup.  However,
I'm seeing a negative regression compare to a very very crappy patch (see
attached, you can laugh at me for doing things like that :-).

My first reaction is that the overhead is in wait queue setup and tear down
in get_request_wait function. Throwing the following patch on top does improve
things a bit, but we are still in the negative territory.  I can't explain why.
Everything suppose to be faster.  So I'm staring at the execution profile at
the moment.


diff -Nru a/drivers/block/ll_rw_blk.c b/drivers/block/ll_rw_blk.c
--- a/drivers/block/ll_rw_blk.c 2005-04-12 00:48:12 -07:00
+++ b/drivers/block/ll_rw_blk.c 2005-04-12 00:48:12 -07:00
@@ -1740,10 +1740,35 @@
  */
 static struct request *get_request_wait(request_queue_t *q, int rw)
 {
-   DEFINE_WAIT(wait);
struct request *rq;
+   struct request_list *rl = >rq;
+   int gfp_flag = GFP_ATOMIC;
+
+   if (rl->count[rw] < queue_congestion_off_threshold(q)) {
+   rq = kmem_cache_alloc(request_cachep, gfp_flag);
+   if (rq) {
+   if (!elv_set_request(q, rq, gfp_flag)) {
+
+   rl->count[rw]++;
+   INIT_LIST_HEAD(>queuelist);
+   rq->flags = rw;
+   rq->rq_status = RQ_ACTIVE;
+   rq->ref_count = 1;
+   rq->q = q;
+   rq->rl = rl;
+   rq->special = NULL;
+   rq->data_len = 0;
+   rq->data = NULL;
+   rq->sense = NULL;
+
+   return rq;
+   }
+   kmem_cache_free(request_cachep, rq);
+   }
+   }

do {
+   DEFINE_WAIT(wait);
struct request_list *rl = >rq;

prepare_to_wait_exclusive(>wait[rw], ,



begin 666 old_freereq.patch
M9 M3G)U(&$O9')I=F5R2AQ*2D*( D)8FQK7W!L
M=6=?9&5V:6-E*'$I.PHM"0EG;W1O(&=E=%]R<3L*+0E]"BT):[EMAIL PROTECTED]&)Ahttp://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Jens Axboe

On Tue, Apr 12 2005, Nick Piggin wrote:
> Actually the patches I have sent you do fix real bugs, but they also
> make the block layer less likely to recurse into page reclaim, so it
> may be eg. hiding the problem that Neil's patch fixes.

Can you push those to Andrew? I'm quite happy with the way they turned
out. It would be nice if Ken would bench 2.6.12-rc2 with and without
those patches.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Jens Axboe

On Tue, Apr 12 2005, Nick Piggin wrote:
 Actually the patches I have sent you do fix real bugs, but they also
 make the block layer less likely to recurse into page reclaim, so it
 may be eg. hiding the problem that Neil's patch fixes.

Can you push those to Andrew? I'm quite happy with the way they turned
out. It would be nice if Ken would bench 2.6.12-rc2 with and without
those patches.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Chen, Kenneth W

On Tue, Apr 12 2005, Nick Piggin wrote:
 Actually the patches I have sent you do fix real bugs, but they also
 make the block layer less likely to recurse into page reclaim, so it
 may be eg. hiding the problem that Neil's patch fixes.

Jens Axboe wrote on Tuesday, April 12, 2005 12:08 AM
 Can you push those to Andrew? I'm quite happy with the way they turned
 out. It would be nice if Ken would bench 2.6.12-rc2 with and without
 those patches.


I like the patch a lot and already did bench it on our db setup.  However,
I'm seeing a negative regression compare to a very very crappy patch (see
attached, you can laugh at me for doing things like that :-).

My first reaction is that the overhead is in wait queue setup and tear down
in get_request_wait function. Throwing the following patch on top does improve
things a bit, but we are still in the negative territory.  I can't explain why.
Everything suppose to be faster.  So I'm staring at the execution profile at
the moment.


diff -Nru a/drivers/block/ll_rw_blk.c b/drivers/block/ll_rw_blk.c
--- a/drivers/block/ll_rw_blk.c 2005-04-12 00:48:12 -07:00
+++ b/drivers/block/ll_rw_blk.c 2005-04-12 00:48:12 -07:00
@@ -1740,10 +1740,35 @@
  */
 static struct request *get_request_wait(request_queue_t *q, int rw)
 {
-   DEFINE_WAIT(wait);
struct request *rq;
+   struct request_list *rl = q-rq;
+   int gfp_flag = GFP_ATOMIC;
+
+   if (rl-count[rw]  queue_congestion_off_threshold(q)) {
+   rq = kmem_cache_alloc(request_cachep, gfp_flag);
+   if (rq) {
+   if (!elv_set_request(q, rq, gfp_flag)) {
+
+   rl-count[rw]++;
+   INIT_LIST_HEAD(rq-queuelist);
+   rq-flags = rw;
+   rq-rq_status = RQ_ACTIVE;
+   rq-ref_count = 1;
+   rq-q = q;
+   rq-rl = rl;
+   rq-special = NULL;
+   rq-data_len = 0;
+   rq-data = NULL;
+   rq-sense = NULL;
+
+   return rq;
+   }
+   kmem_cache_free(request_cachep, rq);
+   }
+   }

do {
+   DEFINE_WAIT(wait);
struct request_list *rl = q-rq;

prepare_to_wait_exclusive(rl-wait[rw], wait,



begin 666 old_freereq.patch
M9EF9B M3G)U($O9')I=F5RR]B;]C:R]L;%]R=U]B;LN8R!B+V1R:79E
MG,O8FQO8VLO;Q?G=?8FQK+F,*+2TM($O9')I=F5RR]B;]C:R]L;%]R
M=U]B;LN8PDR,# U+3 T+3 T(# P.C4X.C4U(TP-SHP, [EMAIL PROTECTED]FEV
M97)S+V)L;V-K+VQL7W)W7V)L:RYC3(P,#4M,#0M,#0@,# [EMAIL PROTECTED]@+3 W
M.C PD! (TQ.3DV+#$U(LQ.3DV+#$T($! B *( ER82 ]()I;RT^8FE?
MG@)B H,2 \/!24]?4E=?04A%040I.PH@BL)[EMAIL PROTECTED])A8B!A(9R964@
MF5Q=65S=!FF]M('1H92!FF5E;ES= J+PHK69R965R97$@/2!G971?
MF5Q=65S=AQ+[EMAIL PROTECTED])0RD[BL*([EMAIL PROTECTED]@7-P:6Y?
M;]C:U]IG$H2T^75E=65?;]C:RD[B *+0EI9B H96QV7W%U975E7V5M
M'1Y*'$I*2![BL):[EMAIL PROTECTED]5L=E]Q=65U95]E;7!T2AQ*2D*( D)8FQK7W!L
M=6=?95V:6-E*'$I.PHM0EG;W1O(=E=%]R3L*+0E]BT):[EMAIL PROTECTED])AG)I
M97(IBT)6=O=[EMAIL PROTECTED])Q.PH@B )96Q?F5T([EMAIL PROTECTED]F=E*'$L
A(9R97$L()I;RD[B )W=I=-H(AE;%]R970I('L*
`
end

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Nick Piggin

Nick Piggin wrote:
Chen, Kenneth W wrote:

I like the patch a lot and already did bench it on our db setup.  
However,
I'm seeing a negative regression compare to a very very crappy patch (see
attached, you can laugh at me for doing things like that :-).

OK - if we go that way, perhaps the following patch may be the
way to do it.
Here.
--
SUSE Labs, Novell Inc.
Index: linux-2.6/drivers/block/ll_rw_blk.c
===
--- linux-2.6.orig/drivers/block/ll_rw_blk.c2005-04-12 21:03:01.0 
+1000
+++ linux-2.6/drivers/block/ll_rw_blk.c 2005-04-12 21:03:45.0 +1000
@@ -1956,10 +1956,11 @@ out:
  */
 static struct request *get_request_wait(request_queue_t *q, int rw)
 {
-   DEFINE_WAIT(wait);
struct request *rq;
 
-   do {
+   rq = get_request(q, rw, GFP_NOIO);
+   while (!rq) {
+   DEFINE_WAIT(wait);
struct request_list *rl = q-rq;
 
prepare_to_wait_exclusive(rl-wait[rw], wait,
@@ -1987,7 +1988,7 @@ static struct request *get_request_wait(
spin_lock_irq(q-queue_lock);
}
finish_wait(rl-wait[rw], wait);
-   } while (!rq);
+   }
 
return rq;
 }

Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Nick Piggin

Nick Piggin wrote:
Nick Piggin wrote:
Chen, Kenneth W wrote:

I like the patch a lot and already did bench it on our db setup.  
However,
I'm seeing a negative regression compare to a very very crappy patch 
(see
attached, you can laugh at me for doing things like that :-).

OK - if we go that way, perhaps the following patch may be the
way to do it.
Here.
Actually yes this is good I think.
What I was worried about is that you could lose some fairness due
to not being put on the queue before allocation.
This is probably a silly thing to worry about, because up until
that point things aren't really deterministic anyway (and before this
patchset it would try doing a GFP_ATOMIC allocation first anyway).
However after the subsequent locking rework, both these get_request()
calls will be performed under the same lock - giving you the same
fairness. So it is nothing to worry about anyway!
It is a bit subtle: get_request may only drop the lock and return NULL
(after retaking the lock), if we fail on a memory allocation. If we
just fail due to unavailable queue slots, then the lock is never
dropped. And the mem allocation can't fail because it is a mempool
alloc with GFP_NOIO.
Nick
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Nick Piggin

Chen, Kenneth W wrote:
On Tue, Apr 12 2005, Nick Piggin wrote:
Actually the patches I have sent you do fix real bugs, but they also
make the block layer less likely to recurse into page reclaim, so it
may be eg. hiding the problem that Neil's patch fixes.

Jens Axboe wrote on Tuesday, April 12, 2005 12:08 AM
Can you push those to Andrew? I'm quite happy with the way they turned
out. It would be nice if Ken would bench 2.6.12-rc2 with and without
those patches.

I like the patch a lot and already did bench it on our db setup.  However,
I'm seeing a negative regression compare to a very very crappy patch (see
attached, you can laugh at me for doing things like that :-).
OK - if we go that way, perhaps the following patch may be the
way to do it.
My first reaction is that the overhead is in wait queue setup and tear down
in get_request_wait function. Throwing the following patch on top does improve
things a bit, but we are still in the negative territory.  I can't explain why.
Everything suppose to be faster.  So I'm staring at the execution profile at
the moment.
Hmm, that's a bit disappointing. Like you said though, I'm sure we
should be able to get better performance out of this.
I'll look at it and see if we can rework it.
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Thomas Davis

Nick Piggin wrote:
It is a bit subtle: get_request may only drop the lock and return NULL
(after retaking the lock), if we fail on a memory allocation. If we
just fail due to unavailable queue slots, then the lock is never
dropped. And the mem allocation can't fail because it is a mempool
alloc with GFP_NOIO.
I'm jumping in here, because we have seen this problem on a X86-64 system, 
with 4gb of ram, and SLES9 (2.6.5-7.141)
You can drive the node into this state:
Mem-info:
Node 1 DMA per-cpu: empty
Node 1 Normal per-cpu:
cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16
cpu 1 hot: low 32, high 96, batch 16
cpu 1 cold: low 0, high 32, batch 16
Node 1 HighMem per-cpu: empty
Node 0 DMA per-cpu:
cpu 0 hot: low 2, high 6, batch 1
cpu 0 cold: low 0, high 2, batch 1
cpu 1 hot: low 2, high 6, batch 1
cpu 1 cold: low 0, high 2, batch 1
Node 0 Normal per-cpu:
cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16
cpu 1 hot: low 32, high 96, batch 16
cpu 1 cold: low 0, high 32, batch 16
Node 0 HighMem per-cpu: empty
Free pages:   10360kB (0kB HighMem)
Active:485853 inactive:421820 dirty:0 writeback:0 unstable:0 free:2590 
slab:10816 mapped:903444 pagetables:2097
Node 1 DMA free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB
lowmem_reserve[]: 0 1664 1664
Node 1 Normal free:2464kB min:2468kB low:4936kB high:7404kB active:918440kB 
inactive:710360kB present:1703936kB
lowmem_reserve[]: 0 0 0
Node 1 HighMem free:0kB min:128kB low:256kB high:384kB active:0kB inactive:0kB 
present:0kB
lowmem_reserve[]: 0 0 0
Node 0 DMA free:4928kB min:20kB low:40kB high:60kB active:0kB inactive:0kB 
present:16384kB
lowmem_reserve[]: 0 2031 2031
Node 0 Normal free:2968kB min:3016kB low:6032kB high:9048kB active:1024968kB 
inactive:976924kB present:2080764kB
lowmem_reserve[]: 0 0 0
Node 0 HighMem free:0kB min:128kB low:256kB high:384kB active:0kB inactive:0kB 
present:0kB
lowmem_reserve[]: 0 0 0
Node 1 DMA: empty
Node 1 Normal: 46*4kB 19*8kB 9*16kB 4*32kB 1*64kB 0*128kB 1*256kB 1*512kB 
1*1024kB 0*2048kB 0*4096kB = 2464kB
Node 1 HighMem: empty
Node 0 DMA: 4*4kB 4*8kB 1*16kB 2*32kB 3*64kB 4*128kB 2*256kB 1*512kB 1*1024kB 
1*2048kB 0*4096kB = 4928kB
Node 0 Normal: 0*4kB 1*8kB 1*16kB 0*32kB 0*64kB 1*128kB 1*256kB 3*512kB 
1*1024kB 0*2048kB 0*4096kB = 2968kB
Node 0 HighMem: empty
Swap cache: add 1009224, delete 106245, find 179674/181478, race 0+2
Free swap:   4739812kB
950271 pages of RAM
17513 reserved pages
2788 pages shared
902980 pages swap cached
with processes doing this:
SysRq : Show State
  sibling
 task PC  pid father child younger older
init  D 0100e810 0 1  0 2   (NOTLB)
01007ff81be8 0006  
     
   010002c1d6e0
Call Trace:8017338b{try_to_free_pages+283} 
80147d0d{schedule_timeout+173}
  80147c50{process_timeout+0} 
8013a292{io_schedule_timeout+82}
  80280efd{blk_congestion_wait+141} 
8013c530{autoremove_wake_function+0}
  8013c530{autoremove_wake_function+0} 
8016ab68{__alloc_pages+776}
  8018573f{read_swap_cache_async+63} 
801781b1{swapin_readahead+97}
  8017834e{do_swap_page+142} 
801796a1{handle_mm_fault+337}
  80123ebb{do_page_fault+411} 801a3259{sys_select+1097}
  801a332f{sys_select+1311} 801122a9{error_exit+0}
mg.C.2D 0100e810 0  1971   1955  1972   (NOTLB)
0100e236bc68 0006  
     
  0001 0100816ed360
Call Trace:8017338b{try_to_free_pages+283} 
80147d0d{schedule_timeout+173}
  80147c50{process_timeout+0} 
8013a292{io_schedule_timeout+82}
  80280efd{blk_congestion_wait+141} 
8013c530{autoremove_wake_function+0}
  8013c530{autoremove_wake_function+0} 
8016ab68{__alloc_pages+776}
  801778ad{do_wp_page+285} 801796c5{handle_mm_fault+373}
  80123ebb{do_page_fault+411} 801122a9{error_exit+0}
mg.C.2S 01007b0a06a0 0  1972   1971  1974   (NOTLB)
0100bc1c1ca0 0006 0010 00010246
  0004c7c0 0100816ec280 00768780 010081f23390
  00018780 0100816ed360
Call Trace:8016abb4{__alloc_pages+852} 
80110ac8{__down_interruptible+216}
  80139280{default_wake_function+0} 
8013531c{recalc_task_prio+940}
  80230d91{__down_failed_interruptible+53}
  a01cc47e{:mosal:.text.lock.mosal_sync+5}

RE: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Chen, Kenneth W

Nick Piggin wrote on Tuesday, April 12, 2005 4:09 AM
 Chen, Kenneth W wrote:
  I like the patch a lot and already did bench it on our db setup.  However,
  I'm seeing a negative regression compare to a very very crappy patch (see
  attached, you can laugh at me for doing things like that :-).

 OK - if we go that way, perhaps the following patch may be the
 way to do it.

OK, if you are going to do it that way, then the ioc_batching code in 
get_request
has to be reworked.  We never push the queue so hard that it kicks itself into 
the
batching mode.  However, calls to get_io_context and put_io_context are 
unconditional
in that function.  Execution profile shows that these two little functions 
actually
consumed lots of cpu cycles.

AFAICS, ioc_*batching() is trying to push more requests onto the queue that is 
full
(or near full) and give high priority to the process that hits the last req 
slot.
Why do we need to go all the way to tsk-io_context to keep track of that state?
For a clean up bonus, I think the tracking can be moved into the queue 
structure.


  My first reaction is that the overhead is in wait queue setup and tear down
  in get_request_wait function. Throwing the following patch on top does 
  improve
  things a bit, but we are still in the negative territory.  I can't explain 
  why.
  Everything suppose to be faster.  So I'm staring at the execution profile at
  the moment.
 

 Hmm, that's a bit disappointing. Like you said though, I'm sure we
 should be able to get better performance out of this.

Absolutely. I'm disappointed too and this is totally out of expectation.  There
must be some other factors.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Claudio Martins


On Tuesday 12 April 2005 01:46, Andrew Morton wrote:
 Claudio Martins [EMAIL PROTECTED] wrote:
I think I'm going to give a try to Neil's patch, but I'll have to apply
  some patches from -mm.

 Just this one if you're using 2.6.12-rc2:

 --- 25/drivers/md/md.c~avoid-deadlock-in-sync_page_io-by-using-gfp_noio Mon
 Apr 11 16:55:07 2005 +++ 25-akpm/drivers/md/md.c Mon Apr 11 16:55:07 2005
 @@ -332,7 +332,7 @@ static int bi_complete(struct bio *bio,
  static int sync_page_io(struct block_device *bdev, sector_t sector, int
 size, struct page *page, int rw)
  {
 - struct bio *bio = bio_alloc(GFP_KERNEL, 1);
 + struct bio *bio = bio_alloc(GFP_NOIO, 1);
   struct completion event;
   int ret;

 _


  Hi Andrew, all,

  Sorry for the delay in reporting. This patch does indeed fix the problem. 
The machine ran stress for almost 15h straight with no problems at all.

 As for Nick's patch  I, too, think it would be nice to be included (once the 
performance problems are sorted out), since it seemed to make the block layer 
more robust and well behaved (at least with stress),  although I didn't run 
performance tests to measure regressions.

  Thanks Nick, Neil, Andrew and all others for your great help with this 
issue. I'll have to put the machine on production now with the patch applied, 
but let me know if I can be of any further help with these issues.

 Thanks

Claudio

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Nick Piggin

Chen, Kenneth W wrote:
Nick Piggin wrote on Tuesday, April 12, 2005 4:09 AM
Chen, Kenneth W wrote:
I like the patch a lot and already did bench it on our db setup.  However,
I'm seeing a negative regression compare to a very very crappy patch (see
attached, you can laugh at me for doing things like that :-).
OK - if we go that way, perhaps the following patch may be the
way to do it.

OK, if you are going to do it that way, then the ioc_batching code in 
get_request
has to be reworked.  We never push the queue so hard that it kicks itself into 
the
batching mode.  However, calls to get_io_context and put_io_context are 
unconditional
in that function.  Execution profile shows that these two little functions 
actually
consumed lots of cpu cycles.
AFAICS, ioc_*batching() is trying to push more requests onto the queue that is 
full
(or near full) and give high priority to the process that hits the last req 
slot.
Why do we need to go all the way to tsk-io_context to keep track of that state?
For a clean up bonus, I think the tracking can be moved into the queue 
structure.
OK - well it is no different to what you had before these patches, so
probably future work would be seperate patches.
get_io_context can probably be reworked. For example, it is only called
with the current thread, so it probably doesn't need to increment the
refcount, as most users are only using it process context... all users
in ll_rw_blk.c, anyway.
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-12 Thread Nick Piggin

Claudio Martins wrote:
On Tuesday 12 April 2005 01:46, Andrew Morton wrote:
Claudio Martins [EMAIL PROTECTED] wrote:
 I think I'm going to give a try to Neil's patch, but I'll have to apply
some patches from -mm.
Just this one if you're using 2.6.12-rc2:
--- 25/drivers/md/md.c~avoid-deadlock-in-sync_page_io-by-using-gfp_noio Mon
Apr 11 16:55:07 2005 +++ 25-akpm/drivers/md/md.c Mon Apr 11 16:55:07 2005
@@ -332,7 +332,7 @@ static int bi_complete(struct bio *bio,
static int sync_page_io(struct block_device *bdev, sector_t sector, int
size, struct page *page, int rw)
{
- struct bio *bio = bio_alloc(GFP_KERNEL, 1);
+ struct bio *bio = bio_alloc(GFP_NOIO, 1);
 struct completion event;
 int ret;
_

  Hi Andrew, all,
  Sorry for the delay in reporting. This patch does indeed fix the problem. 
The machine ran stress for almost 15h straight with no problems at all.

 As for Nick's patch  I, too, think it would be nice to be included (once the 
performance problems are sorted out), since it seemed to make the block layer 
more robust and well behaved (at least with stress),  although I didn't run 
performance tests to measure regressions.

  Thanks Nick, Neil, Andrew and all others for your great help with this 
issue. I'll have to put the machine on production now with the patch applied, 
but let me know if I can be of any further help with these issues.

Thanks for reporting and testing - what we need is more people
like you contributing to Linux ;)
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Nick Piggin

On Tue, 2005-04-12 at 01:22 +0100, Claudio Martins wrote:
> On Monday 11 April 2005 23:59, Nick Piggin wrote:
> >
> > >   OK, I'll try them in a few minutes and report back.
> >
> > I'm not overly hopeful. If they fix the problem, then it's likely
> > that the real bug is hidden.
> >
> 
>   Well, the thing is, they do fix the problem. Or at least they hide it very 
> well ;-)
> 
>   It has been running for more than 5 hours now with stress with no problems 
> and no stuck processes.
> 

Well, that is good... I guess ;)

Actually the patches I have sent you do fix real bugs, but they also
make the block layer less likely to recurse into page reclaim, so it
may be eg. hiding the problem that Neil's patch fixes.

It may be that your fundamental problem is solved by my patches, but
we need to be sure.

>   I think I'm going to give a try to Neil's patch, but I'll have to apply 
> some 
> patches from -mm.
> 

Yep that would be good. Please test -rc2 with Andrew's patch, and
obviously my patches backed out. Thanks for sticking with it.

Nick

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Andrew Morton

Claudio Martins <[EMAIL PROTECTED]> wrote:
>
>   I think I'm going to give a try to Neil's patch, but I'll have to apply 
> some 
>  patches from -mm.

Just this one if you're using 2.6.12-rc2:

--- 25/drivers/md/md.c~avoid-deadlock-in-sync_page_io-by-using-gfp_noio Mon Apr 
11 16:55:07 2005
+++ 25-akpm/drivers/md/md.c Mon Apr 11 16:55:07 2005
@@ -332,7 +332,7 @@ static int bi_complete(struct bio *bio, 
 static int sync_page_io(struct block_device *bdev, sector_t sector, int size,
   struct page *page, int rw)
 {
-   struct bio *bio = bio_alloc(GFP_KERNEL, 1);
+   struct bio *bio = bio_alloc(GFP_NOIO, 1);
struct completion event;
int ret;
 
_

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Claudio Martins


On Tuesday 12 April 2005 00:46, Neil Brown wrote:
> On Monday April 11, [EMAIL PROTECTED] wrote:
> > Neil, have you had a look at the traces? Do they mean much to you?
>
> Just looked.
> bio_alloc_bioset seems implicated, as does sync_page_io.
>
> sync_page_io used to use a 'struct bio' on the stack, but Jens Axboe
> change it to use bio_alloc (don't know why..) and I should have
> checked the change better.
>
> sync_page_io can be called on the write out path, so it should use
> GFP_NOIO rather than GFP_KERNEL.
>
> See if this helps Actually this patch is against 2.6.12-rc2-mm1
> which uses md_super_write instead of sync_page_io (which is now only
> used for read).  So if you are using a non-mm kernel (which seems to
> be the case) you'll need to apply the patch by hand.
>

   Hi Neil,

  I'll test this patch, but I'm wondering if I have to apply all the 
md-related patches from broken out directory of 2.6.12-rc2-mm1 or only some 
specific ones?
   Anyway I'm happy to test all those md updates, if you think they might 
help.

 Thanks 

Claudio

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Claudio Martins

On Monday 11 April 2005 23:59, Nick Piggin wrote:
>
> >   OK, I'll try them in a few minutes and report back.
>
> I'm not overly hopeful. If they fix the problem, then it's likely
> that the real bug is hidden.
>

  Well, the thing is, they do fix the problem. Or at least they hide it very 
well ;-)

  It has been running for more than 5 hours now with stress with no problems 
and no stuck processes.

  I think I'm going to give a try to Neil's patch, but I'll have to apply some 
patches from -mm.

 Thanks

Claudio

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Neil Brown

On Monday April 11, [EMAIL PROTECTED] wrote:
> 
> Neil, have you had a look at the traces? Do they mean much to you?
> 

Just looked. 
bio_alloc_bioset seems implicated, as does sync_page_io.

sync_page_io used to use a 'struct bio' on the stack, but Jens Axboe
change it to use bio_alloc (don't know why..) and I should have 
checked the change better.

sync_page_io can be called on the write out path, so it should use
GFP_NOIO rather than GFP_KERNEL.

See if this helps Actually this patch is against 2.6.12-rc2-mm1
which uses md_super_write instead of sync_page_io (which is now only
used for read).  So if you are using a non-mm kernel (which seems to
be the case) you'll need to apply the patch by hand.

Thanks,
NeilBrown

---
Avoid deadlock in sync_page_io by using GFP_NOIO

..as sync_page_io can be called on the write-out path.
Ditto for md_super_write

Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/md.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~  2005-04-08 11:25:26.0 +1000
+++ ./drivers/md/md.c   2005-04-12 09:42:29.0 +1000
@@ -351,7 +351,7 @@ void md_super_write(mddev_t *mddev, mdk_
 * if zero is reached.
 * If an error occurred, call md_error
 */
-   struct bio *bio = bio_alloc(GFP_KERNEL, 1);
+   struct bio *bio = bio_alloc(GFP_NOIO, 1);

bio->bi_bdev = rdev->bdev;
bio->bi_sector = sector;
@@ -374,7 +374,7 @@ static int bi_complete(struct bio *bio, 
 int sync_page_io(struct block_device *bdev, sector_t sector, int size,
   struct page *page, int rw)
 {
-   struct bio *bio = bio_alloc(GFP_KERNEL, 1);
+   struct bio *bio = bio_alloc(GFP_NOIO, 1);
struct completion event;
int ret;

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Nick Piggin

Claudio Martins wrote:
   Right. I'm using two Seagate ATA133 disks (ide controler is AMD-8111) each 
with 4 partitions, so I get 4 md Raid1 devices. The first one, md0, is for 
swap. The rest are

~$ df -h
FilesystemSize  Used Avail Use% Mounted on
/dev/md1  4.6G  1.9G  2.6G  42% /
tmpfs1005M 0 1005M   0% /dev/shm
/dev/md3   32G  107M   30G   1% /home
/dev/md2   31G  149M   29G   1% /var
  In these tests, /home on md3 is the working area for stress.
  The io scheduler used is the anticipatory. 

OK.
  OK, I'll try them in a few minutes and report back.
 
I'm not overly hopeful. If they fix the problem, then it's likely
that the real bug is hidden.
  I'm curious as whether increasing the vm.min_free_kbytes sysctl value would 
help or not in this case. But I guess it wouldn't since there is already some 
free memory and also the alloc failures are order 0, right?

Yes. And the failures you were seeing with my first patch were coming
from the mempool code anyway. We want those to fail early so they don't
eat into the min_free_kbytes memory.
You could try raising min_free_kbytes though. If that fixes it, then it
indicates there might be some problem in a memory allocation failure
path in software raid somewhere.
Thanks
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Claudio Martins

On Monday 11 April 2005 13:45, Nick Piggin wrote:
>
> No luck yet (on SMP i386). How many disks are you using in each
> raid1 array? You are using one array for swap, and one mounted as
> ext3 for the working area of the `stress` program, right?
>

   Right. I'm using two Seagate ATA133 disks (ide controler is AMD-8111) each 
with 4 partitions, so I get 4 md Raid1 devices. The first one, md0, is for 
swap. The rest are

~$ df -h
FilesystemSize  Used Avail Use% Mounted on
/dev/md1  4.6G  1.9G  2.6G  42% /
tmpfs1005M 0 1005M   0% /dev/shm
/dev/md3   32G  107M   30G   1% /home
/dev/md2   31G  149M   29G   1% /var

  In these tests, /home on md3 is the working area for stress.

  The io scheduler used is the anticipatory. 

> Neil, have you had a look at the traces? Do they mean much to you?
>
> Claudio - I have attached another patch you could try. It has a more
> complete set of mempool and related memory allocation fixes, as well
> as some other recent patches I had which reduces atomic memory usage
> by the block layer. Could you try if you get time? Thanks.

  OK, I'll try them in a few minutes and report back.

  I'm curious as whether increasing the vm.min_free_kbytes sysctl value would 
help or not in this case. But I guess it wouldn't since there is already some 
free memory and also the alloc failures are order 0, right?

 Thanks

Claudio

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Nick Piggin

Nick Piggin wrote:
The common theme seems to be: try_to_free_pages, swap_writepage,
mempool_alloc, down/down_failed in .text.lock.md. Next I would suspect
md/raid1 - maybe some deadlock in an uncommon memory allocation
failure path?
I'll see if I can reproduce it here.
No luck yet (on SMP i386). How many disks are you using in each
raid1 array? You are using one array for swap, and one mounted as
ext3 for the working area of the `stress` program, right?
Neil, have you had a look at the traces? Do they mean much to you?
Claudio - I have attached another patch you could try. It has a more
complete set of mempool and related memory allocation fixes, as well
as some other recent patches I had which reduces atomic memory usage
by the block layer. Could you try if you get time? Thanks.
Nick
--
SUSE Labs, Novell Inc.
Index: linux-2.6/drivers/block/ll_rw_blk.c
===
--- linux-2.6.orig/drivers/block/ll_rw_blk.c2005-04-11 22:18:49.0 
+1000
+++ linux-2.6/drivers/block/ll_rw_blk.c 2005-04-11 22:38:10.0 +1000
@@ -1450,7 +1450,7 @@ EXPORT_SYMBOL(blk_remove_plug);
  */
 void __generic_unplug_device(request_queue_t *q)
 {
-   if (test_bit(QUEUE_FLAG_STOPPED, >queue_flags))
+   if (unlikely(test_bit(QUEUE_FLAG_STOPPED, >queue_flags)))
return;
 
if (!blk_remove_plug(q))
@@ -1828,7 +1828,6 @@ static void __freed_request(request_queu
clear_queue_congested(q, rw);
 
if (rl->count[rw] + 1 <= q->nr_requests) {
-   smp_mb();
if (waitqueue_active(>wait[rw]))
wake_up(>wait[rw]);
 
@@ -1860,18 +1859,20 @@ static void freed_request(request_queue_
 
 #define blkdev_free_rq(list) list_entry((list)->next, struct request, 
queuelist)
 /*
- * Get a free request, queue_lock must not be held
+ * Get a free request, queue_lock must be held.
+ * Returns NULL on failure, with queue_lock held.
+ * Returns !NULL on success, with queue_lock *not held*.
  */
 static struct request *get_request(request_queue_t *q, int rw, int gfp_mask)
 {
+   int batching;
struct request *rq = NULL;
struct request_list *rl = >rq;
-   struct io_context *ioc = get_io_context(gfp_mask);
+   struct io_context *ioc = get_io_context(GFP_ATOMIC);
 
if (unlikely(test_bit(QUEUE_FLAG_DRAIN, >queue_flags)))
goto out;
 
-   spin_lock_irq(q->queue_lock);
if (rl->count[rw]+1 >= q->nr_requests) {
/*
 * The queue will fill after this allocation, so set it as
@@ -1884,6 +1885,8 @@ static struct request *get_request(reque
blk_set_queue_full(q, rw);
}
}
+   
+   batching = ioc_batching(q, ioc);
 
switch (elv_may_queue(q, rw)) {
case ELV_MQUEUE_NO:
@@ -1894,12 +1897,11 @@ static struct request *get_request(reque
goto get_rq;
}
 
-   if (blk_queue_full(q, rw) && !ioc_batching(q, ioc)) {
+   if (blk_queue_full(q, rw) && !batching) {
/*
 * The queue is full and the allocating process is not a
 * "batcher", and not exempted by the IO scheduler
 */
-   spin_unlock_irq(q->queue_lock);
goto out;
}
 
@@ -1933,11 +1935,10 @@ rq_starved:
if (unlikely(rl->count[rw] == 0))
rl->starved[rw] = 1;
 
-   spin_unlock_irq(q->queue_lock);
goto out;
}
 
-   if (ioc_batching(q, ioc))
+   if (batching)
ioc->nr_batch_requests--;

rq_init(q, rq);
@@ -1950,13 +1951,14 @@ out:
 /*
  * No available requests for this queue, unplug the device and wait for some
  * requests to become available.
+ *
+ * Called with q->queue_lock held, and returns with it unlocked.
  */
 static struct request *get_request_wait(request_queue_t *q, int rw)
 {
DEFINE_WAIT(wait);
struct request *rq;
 
-   generic_unplug_device(q);
do {
struct request_list *rl = >rq;
 
@@ -1968,6 +1970,8 @@ static struct request *get_request_wait(
if (!rq) {
struct io_context *ioc;
 
+   __generic_unplug_device(q);
+   spin_unlock_irq(q->queue_lock);
io_schedule();
 
/*
@@ -1979,6 +1983,8 @@ static struct request *get_request_wait(
ioc = get_io_context(GFP_NOIO);
ioc_set_batching(q, ioc);
put_io_context(ioc);
+
+   spin_lock_irq(q->queue_lock);
}
finish_wait(>wait[rw], );
} while (!rq);
@@ -1992,10 +1998,15 @@ struct request *blk_get_request(request_
 
BUG_ON(rw != READ && rw != WRITE);
 
+   spin_lock_irq(q->queue_lock);
if

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Nick Piggin

Claudio Martins wrote:
On Sunday 10 April 2005 03:47, Andrew Morton wrote:
Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from
cutting in during long sysrq traces.
Also, capture the `sysrq-m' output so we can see if the thing is out of
memory.

  Hi Andrew,
  Thanks for the tip. I booted with nmi_watchdog=0 and was able to get a full 
sysrq-t as well as a sysrq-m. Since it might be a little too big for the 
list, I've put it on a text file at:

 http://193.136.132.235/dl145/dump1-2.6.12-rc2.txt
OK, you _may_ be out of memory here (depending on what the lower zone
protection for DMA ends up as), however you are well above all the
"emergency watermarks" in ZONE_NORMAL. Also:
 I also made a run with the mempool-can-fail patch from Nick Piggin. With this 
I got some nice memory allocation errors from the md threads when the trouble 
started. The dump (with sysrq-t and sysrq-m included) is at:

 http://193.136.132.235/dl145/dump2-2.6.12-rc2-nick1.txt
This one shows plenty of memory. The allocation failure messages are
actually a good thing, and show that my patch is sort of working. I
have reworked it a bit so they won't show up though.
So probably not your common or garden memory deadlock.
The common theme seems to be: try_to_free_pages, swap_writepage,
mempool_alloc, down/down_failed in .text.lock.md. Next I would suspect
md/raid1 - maybe some deadlock in an uncommon memory allocation
failure path?
I'll see if I can reproduce it here.
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Nick Piggin

Claudio Martins wrote:
On Sunday 10 April 2005 03:47, Andrew Morton wrote:
Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from
cutting in during long sysrq traces.
Also, capture the `sysrq-m' output so we can see if the thing is out of
memory.

  Hi Andrew,
  Thanks for the tip. I booted with nmi_watchdog=0 and was able to get a full 
sysrq-t as well as a sysrq-m. Since it might be a little too big for the 
list, I've put it on a text file at:

 http://193.136.132.235/dl145/dump1-2.6.12-rc2.txt
 I also made a run with the mempool-can-fail patch from Nick Piggin. With this 
I got some nice memory allocation errors from the md threads when the trouble 
started. The dump (with sysrq-t and sysrq-m included) is at:

 http://193.136.132.235/dl145/dump2-2.6.12-rc2-nick1.txt
 Let me know if you find it more convenient to send the dumps by mail or 
something. Hope this helps.

Itried to get these just now, but couldn't.
Would you gzip them and send them to me privately?
Thanks,
Nick
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Nick Piggin

Claudio Martins wrote:
On Sunday 10 April 2005 03:47, Andrew Morton wrote:
Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from
cutting in during long sysrq traces.
Also, capture the `sysrq-m' output so we can see if the thing is out of
memory.

  Hi Andrew,
  Thanks for the tip. I booted with nmi_watchdog=0 and was able to get a full 
sysrq-t as well as a sysrq-m. Since it might be a little too big for the 
list, I've put it on a text file at:

 http://193.136.132.235/dl145/dump1-2.6.12-rc2.txt
 I also made a run with the mempool-can-fail patch from Nick Piggin. With this 
I got some nice memory allocation errors from the md threads when the trouble 
started. The dump (with sysrq-t and sysrq-m included) is at:

 http://193.136.132.235/dl145/dump2-2.6.12-rc2-nick1.txt
 Let me know if you find it more convenient to send the dumps by mail or 
something. Hope this helps.

Itried to get these just now, but couldn't.
Would you gzip them and send them to me privately?
Thanks,
Nick
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Nick Piggin

Claudio Martins wrote:
On Sunday 10 April 2005 03:47, Andrew Morton wrote:
Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from
cutting in during long sysrq traces.
Also, capture the `sysrq-m' output so we can see if the thing is out of
memory.

  Hi Andrew,
  Thanks for the tip. I booted with nmi_watchdog=0 and was able to get a full 
sysrq-t as well as a sysrq-m. Since it might be a little too big for the 
list, I've put it on a text file at:

 http://193.136.132.235/dl145/dump1-2.6.12-rc2.txt
OK, you _may_ be out of memory here (depending on what the lower zone
protection for DMA ends up as), however you are well above all the
emergency watermarks in ZONE_NORMAL. Also:
 I also made a run with the mempool-can-fail patch from Nick Piggin. With this 
I got some nice memory allocation errors from the md threads when the trouble 
started. The dump (with sysrq-t and sysrq-m included) is at:

 http://193.136.132.235/dl145/dump2-2.6.12-rc2-nick1.txt
This one shows plenty of memory. The allocation failure messages are
actually a good thing, and show that my patch is sort of working. I
have reworked it a bit so they won't show up though.
So probably not your common or garden memory deadlock.
The common theme seems to be: try_to_free_pages, swap_writepage,
mempool_alloc, down/down_failed in .text.lock.md. Next I would suspect
md/raid1 - maybe some deadlock in an uncommon memory allocation
failure path?
I'll see if I can reproduce it here.
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Nick Piggin

Nick Piggin wrote:
The common theme seems to be: try_to_free_pages, swap_writepage,
mempool_alloc, down/down_failed in .text.lock.md. Next I would suspect
md/raid1 - maybe some deadlock in an uncommon memory allocation
failure path?
I'll see if I can reproduce it here.
No luck yet (on SMP i386). How many disks are you using in each
raid1 array? You are using one array for swap, and one mounted as
ext3 for the working area of the `stress` program, right?
Neil, have you had a look at the traces? Do they mean much to you?
Claudio - I have attached another patch you could try. It has a more
complete set of mempool and related memory allocation fixes, as well
as some other recent patches I had which reduces atomic memory usage
by the block layer. Could you try if you get time? Thanks.
Nick
--
SUSE Labs, Novell Inc.
Index: linux-2.6/drivers/block/ll_rw_blk.c
===
--- linux-2.6.orig/drivers/block/ll_rw_blk.c2005-04-11 22:18:49.0 
+1000
+++ linux-2.6/drivers/block/ll_rw_blk.c 2005-04-11 22:38:10.0 +1000
@@ -1450,7 +1450,7 @@ EXPORT_SYMBOL(blk_remove_plug);
  */
 void __generic_unplug_device(request_queue_t *q)
 {
-   if (test_bit(QUEUE_FLAG_STOPPED, q-queue_flags))
+   if (unlikely(test_bit(QUEUE_FLAG_STOPPED, q-queue_flags)))
return;
 
if (!blk_remove_plug(q))
@@ -1828,7 +1828,6 @@ static void __freed_request(request_queu
clear_queue_congested(q, rw);
 
if (rl-count[rw] + 1 = q-nr_requests) {
-   smp_mb();
if (waitqueue_active(rl-wait[rw]))
wake_up(rl-wait[rw]);
 
@@ -1860,18 +1859,20 @@ static void freed_request(request_queue_
 
 #define blkdev_free_rq(list) list_entry((list)-next, struct request, 
queuelist)
 /*
- * Get a free request, queue_lock must not be held
+ * Get a free request, queue_lock must be held.
+ * Returns NULL on failure, with queue_lock held.
+ * Returns !NULL on success, with queue_lock *not held*.
  */
 static struct request *get_request(request_queue_t *q, int rw, int gfp_mask)
 {
+   int batching;
struct request *rq = NULL;
struct request_list *rl = q-rq;
-   struct io_context *ioc = get_io_context(gfp_mask);
+   struct io_context *ioc = get_io_context(GFP_ATOMIC);
 
if (unlikely(test_bit(QUEUE_FLAG_DRAIN, q-queue_flags)))
goto out;
 
-   spin_lock_irq(q-queue_lock);
if (rl-count[rw]+1 = q-nr_requests) {
/*
 * The queue will fill after this allocation, so set it as
@@ -1884,6 +1885,8 @@ static struct request *get_request(reque
blk_set_queue_full(q, rw);
}
}
+   
+   batching = ioc_batching(q, ioc);
 
switch (elv_may_queue(q, rw)) {
case ELV_MQUEUE_NO:
@@ -1894,12 +1897,11 @@ static struct request *get_request(reque
goto get_rq;
}
 
-   if (blk_queue_full(q, rw)  !ioc_batching(q, ioc)) {
+   if (blk_queue_full(q, rw)  !batching) {
/*
 * The queue is full and the allocating process is not a
 * batcher, and not exempted by the IO scheduler
 */
-   spin_unlock_irq(q-queue_lock);
goto out;
}
 
@@ -1933,11 +1935,10 @@ rq_starved:
if (unlikely(rl-count[rw] == 0))
rl-starved[rw] = 1;
 
-   spin_unlock_irq(q-queue_lock);
goto out;
}
 
-   if (ioc_batching(q, ioc))
+   if (batching)
ioc-nr_batch_requests--;

rq_init(q, rq);
@@ -1950,13 +1951,14 @@ out:
 /*
  * No available requests for this queue, unplug the device and wait for some
  * requests to become available.
+ *
+ * Called with q-queue_lock held, and returns with it unlocked.
  */
 static struct request *get_request_wait(request_queue_t *q, int rw)
 {
DEFINE_WAIT(wait);
struct request *rq;
 
-   generic_unplug_device(q);
do {
struct request_list *rl = q-rq;
 
@@ -1968,6 +1970,8 @@ static struct request *get_request_wait(
if (!rq) {
struct io_context *ioc;
 
+   __generic_unplug_device(q);
+   spin_unlock_irq(q-queue_lock);
io_schedule();
 
/*
@@ -1979,6 +1983,8 @@ static struct request *get_request_wait(
ioc = get_io_context(GFP_NOIO);
ioc_set_batching(q, ioc);
put_io_context(ioc);
+
+   spin_lock_irq(q-queue_lock);
}
finish_wait(rl-wait[rw], wait);
} while (!rq);
@@ -1992,10 +1998,15 @@ struct request *blk_get_request(request_
 
BUG_ON(rw != READ  rw != WRITE);
 
+   spin_lock_irq(q-queue_lock);
if (gfp_mask

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Claudio Martins


On Monday 11 April 2005 13:45, Nick Piggin wrote:

 No luck yet (on SMP i386). How many disks are you using in each
 raid1 array? You are using one array for swap, and one mounted as
 ext3 for the working area of the `stress` program, right?


   Right. I'm using two Seagate ATA133 disks (ide controler is AMD-8111) each 
with 4 partitions, so I get 4 md Raid1 devices. The first one, md0, is for 
swap. The rest are

~$ df -h
FilesystemSize  Used Avail Use% Mounted on
/dev/md1  4.6G  1.9G  2.6G  42% /
tmpfs1005M 0 1005M   0% /dev/shm
/dev/md3   32G  107M   30G   1% /home
/dev/md2   31G  149M   29G   1% /var

  In these tests, /home on md3 is the working area for stress.

  The io scheduler used is the anticipatory. 

 Neil, have you had a look at the traces? Do they mean much to you?

 Claudio - I have attached another patch you could try. It has a more
 complete set of mempool and related memory allocation fixes, as well
 as some other recent patches I had which reduces atomic memory usage
 by the block layer. Could you try if you get time? Thanks.

  OK, I'll try them in a few minutes and report back.
 
  I'm curious as whether increasing the vm.min_free_kbytes sysctl value would 
help or not in this case. But I guess it wouldn't since there is already some 
free memory and also the alloc failures are order 0, right?

 Thanks

Claudio

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Nick Piggin

Claudio Martins wrote:
   Right. I'm using two Seagate ATA133 disks (ide controler is AMD-8111) each 
with 4 partitions, so I get 4 md Raid1 devices. The first one, md0, is for 
swap. The rest are

~$ df -h
FilesystemSize  Used Avail Use% Mounted on
/dev/md1  4.6G  1.9G  2.6G  42% /
tmpfs1005M 0 1005M   0% /dev/shm
/dev/md3   32G  107M   30G   1% /home
/dev/md2   31G  149M   29G   1% /var
  In these tests, /home on md3 is the working area for stress.
  The io scheduler used is the anticipatory. 

OK.
  OK, I'll try them in a few minutes and report back.
 
I'm not overly hopeful. If they fix the problem, then it's likely
that the real bug is hidden.
  I'm curious as whether increasing the vm.min_free_kbytes sysctl value would 
help or not in this case. But I guess it wouldn't since there is already some 
free memory and also the alloc failures are order 0, right?

Yes. And the failures you were seeing with my first patch were coming
from the mempool code anyway. We want those to fail early so they don't
eat into the min_free_kbytes memory.
You could try raising min_free_kbytes though. If that fixes it, then it
indicates there might be some problem in a memory allocation failure
path in software raid somewhere.
Thanks
--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Neil Brown

On Monday April 11, [EMAIL PROTECTED] wrote:
 
 Neil, have you had a look at the traces? Do they mean much to you?
 

Just looked. 
bio_alloc_bioset seems implicated, as does sync_page_io.

sync_page_io used to use a 'struct bio' on the stack, but Jens Axboe
change it to use bio_alloc (don't know why..) and I should have 
checked the change better.

sync_page_io can be called on the write out path, so it should use
GFP_NOIO rather than GFP_KERNEL.

See if this helps Actually this patch is against 2.6.12-rc2-mm1
which uses md_super_write instead of sync_page_io (which is now only
used for read).  So if you are using a non-mm kernel (which seems to
be the case) you'll need to apply the patch by hand.


Thanks,
NeilBrown

---
Avoid deadlock in sync_page_io by using GFP_NOIO

..as sync_page_io can be called on the write-out path.
Ditto for md_super_write

Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/md.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~  2005-04-08 11:25:26.0 +1000
+++ ./drivers/md/md.c   2005-04-12 09:42:29.0 +1000
@@ -351,7 +351,7 @@ void md_super_write(mddev_t *mddev, mdk_
 * if zero is reached.
 * If an error occurred, call md_error
 */
-   struct bio *bio = bio_alloc(GFP_KERNEL, 1);
+   struct bio *bio = bio_alloc(GFP_NOIO, 1);
 
bio-bi_bdev = rdev-bdev;
bio-bi_sector = sector;
@@ -374,7 +374,7 @@ static int bi_complete(struct bio *bio, 
 int sync_page_io(struct block_device *bdev, sector_t sector, int size,
   struct page *page, int rw)
 {
-   struct bio *bio = bio_alloc(GFP_KERNEL, 1);
+   struct bio *bio = bio_alloc(GFP_NOIO, 1);
struct completion event;
int ret;
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Claudio Martins


On Monday 11 April 2005 23:59, Nick Piggin wrote:

OK, I'll try them in a few minutes and report back.

 I'm not overly hopeful. If they fix the problem, then it's likely
 that the real bug is hidden.


  Well, the thing is, they do fix the problem. Or at least they hide it very 
well ;-)

  It has been running for more than 5 hours now with stress with no problems 
and no stuck processes.

  I think I'm going to give a try to Neil's patch, but I'll have to apply some 
patches from -mm.

 Thanks

Claudio

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Claudio Martins


On Tuesday 12 April 2005 00:46, Neil Brown wrote:
 On Monday April 11, [EMAIL PROTECTED] wrote:
  Neil, have you had a look at the traces? Do they mean much to you?

 Just looked.
 bio_alloc_bioset seems implicated, as does sync_page_io.

 sync_page_io used to use a 'struct bio' on the stack, but Jens Axboe
 change it to use bio_alloc (don't know why..) and I should have
 checked the change better.

 sync_page_io can be called on the write out path, so it should use
 GFP_NOIO rather than GFP_KERNEL.

 See if this helps Actually this patch is against 2.6.12-rc2-mm1
 which uses md_super_write instead of sync_page_io (which is now only
 used for read).  So if you are using a non-mm kernel (which seems to
 be the case) you'll need to apply the patch by hand.


   Hi Neil,

  I'll test this patch, but I'm wondering if I have to apply all the 
md-related patches from broken out directory of 2.6.12-rc2-mm1 or only some 
specific ones?
   Anyway I'm happy to test all those md updates, if you think they might 
help.

 Thanks 

Claudio

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Andrew Morton

Claudio Martins [EMAIL PROTECTED] wrote:

   I think I'm going to give a try to Neil's patch, but I'll have to apply 
 some 
  patches from -mm.

Just this one if you're using 2.6.12-rc2:

--- 25/drivers/md/md.c~avoid-deadlock-in-sync_page_io-by-using-gfp_noio Mon Apr 
11 16:55:07 2005
+++ 25-akpm/drivers/md/md.c Mon Apr 11 16:55:07 2005
@@ -332,7 +332,7 @@ static int bi_complete(struct bio *bio, 
 static int sync_page_io(struct block_device *bdev, sector_t sector, int size,
   struct page *page, int rw)
 {
-   struct bio *bio = bio_alloc(GFP_KERNEL, 1);
+   struct bio *bio = bio_alloc(GFP_NOIO, 1);
struct completion event;
int ret;
 
_

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-11 Thread Nick Piggin

On Tue, 2005-04-12 at 01:22 +0100, Claudio Martins wrote:
 On Monday 11 April 2005 23:59, Nick Piggin wrote:
 
 OK, I'll try them in a few minutes and report back.
 
  I'm not overly hopeful. If they fix the problem, then it's likely
  that the real bug is hidden.
 
 
   Well, the thing is, they do fix the problem. Or at least they hide it very 
 well ;-)
 
   It has been running for more than 5 hours now with stress with no problems 
 and no stuck processes.
 

Well, that is good... I guess ;)

Actually the patches I have sent you do fix real bugs, but they also
make the block layer less likely to recurse into page reclaim, so it
may be eg. hiding the problem that Neil's patch fixes.

It may be that your fundamental problem is solved by my patches, but
we need to be sure.

   I think I'm going to give a try to Neil's patch, but I'll have to apply 
 some 
 patches from -mm.
 

Yep that would be good. Please test -rc2 with Andrew's patch, and
obviously my patches backed out. Thanks for sticking with it.

Nick




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-10 Thread Claudio Martins

On Sunday 10 April 2005 03:47, Andrew Morton wrote:
>
> Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from
> cutting in during long sysrq traces.
>
> Also, capture the `sysrq-m' output so we can see if the thing is out of
> memory.

  Hi Andrew,

  Thanks for the tip. I booted with nmi_watchdog=0 and was able to get a full 
sysrq-t as well as a sysrq-m. Since it might be a little too big for the 
list, I've put it on a text file at:

 http://193.136.132.235/dl145/dump1-2.6.12-rc2.txt

 I also made a run with the mempool-can-fail patch from Nick Piggin. With this 
I got some nice memory allocation errors from the md threads when the trouble 
started. The dump (with sysrq-t and sysrq-m included) is at:

 http://193.136.132.235/dl145/dump2-2.6.12-rc2-nick1.txt

 Let me know if you find it more convenient to send the dumps by mail or 
something. Hope this helps.

 Thanks,

Claudio

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-10 Thread Claudio Martins


On Sunday 10 April 2005 03:47, Andrew Morton wrote:

 Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from
 cutting in during long sysrq traces.

 Also, capture the `sysrq-m' output so we can see if the thing is out of
 memory.

  Hi Andrew,

  Thanks for the tip. I booted with nmi_watchdog=0 and was able to get a full 
sysrq-t as well as a sysrq-m. Since it might be a little too big for the 
list, I've put it on a text file at:

 http://193.136.132.235/dl145/dump1-2.6.12-rc2.txt

 I also made a run with the mempool-can-fail patch from Nick Piggin. With this 
I got some nice memory allocation errors from the md threads when the trouble 
started. The dump (with sysrq-t and sysrq-m included) is at:

 http://193.136.132.235/dl145/dump2-2.6.12-rc2-nick1.txt

 Let me know if you find it more convenient to send the dumps by mail or 
something. Hope this helps.

 Thanks,

Claudio

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-09 Thread Claudio Martins


On Sunday 10 April 2005 03:53, Nick Piggin wrote:
>
> Looks like you may possibly have a memory allocation deadlock
> (although I can't explain the NMI oops).
>
> I would be interested to see if the following patch is of any
> help to you.
>

  Hi Nick,

  I'll build a kernel with your patch and report on the results as soon as 
possible.

 Thanks 

Claudio

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-09 Thread Claudio Martins


On Sunday 10 April 2005 03:47, Andrew Morton wrote:
>
> Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from
> cutting in during long sysrq traces.
>
> Also, capture the `sysrq-m' output so we can see if the thing is out of
> memory.

  OK, will do it ASAP and report back.

 Thanks,

Claudio

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-09 Thread Nick Piggin

Claudio Martins wrote:
On Tuesday 05 April 2005 03:12, Andrew Morton wrote:
Claudio Martins <[EMAIL PROTECTED]> wrote:
  While stress testing 2.6.12-rc2 on an HP DL145 I get processes stuck
in D state after some time.
  This machine is a dual Opteron 248 with 2GB (ECC) on one node (the
other node has no RAM modules plugged in, since this board works only
with pairs).
  I was using stress (http://weather.ou.edu/~apw/projects/stress/) with
the following command line:
stress -v -c 20 -i 12 -m 10 -d 20
[snip]

  Unfortunately the system Oopsed in the middle of dumping the tasks, but from 
what I can see I'm tempted to think that this might be related to the MD 
code. md2_raid1 is blocked on D state and, although not shown on the dump, I 
know from ps command that md0_raid1 (the swap partition) was also on D state 
(along with the stress processes which are responsible for hogging memory, 
and top and df). There were about 200MB swapped out, but the swap partition 
size is 1GB.

Looks like you may possibly have a memory allocation deadlock
(although I can't explain the NMI oops).
I would be interested to see if the following patch is of any
help to you.
Thanks,
Nick
--
SUSE Labs, Novell Inc.



Index: linux-2.6/mm/mempool.c
===
--- linux-2.6.orig/mm/mempool.c 2005-03-30 10:39:51.0 +1000
+++ linux-2.6/mm/mempool.c  2005-03-30 10:41:29.0 +1000
@@ -198,7 +198,10 @@ void * mempool_alloc(mempool_t *pool, in
void *element;
unsigned long flags;
DEFINE_WAIT(wait);
-   int gfp_nowait = gfp_mask & ~(__GFP_WAIT | __GFP_IO);
+   int gfp_nowait;
+   
+   gfp_mask |= __GFP_NORETRY; /* don't loop in __alloc_pages */
+   gfp_nowait = gfp_mask & ~(__GFP_WAIT | __GFP_IO);
 
might_sleep_if(gfp_mask & __GFP_WAIT);
 repeat_alloc:

Re: Processes stuck on D state on Dual Opteron

2005-04-09 Thread Andrew Morton

Claudio Martins <[EMAIL PROTECTED]> wrote:
>
>   I repeated the test to try to get more output from alt-sysreq-T, but it 
>  oopsed again with even less output. 
>By the way, I have also tested 2.6.11.6 and I get stuck processes in the 
>  same way. With 2.6.9 I get a hard lockup with no working alt-sysrq, after 
>  about 30 to 60mins of stress.

It could be an md deadlock, or it could be an out-of-memory deadlock.  md
trying to allocate memory on the swapout path.

>This is with preempt enabled (as well as BKL preempt). I want to test also 
>  without preempt and also without using MD Raid1, but I'll have to reach the 
>  machine and hit the power button, so not possible until tomorrow :-(
> 
>   The original original message in this thread containing the details of the 
>  setup and a .config is at:
> 
>  http://marc.theaimsgroup.com/?l=linux-kernel=111266784320156=2
> 
>I am happy to test any patches and also wonder if enabling any of the 
>  options in the kernel debugging section could help in trying to find where 
>  the deadlock is.

Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from
cutting in during long sysrq traces.

Also, capture the `sysrq-m' output so we can see if the thing is out of
memory.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-09 Thread Claudio Martins


On Tuesday 05 April 2005 03:12, Andrew Morton wrote:
> Claudio Martins <[EMAIL PROTECTED]> wrote:
> >While stress testing 2.6.12-rc2 on an HP DL145 I get processes stuck
> > in D state after some time.
> >This machine is a dual Opteron 248 with 2GB (ECC) on one node (the
> > other node has no RAM modules plugged in, since this board works only
> > with pairs).
> >
> >I was using stress (http://weather.ou.edu/~apw/projects/stress/) with
> > the following command line:
> >
> >  stress -v -c 20 -i 12 -m 10 -d 20
> >
> >This causes a constant load avg. of around 70, makes the machine go
> > into swap a little, and writes up to about 20GB of random data to disk
> > while eating up all CPU. After about half and hour random processes like
> > top, df, etc get stuck in D state. Half of the 60 or so stress processes
> > are also in D state. The machine keeps being responsive for maybe some 15
> > minutes but then the shells just hang and sshd stops responding to
> > connections, though the machine replies to pings (I don't have console
> > acess till tomorrow).
> >
> >The system is using ext3 with md software Raid1.
> >
> >   I'm interested in knowing if anyone out there with dual Opterons can
> >  reproduce this or not. I also have access to an HP DL360 Dual Xeon, so I
> > will try to find out if this is AMD64 specific as soon as possible.
> > Please let me know if you want me to run some other tests or give some
> > more info to help solve this one.
>
> Can you capture the output from alt-sysrq-T?


 Hi Andrew,

  Due to other tasks, only now was I able to repeat the tests and capture the 
the output from alt-sysrq-T. I booted with serial console, put stress to work 
and when the processes started to get hung on D state I managed to capture 
the following:

 SysRq : Show State

   sibling
  task PC  pid father child younger older
init  D 81007fcfe0d8 0 1  0 2   
(NOTLB)
810003253768 0082 81007fd19170 007d 
   81007fd19170 810003251470 271b 810074468e70 
   810003251680 8027a79a 
Call Trace:{__make_request+1274} 
{__down+152} 
   {default_wake_function+0} 
{mempool_alloc+164} 
   {__down_failed+53} 
{.text.lock.md+155} 
   {make_request+868} 
{cache_alloc_refill+413} 
   {generic_make_request+545} 
{autoremove_wake_function+0} 
   {autoremove_wake_function+0} 
{submit_bio+223} 
   {test_set_page_writeback+203} 
{swap_writepage+184} 
   {shrink_zone+2678} 
{thread_return+0} 
   {thread_return+88} 
{try_to_free_pages+311} 
   {autoremove_wake_function+0} 
{__alloc_pages+533} 
   {__get_free_pages+14} 
{__pollwait+74} 
   {pipe_poll+66} {do_select+725} 
   {__pollwait+0} {sys_select+735} 
   {system_call+126} 
migration/0   S 810002c12720 0 2  1 3   
(L-TLB)
81007ff0fea8 0046 810074806ef0 00750001 
   81007ff0fe58 8100032506f0 0129 810075281230 
   810003250900 810072ffde88 
Call Trace:{migration_thread+532} 
{migration_thread+0} 
   {kthread+217} {child_rip+8} 
   {kthread+0} {child_rip+0} 
   
ksoftirqd/0   S  0 3  1 4 2 
(L-TLB)
81007ff11f08 0046 810072e00430 007d 
   810002c194e0 810003250030 00d1 810072f3a030 
   810003250240  
Call Trace:{__do_softirq+113} 
{ksoftirqd+0} 
   {ksoftirqd+0} {ksoftirqd+99} 
   {ksoftirqd+0} {kthread+217} 
   {child_rip+8} {kthread+0} 
   {child_rip+0} 
migration/1   S 810002c1a720 0 4  1 5 3 
(L-TLB)
81007ff15ea8 0046 810072d1cff0 00730001 
   810079fe7e98 81007ff134b0 00a3 810075281230 
   81007ff136c0 81003381de88 
Call Trace:{migration_thread+532} 
{migration_thread+0} 
   {kthread+217} {child_rip+8} 
   {kthread+0} {child_rip+0} 
   
ksoftirqd/1   S 0001 0 5  1 6 4 
(L-TLB)
81007ff19f08 0046 810075376db0 0077802b8e7e 
   810002c114e0 81007ff12df0 01b4 810074125130 
   81007ff13000  
Call Trace:{__do_softirq+113} 
{ksoftirqd+0} 
   {ksoftirqd+0} {ksoftirqd+99} 
   {ksoftirqd+0} {kthread+217} 
   {child_rip+8} {kthread+0} 
   {child_rip+0} 
events/0  S 094f2f7a804e 0 6  1 7 5 
(L-TLB)
81007ff3be58 0046 0246 8013d00d 
   7ffe0c00 81007ff12730 0c80 803f40c0 
   81007ff12940  
Call Trace:{__mod_timer+317} 
{cache_reap+0} 
   {worker_thread+305} 
{default_wake_function+0}

Re: Processes stuck on D state on Dual Opteron

2005-04-09 Thread Claudio Martins


On Tuesday 05 April 2005 03:12, Andrew Morton wrote:
 Claudio Martins [EMAIL PROTECTED] wrote:
 While stress testing 2.6.12-rc2 on an HP DL145 I get processes stuck
  in D state after some time.
 This machine is a dual Opteron 248 with 2GB (ECC) on one node (the
  other node has no RAM modules plugged in, since this board works only
  with pairs).
 
 I was using stress (http://weather.ou.edu/~apw/projects/stress/) with
  the following command line:
 
   stress -v -c 20 -i 12 -m 10 -d 20
 
 This causes a constant load avg. of around 70, makes the machine go
  into swap a little, and writes up to about 20GB of random data to disk
  while eating up all CPU. After about half and hour random processes like
  top, df, etc get stuck in D state. Half of the 60 or so stress processes
  are also in D state. The machine keeps being responsive for maybe some 15
  minutes but then the shells just hang and sshd stops responding to
  connections, though the machine replies to pings (I don't have console
  acess till tomorrow).
 
 The system is using ext3 with md software Raid1.
 
I'm interested in knowing if anyone out there with dual Opterons can
   reproduce this or not. I also have access to an HP DL360 Dual Xeon, so I
  will try to find out if this is AMD64 specific as soon as possible.
  Please let me know if you want me to run some other tests or give some
  more info to help solve this one.

 Can you capture the output from alt-sysrq-T?


 Hi Andrew,

  Due to other tasks, only now was I able to repeat the tests and capture the 
the output from alt-sysrq-T. I booted with serial console, put stress to work 
and when the processes started to get hung on D state I managed to capture 
the following:

 SysRq : Show State

   sibling
  task PC  pid father child younger older
init  D 81007fcfe0d8 0 1  0 2   
(NOTLB)
810003253768 0082 81007fd19170 007d 
   81007fd19170 810003251470 271b 810074468e70 
   810003251680 8027a79a 
Call Trace:8027a79a{__make_request+1274} 
8037ab68{__down+152} 
   8012f4f0{default_wake_function+0} 
80158de4{mempool_alloc+164} 
   8037c649{__down_failed+53} 
802ed53d{.text.lock.md+155} 
   802d8204{make_request+868} 
8015db7d{cache_alloc_refill+413} 
   8027abd1{generic_make_request+545} 
8014a230{autoremove_wake_function+0} 
   8014a230{autoremove_wake_function+0} 
8027accf{submit_bio+223} 
   8015c39b{test_set_page_writeback+203} 
8016e9d8{swap_writepage+184} 
   80161bc6{shrink_zone+2678} 
8037b3e0{thread_return+0} 
   8037b438{thread_return+88} 
80162187{try_to_free_pages+311} 
   8014a230{autoremove_wake_function+0} 
8015a685{__alloc_pages+533} 
   8015a88e{__get_free_pages+14} 
8018c72a{__pollwait+74} 
   80185c72{pipe_poll+66} 8018caa5{do_select+725} 
   8018c6e0{__pollwait+0} 8018ceef{sys_select+735} 
   8010db06{system_call+126} 
migration/0   S 810002c12720 0 2  1 3   
(L-TLB)
81007ff0fea8 0046 810074806ef0 00750001 
   81007ff0fe58 8100032506f0 0129 810075281230 
   810003250900 810072ffde88 
Call Trace:80130a24{migration_thread+532} 
80130810{migration_thread+0} 
   80149c09{kthread+217} 8010e6ef{child_rip+8} 
   80149b30{kthread+0} 8010e6e7{child_rip+0} 
   
ksoftirqd/0   S  0 3  1 4 2 
(L-TLB)
81007ff11f08 0046 810072e00430 007d 
   810002c194e0 810003250030 00d1 810072f3a030 
   810003250240  
Call Trace:801393e1{__do_softirq+113} 
801399c0{ksoftirqd+0} 
   801399c0{ksoftirqd+0} 80139a23{ksoftirqd+99} 
   801399c0{ksoftirqd+0} 80149c09{kthread+217} 
   8010e6ef{child_rip+8} 80149b30{kthread+0} 
   8010e6e7{child_rip+0} 
migration/1   S 810002c1a720 0 4  1 5 3 
(L-TLB)
81007ff15ea8 0046 810072d1cff0 00730001 
   810079fe7e98 81007ff134b0 00a3 810075281230 
   81007ff136c0 81003381de88 
Call Trace:80130a24{migration_thread+532} 
80130810{migration_thread+0} 
   80149c09{kthread+217} 8010e6ef{child_rip+8} 
   80149b30{kthread+0} 8010e6e7{child_rip+0} 
   
ksoftirqd/1   S 0001 0 5  1 6 4 
(L-TLB)
81007ff19f08 0046

Re: Processes stuck on D state on Dual Opteron

2005-04-09 Thread Andrew Morton

Claudio Martins [EMAIL PROTECTED] wrote:

   I repeated the test to try to get more output from alt-sysreq-T, but it 
  oopsed again with even less output. 
By the way, I have also tested 2.6.11.6 and I get stuck processes in the 
  same way. With 2.6.9 I get a hard lockup with no working alt-sysrq, after 
  about 30 to 60mins of stress.

It could be an md deadlock, or it could be an out-of-memory deadlock.  md
trying to allocate memory on the swapout path.

This is with preempt enabled (as well as BKL preempt). I want to test also 
  without preempt and also without using MD Raid1, but I'll have to reach the 
  machine and hit the power button, so not possible until tomorrow :-(
 
   The original original message in this thread containing the details of the 
  setup and a .config is at:
 
  http://marc.theaimsgroup.com/?l=linux-kernelm=111266784320156w=2
 
I am happy to test any patches and also wonder if enabling any of the 
  options in the kernel debugging section could help in trying to find where 
  the deadlock is.

Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from
cutting in during long sysrq traces.

Also, capture the `sysrq-m' output so we can see if the thing is out of
memory.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-09 Thread Nick Piggin

Claudio Martins wrote:
On Tuesday 05 April 2005 03:12, Andrew Morton wrote:
Claudio Martins [EMAIL PROTECTED] wrote:
  While stress testing 2.6.12-rc2 on an HP DL145 I get processes stuck
in D state after some time.
  This machine is a dual Opteron 248 with 2GB (ECC) on one node (the
other node has no RAM modules plugged in, since this board works only
with pairs).
  I was using stress (http://weather.ou.edu/~apw/projects/stress/) with
the following command line:
stress -v -c 20 -i 12 -m 10 -d 20
[snip]

  Unfortunately the system Oopsed in the middle of dumping the tasks, but from 
what I can see I'm tempted to think that this might be related to the MD 
code. md2_raid1 is blocked on D state and, although not shown on the dump, I 
know from ps command that md0_raid1 (the swap partition) was also on D state 
(along with the stress processes which are responsible for hogging memory, 
and top and df). There were about 200MB swapped out, but the swap partition 
size is 1GB.

Looks like you may possibly have a memory allocation deadlock
(although I can't explain the NMI oops).
I would be interested to see if the following patch is of any
help to you.
Thanks,
Nick
--
SUSE Labs, Novell Inc.



Index: linux-2.6/mm/mempool.c
===
--- linux-2.6.orig/mm/mempool.c 2005-03-30 10:39:51.0 +1000
+++ linux-2.6/mm/mempool.c  2005-03-30 10:41:29.0 +1000
@@ -198,7 +198,10 @@ void * mempool_alloc(mempool_t *pool, in
void *element;
unsigned long flags;
DEFINE_WAIT(wait);
-   int gfp_nowait = gfp_mask  ~(__GFP_WAIT | __GFP_IO);
+   int gfp_nowait;
+   
+   gfp_mask |= __GFP_NORETRY; /* don't loop in __alloc_pages */
+   gfp_nowait = gfp_mask  ~(__GFP_WAIT | __GFP_IO);
 
might_sleep_if(gfp_mask  __GFP_WAIT);
 repeat_alloc:

Re: Processes stuck on D state on Dual Opteron

2005-04-09 Thread Claudio Martins


On Sunday 10 April 2005 03:47, Andrew Morton wrote:

 Suggest you boot with `nmi_watchdog=0' to prevent the nmi watchdog from
 cutting in during long sysrq traces.

 Also, capture the `sysrq-m' output so we can see if the thing is out of
 memory.

  OK, will do it ASAP and report back.

 Thanks,

Claudio

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-09 Thread Claudio Martins


On Sunday 10 April 2005 03:53, Nick Piggin wrote:

 Looks like you may possibly have a memory allocation deadlock
 (although I can't explain the NMI oops).

 I would be interested to see if the following patch is of any
 help to you.


  Hi Nick,

  I'll build a kernel with your patch and report on the results as soon as 
possible.

 Thanks 

Claudio

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Processes stuck on D state on Dual Opteron

2005-04-04 Thread Andrew Morton

Claudio Martins <[EMAIL PROTECTED]> wrote:
>
>While stress testing 2.6.12-rc2 on an HP DL145 I get processes stuck in D 
>  state after some time.  
>This machine is a dual Opteron 248 with 2GB (ECC) on one node (the other 
>  node has no RAM modules plugged in, since this board works only with pairs).
> 
>I was using stress (http://weather.ou.edu/~apw/projects/stress/) with the 
>  following command line:
> 
>  stress -v -c 20 -i 12 -m 10 -d 20
> 
>This causes a constant load avg. of around 70, makes the machine go into 
>  swap a little, and writes up to about 20GB of random data to disk while 
>  eating up all CPU. After about half and hour random processes like top, df, 
>  etc get stuck in D state. Half of the 60 or so stress processes are also in 
> D 
>  state. The machine keeps being responsive for maybe some 15 minutes but then 
>  the shells just hang and sshd stops responding to connections, though the 
>  machine replies to pings (I don't have console acess till tomorrow).
> 
>The system is using ext3 with md software Raid1.
> 
>   I'm interested in knowing if anyone out there with dual Opterons can 
>  reproduce this or not. I also have access to an HP DL360 Dual Xeon, so I 
> will 
>  try to find out if this is AMD64 specific as soon as possible. Please let me 
>  know if you want me to run some other tests or give some more info to help 
>  solve this one.

Can you capture the output from alt-sysrq-T?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

53 matches

Mail list logo