Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-09 Thread Justin Piszcz



On Thu, 8 Nov 2007, Carlos Carvalho wrote:


Jeff Lessem ([EMAIL PROTECTED]) wrote on 6 November 2007 22:00:
Dan Williams wrote:
  The following patch, also attached, cleans up cases where the code looks
  at sh-ops.pending when it should be looking at the consistent
  stack-based snapshot of the operations flags.

I tried this patch (against a stock 2.6.23), and it did not work for
me.  Not only did I/O to the effected RAID5  XFS partition stop, but
also I/O to all other disks.  I was not able to capture any debugging
information, but I should be able to do that tomorrow when I can hook
a serial console to the machine.

I'm not sure if my problem is identical to these others, as mine only
seems to manifest with RAID5+XFS.  The RAID rebuilds with no problem,
and I've not had any problems with RAID5+ext3.

Us too! We're stuck trying to build a disk server with several disks
in a raid5 array, and the rsync from the old machine stops writing to
the new filesystem. It only happens under heavy IO. We can make it
lock without rsync, using 8 simultaneous dd's to the array. All IO
stops, including the resync after a newly created raid or after an
unclean reboot.

We could not trigger the problem with ext3 or reiser3; it only happens
with xfs.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Including XFS mailing list as well can you provide more information to 
them?

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-09 Thread Jeff Lessem

Dan Williams wrote:
 On 11/8/07, Bill Davidsen [EMAIL PROTECTED] wrote:
 Jeff Lessem wrote:
 Dan Williams wrote:
 The following patch, also attached, cleans up cases where the code
 looks
 at sh-ops.pending when it should be looking at the consistent
 stack-based snapshot of the operations flags.
 I tried this patch (against a stock 2.6.23), and it did not work for
 me.  Not only did I/O to the effected RAID5  XFS partition stop, but
 also I/O to all other disks.  I was not able to capture any debugging
 information, but I should be able to do that tomorrow when I can hook
 a serial console to the machine.
 That can't be good! This is worrisome because Joel is giddy with joy
 because it fixes his iSCSI problems. I was going to try it with nbd, but
 perhaps I'll wait a week or so and see if others have more information.
 Applying patches before a holiday weekend is a good way to avoid time
 off. :-(

 We need to see more information on the failure that Jeff is seeing,
 and whether it goes away with the two known patches applied.  He
 applied this most recent patch against stock 2.6.23 which means that
 the platform was still open to the first biofill flags issue.

I applied both of the patches.  The biofill one did not apply cleanly,
as it was adding biofill to one section, and removing it from another,
but it appears that biofill does not need to be removed from a stock
2.6.23 kernel.  The second patch applies with a slight offset, but no
errors.

I can report success so far with both patches applied.  I created an
1100GB RAID5, formated it XFS, and successfully tar c | tar x 895GB
of data onto it.  I'm also in the process of rsync-ing the 895GB of
data from the (slightly changed) original.  In the past, I would
always get a hang within 0-50GB of data transfer.

For each drive in the RAID I also:

echo 128  /sys/block/$i/queue/max_sectors_kb
echo 512  /sys/block/$i/queue/nr_requests
echo 1  /sys/block/$i/device/queue_depth
blockdev --setra 65536 /dev/md3
echo 16384  /sys/block/md3/md/stripe_cache_size

These changes appear to improve performance, along with a RAID5 chunk
size of 1024k, but these changes alone (without the patches) do not
fix the problem.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-08 Thread BERTRAND Joël

BERTRAND Joël wrote:

Chuck Ebbert wrote:

On 11/05/2007 03:36 AM, BERTRAND Joël wrote:

Neil Brown wrote:

On Sunday November 4, [EMAIL PROTECTED] wrote:

# ps auxww | grep D
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME 
COMMAND

root   273  0.0  0.0  0 0 ?DOct21  14:40
[pdflush]
root   274  0.0  0.0  0 0 ?DOct21  13:00
[pdflush]

After several days/weeks, this is the second time this has happened,
while doing regular file I/O (decompressing a file), everything on
the device went into D-state.

At a guess (I haven't looked closely) I'd say it is the bug that was
meant to be fixed by

commit 4ae3f847e49e3787eca91bced31f8fd328d50496

except that patch applied badly and needed to be fixed with
the following patch (not in git yet).
These have been sent to stable@ and should be in the queue for 2.6.23.2

My linux-2.6.23/drivers/md/raid5.c contains your patch for a long
time :

...
spin_lock(sh-lock);
clear_bit(STRIPE_HANDLE, sh-state);
clear_bit(STRIPE_DELAYED, sh-state);

s.syncing = test_bit(STRIPE_SYNCING, sh-state);
s.expanding = test_bit(STRIPE_EXPAND_SOURCE, sh-state);
s.expanded = test_bit(STRIPE_EXPAND_READY, sh-state);
/* Now to look around and see what can be done */

/* clean-up completed biofill operations */
if (test_bit(STRIPE_OP_BIOFILL, sh-ops.complete)) {
clear_bit(STRIPE_OP_BIOFILL, sh-ops.pending);
clear_bit(STRIPE_OP_BIOFILL, sh-ops.ack);
clear_bit(STRIPE_OP_BIOFILL, sh-ops.complete);
}

rcu_read_lock();
for (i=disks; i--; ) {
mdk_rdev_t *rdev;
struct r5dev *dev = sh-dev[i];
...

but it doesn't fix this bug.



Did that chunk starting with clean-up completed biofill operations end
up where it belongs? The patch with the big context moves it to a 
different

place from where the original one puts it when applied to 2.6.23...

Lately I've seen several problems where the context isn't enough to make
a patch apply properly when some offsets have changed. In some cases a
patch won't apply at all because two nearly-identical areas are being
changed and the first chunk gets applied where the second one should,
leaving nowhere for the second chunk to apply.


I always apply this kind of patches by hands, and no by patch 
command. Last patch sent here seems to fix this bug :


gershwin:[/usr/scripts]  cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md7 : active raid1 sdi1[2] md_d0p1[0]
  1464725632 blocks [2/1] [U_]
  [=...]  recovery = 27.1% (396992504/1464725632) 
finish=1040.3min speed=17104K/sec


Resync done. Patch fix this bug.

Regards,

JKB
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-08 Thread Justin Piszcz



On Thu, 8 Nov 2007, BERTRAND Joël wrote:


BERTRAND Joël wrote:

Chuck Ebbert wrote:

On 11/05/2007 03:36 AM, BERTRAND Joël wrote:

Neil Brown wrote:

On Sunday November 4, [EMAIL PROTECTED] wrote:

# ps auxww | grep D
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME 
COMMAND

root   273  0.0  0.0  0 0 ?DOct21  14:40
[pdflush]
root   274  0.0  0.0  0 0 ?DOct21  13:00
[pdflush]

After several days/weeks, this is the second time this has happened,
while doing regular file I/O (decompressing a file), everything on
the device went into D-state.

At a guess (I haven't looked closely) I'd say it is the bug that was
meant to be fixed by

commit 4ae3f847e49e3787eca91bced31f8fd328d50496

except that patch applied badly and needed to be fixed with
the following patch (not in git yet).
These have been sent to stable@ and should be in the queue for 2.6.23.2

My linux-2.6.23/drivers/md/raid5.c contains your patch for a long
time :

...
spin_lock(sh-lock);
clear_bit(STRIPE_HANDLE, sh-state);
clear_bit(STRIPE_DELAYED, sh-state);

s.syncing = test_bit(STRIPE_SYNCING, sh-state);
s.expanding = test_bit(STRIPE_EXPAND_SOURCE, sh-state);
s.expanded = test_bit(STRIPE_EXPAND_READY, sh-state);
/* Now to look around and see what can be done */

/* clean-up completed biofill operations */
if (test_bit(STRIPE_OP_BIOFILL, sh-ops.complete)) {
clear_bit(STRIPE_OP_BIOFILL, sh-ops.pending);
clear_bit(STRIPE_OP_BIOFILL, sh-ops.ack);
clear_bit(STRIPE_OP_BIOFILL, sh-ops.complete);
}

rcu_read_lock();
for (i=disks; i--; ) {
mdk_rdev_t *rdev;
struct r5dev *dev = sh-dev[i];
...

but it doesn't fix this bug.



Did that chunk starting with clean-up completed biofill operations end
up where it belongs? The patch with the big context moves it to a 
different

place from where the original one puts it when applied to 2.6.23...

Lately I've seen several problems where the context isn't enough to make
a patch apply properly when some offsets have changed. In some cases a
patch won't apply at all because two nearly-identical areas are being
changed and the first chunk gets applied where the second one should,
leaving nowhere for the second chunk to apply.


I always apply this kind of patches by hands, and no by patch command. 
Last patch sent here seems to fix this bug :


gershwin:[/usr/scripts]  cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md7 : active raid1 sdi1[2] md_d0p1[0]
  1464725632 blocks [2/1] [U_]
  [=...]  recovery = 27.1% (396992504/1464725632) 
finish=1040.3min speed=17104K/sec


Resync done. Patch fix this bug.

Regards,

JKB



Excellent!

I cannot easily re-produce the bug on my system so I will wait for the 
next stable patch set to include it and let everyone know if it happens 
again, thanks.


Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-08 Thread Bill Davidsen

Jeff Lessem wrote:

Dan Williams wrote:
 The following patch, also attached, cleans up cases where the code 
looks

 at sh-ops.pending when it should be looking at the consistent
 stack-based snapshot of the operations flags.

I tried this patch (against a stock 2.6.23), and it did not work for
me.  Not only did I/O to the effected RAID5  XFS partition stop, but
also I/O to all other disks.  I was not able to capture any debugging
information, but I should be able to do that tomorrow when I can hook
a serial console to the machine.


That can't be good! This is worrisome because Joel is giddy with joy 
because it fixes his iSCSI problems. I was going to try it with nbd, but 
perhaps I'll wait a week or so and see if others have more information. 
Applying patches before a holiday weekend is a good way to avoid time 
off. :-(


I'm not sure if my problem is identical to these others, as mine only
seems to manifest with RAID5+XFS.  The RAID rebuilds with no problem,
and I've not had any problems with RAID5+ext3.


Hopefully it's not the raid which is the issue.

--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-08 Thread Carlos Carvalho
Jeff Lessem ([EMAIL PROTECTED]) wrote on 6 November 2007 22:00:
 Dan Williams wrote:
   The following patch, also attached, cleans up cases where the code looks
   at sh-ops.pending when it should be looking at the consistent
   stack-based snapshot of the operations flags.
 
 I tried this patch (against a stock 2.6.23), and it did not work for
 me.  Not only did I/O to the effected RAID5  XFS partition stop, but
 also I/O to all other disks.  I was not able to capture any debugging
 information, but I should be able to do that tomorrow when I can hook
 a serial console to the machine.
 
 I'm not sure if my problem is identical to these others, as mine only
 seems to manifest with RAID5+XFS.  The RAID rebuilds with no problem,
 and I've not had any problems with RAID5+ext3.

Us too! We're stuck trying to build a disk server with several disks
in a raid5 array, and the rsync from the old machine stops writing to
the new filesystem. It only happens under heavy IO. We can make it
lock without rsync, using 8 simultaneous dd's to the array. All IO
stops, including the resync after a newly created raid or after an
unclean reboot.

We could not trigger the problem with ext3 or reiser3; it only happens
with xfs.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-07 Thread BERTRAND Joël

Dan Williams wrote:

On Tue, 2007-11-06 at 03:19 -0700, BERTRAND Joël wrote:

Done. Here is obtained ouput :


Much appreciated.

[ 1260.969314] handling stripe 7629696, state=0x14 cnt=1, pd_idx=2 ops=0:0:0
[ 1260.980606] check 5: state 0x6 toread  read  
write f800ffcffcc0 written 
[ 1260.994808] check 4: state 0x6 toread  read  
write f800fdd4e360 written 
[ 1261.009325] check 3: state 0x1 toread  read  
write  written 
[ 1261.244478] check 2: state 0x1 toread  read  
write  written 
[ 1261.270821] check 1: state 0x6 toread  read  
write f800ff517e40 written 
[ 1261.312320] check 0: state 0x6 toread  read  
write f800fd4cae60 written 
[ 1261.361030] locked=4 uptodate=2 to_read=0 to_write=4 failed=0 failed_num=0
[ 1261.443120] for sector 7629696, rmw=0 rcw=0

[..]

This looks as if the blocks were prepared to be written out, but were
never handled in ops_run_biodrain(), so they remain locked forever.  The
operations flags are all clear which means handle_stripe thinks nothing
else needs to be done.

The following patch, also attached, cleans up cases where the code looks
at sh-ops.pending when it should be looking at the consistent
stack-based snapshot of the operations flags.


	Thanks for this patch. I'm testing it for three hours. I'm rebuilding a 
1.5 TB raid1 array over iSCSI without any trouble.


gershwin:[/usr/scripts]  cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md7 : active raid1 sdi1[2] md_d0p1[0]
  1464725632 blocks [2/1] [U_]
  [=...]  recovery =  6.7% (99484736/1464725632) 
finish=1450.9min speed=15679K/sec


Without your patch, I never reached 1%... I hope it fix this bug and I 
shall come back when my raid1 volume shall be resynchronized.


Regards,

JKB
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-07 Thread BERTRAND Joël

Chuck Ebbert wrote:

On 11/05/2007 03:36 AM, BERTRAND Joël wrote:

Neil Brown wrote:

On Sunday November 4, [EMAIL PROTECTED] wrote:

# ps auxww | grep D
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
root   273  0.0  0.0  0 0 ?DOct21  14:40
[pdflush]
root   274  0.0  0.0  0 0 ?DOct21  13:00
[pdflush]

After several days/weeks, this is the second time this has happened,
while doing regular file I/O (decompressing a file), everything on
the device went into D-state.

At a guess (I haven't looked closely) I'd say it is the bug that was
meant to be fixed by

commit 4ae3f847e49e3787eca91bced31f8fd328d50496

except that patch applied badly and needed to be fixed with
the following patch (not in git yet).
These have been sent to stable@ and should be in the queue for 2.6.23.2

My linux-2.6.23/drivers/md/raid5.c contains your patch for a long
time :

...
spin_lock(sh-lock);
clear_bit(STRIPE_HANDLE, sh-state);
clear_bit(STRIPE_DELAYED, sh-state);

s.syncing = test_bit(STRIPE_SYNCING, sh-state);
s.expanding = test_bit(STRIPE_EXPAND_SOURCE, sh-state);
s.expanded = test_bit(STRIPE_EXPAND_READY, sh-state);
/* Now to look around and see what can be done */

/* clean-up completed biofill operations */
if (test_bit(STRIPE_OP_BIOFILL, sh-ops.complete)) {
clear_bit(STRIPE_OP_BIOFILL, sh-ops.pending);
clear_bit(STRIPE_OP_BIOFILL, sh-ops.ack);
clear_bit(STRIPE_OP_BIOFILL, sh-ops.complete);
}

rcu_read_lock();
for (i=disks; i--; ) {
mdk_rdev_t *rdev;
struct r5dev *dev = sh-dev[i];
...

but it doesn't fix this bug.



Did that chunk starting with clean-up completed biofill operations end
up where it belongs? The patch with the big context moves it to a different
place from where the original one puts it when applied to 2.6.23...

Lately I've seen several problems where the context isn't enough to make
a patch apply properly when some offsets have changed. In some cases a
patch won't apply at all because two nearly-identical areas are being
changed and the first chunk gets applied where the second one should,
leaving nowhere for the second chunk to apply.


	I always apply this kind of patches by hands, and no by patch command. 
Last patch sent here seems to fix this bug :


gershwin:[/usr/scripts]  cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md7 : active raid1 sdi1[2] md_d0p1[0]
  1464725632 blocks [2/1] [U_]
  [=...]  recovery = 27.1% (396992504/1464725632) 
finish=1040.3min speed=17104K/sec


Regards,

JKB
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-07 Thread Chuck Ebbert
On 11/05/2007 03:36 AM, BERTRAND Joël wrote:
 Neil Brown wrote:
 On Sunday November 4, [EMAIL PROTECTED] wrote:
 # ps auxww | grep D
 USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
 root   273  0.0  0.0  0 0 ?DOct21  14:40
 [pdflush]
 root   274  0.0  0.0  0 0 ?DOct21  13:00
 [pdflush]

 After several days/weeks, this is the second time this has happened,
 while doing regular file I/O (decompressing a file), everything on
 the device went into D-state.

 At a guess (I haven't looked closely) I'd say it is the bug that was
 meant to be fixed by

 commit 4ae3f847e49e3787eca91bced31f8fd328d50496

 except that patch applied badly and needed to be fixed with
 the following patch (not in git yet).
 These have been sent to stable@ and should be in the queue for 2.6.23.2
 
 My linux-2.6.23/drivers/md/raid5.c contains your patch for a long
 time :
 
 ...
 spin_lock(sh-lock);
 clear_bit(STRIPE_HANDLE, sh-state);
 clear_bit(STRIPE_DELAYED, sh-state);
 
 s.syncing = test_bit(STRIPE_SYNCING, sh-state);
 s.expanding = test_bit(STRIPE_EXPAND_SOURCE, sh-state);
 s.expanded = test_bit(STRIPE_EXPAND_READY, sh-state);
 /* Now to look around and see what can be done */
 
 /* clean-up completed biofill operations */
 if (test_bit(STRIPE_OP_BIOFILL, sh-ops.complete)) {
 clear_bit(STRIPE_OP_BIOFILL, sh-ops.pending);
 clear_bit(STRIPE_OP_BIOFILL, sh-ops.ack);
 clear_bit(STRIPE_OP_BIOFILL, sh-ops.complete);
 }
 
 rcu_read_lock();
 for (i=disks; i--; ) {
 mdk_rdev_t *rdev;
 struct r5dev *dev = sh-dev[i];
 ...
 
 but it doesn't fix this bug.
 

Did that chunk starting with clean-up completed biofill operations end
up where it belongs? The patch with the big context moves it to a different
place from where the original one puts it when applied to 2.6.23...

Lately I've seen several problems where the context isn't enough to make
a patch apply properly when some offsets have changed. In some cases a
patch won't apply at all because two nearly-identical areas are being
changed and the first chunk gets applied where the second one should,
leaving nowhere for the second chunk to apply.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-06 Thread BERTRAND Joël

Done. Here is obtained ouput :

[ 1260.967796] for sector 7629696, rmw=0 rcw=0
[ 1260.969314] handling stripe 7629696, state=0x14 cnt=1, pd_idx=2 ops=0:0:0
[ 1260.980606] check 5: state 0x6 toread  read 
 write f800ffcffcc0 written 
[ 1260.994808] check 4: state 0x6 toread  read 
 write f800fdd4e360 written 
[ 1261.009325] check 3: state 0x1 toread  read 
 write  written 
[ 1261.244478] check 2: state 0x1 toread  read 
 write  written 
[ 1261.270821] check 1: state 0x6 toread  read 
 write f800ff517e40 written 
[ 1261.312320] check 0: state 0x6 toread  read 
 write f800fd4cae60 written 
[ 1261.361030] locked=4 uptodate=2 to_read=0 to_write=4 failed=0 
failed_num=0

[ 1261.443120] for sector 7629696, rmw=0 rcw=0
[ 1261.453348] handling stripe 7629696, state=0x14 cnt=1, pd_idx=2 ops=0:0:0
[ 1261.491538] check 5: state 0x6 toread  read 
 write f800ffcffcc0 written 
[ 1261.529120] check 4: state 0x6 toread  read 
 write f800fdd4e360 written 
[ 1261.560151] check 3: state 0x1 toread  read 
 write  written 
[ 1261.599180] check 2: state 0x1 toread  read 
 write  written 
[ 1261.637138] check 1: state 0x6 toread  read 
 write f800ff517e40 written 
[ 1261.674502] check 0: state 0x6 toread  read 
 write f800fd4cae60 written 
[ 1261.712589] locked=4 uptodate=2 to_read=0 to_write=4 failed=0 
failed_num=0

[ 1261.864338] for sector 7629696, rmw=0 rcw=0
[ 1261.873475] handling stripe 7629696, state=0x14 cnt=1, pd_idx=2 ops=0:0:0
[ 1261.907840] check 5: state 0x6 toread  read 
 write f800ffcffcc0 written 
[ 1261.950770] check 4: state 0x6 toread  read 
 write f800fdd4e360 written 
[ 1261.989003] check 3: state 0x1 toread  read 
 write  written 
[ 1262.019621] check 2: state 0x1 toread  read 
 write  written 
[ 1262.068705] check 1: state 0x6 toread  read 
 write f800ff517e40 written 
[ 1262.113265] check 0: state 0x6 toread  read 
 write f800fd4cae60 written 
[ 1262.150511] locked=4 uptodate=2 to_read=0 to_write=4 failed=0 
failed_num=0

[ 1262.171143] for sector 7629696, rmw=0 rcw=0
[ 1262.179142] handling stripe 7629696, state=0x14 cnt=1, pd_idx=2 ops=0:0:0
[ 1262.201905] check 5: state 0x6 toread  read 
 write f800ffcffcc0 written 
[ 1262.252750] check 4: state 0x6 toread  read 
 write f800fdd4e360 written 
[ 1262.289631] check 3: state 0x1 toread  read 
 write  written 
[ 1262.344709] check 2: state 0x1 toread  read 
 write  written 
[ 1262.400411] check 1: state 0x6 toread  read 
 write f800ff517e40 written 
[ 1262.437353] check 0: state 0x6 toread  read 
 write f800fd4cae60 written 
[ 1262.492561] locked=4 uptodate=2 to_read=0 to_write=4 failed=0 
failed_num=0

[ 1262.524993] for sector 7629696, rmw=0 rcw=0
[ 1262.533314] handling stripe 7629696, state=0x14 cnt=1, pd_idx=2 ops=0:0:0
[ 1262.561900] check 5: state 0x6 toread  read 
 write f800ffcffcc0 written 
[ 1262.588986] check 4: state 0x6 toread  read 
 write f800fdd4e360 written 
[ 1262.619455] check 3: state 0x1 toread  read 
 write  written 
[ 1262.671006] check 2: state 0x1 toread  read 
 write  written 
[ 1262.709065] check 1: state 0x6 toread  read 
 write f800ff517e40 written 
[ 1262.746904] check 0: state 0x6 toread  read 


write f800fd4cae60 written 
[ 1262.780203] locked=4 uptodate=2 to_read=0 to_write=4 failed=0 
failed_num=0

[ 1262.805941] for sector 7629696, rmw=0 rcw=0
[ 1262.815759] 

Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-06 Thread Justin Piszcz



On Tue, 6 Nov 2007, BERTRAND Joël wrote:


Done. Here is obtained ouput :

[ 1265.899068] check 4: state 0x6 toread  read 
 write f800fdd4e360 written 
[ 1265.941328] check 3: state 0x1 toread  read 
 write  written 
[ 1265.972129] check 2: state 0x1 toread  read 
 write  written 



For information, after crash, I have :

Root poulenc:[/sys/block]  cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md_d0 : active raid5 sdc1[0] sdh1[5] sdg1[4] sdf1[3] sde1[2] sdd1[1]
 1464725760 blocks level 5, 64k chunk, algorithm 2 [6/6] [UU]

Regards,

JKB


After the crash it is not 'resyncing' ?

Justin.


Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-06 Thread BERTRAND Joël

Justin Piszcz wrote:



On Tue, 6 Nov 2007, BERTRAND Joël wrote:


Done. Here is obtained ouput :

[ 1265.899068] check 4: state 0x6 toread  read 
 write f800fdd4e360 written 
[ 1265.941328] check 3: state 0x1 toread  read 
 write  written 
[ 1265.972129] check 2: state 0x1 toread  read 
 write  written 



For information, after crash, I have :

Root poulenc:[/sys/block]  cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md_d0 : active raid5 sdc1[0] sdh1[5] sdg1[4] sdf1[3] sde1[2] sdd1[1]
 1464725760 blocks level 5, 64k chunk, algorithm 2 [6/6] [UU]

Regards,

JKB


After the crash it is not 'resyncing' ?


No, it isn't...

JKB
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-06 Thread Justin Piszcz



On Tue, 6 Nov 2007, BERTRAND Joël wrote:


Justin Piszcz wrote:



On Tue, 6 Nov 2007, BERTRAND Joël wrote:


Done. Here is obtained ouput :

[ 1265.899068] check 4: state 0x6 toread  read 
 write f800fdd4e360 written 
[ 1265.941328] check 3: state 0x1 toread  read 
 write  written 
[ 1265.972129] check 2: state 0x1 toread  read 
 write  written 



For information, after crash, I have :

Root poulenc:[/sys/block]  cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md_d0 : active raid5 sdc1[0] sdh1[5] sdg1[4] sdf1[3] sde1[2] sdd1[1]
 1464725760 blocks level 5, 64k chunk, algorithm 2 [6/6] [UU]

Regards,

JKB


After the crash it is not 'resyncing' ?


No, it isn't...

JKB



After any crash/unclean shutdown the RAID should resync, if it doesn't, 
that's not good, I'd suggest running a raid check.


The 'repair' is supposed to clean it, in some cases (md0=swap) it gets 
dirty again.


Tue May  8 09:19:54 EDT 2007: Executing RAID health check for /dev/md0...
Tue May  8 09:19:55 EDT 2007: Executing RAID health check for /dev/md1...
Tue May  8 09:19:56 EDT 2007: Executing RAID health check for /dev/md2...
Tue May  8 09:19:57 EDT 2007: Executing RAID health check for /dev/md3...
Tue May  8 10:09:58 EDT 2007: cat /sys/block/md0/md/mismatch_cnt
Tue May  8 10:09:58 EDT 2007: 2176
Tue May  8 10:09:58 EDT 2007: cat /sys/block/md1/md/mismatch_cnt
Tue May  8 10:09:58 EDT 2007: 0
Tue May  8 10:09:58 EDT 2007: cat /sys/block/md2/md/mismatch_cnt
Tue May  8 10:09:58 EDT 2007: 0
Tue May  8 10:09:58 EDT 2007: cat /sys/block/md3/md/mismatch_cnt
Tue May  8 10:09:58 EDT 2007: 0
Tue May  8 10:09:58 EDT 2007: The meta-device /dev/md0 has 2176 mismatched 
sectors.

Tue May  8 10:09:58 EDT 2007: Executing repair on /dev/md0
Tue May  8 10:09:59 EDT 2007: The meta-device /dev/md1 has no mismatched 
sectors.
Tue May  8 10:10:00 EDT 2007: The meta-device /dev/md2 has no mismatched 
sectors.
Tue May  8 10:10:01 EDT 2007: The meta-device /dev/md3 has no mismatched 
sectors.

Tue May  8 10:20:02 EDT 2007: All devices are clean...
Tue May  8 10:20:02 EDT 2007: cat /sys/block/md0/md/mismatch_cnt
Tue May  8 10:20:02 EDT 2007: 2176
Tue May  8 10:20:02 EDT 2007: cat /sys/block/md1/md/mismatch_cnt
Tue May  8 10:20:02 EDT 2007: 0
Tue May  8 10:20:02 EDT 2007: cat /sys/block/md2/md/mismatch_cnt
Tue May  8 10:20:02 EDT 2007: 0
Tue May  8 10:20:02 EDT 2007: cat /sys/block/md3/md/mismatch_cnt
Tue May  8 10:20:02 EDT 2007: 0


Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-06 Thread BERTRAND Joël

Justin Piszcz wrote:



On Tue, 6 Nov 2007, BERTRAND Joël wrote:


Justin Piszcz wrote:



On Tue, 6 Nov 2007, BERTRAND Joël wrote:


Done. Here is obtained ouput :

[ 1265.899068] check 4: state 0x6 toread  read 
 write f800fdd4e360 written 
[ 1265.941328] check 3: state 0x1 toread  read 
 write  written 
[ 1265.972129] check 2: state 0x1 toread  read 
 write  written 



For information, after crash, I have :

Root poulenc:[/sys/block]  cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md_d0 : active raid5 sdc1[0] sdh1[5] sdg1[4] sdf1[3] sde1[2] sdd1[1]
 1464725760 blocks level 5, 64k chunk, algorithm 2 [6/6] [UU]

Regards,

JKB


After the crash it is not 'resyncing' ?


No, it isn't...

JKB



After any crash/unclean shutdown the RAID should resync, if it doesn't, 
that's not good, I'd suggest running a raid check.


The 'repair' is supposed to clean it, in some cases (md0=swap) it gets 
dirty again.


Tue May  8 09:19:54 EDT 2007: Executing RAID health check for /dev/md0...
Tue May  8 09:19:55 EDT 2007: Executing RAID health check for /dev/md1...
Tue May  8 09:19:56 EDT 2007: Executing RAID health check for /dev/md2...
Tue May  8 09:19:57 EDT 2007: Executing RAID health check for /dev/md3...
Tue May  8 10:09:58 EDT 2007: cat /sys/block/md0/md/mismatch_cnt
Tue May  8 10:09:58 EDT 2007: 2176
Tue May  8 10:09:58 EDT 2007: cat /sys/block/md1/md/mismatch_cnt
Tue May  8 10:09:58 EDT 2007: 0
Tue May  8 10:09:58 EDT 2007: cat /sys/block/md2/md/mismatch_cnt
Tue May  8 10:09:58 EDT 2007: 0
Tue May  8 10:09:58 EDT 2007: cat /sys/block/md3/md/mismatch_cnt
Tue May  8 10:09:58 EDT 2007: 0
Tue May  8 10:09:58 EDT 2007: The meta-device /dev/md0 has 2176 
mismatched sectors.

Tue May  8 10:09:58 EDT 2007: Executing repair on /dev/md0
Tue May  8 10:09:59 EDT 2007: The meta-device /dev/md1 has no mismatched 
sectors.
Tue May  8 10:10:00 EDT 2007: The meta-device /dev/md2 has no mismatched 
sectors.
Tue May  8 10:10:01 EDT 2007: The meta-device /dev/md3 has no mismatched 
sectors.

Tue May  8 10:20:02 EDT 2007: All devices are clean...
Tue May  8 10:20:02 EDT 2007: cat /sys/block/md0/md/mismatch_cnt
Tue May  8 10:20:02 EDT 2007: 2176
Tue May  8 10:20:02 EDT 2007: cat /sys/block/md1/md/mismatch_cnt
Tue May  8 10:20:02 EDT 2007: 0
Tue May  8 10:20:02 EDT 2007: cat /sys/block/md2/md/mismatch_cnt
Tue May  8 10:20:02 EDT 2007: 0
Tue May  8 10:20:02 EDT 2007: cat /sys/block/md3/md/mismatch_cnt
Tue May  8 10:20:02 EDT 2007: 0


	I cannot repair this raid volume. I cannot reboot server without 
sending stop+A. init 6 stops at INIT:. After reboot, md0 is 
resynchronized.


Regards,

JKB
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-06 Thread Dan Williams
On Tue, 2007-11-06 at 03:19 -0700, BERTRAND Joël wrote:
 Done. Here is obtained ouput :

Much appreciated.
 
 [ 1260.969314] handling stripe 7629696, state=0x14 cnt=1, pd_idx=2 ops=0:0:0
 [ 1260.980606] check 5: state 0x6 toread  read 
  write f800ffcffcc0 written 
 [ 1260.994808] check 4: state 0x6 toread  read 
  write f800fdd4e360 written 
 [ 1261.009325] check 3: state 0x1 toread  read 
  write  written 
 [ 1261.244478] check 2: state 0x1 toread  read 
  write  written 
 [ 1261.270821] check 1: state 0x6 toread  read 
  write f800ff517e40 written 
 [ 1261.312320] check 0: state 0x6 toread  read 
  write f800fd4cae60 written 
 [ 1261.361030] locked=4 uptodate=2 to_read=0 to_write=4 failed=0 failed_num=0
 [ 1261.443120] for sector 7629696, rmw=0 rcw=0
[..]

This looks as if the blocks were prepared to be written out, but were
never handled in ops_run_biodrain(), so they remain locked forever.  The
operations flags are all clear which means handle_stripe thinks nothing
else needs to be done.

The following patch, also attached, cleans up cases where the code looks
at sh-ops.pending when it should be looking at the consistent
stack-based snapshot of the operations flags.


---

 drivers/md/raid5.c |   16 +---
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 496b9a3..e1a3942 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -693,7 +693,8 @@ ops_run_prexor(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)
 }
 
 static struct dma_async_tx_descriptor *
-ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
+ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx,
+unsigned long pending)
 {
int disks = sh-disks;
int pd_idx = sh-pd_idx, i;
@@ -701,7 +702,7 @@ ops_run_biodrain(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)
/* check if prexor is active which means only process blocks
 * that are part of a read-modify-write (Wantprexor)
 */
-   int prexor = test_bit(STRIPE_OP_PREXOR, sh-ops.pending);
+   int prexor = test_bit(STRIPE_OP_PREXOR, pending);
 
pr_debug(%s: stripe %llu\n, __FUNCTION__,
(unsigned long long)sh-sector);
@@ -778,7 +779,8 @@ static void ops_complete_write(void *stripe_head_ref)
 }
 
 static void
-ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
+ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx,
+   unsigned long pending)
 {
/* kernel stack size limits the total number of disks */
int disks = sh-disks;
@@ -786,7 +788,7 @@ ops_run_postxor(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)
 
int count = 0, pd_idx = sh-pd_idx, i;
struct page *xor_dest;
-   int prexor = test_bit(STRIPE_OP_PREXOR, sh-ops.pending);
+   int prexor = test_bit(STRIPE_OP_PREXOR, pending);
unsigned long flags;
dma_async_tx_callback callback;
 
@@ -813,7 +815,7 @@ ops_run_postxor(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)
}
 
/* check whether this postxor is part of a write */
-   callback = test_bit(STRIPE_OP_BIODRAIN, sh-ops.pending) ?
+   callback = test_bit(STRIPE_OP_BIODRAIN, pending) ?
ops_complete_write : ops_complete_postxor;
 
/* 1/ if we prexor'd then the dest is reused as a source
@@ -901,12 +903,12 @@ static void raid5_run_ops(struct stripe_head *sh, 
unsigned long pending)
tx = ops_run_prexor(sh, tx);
 
if (test_bit(STRIPE_OP_BIODRAIN, pending)) {
-   tx = ops_run_biodrain(sh, tx);
+   tx = ops_run_biodrain(sh, tx, pending);
overlap_clear++;
}
 
if (test_bit(STRIPE_OP_POSTXOR, pending))
-   ops_run_postxor(sh, tx);
+   ops_run_postxor(sh, tx, pending);
 
if (test_bit(STRIPE_OP_CHECK, pending))
ops_run_check(sh);

raid5: fix unending write sequence

From: Dan Williams [EMAIL PROTECTED]


---

 drivers/md/raid5.c |   16 +---
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 496b9a3..e1a3942 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -693,7 +693,8 @@ ops_run_prexor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 }
 
 static struct dma_async_tx_descriptor *
-ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
+ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx,
+		 

Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-06 Thread Jeff Lessem

Dan Williams wrote:
 The following patch, also attached, cleans up cases where the code looks
 at sh-ops.pending when it should be looking at the consistent
 stack-based snapshot of the operations flags.

I tried this patch (against a stock 2.6.23), and it did not work for
me.  Not only did I/O to the effected RAID5  XFS partition stop, but
also I/O to all other disks.  I was not able to capture any debugging
information, but I should be able to do that tomorrow when I can hook
a serial console to the machine.

I'm not sure if my problem is identical to these others, as mine only
seems to manifest with RAID5+XFS.  The RAID rebuilds with no problem,
and I've not had any problems with RAID5+ext3.



 ---

  drivers/md/raid5.c |   16 +---
  1 files changed, 9 insertions(+), 7 deletions(-)

 diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
 index 496b9a3..e1a3942 100644
 --- a/drivers/md/raid5.c
 +++ b/drivers/md/raid5.c
 @@ -693,7 +693,8 @@ ops_run_prexor(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)

  }

  static struct dma_async_tx_descriptor *
 -ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 +ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx,
 +   unsigned long pending)
  {
int disks = sh-disks;
int pd_idx = sh-pd_idx, i;
 @@ -701,7 +702,7 @@ ops_run_biodrain(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)

/* check if prexor is active which means only process blocks
 * that are part of a read-modify-write (Wantprexor)
 */
 -  int prexor = test_bit(STRIPE_OP_PREXOR, sh-ops.pending);
 +  int prexor = test_bit(STRIPE_OP_PREXOR, pending);

pr_debug(%s: stripe %llu\n, __FUNCTION__,
(unsigned long long)sh-sector);
 @@ -778,7 +779,8 @@ static void ops_complete_write(void *stripe_head_ref)
  }

  static void
 -ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 +ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx,
 +  unsigned long pending)
  {
/* kernel stack size limits the total number of disks */
int disks = sh-disks;
 @@ -786,7 +788,7 @@ ops_run_postxor(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)


int count = 0, pd_idx = sh-pd_idx, i;
struct page *xor_dest;
 -  int prexor = test_bit(STRIPE_OP_PREXOR, sh-ops.pending);
 +  int prexor = test_bit(STRIPE_OP_PREXOR, pending);
unsigned long flags;
dma_async_tx_callback callback;

 @@ -813,7 +815,7 @@ ops_run_postxor(struct stripe_head *sh, struct 
dma_async_tx_descriptor *tx)

}

/* check whether this postxor is part of a write */
 -  callback = test_bit(STRIPE_OP_BIODRAIN, sh-ops.pending) ?
 +  callback = test_bit(STRIPE_OP_BIODRAIN, pending) ?
ops_complete_write : ops_complete_postxor;

/* 1/ if we prexor'd then the dest is reused as a source
 @@ -901,12 +903,12 @@ static void raid5_run_ops(struct stripe_head *sh, 
unsigned long pending)

tx = ops_run_prexor(sh, tx);

if (test_bit(STRIPE_OP_BIODRAIN, pending)) {
 -  tx = ops_run_biodrain(sh, tx);
 +  tx = ops_run_biodrain(sh, tx, pending);
overlap_clear++;
}

if (test_bit(STRIPE_OP_POSTXOR, pending))
 -  ops_run_postxor(sh, tx);
 +  ops_run_postxor(sh, tx, pending);

if (test_bit(STRIPE_OP_CHECK, pending))
ops_run_check(sh);



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-05 Thread BERTRAND Joël

Neil Brown wrote:

On Sunday November 4, [EMAIL PROTECTED] wrote:

# ps auxww | grep D
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
root   273  0.0  0.0  0 0 ?DOct21  14:40 [pdflush]
root   274  0.0  0.0  0 0 ?DOct21  13:00 [pdflush]

After several days/weeks, this is the second time this has happened, while 
doing regular file I/O (decompressing a file), everything on the device 
went into D-state.


At a guess (I haven't looked closely) I'd say it is the bug that was
meant to be fixed by

commit 4ae3f847e49e3787eca91bced31f8fd328d50496

except that patch applied badly and needed to be fixed with
the following patch (not in git yet).
These have been sent to stable@ and should be in the queue for 2.6.23.2


My linux-2.6.23/drivers/md/raid5.c contains your patch for a long time :

...
spin_lock(sh-lock);
clear_bit(STRIPE_HANDLE, sh-state);
clear_bit(STRIPE_DELAYED, sh-state);

s.syncing = test_bit(STRIPE_SYNCING, sh-state);
s.expanding = test_bit(STRIPE_EXPAND_SOURCE, sh-state);
s.expanded = test_bit(STRIPE_EXPAND_READY, sh-state);
/* Now to look around and see what can be done */

/* clean-up completed biofill operations */
if (test_bit(STRIPE_OP_BIOFILL, sh-ops.complete)) {
clear_bit(STRIPE_OP_BIOFILL, sh-ops.pending);
clear_bit(STRIPE_OP_BIOFILL, sh-ops.ack);
clear_bit(STRIPE_OP_BIOFILL, sh-ops.complete);
}

rcu_read_lock();
for (i=disks; i--; ) {
mdk_rdev_t *rdev;
struct r5dev *dev = sh-dev[i];
...

but it doesn't fix this bug.

Regards,

JKB
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-05 Thread Dan Williams
On 11/4/07, Justin Piszcz [EMAIL PROTECTED] wrote:


 On Mon, 5 Nov 2007, Neil Brown wrote:

  On Sunday November 4, [EMAIL PROTECTED] wrote:
  # ps auxww | grep D
  USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
  root   273  0.0  0.0  0 0 ?DOct21  14:40 [pdflush]
  root   274  0.0  0.0  0 0 ?DOct21  13:00 [pdflush]
 
  After several days/weeks, this is the second time this has happened, while
  doing regular file I/O (decompressing a file), everything on the device
  went into D-state.
 
  At a guess (I haven't looked closely) I'd say it is the bug that was
  meant to be fixed by
 
  commit 4ae3f847e49e3787eca91bced31f8fd328d50496
 
  except that patch applied badly and needed to be fixed with
  the following patch (not in git yet).
  These have been sent to stable@ and should be in the queue for 2.6.23.2
 

 Ah, thanks Neil, will be updating as soon as it is released, thanks.


Are you seeing the same md thread takes 100% of the CPU that Joël is
reporting?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-05 Thread Justin Piszcz



On Mon, 5 Nov 2007, Dan Williams wrote:


On 11/4/07, Justin Piszcz [EMAIL PROTECTED] wrote:



On Mon, 5 Nov 2007, Neil Brown wrote:


On Sunday November 4, [EMAIL PROTECTED] wrote:

# ps auxww | grep D
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
root   273  0.0  0.0  0 0 ?DOct21  14:40 [pdflush]
root   274  0.0  0.0  0 0 ?DOct21  13:00 [pdflush]

After several days/weeks, this is the second time this has happened, while
doing regular file I/O (decompressing a file), everything on the device
went into D-state.


At a guess (I haven't looked closely) I'd say it is the bug that was
meant to be fixed by

commit 4ae3f847e49e3787eca91bced31f8fd328d50496

except that patch applied badly and needed to be fixed with
the following patch (not in git yet).
These have been sent to stable@ and should be in the queue for 2.6.23.2



Ah, thanks Neil, will be updating as soon as it is released, thanks.



Are you seeing the same md thread takes 100% of the CPU that Joël is
reporting?



Yes, in another e-mail I posted the top output with md3_raid5 at 100%.

Justin.

Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-05 Thread Dan Williams
On 11/5/07, Justin Piszcz [EMAIL PROTECTED] wrote:
[..]
  Are you seeing the same md thread takes 100% of the CPU that Joël is
  reporting?
 

 Yes, in another e-mail I posted the top output with md3_raid5 at 100%.


This seems too similar to Joël's situation for them not to be
correlated, and it shows that iscsi is not a necessary component of
the failure.

The attached patch allows the debug statements in MD to be enabled via
sysfs.  Joël, since it is easier for you to reproduce can you capture
the kernel log output after the raid thread goes into the spin?  It
will help if you have CONFIG_PRINTK_TIME=y set in your kernel
configuration.

After the failure run:

echo 1  /sys/block/md_d0/md/debug_print_enable; sleep 5; echo 0 
/sys/block/md_d0/md/debug_print_enable

...to enable the print messages for a few seconds.  Please send the
output in a private message if it proves too big for the mailing list.


raid5-debug-print-enable.patch
Description: Binary data


2.6.23.1: mdadm/raid5 hung/d-state

2007-11-04 Thread Justin Piszcz

# ps auxww | grep D
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
root   273  0.0  0.0  0 0 ?DOct21  14:40 [pdflush]
root   274  0.0  0.0  0 0 ?DOct21  13:00 [pdflush]

After several days/weeks, this is the second time this has happened, while 
doing regular file I/O (decompressing a file), everything on the device 
went into D-state.


# mdadm -D /dev/md3
/dev/md3:
Version : 00.90.03
  Creation Time : Wed Aug 22 10:38:53 2007
 Raid Level : raid5
 Array Size : 1318680576 (1257.59 GiB 1350.33 GB)
  Used Dev Size : 146520064 (139.73 GiB 150.04 GB)
   Raid Devices : 10
  Total Devices : 10
Preferred Minor : 3
Persistence : Superblock is persistent

Update Time : Sun Nov  4 06:38:29 2007
  State : active
 Active Devices : 10
Working Devices : 10
 Failed Devices : 0
  Spare Devices : 0

 Layout : left-symmetric
 Chunk Size : 1024K

   UUID : e37a12d1:1b0b989a:083fb634:68e9eb49
 Events : 0.4309

Number   Major   Minor   RaidDevice State
   0   8   330  active sync   /dev/sdc1
   1   8   491  active sync   /dev/sdd1
   2   8   652  active sync   /dev/sde1
   3   8   813  active sync   /dev/sdf1
   4   8   974  active sync   /dev/sdg1
   5   8  1135  active sync   /dev/sdh1
   6   8  1296  active sync   /dev/sdi1
   7   8  1457  active sync   /dev/sdj1
   8   8  1618  active sync   /dev/sdk1
   9   8  1779  active sync   /dev/sdl1

If I wanted to find out what is causing this, what type of debugging would 
I have to enable to track it down?  Any attempt to read/write files on the 
devices fails (also going into d-state).  Is there any useful information 
I can get currently before rebooting the machine?


# pwd
/sys/block/md3/md
# ls
array_state  dev-sdj1/ rd2@  stripe_cache_active
bitmap_set_bits  dev-sdk1/ rd3@  stripe_cache_size
chunk_size   dev-sdl1/ rd4@  suspend_hi
component_size   layoutrd5@  suspend_lo
dev-sdc1/level rd6@  sync_action
dev-sdd1/metadata_version  rd7@  sync_completed
dev-sde1/mismatch_cnt  rd8@  sync_speed
dev-sdf1/new_dev   rd9@  sync_speed_max
dev-sdg1/raid_disksreshape_position  sync_speed_min
dev-sdh1/rd0@  resync_start
dev-sdi1/rd1@  safe_mode_delay
# cat array_state
active-idle
# cat mismatch_cnt
0
# cat stripe_cache_active
1
# cat stripe_cache_size
16384
# cat sync_action
idle
# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md1 : active raid1 sdb2[1] sda2[0]
  136448 blocks [2/2] [UU]

md2 : active raid1 sdb3[1] sda3[0]
  129596288 blocks [2/2] [UU]

md3 : active raid5 sdl1[9] sdk1[8] sdj1[7] sdi1[6] sdh1[5] sdg1[4] sdf1[3] 
sde1[2] sdd1[1] sdc1[0]
  1318680576 blocks level 5, 1024k chunk, algorithm 2 [10/10] 
[UU]


md0 : active raid1 sdb1[1] sda1[0]
  16787776 blocks [2/2] [UU]

unused devices: none
#

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23.1: mdadm/raid5 hung/d-state (md3_raid5 stuck in endless loop?)

2007-11-04 Thread Justin Piszcz

Time to reboot, before reboot:

top - 07:30:23 up 13 days, 13:33, 10 users,  load average: 16.00, 15.99, 14.96
Tasks: 221 total,   7 running, 209 sleeping,   0 stopped,   5 zombie
Cpu(s):  0.0%us, 25.5%sy,  0.0%ni, 74.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8039432k total,  1744356k used,  6295076k free,  164k buffers
Swap: 16787768k total,  160k used, 16787608k free,   616960k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
  688 root  15  -5 000 R  100  0.0 121:21.43 md3_raid5
  273 root  20   0 000 D0  0.0  14:40.68 pdflush
  274 root  20   0 000 D0  0.0  13:00.93 pdflush

# cat /proc/fs/xfs/stat
extent_alloc 301974 256068291 310513 240764389
abt 1900173 15346352 738568 731314
blk_map 276979807 235589732 864002 211245834 591619 513439614 0
bmbt 50717 367726 14177 11846
dir 3818065 361561 359723 975628
trans 48452 2648064 570998
ig 6034530 2074424 43153 3960106 0 3869384 460831
log 282781 10454333 3028 399803 173488
push_ail 3267594 0 1620 2611 730365 0 4476 0 10269 0
xstrat 291940 0
rw 61423078 103732605
attr 0 0 0 0
icluster 312958 97323 419837
vnodes 90721 4019823 0 1926744 3929102 3929102 3929102 0
buf 14678900 11027087 3651843 25743 760449 0 0 15775888 280425
xpc 966925905920 1047628533165 1162276949815
debug 0

# cat meminfo
MemTotal:  8039432 kB
MemFree:   6287000 kB
Buffers:   164 kB
Cached: 617072 kB
SwapCached:  0 kB
Active: 178404 kB
Inactive:   589880 kB
SwapTotal:16787768 kB
SwapFree: 16787608 kB
Dirty:  494280 kB
Writeback:   86004 kB
AnonPages:  151240 kB
Mapped:  17092 kB
Slab:   259696 kB
SReclaimable:   170876 kB
SUnreclaim:  88820 kB
PageTables:  11448 kB
NFS_Unstable:0 kB
Bounce:  0 kB
CommitLimit:  20807484 kB
Committed_AS:   353536 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 15468 kB
VmallocChunk: 34359722699 kB

# echo 3  /proc/sys/vm/drop_caches

# cat /proc/meminfo
MemTotal:  8039432 kB
MemFree:   6418352 kB
Buffers:32 kB
Cached: 597908 kB
SwapCached:  0 kB
Active: 172028 kB
Inactive:   579808 kB
SwapTotal:16787768 kB
SwapFree: 16787608 kB
Dirty:  494312 kB
Writeback:   86004 kB
AnonPages:  154104 kB
Mapped:  17416 kB
Slab:   144072 kB
SReclaimable:53100 kB
SUnreclaim:  90972 kB
PageTables:  11832 kB
NFS_Unstable:0 kB
Bounce:  0 kB
CommitLimit:  20807484 kB
Committed_AS:   360748 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 15468 kB
VmallocChunk: 34359722699 kB

Nothing is actually happening on the device itself however.

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await  svctm  %util
sda   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.00   0.00   0.00
sdb   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.00   0.00   0.00
sdc   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.00   0.00   0.00
sdd   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.00   0.00   0.00
sde   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.00   0.00   0.00
sdf   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.00   0.00   0.00
sdg   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.00   0.00   0.00
sdh   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.00   0.00   0.00
sdi   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.00   0.00   0.00
sdj   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.00   0.00   0.00
sdk   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.00   0.00   0.00
sdl   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.00   0.00   0.00
md0   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.00   0.00   0.00
md3   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.00   0.00   0.00
md2   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.00   0.00   0.00
md1   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.00   0.00   0.00


# vmstat 1
procs ---memory-- ---swap-- -io -system-- 
cpu

 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
 6  0160 6420244 32 60009200   221   22751  1  1 98  0
 6  0160 6420228 32 60012000 0 0 1015  142  0 25 75  0
 6  0160 6420228 32 60012000 0 0 1005  127  0 25 75  0
 6  0160 6420228 32 60012000 041 1022  151  0 26 74  0
 6  0160 6420228   

Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-04 Thread Michael Tokarev
Justin Piszcz wrote:
 # ps auxww | grep D
 USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
 root   273  0.0  0.0  0 0 ?DOct21  14:40 [pdflush]
 root   274  0.0  0.0  0 0 ?DOct21  13:00 [pdflush]
 
 After several days/weeks, this is the second time this has happened,
 while doing regular file I/O (decompressing a file), everything on the
 device went into D-state.

The next time you come across something like that, do a SysRq-T dump and
post that.  It shows a stack trace of all processes - and in particular,
where exactly each task is stuck.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-04 Thread BERTRAND Joël

Justin Piszcz wrote:

# ps auxww | grep D
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
root   273  0.0  0.0  0 0 ?DOct21  14:40 [pdflush]
root   274  0.0  0.0  0 0 ?DOct21  13:00 [pdflush]

After several days/weeks, this is the second time this has happened, 
while doing regular file I/O (decompressing a file), everything on the 
device went into D-state.


	Same observation here (kernel 2.6.23). I can see this bug when I try to 
synchronize a raid1 volume over iSCSI (each element is a raid5 volume), 
or sometimes only with a 1,5 TB raid5 volume. When this bug occurs, md 
subsystem eats 100% of one CPU and pdflush remains in D state too. What 
is your architecture ? I use two 32-threads T1000 (sparc64), and I'm 
trying to determine if this bug is arch specific.


Regards,

JKB
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-04 Thread Justin Piszcz



On Sun, 4 Nov 2007, BERTRAND Joël wrote:


Justin Piszcz wrote:

# ps auxww | grep D
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
root   273  0.0  0.0  0 0 ?DOct21  14:40 [pdflush]
root   274  0.0  0.0  0 0 ?DOct21  13:00 [pdflush]

After several days/weeks, this is the second time this has happened, while 
doing regular file I/O (decompressing a file), everything on the device 
went into D-state.


	Same observation here (kernel 2.6.23). I can see this bug when I try 
to synchronize a raid1 volume over iSCSI (each element is a raid5 volume), or 
sometimes only with a 1,5 TB raid5 volume. When this bug occurs, md subsystem 
eats 100% of one CPU and pdflush remains in D state too. What is your 
architecture ? I use two 32-threads T1000 (sparc64), and I'm trying to 
determine if this bug is arch specific.


Regards,

JKB



Using x86_64 here (Q6600/Intel DG965WH).

Justin.

Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-04 Thread Michael Tokarev
Justin Piszcz wrote:
 On Sun, 4 Nov 2007, Michael Tokarev wrote:
[]
 The next time you come across something like that, do a SysRq-T dump and
 post that.  It shows a stack trace of all processes - and in particular,
 where exactly each task is stuck.

 Yes I got it before I rebooted, ran that and then dmesg  file.
 
 Here it is:
 
 [1172609.665902]  80747dc0 80747dc0 80747dc0 
 80744d80
 [1172609.668768]  80747dc0 81015c3aa918 810091c899b4 
 810091c899a8

That's only partial list.  All the kernel threads - which are most important
in this context - aren't shown.  You ran out of dmesg buffer, and the most
interesting entries was at the beginning.  If your /var/log partition is
working, the stuff should be in /var/log/kern.log or equivalent.  If it's
not working, there is a way to capture the info still, by stopping syslogd,
cat'ing /proc/kmsg to some tmpfs file and scp'ing it elsewhere.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-04 Thread David Greaves
Michael Tokarev wrote:
 Justin Piszcz wrote:
 On Sun, 4 Nov 2007, Michael Tokarev wrote:
 []
 The next time you come across something like that, do a SysRq-T dump and
 post that.  It shows a stack trace of all processes - and in particular,
 where exactly each task is stuck.
 
 Yes I got it before I rebooted, ran that and then dmesg  file.

 Here it is:

 [1172609.665902]  80747dc0 80747dc0 80747dc0 
 80744d80
 [1172609.668768]  80747dc0 81015c3aa918 810091c899b4 
 810091c899a8
 
 That's only partial list.  All the kernel threads - which are most important
 in this context - aren't shown.  You ran out of dmesg buffer, and the most
 interesting entries was at the beginning.  If your /var/log partition is
 working, the stuff should be in /var/log/kern.log or equivalent.  If it's
 not working, there is a way to capture the info still, by stopping syslogd,
 cat'ing /proc/kmsg to some tmpfs file and scp'ing it elsewhere.

or netconsole is actually pretty easy and incredibly useful in this kind of
situation even if there's no disk at all :)

David

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-04 Thread Neil Brown
On Sunday November 4, [EMAIL PROTECTED] wrote:
 # ps auxww | grep D
 USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
 root   273  0.0  0.0  0 0 ?DOct21  14:40 [pdflush]
 root   274  0.0  0.0  0 0 ?DOct21  13:00 [pdflush]
 
 After several days/weeks, this is the second time this has happened, while 
 doing regular file I/O (decompressing a file), everything on the device 
 went into D-state.

At a guess (I haven't looked closely) I'd say it is the bug that was
meant to be fixed by

commit 4ae3f847e49e3787eca91bced31f8fd328d50496

except that patch applied badly and needed to be fixed with
the following patch (not in git yet).
These have been sent to stable@ and should be in the queue for 2.6.23.2


NeilBrown

Fix misapplied patch in raid5.c

commit 4ae3f847e49e3787eca91bced31f8fd328d50496 did not get applied
correctly, presumably due to substantial similarities between
handle_stripe5 and handle_stripe6.

This patch (with lots of context) moves the chunk of new code from
handle_stripe6 (where it isn't needed (yet)) to handle_stripe5.


Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/raid5.c |   14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c2007-11-02 12:10:49.0 +1100
+++ ./drivers/md/raid5.c2007-11-02 12:25:31.0 +1100
@@ -2607,40 +2607,47 @@ static void handle_stripe5(struct stripe
struct bio *return_bi = NULL;
struct stripe_head_state s;
struct r5dev *dev;
unsigned long pending = 0;
 
memset(s, 0, sizeof(s));
pr_debug(handling stripe %llu, state=%#lx cnt=%d, pd_idx=%d 
ops=%lx:%lx:%lx\n, (unsigned long long)sh-sector, sh-state,
atomic_read(sh-count), sh-pd_idx,
sh-ops.pending, sh-ops.ack, sh-ops.complete);
 
spin_lock(sh-lock);
clear_bit(STRIPE_HANDLE, sh-state);
clear_bit(STRIPE_DELAYED, sh-state);
 
s.syncing = test_bit(STRIPE_SYNCING, sh-state);
s.expanding = test_bit(STRIPE_EXPAND_SOURCE, sh-state);
s.expanded = test_bit(STRIPE_EXPAND_READY, sh-state);
/* Now to look around and see what can be done */
 
+   /* clean-up completed biofill operations */
+   if (test_bit(STRIPE_OP_BIOFILL, sh-ops.complete)) {
+   clear_bit(STRIPE_OP_BIOFILL, sh-ops.pending);
+   clear_bit(STRIPE_OP_BIOFILL, sh-ops.ack);
+   clear_bit(STRIPE_OP_BIOFILL, sh-ops.complete);
+   }
+
rcu_read_lock();
for (i=disks; i--; ) {
mdk_rdev_t *rdev;
struct r5dev *dev = sh-dev[i];
clear_bit(R5_Insync, dev-flags);
 
pr_debug(check %d: state 0x%lx toread %p read %p write %p 
written %p\n, i, dev-flags, dev-toread, dev-read,
dev-towrite, dev-written);
 
/* maybe we can request a biofill operation
 *
 * new wantfill requests are only permitted while
 * STRIPE_OP_BIOFILL is clear
 */
if (test_bit(R5_UPTODATE, dev-flags)  dev-toread 
!test_bit(STRIPE_OP_BIOFILL, sh-ops.pending))
set_bit(R5_Wantfill, dev-flags);
 
/* now count some things */
@@ -2880,47 +2887,40 @@ static void handle_stripe6(struct stripe
struct stripe_head_state s;
struct r6_state r6s;
struct r5dev *dev, *pdev, *qdev;
 
r6s.qd_idx = raid6_next_disk(pd_idx, disks);
pr_debug(handling stripe %llu, state=%#lx cnt=%d, 
pd_idx=%d, qd_idx=%d\n,
   (unsigned long long)sh-sector, sh-state,
   atomic_read(sh-count), pd_idx, r6s.qd_idx);
memset(s, 0, sizeof(s));
 
spin_lock(sh-lock);
clear_bit(STRIPE_HANDLE, sh-state);
clear_bit(STRIPE_DELAYED, sh-state);
 
s.syncing = test_bit(STRIPE_SYNCING, sh-state);
s.expanding = test_bit(STRIPE_EXPAND_SOURCE, sh-state);
s.expanded = test_bit(STRIPE_EXPAND_READY, sh-state);
/* Now to look around and see what can be done */
 
-   /* clean-up completed biofill operations */
-   if (test_bit(STRIPE_OP_BIOFILL, sh-ops.complete)) {
-   clear_bit(STRIPE_OP_BIOFILL, sh-ops.pending);
-   clear_bit(STRIPE_OP_BIOFILL, sh-ops.ack);
-   clear_bit(STRIPE_OP_BIOFILL, sh-ops.complete);
-   }
-
rcu_read_lock();
for (i=disks; i--; ) {
mdk_rdev_t *rdev;
dev = sh-dev[i];
clear_bit(R5_Insync, dev-flags);
 
pr_debug(check %d: state 0x%lx read %p write %p written %p\n,
i, dev-flags, dev-toread, dev-towrite, dev-written);
/* maybe we can reply to a read */

Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-04 Thread Justin Piszcz



On Mon, 5 Nov 2007, Neil Brown wrote:


On Sunday November 4, [EMAIL PROTECTED] wrote:

# ps auxww | grep D
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
root   273  0.0  0.0  0 0 ?DOct21  14:40 [pdflush]
root   274  0.0  0.0  0 0 ?DOct21  13:00 [pdflush]

After several days/weeks, this is the second time this has happened, while
doing regular file I/O (decompressing a file), everything on the device
went into D-state.


At a guess (I haven't looked closely) I'd say it is the bug that was
meant to be fixed by

commit 4ae3f847e49e3787eca91bced31f8fd328d50496

except that patch applied badly and needed to be fixed with
the following patch (not in git yet).
These have been sent to stable@ and should be in the queue for 2.6.23.2



Ah, thanks Neil, will be updating as soon as it is released, thanks.

Justin.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html