subject:"RAID extremely slow"

Re: RAID extremely slow

2012-08-17 Thread Jan Engelhardt


On Thursday 2012-07-26 03:00, Phil Turmel wrote:
>> I used atop to show the transfer speeds to each drive. Here's a
>> screenshot:
>> http://img402.imageshack.us/img402/6484/screenshotfrom201207251.png
>
>[ The output of "lsdrv" [1] might be useful here, along with
>"mdadm -D /dev/md0" and "mdadm -E /dev/[b-j]" ]
>[1] http://github.com/pturmel/lsdrv

lsdrv? Shows a bunch, but for this, using standard tools
like lsscsi & lsblk seems sufficient :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-08-17 Thread Jan Engelhardt


On Thursday 2012-07-26 03:00, Phil Turmel wrote:
 I used atop to show the transfer speeds to each drive. Here's a
 screenshot:
 http://img402.imageshack.us/img402/6484/screenshotfrom201207251.png

[ The output of lsdrv [1] might be useful here, along with
mdadm -D /dev/md0 and mdadm -E /dev/[b-j] ]
[1] http://github.com/pturmel/lsdrv

lsdrv? Shows a bunch, but for this, using standard tools
like lsscsi  lsblk seems sufficient :)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-31 Thread Bill Davidsen


Kevin Ross wrote:

On 07/27/2012 09:45 PM, Grant Coady wrote:

On Fri, 27 Jul 2012 14:45:18 -0700, you wrote:


On 07/27/2012 12:08 PM, Bill Davidsen wrote:

Have you set the io scheduler to deadline on all members of the array?
That's kind of "job one" on older kernels.


I have not, thanks for the tip, I'll look into that now.

Plus I disable the on-drive queuing (NCQ) during startup, right now
I don't have benchmarks to show the difference.  This on a six by 1TB
drive RAID6 array I built over a year ago on Slackware64-13.37:

# cat /etc/rc.d/rc.local
...
# turn off NCQ on the RAID drives by adjusting queue depth to 1
n=1
echo "rc.local: Disable RAID drives' NCQ"
for d in a b c d e f
do
 echo "  set NCQ depth to $n on sd${d}"
 echo $n>  /sys/block/sd${d}/device/queue_depth
done
...

Maybe you could try that?  See if it makes a difference.  My drives
are Seagate.

Grant.



Does disabling NCQ improve performance?


Does for me!


The suggestion to use kernel 3.4.6 has been working quite well so far, 
hopefully that fixes the problem.  I'll know for sure in a few more days...


Thanks!
-- Kevin




--
Bill Davidsen 
  We are not out of the woods yet, but we know the direction and have
taken the first step. The steps are many, but finite in number, and if
we persevere we will reach our destination.  -me, 2010


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-31 Thread Bill Davidsen


Kevin Ross wrote:

On 07/27/2012 09:45 PM, Grant Coady wrote:

On Fri, 27 Jul 2012 14:45:18 -0700, you wrote:


On 07/27/2012 12:08 PM, Bill Davidsen wrote:

Have you set the io scheduler to deadline on all members of the array?
That's kind of job one on older kernels.


I have not, thanks for the tip, I'll look into that now.

Plus I disable the on-drive queuing (NCQ) during startup, right now
I don't have benchmarks to show the difference.  This on a six by 1TB
drive RAID6 array I built over a year ago on Slackware64-13.37:

# cat /etc/rc.d/rc.local
...
# turn off NCQ on the RAID drives by adjusting queue depth to 1
n=1
echo rc.local: Disable RAID drives' NCQ
for d in a b c d e f
do
 echo   set NCQ depth to $n on sd${d}
 echo $n  /sys/block/sd${d}/device/queue_depth
done
...

Maybe you could try that?  See if it makes a difference.  My drives
are Seagate.

Grant.



Does disabling NCQ improve performance?


Does for me!


The suggestion to use kernel 3.4.6 has been working quite well so far, 
hopefully that fixes the problem.  I'll know for sure in a few more days...


Thanks!
-- Kevin




--
Bill Davidsen david...@tmr.com
  We are not out of the woods yet, but we know the direction and have
taken the first step. The steps are many, but finite in number, and if
we persevere we will reach our destination.  -me, 2010


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-28 Thread Kevin Ross


On 07/27/2012 09:45 PM, Grant Coady wrote:

On Fri, 27 Jul 2012 14:45:18 -0700, you wrote:


On 07/27/2012 12:08 PM, Bill Davidsen wrote:

Have you set the io scheduler to deadline on all members of the array?
That's kind of "job one" on older kernels.


I have not, thanks for the tip, I'll look into that now.

Plus I disable the on-drive queuing (NCQ) during startup, right now
I don't have benchmarks to show the difference.  This on a six by 1TB
drive RAID6 array I built over a year ago on Slackware64-13.37:

# cat /etc/rc.d/rc.local
...
# turn off NCQ on the RAID drives by adjusting queue depth to 1
n=1
echo "rc.local: Disable RAID drives' NCQ"
for d in a b c d e f
do
 echo "  set NCQ depth to $n on sd${d}"
 echo $n>  /sys/block/sd${d}/device/queue_depth
done
...

Maybe you could try that?  See if it makes a difference.  My drives
are Seagate.

Grant.



Does disabling NCQ improve performance?

The suggestion to use kernel 3.4.6 has been working quite well so far, 
hopefully that fixes the problem.  I'll know for sure in a few more days...


Thanks!
-- Kevin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-28 Thread Kevin Ross


On 07/27/2012 09:45 PM, Grant Coady wrote:

On Fri, 27 Jul 2012 14:45:18 -0700, you wrote:


On 07/27/2012 12:08 PM, Bill Davidsen wrote:

Have you set the io scheduler to deadline on all members of the array?
That's kind of job one on older kernels.


I have not, thanks for the tip, I'll look into that now.

Plus I disable the on-drive queuing (NCQ) during startup, right now
I don't have benchmarks to show the difference.  This on a six by 1TB
drive RAID6 array I built over a year ago on Slackware64-13.37:

# cat /etc/rc.d/rc.local
...
# turn off NCQ on the RAID drives by adjusting queue depth to 1
n=1
echo rc.local: Disable RAID drives' NCQ
for d in a b c d e f
do
 echo   set NCQ depth to $n on sd${d}
 echo $n  /sys/block/sd${d}/device/queue_depth
done
...

Maybe you could try that?  See if it makes a difference.  My drives
are Seagate.

Grant.



Does disabling NCQ improve performance?

The suggestion to use kernel 3.4.6 has been working quite well so far, 
hopefully that fixes the problem.  I'll know for sure in a few more days...


Thanks!
-- Kevin

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-27 Thread Grant Coady

On Fri, 27 Jul 2012 14:45:18 -0700, you wrote:

>On 07/27/2012 12:08 PM, Bill Davidsen wrote:
>> Have you set the io scheduler to deadline on all members of the array? 
>> That's kind of "job one" on older kernels.
>>
>
>I have not, thanks for the tip, I'll look into that now.

Plus I disable the on-drive queuing (NCQ) during startup, right now 
I don't have benchmarks to show the difference.  This on a six by 1TB 
drive RAID6 array I built over a year ago on Slackware64-13.37:

# cat /etc/rc.d/rc.local
...
# turn off NCQ on the RAID drives by adjusting queue depth to 1
n=1
echo "rc.local: Disable RAID drives' NCQ"
for d in a b c d e f
do
echo "  set NCQ depth to $n on sd${d}"
echo $n > /sys/block/sd${d}/device/queue_depth
done
...

Maybe you could try that?  See if it makes a difference.  My drives 
are Seagate.

Grant.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-27 Thread Kevin Ross


On 07/27/2012 12:08 PM, Bill Davidsen wrote:
Have you set the io scheduler to deadline on all members of the array? 
That's kind of "job one" on older kernels.




I have not, thanks for the tip, I'll look into that now.

Thanks!
-- Kevin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-27 Thread Bill Davidsen


Kevin Ross wrote:





unused devices:

# cat /proc/sys/dev/raid/speed_limit_min
1

MD is unable to reach its minimum rebuild rate while other system
activity is ongoing.  You might want to lower this number to see if that
gets you out of the stalls.

Or temporarily shut down mythtv.


I will try lowering those numbers next time this happens, which will probably
be within the next day or two.  That's about how often this happens.


Unfortunately, it has happened again, with speeds at near zero.

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] sdf1[3]
sdg1[8] sdj1[1]
   6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9]
[U]
   [=>...]  resync =  8.3% (81251712/976758784)
finish=1057826.4min speed=14K/sec

unused devices: 

atop doesn't show ANY activity on the raid device or the individual drives.
http://img687.imageshack.us/img687/2913/screenshotfrom201207252.png

Also, I tried writing to a test file with the following command, and it hangs.
I let it go for about 30 minutes, with no change.

# dd if=/dev/zero of=test bs=1M count=1

dmesg only reports hung tasks.  It doesn't report any other problems. Here's my
dmesg output:
http://pastebin.ca/2174778

I'm going to try rebooting into single user mode, and see if the rebuild
succeeds without stalling.

Have you set the io scheduler to deadline on all members of the array? That's 
kind of "job one" on older kernels.


--
Bill Davidsen 
  "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-27 Thread Bill Davidsen


Kevin Ross wrote:





unused devices:none

# cat /proc/sys/dev/raid/speed_limit_min
1

MD is unable to reach its minimum rebuild rate while other system
activity is ongoing.  You might want to lower this number to see if that
gets you out of the stalls.

Or temporarily shut down mythtv.


I will try lowering those numbers next time this happens, which will probably
be within the next day or two.  That's about how often this happens.


Unfortunately, it has happened again, with speeds at near zero.

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] sdf1[3]
sdg1[8] sdj1[1]
   6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9]
[U]
   [=...]  resync =  8.3% (81251712/976758784)
finish=1057826.4min speed=14K/sec

unused devices: none

atop doesn't show ANY activity on the raid device or the individual drives.
http://img687.imageshack.us/img687/2913/screenshotfrom201207252.png

Also, I tried writing to a test file with the following command, and it hangs.
I let it go for about 30 minutes, with no change.

# dd if=/dev/zero of=test bs=1M count=1

dmesg only reports hung tasks.  It doesn't report any other problems. Here's my
dmesg output:
http://pastebin.ca/2174778

I'm going to try rebooting into single user mode, and see if the rebuild
succeeds without stalling.

Have you set the io scheduler to deadline on all members of the array? That's 
kind of job one on older kernels.


--
Bill Davidsen david...@tmr.com
  We have more to fear from the bungling of the incompetent than from
the machinations of the wicked.  - from Slashdot


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-27 Thread Kevin Ross


On 07/27/2012 12:08 PM, Bill Davidsen wrote:
Have you set the io scheduler to deadline on all members of the array? 
That's kind of job one on older kernels.




I have not, thanks for the tip, I'll look into that now.

Thanks!
-- Kevin

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-27 Thread Grant Coady

On Fri, 27 Jul 2012 14:45:18 -0700, you wrote:

On 07/27/2012 12:08 PM, Bill Davidsen wrote:
 Have you set the io scheduler to deadline on all members of the array? 
 That's kind of job one on older kernels.


I have not, thanks for the tip, I'll look into that now.

Plus I disable the on-drive queuing (NCQ) during startup, right now 
I don't have benchmarks to show the difference.  This on a six by 1TB 
drive RAID6 array I built over a year ago on Slackware64-13.37:

# cat /etc/rc.d/rc.local
...
# turn off NCQ on the RAID drives by adjusting queue depth to 1
n=1
echo rc.local: Disable RAID drives' NCQ
for d in a b c d e f
do
echo   set NCQ depth to $n on sd${d}
echo $n  /sys/block/sd${d}/device/queue_depth
done
...

Maybe you could try that?  See if it makes a difference.  My drives 
are Seagate.

Grant.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-26 Thread Kevin Ross


On 07/26/2012 07:53 PM, Kevin Ross wrote:

On 07/26/2012 07:27 PM, David Dillow wrote:

On Thu, 2012-07-26 at 19:17 -0700, Kevin Ross wrote:

On 07/26/2012 07:17 PM, David Dillow wrote:

On Thu, 2012-07-26 at 22:15 -0400, David Dillow wrote:
If you can, upgrade to the latest 3.4 stable kernel (3.4.6 right 
now).
As far as I can see, the latest 3.2 stable does not contain the 
delayed

stripe fix.
And I was looking at the wrong version; 3.2.24 does indeed have the 
fix.



I'm running 3.2.21, does that contain the fix?

No, that was the one I looked at. It is commit
c0159c780e8d42309d04e83271986274d3880826.



Okay, I grabbed 3.4.4 from Debian experimental, and I'm running with 
that now.  Hopefully this fixes the problem.


Thanks for your help!
-- Kevin


Just noticed I need 3.4.5 or later.  Doh!  I'll grab a vanilla kernel 
from kernel.org and build it.


Thanks!
-- Kevin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-26 Thread David Dillow

On Wed, 2012-07-25 at 18:55 -0700, Kevin Ross wrote:
> On 07/25/2012 06:00 PM, Phil Turmel wrote:
> > Piles of small reads  scattered across multiple drives, and a
> > concentration of queued writes to /dev/sda.  What's on /dev/sda?
> > It's not a member of the raid, so it must be some other system task
> > involved.

> After rebooting, MythTV is currently recording two shows, and the resync 
> is running at full speed.
> 
> # cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] 
> sdf1[3] sdg1[8] sdj1[1]
>6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 
> [9/9] [U]
>[=>...]  resync =  9.3% (91363840/976758784) 
> finish=1434.3min speed=10287K/sec
> 
> unused devices: 
> 
> atop shows the avio of all the drives to be less than 1ms, where before 
> they were much higher.  It will run for a couple days under load just 
> fine, and then it will come to a halt.
> 
> It's a 3.2.21 kernel.  I'm running Debian Testing, and the exact Debian 
> package version is:

I suspect you are being hit by same bug I was -- delayed stripes never
got processed. If you get into the state where the rebuild isn't
progressing, and you find that increasing the size of the stripe cache
allows the rebuild to proceed (but the filesystem stays wedged), then
that cinches it.

If you can, upgrade to the latest 3.4 stable kernel (3.4.6 right now).
As far as I can see, the latest 3.2 stable does not contain the delayed
stripe fix.

After applying those fixes to my kernel, my MythTV setup over a 5 disk
RAID5 has been pretty solid, where before I was getting lockups every
few days. It still seems to be getting slower over time, but I've not
looked into it yet as it is not as catastrophic as the wedging.

HTH,
Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-26 Thread David Dillow

On Thu, 2012-07-26 at 22:15 -0400, David Dillow wrote:
> If you can, upgrade to the latest 3.4 stable kernel (3.4.6 right now).
> As far as I can see, the latest 3.2 stable does not contain the delayed
> stripe fix.

And I was looking at the wrong version; 3.2.24 does indeed have the fix.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-26 Thread Kevin Ross


On 07/26/2012 07:27 PM, David Dillow wrote:

On Thu, 2012-07-26 at 19:17 -0700, Kevin Ross wrote:

On 07/26/2012 07:17 PM, David Dillow wrote:

On Thu, 2012-07-26 at 22:15 -0400, David Dillow wrote:

If you can, upgrade to the latest 3.4 stable kernel (3.4.6 right now).
As far as I can see, the latest 3.2 stable does not contain the delayed
stripe fix.

And I was looking at the wrong version; 3.2.24 does indeed have the fix.


I'm running 3.2.21, does that contain the fix?

No, that was the one I looked at. It is commit
c0159c780e8d42309d04e83271986274d3880826.



Okay, I grabbed 3.4.4 from Debian experimental, and I'm running with 
that now.  Hopefully this fixes the problem.


Thanks for your help!
-- Kevin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-26 Thread David Dillow

On Thu, 2012-07-26 at 19:17 -0700, Kevin Ross wrote:
> On 07/26/2012 07:17 PM, David Dillow wrote:
> > On Thu, 2012-07-26 at 22:15 -0400, David Dillow wrote:
> >> If you can, upgrade to the latest 3.4 stable kernel (3.4.6 right now).
> >> As far as I can see, the latest 3.2 stable does not contain the delayed
> >> stripe fix.
> > And I was looking at the wrong version; 3.2.24 does indeed have the fix.
> >
> 
> I'm running 3.2.21, does that contain the fix?

No, that was the one I looked at. It is commit
c0159c780e8d42309d04e83271986274d3880826.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-26 Thread Kevin Ross


On 07/26/2012 07:17 PM, David Dillow wrote:

On Thu, 2012-07-26 at 22:15 -0400, David Dillow wrote:

If you can, upgrade to the latest 3.4 stable kernel (3.4.6 right now).
As far as I can see, the latest 3.2 stable does not contain the delayed
stripe fix.

And I was looking at the wrong version; 3.2.24 does indeed have the fix.



I'm running 3.2.21, does that contain the fix?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-26 Thread Kevin Ross


On 07/25/2012 10:00 PM, Kevin Ross wrote:





unused devices:

# cat /proc/sys/dev/raid/speed_limit_min
1

MD is unable to reach its minimum rebuild rate while other system
activity is ongoing.  You might want to lower this number to see if 
that

gets you out of the stalls.

Or temporarily shut down mythtv.


I will try lowering those numbers next time this happens, which will 
probably be within the next day or two.  That's about how often this 
happens.


Unfortunately, it has happened again, with speeds at near zero.

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] 
sdf1[3] sdg1[8] sdj1[1]
  6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 
[9/9] [U]
  [=>...]  resync =  8.3% (81251712/976758784) 
finish=1057826.4min speed=14K/sec


unused devices: 

atop doesn't show ANY activity on the raid device or the individual 
drives.

http://img687.imageshack.us/img687/2913/screenshotfrom201207252.png

Also, I tried writing to a test file with the following command, and 
it hangs.  I let it go for about 30 minutes, with no change.


# dd if=/dev/zero of=test bs=1M count=1

dmesg only reports hung tasks.  It doesn't report any other problems.  
Here's my dmesg output:

http://pastebin.ca/2174778

I'm going to try rebooting into single user mode, and see if the 
rebuild succeeds without stalling.


-- Kevin


It rebuilt fine in single user mode, with speeds usually around 
50MB/sec.  But after exiting single user mode, and allowing MythTV and 
other programs to start, within 30 minutes I had the problem again.  
Basically a hung filesystem.  I couldn't even "cat /proc/mdstat", that 
just hung.  Lots of hung task warnings in dmesg.


Because Phil suggested that fsync calls might cause stalls, I commented 
out the fsync in MythTV.  I'll run with that for awhile, and see how 
things work out.  So far it isn't adversely affecting MythTV.


Thanks!
-- Kevin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-26 Thread Kevin Ross


On 07/25/2012 10:00 PM, Kevin Ross wrote:





unused devices:none

# cat /proc/sys/dev/raid/speed_limit_min
1

MD is unable to reach its minimum rebuild rate while other system
activity is ongoing.  You might want to lower this number to see if 
that

gets you out of the stalls.

Or temporarily shut down mythtv.


I will try lowering those numbers next time this happens, which will 
probably be within the next day or two.  That's about how often this 
happens.


Unfortunately, it has happened again, with speeds at near zero.

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] 
sdf1[3] sdg1[8] sdj1[1]
  6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 
[9/9] [U]
  [=...]  resync =  8.3% (81251712/976758784) 
finish=1057826.4min speed=14K/sec


unused devices: none

atop doesn't show ANY activity on the raid device or the individual 
drives.

http://img687.imageshack.us/img687/2913/screenshotfrom201207252.png

Also, I tried writing to a test file with the following command, and 
it hangs.  I let it go for about 30 minutes, with no change.


# dd if=/dev/zero of=test bs=1M count=1

dmesg only reports hung tasks.  It doesn't report any other problems.  
Here's my dmesg output:

http://pastebin.ca/2174778

I'm going to try rebooting into single user mode, and see if the 
rebuild succeeds without stalling.


-- Kevin


It rebuilt fine in single user mode, with speeds usually around 
50MB/sec.  But after exiting single user mode, and allowing MythTV and 
other programs to start, within 30 minutes I had the problem again.  
Basically a hung filesystem.  I couldn't even cat /proc/mdstat, that 
just hung.  Lots of hung task warnings in dmesg.


Because Phil suggested that fsync calls might cause stalls, I commented 
out the fsync in MythTV.  I'll run with that for awhile, and see how 
things work out.  So far it isn't adversely affecting MythTV.


Thanks!
-- Kevin

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-26 Thread Kevin Ross


On 07/26/2012 07:17 PM, David Dillow wrote:

On Thu, 2012-07-26 at 22:15 -0400, David Dillow wrote:

If you can, upgrade to the latest 3.4 stable kernel (3.4.6 right now).
As far as I can see, the latest 3.2 stable does not contain the delayed
stripe fix.

And I was looking at the wrong version; 3.2.24 does indeed have the fix.



I'm running 3.2.21, does that contain the fix?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-26 Thread David Dillow

On Thu, 2012-07-26 at 19:17 -0700, Kevin Ross wrote:
 On 07/26/2012 07:17 PM, David Dillow wrote:
  On Thu, 2012-07-26 at 22:15 -0400, David Dillow wrote:
  If you can, upgrade to the latest 3.4 stable kernel (3.4.6 right now).
  As far as I can see, the latest 3.2 stable does not contain the delayed
  stripe fix.
  And I was looking at the wrong version; 3.2.24 does indeed have the fix.
 
 
 I'm running 3.2.21, does that contain the fix?

No, that was the one I looked at. It is commit
c0159c780e8d42309d04e83271986274d3880826.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-26 Thread Kevin Ross


On 07/26/2012 07:27 PM, David Dillow wrote:

On Thu, 2012-07-26 at 19:17 -0700, Kevin Ross wrote:

On 07/26/2012 07:17 PM, David Dillow wrote:

On Thu, 2012-07-26 at 22:15 -0400, David Dillow wrote:

If you can, upgrade to the latest 3.4 stable kernel (3.4.6 right now).
As far as I can see, the latest 3.2 stable does not contain the delayed
stripe fix.

And I was looking at the wrong version; 3.2.24 does indeed have the fix.


I'm running 3.2.21, does that contain the fix?

No, that was the one I looked at. It is commit
c0159c780e8d42309d04e83271986274d3880826.



Okay, I grabbed 3.4.4 from Debian experimental, and I'm running with 
that now.  Hopefully this fixes the problem.


Thanks for your help!
-- Kevin

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-26 Thread David Dillow

On Thu, 2012-07-26 at 22:15 -0400, David Dillow wrote:
 If you can, upgrade to the latest 3.4 stable kernel (3.4.6 right now).
 As far as I can see, the latest 3.2 stable does not contain the delayed
 stripe fix.

And I was looking at the wrong version; 3.2.24 does indeed have the fix.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-26 Thread David Dillow

On Wed, 2012-07-25 at 18:55 -0700, Kevin Ross wrote:
 On 07/25/2012 06:00 PM, Phil Turmel wrote:
  Piles of small reads  scattered across multiple drives, and a
  concentration of queued writes to /dev/sda.  What's on /dev/sda?
  It's not a member of the raid, so it must be some other system task
  involved.

 After rebooting, MythTV is currently recording two shows, and the resync 
 is running at full speed.
 
 # cat /proc/mdstat
 Personalities : [raid6] [raid5] [raid4]
 md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] 
 sdf1[3] sdg1[8] sdj1[1]
6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 
 [9/9] [U]
[=...]  resync =  9.3% (91363840/976758784) 
 finish=1434.3min speed=10287K/sec
 
 unused devices: none
 
 atop shows the avio of all the drives to be less than 1ms, where before 
 they were much higher.  It will run for a couple days under load just 
 fine, and then it will come to a halt.
 
 It's a 3.2.21 kernel.  I'm running Debian Testing, and the exact Debian 
 package version is:

I suspect you are being hit by same bug I was -- delayed stripes never
got processed. If you get into the state where the rebuild isn't
progressing, and you find that increasing the size of the stripe cache
allows the rebuild to proceed (but the filesystem stays wedged), then
that cinches it.

If you can, upgrade to the latest 3.4 stable kernel (3.4.6 right now).
As far as I can see, the latest 3.2 stable does not contain the delayed
stripe fix.

After applying those fixes to my kernel, my MythTV setup over a 5 disk
RAID5 has been pretty solid, where before I was getting lockups every
few days. It still seems to be getting slower over time, but I've not
looked into it yet as it is not as catastrophic as the wedging.

HTH,
Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-26 Thread Kevin Ross


On 07/26/2012 07:53 PM, Kevin Ross wrote:

On 07/26/2012 07:27 PM, David Dillow wrote:

On Thu, 2012-07-26 at 19:17 -0700, Kevin Ross wrote:

On 07/26/2012 07:17 PM, David Dillow wrote:

On Thu, 2012-07-26 at 22:15 -0400, David Dillow wrote:
If you can, upgrade to the latest 3.4 stable kernel (3.4.6 right 
now).
As far as I can see, the latest 3.2 stable does not contain the 
delayed

stripe fix.
And I was looking at the wrong version; 3.2.24 does indeed have the 
fix.



I'm running 3.2.21, does that contain the fix?

No, that was the one I looked at. It is commit
c0159c780e8d42309d04e83271986274d3880826.



Okay, I grabbed 3.4.4 from Debian experimental, and I'm running with 
that now.  Hopefully this fixes the problem.


Thanks for your help!
-- Kevin


Just noticed I need 3.4.5 or later.  Doh!  I'll grab a vanilla kernel 
from kernel.org and build it.


Thanks!
-- Kevin

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-25 Thread Kevin Ross






unused devices:

# cat /proc/sys/dev/raid/speed_limit_min
1

MD is unable to reach its minimum rebuild rate while other system
activity is ongoing.  You might want to lower this number to see if that
gets you out of the stalls.

Or temporarily shut down mythtv.


I will try lowering those numbers next time this happens, which will 
probably be within the next day or two.  That's about how often this 
happens.


Unfortunately, it has happened again, with speeds at near zero.

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] 
sdf1[3] sdg1[8] sdj1[1]
  6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 
[9/9] [U]
  [=>...]  resync =  8.3% (81251712/976758784) 
finish=1057826.4min speed=14K/sec


unused devices: 

atop doesn't show ANY activity on the raid device or the individual drives.
http://img687.imageshack.us/img687/2913/screenshotfrom201207252.png

Also, I tried writing to a test file with the following command, and it 
hangs.  I let it go for about 30 minutes, with no change.


# dd if=/dev/zero of=test bs=1M count=1

dmesg only reports hung tasks.  It doesn't report any other problems.  
Here's my dmesg output:

http://pastebin.ca/2174778

I'm going to try rebooting into single user mode, and see if the rebuild 
succeeds without stalling.


-- Kevin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-25 Thread Kevin Ross


On 07/25/2012 07:09 PM, CoolCold wrote:
You might be interested in write intent bitmap then, it gonna help a 
lot. (resending in plain text)


Thanks, I'll look into that!

-- Kevin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-25 Thread CoolCold

On Thu, Jul 26, 2012 at 5:55 AM, Kevin Ross  wrote:
>
> Thank you very much for taking the time to look into this.
>
>
> On 07/25/2012 06:00 PM, Phil Turmel wrote:
>>
>> Piles of small reads  scattered across multiple drives, and a
>> concentration of queued writes to /dev/sda.  What's on /dev/sda?
>> It's not a member of the raid, so it must be some other system task
>> involved.
>
>
> /dev/sda1 is the root filesystem.  The writes were most likely by MySQL,
> but I would have to run iotop to be sure.
>
>
>> [ The output of "lsdrv" [1] might be useful here, along with
>> "mdadm -D /dev/md0" and "mdadm -E /dev/[b-j]" ]
>
>
> Here you go: http://pastebin.ca/2174740
>
>
>> MythTV is trying to flush recorded video to disk, I presume.  Sync is
>> known to cause stalls--a great deal of work is on-going to improve
>> this.  How old is this kernel?
>
>
> After rebooting, MythTV is currently recording two shows, and the resync
> is running at full speed.
>
>
> # cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4]
> sdf1[3] sdg1[8] sdj1[1]
>   6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9]
> [U]
>   [=>...]  resync =  9.3% (91363840/976758784)
> finish=1434.3min speed=10287K/sec
>
> unused devices: 
>
> atop shows the avio of all the drives to be less than 1ms, where before
> they were much higher.  It will run for a couple days under load just fine,
> and then it will come to a halt.
>
> It's a 3.2.21 kernel.  I'm running Debian Testing, and the exact Debian
> package version is:
>
> ii  linux-image-3.2.0-3-686-pae3.2.21-3
> Linux 3.2 for modern PCs
>
>
>>
>>> [51000.672258]  [] ? sysenter_do_call+0x12/0x28
>>> [51000.672261]  [] ? quirk_usb_early_handoff+0x4a9/0x522
>>>
>>> Here is some other possibly relevant info:
>>>
>>> # cat /proc/mdstat
>>> Personalities : [raid6] [raid5] [raid4]
>>> md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4]
>>> sdf1[3] sdg1[8] sdj1[1]
>>>6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2
>>> [9/9]
>>> [U]
>>>[==>..]  resync = 51.3% (501954432/976758784)
>>> finish=28755.6min speed=275K/sec
>>
>> Is this resync a weekly check, or did something else trigger it?
>
>
> This is not a scheduled check.  It was triggered by, I believe, an unclean
> shutdown.  An unclean shutdown will trigger a resync.  I don't think it used
> to do this, but I could be remembering wrong.
>
>
>>
>>> unused devices:
>>>
>>> # cat /proc/sys/dev/raid/speed_limit_min
>>> 1
>>
>> MD is unable to reach its minimum rebuild rate while other system
>> activity is ongoing.  You might want to lower this number to see if that
>> gets you out of the stalls.
>>
>> Or temporarily shut down mythtv.
>
>
> I will try lowering those numbers next time this happens, which will
> probably be within the next day or two.  That's about how often this
> happens.
You might be interested in write intent bitmap then, it gonna help a lot.
(resending in plain text)
>
>
>>> # cat /proc/sys/dev/raid/speed_limit_max
>>> 20
>>>
>>> Thanks in advance!
>>> -- Kevin
>>
>> HTH,
>>
>> Phil
>>
>> [1] http://github.com/pturmel/lsdrv
>>
>
> Thanks!
> -- Kevin
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
Best regards,
[COOLCOLD-RIPN]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-25 Thread Kevin Ross


Thank you very much for taking the time to look into this.

On 07/25/2012 06:00 PM, Phil Turmel wrote:

Piles of small reads  scattered across multiple drives, and a
concentration of queued writes to /dev/sda.  What's on /dev/sda?
It's not a member of the raid, so it must be some other system task
involved.


/dev/sda1 is the root filesystem.  The writes were most likely by MySQL, 
but I would have to run iotop to be sure.



[ The output of "lsdrv" [1] might be useful here, along with
"mdadm -D /dev/md0" and "mdadm -E /dev/[b-j]" ]


Here you go: http://pastebin.ca/2174740


MythTV is trying to flush recorded video to disk, I presume.  Sync is
known to cause stalls--a great deal of work is on-going to improve
this.  How old is this kernel?


After rebooting, MythTV is currently recording two shows, and the resync 
is running at full speed.


# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] 
sdf1[3] sdg1[8] sdj1[1]
  6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 
[9/9] [U]
  [=>...]  resync =  9.3% (91363840/976758784) 
finish=1434.3min speed=10287K/sec


unused devices: 

atop shows the avio of all the drives to be less than 1ms, where before 
they were much higher.  It will run for a couple days under load just 
fine, and then it will come to a halt.


It's a 3.2.21 kernel.  I'm running Debian Testing, and the exact Debian 
package version is:


ii  linux-image-3.2.0-3-686-pae
3.2.21-3   Linux 3.2 for modern PCs





[51000.672258]  [] ? sysenter_do_call+0x12/0x28
[51000.672261]  [] ? quirk_usb_early_handoff+0x4a9/0x522

Here is some other possibly relevant info:

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4]
sdf1[3] sdg1[8] sdj1[1]
   6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9]
[U]
   [==>..]  resync = 51.3% (501954432/976758784)
finish=28755.6min speed=275K/sec

Is this resync a weekly check, or did something else trigger it?


This is not a scheduled check.  It was triggered by, I believe, an 
unclean shutdown.  An unclean shutdown will trigger a resync.  I don't 
think it used to do this, but I could be remembering wrong.





unused devices:

# cat /proc/sys/dev/raid/speed_limit_min
1

MD is unable to reach its minimum rebuild rate while other system
activity is ongoing.  You might want to lower this number to see if that
gets you out of the stalls.

Or temporarily shut down mythtv.


I will try lowering those numbers next time this happens, which will 
probably be within the next day or two.  That's about how often this 
happens.



# cat /proc/sys/dev/raid/speed_limit_max
20

Thanks in advance!
-- Kevin

HTH,

Phil

[1] http://github.com/pturmel/lsdrv



Thanks!
-- Kevin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-25 Thread Phil Turmel

[Added linux-raid to the CC]

Hi Kevin,

Notes interleaved:

On 07/25/2012 06:52 PM, Kevin Ross wrote:
> Hello,
> 
> I'm having a problem.  After a while, my software RAID rebuild becomes
> extremely slow, and the filesystem on the RAID is essentially blocked. 
> I don't know what is causing this.  I guess it could be a bad drive, but
> how can I find out?

Probably not.  That pretty much always shows up in dmesg.

> I used atop to show the transfer speeds to each drive. Here's a
> screenshot:
> http://img402.imageshack.us/img402/6484/screenshotfrom201207251.png

Piles of small reads  scattered across multiple drives, and a
concentration of queued writes to /dev/sda.  What's on /dev/sda?
It's not a member of the raid, so it must be some other system task
involved.

[ The output of "lsdrv" [1] might be useful here, along with
"mdadm -D /dev/md0" and "mdadm -E /dev/[b-j]" ]

> "smartctl -a" for all the drives looks good to me, no pending failures,
> or errors logged.  dmesg doesn't report anything wrong with any of the
> drives.  It does, however, report lots of hung tasks, which are trying
> to access the RAID volume.  For example:
> 
> [51000.672064] INFO: task mythbackend:10677 blocked for more than 120
> seconds.
> [51000.672098] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [51000.672143] mythbackend D 000e 0 10677  1 0x
> [51000.672146]  f38bea00 0086 c1095415 000e 0002 
>  c147aac0
> [51000.672152]  f38bebac c147aac0 eb2cff04 003d2f4b  c109cacb
> 01872f02 eb2cfe50
> [51000.672157]  c100f28b c13df480 01872f02 eb2cfe68 c10532b1 0069a8d0
> f79d6ac0 
> [51000.672162] Call Trace:
> [51000.672169]  [] ? find_get_pages_tag+0x2f/0xa2
> [51000.672173]  [] ? pagevec_lookup_tag+0x18/0x1e
> [51000.672176]  [] ? read_tsc+0xa/0x28
> [51000.672179]  [] ? timekeeping_get_ns+0x11/0x55
> [51000.672182]  [] ? ktime_get_ts+0x7a/0x82
> [51000.672186]  [] ? io_schedule+0x4a/0x5f
> [51000.672188]  [] ? sleep_on_page+0x5/0x8
> [51000.672191]  [] ? __wait_on_bit+0x2f/0x54
> [51000.672193]  [] ? lock_page+0x1d/0x1d
> [51000.672196]  [] ? wait_on_page_bit+0x57/0x5e
> [51000.672199]  [] ? autoremove_wake_function+0x29/0x29
> [51000.672201]  [] ? filemap_fdatawait_range+0x71/0x11e
> [51000.672205]  [] ? filemap_write_and_wait_range+0x3e/0x4c
> [51000.672232]  [] ? xfs_file_fsync+0x68/0x214 [xfs]
> [51000.672246]  [] ? xfs_file_splice_write+0x144/0x144 [xfs]
> [51000.672249]  [] ? vfs_fsync_range+0x27/0x2d
> [51000.672252]  [] ? vfs_fsync+0x11/0x15
> [51000.672254]  [] ? sys_fdatasync+0x20/0x2e

MythTV is trying to flush recorded video to disk, I presume.  Sync is
known to cause stalls--a great deal of work is on-going to improve
this.  How old is this kernel?

> [51000.672258]  [] ? sysenter_do_call+0x12/0x28
> [51000.672261]  [] ? quirk_usb_early_handoff+0x4a9/0x522
> 
> Here is some other possibly relevant info:
> 
> # cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4]
> sdf1[3] sdg1[8] sdj1[1]
>   6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9]
> [U]
>   [==>..]  resync = 51.3% (501954432/976758784)
> finish=28755.6min speed=275K/sec

Is this resync a weekly check, or did something else trigger it?

> unused devices: 
> 
> # cat /proc/sys/dev/raid/speed_limit_min
> 1

MD is unable to reach its minimum rebuild rate while other system
activity is ongoing.  You might want to lower this number to see if that
gets you out of the stalls.

Or temporarily shut down mythtv.

> # cat /proc/sys/dev/raid/speed_limit_max
> 20
> 
> Thanks in advance!
> -- Kevin

HTH,

Phil

[1] http://github.com/pturmel/lsdrv

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RAID extremely slow

2012-07-25 Thread Kevin Ross


Hello,

I'm having a problem.  After a while, my software RAID rebuild becomes 
extremely slow, and the filesystem on the RAID is essentially blocked.  
I don't know what is causing this.  I guess it could be a bad drive, but 
how can I find out?


I used atop to show the transfer speeds to each drive. Here's a 
screenshot: 
http://img402.imageshack.us/img402/6484/screenshotfrom201207251.png


"smartctl -a" for all the drives looks good to me, no pending failures, 
or errors logged.  dmesg doesn't report anything wrong with any of the 
drives.  It does, however, report lots of hung tasks, which are trying 
to access the RAID volume.  For example:


[51000.672064] INFO: task mythbackend:10677 blocked for more than 120 
seconds.
[51000.672098] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.

[51000.672143] mythbackend D 000e 0 10677  1 0x
[51000.672146]  f38bea00 0086 c1095415 000e 0002  
 c147aac0
[51000.672152]  f38bebac c147aac0 eb2cff04 003d2f4b  c109cacb 
01872f02 eb2cfe50
[51000.672157]  c100f28b c13df480 01872f02 eb2cfe68 c10532b1 0069a8d0 
f79d6ac0 

[51000.672162] Call Trace:
[51000.672169]  [] ? find_get_pages_tag+0x2f/0xa2
[51000.672173]  [] ? pagevec_lookup_tag+0x18/0x1e
[51000.672176]  [] ? read_tsc+0xa/0x28
[51000.672179]  [] ? timekeeping_get_ns+0x11/0x55
[51000.672182]  [] ? ktime_get_ts+0x7a/0x82
[51000.672186]  [] ? io_schedule+0x4a/0x5f
[51000.672188]  [] ? sleep_on_page+0x5/0x8
[51000.672191]  [] ? __wait_on_bit+0x2f/0x54
[51000.672193]  [] ? lock_page+0x1d/0x1d
[51000.672196]  [] ? wait_on_page_bit+0x57/0x5e
[51000.672199]  [] ? autoremove_wake_function+0x29/0x29
[51000.672201]  [] ? filemap_fdatawait_range+0x71/0x11e
[51000.672205]  [] ? filemap_write_and_wait_range+0x3e/0x4c
[51000.672232]  [] ? xfs_file_fsync+0x68/0x214 [xfs]
[51000.672246]  [] ? xfs_file_splice_write+0x144/0x144 [xfs]
[51000.672249]  [] ? vfs_fsync_range+0x27/0x2d
[51000.672252]  [] ? vfs_fsync+0x11/0x15
[51000.672254]  [] ? sys_fdatasync+0x20/0x2e
[51000.672258]  [] ? sysenter_do_call+0x12/0x28
[51000.672261]  [] ? quirk_usb_early_handoff+0x4a9/0x522

Here is some other possibly relevant info:

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] 
sdf1[3] sdg1[8] sdj1[1]
  6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 
[9/9] [U]
  [==>..]  resync = 51.3% (501954432/976758784) 
finish=28755.6min speed=275K/sec


unused devices: 

# cat /proc/sys/dev/raid/speed_limit_min
1
# cat /proc/sys/dev/raid/speed_limit_max
20

Thanks in advance!
-- Kevin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RAID extremely slow

2012-07-25 Thread Kevin Ross


Hello,

I'm having a problem.  After a while, my software RAID rebuild becomes 
extremely slow, and the filesystem on the RAID is essentially blocked.  
I don't know what is causing this.  I guess it could be a bad drive, but 
how can I find out?


I used atop to show the transfer speeds to each drive. Here's a 
screenshot: 
http://img402.imageshack.us/img402/6484/screenshotfrom201207251.png


smartctl -a for all the drives looks good to me, no pending failures, 
or errors logged.  dmesg doesn't report anything wrong with any of the 
drives.  It does, however, report lots of hung tasks, which are trying 
to access the RAID volume.  For example:


[51000.672064] INFO: task mythbackend:10677 blocked for more than 120 
seconds.
[51000.672098] echo 0  /proc/sys/kernel/hung_task_timeout_secs 
disables this message.

[51000.672143] mythbackend D 000e 0 10677  1 0x
[51000.672146]  f38bea00 0086 c1095415 000e 0002  
 c147aac0
[51000.672152]  f38bebac c147aac0 eb2cff04 003d2f4b  c109cacb 
01872f02 eb2cfe50
[51000.672157]  c100f28b c13df480 01872f02 eb2cfe68 c10532b1 0069a8d0 
f79d6ac0 

[51000.672162] Call Trace:
[51000.672169]  [c1095415] ? find_get_pages_tag+0x2f/0xa2
[51000.672173]  [c109cacb] ? pagevec_lookup_tag+0x18/0x1e
[51000.672176]  [c100f28b] ? read_tsc+0xa/0x28
[51000.672179]  [c10532b1] ? timekeeping_get_ns+0x11/0x55
[51000.672182]  [c10536a4] ? ktime_get_ts+0x7a/0x82
[51000.672186]  [c12bea8b] ? io_schedule+0x4a/0x5f
[51000.672188]  [c1095659] ? sleep_on_page+0x5/0x8
[51000.672191]  [c12bedeb] ? __wait_on_bit+0x2f/0x54
[51000.672193]  [c1095654] ? lock_page+0x1d/0x1d
[51000.672196]  [c1095754] ? wait_on_page_bit+0x57/0x5e
[51000.672199]  [c104d171] ? autoremove_wake_function+0x29/0x29
[51000.672201]  [c1095823] ? filemap_fdatawait_range+0x71/0x11e
[51000.672205]  [c109630f] ? filemap_write_and_wait_range+0x3e/0x4c
[51000.672232]  [f86bfb39] ? xfs_file_fsync+0x68/0x214 [xfs]
[51000.672246]  [f86bfad1] ? xfs_file_splice_write+0x144/0x144 [xfs]
[51000.672249]  [c10e7e3b] ? vfs_fsync_range+0x27/0x2d
[51000.672252]  [c10e7e52] ? vfs_fsync+0x11/0x15
[51000.672254]  [c10e80b8] ? sys_fdatasync+0x20/0x2e
[51000.672258]  [c12c409f] ? sysenter_do_call+0x12/0x28
[51000.672261]  [c12b] ? quirk_usb_early_handoff+0x4a9/0x522

Here is some other possibly relevant info:

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] 
sdf1[3] sdg1[8] sdj1[1]
  6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 
[9/9] [U]
  [==..]  resync = 51.3% (501954432/976758784) 
finish=28755.6min speed=275K/sec


unused devices: none

# cat /proc/sys/dev/raid/speed_limit_min
1
# cat /proc/sys/dev/raid/speed_limit_max
20

Thanks in advance!
-- Kevin

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-25 Thread Phil Turmel

[Added linux-raid to the CC]

Hi Kevin,

Notes interleaved:

On 07/25/2012 06:52 PM, Kevin Ross wrote:
 Hello,
 
 I'm having a problem.  After a while, my software RAID rebuild becomes
 extremely slow, and the filesystem on the RAID is essentially blocked. 
 I don't know what is causing this.  I guess it could be a bad drive, but
 how can I find out?

Probably not.  That pretty much always shows up in dmesg.

 I used atop to show the transfer speeds to each drive. Here's a
 screenshot:
 http://img402.imageshack.us/img402/6484/screenshotfrom201207251.png

Piles of small reads  scattered across multiple drives, and a
concentration of queued writes to /dev/sda.  What's on /dev/sda?
It's not a member of the raid, so it must be some other system task
involved.

[ The output of lsdrv [1] might be useful here, along with
mdadm -D /dev/md0 and mdadm -E /dev/[b-j] ]

 smartctl -a for all the drives looks good to me, no pending failures,
 or errors logged.  dmesg doesn't report anything wrong with any of the
 drives.  It does, however, report lots of hung tasks, which are trying
 to access the RAID volume.  For example:
 
 [51000.672064] INFO: task mythbackend:10677 blocked for more than 120
 seconds.
 [51000.672098] echo 0  /proc/sys/kernel/hung_task_timeout_secs
 disables this message.
 [51000.672143] mythbackend D 000e 0 10677  1 0x
 [51000.672146]  f38bea00 0086 c1095415 000e 0002 
  c147aac0
 [51000.672152]  f38bebac c147aac0 eb2cff04 003d2f4b  c109cacb
 01872f02 eb2cfe50
 [51000.672157]  c100f28b c13df480 01872f02 eb2cfe68 c10532b1 0069a8d0
 f79d6ac0 
 [51000.672162] Call Trace:
 [51000.672169]  [c1095415] ? find_get_pages_tag+0x2f/0xa2
 [51000.672173]  [c109cacb] ? pagevec_lookup_tag+0x18/0x1e
 [51000.672176]  [c100f28b] ? read_tsc+0xa/0x28
 [51000.672179]  [c10532b1] ? timekeeping_get_ns+0x11/0x55
 [51000.672182]  [c10536a4] ? ktime_get_ts+0x7a/0x82
 [51000.672186]  [c12bea8b] ? io_schedule+0x4a/0x5f
 [51000.672188]  [c1095659] ? sleep_on_page+0x5/0x8
 [51000.672191]  [c12bedeb] ? __wait_on_bit+0x2f/0x54
 [51000.672193]  [c1095654] ? lock_page+0x1d/0x1d
 [51000.672196]  [c1095754] ? wait_on_page_bit+0x57/0x5e
 [51000.672199]  [c104d171] ? autoremove_wake_function+0x29/0x29
 [51000.672201]  [c1095823] ? filemap_fdatawait_range+0x71/0x11e
 [51000.672205]  [c109630f] ? filemap_write_and_wait_range+0x3e/0x4c
 [51000.672232]  [f86bfb39] ? xfs_file_fsync+0x68/0x214 [xfs]
 [51000.672246]  [f86bfad1] ? xfs_file_splice_write+0x144/0x144 [xfs]
 [51000.672249]  [c10e7e3b] ? vfs_fsync_range+0x27/0x2d
 [51000.672252]  [c10e7e52] ? vfs_fsync+0x11/0x15
 [51000.672254]  [c10e80b8] ? sys_fdatasync+0x20/0x2e

MythTV is trying to flush recorded video to disk, I presume.  Sync is
known to cause stalls--a great deal of work is on-going to improve
this.  How old is this kernel?

 [51000.672258]  [c12c409f] ? sysenter_do_call+0x12/0x28
 [51000.672261]  [c12b] ? quirk_usb_early_handoff+0x4a9/0x522
 
 Here is some other possibly relevant info:
 
 # cat /proc/mdstat
 Personalities : [raid6] [raid5] [raid4]
 md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4]
 sdf1[3] sdg1[8] sdj1[1]
   6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9]
 [U]
   [==..]  resync = 51.3% (501954432/976758784)
 finish=28755.6min speed=275K/sec

Is this resync a weekly check, or did something else trigger it?

 unused devices: none
 
 # cat /proc/sys/dev/raid/speed_limit_min
 1

MD is unable to reach its minimum rebuild rate while other system
activity is ongoing.  You might want to lower this number to see if that
gets you out of the stalls.

Or temporarily shut down mythtv.

 # cat /proc/sys/dev/raid/speed_limit_max
 20
 
 Thanks in advance!
 -- Kevin

HTH,

Phil

[1] http://github.com/pturmel/lsdrv

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-25 Thread Kevin Ross


Thank you very much for taking the time to look into this.

On 07/25/2012 06:00 PM, Phil Turmel wrote:

Piles of small reads  scattered across multiple drives, and a
concentration of queued writes to /dev/sda.  What's on /dev/sda?
It's not a member of the raid, so it must be some other system task
involved.


/dev/sda1 is the root filesystem.  The writes were most likely by MySQL, 
but I would have to run iotop to be sure.



[ The output of lsdrv [1] might be useful here, along with
mdadm -D /dev/md0 and mdadm -E /dev/[b-j] ]


Here you go: http://pastebin.ca/2174740


MythTV is trying to flush recorded video to disk, I presume.  Sync is
known to cause stalls--a great deal of work is on-going to improve
this.  How old is this kernel?


After rebooting, MythTV is currently recording two shows, and the resync 
is running at full speed.


# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] 
sdf1[3] sdg1[8] sdj1[1]
  6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 
[9/9] [U]
  [=...]  resync =  9.3% (91363840/976758784) 
finish=1434.3min speed=10287K/sec


unused devices: none

atop shows the avio of all the drives to be less than 1ms, where before 
they were much higher.  It will run for a couple days under load just 
fine, and then it will come to a halt.


It's a 3.2.21 kernel.  I'm running Debian Testing, and the exact Debian 
package version is:


ii  linux-image-3.2.0-3-686-pae
3.2.21-3   Linux 3.2 for modern PCs





[51000.672258]  [c12c409f] ? sysenter_do_call+0x12/0x28
[51000.672261]  [c12b] ? quirk_usb_early_handoff+0x4a9/0x522

Here is some other possibly relevant info:

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4]
sdf1[3] sdg1[8] sdj1[1]
   6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9]
[U]
   [==..]  resync = 51.3% (501954432/976758784)
finish=28755.6min speed=275K/sec

Is this resync a weekly check, or did something else trigger it?


This is not a scheduled check.  It was triggered by, I believe, an 
unclean shutdown.  An unclean shutdown will trigger a resync.  I don't 
think it used to do this, but I could be remembering wrong.





unused devices:none

# cat /proc/sys/dev/raid/speed_limit_min
1

MD is unable to reach its minimum rebuild rate while other system
activity is ongoing.  You might want to lower this number to see if that
gets you out of the stalls.

Or temporarily shut down mythtv.


I will try lowering those numbers next time this happens, which will 
probably be within the next day or two.  That's about how often this 
happens.



# cat /proc/sys/dev/raid/speed_limit_max
20

Thanks in advance!
-- Kevin

HTH,

Phil

[1] http://github.com/pturmel/lsdrv



Thanks!
-- Kevin

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-25 Thread CoolCold

On Thu, Jul 26, 2012 at 5:55 AM, Kevin Ross ke...@familyross.net wrote:

 Thank you very much for taking the time to look into this.


 On 07/25/2012 06:00 PM, Phil Turmel wrote:

 Piles of small reads  scattered across multiple drives, and a
 concentration of queued writes to /dev/sda.  What's on /dev/sda?
 It's not a member of the raid, so it must be some other system task
 involved.


 /dev/sda1 is the root filesystem.  The writes were most likely by MySQL,
 but I would have to run iotop to be sure.


 [ The output of lsdrv [1] might be useful here, along with
 mdadm -D /dev/md0 and mdadm -E /dev/[b-j] ]


 Here you go: http://pastebin.ca/2174740


 MythTV is trying to flush recorded video to disk, I presume.  Sync is
 known to cause stalls--a great deal of work is on-going to improve
 this.  How old is this kernel?


 After rebooting, MythTV is currently recording two shows, and the resync
 is running at full speed.


 # cat /proc/mdstat
 Personalities : [raid6] [raid5] [raid4]
 md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4]
 sdf1[3] sdg1[8] sdj1[1]
   6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9]
 [U]
   [=...]  resync =  9.3% (91363840/976758784)
 finish=1434.3min speed=10287K/sec

 unused devices: none

 atop shows the avio of all the drives to be less than 1ms, where before
 they were much higher.  It will run for a couple days under load just fine,
 and then it will come to a halt.

 It's a 3.2.21 kernel.  I'm running Debian Testing, and the exact Debian
 package version is:

 ii  linux-image-3.2.0-3-686-pae3.2.21-3
 Linux 3.2 for modern PCs



 [51000.672258]  [c12c409f] ? sysenter_do_call+0x12/0x28
 [51000.672261]  [c12b] ? quirk_usb_early_handoff+0x4a9/0x522

 Here is some other possibly relevant info:

 # cat /proc/mdstat
 Personalities : [raid6] [raid5] [raid4]
 md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4]
 sdf1[3] sdg1[8] sdj1[1]
6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2
 [9/9]
 [U]
[==..]  resync = 51.3% (501954432/976758784)
 finish=28755.6min speed=275K/sec

 Is this resync a weekly check, or did something else trigger it?


 This is not a scheduled check.  It was triggered by, I believe, an unclean
 shutdown.  An unclean shutdown will trigger a resync.  I don't think it used
 to do this, but I could be remembering wrong.



 unused devices:none

 # cat /proc/sys/dev/raid/speed_limit_min
 1

 MD is unable to reach its minimum rebuild rate while other system
 activity is ongoing.  You might want to lower this number to see if that
 gets you out of the stalls.

 Or temporarily shut down mythtv.


 I will try lowering those numbers next time this happens, which will
 probably be within the next day or two.  That's about how often this
 happens.
You might be interested in write intent bitmap then, it gonna help a lot.
(resending in plain text)


 # cat /proc/sys/dev/raid/speed_limit_max
 20

 Thanks in advance!
 -- Kevin

 HTH,

 Phil

 [1] http://github.com/pturmel/lsdrv


 Thanks!
 -- Kevin


 --
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
Best regards,
[COOLCOLD-RIPN]
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-25 Thread Kevin Ross


On 07/25/2012 07:09 PM, CoolCold wrote:
You might be interested in write intent bitmap then, it gonna help a 
lot. (resending in plain text)


Thanks, I'll look into that!

-- Kevin

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RAID extremely slow

2012-07-25 Thread Kevin Ross






unused devices:none

# cat /proc/sys/dev/raid/speed_limit_min
1

MD is unable to reach its minimum rebuild rate while other system
activity is ongoing.  You might want to lower this number to see if that
gets you out of the stalls.

Or temporarily shut down mythtv.


I will try lowering those numbers next time this happens, which will 
probably be within the next day or two.  That's about how often this 
happens.


Unfortunately, it has happened again, with speeds at near zero.

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] 
sdf1[3] sdg1[8] sdj1[1]
  6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 
[9/9] [U]
  [=...]  resync =  8.3% (81251712/976758784) 
finish=1057826.4min speed=14K/sec


unused devices: none

atop doesn't show ANY activity on the raid device or the individual drives.
http://img687.imageshack.us/img687/2913/screenshotfrom201207252.png

Also, I tried writing to a test file with the following command, and it 
hangs.  I let it go for about 30 minutes, with no change.


# dd if=/dev/zero of=test bs=1M count=1

dmesg only reports hung tasks.  It doesn't report any other problems.  
Here's my dmesg output:

http://pastebin.ca/2174778

I'm going to try rebooting into single user mode, and see if the rebuild 
succeeds without stalling.


-- Kevin

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

38 matches

Mail list logo