Re: RAID extremely slow
On Thursday 2012-07-26 03:00, Phil Turmel wrote: >> I used atop to show the transfer speeds to each drive. Here's a >> screenshot: >> http://img402.imageshack.us/img402/6484/screenshotfrom201207251.png > >[ The output of "lsdrv" [1] might be useful here, along with >"mdadm -D /dev/md0" and "mdadm -E /dev/[b-j]" ] >[1] http://github.com/pturmel/lsdrv lsdrv? Shows a bunch, but for this, using standard tools like lsscsi & lsblk seems sufficient :) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
On Thursday 2012-07-26 03:00, Phil Turmel wrote: I used atop to show the transfer speeds to each drive. Here's a screenshot: http://img402.imageshack.us/img402/6484/screenshotfrom201207251.png [ The output of lsdrv [1] might be useful here, along with mdadm -D /dev/md0 and mdadm -E /dev/[b-j] ] [1] http://github.com/pturmel/lsdrv lsdrv? Shows a bunch, but for this, using standard tools like lsscsi lsblk seems sufficient :) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
Kevin Ross wrote: On 07/27/2012 09:45 PM, Grant Coady wrote: On Fri, 27 Jul 2012 14:45:18 -0700, you wrote: On 07/27/2012 12:08 PM, Bill Davidsen wrote: Have you set the io scheduler to deadline on all members of the array? That's kind of "job one" on older kernels. I have not, thanks for the tip, I'll look into that now. Plus I disable the on-drive queuing (NCQ) during startup, right now I don't have benchmarks to show the difference. This on a six by 1TB drive RAID6 array I built over a year ago on Slackware64-13.37: # cat /etc/rc.d/rc.local ... # turn off NCQ on the RAID drives by adjusting queue depth to 1 n=1 echo "rc.local: Disable RAID drives' NCQ" for d in a b c d e f do echo " set NCQ depth to $n on sd${d}" echo $n> /sys/block/sd${d}/device/queue_depth done ... Maybe you could try that? See if it makes a difference. My drives are Seagate. Grant. Does disabling NCQ improve performance? Does for me! The suggestion to use kernel 3.4.6 has been working quite well so far, hopefully that fixes the problem. I'll know for sure in a few more days... Thanks! -- Kevin -- Bill Davidsen We are not out of the woods yet, but we know the direction and have taken the first step. The steps are many, but finite in number, and if we persevere we will reach our destination. -me, 2010 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
Kevin Ross wrote: On 07/27/2012 09:45 PM, Grant Coady wrote: On Fri, 27 Jul 2012 14:45:18 -0700, you wrote: On 07/27/2012 12:08 PM, Bill Davidsen wrote: Have you set the io scheduler to deadline on all members of the array? That's kind of job one on older kernels. I have not, thanks for the tip, I'll look into that now. Plus I disable the on-drive queuing (NCQ) during startup, right now I don't have benchmarks to show the difference. This on a six by 1TB drive RAID6 array I built over a year ago on Slackware64-13.37: # cat /etc/rc.d/rc.local ... # turn off NCQ on the RAID drives by adjusting queue depth to 1 n=1 echo rc.local: Disable RAID drives' NCQ for d in a b c d e f do echo set NCQ depth to $n on sd${d} echo $n /sys/block/sd${d}/device/queue_depth done ... Maybe you could try that? See if it makes a difference. My drives are Seagate. Grant. Does disabling NCQ improve performance? Does for me! The suggestion to use kernel 3.4.6 has been working quite well so far, hopefully that fixes the problem. I'll know for sure in a few more days... Thanks! -- Kevin -- Bill Davidsen david...@tmr.com We are not out of the woods yet, but we know the direction and have taken the first step. The steps are many, but finite in number, and if we persevere we will reach our destination. -me, 2010 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
On 07/27/2012 09:45 PM, Grant Coady wrote: On Fri, 27 Jul 2012 14:45:18 -0700, you wrote: On 07/27/2012 12:08 PM, Bill Davidsen wrote: Have you set the io scheduler to deadline on all members of the array? That's kind of "job one" on older kernels. I have not, thanks for the tip, I'll look into that now. Plus I disable the on-drive queuing (NCQ) during startup, right now I don't have benchmarks to show the difference. This on a six by 1TB drive RAID6 array I built over a year ago on Slackware64-13.37: # cat /etc/rc.d/rc.local ... # turn off NCQ on the RAID drives by adjusting queue depth to 1 n=1 echo "rc.local: Disable RAID drives' NCQ" for d in a b c d e f do echo " set NCQ depth to $n on sd${d}" echo $n> /sys/block/sd${d}/device/queue_depth done ... Maybe you could try that? See if it makes a difference. My drives are Seagate. Grant. Does disabling NCQ improve performance? The suggestion to use kernel 3.4.6 has been working quite well so far, hopefully that fixes the problem. I'll know for sure in a few more days... Thanks! -- Kevin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
On 07/27/2012 09:45 PM, Grant Coady wrote: On Fri, 27 Jul 2012 14:45:18 -0700, you wrote: On 07/27/2012 12:08 PM, Bill Davidsen wrote: Have you set the io scheduler to deadline on all members of the array? That's kind of job one on older kernels. I have not, thanks for the tip, I'll look into that now. Plus I disable the on-drive queuing (NCQ) during startup, right now I don't have benchmarks to show the difference. This on a six by 1TB drive RAID6 array I built over a year ago on Slackware64-13.37: # cat /etc/rc.d/rc.local ... # turn off NCQ on the RAID drives by adjusting queue depth to 1 n=1 echo rc.local: Disable RAID drives' NCQ for d in a b c d e f do echo set NCQ depth to $n on sd${d} echo $n /sys/block/sd${d}/device/queue_depth done ... Maybe you could try that? See if it makes a difference. My drives are Seagate. Grant. Does disabling NCQ improve performance? The suggestion to use kernel 3.4.6 has been working quite well so far, hopefully that fixes the problem. I'll know for sure in a few more days... Thanks! -- Kevin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
On Fri, 27 Jul 2012 14:45:18 -0700, you wrote: >On 07/27/2012 12:08 PM, Bill Davidsen wrote: >> Have you set the io scheduler to deadline on all members of the array? >> That's kind of "job one" on older kernels. >> > >I have not, thanks for the tip, I'll look into that now. Plus I disable the on-drive queuing (NCQ) during startup, right now I don't have benchmarks to show the difference. This on a six by 1TB drive RAID6 array I built over a year ago on Slackware64-13.37: # cat /etc/rc.d/rc.local ... # turn off NCQ on the RAID drives by adjusting queue depth to 1 n=1 echo "rc.local: Disable RAID drives' NCQ" for d in a b c d e f do echo " set NCQ depth to $n on sd${d}" echo $n > /sys/block/sd${d}/device/queue_depth done ... Maybe you could try that? See if it makes a difference. My drives are Seagate. Grant. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
On 07/27/2012 12:08 PM, Bill Davidsen wrote: Have you set the io scheduler to deadline on all members of the array? That's kind of "job one" on older kernels. I have not, thanks for the tip, I'll look into that now. Thanks! -- Kevin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
Kevin Ross wrote: unused devices: # cat /proc/sys/dev/raid/speed_limit_min 1 MD is unable to reach its minimum rebuild rate while other system activity is ongoing. You might want to lower this number to see if that gets you out of the stalls. Or temporarily shut down mythtv. I will try lowering those numbers next time this happens, which will probably be within the next day or two. That's about how often this happens. Unfortunately, it has happened again, with speeds at near zero. # cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] sdf1[3] sdg1[8] sdj1[1] 6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9] [U] [=>...] resync = 8.3% (81251712/976758784) finish=1057826.4min speed=14K/sec unused devices: atop doesn't show ANY activity on the raid device or the individual drives. http://img687.imageshack.us/img687/2913/screenshotfrom201207252.png Also, I tried writing to a test file with the following command, and it hangs. I let it go for about 30 minutes, with no change. # dd if=/dev/zero of=test bs=1M count=1 dmesg only reports hung tasks. It doesn't report any other problems. Here's my dmesg output: http://pastebin.ca/2174778 I'm going to try rebooting into single user mode, and see if the rebuild succeeds without stalling. Have you set the io scheduler to deadline on all members of the array? That's kind of "job one" on older kernels. -- Bill Davidsen "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
Kevin Ross wrote: unused devices:none # cat /proc/sys/dev/raid/speed_limit_min 1 MD is unable to reach its minimum rebuild rate while other system activity is ongoing. You might want to lower this number to see if that gets you out of the stalls. Or temporarily shut down mythtv. I will try lowering those numbers next time this happens, which will probably be within the next day or two. That's about how often this happens. Unfortunately, it has happened again, with speeds at near zero. # cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] sdf1[3] sdg1[8] sdj1[1] 6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9] [U] [=...] resync = 8.3% (81251712/976758784) finish=1057826.4min speed=14K/sec unused devices: none atop doesn't show ANY activity on the raid device or the individual drives. http://img687.imageshack.us/img687/2913/screenshotfrom201207252.png Also, I tried writing to a test file with the following command, and it hangs. I let it go for about 30 minutes, with no change. # dd if=/dev/zero of=test bs=1M count=1 dmesg only reports hung tasks. It doesn't report any other problems. Here's my dmesg output: http://pastebin.ca/2174778 I'm going to try rebooting into single user mode, and see if the rebuild succeeds without stalling. Have you set the io scheduler to deadline on all members of the array? That's kind of job one on older kernels. -- Bill Davidsen david...@tmr.com We have more to fear from the bungling of the incompetent than from the machinations of the wicked. - from Slashdot -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
On 07/27/2012 12:08 PM, Bill Davidsen wrote: Have you set the io scheduler to deadline on all members of the array? That's kind of job one on older kernels. I have not, thanks for the tip, I'll look into that now. Thanks! -- Kevin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
On Fri, 27 Jul 2012 14:45:18 -0700, you wrote: On 07/27/2012 12:08 PM, Bill Davidsen wrote: Have you set the io scheduler to deadline on all members of the array? That's kind of job one on older kernels. I have not, thanks for the tip, I'll look into that now. Plus I disable the on-drive queuing (NCQ) during startup, right now I don't have benchmarks to show the difference. This on a six by 1TB drive RAID6 array I built over a year ago on Slackware64-13.37: # cat /etc/rc.d/rc.local ... # turn off NCQ on the RAID drives by adjusting queue depth to 1 n=1 echo rc.local: Disable RAID drives' NCQ for d in a b c d e f do echo set NCQ depth to $n on sd${d} echo $n /sys/block/sd${d}/device/queue_depth done ... Maybe you could try that? See if it makes a difference. My drives are Seagate. Grant. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
On 07/26/2012 07:53 PM, Kevin Ross wrote: On 07/26/2012 07:27 PM, David Dillow wrote: On Thu, 2012-07-26 at 19:17 -0700, Kevin Ross wrote: On 07/26/2012 07:17 PM, David Dillow wrote: On Thu, 2012-07-26 at 22:15 -0400, David Dillow wrote: If you can, upgrade to the latest 3.4 stable kernel (3.4.6 right now). As far as I can see, the latest 3.2 stable does not contain the delayed stripe fix. And I was looking at the wrong version; 3.2.24 does indeed have the fix. I'm running 3.2.21, does that contain the fix? No, that was the one I looked at. It is commit c0159c780e8d42309d04e83271986274d3880826. Okay, I grabbed 3.4.4 from Debian experimental, and I'm running with that now. Hopefully this fixes the problem. Thanks for your help! -- Kevin Just noticed I need 3.4.5 or later. Doh! I'll grab a vanilla kernel from kernel.org and build it. Thanks! -- Kevin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
On Wed, 2012-07-25 at 18:55 -0700, Kevin Ross wrote: > On 07/25/2012 06:00 PM, Phil Turmel wrote: > > Piles of small reads scattered across multiple drives, and a > > concentration of queued writes to /dev/sda. What's on /dev/sda? > > It's not a member of the raid, so it must be some other system task > > involved. > After rebooting, MythTV is currently recording two shows, and the resync > is running at full speed. > > # cat /proc/mdstat > Personalities : [raid6] [raid5] [raid4] > md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] > sdf1[3] sdg1[8] sdj1[1] >6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 > [9/9] [U] >[=>...] resync = 9.3% (91363840/976758784) > finish=1434.3min speed=10287K/sec > > unused devices: > > atop shows the avio of all the drives to be less than 1ms, where before > they were much higher. It will run for a couple days under load just > fine, and then it will come to a halt. > > It's a 3.2.21 kernel. I'm running Debian Testing, and the exact Debian > package version is: I suspect you are being hit by same bug I was -- delayed stripes never got processed. If you get into the state where the rebuild isn't progressing, and you find that increasing the size of the stripe cache allows the rebuild to proceed (but the filesystem stays wedged), then that cinches it. If you can, upgrade to the latest 3.4 stable kernel (3.4.6 right now). As far as I can see, the latest 3.2 stable does not contain the delayed stripe fix. After applying those fixes to my kernel, my MythTV setup over a 5 disk RAID5 has been pretty solid, where before I was getting lockups every few days. It still seems to be getting slower over time, but I've not looked into it yet as it is not as catastrophic as the wedging. HTH, Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
On Thu, 2012-07-26 at 22:15 -0400, David Dillow wrote: > If you can, upgrade to the latest 3.4 stable kernel (3.4.6 right now). > As far as I can see, the latest 3.2 stable does not contain the delayed > stripe fix. And I was looking at the wrong version; 3.2.24 does indeed have the fix. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
On 07/26/2012 07:27 PM, David Dillow wrote: On Thu, 2012-07-26 at 19:17 -0700, Kevin Ross wrote: On 07/26/2012 07:17 PM, David Dillow wrote: On Thu, 2012-07-26 at 22:15 -0400, David Dillow wrote: If you can, upgrade to the latest 3.4 stable kernel (3.4.6 right now). As far as I can see, the latest 3.2 stable does not contain the delayed stripe fix. And I was looking at the wrong version; 3.2.24 does indeed have the fix. I'm running 3.2.21, does that contain the fix? No, that was the one I looked at. It is commit c0159c780e8d42309d04e83271986274d3880826. Okay, I grabbed 3.4.4 from Debian experimental, and I'm running with that now. Hopefully this fixes the problem. Thanks for your help! -- Kevin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
On Thu, 2012-07-26 at 19:17 -0700, Kevin Ross wrote: > On 07/26/2012 07:17 PM, David Dillow wrote: > > On Thu, 2012-07-26 at 22:15 -0400, David Dillow wrote: > >> If you can, upgrade to the latest 3.4 stable kernel (3.4.6 right now). > >> As far as I can see, the latest 3.2 stable does not contain the delayed > >> stripe fix. > > And I was looking at the wrong version; 3.2.24 does indeed have the fix. > > > > I'm running 3.2.21, does that contain the fix? No, that was the one I looked at. It is commit c0159c780e8d42309d04e83271986274d3880826. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
On 07/26/2012 07:17 PM, David Dillow wrote: On Thu, 2012-07-26 at 22:15 -0400, David Dillow wrote: If you can, upgrade to the latest 3.4 stable kernel (3.4.6 right now). As far as I can see, the latest 3.2 stable does not contain the delayed stripe fix. And I was looking at the wrong version; 3.2.24 does indeed have the fix. I'm running 3.2.21, does that contain the fix? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
On 07/25/2012 10:00 PM, Kevin Ross wrote: unused devices: # cat /proc/sys/dev/raid/speed_limit_min 1 MD is unable to reach its minimum rebuild rate while other system activity is ongoing. You might want to lower this number to see if that gets you out of the stalls. Or temporarily shut down mythtv. I will try lowering those numbers next time this happens, which will probably be within the next day or two. That's about how often this happens. Unfortunately, it has happened again, with speeds at near zero. # cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] sdf1[3] sdg1[8] sdj1[1] 6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9] [U] [=>...] resync = 8.3% (81251712/976758784) finish=1057826.4min speed=14K/sec unused devices: atop doesn't show ANY activity on the raid device or the individual drives. http://img687.imageshack.us/img687/2913/screenshotfrom201207252.png Also, I tried writing to a test file with the following command, and it hangs. I let it go for about 30 minutes, with no change. # dd if=/dev/zero of=test bs=1M count=1 dmesg only reports hung tasks. It doesn't report any other problems. Here's my dmesg output: http://pastebin.ca/2174778 I'm going to try rebooting into single user mode, and see if the rebuild succeeds without stalling. -- Kevin It rebuilt fine in single user mode, with speeds usually around 50MB/sec. But after exiting single user mode, and allowing MythTV and other programs to start, within 30 minutes I had the problem again. Basically a hung filesystem. I couldn't even "cat /proc/mdstat", that just hung. Lots of hung task warnings in dmesg. Because Phil suggested that fsync calls might cause stalls, I commented out the fsync in MythTV. I'll run with that for awhile, and see how things work out. So far it isn't adversely affecting MythTV. Thanks! -- Kevin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
On 07/25/2012 10:00 PM, Kevin Ross wrote: unused devices:none # cat /proc/sys/dev/raid/speed_limit_min 1 MD is unable to reach its minimum rebuild rate while other system activity is ongoing. You might want to lower this number to see if that gets you out of the stalls. Or temporarily shut down mythtv. I will try lowering those numbers next time this happens, which will probably be within the next day or two. That's about how often this happens. Unfortunately, it has happened again, with speeds at near zero. # cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] sdf1[3] sdg1[8] sdj1[1] 6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9] [U] [=...] resync = 8.3% (81251712/976758784) finish=1057826.4min speed=14K/sec unused devices: none atop doesn't show ANY activity on the raid device or the individual drives. http://img687.imageshack.us/img687/2913/screenshotfrom201207252.png Also, I tried writing to a test file with the following command, and it hangs. I let it go for about 30 minutes, with no change. # dd if=/dev/zero of=test bs=1M count=1 dmesg only reports hung tasks. It doesn't report any other problems. Here's my dmesg output: http://pastebin.ca/2174778 I'm going to try rebooting into single user mode, and see if the rebuild succeeds without stalling. -- Kevin It rebuilt fine in single user mode, with speeds usually around 50MB/sec. But after exiting single user mode, and allowing MythTV and other programs to start, within 30 minutes I had the problem again. Basically a hung filesystem. I couldn't even cat /proc/mdstat, that just hung. Lots of hung task warnings in dmesg. Because Phil suggested that fsync calls might cause stalls, I commented out the fsync in MythTV. I'll run with that for awhile, and see how things work out. So far it isn't adversely affecting MythTV. Thanks! -- Kevin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
On 07/26/2012 07:17 PM, David Dillow wrote: On Thu, 2012-07-26 at 22:15 -0400, David Dillow wrote: If you can, upgrade to the latest 3.4 stable kernel (3.4.6 right now). As far as I can see, the latest 3.2 stable does not contain the delayed stripe fix. And I was looking at the wrong version; 3.2.24 does indeed have the fix. I'm running 3.2.21, does that contain the fix? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
On Thu, 2012-07-26 at 19:17 -0700, Kevin Ross wrote: On 07/26/2012 07:17 PM, David Dillow wrote: On Thu, 2012-07-26 at 22:15 -0400, David Dillow wrote: If you can, upgrade to the latest 3.4 stable kernel (3.4.6 right now). As far as I can see, the latest 3.2 stable does not contain the delayed stripe fix. And I was looking at the wrong version; 3.2.24 does indeed have the fix. I'm running 3.2.21, does that contain the fix? No, that was the one I looked at. It is commit c0159c780e8d42309d04e83271986274d3880826. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
On 07/26/2012 07:27 PM, David Dillow wrote: On Thu, 2012-07-26 at 19:17 -0700, Kevin Ross wrote: On 07/26/2012 07:17 PM, David Dillow wrote: On Thu, 2012-07-26 at 22:15 -0400, David Dillow wrote: If you can, upgrade to the latest 3.4 stable kernel (3.4.6 right now). As far as I can see, the latest 3.2 stable does not contain the delayed stripe fix. And I was looking at the wrong version; 3.2.24 does indeed have the fix. I'm running 3.2.21, does that contain the fix? No, that was the one I looked at. It is commit c0159c780e8d42309d04e83271986274d3880826. Okay, I grabbed 3.4.4 from Debian experimental, and I'm running with that now. Hopefully this fixes the problem. Thanks for your help! -- Kevin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
On Thu, 2012-07-26 at 22:15 -0400, David Dillow wrote: If you can, upgrade to the latest 3.4 stable kernel (3.4.6 right now). As far as I can see, the latest 3.2 stable does not contain the delayed stripe fix. And I was looking at the wrong version; 3.2.24 does indeed have the fix. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
On Wed, 2012-07-25 at 18:55 -0700, Kevin Ross wrote: On 07/25/2012 06:00 PM, Phil Turmel wrote: Piles of small reads scattered across multiple drives, and a concentration of queued writes to /dev/sda. What's on /dev/sda? It's not a member of the raid, so it must be some other system task involved. After rebooting, MythTV is currently recording two shows, and the resync is running at full speed. # cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] sdf1[3] sdg1[8] sdj1[1] 6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9] [U] [=...] resync = 9.3% (91363840/976758784) finish=1434.3min speed=10287K/sec unused devices: none atop shows the avio of all the drives to be less than 1ms, where before they were much higher. It will run for a couple days under load just fine, and then it will come to a halt. It's a 3.2.21 kernel. I'm running Debian Testing, and the exact Debian package version is: I suspect you are being hit by same bug I was -- delayed stripes never got processed. If you get into the state where the rebuild isn't progressing, and you find that increasing the size of the stripe cache allows the rebuild to proceed (but the filesystem stays wedged), then that cinches it. If you can, upgrade to the latest 3.4 stable kernel (3.4.6 right now). As far as I can see, the latest 3.2 stable does not contain the delayed stripe fix. After applying those fixes to my kernel, my MythTV setup over a 5 disk RAID5 has been pretty solid, where before I was getting lockups every few days. It still seems to be getting slower over time, but I've not looked into it yet as it is not as catastrophic as the wedging. HTH, Dave -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
On 07/26/2012 07:53 PM, Kevin Ross wrote: On 07/26/2012 07:27 PM, David Dillow wrote: On Thu, 2012-07-26 at 19:17 -0700, Kevin Ross wrote: On 07/26/2012 07:17 PM, David Dillow wrote: On Thu, 2012-07-26 at 22:15 -0400, David Dillow wrote: If you can, upgrade to the latest 3.4 stable kernel (3.4.6 right now). As far as I can see, the latest 3.2 stable does not contain the delayed stripe fix. And I was looking at the wrong version; 3.2.24 does indeed have the fix. I'm running 3.2.21, does that contain the fix? No, that was the one I looked at. It is commit c0159c780e8d42309d04e83271986274d3880826. Okay, I grabbed 3.4.4 from Debian experimental, and I'm running with that now. Hopefully this fixes the problem. Thanks for your help! -- Kevin Just noticed I need 3.4.5 or later. Doh! I'll grab a vanilla kernel from kernel.org and build it. Thanks! -- Kevin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
unused devices: # cat /proc/sys/dev/raid/speed_limit_min 1 MD is unable to reach its minimum rebuild rate while other system activity is ongoing. You might want to lower this number to see if that gets you out of the stalls. Or temporarily shut down mythtv. I will try lowering those numbers next time this happens, which will probably be within the next day or two. That's about how often this happens. Unfortunately, it has happened again, with speeds at near zero. # cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] sdf1[3] sdg1[8] sdj1[1] 6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9] [U] [=>...] resync = 8.3% (81251712/976758784) finish=1057826.4min speed=14K/sec unused devices: atop doesn't show ANY activity on the raid device or the individual drives. http://img687.imageshack.us/img687/2913/screenshotfrom201207252.png Also, I tried writing to a test file with the following command, and it hangs. I let it go for about 30 minutes, with no change. # dd if=/dev/zero of=test bs=1M count=1 dmesg only reports hung tasks. It doesn't report any other problems. Here's my dmesg output: http://pastebin.ca/2174778 I'm going to try rebooting into single user mode, and see if the rebuild succeeds without stalling. -- Kevin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
On 07/25/2012 07:09 PM, CoolCold wrote: You might be interested in write intent bitmap then, it gonna help a lot. (resending in plain text) Thanks, I'll look into that! -- Kevin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
On Thu, Jul 26, 2012 at 5:55 AM, Kevin Ross wrote: > > Thank you very much for taking the time to look into this. > > > On 07/25/2012 06:00 PM, Phil Turmel wrote: >> >> Piles of small reads scattered across multiple drives, and a >> concentration of queued writes to /dev/sda. What's on /dev/sda? >> It's not a member of the raid, so it must be some other system task >> involved. > > > /dev/sda1 is the root filesystem. The writes were most likely by MySQL, > but I would have to run iotop to be sure. > > >> [ The output of "lsdrv" [1] might be useful here, along with >> "mdadm -D /dev/md0" and "mdadm -E /dev/[b-j]" ] > > > Here you go: http://pastebin.ca/2174740 > > >> MythTV is trying to flush recorded video to disk, I presume. Sync is >> known to cause stalls--a great deal of work is on-going to improve >> this. How old is this kernel? > > > After rebooting, MythTV is currently recording two shows, and the resync > is running at full speed. > > > # cat /proc/mdstat > Personalities : [raid6] [raid5] [raid4] > md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] > sdf1[3] sdg1[8] sdj1[1] > 6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9] > [U] > [=>...] resync = 9.3% (91363840/976758784) > finish=1434.3min speed=10287K/sec > > unused devices: > > atop shows the avio of all the drives to be less than 1ms, where before > they were much higher. It will run for a couple days under load just fine, > and then it will come to a halt. > > It's a 3.2.21 kernel. I'm running Debian Testing, and the exact Debian > package version is: > > ii linux-image-3.2.0-3-686-pae3.2.21-3 > Linux 3.2 for modern PCs > > >> >>> [51000.672258] [] ? sysenter_do_call+0x12/0x28 >>> [51000.672261] [] ? quirk_usb_early_handoff+0x4a9/0x522 >>> >>> Here is some other possibly relevant info: >>> >>> # cat /proc/mdstat >>> Personalities : [raid6] [raid5] [raid4] >>> md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] >>> sdf1[3] sdg1[8] sdj1[1] >>>6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 >>> [9/9] >>> [U] >>>[==>..] resync = 51.3% (501954432/976758784) >>> finish=28755.6min speed=275K/sec >> >> Is this resync a weekly check, or did something else trigger it? > > > This is not a scheduled check. It was triggered by, I believe, an unclean > shutdown. An unclean shutdown will trigger a resync. I don't think it used > to do this, but I could be remembering wrong. > > >> >>> unused devices: >>> >>> # cat /proc/sys/dev/raid/speed_limit_min >>> 1 >> >> MD is unable to reach its minimum rebuild rate while other system >> activity is ongoing. You might want to lower this number to see if that >> gets you out of the stalls. >> >> Or temporarily shut down mythtv. > > > I will try lowering those numbers next time this happens, which will > probably be within the next day or two. That's about how often this > happens. You might be interested in write intent bitmap then, it gonna help a lot. (resending in plain text) > > >>> # cat /proc/sys/dev/raid/speed_limit_max >>> 20 >>> >>> Thanks in advance! >>> -- Kevin >> >> HTH, >> >> Phil >> >> [1] http://github.com/pturmel/lsdrv >> > > Thanks! > -- Kevin > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Best regards, [COOLCOLD-RIPN] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
Thank you very much for taking the time to look into this. On 07/25/2012 06:00 PM, Phil Turmel wrote: Piles of small reads scattered across multiple drives, and a concentration of queued writes to /dev/sda. What's on /dev/sda? It's not a member of the raid, so it must be some other system task involved. /dev/sda1 is the root filesystem. The writes were most likely by MySQL, but I would have to run iotop to be sure. [ The output of "lsdrv" [1] might be useful here, along with "mdadm -D /dev/md0" and "mdadm -E /dev/[b-j]" ] Here you go: http://pastebin.ca/2174740 MythTV is trying to flush recorded video to disk, I presume. Sync is known to cause stalls--a great deal of work is on-going to improve this. How old is this kernel? After rebooting, MythTV is currently recording two shows, and the resync is running at full speed. # cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] sdf1[3] sdg1[8] sdj1[1] 6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9] [U] [=>...] resync = 9.3% (91363840/976758784) finish=1434.3min speed=10287K/sec unused devices: atop shows the avio of all the drives to be less than 1ms, where before they were much higher. It will run for a couple days under load just fine, and then it will come to a halt. It's a 3.2.21 kernel. I'm running Debian Testing, and the exact Debian package version is: ii linux-image-3.2.0-3-686-pae 3.2.21-3 Linux 3.2 for modern PCs [51000.672258] [] ? sysenter_do_call+0x12/0x28 [51000.672261] [] ? quirk_usb_early_handoff+0x4a9/0x522 Here is some other possibly relevant info: # cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] sdf1[3] sdg1[8] sdj1[1] 6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9] [U] [==>..] resync = 51.3% (501954432/976758784) finish=28755.6min speed=275K/sec Is this resync a weekly check, or did something else trigger it? This is not a scheduled check. It was triggered by, I believe, an unclean shutdown. An unclean shutdown will trigger a resync. I don't think it used to do this, but I could be remembering wrong. unused devices: # cat /proc/sys/dev/raid/speed_limit_min 1 MD is unable to reach its minimum rebuild rate while other system activity is ongoing. You might want to lower this number to see if that gets you out of the stalls. Or temporarily shut down mythtv. I will try lowering those numbers next time this happens, which will probably be within the next day or two. That's about how often this happens. # cat /proc/sys/dev/raid/speed_limit_max 20 Thanks in advance! -- Kevin HTH, Phil [1] http://github.com/pturmel/lsdrv Thanks! -- Kevin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
[Added linux-raid to the CC] Hi Kevin, Notes interleaved: On 07/25/2012 06:52 PM, Kevin Ross wrote: > Hello, > > I'm having a problem. After a while, my software RAID rebuild becomes > extremely slow, and the filesystem on the RAID is essentially blocked. > I don't know what is causing this. I guess it could be a bad drive, but > how can I find out? Probably not. That pretty much always shows up in dmesg. > I used atop to show the transfer speeds to each drive. Here's a > screenshot: > http://img402.imageshack.us/img402/6484/screenshotfrom201207251.png Piles of small reads scattered across multiple drives, and a concentration of queued writes to /dev/sda. What's on /dev/sda? It's not a member of the raid, so it must be some other system task involved. [ The output of "lsdrv" [1] might be useful here, along with "mdadm -D /dev/md0" and "mdadm -E /dev/[b-j]" ] > "smartctl -a" for all the drives looks good to me, no pending failures, > or errors logged. dmesg doesn't report anything wrong with any of the > drives. It does, however, report lots of hung tasks, which are trying > to access the RAID volume. For example: > > [51000.672064] INFO: task mythbackend:10677 blocked for more than 120 > seconds. > [51000.672098] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [51000.672143] mythbackend D 000e 0 10677 1 0x > [51000.672146] f38bea00 0086 c1095415 000e 0002 > c147aac0 > [51000.672152] f38bebac c147aac0 eb2cff04 003d2f4b c109cacb > 01872f02 eb2cfe50 > [51000.672157] c100f28b c13df480 01872f02 eb2cfe68 c10532b1 0069a8d0 > f79d6ac0 > [51000.672162] Call Trace: > [51000.672169] [] ? find_get_pages_tag+0x2f/0xa2 > [51000.672173] [] ? pagevec_lookup_tag+0x18/0x1e > [51000.672176] [] ? read_tsc+0xa/0x28 > [51000.672179] [] ? timekeeping_get_ns+0x11/0x55 > [51000.672182] [] ? ktime_get_ts+0x7a/0x82 > [51000.672186] [] ? io_schedule+0x4a/0x5f > [51000.672188] [] ? sleep_on_page+0x5/0x8 > [51000.672191] [] ? __wait_on_bit+0x2f/0x54 > [51000.672193] [] ? lock_page+0x1d/0x1d > [51000.672196] [] ? wait_on_page_bit+0x57/0x5e > [51000.672199] [] ? autoremove_wake_function+0x29/0x29 > [51000.672201] [] ? filemap_fdatawait_range+0x71/0x11e > [51000.672205] [] ? filemap_write_and_wait_range+0x3e/0x4c > [51000.672232] [] ? xfs_file_fsync+0x68/0x214 [xfs] > [51000.672246] [] ? xfs_file_splice_write+0x144/0x144 [xfs] > [51000.672249] [] ? vfs_fsync_range+0x27/0x2d > [51000.672252] [] ? vfs_fsync+0x11/0x15 > [51000.672254] [] ? sys_fdatasync+0x20/0x2e MythTV is trying to flush recorded video to disk, I presume. Sync is known to cause stalls--a great deal of work is on-going to improve this. How old is this kernel? > [51000.672258] [] ? sysenter_do_call+0x12/0x28 > [51000.672261] [] ? quirk_usb_early_handoff+0x4a9/0x522 > > Here is some other possibly relevant info: > > # cat /proc/mdstat > Personalities : [raid6] [raid5] [raid4] > md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] > sdf1[3] sdg1[8] sdj1[1] > 6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9] > [U] > [==>..] resync = 51.3% (501954432/976758784) > finish=28755.6min speed=275K/sec Is this resync a weekly check, or did something else trigger it? > unused devices: > > # cat /proc/sys/dev/raid/speed_limit_min > 1 MD is unable to reach its minimum rebuild rate while other system activity is ongoing. You might want to lower this number to see if that gets you out of the stalls. Or temporarily shut down mythtv. > # cat /proc/sys/dev/raid/speed_limit_max > 20 > > Thanks in advance! > -- Kevin HTH, Phil [1] http://github.com/pturmel/lsdrv -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RAID extremely slow
Hello, I'm having a problem. After a while, my software RAID rebuild becomes extremely slow, and the filesystem on the RAID is essentially blocked. I don't know what is causing this. I guess it could be a bad drive, but how can I find out? I used atop to show the transfer speeds to each drive. Here's a screenshot: http://img402.imageshack.us/img402/6484/screenshotfrom201207251.png "smartctl -a" for all the drives looks good to me, no pending failures, or errors logged. dmesg doesn't report anything wrong with any of the drives. It does, however, report lots of hung tasks, which are trying to access the RAID volume. For example: [51000.672064] INFO: task mythbackend:10677 blocked for more than 120 seconds. [51000.672098] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [51000.672143] mythbackend D 000e 0 10677 1 0x [51000.672146] f38bea00 0086 c1095415 000e 0002 c147aac0 [51000.672152] f38bebac c147aac0 eb2cff04 003d2f4b c109cacb 01872f02 eb2cfe50 [51000.672157] c100f28b c13df480 01872f02 eb2cfe68 c10532b1 0069a8d0 f79d6ac0 [51000.672162] Call Trace: [51000.672169] [] ? find_get_pages_tag+0x2f/0xa2 [51000.672173] [] ? pagevec_lookup_tag+0x18/0x1e [51000.672176] [] ? read_tsc+0xa/0x28 [51000.672179] [] ? timekeeping_get_ns+0x11/0x55 [51000.672182] [] ? ktime_get_ts+0x7a/0x82 [51000.672186] [] ? io_schedule+0x4a/0x5f [51000.672188] [] ? sleep_on_page+0x5/0x8 [51000.672191] [] ? __wait_on_bit+0x2f/0x54 [51000.672193] [] ? lock_page+0x1d/0x1d [51000.672196] [] ? wait_on_page_bit+0x57/0x5e [51000.672199] [] ? autoremove_wake_function+0x29/0x29 [51000.672201] [] ? filemap_fdatawait_range+0x71/0x11e [51000.672205] [] ? filemap_write_and_wait_range+0x3e/0x4c [51000.672232] [] ? xfs_file_fsync+0x68/0x214 [xfs] [51000.672246] [] ? xfs_file_splice_write+0x144/0x144 [xfs] [51000.672249] [] ? vfs_fsync_range+0x27/0x2d [51000.672252] [] ? vfs_fsync+0x11/0x15 [51000.672254] [] ? sys_fdatasync+0x20/0x2e [51000.672258] [] ? sysenter_do_call+0x12/0x28 [51000.672261] [] ? quirk_usb_early_handoff+0x4a9/0x522 Here is some other possibly relevant info: # cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] sdf1[3] sdg1[8] sdj1[1] 6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9] [U] [==>..] resync = 51.3% (501954432/976758784) finish=28755.6min speed=275K/sec unused devices: # cat /proc/sys/dev/raid/speed_limit_min 1 # cat /proc/sys/dev/raid/speed_limit_max 20 Thanks in advance! -- Kevin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RAID extremely slow
Hello, I'm having a problem. After a while, my software RAID rebuild becomes extremely slow, and the filesystem on the RAID is essentially blocked. I don't know what is causing this. I guess it could be a bad drive, but how can I find out? I used atop to show the transfer speeds to each drive. Here's a screenshot: http://img402.imageshack.us/img402/6484/screenshotfrom201207251.png smartctl -a for all the drives looks good to me, no pending failures, or errors logged. dmesg doesn't report anything wrong with any of the drives. It does, however, report lots of hung tasks, which are trying to access the RAID volume. For example: [51000.672064] INFO: task mythbackend:10677 blocked for more than 120 seconds. [51000.672098] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [51000.672143] mythbackend D 000e 0 10677 1 0x [51000.672146] f38bea00 0086 c1095415 000e 0002 c147aac0 [51000.672152] f38bebac c147aac0 eb2cff04 003d2f4b c109cacb 01872f02 eb2cfe50 [51000.672157] c100f28b c13df480 01872f02 eb2cfe68 c10532b1 0069a8d0 f79d6ac0 [51000.672162] Call Trace: [51000.672169] [c1095415] ? find_get_pages_tag+0x2f/0xa2 [51000.672173] [c109cacb] ? pagevec_lookup_tag+0x18/0x1e [51000.672176] [c100f28b] ? read_tsc+0xa/0x28 [51000.672179] [c10532b1] ? timekeeping_get_ns+0x11/0x55 [51000.672182] [c10536a4] ? ktime_get_ts+0x7a/0x82 [51000.672186] [c12bea8b] ? io_schedule+0x4a/0x5f [51000.672188] [c1095659] ? sleep_on_page+0x5/0x8 [51000.672191] [c12bedeb] ? __wait_on_bit+0x2f/0x54 [51000.672193] [c1095654] ? lock_page+0x1d/0x1d [51000.672196] [c1095754] ? wait_on_page_bit+0x57/0x5e [51000.672199] [c104d171] ? autoremove_wake_function+0x29/0x29 [51000.672201] [c1095823] ? filemap_fdatawait_range+0x71/0x11e [51000.672205] [c109630f] ? filemap_write_and_wait_range+0x3e/0x4c [51000.672232] [f86bfb39] ? xfs_file_fsync+0x68/0x214 [xfs] [51000.672246] [f86bfad1] ? xfs_file_splice_write+0x144/0x144 [xfs] [51000.672249] [c10e7e3b] ? vfs_fsync_range+0x27/0x2d [51000.672252] [c10e7e52] ? vfs_fsync+0x11/0x15 [51000.672254] [c10e80b8] ? sys_fdatasync+0x20/0x2e [51000.672258] [c12c409f] ? sysenter_do_call+0x12/0x28 [51000.672261] [c12b] ? quirk_usb_early_handoff+0x4a9/0x522 Here is some other possibly relevant info: # cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] sdf1[3] sdg1[8] sdj1[1] 6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9] [U] [==..] resync = 51.3% (501954432/976758784) finish=28755.6min speed=275K/sec unused devices: none # cat /proc/sys/dev/raid/speed_limit_min 1 # cat /proc/sys/dev/raid/speed_limit_max 20 Thanks in advance! -- Kevin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
[Added linux-raid to the CC] Hi Kevin, Notes interleaved: On 07/25/2012 06:52 PM, Kevin Ross wrote: Hello, I'm having a problem. After a while, my software RAID rebuild becomes extremely slow, and the filesystem on the RAID is essentially blocked. I don't know what is causing this. I guess it could be a bad drive, but how can I find out? Probably not. That pretty much always shows up in dmesg. I used atop to show the transfer speeds to each drive. Here's a screenshot: http://img402.imageshack.us/img402/6484/screenshotfrom201207251.png Piles of small reads scattered across multiple drives, and a concentration of queued writes to /dev/sda. What's on /dev/sda? It's not a member of the raid, so it must be some other system task involved. [ The output of lsdrv [1] might be useful here, along with mdadm -D /dev/md0 and mdadm -E /dev/[b-j] ] smartctl -a for all the drives looks good to me, no pending failures, or errors logged. dmesg doesn't report anything wrong with any of the drives. It does, however, report lots of hung tasks, which are trying to access the RAID volume. For example: [51000.672064] INFO: task mythbackend:10677 blocked for more than 120 seconds. [51000.672098] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [51000.672143] mythbackend D 000e 0 10677 1 0x [51000.672146] f38bea00 0086 c1095415 000e 0002 c147aac0 [51000.672152] f38bebac c147aac0 eb2cff04 003d2f4b c109cacb 01872f02 eb2cfe50 [51000.672157] c100f28b c13df480 01872f02 eb2cfe68 c10532b1 0069a8d0 f79d6ac0 [51000.672162] Call Trace: [51000.672169] [c1095415] ? find_get_pages_tag+0x2f/0xa2 [51000.672173] [c109cacb] ? pagevec_lookup_tag+0x18/0x1e [51000.672176] [c100f28b] ? read_tsc+0xa/0x28 [51000.672179] [c10532b1] ? timekeeping_get_ns+0x11/0x55 [51000.672182] [c10536a4] ? ktime_get_ts+0x7a/0x82 [51000.672186] [c12bea8b] ? io_schedule+0x4a/0x5f [51000.672188] [c1095659] ? sleep_on_page+0x5/0x8 [51000.672191] [c12bedeb] ? __wait_on_bit+0x2f/0x54 [51000.672193] [c1095654] ? lock_page+0x1d/0x1d [51000.672196] [c1095754] ? wait_on_page_bit+0x57/0x5e [51000.672199] [c104d171] ? autoremove_wake_function+0x29/0x29 [51000.672201] [c1095823] ? filemap_fdatawait_range+0x71/0x11e [51000.672205] [c109630f] ? filemap_write_and_wait_range+0x3e/0x4c [51000.672232] [f86bfb39] ? xfs_file_fsync+0x68/0x214 [xfs] [51000.672246] [f86bfad1] ? xfs_file_splice_write+0x144/0x144 [xfs] [51000.672249] [c10e7e3b] ? vfs_fsync_range+0x27/0x2d [51000.672252] [c10e7e52] ? vfs_fsync+0x11/0x15 [51000.672254] [c10e80b8] ? sys_fdatasync+0x20/0x2e MythTV is trying to flush recorded video to disk, I presume. Sync is known to cause stalls--a great deal of work is on-going to improve this. How old is this kernel? [51000.672258] [c12c409f] ? sysenter_do_call+0x12/0x28 [51000.672261] [c12b] ? quirk_usb_early_handoff+0x4a9/0x522 Here is some other possibly relevant info: # cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] sdf1[3] sdg1[8] sdj1[1] 6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9] [U] [==..] resync = 51.3% (501954432/976758784) finish=28755.6min speed=275K/sec Is this resync a weekly check, or did something else trigger it? unused devices: none # cat /proc/sys/dev/raid/speed_limit_min 1 MD is unable to reach its minimum rebuild rate while other system activity is ongoing. You might want to lower this number to see if that gets you out of the stalls. Or temporarily shut down mythtv. # cat /proc/sys/dev/raid/speed_limit_max 20 Thanks in advance! -- Kevin HTH, Phil [1] http://github.com/pturmel/lsdrv -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
Thank you very much for taking the time to look into this. On 07/25/2012 06:00 PM, Phil Turmel wrote: Piles of small reads scattered across multiple drives, and a concentration of queued writes to /dev/sda. What's on /dev/sda? It's not a member of the raid, so it must be some other system task involved. /dev/sda1 is the root filesystem. The writes were most likely by MySQL, but I would have to run iotop to be sure. [ The output of lsdrv [1] might be useful here, along with mdadm -D /dev/md0 and mdadm -E /dev/[b-j] ] Here you go: http://pastebin.ca/2174740 MythTV is trying to flush recorded video to disk, I presume. Sync is known to cause stalls--a great deal of work is on-going to improve this. How old is this kernel? After rebooting, MythTV is currently recording two shows, and the resync is running at full speed. # cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] sdf1[3] sdg1[8] sdj1[1] 6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9] [U] [=...] resync = 9.3% (91363840/976758784) finish=1434.3min speed=10287K/sec unused devices: none atop shows the avio of all the drives to be less than 1ms, where before they were much higher. It will run for a couple days under load just fine, and then it will come to a halt. It's a 3.2.21 kernel. I'm running Debian Testing, and the exact Debian package version is: ii linux-image-3.2.0-3-686-pae 3.2.21-3 Linux 3.2 for modern PCs [51000.672258] [c12c409f] ? sysenter_do_call+0x12/0x28 [51000.672261] [c12b] ? quirk_usb_early_handoff+0x4a9/0x522 Here is some other possibly relevant info: # cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] sdf1[3] sdg1[8] sdj1[1] 6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9] [U] [==..] resync = 51.3% (501954432/976758784) finish=28755.6min speed=275K/sec Is this resync a weekly check, or did something else trigger it? This is not a scheduled check. It was triggered by, I believe, an unclean shutdown. An unclean shutdown will trigger a resync. I don't think it used to do this, but I could be remembering wrong. unused devices:none # cat /proc/sys/dev/raid/speed_limit_min 1 MD is unable to reach its minimum rebuild rate while other system activity is ongoing. You might want to lower this number to see if that gets you out of the stalls. Or temporarily shut down mythtv. I will try lowering those numbers next time this happens, which will probably be within the next day or two. That's about how often this happens. # cat /proc/sys/dev/raid/speed_limit_max 20 Thanks in advance! -- Kevin HTH, Phil [1] http://github.com/pturmel/lsdrv Thanks! -- Kevin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
On Thu, Jul 26, 2012 at 5:55 AM, Kevin Ross ke...@familyross.net wrote: Thank you very much for taking the time to look into this. On 07/25/2012 06:00 PM, Phil Turmel wrote: Piles of small reads scattered across multiple drives, and a concentration of queued writes to /dev/sda. What's on /dev/sda? It's not a member of the raid, so it must be some other system task involved. /dev/sda1 is the root filesystem. The writes were most likely by MySQL, but I would have to run iotop to be sure. [ The output of lsdrv [1] might be useful here, along with mdadm -D /dev/md0 and mdadm -E /dev/[b-j] ] Here you go: http://pastebin.ca/2174740 MythTV is trying to flush recorded video to disk, I presume. Sync is known to cause stalls--a great deal of work is on-going to improve this. How old is this kernel? After rebooting, MythTV is currently recording two shows, and the resync is running at full speed. # cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] sdf1[3] sdg1[8] sdj1[1] 6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9] [U] [=...] resync = 9.3% (91363840/976758784) finish=1434.3min speed=10287K/sec unused devices: none atop shows the avio of all the drives to be less than 1ms, where before they were much higher. It will run for a couple days under load just fine, and then it will come to a halt. It's a 3.2.21 kernel. I'm running Debian Testing, and the exact Debian package version is: ii linux-image-3.2.0-3-686-pae3.2.21-3 Linux 3.2 for modern PCs [51000.672258] [c12c409f] ? sysenter_do_call+0x12/0x28 [51000.672261] [c12b] ? quirk_usb_early_handoff+0x4a9/0x522 Here is some other possibly relevant info: # cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] sdf1[3] sdg1[8] sdj1[1] 6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9] [U] [==..] resync = 51.3% (501954432/976758784) finish=28755.6min speed=275K/sec Is this resync a weekly check, or did something else trigger it? This is not a scheduled check. It was triggered by, I believe, an unclean shutdown. An unclean shutdown will trigger a resync. I don't think it used to do this, but I could be remembering wrong. unused devices:none # cat /proc/sys/dev/raid/speed_limit_min 1 MD is unable to reach its minimum rebuild rate while other system activity is ongoing. You might want to lower this number to see if that gets you out of the stalls. Or temporarily shut down mythtv. I will try lowering those numbers next time this happens, which will probably be within the next day or two. That's about how often this happens. You might be interested in write intent bitmap then, it gonna help a lot. (resending in plain text) # cat /proc/sys/dev/raid/speed_limit_max 20 Thanks in advance! -- Kevin HTH, Phil [1] http://github.com/pturmel/lsdrv Thanks! -- Kevin -- To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Best regards, [COOLCOLD-RIPN] -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
On 07/25/2012 07:09 PM, CoolCold wrote: You might be interested in write intent bitmap then, it gonna help a lot. (resending in plain text) Thanks, I'll look into that! -- Kevin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RAID extremely slow
unused devices:none # cat /proc/sys/dev/raid/speed_limit_min 1 MD is unable to reach its minimum rebuild rate while other system activity is ongoing. You might want to lower this number to see if that gets you out of the stalls. Or temporarily shut down mythtv. I will try lowering those numbers next time this happens, which will probably be within the next day or two. That's about how often this happens. Unfortunately, it has happened again, with speeds at near zero. # cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid6 sdh1[0] sdd1[9] sde1[10] sdb1[6] sdi1[7] sdc1[4] sdf1[3] sdg1[8] sdj1[1] 6837311488 blocks super 1.2 level 6, 512k chunk, algorithm 2 [9/9] [U] [=...] resync = 8.3% (81251712/976758784) finish=1057826.4min speed=14K/sec unused devices: none atop doesn't show ANY activity on the raid device or the individual drives. http://img687.imageshack.us/img687/2913/screenshotfrom201207252.png Also, I tried writing to a test file with the following command, and it hangs. I let it go for about 30 minutes, with no change. # dd if=/dev/zero of=test bs=1M count=1 dmesg only reports hung tasks. It doesn't report any other problems. Here's my dmesg output: http://pastebin.ca/2174778 I'm going to try rebooting into single user mode, and see if the rebuild succeeds without stalling. -- Kevin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/