mdadm: unable to add a disk to degraded raid1 array
In case someone else happens upon this I have found that mdadm = v2.6.2 cannot add a disk to a degraded raid1 array created with mdadm 2.6.2. I bisected the problem down to mdadm git commit 2fb749d1b7588985b1834e43de4ec5685d0b8d26 which appears to make an incompatible change to the super block's 'data_size' field. --- sdb1-sb-good.hex2007-12-12 14:31:42.0 + +++ sdb1-sb-bad.hex 2007-12-12 14:31:36.0 + @@ -6,12 +6,12 @@ 050 60d8 0077 0004 060 * -080 60d8 0077 +080 60d0 0077 Which trips up the if (rdev-size le64_to_cpu(sb-data_size)/2) check in super_1_load [1], resulting in: mdadm: add new device failed for /dev/sdb1 as 4: Invalid argument -- Dan [1] http://lxr.linux.no/linux/drivers/md/md.c#L1148 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on the same 64k chunk array and had raised the stripe_cache_size to 1024... and got a hang. this time i grabbed stripe_cache_active before bumping the size again -- it was only 905 active. as i recall the bug we were debugging a year+ ago the active was at the size when it would hang. so this is probably something new. anyhow raising it to 2048 got it unstuck, but i'm guessing i'll be able to hit that limit too if i try harder :) btw what units are stripe_cache_size/active in? is the memory consumed equal to (chunk_size * raid_disks * stripe_cache_size) or (chunk_size * raid_disks * stripe_cache_active)? -dean On Thu, 27 Dec 2007, dean gaudet wrote: hmm this seems more serious... i just ran into it with chunksize 64KiB and while just untarring a bunch of linux kernels in parallel... increasing stripe_cache_size did the trick again. -dean On Thu, 27 Dec 2007, dean gaudet wrote: hey neil -- remember that raid5 hang which me and only one or two others ever experienced and which was hard to reproduce? we were debugging it well over a year ago (that box has 400+ day uptime now so at least that long ago :) the workaround was to increase stripe_cache_size... i seem to have a way to reproduce something which looks much the same. setup: - 2.6.24-rc6 - system has 8GiB RAM but no swap - 8x750GB in a raid5 with one spare, chunksize 1024KiB. - mkfs.xfs default options - mount -o noatime - dd if=/dev/zero of=/mnt/foo bs=4k count=2621440 that sequence hangs for me within 10 seconds... and i can unhang / rehang it by toggling between stripe_cache_size 256 and 1024. i detect the hang by watching iostat -kx /dev/sd? 5. i've attached the kernel log where i dumped task and timer state while it was hung... note that you'll see at some point i did an xfs mount with external journal but it happens with internal journal as well. looks like it's using the raid456 module and async api. anyhow let me know if you need more info / have any suggestions. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
On Tue, 25 Dec 2007, Bill Davidsen wrote: The issue I'm thinking about is hardware sector size, which on modern drives may be larger than 512b and therefore entail a read-alter-rewrite (RAR) cycle when writing a 512b block. i'm not sure any shipping SATA disks have larger than 512B sectors yet... do you know of any? (or is this thread about SCSI which i don't pay attention to...) on a brand new WDC WD7500AAKS-00RBA0 with this partition layout: 255 heads, 63 sectors/track, 91201 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes so sda1 starts at a non-multiple of 4096 into the disk. i ran some random seek+write experiments using http://arctic.org/~dean/randomio/, here are the results using 512 byte and 4096 byte writes (fsync after each write), 8 threads, on sda1: # ./randomio /dev/sda1 8 1 1 512 10 6 total | read: latency (ms) | write:latency (ms) iops | iops minavgmax sdev | iops minavgmax sdev +---+-- 148.5 |0.0 infnan0.0nan | 148.5 0.2 53.7 89.3 19.5 129.2 |0.0 infnan0.0nan | 129.2 37.2 61.9 96.79.3 131.2 |0.0 infnan0.0nan | 131.2 40.3 61.0 90.49.3 132.0 |0.0 infnan0.0nan | 132.0 39.6 60.6 89.39.1 130.7 |0.0 infnan0.0nan | 130.7 39.8 61.3 98.18.9 131.4 |0.0 infnan0.0nan | 131.4 40.0 60.8 101.09.6 # ./randomio /dev/sda1 8 1 1 4096 10 6 total | read: latency (ms) | write:latency (ms) iops | iops minavgmax sdev | iops minavgmax sdev +---+-- 141.7 |0.0 infnan0.0nan | 141.7 0.3 56.3 99.3 21.1 132.4 |0.0 infnan0.0nan | 132.4 43.3 60.4 91.88.5 131.6 |0.0 infnan0.0nan | 131.6 41.4 60.9 111.09.6 131.8 |0.0 infnan0.0nan | 131.8 41.4 60.7 85.38.6 130.6 |0.0 infnan0.0nan | 130.6 41.7 61.3 95.09.4 131.4 |0.0 infnan0.0nan | 131.4 42.2 60.8 90.58.4 i think the anomalous results in the first 10s samples are perhaps the drive coming out of a standby state. and here are the results aligned using the sda raw device itself: # ./randomio /dev/sda 8 1 1 512 10 6 total | read: latency (ms) | write:latency (ms) iops | iops minavgmax sdev | iops minavgmax sdev +---+-- 147.3 |0.0 infnan0.0nan | 147.3 0.3 54.1 93.7 20.1 132.4 |0.0 infnan0.0nan | 132.4 37.4 60.6 91.89.2 132.5 |0.0 infnan0.0nan | 132.5 37.7 60.3 93.79.3 131.8 |0.0 infnan0.0nan | 131.8 39.4 60.7 92.79.0 133.9 |0.0 infnan0.0nan | 133.9 41.7 59.8 90.78.5 130.2 |0.0 infnan0.0nan | 130.2 40.8 61.5 88.68.9 # ./randomio /dev/sda 8 1 1 4096 10 6 total | read: latency (ms) | write:latency (ms) iops | iops minavgmax sdev | iops minavgmax sdev +---+-- 145.4 |0.0 infnan0.0nan | 145.4 0.3 54.9 94.0 20.1 130.3 |0.0 infnan0.0nan | 130.3 36.0 61.4 92.79.6 130.6 |0.0 infnan0.0nan | 130.6 38.2 61.2 96.79.2 132.1 |0.0 infnan0.0nan | 132.1 39.0 60.5 93.59.2 131.8 |0.0 infnan0.0nan | 131.8 43.1 60.8 93.89.1 129.0 |0.0 infnan0.0nan | 129.0 40.2 62.0 96.48.8 it looks pretty much the same to me... -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
On Sat, 29 Dec 2007, dean gaudet wrote: On Tue, 25 Dec 2007, Bill Davidsen wrote: The issue I'm thinking about is hardware sector size, which on modern drives may be larger than 512b and therefore entail a read-alter-rewrite (RAR) cycle when writing a 512b block. i'm not sure any shipping SATA disks have larger than 512B sectors yet... do you know of any? (or is this thread about SCSI which i don't pay attention to...) on a brand new WDC WD7500AAKS-00RBA0 with this partition layout: 255 heads, 63 sectors/track, 91201 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes so sda1 starts at a non-multiple of 4096 into the disk. i ran some random seek+write experiments using http://arctic.org/~dean/randomio/, here are the results using 512 byte and 4096 byte writes (fsync after each write), 8 threads, on sda1: # ./randomio /dev/sda1 8 1 1 512 10 6 total | read: latency (ms) | write:latency (ms) iops | iops minavgmax sdev | iops minavgmax sdev +---+-- 148.5 |0.0 infnan0.0nan | 148.5 0.2 53.7 89.3 19.5 129.2 |0.0 infnan0.0nan | 129.2 37.2 61.9 96.79.3 131.2 |0.0 infnan0.0nan | 131.2 40.3 61.0 90.49.3 132.0 |0.0 infnan0.0nan | 132.0 39.6 60.6 89.39.1 130.7 |0.0 infnan0.0nan | 130.7 39.8 61.3 98.18.9 131.4 |0.0 infnan0.0nan | 131.4 40.0 60.8 101.09.6 # ./randomio /dev/sda1 8 1 1 4096 10 6 total | read: latency (ms) | write:latency (ms) iops | iops minavgmax sdev | iops minavgmax sdev +---+-- 141.7 |0.0 infnan0.0nan | 141.7 0.3 56.3 99.3 21.1 132.4 |0.0 infnan0.0nan | 132.4 43.3 60.4 91.88.5 131.6 |0.0 infnan0.0nan | 131.6 41.4 60.9 111.09.6 131.8 |0.0 infnan0.0nan | 131.8 41.4 60.7 85.38.6 130.6 |0.0 infnan0.0nan | 130.6 41.7 61.3 95.09.4 131.4 |0.0 infnan0.0nan | 131.4 42.2 60.8 90.58.4 i think the anomalous results in the first 10s samples are perhaps the drive coming out of a standby state. and here are the results aligned using the sda raw device itself: # ./randomio /dev/sda 8 1 1 512 10 6 total | read: latency (ms) | write:latency (ms) iops | iops minavgmax sdev | iops minavgmax sdev +---+-- 147.3 |0.0 infnan0.0nan | 147.3 0.3 54.1 93.7 20.1 132.4 |0.0 infnan0.0nan | 132.4 37.4 60.6 91.89.2 132.5 |0.0 infnan0.0nan | 132.5 37.7 60.3 93.79.3 131.8 |0.0 infnan0.0nan | 131.8 39.4 60.7 92.79.0 133.9 |0.0 infnan0.0nan | 133.9 41.7 59.8 90.78.5 130.2 |0.0 infnan0.0nan | 130.2 40.8 61.5 88.68.9 # ./randomio /dev/sda 8 1 1 4096 10 6 total | read: latency (ms) | write:latency (ms) iops | iops minavgmax sdev | iops minavgmax sdev +---+-- 145.4 |0.0 infnan0.0nan | 145.4 0.3 54.9 94.0 20.1 130.3 |0.0 infnan0.0nan | 130.3 36.0 61.4 92.79.6 130.6 |0.0 infnan0.0nan | 130.6 38.2 61.2 96.79.2 132.1 |0.0 infnan0.0nan | 132.1 39.0 60.5 93.59.2 131.8 |0.0 infnan0.0nan | 131.8 43.1 60.8 93.89.1 129.0 |0.0 infnan0.0nan | 129.0 40.2 62.0 96.48.8 it looks pretty much the same to me... -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Good to know/have it confirmed by someone else, the alignment does not matter with Linux/SW RAID. Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote: hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on the same 64k chunk array and had raised the stripe_cache_size to 1024... and got a hang. this time i grabbed stripe_cache_active before bumping the size again -- it was only 905 active. as i recall the bug we were debugging a year+ ago the active was at the size when it would hang. so this is probably something new. I believe I am seeing the same issue and am trying to track down whether XFS is doing something unexpected, i.e. I have not been able to reproduce the problem with EXT3. MD tries to increase throughput by letting some stripe work build up in batches. It looks like every time your system has hung it has been in the 'inactive_blocked' state i.e. 3/4 of stripes active. This state should automatically clear... anyhow raising it to 2048 got it unstuck, but i'm guessing i'll be able to hit that limit too if i try harder :) Once you hang if 'stripe_cache_size' is increased such that stripe_cache_active 3/4 * stripe_cache_size things will start flowing again. btw what units are stripe_cache_size/active in? is the memory consumed equal to (chunk_size * raid_disks * stripe_cache_size) or (chunk_size * raid_disks * stripe_cache_active)? memory_consumed = PAGE_SIZE * raid_disks * stripe_cache_size -dean -- Dan - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Sat, 29 Dec 2007, Dan Williams wrote: On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote: hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on the same 64k chunk array and had raised the stripe_cache_size to 1024... and got a hang. this time i grabbed stripe_cache_active before bumping the size again -- it was only 905 active. as i recall the bug we were debugging a year+ ago the active was at the size when it would hang. so this is probably something new. I believe I am seeing the same issue and am trying to track down whether XFS is doing something unexpected, i.e. I have not been able to reproduce the problem with EXT3. MD tries to increase throughput by letting some stripe work build up in batches. It looks like every time your system has hung it has been in the 'inactive_blocked' state i.e. 3/4 of stripes active. This state should automatically clear... cool, glad you can reproduce it :) i have a bit more data... i'm seeing the same problem on debian's 2.6.22-3-amd64 kernel, so it's not new in 2.6.24. i'm doing some more isolation but just grabbing kernels i have precompiled so far -- a 2.6.19.7 kernel doesn't show the problem, and early indications are a 2.6.21.7 kernel also doesn't have the problem but i'm giving it longer to show its head. i'll try a stock 2.6.22 next depending on how the 2.6.21 test goes, just so we get the debian patches out of the way. i was tempted to blame async api because it's newish :) but according to the dmesg output it doesn't appear the 2.6.22-3-amd64 kernel used async API, and it still hung, so async is probably not to blame. anyhow the test case i'm using is the dma_thrasher script i attached... it takes about an hour to give me confidence there's no problems so this will take a while. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch] improve stripe_cache_size documentation
Document the amount of memory used by the stripe cache and the fact that it's tied down and unavailable for other purposes (right?). thanks to Dan Williams for the formula. -dean Signed-off-by: dean gaudet [EMAIL PROTECTED] Index: linux/Documentation/md.txt === --- linux.orig/Documentation/md.txt 2007-12-29 13:01:25.0 -0800 +++ linux/Documentation/md.txt 2007-12-29 13:04:17.0 -0800 @@ -438,5 +438,11 @@ stripe_cache_size (currently raid5 only) number of entries in the stripe cache. This is writable, but there are upper and lower limits (32768, 16). Default is 128. + + The stripe cache memory is locked down and not available for other uses. + The total size of the stripe cache is determined by this formula: + +PAGE_SIZE * raid_disks * stripe_cache_size + strip_cache_active (currently raid5 only) number of active entries in the stripe cache - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Sat, 29 Dec 2007, dean gaudet wrote: On Sat, 29 Dec 2007, Dan Williams wrote: On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote: hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on the same 64k chunk array and had raised the stripe_cache_size to 1024... and got a hang. this time i grabbed stripe_cache_active before bumping the size again -- it was only 905 active. as i recall the bug we were debugging a year+ ago the active was at the size when it would hang. so this is probably something new. I believe I am seeing the same issue and am trying to track down whether XFS is doing something unexpected, i.e. I have not been able to reproduce the problem with EXT3. MD tries to increase throughput by letting some stripe work build up in batches. It looks like every time your system has hung it has been in the 'inactive_blocked' state i.e. 3/4 of stripes active. This state should automatically clear... cool, glad you can reproduce it :) i have a bit more data... i'm seeing the same problem on debian's 2.6.22-3-amd64 kernel, so it's not new in 2.6.24. i'm doing some more isolation but just grabbing kernels i have precompiled so far -- a 2.6.19.7 kernel doesn't show the problem, and early indications are a 2.6.21.7 kernel also doesn't have the problem but i'm giving it longer to show its head. i'll try a stock 2.6.22 next depending on how the 2.6.21 test goes, just so we get the debian patches out of the way. i was tempted to blame async api because it's newish :) but according to the dmesg output it doesn't appear the 2.6.22-3-amd64 kernel used async API, and it still hung, so async is probably not to blame. anyhow the test case i'm using is the dma_thrasher script i attached... it takes about an hour to give me confidence there's no problems so this will take a while. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Dean, Curious btw what kind of filesystem size/raid type (5, but defaults I assume, nothing special right? (right-symmetric vs. left-symmetric, etc?)/cache size/chunk size(s) are you using/testing with? The script you sent out earlier, you are able to reproduce it easily with 31 or so kernel tar decompressions? Justin. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Dec 29, 2007 1:58 PM, dean gaudet [EMAIL PROTECTED] wrote: On Sat, 29 Dec 2007, Dan Williams wrote: On Dec 29, 2007 9:48 AM, dean gaudet [EMAIL PROTECTED] wrote: hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on the same 64k chunk array and had raised the stripe_cache_size to 1024... and got a hang. this time i grabbed stripe_cache_active before bumping the size again -- it was only 905 active. as i recall the bug we were debugging a year+ ago the active was at the size when it would hang. so this is probably something new. I believe I am seeing the same issue and am trying to track down whether XFS is doing something unexpected, i.e. I have not been able to reproduce the problem with EXT3. MD tries to increase throughput by letting some stripe work build up in batches. It looks like every time your system has hung it has been in the 'inactive_blocked' state i.e. 3/4 of stripes active. This state should automatically clear... cool, glad you can reproduce it :) i have a bit more data... i'm seeing the same problem on debian's 2.6.22-3-amd64 kernel, so it's not new in 2.6.24. This is just brainstorming at this point, but it looks like xfs can submit more requests in the bi_end_io path such that it can lock itself out of the RAID array. The sequence that concerns me is: return_io-xfs_buf_end_io-xfs_buf_io_end-xfs_buf_iodone_work-xfs_buf_iorequest-make_request-hang I need verify whether this path is actually triggering, but if we are in an inactive_blocked condition this new request will be put on a wait queue and we'll never get to the release_stripe() call after return_io(). It would be interesting to see if this is new XFS behavior in recent kernels. -- Dan - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Sat, 29 Dec 2007, Justin Piszcz wrote: Curious btw what kind of filesystem size/raid type (5, but defaults I assume, nothing special right? (right-symmetric vs. left-symmetric, etc?)/cache size/chunk size(s) are you using/testing with? mdadm --create --level=5 --chunk=64 -n7 -x1 /dev/md2 /dev/sd[a-h]1 mkfs.xfs -f /dev/md2 otherwise defaults The script you sent out earlier, you are able to reproduce it easily with 31 or so kernel tar decompressions? not sure, the point of the script is to untar more than there is RAM. it happened with a single rsync running though -- 3.5M indoes from a remote box. it also happens with the single 10GB dd write... although i've been using the tar method for testing different kernel revs. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc6 reproducible raid5 hang
On Sat, 29 Dec 2007, dean gaudet wrote: On Sat, 29 Dec 2007, Justin Piszcz wrote: Curious btw what kind of filesystem size/raid type (5, but defaults I assume, nothing special right? (right-symmetric vs. left-symmetric, etc?)/cache size/chunk size(s) are you using/testing with? mdadm --create --level=5 --chunk=64 -n7 -x1 /dev/md2 /dev/sd[a-h]1 mkfs.xfs -f /dev/md2 otherwise defaults hmm i missed a few things, here's exactly how i created the array: mdadm --create --level=5 --chunk=64 -n7 -x1 --assume-clean /dev/md2 /dev/sd[a-h]1 it's reassembled automagically each reboot, but i do this each reboot: mkfs.xfs -f /dev/md2 mount -o noatime /dev/md2 /mnt/new ./dma_thrasher linux.tar.gz /mnt/new the --assume-clean and noatime probably make no difference though... on the bisection front it looks like it's new behaviour between 2.6.21.7 and 2.6.22.15 (stock kernels now, not debian). i've got to step out for a while, but i'll go at it again later, probably with git bisect unless someone has some cherry picked changes to suggest. -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
Justin Piszcz wrote: [] Good to know/have it confirmed by someone else, the alignment does not matter with Linux/SW RAID. Alignment matters when one partitions Linux/SW raid array. If the inside partitions will not be aligned on a stripe boundary, esp. in the worst case when the filesystem blocks cross the stripe boundary (wonder if it's ever possible... and I think it is, if a partition starts at some odd 512 bytes boundary, and filesystem block size is 4Kb, there's just no chance for an inside filesystem to do full-stripe writes, ever, so (modulo stripe cache size) all writes will go read-modify-write or similar way. And that's what the original article is about, by the way. It just happens that hardware raid array is more often split into partitions (using native tools) than linux software raid arrays. And that's what has been pointed out in this thread, as well... ;) /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html