Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
Robin Hill wrote: On Wed Dec 19, 2007 at 09:50:16AM -0500, Justin Piszcz wrote: The (up to) 30% percent figure is mentioned here: http://insights.oetiker.ch/linux/raidoptimization.html That looks to be referring to partitioning a RAID device - this'll only apply to hardware RAID or partitionable software RAID, not to the normal use case. When you're creating an array out of standard partitions then you know the array stripe size will align with the disks (there's no way it cannot), and you can set the filesystem stripe size to align as well (XFS will do this automatically). I've actually done tests on this with hardware RAID to try to find the correct partition offset, but wasn't able to see any difference (using bonnie++ and moving the partition start by one sector at a time). # fdisk -l /dev/sdc Disk /dev/sdc: 150.0 GB, 150039945216 bytes 255 heads, 63 sectors/track, 18241 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0x5667c24a Device Boot Start End Blocks Id System /dev/sdc1 1 18241 146520801 fd Linux raid autodetect This looks to be a normal disk - the partition offsets shouldn't be relevant here (barring any knowledge of the actual physical disk layout anyway, and block remapping may well make that rather irrelevant). The issue I'm thinking about is hardware sector size, which on modern drives may be larger than 512b and therefore entail a read-alter-rewrite (RAR) cycle when writing a 512b block. With larger writes, if the alignment is poor and the write size is some multiple of 512, it's possible to have an RAR at each end of the write. The only way to have a hope of controlling the alignment is to write a raw device or use a filesystem which can be configured to have blocks which are a multiple of the sector size and to do all i/o in block size starting each file on a block boundary. That may be possible with ext[234] set up properly. Why this is important: the physical layout of the drive is useful, but for a large write the drive will have to make some number of steps from on cylinder to another. By carefully choosing the starting point, the best improvement will be to eliminate 2 track-to-track seek times, one at the start and one at the end. If the writes are small only one t2t saving is possible. Now consider a RAR process. The drive is spinning typically at 7200 rpm, or 8.333 ms/rev. A read might take .5 rev on average, and a RAR will take 1.5 rev, because it takes a full revolution after the original data is read before the altered data can be rewritten. Larger sectors give more capacity, but reduced performance for write. And doing small writes can result in paying the RAR penalty on every write. So there may be a measurable benefit to getting that alignment right at the drive level. -- Bill Davidsen [EMAIL PROTECTED] Woe unto the statesman who makes war without a reason that will still be valid when the war is over... Otto von Bismark - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid10 performance question
On Sun, 23 Dec 2007 08:26:55 -0600, Jon Nelson [EMAIL PROTECTED] said: I've found in some tests that raid10,f2 gives me the best I/O of any raid5 or raid10 format. Mostly, depending on type of workload. Anyhow in general most forms of RAID10 are cool, and handle disk losses better and so on. However, the performance of raid10,o2 and raid10,n2 in degraded mode is nearly identical to the non-degraded mode performance (for me, this hovers around 100MB/s). You don't say how many drives you got, but may suggest that your array transfers are limited by the PCI host bus speed. raid10,f2 has degraded mode performance, writing, that is indistinguishable from it's non-degraded mode performance It's the raid10,f2 *read* performance in degraded mode that is strange - I get almost exactly 50% of the non-degraded mode read performance. Why is that? Well, the best description I found of the odd Linux RAID10 modes is here: http://en.Wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10 The key here is: The driver also supports a far layout where all the drives are divided into f sections. Now when there are two sections as in 'f2', each block will be written to a block in the first half of the first disk and to the second half of the next disk. Consider this layout for the first 4 blocks on 2x2 layout compared to the standard layout: DISK DISK A B C D A B C D 1 2 3 4 1 1 2 2 . . . . 3 3 4 4 . . . . . . . . --- 4 1 2 3 . . . . . . . . . . . . This means that with the far layout one can read blocks 1,2,3,4 at the same speed as a RAID0 on the outer cylinders of each disk; but if one of the disks fails, the mirror blocks have to be read from the inner cylinders of the next disk, which are usually a lot slower than the outer ones. Now, there is a very interesting detail here: one idea about getting a fast array is to take make it out of large high density drives and just use the outer cylinders of each drive, thus at the same time having a much smaller range of arm travel and higher transfer rates. The 'f2' layout means that (until a drive fails) for all reads and for short writes MD is effectively using just the outer half of each drive, *as well as* what is effectively a RAID0 layout. Note that the sustained writing speed of 'f2' is going to be same *across the whole capacity* of the RAID. While the sustained write speed of a 'n2' layout will be higher at the beginning and slower at the end just like for a single disk. Interesting, I hadn't realized that, even if I am keenly aware of the non uniform speeds of disks across cylinders. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
On Wed, 19 Dec 2007 07:28:20 +1100, Neil Brown [EMAIL PROTECTED] said: [ ... what to do with 48 drive Sun Thumpers ... ] neilb I wouldn't create a raid5 or raid6 on all 48 devices. neilb RAID5 only survives a single device failure and with that neilb many devices, the chance of a second failure before you neilb recover becomes appreciable. That's just one of the many problems, other are: * If a drive fails, rebuild traffic is going to hit hard, with reading in parallel 47 blocks to compute a new 48th. * With a parity strip length of 48 it will be that much harder to avoid read-modify before write, as it will be avoidable only for writes of at least 48 blocks aligned on 48 block boundaries. And reading 47 blocks to write one is going to be quite painful. [ ... ] neilb RAID10 would be a good option if you are happy wit 24 neilb drives worth of space. [ ... ] That sounds like the only feasible option (except for the 3 drive case in most cases). Parity RAID does not scale much beyond 3-4 drives. neilb Alternately, 8 6drive RAID5s or 6 8raid RAID6s, and use neilb RAID0 to combine them together. This would give you neilb adequate reliability and performance and still a large neilb amount of storage space. That sounds optimistic to me: the reason to do a RAID50 of 8x(5+1) can only be to have a single filesystem, else one could have 8 distinct filesystems each with a subtree of the whole. With a single filesystem the failure of any one of the 8 RAID5 components of the RAID0 will cause the loss of the whole lot. So in the 47+1 case a loss of any two drives would lead to complete loss; in the 8x(5+1) case only a loss of two drives in the same RAID5 will. It does not sound like a great improvement to me (especially considering the thoroughly inane practice of building arrays out of disks of the same make and model taken out of the same box). There are also modest improvements in the RMW strip size and in the cost of a rebuild after a single drive loss. Probably the reduction in the RMW strip size is the best improvement. Anyhow, let's assume 0.5TB drives; with a 47+1 we get a single 23.5TB filesystem, and with 8*(5+1) we get a 20TB filesystem. With current filesystem technology either size is worrying, for example as to time needed for an 'fsck'. In practice RAID5 beyond 3-4 drives seems only useful for almost read-only filesystems where restoring from backups is quick and easy, never mind the 47+1 case or the 8x(5+1) one, and I think that giving some credit even to the latter arrangement is not quite right... - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid10: unfair disk load?
Richard Scobie wrote: Jon Nelson wrote: My own tests on identical hardware (same mobo, disks, partitions, everything) and same software, with the only difference being how mdadm is invoked (the only changes here being level and possibly layout) show that raid0 is about 15% faster on reads than the very fast raid10, f2 layout. raid10,f2 is approx. 50% of the write speed of raid0. This more or less matches my testing. Have you tested a stacked RAID 10 made up of 2 drive RAID1 arrays, striped together into a RAID0. That is not raid10, that's raid1+0. See man md. I have found this configuration to offer very good performance, at the cost of slightly more complexity. It does, raid0 can be striped over many configurations, raid[156] being most common. -- Bill Davidsen [EMAIL PROTECTED] Woe unto the statesman who makes war without a reason that will still be valid when the war is over... Otto von Bismark - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid over 48 disks
Peter Grandi wrote: On Wed, 19 Dec 2007 07:28:20 +1100, Neil Brown [EMAIL PROTECTED] said: [ ... what to do with 48 drive Sun Thumpers ... ] neilb I wouldn't create a raid5 or raid6 on all 48 devices. neilb RAID5 only survives a single device failure and with that neilb many devices, the chance of a second failure before you neilb recover becomes appreciable. That's just one of the many problems, other are: * If a drive fails, rebuild traffic is going to hit hard, with reading in parallel 47 blocks to compute a new 48th. * With a parity strip length of 48 it will be that much harder to avoid read-modify before write, as it will be avoidable only for writes of at least 48 blocks aligned on 48 block boundaries. And reading 47 blocks to write one is going to be quite painful. [ ... ] neilb RAID10 would be a good option if you are happy wit 24 neilb drives worth of space. [ ... ] That sounds like the only feasible option (except for the 3 drive case in most cases). Parity RAID does not scale much beyond 3-4 drives. neilb Alternately, 8 6drive RAID5s or 6 8raid RAID6s, and use neilb RAID0 to combine them together. This would give you neilb adequate reliability and performance and still a large neilb amount of storage space. That sounds optimistic to me: the reason to do a RAID50 of 8x(5+1) can only be to have a single filesystem, else one could have 8 distinct filesystems each with a subtree of the whole. With a single filesystem the failure of any one of the 8 RAID5 components of the RAID0 will cause the loss of the whole lot. So in the 47+1 case a loss of any two drives would lead to complete loss; in the 8x(5+1) case only a loss of two drives in the same RAID5 will. It does not sound like a great improvement to me (especially considering the thoroughly inane practice of building arrays out of disks of the same make and model taken out of the same box). Quality control just isn't that good that same box make a big difference, assuming that you have an appropriate number of hot spares online. Note that I said big difference, is there some clustering of failures? Some, but damn little. A few years ago I was working with multiple 6TB machines and 20+ 1TB machines, all using small, fast, drives in RAID5E. I can't remember a case where a drive failed before rebuild was complete, and only one or two where there was a failure to degraded mode before the hot spare was replaced. That said, RAID5E typically can rebuild a lot faster than a typical hot spare as a unit drive, at least for any given impact on performance. This undoubtedly reduce our exposure time. There are also modest improvements in the RMW strip size and in the cost of a rebuild after a single drive loss. Probably the reduction in the RMW strip size is the best improvement. Anyhow, let's assume 0.5TB drives; with a 47+1 we get a single 23.5TB filesystem, and with 8*(5+1) we get a 20TB filesystem. With current filesystem technology either size is worrying, for example as to time needed for an 'fsck'. Given that someone is putting a typical filesystem full of small files on a big raid, I agree. But fsck with large files is pretty fast on a given filesystem (200GB files on a 6TB ext3, for instance), due to the small number of inodes in play. While the bitmap resolution is a factor, it's pretty linear, fsck with lots of files gets really slow. And let's face it, the objective of raid is to avoid doing that fsck in the first place ;-) -- Bill Davidsen [EMAIL PROTECTED] Woe unto the statesman who makes war without a reason that will still be valid when the war is over... Otto von Bismark - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid10 performance question
On Tue, 25 Dec 2007 19:08:15 +, [EMAIL PROTECTED] (Peter Grandi) said: [ ... ] It's the raid10,f2 *read* performance in degraded mode that is strange - I get almost exactly 50% of the non-degraded mode read performance. Why is that? [ ... ] the mirror blocks have to be read from the inner cylinders of the next disk, which are usually a lot slower than the outer ones. [ ... ] Just to be complete there is of course the other issue that affect sustained writes too, which is extra seeks. If disk B fails the situation becomes: DISK A X C D 1 X 3 4 . . . . . . . . . . . . --- 4 X 2 3 . . . . . . . . . . . . Not only must block 2 be read from an inner cylinder, but to read block 3 there must be a seek to an outer cylinder on the same disk. Which is the same well known issue when doing sustained writes with RAID10 'f2'. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 001 of 7] md: Support 'external' metadata for md arrays.
On Fri, 14 Dec 2007 17:26:08 +1100 NeilBrown [EMAIL PROTECTED] wrote: + if (strncmp(buf, external:, 9) == 0) { + int namelen = len-9; + if (namelen = sizeof(mddev-metadata_type)) + namelen = sizeof(mddev-metadata_type)-1; + strncpy(mddev-metadata_type, buf+9, namelen); + mddev-metadata_type[namelen] = 0; + if (namelen mddev-metadata_type[namelen-1] == '\n') + mddev-metadata_type[--namelen] = 0; + mddev-persistent = 0; + mddev-external = 1; size_t would be a more appropriate type for `namelen'. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 004 of 7] md: Allow devices to be shared between md arrays.
On Fri, 14 Dec 2007 17:26:28 +1100 NeilBrown [EMAIL PROTECTED] wrote: + mddev_unlock(rdev-mddev); + ITERATE_MDDEV(mddev, tmp) { + mdk_rdev_t *rdev2; + + mddev_lock(mddev); + ITERATE_RDEV(mddev, rdev2, tmp2) + if (test_bit(AllReserved, rdev2-flags) || + (rdev-bdev == rdev2-bdev + rdev != rdev2 + overlaps(rdev-data_offset, rdev-size, + rdev2-data_offset, rdev2-size))) { + overlap = 1; + break; + } + mddev_unlock(mddev); + if (overlap) { + mddev_put(mddev); + break; + } + } eww, ITERATE_MDDEV() and ITERATE_RDEV() are an eyesore. for_each_mddev() and for_each_rdev() would at least mean the reader doesn't need to check the implementation when wondering what that `break' is breaking from. #define In_sync 2 /* device is in_sync with rest of array */ #define WriteMostly 4 /* Avoid reading if at all possible */ #define BarriersNotsupp 5 /* BIO_RW_BARRIER is not supported */ +#define AllReserved 6 /* If whole device is reserved for The naming style here is inconsistent. A task for the keen would be to convert these to an enum and add some namespacing prefix to them. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html