Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again
Justin Piszcz wrote: For these benchmarks I timed how long it takes to extract a standard 4.4 GiB DVD: Settings: Software RAID 5 with the following settings (until I change those too): Base setup: blockdev --setra 65536 /dev/md3 echo 16384 /sys/block/md3/md/stripe_cache_size echo Disabling NCQ on all disks... for i in $DISKS do echo Disabling NCQ on $i echo 1 /sys/block/$i/device/queue_depth done p34:~# grep : *chunk* |sort -n 4-chunk.txt:0:45.31 8-chunk.txt:0:44.32 16-chunk.txt:0:41.02 32-chunk.txt:0:40.50 64-chunk.txt:0:40.88 128-chunk.txt:0:40.21 256-chunk.txt:0:40.14*** 512-chunk.txt:0:40.35 1024-chunk.txt:0:41.11 2048-chunk.txt:0:43.89 4096-chunk.txt:0:47.34 8192-chunk.txt:0:57.86 16384-chunk.txt:1:09.39 32768-chunk.txt:1:26.61 It would appear a 256 KiB chunk-size is optimal. Can you retest with different max_sectors_kb on both md and sd? Also, can you retest using dd with different block-sizes? Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID 5 + XFS Multi-Benchmarks / 10 Raptors Again
Justin Piszcz wrote: On Wed, 16 Jan 2008, Al Boldi wrote: Also, can you retest using dd with different block-sizes? I can do this, moment.. I know about oflag=direct but I choose to use dd with sync and measure the total time it takes. /usr/bin/time -f %E -o ~/$i=chunk.txt bash -c 'dd if=/dev/zero of=/r1/bigfile bs=1M count=10240; sync' So I was asked on the mailing list to test dd with various chunk sizes, here is the length of time it took to write 10 GiB and sync per each chunk size: 4=chunk.txt:0:25.46 8=chunk.txt:0:25.63 16=chunk.txt:0:25.26 32=chunk.txt:0:25.08 64=chunk.txt:0:25.55 128=chunk.txt:0:25.26 256=chunk.txt:0:24.72 512=chunk.txt:0:24.71 1024=chunk.txt:0:25.40 2048=chunk.txt:0:25.71 4096=chunk.txt:0:27.18 8192=chunk.txt:0:29.00 16384=chunk.txt:0:31.43 32768=chunk.txt:0:50.11 65536=chunk.txt:2:20.80 What do you get with bs=512,1k,2k,4k,8k,16k... Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])
Lars Ellenberg wrote: meanwhile, please, anyone interessted, the drbd paper for LinuxConf Eu 2007 is finalized. http://www.drbd.org/fileadmin/drbd/publications/ drbd8.linux-conf.eu.2007.pdf it does not give too much implementation detail (would be inapropriate for conference proceedings, imo; some paper commenting on the source code should follow). but it does give a good overview about what DRBD actually is, what exact problems it tries to solve, and what developments to expect in the near future. so you can make up your mind about Do we need it?, and Why DRBD? Why not NBD + MD-RAID? Ok, conceptually your driver sounds really interresting, but when I read the pdf I got completely turned off. The problem is that the concepts are not clearly implemented, when in fact the concepts are really simple: Allow shared access to remote block storage with fault tolerance. The first thing to tackle here would be write serialization. Then start thinking about fault tolerance. Now, shared remote block access should theoretically be handled, as does DRBD, by a block layer driver, but realistically it may be more appropriate to let it be handled by the combining end user, like OCFS or GFS. The idea here is to simplify lower layer implementations while removing any preconceived dependencies, and let upper layers reign free without incurring redundant overhead. Look at ZFS; it illegally violates layering by combining md/dm/lvm with the fs, but it does this based on a realistic understanding of the problems involved, which enables it to improve performance, flexibility, and functionality specific to its use case. This implies that there are two distinct forces at work here: 1. Layer components 2. Use-Case composers Layer components should technically not implement any use case (other than providing a plumbing framework), as that would incur unnecessary dependencies, which could reduce its generality and thus reusability. Use-Case composers can now leverage layer components from across the layering hierarchy, to yield a specific use case implementation. DRBD is such a Use-Case composer, as is mdm / dm / lvm and any fs in general, whereas aoe / nbd / loop and the VFS / FUSE are examples of layer components. It follows that Use-case composers, like DRBD, need common functionality that should be factored out into layer components, and then recompose to implement a specific use case. Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])
Evgeniy Polyakov wrote: Al Boldi ([EMAIL PROTECTED]) wrote: Look at ZFS; it illegally violates layering by combining md/dm/lvm with the fs, but it does this based on a realistic understanding of the problems involved, which enables it to improve performance, flexibility, and functionality specific to its use case. This implies that there are two distinct forces at work here: 1. Layer components 2. Use-Case composers Layer components should technically not implement any use case (other than providing a plumbing framework), as that would incur unnecessary dependencies, which could reduce its generality and thus reusability. Use-Case composers can now leverage layer components from across the layering hierarchy, to yield a specific use case implementation. DRBD is such a Use-Case composer, as is mdm / dm / lvm and any fs in general, whereas aoe / nbd / loop and the VFS / FUSE are examples of layer components. It follows that Use-case composers, like DRBD, need common functionality that should be factored out into layer components, and then recompose to implement a specific use case. Out of curiosity, did you try ndb+dm+raid1 compared to drbd and/or zfs on top of distributed storage (which is a urprise to me, that holy zfs suppors that)? Actually, I may not have been very clear in my Use-Case composer description to mean internal in-kernel Use-Case composer as opposed to external Userland Use-Case composer. So, nbd+dm+raid1 would be an external Userland Use-Case composition, which obviously could have some drastic performance issues. DRBD and ZFS are examples of internal in-kernel Use-Case composers, which obviously could show some drastic performance improvements. Although you could allow in-kernel Use-Case composers to be run on top of Userland Use-Case composers, that wouldn't be the preferred mode of operation. Instead, you would for example recompose ZFS to incorporate an in-kernel distributed storage layer component, like nbd. All this boils down to refactoring Use-Case composers to produce layer components with both in-kernel and userland interfaces. Once we have that, it becomes a matter of plug-and-play to produce something awesome like ZFS. Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bonnie++ benchmarks for ext2,ext3,ext4,jfs,reiserfs,xfs,zfs on software raid 5
Justin Piszcz wrote: CONFIG: Software RAID 5 (400GB x 6): Default mkfs parameters for all filesystems. Kernel was 2.6.21 or 2.6.22, did these awhile ago. Hardware was SATA with PCI-e only, nothing on the PCI bus. ZFS was userspace+fuse of course. Wow! Userspace and still that efficient. Reiser was V3. EXT4 was created using the recommended options on its project page. RAW: ext2,7760M,56728,96.,180505,51,85484,17.,50946.7,80.,235541,21 .,373.667,0,16:10:16/64,2354,27,0,0,8455.67,14.6667,2211.67,26. ,0,0,9724,22. ext3,7760M,52702.7,94.,165005,60,82294.7,20.6667,52664,83.6667,258788, 33.,335.8,0,16:10:16/64,858.333,10.6667,10250.3,28.6667,4084,15,897 ,12.6667,4024.33,12.,2754,11. ext4,7760M,53129.7,95,164515,59.,101678,31.6667,62194.3,98.6667,266716 ,22.,405.767,0,16:10:16/64,1963.67,23.6667,0,0,20859,73.6667,1731,2 1.,9022,23.6667,16410,65.6667 jfs,7760M,54606,92,191997,52,112764,33.6667,63585.3,99,274921,22.,383. 8,0,16:10:16/64,344,1,0,0,539.667,0,297.667,1,0,0,340,0 reiserfs,7760M,51056.7,96,180607,67,106907,38.,61231.3,97.6667,275339, 29.,441.167,0,16:10:16/64,2516,60.6667,19174.3,60.6667,8194.33,54.3 333,2011,42.6667,6963.67,19.6667,9168.33,68.6667 xfs,7760M,52985.7,93,158342,45,79682,14,60547.3,98,239101,20.,359.667, 0,16:10:16/64,415,4,0,0,1774.67,10.6667,454,4.7,14526.3,40,1572,12. 6667 zfs,7760M, Dissecting some of these numbers: speed %cpu 25601,43., 32198.7,4, 13266.3, 2, 44145.3,68.6667, 129278,9, 245.167,0, 16:10:16/64, speed %cpu 218.333,2, 2698.33,11.6667, 7434.67,14., 244,2, 2191.33,11.6667, 5613.33,13. Extrapolating these %cpu number makes ZFS the fastest. Are you sure these numbers are correct? Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFH] Partition table recovery
Theodore Tso wrote: On Sun, Jul 22, 2007 at 07:10:31AM +0300, Al Boldi wrote: Sounds great, but it may be advisable to hook this into the partition modification routines instead of mkfs/fsck. Which would mean that the partition manager could ask the kernel to instruct its fs subsystem to update the backup partition table for each known fs-type that supports such a feature. Well, let's think about this a bit. What are the requirements? 1) The partition manager should be able explicitly request that a new backup of the partition tables be stashed in each filesystem that has room for such a backup. That way, when the user affirmatively makes a partition table change, it can get backed up in all of the right places automatically. 2) The fsck program should *only* stash a backup of the partition table if there currently isn't one in the filesystem. It may be that the partition table has been corrupted, and so merely doing an fsck should not transfer a current copy of the partition table to the filesystem-secpfic backup area. It could be that the partition table was only partially recovered, and we don't want to overwrite the previously existing backups except on an explicit request from the system administrator. 3) The mkfs program should automatically create a backup of the current partition table layout. That way we get a backup in the newly created filesystem as soon as it is created. 4) The exact location of the backup may vary from filesystem to filesystem. For ext2/3/4, bytes 512-1023 are always unused, and don't interfere with the boot sector at bytes 0-511, so that's the obvious location. Other filesystems may have that location in use, and some other location might be a better place to store it. Ideally it will be a well-known location, that isn't dependent on finding an inode table, or some such, but that may not be possible for all filesystems. OK, so how about this as a solution that meets the above requirements? /sbin/partbackup device [fspart] Will scan device (i.e., /dev/hda, /dev/sdb, etc.) and create a 512 byte partition backup, using the format I've previously described. If fspart is specified on the command line, it will use the blkid library to determine the filesystem type of fspart, and then attempt to execute /dev/partbackupfs.fstype to write the partition backup to fspart. If fspart is '-', then it will write the 512 byte partition table to stdout. If fspart is not specified on the command line, /sbin/partbackup will iterate over all partitions in device, use the blkid library to attempt to determine the correct filesystem type, and then execute /sbin/partbackupfs.fstype if such a backup program exists. /sbin/partbackupfs.fstype fspart ... is a filesystem specific program for filesystem type fstype. It will assure that fspart (i.e., /dev/hda1, /dev/sdb3) is of an appropriate filesystem type, and then read 512 bytes from stdin and write it out to fspart to an appropriate place for that filesystem. Partition managers will be encouraged to check to see if /sbin/partbackup exists, and if so, after the partition table is written, will check to see if /sbin/partbackup exists, and if so, to call it with just one argument (i.e., /sbin/partbackup /dev/hdb). They SHOULD provide an option for the user to suppress the backup from happening, but the backup should be the default behavior. An /etc/mkfs.fstype program is encouraged to run /sbin/partbackup with two arguments (i.e., /sbin/partbackup /dev/hdb /dev/hdb3) when creating a filesystem. An /etc/fsck.fstype program is encouraged to check to see if a partition backup exists (assuming the filesystem supports it), and if not, call /sbin/partbackup with two arguments. A filesystem utility package for a particular filesystem type is encouraged to make the above changes to its mkfs and fsck programs, as well as provide an /sbin/partbackupfs.fstype program. Great! I would do this all in userspace, though. Is there any reason to get the kernel involved? I don't think so. Yes, doing things in userspace, when possible, is much better. But, a change in the partition table has to be relayed to the kernel, and when that change happens to be on a mounted disk, then the partition manager complains of not being able to update the kernel's view. So how can this be addressed? Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFH] Partition table recovery
Theodore Tso wrote: On Sat, Jul 21, 2007 at 07:54:14PM +0200, Rene Herman wrote: sfdisk -d already works most of the time. Not as a verbatim tool (I actually semi-frequently use a sfdisk -d /dev/hda | sfdisk invocation as a way to _rewrite_ the CHS fields to other values after changing machines around on a disk) but something you'd backup on the FS level should, in my opinion, need to be less fragile than would be possible with just 512 bytes available. *IF* you remember to store the sfdisk -d somewhere useful. In my How To Recover From Hard Drive Catastrophies classes, I tell them to print out a copy of sfdisk -l /dev/hda ; sfdisk -d /dev/hda and tape it to the side of the computer. I also tell them do regular backups. What to make a guess how many them actually follow this good advice? Far fewer than I would like, I suspect... What I'm suggesting is the equivalent of sfdisk -d, except we'd be doing it automatically without requiring the user to take any kind of explicit action. Is it perfect? No, although the edge conditions are quite rare these days and generally involve users using legacy systems and/or doing Weird Shit such that They Really Should Know To Do Their Own Explicit Backups. But for the novice users, it should work Just Fine. Sounds great, but it may be advisable to hook this into the partition modification routines instead of mkfs/fsck. Which would mean that the partition manager could ask the kernel to instruct its fs subsystem to update the backup partition table for each known fs-type that supports such a feature. Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFH] Partition table recovery
Jeffrey V. Merkey wrote: Al Boldi wrote: As always, a good friend of mine managed to scratch my partion table by cat'ing /dev/full into /dev/sda. I was able to push him out of the way, but at least the first 100MB are gone. I can probably live without the first partion, but there are many partitions after that, which I hope should easily be recoverable. I tried parted, but it's not working out for me. Does anybody know of a simple partition recovery tool, that would just scan the disk for lost partions? One thing NetWare always did was to stamp a copy of the partition table at the time a partition was created as the second logical sector (offset 1) from the start of a newly created partition. This allowed the disk to be scanned for the original (or last) partition table copy. This is really a good idea, as this would save you the trouble of reconstructing the table due to older overlapping entries. Can linux do something like that? Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFH] Partition table recovery
Dave Young wrote: On 7/20/07, Al Boldi [EMAIL PROTECTED] wrote: As always, a good friend of mine managed to scratch my partion table by cat'ing /dev/full into /dev/sda. I was able to push him out of the way, but /dev/null ? at least the first 100MB are gone. I can probably live without the first partion, but there are many partitions after that, which I hope should easily be recoverable. I tried parted, but it's not working out for me. Does anybody know of a simple partition recovery tool, that would just scan the disk for lost partions? The best way is to backup you partition table before destroyed. Very true! # sfdisk -d is a real saviour. But make sure you don't save it on the same disk you are trying to recover. Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFH] Partition table recovery
James Lamanna wrote: On 7/19/07, Al Boldi [EMAIL PROTECTED] wrote: As always, a good friend of mine managed to scratch my partion table by cat'ing /dev/full into /dev/sda. I was able to push him out of the way, but at least the first 100MB are gone. I can probably live without the first partion, but there are many partitions after that, which I hope should easily be recoverable. I tried parted, but it's not working out for me. Does anybody know of a simple partition recovery tool, that would just scan the disk for lost partions? Tried gpart? http://www.stud.uni-hannover.de/user/76201/gpart/ This definitely looks like the ticket. And also rescuept from util-linux. There is only one small problem; I have been regularly adding / deleting / resizing partitions, which kind of confuses the scanner. But still, it's better than nothing. Anton Altaparmakov wrote: parted and its derivatives are pile of crap... They cause corruption to totally healthy systems at the best of times. Don't go near them. Use TestDisk (http://www.cgsecurity.org/wiki/TestDisk) and be happy. (-: This one really worked best, without getting confused about older partitions. Thanks everybody! BTW, what's a partion table? -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFH] Partition table recovery
Jan-Benedict Glaw wrote: On Fri, 2007-07-20 14:29:34 +0300, Al Boldi [EMAIL PROTECTED] wrote: But, I want something much more automated. And the partition table backup per partition entry isn't really a bad idea. That's called `gpart'. Oh, gpart is great, but if we had a backup copy of the partition table on every partition location on disk, then this backup copy could easily be reused to reconstruct the original partition table without further searching. Just like the NetWare approach, and in some respect like the ext2/3 superblock backups. Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFH] Partion table recovery
As always, a good friend of mine managed to scratch my partion table by cat'ing /dev/full into /dev/sda. I was able to push him out of the way, but at least the first 100MB are gone. I can probably live without the first partion, but there are many partitions after that, which I hope should easily be recoverable. I tried parted, but it's not working out for me. Does anybody know of a simple partition recovery tool, that would just scan the disk for lost partions? Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Software RAID5 Horrible Write Speed On 3ware Controller!!
Justin Piszcz wrote: UltraDense-AS-3ware-R5-9-disks,16G,50676,89,96019,34,46379,9,60267,99,5010 98,56,248.5,0,16:10:16/64,240,3,21959,84,1109,10,286,4,22923,91,544,6 UltraDense-AS-3ware-R5-9-disks,16G,49983,88,96902,37,47951,10,59002,99,529 121,60,210.3,0,16:10:16/64,250,3,25506,98,1163,10,268,3,18003,71,772,8 UltraDense-AS-3ware-R5-9-disks,16G,49811,87,95759,35,48214,10,60153,99,538 559,61,276.8,0,16:10:16/64,233,3,25514,97,1100,9,279,3,21398,84,839,9 Is there any easy way to decipher these numbers? Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] VFS: data=ordered (was: [Advocacy] Re: 3ware 9650 tips)
Matthew Wilcox wrote: On Mon, Jul 16, 2007 at 08:40:00PM +0300, Al Boldi wrote: XFS surely rocks, but it's missing one critical component: data=ordered And that's one component that's just too critical to overlook for an enterprise environment that is built on data-integrity over performance. So that's the secret why people still use ext3, and XFS' reliance on external hardware to ensure integrity is really misplaced. Now, maybe when we get the data=ordered onto the VFS level, then maybe XFS may become viable for the enterprise, and ext3 may cease to be KING. Wow, thanks for bringing an advocacy thread onto linux-fsdevel. Just what we wanted. Do you have any insight into how to get the data=ordered onto the VFS level? Because to me, that sounds like pure nonsense. Well, conceptually it sounds like a piece of cake, technically your guess is as good as mine. IIRC, akpm once mentioned something like this. But seriously, can you think of a technical reason why it shouldn't be possible to abstract data=ordered mode out into the VFS? Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 3/4] md: writeback caching policy for raid5 [experimental]
Dan Williams wrote: In write-through mode bi_end_io is called once writes to the data disk(s) and the parity disk have completed. In write-back mode bi_end_io is called immediately after data has been copied into the stripe cache, which also causes the stripe to be marked dirty. This is not really meaningful, as this is exactly what the page-cache already does before being synced. It may be more reasonable to sync the data-stripe as usual, and only delay the parity. This way you shouldn't have to worry about unclean shutdowns. Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 does not seem faster
Bill Davidsen wrote: Al Boldi wrote: The problem is that raid1 one doesn't do striped reads, but rather uses read-balancing per proc. Try your test with parallel reads; it should be faster. : : It would be nice if reads larger than some size were considered as candidates for multiple devices. By setting the readahead larger than that value speed increases would be noted for sequential access. Actually, that's what I thought for a long time too, but as Neil once pointed out, for striped reads to be efficient, each chunk should be located sequentially, as to avoid any seeks. This is only possible by introducing some offset layout, as in raid10, which infers a loss of raid1's single-disk-image compatibility. What could be feasible, is some kind of an initial burst striped readahead, which could possibly improve small reads (readahead * nr_of_disks). Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 does not seem faster
Jan Engelhardt wrote: normally, I'd think that combining drives into a raid1 array would give me at least a little improvement in read speed. In my setup however, this does not seem to be the case. 14:16 opteron:/var/log # hdparm -t /dev/sda Timing buffered disk reads: 170 MB in 3.01 seconds = 56.52 MB/sec 14:17 opteron:/var/log # hdparm -t /dev/md3 Timing buffered disk reads: 170 MB in 3.01 seconds = 56.45 MB/sec (and dd_rescue shows the same numbers) The problem is that raid1 one doesn't do striped reads, but rather uses read-balancing per proc. Try your test with parallel reads; it should be faster. You could use raid10, but then you loose single-disk-image compatibility. Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PATA/SATA Disk Reliability paper
Mario 'BitKoenig' Holbe wrote: Al Boldi [EMAIL PROTECTED] wrote: Interesting link. They seem to point out that smart not necessarily warns of pending failure. This is probably worse than not having smart at all, as it gives you the illusion of safety. If SMART gives you the illusion of safety, you didn't understand SMART. SMART hints *only* the potential presence or occurence of failures in the future, it does not prove the absence of such - and nobody ever said it does. It would even be impossible to do that, though (which is easy to prove by just utilizing an external damaging tool like a hammer). Concluding from that that not having any failure detector at all is better than having at least an imperfect one is IMHO completely wrong. Agreed. But would you then call it SMART? Sounds rather DUMB. Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PATA/SATA Disk Reliability paper
Mark Hahn wrote: In contrast, ever since these holes appeared, drive failures became the norm. wow, great conspiracy theory! I think you misunderstand. I just meant plain old-fashioned mis-engineering. maybe the hole is plugged at the factory with a substance which evaporates at 1/warranty-period ;) Actually it's plugged with a thin paper-like filter, which does not seem to evaporate easily. And it's got nothing to do with warranty, although if you get lucky and the failure happens within the warranty period, you can probably demand a replacement drive to make you feel better. But remember, the google report mentions a great number of drives failing for no apparent reason, not even a smart warning, so failing within the warranty period is just pure luck. seriously, isn't it easy to imagine a bladder-like arrangement that permits equilibration without net flow? disk spec-sheets do limit this - I checked the seagate 7200.10: 10k feet operating, 40k max. amusingly -200 feet is the min either way... Well, it looks like filtered net flow on wd's. What's it look like on seagate? Doe anyone rememnber that you had to let you drives acclimate to your machine room for a day or so before you used them. The problem is, that's not enough; the room temperature/humidity has to be controlled too. In a desktop environment, that's not really feasible. 5-90% humidity, operating, 95% non-op, and 30%/hour. seems pretty easy to me. in fact, I frequently ask people to justify the assumption that a good machineroom needs tight control over humidity. (assuming, like most machinerooms, you aren't frequently handling the innards.) I agree, but reality has a different opinion, and it may take down that drive, specs or no specs. A good way to deal with reality is to find the real reasons for failure. Once these reasons are known, engineering quality drives becomes, thank GOD, really rather easy. Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PATA/SATA Disk Reliability paper
Mark Hahn wrote: - disks are very complicated, so their failure rates are a combination of conditional failure rates of many components. to take a fully reductionist approach would require knowing how each of ~1k parts responds to age, wear, temp, handling, etc. and none of those can be assumed to be independent. those are the real reasons, but most can't be measured directly outside a lab and the number of combinatorial interactions is huge. It seems to me that the biggest problem are the 7.2k+ rpm platters themselves, especially with those heads flying closely on top of them. So, we can probably forget the rest of the ~1k non-moving parts, as they have proven to be pretty reliable, most of the time. - factorial analysis of the data. temperature is a good example, because both low and high temperature affect AFR, and in ways that interact with age and/or utilization. this is a common issue in medical studies, which are strikingly similar in design (outcome is subject or disk dies...) there is a well-established body of practice for factorial analysis. Agreed. We definitely need more sensors. - recognition that the relative results are actually quite good, even if the absolute results are not amazing. for instance, assume we have 1k drives, and a 10% overall failure rate. using all SMART but temp detects 64 of the 100 failures and misses 36. essentially, the failure rate is now .036. I'm guessing that if utilization and temperature were included, the rate would be much lower. feedback from active testing (especially scrubbing) and performance under the normal workload would also help. Are you saying, you are content with pre-mature disk failure, as long as there is a smart warning sign? If so, then I don't think that is enough. I think the sensors should trigger some kind of shutdown mechanism as a protective measure, when some threshold is reached. Just like the protective measure you see for CPUs to prevent meltdown. Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PATA/SATA Disk Reliability paper
Stephen C Woods wrote: So drives do need to be ventilated, not so much wory about exploding, but rather subtle distortion of the case as the atmospheric preasure changed. I have a '94 Caviar without any apparent holes; and as a bonus, the drive still works. In contrast, ever since these holes appeared, drive failures became the norm. Doe anyone rememnber that you had to let you drives acclimate to your machine room for a day or so before you used them. The problem is, that's not enough; the room temperature/humidity has to be controlled too. In a desktop environment, that's not really feasible. Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PATA/SATA Disk Reliability paper
Eyal Lebedinsky wrote: Disks are sealed, and a dessicant is present in each to keep humidity down. If you ever open a disk drive (e.g. for the magnets, or the mirror quality platters, or for fun) then you can see the dessicant sachet. Actually, they aren't sealed 100%. On wd's at least, there is a hole with a warning printed on its side: DO NOT COVER HOLE BELOW V V V V o In contrast, older models from the last century, don't have that hole. Al Boldi wrote: If there is one thing to watch out for, it is dew. I remember video machines sensing for dew, so do any drives sense for dew? Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PATA/SATA Disk Reliability paper
Richard Scobie wrote: Thought this paper may be of interest. A study done by Google on over 100,000 drives they have/had in service. http://labs.google.com/papers/disk_failures.pdf Interesting link. They seem to point out that smart not necessarily warns of pending failure. This is probably worse than not having smart at all, as it gives you the illusion of safety. If there is one thing to watch out for, it is dew. I remember video machines sensing for dew, so do any drives sense for dew? Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID 5 Performance Optimizations: 2.6.19.1: (211MB/s read 195MB/s write)
Justin Piszcz wrote: RAID 5 TWEAKED: 1:06.41 elapsed @ 60% CPU This should be 1:14 not 1:06(was with a similarly sized file but not the same) the 1:14 is the same file as used with the other benchmarks. and to get that I used 256mb read-ahead and 16384 stripe size ++ 128 max_sectors_kb (same size as my sw raid5 chunk size) max_sectors_kb is probably your key. On my system I get twice the read performance by just reducing max_sectors_kb from default 512 to 192. Can you do a fresh reboot to shell and then: $ cat /sys/block/hda/queue/* $ cat /proc/meminfo $ echo 3 /proc/sys/vm/drop_caches $ dd if=/dev/hda of=/dev/null bs=1M count=10240 $ echo 192 /sys/block/hda/queue/max_sectors_kb $ echo 3 /proc/sys/vm/drop_caches $ dd if=/dev/hda of=/dev/null bs=1M count=10240 Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID 5 Performance Optimizations: 2.6.19.1: (211MB/s read 195MB/s write)
Justin Piszcz wrote: Btw, max sectors did improve my performance a little bit but stripe_cache+read_ahead were the main optimizations that made everything go faster by about ~1.5x. I have individual bonnie++ benchmarks of [only] the max_sector_kb tests as well, it improved the times from 8min/bonnie run - 7min 11 seconds or so, see below and then after that is what you requested. # echo 3 /proc/sys/vm/drop_caches # dd if=/dev/md3 of=/dev/null bs=1M count=10240 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 399.352 seconds, 26.9 MB/s # for i in sde sdg sdi sdk; do echo 192 /sys/block/$i/queue/max_sectors_kb; echo Set /sys/block/$i/queue/max_sectors_kb to 192kb; done Set /sys/block/sde/queue/max_sectors_kb to 192kb Set /sys/block/sdg/queue/max_sectors_kb to 192kb Set /sys/block/sdi/queue/max_sectors_kb to 192kb Set /sys/block/sdk/queue/max_sectors_kb to 192kb # echo 3 /proc/sys/vm/drop_caches # dd if=/dev/md3 of=/dev/null bs=1M count=10240 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 398.069 seconds, 27.0 MB/s Awful performance with your numbers/drop_caches settings.. ! Can you repeat with /dev/sda only? With fresh reboot to shell, then: $ cat /sys/block/sda/queue/max_sectors_kb $ echo 3 /proc/sys/vm/drop_caches $ dd if=/dev/sda of=/dev/null bs=1M count=10240 $ echo 192 /sys/block/sda/queue/max_sectors_kb $ echo 3 /proc/sys/vm/drop_caches $ dd if=/dev/sda of=/dev/null bs=1M count=10240 $ echo 128 /sys/block/sda/queue/max_sectors_kb $ echo 3 /proc/sys/vm/drop_caches $ dd if=/dev/sda of=/dev/null bs=1M count=10240 What were your tests designed to show? A problem with the block-io. Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID 5 Performance Optimizations: 2.6.19.1: (211MB/s read 195MB/s write)
Justin Piszcz wrote: On Sat, 13 Jan 2007, Al Boldi wrote: Justin Piszcz wrote: Btw, max sectors did improve my performance a little bit but stripe_cache+read_ahead were the main optimizations that made everything go faster by about ~1.5x. I have individual bonnie++ benchmarks of [only] the max_sector_kb tests as well, it improved the times from 8min/bonnie run - 7min 11 seconds or so, see below and then after that is what you requested. Can you repeat with /dev/sda only? For sda-- (is a 74GB raptor only)-- but ok. Do you get the same results for the 150GB-raptor on sd{e,g,i,k}? # uptime 16:25:38 up 1 min, 3 users, load average: 0.23, 0.14, 0.05 # cat /sys/block/sda/queue/max_sectors_kb 512 # echo 3 /proc/sys/vm/drop_caches # dd if=/dev/sda of=/dev/null bs=1M count=10240 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 150.891 seconds, 71.2 MB/s # echo 192 /sys/block/sda/queue/max_sectors_kb # echo 3 /proc/sys/vm/drop_caches # dd if=/dev/sda of=/dev/null bs=1M count=10240 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 150.192 seconds, 71.5 MB/s # echo 128 /sys/block/sda/queue/max_sectors_kb # echo 3 /proc/sys/vm/drop_caches # dd if=/dev/sda of=/dev/null bs=1M count=10240 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 150.15 seconds, 71.5 MB/s Does this show anything useful? Probably a latency issue. md is highly latency sensitive. What CPU type/speed do you have? Bootlog/dmesg? Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Propose of enhancement of raid1 driver
Mario 'BitKoenig' Holbe wrote: Al Boldi [EMAIL PROTECTED] wrote: But what still isn't clear, why can't raid1 use something like the raid10 offset=2 mode? RAID1 has equal data on all mirrors, so sooner or later you have to seek somewhere - no matter how you layout the data on each mirror. Don't underestimate the effects mere layout can have on multi-disk array performance, despite it being highly hw dependent. The best approach would probably involve a user-configurable layout table, to tune it to the specific hw. Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Propose of enhancement of raid1 driver
Mario 'BitKoenig' Holbe wrote: Al Boldi [EMAIL PROTECTED] wrote: Don't underestimate the effects mere layout can have on multi-disk array performance, despite it being highly hw dependent. I can't see the difference between equal mirrors and somehow interleaved layout on RAID1. Since you have to seek anyways, there should be no difference between both approaches once you read big enough chunks. The problem with reading big chunks is: you probably read far too much when you don't really need the data you did read. Think adaptive. And vice versa: when you don't read big chunks, it doesn't matter how your data is laid out. Think tracks and heads, physical that is. Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Propose of enhancement of raid1 driver
Mario 'BitKoenig' Holbe wrote: Neil Brown [EMAIL PROTECTED] wrote: Skipping over blocks within a track is no faster than reading blocks in the track, so you would need to make sure that your chunk size is Not even no faster but probably even slower. Surely slower, on conventional hds anyway. But what still isn't clear, why can't raid1 use something like the raid10 offset=2 mode? Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Large single raid and XFS or two small ones and EXT3?
Chris Allen wrote: Francois Barre wrote: 2006/6/23, PFC [EMAIL PROTECTED]: - XFS is faster and fragments less, but make sure you have a good UPS Why a good UPS ? XFS has a good strong journal, I never had an issue with it yet... And believe me, I did have some dirty things happening here... - ReiserFS 3.6 is mature and fast, too, you might consider it - ext3 is slow if you have many files in one directory, but has more mature tools (resize, recovery etc) XFS tools are kind of mature also. Online grow, dump, ... I'd go with XFS or Reiser. I'd go with XFS. But I may be kind of fanatic... Strange that whatever the filesystem you get equal numbers of people saying that they have never lost a single byte to those who have had horrible corruption and would never touch it again. We stopped using XFS about a year ago because we were getting kernel stack space panics under heavy load over NFS. It looks like the time has come to give it another try. If you are keen on data integrity then don't touch any fs w/o data=ordered. ext3 is still king wrt data=ordered, albeit slow. Now XFS is fast, but doesn't support data=ordered. It seems that their solution to the problem is to pass the burden onto hw by using barriers. Maybe XFS can get away with this. Maybe. Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 009 of 11] md: Support stripe/offset mode in raid10
Neil Brown wrote: On Tuesday May 2, [EMAIL PROTECTED] wrote: NeilBrown wrote: The industry standard DDF format allows for a stripe/offset layout where data is duplicated on different stripes. e.g. A B C D D A B C E F G H H E F G (columns are drives, rows are stripes, LETTERS are chunks of data). Presumably, this is the case for --layout=f2 ? Almost. mdadm doesn't support this layout yet. 'f2' is a similar layout, but the offset stripes are a lot further down the drives. It will possibly be called 'o2' or 'offset2'. If so, would --layout=f4 result in a 4-mirror/striped array? o4 on a 4 drive array would be A B C D D A B C C D A B B C D A E F G H Yes, so would this give us 4 physically duplicate mirrors? If not, would it be possible to add a far-offset mode to yield such a layout? Also, would it be possible to have a staged write-back mechanism across multiple stripes? What exactly would that mean? Write the first stripe, then write subsequent duplicate stripes based on idle with a max delay for each delayed stripe. And what would be the advantage? Faster burst writes, probably. Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Help needed - RAID5 recovery from Power-fail
Neil Brown wrote: 2 devices in a raid5?? Doesn't seem a lot of point it being raid5 rather than raid1. Wouldn't a 2-dev raid5 imply a striped block mirror (i.e faster) rather than a raid1 duplicate block mirror (i.e. slower) ? Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2.6.15-git9a] aoe [1/1]: do not stop retransmit timer when device goes down
Ed L. Cashin wrote: On Thu, Jan 26, 2006 at 01:04:37AM +0300, Al Boldi wrote: Ed L. Cashin wrote: This patch is a bugfix that follows and depends on the eight aoe driver patches sent January 19th. Will they also fix this? Or is this an md bug? No, this patch fixes a bug that would cause an AoE device to be totally unusable, so I think mdadm or mkraid would get an error that the device was not available before it tried to make a new md device. It only happens with aoe. It looks like in setting up the raid, sysfs_create_link probably has this going off: BUG_ON(!kobj || !kobj-dentry || !name); Also, why is aoe slower than nbd? It wasn't when I tried it. The userland vblade is slow. Maybe that's affecting your results? Why is the userland vblade server slower than the userland nbd-server? Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: io performance...
Jeff V. Merkey wrote: Jens Axboe wrote: On Mon, Jan 16 2006, Jeff V. Merkey wrote: Max Waterman wrote: I've noticed that I consistently get better (read) numbers from kernel 2.6.8 than from later kernels. To open the bottlenecks, the following works well. Jens will shoot me -#define BLKDEV_MIN_RQ4 -#define BLKDEV_MAX_RQ128 /* Default maximum */ +#define BLKDEV_MIN_RQ4096 +#define BLKDEV_MAX_RQ8192/* Default maximum */ Yeah I could shoot you. However I'm more interested in why this is necessary, eg I'd like to see some numbers from you comparing: - Doing # echo 8192 /sys/block/dev/queue/nr_requests for each drive you are accessing. The BLKDEV_MIN_RQ increase is just silly and wastes a huge amount of memory for no good reason. Yep. I build it into the kernel to save the trouble of sending it to proc. Jens recommendation will work just fine. It has the same affect of increasing the max requests outstanding. Your suggestion doesn't do anything here on 2.6.15, but echo 192 /sys/block/dev/queue/max_sectors_kb echo 192 /sys/block/dev/queue/read_ahead_kb works wonders! I don't know why, but anything less than 64 and more than 256 makes the queue collapse miserably, causing some strange __copy_to_user calls?!?!? Also, it seems that changing the kernel HZ has some drastic effects on the queues. A simple lilo gets delayed 400% and 200% using 100HZ and 250HZ respectively. -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID0 performance question
JaniD++ wrote: For me, the performance bottleneck is cleanly about RAID0 layer used exactly as concentrator to join the 4x2TB to 1x8TB. Did you try running RAID0 over nbd directly and found it to be faster? IIRC, stacking raid modules does need a considerable amount of tuning, and even then it does not scale linearly. Maybe NeilBrown can help? -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Where is the performance bottleneck?
Holger Kiehl wrote: top - 08:39:11 up 2:03, 2 users, load average: 23.01, 21.48, 15.64 Tasks: 102 total, 2 running, 100 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0% us, 17.7% sy, 0.0% ni, 0.0% id, 78.9% wa, 0.2% hi, 3.1% si Mem: 8124184k total, 8093068k used,31116k free, 7831348k buffers Swap: 15631160k total,13352k used, 15617808k free, 5524k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 3423 root 18 0 55204 460 392 R 12.0 0.0 1:15.55 dd 3421 root 18 0 55204 464 392 D 11.3 0.0 1:17.36 dd 3418 root 18 0 55204 464 392 D 10.3 0.0 1:10.92 dd 3416 root 18 0 55200 464 392 D 10.0 0.0 1:09.20 dd 3420 root 18 0 55204 464 392 D 10.0 0.0 1:10.49 dd 3422 root 18 0 55200 460 392 D 9.3 0.0 1:13.58 dd 3417 root 18 0 55204 460 392 D 7.6 0.0 1:13.11 dd 158 root 15 0 000 D 1.3 0.0 1:12.61 kswapd3 159 root 15 0 000 D 1.3 0.0 1:08.75 kswapd2 160 root 15 0 000 D 1.0 0.0 1:07.11 kswapd1 3419 root 18 0 51096 552 476 D 1.0 0.0 1:17.15 dd 161 root 15 0 000 D 0.7 0.0 0:54.46 kswapd0 A loadaverage of 23 for 8 dd's seems a bit high. Also why is kswapd working so hard? Is that correct. Actually, kswapd is another problem. (see Kswapd Flaw thread) Which has little impact on your problem but basically kswapd tries very hard maybe even to hard to fullfil a request for memory, so when the buffer/cache pages are full kswapd tries to find some more unused memory. When it finds none it starts recycling the buffer/cache pages. Which is OK, but it only does this after searching for swappable memory which wastes CPU cycles. This can be tuned a little but not much by adjusting /sys(proc)/.../vm/... Or renicing kswapd to the lowest priority, which may cause other problems. Things get really bad when procs start asking for more memory than is available, causing kswapd to take the liberty of paging out running procs in the hope that these procs won't come back later. So when they do come back something like a wild goose chase begins. This is also known as OverCommit. This is closely related to the dreaded OOM-killer, which occurs when the system cannot satisfy a memory request for a returning proc, causing the VM to start killing in an unpredictable manner. Turning OverCommit off should solve this problem but it doesn't. This is why it is recommended to run the system always with swap enabled even if you have tons of memory, which really only pushes the problem out of the way until you hit the dead end and the wild goose chase begins again. Sadly 2.6.13 did not fix this either. Although this description only vaguely defines the problem from an end-user pov, the actual semantics may be quite different. -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH md 006 of 6] Add write-behind support for md/raid1
Paul Clements wrote: Al Boldi wrote: NeilBrown wrote: If a device is flagged 'WriteMostly' and the array has a bitmap, and the bitmap superblock indicates that write_behind is allowed, then write_behind is enabled for WriteMostly devices. Nice, but why is it dependent on WriteMostly? WriteMostly is just a flag that tells us which devices will get the write-behinds, and which will not. You'll be able to mix any combination of WriteMostly devices and normal devices in a raid1. Yes, but doesn't WriteMostly imply ReadDelay? If so, doesn't that mean that WriteBehind is dependent on ReadDelay? -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Multiplexed RAID-1 mode
Neil Brown wrote: { On Sunday July 31, [EMAIL PROTECTED] wrote: Multiplexing read/write requests would certainly improve performance ala RAID-0 (-offset overhead). During reads the same RAID-0 code (+mirroring offset) could be used. During writes though, this would imply delayed mirroring. But what exactly do you mean by 'delayed mirroring'? Are you suggesting that the write request completes after only writing to one mirror? If so, which one? Wouldn't this substantially reduce the value of mirroring? } Think of it as a _smart_ resync running on idle. Should be an option though! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Multiplexed RAID-1 mode
Gordon Henderson wrote: { On Sat, 30 Jul 2005, Jeff Breidenbach wrote: I just ran a Linux software RAID-1 benchmark with some 500GB SATA drives in NCQ mode, along with a non-RAID control. Details are here for those interested. http://www.jab.org/raid-bench/ The results you get are about what I get on various systems - essentially with RAID-1 you get about the same speed as a single drive will get. ns1:/var/tmp# hdparm -tT /dev/md1 /dev/sda1 /dev/sdb1 /dev/md1: Timing cached reads: 4116 MB in 2.00 seconds = 2058.31 MB/sec Timing buffered disk reads: 174 MB in 3.00 seconds = 57.99 MB/sec /dev/sda1: Timing cached reads: 4096 MB in 2.00 seconds = 2048.31 MB/sec Timing buffered disk reads: 176 MB in 3.03 seconds = 58.11 MB/sec /dev/sdb1: Timing cached reads: 4116 MB in 2.00 seconds = 2057.28 MB/sec Timing buffered disk reads: 176 MB in 3.02 seconds = 58.27 MB/sec } Multiplexing read/write requests would certainly improve performance ala RAID-0 (-offset overhead). During reads the same RAID-0 code (+mirroring offset) could be used. During writes though, this would imply delayed mirroring. -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html