Re: split RAID1 during backups?
On Mon, 24 Oct 2005, Jeff Breidenbach wrote: Ok... thanks everyone! David, you said you are worried about failure scenarios involved with RAID splitting. Could you please elaborate? My biggest concern is I'm going to accidentally trigger a rebuild no matter what I try but maybe you have something more serious in mind. Brad, your suggestion about kernel 2.6.13 and intent logging and having mdadm pull a disk sounds like a winner. I'm going to to try it if the software looks mature enough. Should I be scared? Dean, the comment about write-mostly is confusing to me. Let's say I somehow marked one of the component drives write-mostly to quiet it down. How do I get at it? Linux will not let me mount the component partition if md0 is also mounted. Do you think write-mostly or write-behind are likely enough to be magic bullets that I should learn all about them? Bill, thanks for the suggestion to use nbd instead of netcat. Netcat is solid software and very fast, but does feel a little like duct tape. You also suggested putting a third drive (local or nbd remote) temporarily in the RAID1. What does that buy versus the current practice of using dd_rescue to copy the data off md0? I'm not imagining any I/O savings over the current approach. As a paranoid admin, you (a) reduce read-only time, (b) never have an unmirrored data running, and (c) it does let you send from an unused drive (or you can get a real hot swap bay and put the mirrored drive in the safe). I have one other thought, if you want to just stream this to another drive and can stand long r/o mounts (or play with intent stuff, carefully), that is to: open a socket to something on the other machine which is going to write a single BIG data file, or to a partition of the same size. open the partition as a file (open /dev/md0) use the sendfile() system call to blast the file to the socket without using user memory. Based on my vast experience with one test program, this should work ;-) It will be limited by how fast you can write it at the other end, I suspect. I still think you should be able to do incrementals. If that's Reiser3 you're using, it may not be performing as well as ext3 with hashing would, but I lack the time to test that properly. John, I'm using 4KB blocks in reiserfs with tail packing. All sorts of other details are in the dmesg output [1]. I agree seeks are a major bottleneck, and I like your suggestion about putting extra spindles in. Master-slave won't work because the data is continuously changing. I'm not going to argue about the optimality of millions of tiny files (go talk to Hans Reiser about that one!) but I definitely don't foresee major application redesign any time soon. Most importantly, thanks for the encouragement. So far it sounds like there might be some ninja magic required, but I'm becoming increasingly optimistic that it will be - somehow - possible manage disk contention in order to dramatically raise backup speeds. Cheers, Jeff [1] http://www.jab.org/dmesg - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with little computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: split RAID1 during backups?
On Wed, 26 Oct 2005, Jeff Breidenbach wrote: Norman What you should be able to do with software raid1 is the Norman following: Stop the raid, mount both underlying devices Norman instead of the raid device, but of course READ ONLY. Both Norman contain the complete data and filesystem, and in addition to Norman that the md superblock at the end. Both should be identical Norman copies of that. Thus, you do not have to resync Norman afterwards. You then can backup the one disk while serving the Norman web server from the other. When you are done, unmount, Norman assemble the raid, mount it and go on. I tried both variants of Norman's suggestion on a test machine and they worked great. Shutting down and restarting md0 did not trigger a rebuild. Perfect! And I could mount component partitions read-only at any time. However on the production machine the component partitions refused to mount, claiming to be already mounted. Despite the fact that the component drives do not show up anywhere in lsof or mtab. When I saw this, I got nervous and did not even try stopping md0 on the production machine. As long as md0 is running I suspect the partition will be marked as in use. So you have to stop it. If the 2.4 kernel didn't detect that, I would call it a bug. # mount -o ro /dev/sdc1 backup mount: /dev/sdc1 already mounted or backup busy The two machines hardly match. The test machine has a 2.4.27 kernel and JBOD drives hanging off a 3ware 7xxx controller. The production machine has a 2.6.12 kernel and Intel SATA controllers. Both machines have mdadm 1.9.0, and the discrepancy in behavior seems weird to me. Any insights? [___snip___] Bill If you want to try something which used to work see nbd, Bill export 500GB from another machine, add the network block device Bill to the mirror, let it sync, break the mirror. Haven't tried Bill since 2.4.19 or so. Wow, nbd (network block device) sounds really useful. I wonder if it is a good way to provide more spindles to a hungry webserver. Plus they had a major release yesterday. While I've been focusing on managing disk contention, if there's an easy way to reduce it, that's definitely fair game. Some of the other suggestions I'm going to hold off on. For example, sendfile() doesn't really address the bottleneck of disk contention. sendfile() bypasses the copy to user buffer, which in turn will bypass copy to system buffers, which eliminates contention for buffer space. Use vmstat to check, if you have a lot of system time and lots of space in buffers of various kinds, there's a good possibility that the problem is there. I'm also not so anxious to switch filesystems. That's a two week endeavor that doesn't really address the contention issue. And it's also a little hard for me to imagine that someone is going to beat the pants off of reiserfs, especially since reiserfs was specifically designed to deal with lots of small files efficiently. Finally, I'm not going to focus on incremental backups if there's any prayer of getting a 500GB full backup in 3 hours. Full backups provide a LOT of warm fuzzies. Again, thank you all very much. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with little computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: udev and mdadm -- I'm lost
Konstantin Olchanski wrote: On Tue, Nov 08, 2005 at 04:40:23PM -0700, Dave Jiang wrote: I see there is an mdadm --auto option now... Just use --auto or --auto=yes and it should take care of the device node creations. I have a /etc/mdadm.conf so I just do: mdadm -As --auto=yes and it brings everything up over udev. I face a similar problem each time I use KNOPPIX to revive a non-booting server. It always takes me 10-20 minutes to figure out the right mdadm incantation to start the md devices. It does not help that mdadm wants an /etc/mdadm.conf which is on an md device itself, unaccessible. Hate to say it, that's a reward of a poor config choice and should be fixed. I wish there were a simple command to find and start all md devices, something along the lines of: mdadm --please-find-and-start-all-my-md-devices-please-please-please or mdadm --start-all, or whatever. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RAID-6
Based on some google searching on RAID-6, I find that it seems to be used to describe two different things. One is very similar to RAID-5, but with two redundancy blocks per stripe, one XOR and one CRC (or at any rate two methods are employed). The other sources define RAID-6 as RAID-5 with a distributed hot spare, AKA RAID-5E, which spreads head motion to all drives for performance. Any clarification on this? -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor Software RAID-0 performance with 2.6.14.2
Lars Roland wrote: I have created a stripe across two 500Gb disks located on separate IDE channels using: mdadm -Cv /dev/md0 -c32 -n2 -l0 /dev/hdb /dev/hdd the performance is awful on both kernel 2.6.12.5 and 2.6.14.2 (even with hdparm and blockdev tuning), both bonnie++ and hdparm (included below) shows a single disk operating faster than the stripe: In looking at this I found something interesting, even though you identified your problem before I was able to use the data for the intended purpose. So other than suggesting that the stripe size is too small, nothing on that, your hardware is the issue. I have two ATA drives connected, and each has two partitions. The first partition of each is mirrored for reliability with default 64k chunks, and the second is striped, with 512k chunks (I write a lot of 100MB files to this f/s). Reading the individual devices with dd, I saw a transfer rate of about 60MB/s, while the striped md1 device gave just under 120MB/s. (60.3573 and 119.6458) actually. However, the mirrored md0 also gave just 60MB/s read speed. One of the advantages of mirroring is that if there is heavy read load when one drive is busy there is another copy of the data on the other drive(s). But doing 1MB reads on the mirrored device did not show that the kernel took advantage of this in any way. In fact, it looks as if all the reads are going to the first device, even with multiple processes running. Does the md code now set write-mostly by default and only go to the redundant drives if the first fails? I won't be able to do a lot of testing until Thursday, or perhaps Wednesday night, but that is not as I expected and not what I want, I do mirroring on web and news servers to spread the head motion, now I will be looking at the stats to see if that's happening. I added the raid M/L to the addresses, since this is getting to be general RAID question. -- -bill davidsen ([EMAIL PROTECTED]) The secret to procrastination is to put things off until the last possible moment - but no longer -me - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor Software RAID-0 performance with 2.6.14.2
Paul Clements wrote: Bill Davidsen wrote: One of the advantages of mirroring is that if there is heavy read load when one drive is busy there is another copy of the data on the other drive(s). But doing 1MB reads on the mirrored device did not show that the kernel took advantage of this in any way. In fact, it looks as if all the reads are going to the first device, even with multiple processes running. Does the md code now set write-mostly by default and only go to the redundant drives if the first fails? No, it doesn't use write-mostly by default. The way raid1 read balancing works (in recent kernels) is this: - sequential reads continue to go to the first disk - for non-sequential reads, the code tries to pick the disk whose head is closest to the sector that needs to be read So even if the reads aren't exactly sequential, you probably still end up reading from the first disk most of the time. I imagine with a more random read pattern you'd see the second disk getting used. Thanks for the clarification. I think the current method is best for most cases, I have to think about how large a file you would need to have any saving in transfer time given that you have to consider the slowest seek, drives doing other things on a busy system, etc. -- -bill davidsen ([EMAIL PROTECTED]) The secret to procrastination is to put things off until the last possible moment - but no longer -me - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Booting from raid1 -- md: invalid raid superblock magic on sdb1
David M. Strang wrote: I've read, and read, and read -- and I'm still not having ANY luck booting completely from a raid1 device. This is my setup... sda1 is booting, working great. I'm attempting to transition to a bootable raid1. sdb1 is a 400GB partition -- it is type FD. Disk /dev/sdb: 400.0 GB, 400088457216 bytes 2 heads, 4 sectors/track, 97677846 cylinders Units = cylinders of 8 * 512 = 4096 bytes Device Boot Start End Blocks Id System /dev/sdb1 197677846 390711382 fd Linux raid autodetect I have created my raid1 mirror with the following command: mdadm --create /dev/md_d0 -e1 -ap --level=1 --raid-devices=2 missing,/dev/sdb1 The raid created correctly, I then partitioned md_d0 to match sda1. I hope that's a typo... you need to partition sdb, not the md_d0 raid device. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: building a disk server
Brad Dameron wrote: Might look at a Areca SATA RAID controller. They support up to 24 ports and has hardware level RAID capacity expansion. Or if you want to go cheaper look at the 3ware controller and use it in JBOD. That way you can get the Smart monitoring and hotplug. Here is the server I built with 3TB usable for pretty cheap. All this was from www.8anet.com Supermicro SC933T-R760 3u or SC932T-R760 rackmount Chassis with 15 SATA Hot-Swap drive trays and triple redundant power supplies. Any motherboard and CPU will do. I would recommend a AMD64 CPU with a motherboard that has a PCI-X slot on it if possible. I used a Tyan S2468 with dual Athlon 2800's and 2GB. A 3ware 9500S-12. Not the 9500S-12MI with this case. Or the new 9550SX-12 which is much faster now. 12 - 300GB Maxtor MaXLine III drives 2 - Western Digital 36GB 10k drives I use the 2 36GB drives mirrored for the OS since I had the extra slots. Could of went with a Areca 16 port card instead. But I already had the 3ware laying around. I went with the 300GB Maxtor drives because at the time they were the ones that had SATAII NCQ (Native Command Queuing) and 16MB cache. This setup is very fast and I use it as a NFS server for backing up my main servers. I currently have about 20% left out of 3TB. Time to add another one. The only part of the hardware I would change is the CPU setup, a single dual core setup seems more cost and heat effective now. The controller is fine, but that just gets better with time as new stuff comes out. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: stripe_cache_size ?
Neil Brown wrote: On Friday December 9, [EMAIL PROTECTED] wrote: On Fri, 9 Dec 2005, Neil Brown wrote: On Friday December 9, [EMAIL PROTECTED] wrote: Hi, I found that there's a new sysfs stripe_cache_size variable. I want to know how does it affect RAID5 read / write performance (if any) ? Please cc to me if possible, thanks. Would you like to try it out and see? Any value from about 10 to a few thousand should be perfectly safe, though very large values may cause the system to run short of memory. The memory used is approximately stripe_cache_size * 4K * number-of-drives What??? I hope that's a typo... 1 - there's no use of the sysfs variable? 'stripe_cache_size' is the sysfs variable. Yes, it is used. 2 - that's going to be huge, 128k * 4k * 10 = 5.1GB !!! That is why I warned to limit it to a few thousand (128k is more than a few thousand!). Sorry, for some reason I read that as being in stripes instead of bytes, which would make it 128k for size only 2. My misread. I just ran bonnie over a 5drive raid5 with stripe_cache_size varying in from 256 to 4096 in a exponential sequence. (Numbers below 256 cause problems - I'll fix that). Results: 256 cage,8G,42594,93,151807,38,50660,18,38610,91,172056,38,912.8,2,16,4356,99,+,+++,+,+++,4389,99,+,+++,14091,100 512 cage,8G,42145,92,186535,44,60659,21,42249,96,172057,37,971.9,2,16,4407,99,+,+++,+,+++,4452,99,+,+++,13909,99 1024 cage,8G,42250,92,210407,50,61254,21,42106,96,172575,37,903.1,2,16,4370,99,+,+++,+,+++,4395,99,+,+++,13809,100 2048 cage,8G,42458,92,229577,55,61762,21,41965,96,168950,36,837.9,2,16,4373,99,+,+++,+,+++,4460,99,+,+++,14084,100 4096 cage,8G,42305,92,250318,62,62192,21,42156,96,170692,38,981.8,3,16,4380,99,+,+++,+,+++,4426,99,+,+++,13723,99 Seq Write speed ^ Increases substantially. Seq Read ^ Doesn't vary much. Seq rewrite ^ improves a bit So for that limited test, write speed is helped a lot, read speed isn't. Maybe I should try iozone... NeilBrown -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: First RAID Setup
Brad Campbell wrote: Callahan, Tom wrote: It is always wise to build in a spare however, that being said about all raid levels. In your configuration, if a disk fails in your RAID5, your array will go down. RAID5 is usually 3+ disks, with a mirror. So you should have 3 disks at minimum, and then a 4th as a spare. /me wonders in the days of reliable RAID-6 why we use RAID-5 + spare? RAID-6 has saved me twice now from dual drive failures on a 15 disk array. It's schweett It's also a lot more overhead... RAID-5 needs to update just one parity block beyond the data written. As I understand the Q sum in RAID-6, and watching disk access rates, each write requires the entire stripe to be read, then P and Q calculated, then written. You can do the P with a read+write, but since you have to read the entire stripe for Q, you save a read by recalculating the P from data. Did I say that right, Neil? If you are seeing dual drive failures, I suspect your hardware has problems. We run multiple 3 and 6 TB databases, and over a dozen 1 TB data caching servers, all using a lot of small fast disk, and I haven't seen a real dual drive failure in about 8 years. We did see some cases which looked like dual failures, it turned out to be a firmware limitation, controller not waiting for the bus to settle after a real failure, and thinking the next i/o had failed (or similar, in any case a false fail on the transaction after the real fail). If you run two PATA drives on the same cable in master/slave, it's at least possible that this could happen with consumer grade hardware as well. Just a thought, dual failures are VERY unlikely unless one triggers the other in some way, like failing the bus or cabinet power supply. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: First RAID Setup
Andargor The Wise wrote: Yet another thing, someone has suggested that I should increase the chunk size for my RAID5 from 32 to either 64 or 128. Is it worth it, considering that the system doesn't normally run on a heavy load? Mail for a few users, some read-only database applications, website, etc. Mostly a development machine. Can't think of a case where it's not worth having better performance... I should write a WP on stripe size, and what happens as you change it with given loads. Would this alleviate the pauses during large file transfers/copies that I have indicated in my previous post? I'm asking because backing up ~176 GB, reconfiguring the RAID, and restoring it properly so the machine boots (the RAID5 is /) is quite a PITA. You may have some special case, but I would never put data that large in / just as a system admin issue. It makes backups and restores, as well as upgrades quite unpleasant. I guess you may have noticed that by now ;-) Good luck! -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adding Reed-Solomon Personality to MD, need help/advice
Bailey, Scott wrote: Interestingly, I was just browsing this paper http://www.cs.utk.edu/%7Eplank/plank/papers/CS-05-569.html which appears to be quite on-topic for this discussion. I admit my eyes glaze over during intensive math discussions but it appears tuned RS might not be as horrible as you'd think since apparently state-of-the-art now provides tricks to avoid the Galois Field operations that used to be required. The thought that came to my mind was how does md's RAID-6 personality compare to EVENODD coding? Wondering if my home server will ever have enough storage for these discussions to become non-academic for me, :-) The problem is not having storage, it's having backup. The properties of backup are - able to be moved to off-site storage - cheap and fast enough to use regularly Making storage more reliable is a desirable end, but it doesn't guard against many common failures such as controllers going bad and writing unreadable sectors all over before total failure, fire, flood, and software errors in the kernel code. While none of these is common in the sense of everyday, they are all common in the sense of I never heard of that happening response. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid sync observations
Gordon Henderson wrote: On Wed, 21 Dec 2005, Sebastian Kuzminsky wrote: But how does the performance for read and write compare? Good question! I'll post some performance numbers of the RAID-6 configuration when I have it up and running. Post your hardware config too if you don't mind. I have one server with 8 drives and for swap (Which it never does!) I created 2 x 4 disk RAID 6 arrays (same partition on all disks) and gave them to the kernel with equal priority FilenameTypeSizeUsedPriority /dev/md10 partition 1991800 0 1 /dev/md11 partition 1991800 0 1 md10 : active raid6 sdd2[3] sdc2[2] sdb2[1] sda2[0] 1991808 blocks level 6, 64k chunk, algorithm 2 [4/4] [] md11 : active raid6 sdh2[3] sdg2[2] sdf2[1] sde2[0] 1991808 blocks level 6, 64k chunk, algorithm 2 [4/4] [] /dev/md10: Timing buffered disk reads: 64 MB in 0.66 seconds = 97.28 MB/sec /dev/md11: Timing buffered disk reads: 64 MB in 0.95 seconds = 67.59 MB/sec md10 is an on-board 4-port SII SATA controller, md11 is 2 x 2-port SII PCI cards. (Server is currently moderately loaded, so results are a bit lower than usual Cue the must/must not swap on RAID arguments ;-) I wouldn't swap on RAID-6... performance is importand, swap is tiny compared to disk size. I would go to 2GB partitions and four-way RAID-1, since fast swap in seems to make for better feel and write is cached somewhat. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: more info on the hang with 2.6.15-rc5
Sebastian Kuzminsky wrote: Now it works, but I dont trust it one bit. I had been seeing almost immediate, perfectly repeatable hard lockups in 2.6.15-rc5 and 2.6.15-rc5-mm3, when using sata_mv, RAID, and LVM together. Nothing in the syslog or on the console, and the system is totally unresponsive to the keyboard network. My hardware setup is: four Seagate Barracuda 500 GB disks, on a Marvell MV88SX6081 8-port SATA-II PCI-X controller, on a PCI-X bus (64/66). The disks work great when accessed directly. They work great when used as four PVs for LVM, and when assembled into a 4-disk RAID-6. But when I make a RAID-6 array out of them, and use the array as a PV, the system would hang completely, within seconds. (This is with LVM 2.02.01, libdevicemapper 1.02.02, and dm-driver 4.5.0.) I turned on all the debugging options in the kernel config hoping to get some insight, but this debug kernel doesnt crash. It's running fine, and I'm pounding on it. A timing problem in the interaction between LVM and RAID? Some kind of wierd heisenbug I'd be happy to do any debugging tests people suggest. I've been waiting for more info on this, did it get fixed? 2.6.15? -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mdadm-2.2 SEGFAULT: mdadm --assemble --scan
Andreas Haumer wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi! Andre Noll schrieb: sorry if this is already known/fixed: Assemble() is called from mdadm.c with the update argument equal to NULL: Assemble(ss, array_list-devname, mdfd, array_list, configfile, NULL, readonly, runstop, NULL, verbose-quiet, force); But in Assemble.c we have if (ident-uuid_set (!update strcmp(update, uuid)!= 0) ... which yields a segfault in glibc's strcmp(). I just found the same problem after upgrading to mdadm-2.2 The logic to test for update not being NULL seems to be reversed. I created a small patch which seems to cure the problem (see attached file) HTH - - andreas - -- Andreas Haumer | mailto:[EMAIL PROTECTED] *x Software + Systeme | http://www.xss.co.at/ Karmarschgasse 51/2/20 | Tel: +43-1-6060114-0 A-1100 Vienna, Austria | Fax: +43-1-6060114-71 -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.2 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFDtqD0xJmyeGcXPhERAsdiAJ0Ve787gscq4VOGtT+9Qp3k62iUEgCgs9pH Ekg0gkLEk+99XXHw+1ezdu8= =rh66 -END PGP SIGNATURE- Index: mdadm/Assemble.c === RCS file: /home/cvs/repository/distribution/Utilities/mdadm/Assemble.c,v retrieving revision 1.1.1.7 diff -u -r1.1.1.7 Assemble.c --- mdadm/Assemble.c5 Dec 2005 05:56:20 - 1.1.1.7 +++ mdadm/Assemble.c31 Dec 2005 15:01:34 - @@ -219,7 +219,7 @@ } if (dfd = 0) close(dfd); - if (ident-uuid_set (!update strcmp(update, uuid)!= 0) + if (ident-uuid_set (update strcmp(update, uuid)!= 0) (!super || same_uuid(info.uuid, ident-uuid, tst-ss-swapuuid)==0)) { if ((inargv verbose = 0) || verbose 0) fprintf(stderr, Name : %s has wrong uuid.\n, Is that right now? Because evaluates to zero or one left to right, the parens and the !=0 are not needed, and I assume they're in for a reason (other than to make the code hard to understand). A comment before that if would make the intention clear, I originally though the (!update was intended to be !(update which would explain the parens, but that seems wrong. If it actually works as intended with the patch, perhaps a comment and cleanup in 2.3? -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID 16?
On Thu, 2 Feb 2006, J. Ryan Earl wrote: X-UID: 40928 Gordon Henderson wrote: I've actually had very good results hot swapping SCSI drives on a live linux system though. Anyone tried SATA drives yet? Yes, and it does NOT work yet. libata does not support hotplugging of harddrives yet: http://linux.yyz.us/sata/features.html It supports hotplugging of the PCI controller itself, but not harddrives. I can add controllers but no devices? I have to think the norm is exactly the opposite... One of the SATA controllers with the plugs on the back for your little external box. A work in progress, I realize. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with little computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: with the latest mdadm
On Tue, 7 Feb 2006, Neil Brown wrote: On Tuesday February 7, [EMAIL PROTECTED] wrote: hi, with the latest mdadm-2.3.1 we've got the following message: - md: md4: sync done. RAID1 conf printout: --- wd:2 rd:2 disk 0, wo:0, o:1, dev:sda2 disk 1, wo:0, o:1, dev:sdb2 md: mdadm(pid 8003) used obsolete MD ioctl, upgrade your software to use new ictls. - this is just a warning or some kind of problem with mdadm? This is with a 2.4 kernel, isn't it. The md driver is incorrectly interpreting an ioctl that it doesn't recognise as an obsolete ioctl. In fact it is a new ioctl that 2.4 doesn't know about (I suspect it is GET_BITMAP_FILE). The message should probably be removed from 2.4, but as 2.4 is in deep-maintenance mode, I suspect that is unlikely More to the point, why is mdadm trying to use an illegal ioctl instead of checking the kernel version and disabling features which won't work? Users are trusting their data to this software, and try it and see if it works ioctls don't give me a warm fuzzy feeling about correct operation. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with little computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RAID 106 array
I was doing some advanced planning for storage, and it came to me that there might be benefit from thinking outside the box on RAID config. Therefore I offer this for comment. The object is reliability with performance. The means is to set up two arrays on physical devices, one RAID-0 for performance, one RAID-6 for reliability. Let's call them four and seven drives for the RAID-0 and RAID-6+Spare. Then configure RAID-1 over the two arrays, marking the RAID-6 as write-mostly. The intension is that under heavy load the RAID-6 would have a lot of head motion going on writing two parity blocks for every one data write, while the RAID-0 would be doing as little work as possible for writes and would therefore have more ability to handle reads quickly. Just a thought experiment at the moment. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with little computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NVRAM support
Erik Mouw wrote: On Fri, Feb 10, 2006 at 10:01:09AM +0100, Mirko Benz wrote: Does a high speed NVRAM device makes sense for Linux SW RAID? E.g. a PCI card that exports battery backed memory. Unless it's very large (i.e.: as large as one of your disks), it doesn't make sense. It will probably break less often, but it doesn't help you in case a disk really breaks. It also won't speed up an MD device much. Could that significantly improve write speed for RAID 5/6 (e.g. via an external journal, asynchronous operation and write caching)? You could use it for an external journal, or you could use it as a swap device. Let me concur, I used external journal on SSD a decade ago with jfs (AIX). If you do a lot of operations which generate journal entries, file create, delete, etc, then it will double your performance in some cases. Otherwise it really doesn't help much, use as a swap device might be more helpful depending on your config. What changes would be required? None, ext3 supports external journals. Look for the -O option in the mke2fs manual page. Using the NVRAM device as swap is not different from a using normal swap partition. Erik -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.15: mdrun, udev -- who creates nodes?
linas wrote: On Tue, Jan 31, 2006 at 04:40:46PM +, Jason Lunz was heard to remark: [EMAIL PROTECTED] said: -- kernel scans /dev/hda1, looking for md superblock -- kernel assembles devices according to info found in the superblocks -- udev creates /dev/md0, etc.=20 The problem is that some users and distributions build the drivers as modules and/or disable in-kernel auto-assembly. Not only that, the raid developers themselves consider autoassembly deprecated. http://article.gmane.org/gmane.linux.kernel/373620 Hmm. My knee-jerk, didn't-stop-to-think-about-it reaction is that this is one of the finest features of linux raid, so why remove it? Speaking as a real-life sysadmin, with actual servers and actual failed disks, disk cables and disk controllers, this is a life-saving feature. Persistant naming of devices in Linux has long been a problem, and in this case, it seemed to work. story I once had an ide controller fail on an x86 board. I bought a new controller at the local store, recabled the disks, and booted. I was alarmed to find that the system was trying to mount /home as /usr, and /usr as /lib, etc. Turned out that /dev/hdc had gotten renamed as /dev/hde, etc. and had to go through a long, painful, rocket-science (yes, I *do* have a PhD) boot-floppy rescue to restore the system to working order. I shudder to think what would have happened if RAID reconstruction had started based on faulty device names. Worse, as part of my rescue ops, I had to make multle copies of /etc/fstab, which resided on different disks (my root volume was raided), as well as the boot floppy, and each contained inconsistent info (needed to bootstrap my way back). Along the way, I made multiple errors in editing the /etc/fstab since I could not keep them straight; twiddling BIOS settings added to the confusion. If this had been /etc/raid.conf instead, with reconstruction triggered off of it, this could have been an absolute disaster. /story Based on the above, real-life experience, my gut reaction is raid assembly based on config files is a bad idea. I don't understand how innocent, minor errors made by the sysadmin won't result in catastrophic data loss. I fear you don't understand how the auto detect and assemble works. Or more to the point what it does, since how it works is somewhat more complex. If you use partitions and UUID, you can just plug in the drives any old place and they will be found and recognised in spite of that. As long as you have a boot drive where the BIOS will use it, mdadm with find your stuff and put it together correctly. Neil does more magic than harry Potter! I know someone who gave this a real life test, although I'd not say who. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
shared spare
One of the things I like about the IBM ServeRAID controller is spare drive shared between two RAID groups. First to fail gets it. For software RAID is this at all in the future? -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: silent corruption with RAID1
Moses Leslie wrote: Hi, I have a machine that currently has 4 drives in it (currently running 2.6.15.4). The first two drives are on the onboard SATA controller (VIA) in a RAID-1. I haven't had any issues with these. The other two drives were added recently, along with an SiL PCI SATA card to put them on. lspci reports this card as: :00:0a.0 Unknown mass storage controller: Silicon Image, Inc. (formerly CMD Technology Inc) SiI 3112 [SATALink/SATARaid] Serial ATA Controller (rev 02) I initially used mdadm to create a new RAID1 of the two new drives, and added them into the LVM group that the other ones were in to expand the drive, but pretty quickly noticed (via rsync -c) that all new files were corrupted. I've since pulled the 2nd set of drives out of the LVM to test. It's only when using a RAID-1 that I get occasionaly corruption. I split the drives (each 300GB) into 4 75GB partitions each, and created 3 md devices. One 75GB raid1, one 150GB raid0, and 1 225GB raid5. I used a script that newfs'd each one, dd'd multiple copies of files (one run with a 1GB, one with 3GB, one with 6GB), md5'd those files, then umounted. At least once in each test run, there was a file with the wrong checksum when on the RAID-1 part of the test. After completing all the tests, I redid the md devices such that none of them used any of the same partitions that they had used in the first test (IE the RAID1 was sda1 and sdb1 in the first one, and was sda4 and sdb4 in the second one). I also did the same test using each of the regular partitions as well (sda1-4 and sdb1-4). I was never able to duplicate any corruption any other time than with the RAID1. There's never any error messages in dmesg or syslog. Is there anything I can do to help track down where the problem is? Based on my own experience, I would suspect hardware. I can't swear that you don't have buggy software of some kind, but I've been running for over a year on RAID-1 with critical data on the volume, and haven't seen any indication of problems. Because of the data, the files get checked against md5sums daily and sha1sums monthly. Some files are old, some are added almost every day, files seldom are updated, but it does happen, and they are moved to new directories on a fairly frequest (2-3 times/mo) basis. The checkfiles are run against an archival copy on another system about once a month, so I'm pretty sure there is no corruption happening. Cables are my favorite source of intermittent evil, memory problems are next, but that usually shows up everywhere if you look hard. Hope any of this is useful. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: block level vs. file level
Molle Bestefich wrote: it wrote: Ouch. How does hardware raid deal with this? Does it? Hardware RAID controllers deal with this by rounding the size of participant devices down to nearest GB, on the assumption that no drive manufacturers would have the guts to actually sell eg. a 250 GB drive with less than exactly 250.000.000.000 bytes of space on it. (It would be nice if the various flavors of Linux fdisk had an option to do this. It would be very nice if anaconda had an option to do this.) I guess if you care you specify the size of the partition instead of use it all. I use fdisk usually, cfdisk when installing, both let me set size, fdisk let's me set starting track and even play with the partition table's idea of geometry. What kind of an option did you have in mind? -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4 disks: RAID-6 or RAID-10 ..
Gordon Henderson wrote: On Fri, 17 Feb 2006, Francois Barre wrote: 2006/2/17, Gordon Henderson [EMAIL PROTECTED]: On Fri, 17 Feb 2006, berk walker wrote: RAID-6 *will* give you your required 2-drive redundancy. Anyway, if you wish to resize your setup to 5 drives one day or another, I guess raid 6 would be preferable, because one day or another, a patch will popup and make raid6 resizing possible. Or won't it ? Resizing isn't something I really care for. This particular box will be sent away to a data centre where it'll stay for 3 years until I replace it. (And if I really do need more disk space in the meantime, I'll just build another :) Still scratching my head, trying to work out if raid-10 can withstand (any) 2 disks of failure though, although after reading md(4) a few times now, I'm begining to think it can't (unless you are lucky!) So maybe I'll just stick with Raid-6 as I know that! With only four drives you can just do the possible failure cases, there are only six... when any one drive fails you can only survive the failure of two of the three remaining drives, not what you wanted. How reliable do you NEED here is the real question. It isn't too hard to make the drives more reliable than the case they're in, how many fans and power supplies can you survive losing? -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Does grub support sw raid1?
Mike Hardy wrote: This works for me, there are several pages out there (I recall using the commands from a gentoo one most recently) that show the exact sequence of grub things you should do to get grub in the MBR of both disks. It sounds like your machine may not be set to boot off of anything other than that one disk though? Is that maybe a BIOS thing? I dunno, but I have definitely pulled a primary drive out of the system completely and booted off the second one, then had linux come up with (correctly) degraded arrays Frequently a BIOS will boot a 2nd drive if the first is missing/dead, but not if it returns bad data (CRC error). A soft fail is not handled correctly by all BIOS' in use, possibly a majority of them. -Mike Herta Van den Eynde wrote: (crossposted to linux-raid@vger.kernel.org and redhat-list@redhat.com) (apologies for this, but this should have be operational last week) I installed Red Hat EL AS 4 on a HP Proliant DL380, and configured all system devices in software RAID 1. I added an entry to grub.conf to fallback to the second disk in case the first entry fails. At boottime, booting from hd0 works fine. As does booting from hd1. Until I physically remove hd0 from the system. I tried manually installing grub on hd1, I added hd1 to the device.map and subsequently re-installed grub on it, I remapped hd0 to /dev/cciss/c0d1 and subsequently re-installed grub all to no avail. I previously installed this while the devices were in slots 2 and 3. The system wouldn't even boot then. It looks as though booting from sw RAID1 will only work when there's a valid device in slot 0. Still preferable over hw RAID1, but even better would be if this worked all the way. Is this working for anyone? Any idea what I may have overlooked? Any suggestions on how to debug this? Kind regards, Herta Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: software raid to Hardware raid
Ken wrote: Hello, We have a Red Hat 7.1 box running kernel 2.4.17-SMP. We have raidtools installed. We had a hardware RAID that got converted to a software raid by mistake. Is there any way to go back without loosing the data? the device is showing up as /dev/md0. Thank you. 1 - back up your data 2 - convert to hw raid 3 - test carefully to be sure it works 4 - reload your data And left out, step zero, consider why you want to do this if what you have is working reliably also, between 2 and 3 consider upgrading to a distribution and kernel written this millenium. Given the state of hardware raid when that kernel was new, and the state of the drivers, I would be very sure I had a GOOD reason to change anything. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Partitioning md devices versus partitioining underlying devices
On Thu, 6 Apr 2006, andy liebman wrote: Hi, I have a fundamental question about WHERE it is best to do partititioning. Here's a concrete example. I have two 3ware RAID-5 arrays, each made up of 12 500 GB drives. When presented to Linux, these are /dev/sda and /dev/sdb -- each 5.5 TB in size. I want to stripe the two arrays together, so that 24 drives are all operating as one unit. However, I don't want an 11 TB filesystem. I want to keep my filesystems down below 6 TB. It seems I have two choices: 1) partition the 3ware devices to make /dev/sda1, /dev/sda2, /dev/sdb1 and /dev/sdb2. Then I can create TWO md RAID-0 devices -- /dev/sda1 + /dev/sdb1 = /dev/md1, /dev/sda2 + /dev/sdb2 = /dev/md2 OR 2) create /dev/md1 from the entire 3ware devices -- /dev/sda + /dev/sdb = /dev/md1 -- and then partition /dev/md1 into two devices. The question is, are these essentially equivalent alternatives? Is there any theoretical reason why one choice would be better than the other -- in terms of security, performance, memory usage, etc. A knowledgeable answer would be appreciated. Thanks in advance. There is one advantage to partitioning sda and sdb and then building devices using the partitions... you can use different stripe sizes on each md drive built on the partition. *IF* you have different things going on in the filesystems, you may be able to improve performance and spread head motion by using tuned stripe sizes. I did this for an application which had and index of 128 bytes index records to a bunch of 500-1000k data records. I used a small stripe size on the index and large on the data, and was able to reduce time from request to data delivery by more than 20%. I was doing RAID-0 over six SCSI drives. Assuming that you do the same thing on both filesystems, I see no benefit to one way over the other, I was just answering your question as to a possible benefit. Use of LVM or dm to do the same thing might allow you to change f/s sizes and such after the fact, I have only tried that as a learning exercise, so I can't say how well it works in practice, or which is better for you. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with little computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Softraid controllers and Linux
Jim Klimov wrote: Hello linux-raid, I have tried several cheap RAID controllers recently (namely, VIA VT6421, Intel 6300ESB and Adaptec/Marvell 885X6081). VIA one is a PCI card, the second two are built in a Supermicro motherboard (E7520/X6DHT-G). The intent was to let the BIOS of the controllers make a RAID1 mirror of two disks independently of an OS to make redundant multi-OS booting transparent. While DOS and Windows saw their mirrors as a singular block device, Linux (FC5) accessed the two drives separately on all adapters. Is this a bug or a feature of the kernel driver support? (I did not try vendors' binary drivers, if there are any). If I understand how Linux uses the drives, you have to make them raid manually. However, the nice thing about BIOS RAID is that it will boot the system if the first boot drive fails. If the drive fails hard the BIOS will go to the first functional drive and boot. But if you get a CRC error, some BIOS will try another and some will just fail. Vendor dependent. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RHEL3 kernel panic with md
Colin McDonald wrote: I appear to have a corrupt file system and now it is mirrored. LOL. I am running Redhat Enterprise 3 and using mdtools. I booted from the install media iso and went into rescue mode. RH was unable to find the partitions automatically but after exiting into bash i can run fdisk -l and i see all of the partitions. I know this is sparse info but would any of the group be able to give the best approach to getting them mounted and fsck'd? I'm happy to say I haven't tried this, but I would think that you can start the raid array manually and then run fsck (also manually). I would worry about why the rescue mode didn't find your data, though. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: disks becoming slow but not explicitly failing anyone?
Nix wrote: On 23 Apr 2006, Mark Hahn stipulated: I've seen a lot of cheap disks say (generally deep in the data sheet that's only available online after much searching and that nobody ever reads) that they are only reliable if used for a maximum of twelve hours a day, or 90 hours a week, or something of that nature. Even server I haven't, and I read lots of specs. they _will_ sometimes say that non-enterprise drives are intended or designed for a 8x5 desktop-like usage pattern. That's the phrasing, yes: foolish me assumed that meant `if you leave it on for much longer than that, things will go wrong'. to the normal way of thinking about reliability, this would simply mean a factor of 4.2x lower reliability - say from 1M to 250K hours MTBF. that's still many times lower rate of failure than power supplies or fans. Ah, right, it's not a drastic change. It still stuns me that anyone would ever voluntarily buy drives that can't be left switched on (which is perhaps why the manufacturers hide I've definitely never seen any spec that stated that the drive had to be switched off. the issue is really just what is the designed duty-cycle? I see. So it's just `we didn't try to push the MTBF up as far as we would on other sorts of disks'. I run a number of servers which are used as compute clusters. load is definitely 24x7, since my users always keep the queues full. but the servers are not maxed out 24x7, and do work quite nicely with desktop drives for years at a time. it's certainly also significant that these are in a decent machineroom environment. Yeah; i.e., cooled. I don't have a cleanroom in my house so the RAID array I run there is necessarily uncooled, and the alleged aircon in the room housing work's array is permanently on the verge of total collapse (I think it lowers the temperature, but not by much). it's unfortunate that disk vendors aren't more forthcoming with their drive stats. for instance, it's obvious that wear in MTBF terms would depend nonlinearly on the duty cycle. it's important for a customer to know where that curve bends, and to try to stay in the low-wear zone. similarly, disk Agreed! I tend to assume that non-laptop disks hate being turned on and hate temperature changes, so just keep them running 24x7. This seems to be OK, with the only disks this has ever killed being Hitachi server-class disks in a very expensive Sun server which was itself meant for 24x7 operation; the cheaper disks in my home systems were quite happy. (Go figure...) specs often just give a max operating temperature (often 60C!), which is almost disingenuous, since temperature has a superlinear effect on reliability. I'll say. I'm somewhat twitchy about the uncooled 37C disks in one of my machines: but one of the other disks ran at well above 60C for *years* without incident: it was an old one with no onboard temperature sensing, and it was perhaps five years after startup that I opened that machine for the first time in years and noticed that the disk housing nearly burned me when I touched it. The guy who installed it said that yes, it had always run that hot, and was that important? *gah* I got a cooler for that disk in short order. a system designer needs to evaluate the expected duty cycle when choosing disks, as well as many other factors which are probably more important. for instance, an earlier thread concerned a vast amount of read traffic to disks resulting from atime updates. Oddly, I see a steady pulse of write traffic, ~100Kb/s, to one dm device (translating into read+write on the underlying disks) even when the system is quiescient, all daemons killed, and all fsen mounted with noatime. One of these days I must fish out blktrace and see what's causing it (but that machine is hard to quiesce like that: it's in heavy use). simply using more disks also decreases the load per disk, though this is clearly only a win if it's the difference in staying out of the disks duty-cycle danger zone (since more disks divide system MTBF). Well, yes, but if you have enough more you can make some of them spares and push up the MTBF again (and the cooling requirements, and the power consumption: I wish there was a way to spin down spares until they were needed, but non-laptop controllers don't often seem to provide a way to spin anything down at all that I know of). hdparam will let you set the spindown time. I have all mine set that way for power and heat reasons, they tend to be in burst use. Dropped the CR temp by enough to notice, but I need some more local cooling for that room still. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org
Re: Two-disk RAID5?
Erik Mouw wrote: On Wed, Apr 26, 2006 at 03:22:38PM -0400, Jon Lewis wrote: On Wed, 26 Apr 2006, Jansen, Frank wrote: It is not possible to flip a bit to change a set of disks from RAID 1 to RAID 5, as the physical layout is different. As Tuomas pointed out though, a 2 disk RAID5 is kind of a special case where all you have is data and parity which is actually also just data. No, the other way around: RAID1 is a special case of RAID5. No it isn't. If you have N drives in RAID1 you have N independent copies of the data and no parity, there's just no corresponding thing in RAID5, which has one copy of the data, plus parity. There is no special case, it just doesn't work that way. Set N2 and report back. Sorry, I couldn't find a diplomatic way to say you're completely wrong. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Two-disk RAID5?
John Rowe wrote: I'm about to create a RAID1 file system and a strange thought occurs to me: if I create a two-disk RAID5 array then I can grow it later by the simple expedient of adding a third disk and hence doubling its size. Is there any real down-side to this, such as performance? Alternatively is it likely that mdadm will soon be able to convert a RAID1 pair to RAID5 any time soon? (Just how different are they anyway? Isn't the RAID4/5 checksum just an OR?) I think it works, I just set up a little test case with two 20MB files and loopback mount. The mdadm seems to work, the mke2fs seems to work, the f/s is there. Please verify, this system is a bit (okay a bunch) hacked. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4 disks in raid 5: 33MB/s read performance?
Mark Hahn wrote: I just dd'ed a 700MB iso to /dev/null, dd returned 33MB/s. Isn't that a little slow? what bs parameter did you give to dd? it should be at least 3*chunk (probably 3*64k if you used defaults.) I would expect readahead to make this unproductive. Mind you, I didn't say it is, but I can't see why not. There was a problem with data going through stripe cache when it didn't need to, but I thought that was fixed. Neil? Am I an optimist? -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5 kicks non-fresh drives
Mikael Abrahamsson wrote: On Thu, 25 May 2006, Craig Hollabaugh wrote: That did it! I set the partition FS Types from 'Linux' to 'Linux raid autodetect' after my last re-sync completed. Manually stopped and started the array. Things looked good, so I crossed my fingers and rebooted. The kernel found all the drives and all is happy here in Colorado. Would it make sense for the raid code to somehow warn in the log when a device in a raid set doesn't have Linux raid autodetect partition type? If this was in dmesg, would you have spotted the problem before? As long as it is written where logwatch will see it, not recognize it, and report it... People who don't read their logwatch reports get no sympathy from me. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can't get drives containing spare devices to spindown
Did I miss an answer to this? As the weather gets hotter I'm doing all I can to reduce heat. Marc L. de Bruin wrote: Lo, Situation: /dev/md0, type raid1, containing 2 active devices (/dev/hda1 and /dev/hdc1) and 2 spare devices (/dev/hde1 and /dev/hdg1). Those two spare 'partitions' are the only partitions on those disks and therefore I'd like to spin down those disks using hdparm for obvious reasons (noise, heat). Specifically, 'hdparm -S value device' sets the standby (spindown) timeout for a drive; the value is used by the drive to determine how long to wait (with no disk activity) before turning off the spindle motor to save power. However, it turns out that md actually sort-of prevents those spare disks to spindown. I can get them off for about 3 to 4 seconds, after which they immediately spin up again. Removing the spare devices from /dev/md0 (mdadm /dev/md0 --remove /dev/hd[eg]1) actually solves this, but I have no intention actually removing those devices. How can I make sure that I'm actually able to spin down those two spare drives? -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: problems with raid=noautodetect
Neil Brown wrote: On Friday May 26, [EMAIL PROTECTED] wrote: On Tue, May 23, 2006 at 08:39:26AM +1000, Neil Brown wrote: Presumably you have a 'DEVICE' line in mdadm.conf too? What is it. My first guess is that it isn't listing /dev/sdd? somehow. Neil, i am seeing a lot of people that fall in this same error, and i would propose a way of avoiding this problem 1) make DEVICE partitions the default if no device line is specified. As you note, we think alike on this :-) 2) deprecate the DEVICE keyword issuing a warning when it is found in the configuration file Not sure I'm so keen on that, at least not in the near term. Let's not start warning and depreciating powerful features because they can be misused... If I wanted someone to make decisions for me I would be using this software at all. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: problems with raid=noautodetect
Luca Berra wrote: On Tue, May 30, 2006 at 01:10:24PM -0400, Bill Davidsen wrote: 2) deprecate the DEVICE keyword issuing a warning when it is found in the configuration file Not sure I'm so keen on that, at least not in the near term. Let's not start warning and depreciating powerful features because they can be misused... If I wanted someone to make decisions for me I would be using this software at all. you cut the rest of the mail. Trimming the part about which I make no comment is usually a good thing. i did not propose to deprecate the feature, just the keyword. A rose by any other name would still smell as sweet. In other words, the capability is still able to be misused, and changing the name or generating error messages will only cause work and concern for people using the feature. but, ok, just go on writing DEVICE /dev/sda1 DEVICE /dev/sdb1 ARRAY /dev/md0 devices=/dev/sda1,/dev/sdb1 then come on the list and complain when it stops working. What I suggest is that the feature keep working, and no one will complain. If there is a missing partition the error messages are clear. The feature is mainly used when there are partitions or drives which should not be examined, and stops working only when a hardware config has changed. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RAID5E
Where I was working most recently some systems were using RAID5E (RAID5 with both the parity and hot spare distributed). This seems to be highly desirable for small arrays, where spreading head motion over one more drive will improve performance, and in all cases where a rebuild to the hot spare will avoid a bottleneck on a single drive. Is there any plan to add this capability? -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: which CPU for XOR?
Dexter Filmore wrote: What type of operation is XOR anyway? Should be ALU, right? So - what CPU is best for software raid? One with high integer processing power? Unless you're running really low on CPU, it probably doesn't matter... you run out of memory bandwidth on large data (larger than cache) anyway. That's my take on it. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: to understand the logic of raid0_make_request
Neil Brown wrote: On Tuesday June 13, [EMAIL PROTECTED] wrote: hello,everyone. I am studying the code of raid0.But I find that the logic of raid0_make_request is a little difficult to understand. Who can tell me what the function of raid0_make_request will do eventually? One of two possibilities. Most often it will update bio-bi_dev and bio-bi_sector to refer to the correct location on the correct underlying devices, and then will return '1'. The fact that it returns '1' is noticed by generic_make_request in block/ll_rw_block.c and generic_make_request will loop around and retry the request on the new device at the new offset. However in the unusual case that the request cross a chunk boundary and so needs to be sent to two different devices, raidi_make_request will split the bio into to (using bio_split) will submit each of the two bios directly down to the appropriate devices - and will then return '0', so that generic make request doesn't loop around. I hope that helps. Helps me, anyway, thanks! Wish the comments on stuff like that in general were clear, you can see what the code *does*, but you have to hope that it's what the coder *intended*. And if you're looking for a bug it may not be, so this is not an idle complaint. Some of the kernel coders think if it was hard to write it should be hard to understand. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ANNOUNCE: mdadm 2.5.1 - A tool for managing Soft RAID under Linux
Paul Clements wrote: Neil Brown wrote: I am pleased to announce the availability of mdadm version 2.5.1 Hi Neil, Here's a small patch to allow compilation on gcc 2.x. It looks like gcc 3.x allows variable declarations that are not at the start of a block of code (I don't know if there's some standard that allows that in C code now, but it doesn't work with all C compilers). Even if valid, having the declaration at the top of the block in which it's used makes the program more readable. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is shrinking raid5 possible?
Neil Brown wrote: On Monday June 19, [EMAIL PROTECTED] wrote: Hi, I'd like to shrink the size of a RAID5 array - is this possible? My first attempt shrinking 1.4Tb to 600Gb, mdadm --grow /dev/md5 --size=629145600 gives mdadm: Cannot set device size/shape for /dev/md5: No space left on device Yep. The '--size' option refers to: Amount (in Kibibytes) of space to use from each drive in RAID1/4/5/6. This must be a multiple of the chunk size, and must leave about 128Kb of space at the end of the drive for the RAID superblock. (from the man page). So you were telling md to use the first 600GB of each device in the array, and it told you there wasn't that much room. If your array has N drives, you need to divide the target array size by N-1 to find the target device size. So if you have a 5 drive array, then you want --size=157286400 May I say in all honesty that making people do that math instead of the computer is a really bad user interface? Good, consider it said. An means to just set the target size of the resulting raid device would be a LOT less likely to cause bad user input, and while I'm complaining it should inderstand suffices 'k', 'm', and 'g'. Far easier to use for the case where you need, for instance, 10G of storage for a database, tell mdadm what devices to use and what you need (and the level of course) and let the computer figure out the details, rounding up, leaving 128k, and phase of the moon if you decide to use it. Sorry, I think the current approach is baaad human interface. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: New FAQ entry? (was IBM xSeries stop responding during RAID1 reconstruction)
Niccolo Rigacci wrote: personally, I don't see any point to worrying about the default, compile-time or boot time: for f in `find /sys/block/* -name scheduler`; do echo cfq $f; done I tested this case: - reboot as per power failure (RAID goes dirty) - RAID start resyncing as soon as the kernel assemble it - every disk activity is blocked, even DHCP failed! - host services are unavailable This is why I changed the kernel default. Changing on the command line assumes that you built all of the schedulers in... but making that assumption, perhaps the correct fail-safe is to have cfq as the default, and at the end of rc.local check for rebuild, and if everything is clean change to whatever work best at the end of the boot. If the raid is not clean stay with cfq. Has anyone tried deadline for this? I think I had this as deafult and didn't hand on a raid5 fail/rebuild. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ok to go ahead with this setup?
Christian Pernegger wrote: Hi list! Having experienced firsthand the pain that hardware RAID controllers can be -- my 3ware 7500-8 died and it took me a week to find even a 7508-8 -- I would like to switch to kernel software RAID. Here's a tentative setup: Intel SE7230NH1-E mainboard Pentium D 930 2x1GB Crucial 533 DDR2 ECC Intel SC5295-E enclosure Promise Ultra133 TX2 (2ch PATA) - 2x Maxtor 6B300R0 (300GB, DiamondMax 10) in RAID1 Onboard Intel ICH7R (4ch SATA) - 4x Western Digital WD5000YS (500GB, Caviar RE2) in RAID5 * Does this hardware work flawlessly with Linux? * Is it advisable to boot from the mirror? Would the box still boot with only one of the disks? Let me say this about firmware mirror: while virtually every BIOS will boot the next disk if the first fails, some will not fail over if the first drive is returning a parity but still returning data. Take that data any way you want, drive failure at power cycle is somewhat more likely than failure while running. * Can I use EVMS as a frontend? Does it even use md or is EVMS's RAID something else entirely? * Should I use the 300s as a single mirror, or span multiple ones over the two disks? * Am I even correct in assuming that I could stick an array in another box and have it work? Comments welcome Thanks, Chris - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ok to go ahead with this setup?
Molle Bestefich wrote: Christian Pernegger wrote: Anything specific wrong with the Maxtors? No. I've used Maxtor for a long time and I'm generally happy with them. They break now and then, but their online warranty system is great. I've also been treated kindly by their help desk - talked to a cute gal from Maxtor in Ireland over the phone just yesterday ;-). Then again, they've just been acquired by Seagate, or so, so things may change for the worse, who knows. I'd watch out regarding the Western Digital disks, apparently they have a bad habit of turning themselves off when used in RAID mode, for some reason: http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/1980/ Based on three trials in five years, I'm happy with WD and Seagate. WD didn't ask when I bought it, just the serial for manufacturing date. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Large single raid and XFS or two small ones and EXT3?
Martin Schröder wrote: 2006/6/23, Francois Barre [EMAIL PROTECTED]: Loosing data is worse than loosing anything else. You can buy you That's why RAID is no excuse for backups. The problem is that there is no cost effective backup available. When a tape was the same size as a disk and 10% the cost, backups were practical. Today anything larger than hobby size disk is just not easy to back up. Anything large enough to be useful is expensive, small media or something you can't take off-site and lock in a vault aren't backups so much as copies, which may protect against some problems, but which provide little to no protection against site disasters. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Large single raid and XFS or two small ones and EXT3?
winspeareAdam Talbot wrote: OK, this topic I relay need to get in on. I have spent the last few week bench marking my new 1.2TB, 6 disk, RAID6 array. I wanted real numbers, not This FS is faster because... I have moved over 100TB of data on my new array running the bench mark testing. I have yet to have any major problems with ReiserFS, EXT2/3, JFS, or XFS. I have done extensive testing on all, including just trying to break the file system with billions of 1k files, or a 1TB file. Was able to cause some problems with EXT3 and RiserFS with the 1KB and 1TB tests, respectively. but both were fixed with a fsck. My basic test is to move all data from my old server to my new server (whitequeen2) and clock the transfer time. Whitequeen2 has very little storage. The NAS's 1.2TB of storage is attached via iSCSI and a cross over cable to the back of whitequeen2. The data is 100GB of user's files(1KB~2MB), 50GB of MP3's (1MB~5MB) and the rest is movies and system backups 600MB~2GB. Here is a copy of my current data sheet, including specs on the servers and copy times, my numbers are not perfect, but they should give you a clue about speeds... XFS wins. In many (most?) cases I'm a lot more concerned about filesystem stability than performance. That is, I want the fastest reliable filesystem. With ext2 and ext3 I've run multiple multi-TB machines spread over four time zones, and not had a f/s problem updating ~1TB/day. The computer: whitequeen2 AMD Athlon64 3200 (2.0GHz) 1GB Corsair DDR 400 (2X 512MB's running in dual DDR mode) Foxconn 6150K8MA-8EKRS motherboard Off brand case/power supply 2X os disks, software raid array, RAID 1, Maxtor 51369U3, FW DA620CQ0 Intel pro/1000 NIC CentOS 4.3 X86_64 2.6.9 Main app server, Apache, Samba, NFS, NIS The computer: nas AMD Athlon64 3000 (1.8GHz) 256MB Corsair DDR 400 (2X 128MB's running in dual DDR mode) Foxconn 6150K8MA-8EKRS motherboard Off brand case/power supply and drive cages 2X os disks, software raid array, RAID 1, Maxtor 51369U3, FW DA620CQ0 6X software raid array, RAID 6, Maxtor 7V300F0, FW VA111900 Gentoo linux. X86_64 2.6.16-gentoo-r9 System built very lite, only built as an iSCSI based NAS. EXT3 Config=APP+NFS--NAS+iSCSI RAID6 64K chunk [EMAIL PROTECTED] tmp]# time tar cf - . | (cd /data ; tar xf - ) real371m29.802s user1m28.492s sys 46m48.947s /dev/sdb1 1.1T 371G 674G 36% /data 6.192 hours @ 61,262M/hour or 1021M/min or 17.02M/sec EXT2 Config=APP+NFS--NAS+iSCSI RAID6 64K chunk [EMAIL PROTECTED] tmp]# time tar cf - . | ( cd /data/ ; tar xf - ) real401m48.702s user1m25.599s sys 30m22.620s /dev/sdb1 1.1T 371G 674G 36% /data 6.692 hours @ 56,684M/hour or 945M/min or 15.75M/sec Did you tune the extN filesystems to the stripe size of the raid? -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Large single raid and XFS or two small ones and EXT3?
Justin Piszcz wrote: On Sat, 24 Jun 2006, Neil Brown wrote: On Friday June 23, [EMAIL PROTECTED] wrote: The problem is that there is no cost effective backup available. One-liner questions : - How does Google make backups ? No, Google ARE the backups :-) - Aren't tapes dead yet ? LTO-3 does 300Gig, and LTO-4 is planned. They may not cope with tera-byte arrays in one hit, but they still have real value. - What about a NUMA principle applied to storage ? You mean an Hierarchical Storage Manager? Yep, they exist. I'm sure SGI, EMC and assorted other TLAs could sell you one. LTO3 is 400GB native and we've seen very good compression, so 800GB-1TB per tape. The problem is in small business use, LTO3 is costly in the 1-10TB range, and takes a lot of media changes as well. A TB of RAID-5 is ~$500, and at that small size the cost of drives and media is disproportionally high. Using more drives is cost effective, but they are not good for long term off site storage, because they're large and fragile. No obvious solutions in that price and application range that I see. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Large single raid and XFS or two small ones and EXT3?
Adam Talbot wrote: Not exactly sure how to tune for stripe size. What would you advise? -Adam See the -R option of mke2fs. I don't have a number for the performance impact of this, but I bet someone else on the list will. Depending on what posts you read, reports range from measurable to significant, without quantifying. Note, next month I will set up either a 2x750 RAID-1 or 4x250 RAID-5 array, and if I got RAID-5 I will have the chance to run some metrics before putting the hardware into production service. I'll report on the -R option if I have any data. Bill Davidsen wrote: winspeareAdam Talbot wrote: OK, this topic I relay need to get in on. I have spent the last few week bench marking my new 1.2TB, 6 disk, RAID6 array. I wanted real numbers, not This FS is faster because... I have moved over 100TB of data on my new array running the bench mark testing. I have yet to have any major problems with ReiserFS, EXT2/3, JFS, or XFS. I have done extensive testing on all, including just trying to break the file system with billions of 1k files, or a 1TB file. Was able to cause some problems with EXT3 and RiserFS with the 1KB and 1TB tests, respectively. but both were fixed with a fsck. My basic test is to move all data from my old server to my new server (whitequeen2) and clock the transfer time. Whitequeen2 has very little storage. The NAS's 1.2TB of storage is attached via iSCSI and a cross over cable to the back of whitequeen2. The data is 100GB of user's files(1KB~2MB), 50GB of MP3's (1MB~5MB) and the rest is movies and system backups 600MB~2GB. Here is a copy of my current data sheet, including specs on the servers and copy times, my numbers are not perfect, but they should give you a clue about speeds... XFS wins. In many (most?) cases I'm a lot more concerned about filesystem stability than performance. That is, I want the fastest reliable filesystem. With ext2 and ext3 I've run multiple multi-TB machines spread over four time zones, and not had a f/s problem updating ~1TB/day. Did you tune the extN filesystems to the stripe size of the raid? -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5 degraded after mdadm -S, mdadm --assemble (everytime)
Ronald Lembcke wrote: Hi! I set up a RAID5 array of 4 disks. I initially created a degraded array and added the fourth disk (sda1) later. The array is clean, but when I do mdadm -S /dev/md0 mdadm --assemble /dev/md0 /dev/sd[abcd]1 it won't start. It always says sda1 is failed. When I remove sda1 and add it again everything seems to be fine until I stop the array. Below is the output of /proc/mdstat, mdadm -D -Q, mdadm -E and a piece of the kernel log. The output of mdadm -E looks strange for /dev/sd[bcd]1, saying 1 failed. What can I do about this? How could this happen? I mixed up the syntax when adding the fourth disk and tried these two commands (at least one didn't yield an error message): mdadm --manage -a /dev/md0 /dev/sda1 mdadm --manage -a /dev/sda1 /dev/md0 Thanks in advance ... Roni ganges:~# cat /proc/mdstat Personalities : [raid5] [raid4] md0 : active raid5 sda1[4] sdc1[0] sdb1[2] sdd1[1] 691404864 blocks super 1.0 level 5, 64k chunk, algorithm 2 [4/4] [] unused devices: none I will just comment that the 0 1 2 4 numbering on the devices is unusual. When you created this did you do something which made md think there was another device, failed or missing, which was device[3]? I just looked at a bunch of my arrays and found no similar examples. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IBM xSeries stop responding during RAID1 reconstruction
Mr. James W. Laferriere wrote: Hello Gabor , On Tue, 20 Jun 2006, Gabor Gombas wrote: On Tue, Jun 20, 2006 at 03:08:59PM +0200, Niccolo Rigacci wrote: Do you know if it is possible to switch the scheduler at runtime? echo cfq /sys/block/disk/queue/scheduler At least one can do a ls of the /sys/block area then do an automated echo cfq down the tree . Does anyone know of a method to set a default scheduler ? Scanning down a list or manually maintaining a list seems to be a bug in the waiting . Tia , JimL Thought I posted this... it can be set in kernel build or on the bloot parameters from grub/lilo. 2nd thought: set it to cfq by default, then at the END of rc.local, if there are no arrays rebuilding, change to something else if you like. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: I need a PCI V2.1 4 port SATA card
Gordon Henderson wrote: On Wed, 28 Jun 2006, Christian Pernegger wrote: I also subscribe to the almost commodity hardware philosophy, however I've not been able to find a case that comfortably takes even 8 drives. (The Stacker is an absolute nightmare ...) Even most rackable cases stop at 6 3.5 drive bays -- either that or they are dedicated storage racks with integrated hw RAID and fiber SCSI interconnect -- definitely not commodity. I've used these: http://www.acme-technology.co.uk/acm338.htm (8 drives in a 3U case), and their variants eg: http://www.acme-technology.co.uk/acm312.htm (12 disks in a 3U case) Interesting ad, with a masonic emblem, and a picture of a white case with a note saying it's only available in black. Of course the hardware may be perfectly fine, but I wouldn't count on color. for several years with good results. Not the cheapest on the block though, but never had any real issues with them. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Cutting power without breaking RAID
Niccolo Rigacci wrote: On Thu, Jun 29, 2006 at 02:00:09PM +1000, Neil Brown wrote: With 2.6, killall -9 md0_raid1 should do the trick (assuming root is on /dev/md0. If it is elsewhere, choose a different process name). Thanks, this is what I was looking for! I will try remounting read-only and killing the md0_raid1. I will keep you informed. Why should this trickery be needed? When an array is mounted r/o it should be clean. How can it be dirty. I assume readonly implies noatime, I mount physically readonly devices without explicitly saying noatime and nothing whines. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Random Seek on Array as slow as on single disk
A. Liemen wrote: Hardware Raid. http://www.areca.com.tw/products/html/pcix-sata.htm You should ask the vendor, this isn't a software RAID issue, and the usual path to improving bad hardware is in upgrading. You may be able to get better firmware if you're lucky. Alex Jeff Breidenbach schrieb: Controller: Areca ARC 1160 PCI-X 1GB Cache Those numbers are for Arica hardware raid or linux software raid? --Jeff - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hardware assisted parity computation - is it now worth it?
Burn Alting wrote: Last year, there were discussions on this list about the possible use of a 'co-processor' (Intel's IOP333) to compute raid 5/6's parity data. We are about to see low cost, multi core cpu chips with very high speed memory bandwidth. In light of this, is there any effective benefit to such devices as the IOP333? Was there ever? Unless you're running on a really slow CPU, like 386, with a TB of RAID attached, and heavy CPU load, could anyone ever see a measureable performance gain? I haven't seen any such benchmarks, although I haven't looked beyond reading several related mailing lists. Or in other words, is a cheaper (power, heat, etc) cpu with higher memory access speeds, more cost effective than a bridge/bus device (ie hardware) solution (which typically has much lower memory access speeds)? An additional device is always more complex, and less tunable than a CPU based solution. Except in the case above where there is very little CPU available, I don't see much hope for a cost (money and complexity) effective non-CPU solution. Obviously my opinion only. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Array will not assemble
Richard Scobie wrote: Neil Brown wrote: Add DEVICE /dev/sd? or similar on a separate line. Remove devices=/dev/sdc,/dev/sdd Thanks. My mistake, I thought after having assembled the arrays initially, that the output of: mdadm --detail --scan mdadm.conf could be used directly. I'm using Centos 4.3, which I believe is the latest RHEL 4 and they are only on mdadm 1.6 :( Do understand that the whole purpose of RHEL is to have a stable system. Upgrades are not done, instead bugs are fixed in the original version to correct security or stability issues. However, feature changes are not provided, because ver versions mean new issues. Stability has its price. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: second controller: what will my discs be called, and does it matter?
Dexter Filmore wrote: Currently I have 4 discs on a 4 channel sata controller which does its job quite well for 20 bucks. Now, if I wanted to grow the array I'd probably go for another one of these. How can I tell if the discs on the new controller will become sd[e-h] or if they'll be the new a-d and push the existing ones back? For software RAID you shouldn't care, for other things you might. Next question: assembling by UUID, does that matter at all? No. There's the beauty of it. (And while talking UUID - can I safely migrate to a udev-kernel? Someone on this list recently ran into trouble because of such an issue.) You shouldn't lose data unless you panic at the first learning experience and do something without thinking of the results. I would convert to UUID first, obviously. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: issue with internal bitmaps
Neil Brown wrote: On Thursday July 6, [EMAIL PROTECTED] wrote: hello, i just realized that internal bitmaps do not seem to work anymore. I cannot imagine why. Nothing you have listed show anything wrong with md... Maybe you were expecting mdadm -X /dev/md100 to do something useful. Like -E, -X must be applied to a component device. Try mdadm -X /dev/sda1 To take this from the other end, why should -X apply to a component? Since the components can and do change names, and you frequently mention assembly by UUID, why aren't the component names determined from the invariant array name when mdadm wants them, instead of having a user or script check the array to get the components? Between udev and dynamic reconfiguration when component names have become less and less relevant, perhaps they can be less used in the future. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] enable auto=yes by default when using udev
Michael Tokarev wrote: Neil Brown wrote: On Monday July 3, [EMAIL PROTECTED] wrote: Hello, the following patch aims at solving an issue that is confusing a lot of users. when using udev, device files are created only when devices are registered with the kernel, and md devices are registered only when started. mdadm needs the device file _before_ starting the array. so when using udev you must add --auto=yes to the mdadm commandline or to the ARRAY line in mdadm.conf following patch makes auto=yes the default when using udev The principle I'm reasonably happy with, though you can now make this the default with a line like CREATE auto=yes in mdadm.conf. However + + /* if we are using udev and auto is not set, mdadm will almost +* certainly fail, so we force it here. +*/ + if (autof == 0 access(/dev/.udevdb,F_OK) == 0) + autof=2; + I'm worried that this test is not very robust. On my Debian/unstable system running used, there is no /dev/.udevdb though there is a /dev/.udev/db I guess I could test for both, but then udev might change again I'd really like a more robust check. Why to test for udev at all? If the device does not exist, regardless if udev is running or not, it might be a good idea to try to create it. Because IT IS NEEDED, period. Whenever the operation fails or not, and whenever we fail if it fails or not - it's another question, and I think that w/o explicit auto=yes, we may ignore create error and try to continue, and with auto=yes, we fail on create error. I have to agree here, I can't think of a case where creation of the device name would not be desirable, udev or no. But to be cautious, perhaps the default should be to create the device if the path starts with /dev/ or /tmp/ unless auto creation is explicitly off. I don't think udev or mount points come into the default decision at all, there are just too many options on naming. Note that /dev might be managed by some other tool as well, like mudev from busybox, or just a tiny shell /sbin/hotplug script. Note also that the whole root filesystem might be on tmpfs (like in initramfs), so /dev will not be a mountpoint. Agree with both points. Also, I think mdadm should stop creating strange temporary nodes somewhere as it does now. If /dev/whatever exist, use it. If not, create it (unless, perhaps, auto=no is specified) directly with proper mknod(/dev/mdX), but don't try to use some temporary names in /dev or elsewhere. True, but I don't see a case where this would be useful. And if it is, then add an auto=obscure_names option for the case where you really want that behaviour. In case of nfs-mounted read-only root filesystem, if someone will ever need to assemble raid arrays in that case.. well, he can either prepare proper /dev on the nfs server, or use tmpfs-based /dev, or just specify /tmp/mdXX instead of /dev/mdXX - whatever suits their needs better. Because /dev and /tmp are well known special cases, I would default auto for them. In other cases explicit behavior could be specified. Feel free to point out something bad which occurs by using this default behavior. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Test feedback 2.6.17.4+libata-tj-stable (EH, hotplug)
Christian Pernegger wrote: I finally got around to testing 2.6.17.4 with libata-tj-stable-20060710. Hardware: ICH7R in ahci mode + WD5000YS's. EH: much, much better. Before the patch it seemed like errors were only printed to dmesg but never handed up to any layer above. Now md actually fails the disk when I pull the (power) plug. I'll try my bad cable once I can find it. Hotplug: Unplugging was fine, took about 15s until the driver gave up on the disk. After re-plugging the driver had to hard-reset the port once to get the disk back, though that might be by design. The fact that the disk had changed minor numbers after it was plugged back in bugs me a bit. (was sdc before, sde after). Additionally udev removed the sdc device file, so I had to manually recreate it to be able to remove the 'faulty' disk from its md array. Thanks for a great patch! I just hope it doesn't eat my data :) And thank you for testing! -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: which disk the the one that data is on?
Shai wrote: Hi, I rebooted my server today to find out that one of the arrays is being re-synced (see output below) . 1. What does the (S) to the right of hdh1[5](S) mean? 2. How do I know, from this output, which disk is the one holding the most current data and from which all the other drives are syncing from? Or are they all containing the data and this sync process is something else? Maybe I'm just not understanding what is being done exactly? In addition to what you have already been told, if you find out that the array is in rebuild I would be a lot more worried to find out why. If it was from unclean shutdown you really should look into a bitmap if you don't have one. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: issue with internal bitmaps
Bill Davidsen wrote: Neil Brown wrote: On Thursday July 6, [EMAIL PROTECTED] wrote: hello, i just realized that internal bitmaps do not seem to work anymore. I cannot imagine why. Nothing you have listed show anything wrong with md... Maybe you were expecting mdadm -X /dev/md100 to do something useful. Like -E, -X must be applied to a component device. Try mdadm -X /dev/sda1 To take this from the other end, why should -X apply to a component? Since the components can and do change names, and you frequently mention assembly by UUID, why aren't the component names determined from the invariant array name when mdadm wants them, instead of having a user or script check the array to get the components? Boy, I didn't say that well... what I meant to suggest is that when -E or -X are applied to the array as a whole, would it not be useful to itterate them over all of the components rather than than looking for non-existant data in the array itself? -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 005 of 9] md: Replace magic numbers in sb_dirty with well defined bit flags
Ingo Oeser wrote: Hi Neil, I think the names in this patch don't match the description at all. May I suggest different ones? On Monday, 31. July 2006 09:32, NeilBrown wrote: Instead of magic numbers (0,1,2,3) in sb_dirty, we have some flags instead: MD_CHANGE_DEVS Some device state has changed requiring superblock update on all devices. MD_SB_STALE or MD_SB_NEED_UPDATE I think STALE is better, it is unambigous. MD_CHANGE_CLEAN The array has transitions from 'clean' to 'dirty' or back, requiring a superblock update on active devices, but possibly not on spares Maybe split this into MD_SB_DIRTY and MD_SB_CLEAN ? I don't think the split is beneficial, but I don't care for the name much. Some name like SB_UPDATE_NEEDED or the like might be better. MD_CHANGE_PENDING A superblock update is underway. MD_SB_PENDING_UPDATE I would have said UPDATE_PENDING, but either is more descriptive than the original. Neil - the logic in this code is pretty complex, all the help you can give the occasional reader, by using very descriptive names for things, is helpful to the reader and reduces your question due to misunderstanding load. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Interesting RAID checking observations - I'm getting it too
Mark Smith wrote: Just a note, I've noticed this problem too. I run a RAID1 check once every 24 hours, and while developing the script to do it, noticed that the machine became virtually unusable - mouse was jumpy, typing lagged. I run this check every morning at 4.00am so I'm usually asleep and don't notice it, so it hasn't been a big bother to me. Interesting, but do you run other stuff at that time? Several distributions run various things in the middle of the night which really bog the machine. The data may be a bit corse, however here is what sysstat/sar says my machine does during the check. Let me know if you want any more or other sar data. [ data dropped, not relevant to my suggestion ] -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Feature Request/Suggestion - Drive Linking
Michael Tokarev wrote: Tuomas Leikola wrote: [] Here's an alternate description. On first 'unrecoverable' error, the disk is marked as FAILING, which means that a spare is immediately taken into use to replace the failing one. The disk is not kicked, and readable blocks can still be used to rebuild other blocks (from other FAILING disks). The rebuild can be more like a ddrescue type operation, which is probably a lot faster in the case of raid6, and the disk can be automatically kicked after the sync is done. If there is no read access to the FAILING disk, the rebuild will be faster just because seeks are avoided in a busy system. It's not that simple. The issue is with writes. If there's a failing disk, md code will need to keep track of up-to-date, or good sectors of it vs obsolete ones. Ie, when write fails, the data in that block is either unreadable (but can become readable on the next try, say, after themperature change or whatnot), or readable but contains old data, or is readable but contains some random garbage. So at least that block(s) of the disk should not be copied to the spare during resync, and should not be read at all, to avoid returning wrong data to userspace. In short, if the array isn't stopped (or changed to read-only), we should watch for writes, and remember which ones are failed. Which is some non-trivial change. Yes, bitmaps somewhat helps here. It would seem that much of the code needed is already there. When doing the recovery the spare can be treated as a RAID1 copy of the failing drive, with all sectors out of date. Then the sectors from the failing drive can be copied, using reconstruction if needed, until there is a valid copy on the new drive. There are several decision points during this process: - do writes get tried to the failing drive, or just the spare? - do you mark the failing drive as failed after the good copy is created? But I think most of the logic exists, the hardest part would be deciding what to do. The existing code looks as if it could be hooked to do this far more easily than writing new. In fact, several suggested recovery schemes involve stopping the RAID5, replacing the failing drive with a created RAID1, etc. So the method is valid, it would just be nice to have it happen without human intervention. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PROBLEM: system crash on AMD64 with 2.6.17.11 while accessing 3TB Software-RAID5
Ralf Herrmann wrote: Dear Mr.Brown, Yes.. you are hitting some pretty serious BUGs. And this is in code that is not specific to RAID at all, so if there really were bugs there, we would expect to have seen them well before now. Your are absolutely right, it doesn't seem to be in RAID at all, but as of now, it only happened when doing something with /dev/md0. I really looks to me like a hardware problem. Some how various bits of memory sometimes have bad values and cause a problem. How long did you run memtest? I would suggest running it for at least 24 hours, because my best guess is that it is bad memory, even though your tests so far don't show that. I ran it for about 16h, with all tests enabled, no error occured. I was always wondering why it worked before the change and not now. The only difference were the larger drives. And i've read so many reports of people running much larger RAID5 partitions than we do, so why should it fail in this case? So my best bet at the moment, would be a hardware problem, too. I continued looking at the kernel oops messages and sometimes disassembly of the code where it broke gave invalid opcodes. This also looks pretty much like a hardware issue. But tests of single components did not unvail any error. It seems to me, that it only happens, when many system components are involved, several HDDs, the whole RAM, the NIC and so on. That leads me to another idea i'm currently testing. It could very well be a bad power supply. Maybe this box was running at full load of the power supply before, and now with new drives consumes more power the supply can deliver, if all system components are used at once. I switched to a better power supply, tests are running as i write this. I'm sorry if i wasted your time, i should have checked this before writing to the list. But power supply problems are pretty odd and hard to identify. Anyways, i'm not sure if that solves the problem. Ok, i'll write the results of current tests, when they are finished. Thanks for your consideration. It certainly is a legitimate question, and marginal power would have been at the end of my list as well... However, if all else fails, try formatting the new drives to use only the size of the old drive capacity (RAID on small partitions) and see if that works. If so you may have found some rare size-related bug. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid5 reads and cpu
Rob Bray wrote: This might be a dumb question, but what causes md to use a large amount of cpu resources when reading a large amount of data from a raid1 array? Examples are on a 2.4GHz AMD64, 2GB, 2.6.15.1 (I realize there are md enhancements to later versions; I had some other unrelated issues and rolled back to one I've run on for several months). A given 7-disk raid0 array can read 450MB/s (using cat null) and use virtually no CPU resources. (Although cat and kswapd use quite a bit [60%] munching on the data) A raid5 array on the same drive set pulls in at 250MB/s, but md uses roughly 50% of the CPU (the other 50% is spent dealing with the data, saturating the processor). A consistency check on the raid5 array uses roughly 3% of the cpu. It is otherwise ~97% idle. md11 : active raid5 sdi2[5] sdh2[4] sdf2[3] sde2[2] sdd2[1] sdc2[6] sdb2[0] 248974848 blocks level 5, 256k chunk, algorithm 2 [7/7] [UUU] [==..] resync = 72.2% (29976960/41495808) finish=3.7min speed=51460K/sec (~350MB/s aggregate throughput, 50MB/s on each device) Just a friendly question as to why CPU utilization is significantly different between a check and a real-world read on raid5? I feel like if there was vm overhead getting the data into userland, the slowdown would be present in raid0 as well. I assume parity calculations aren't done on a read of the array, which leaves me at my question. What are you stripe and cache sizes? -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Interesting RAID checking observations
[EMAIL PROTECTED] wrote: I don't think the processor is saturating. I've seen reports of this sort of thing before and until recently had no idea what was happening, couldn't reproduce it, and couldn't think of any more useful data to collect. Well I can reproduce it easily enough. It's a production server, but I can do low-risk experiments after hours. I'd like to note that the symptoms include not even being able to *type* at the console, which I thought was all in-kernel code, not subject to being swapped out. But whatever. Really? Or is it just that you can type but the characters don't get echoed. The type part is in the kernel, but the display involves X unless you run a direct console. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux: Why software RAID?
Gordon Henderson wrote: On Thu, 24 Aug 2006, Adam Kropelin wrote: Generally speaking the channels on onboard ATA are independant with any vaguely modern card. Ahh, I did not know that. Does this apply to master/slave connections on the same PATA cable as well? I know zero about PATA, but I assumed from the terminology that master and slave needed to cooperate rather closely. I don't know much about co-operation between master slave, but I do know that a failing PATA IDE drive can take out the other one on the same bus - or in my case, render it unusable until I removed the dead drive, whereupon (to my relief) it sprang back into life. This was many many moons ago before I started to use s/w RAID, but it's one thing that would kill a multi-disk array, so I've never done it since. I guess the same could happen on SCSI, but I suspect the interface is a little better designed... Until recently I was working with 38 systems using SCSI RAID controllers (IBM ServeRAID Ultra320). With several types of SCSI drives I saw failures where one drive failed, hung the bus, and caused the next command to another drive to fail. At that point I have to force the controller to think the 2nd drive failed was okay, and then it would recover. I'm told this happens with other hardware, I just haven't personally seen it. From that standpoint, the SATA on the MB look pretty good! -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID over Firewire
Richard Scobie wrote: Has anyone had any experience or comment regarding linux RAID over ieee1394? As a budget backup solution, I am considering using a pair of 500GB drives, each connected to a firewire 400 port, configured as a linear array, to which the contents of an onboard array will be rsynced weekly. In theory, throughput performance should not be an issue, but it would be great to hear from somone who has done this. It should work, but I don't like it... it leaves you with a lot of exposure between backups. Unless your data change a lot, you might consider a good incremental dump program to DVD or similar. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5 producing fake partition table on single drive
Doug Ledford wrote: On Mon, 2006-08-21 at 17:35 +1000, Neil Brown wrote: Buffer I/O error on device sde3, logical block 1793 This, on the other hand, might be a problem - though possibly only a small one. Who is trying to access sde3 I wonder. I'm fairly sure the kernel wouldn't do that directly. It's the mount program collecting possible LABEL= data on the partitions listed in /proc/partitions, of which sde3 is outside the valid range for the drive. May I belatedly say that this is sort-of a kernel issue, since /proc/partitions reflects invalid data? Perhaps a boot option like nopart=sda,sdb or similar would be in order? -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can you IMAGE Mirrored OS Drives?
Alternatives would be to have two such backup devices, and configure them, as andy liebman wrote: I may not have been clear what I was asking. I wanted to know if you can make DISK IMAGES -- for example, with a program like Norton Ghost or Acronis True Image (better) -- of EACH of the two OS drives from a mirrored pair. Then restore Image A to one new disk, Image B to another disk. And then have a new working mirrored pair. May I say belatedly (I've been flat out since July 1) that if I were making a significant number of these clones, I'd write a script so that I could clone one drive, drop it in another machine, and let the script run on the other machine to finish the job. I have no idea how many of these you are doing, but automation is nice to avoid finger checks. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3ware glitches cause softraid rebuilds
adam radford wrote: Jim, Can you try the attached (and below) patch for 2.6.17.11? Don't you want the sleep BEFORE setting the new value? ie. giving a wait for status to change before checking it again? Also, please make sure you are running the latest firmware. Thanks, -Adam diff -Naur linux-2.6.17.11/drivers/scsi/3w-9xxx.c linux-2.6.17.12/drivers/scsi/3w-9xxx.c --- linux-2.6.17.11/drivers/scsi/3w-9xxx.c2006-08-23 14:16:33.0 -0700 +++ linux-2.6.17.12/drivers/scsi/3w-9xxx.c2006-08-28 17:48:29.0 -0700 @@ -943,6 +943,7 @@ before = jiffies; while ((response_que_value TW_9550SX_DRAIN_COMPLETED) != TW_9550SX_DRAIN_COMPLETED) { response_que_value = readl(TW_RESPONSE_QUEUE_REG_ADDR_LARGE(tw_dev)); +msleep(1); if (time_after(jiffies, before + HZ * 30)) goto out; } diff -Naur linux-2.6.17.11/drivers/scsi/3w-9xxx.c linux-2.6.17.12/drivers/scsi/3w-9xxx.c --- linux-2.6.17.11/drivers/scsi/3w-9xxx.c 2006-08-23 14:16:33.0 -0700 +++ linux-2.6.17.12/drivers/scsi/3w-9xxx.c 2006-08-28 17:48:29.0 -0700 @@ -943,6 +943,7 @@ before = jiffies; while ((response_que_value TW_9550SX_DRAIN_COMPLETED) != TW_9550SX_DRAIN_COMPLETED) { response_que_value = readl(TW_RESPONSE_QUEUE_REG_ADDR_LARGE(tw_dev)); + msleep(1); if (time_after(jiffies, before + HZ * 30)) goto out; } -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux: Why software RAID?
Alan Cox wrote: Ar Iau, 2006-08-24 am 07:31 -0700, ysgrifennodd Marc Perkel: So - the bottom line answer to my question is that unless you are running raid 5 and you have a high powered raid card with cache and battery backup that there is no significant speed increase to use hardware raid. For raid 0 there is no advantage. If your raid is entirely on PCI plug in cards and you are doing RAID1 there is a speed up using hardware assisted raid because of the PCI bus contention. I would expect to see this with RAID5 as well, for the same reason... -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID over Firewire
Richard Scobie wrote: Bill Davidsen wrote: It should work, but I don't like it... it leaves you with a lot of exposure between backups. Unless your data change a lot, you might consider a good incremental dump program to DVD or similar. Thanks. I have abandoned this option for various reasons, including people randomly unplugging the drives. Rsync to another machine is the current plan. At one time I was evaluating doing RAID1 to an NBD on another machine, using write-mostly to make it a one way process. I had to redeplot the hardware before I reached a conclusion, and it was with an older kernel, so I simply throw it out for discussion. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raidhotadd works, mdadm --add doesn't
://vger.kernel.org/majordomo-info.html -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PROBLEM: system crash on AMD64 with 2.6.17.11 while accessing 3TB Software-RAID5
Ralf Herrmann wrote: Dear Mr. Davidson and Mr. Brown, It certainly is a legitimate question, and marginal power would have been at the end of my list as well... However, if all else fails, try formatting the new drives to use only the size of the old drive capacity (RAID on small partitions) and see if that works. If so you may have found some rare size-related bug. Seems as if the new power supply did the trick. The box's been running smoothly, for about 2 days lately. I'm currently not in office, but since i didn't get emergency calls from my colleagues i assume it still works. Thanks again for your time, Thanks for letting us know what it was. Even if it was not the first thing we suggested. ;-) -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5 producing fake partition table on single drive
Lem wrote: On Mon, 2006-09-04 at 13:55 -0400, Bill Davidsen wrote: May I belatedly say that this is sort-of a kernel issue, since /proc/partitions reflects invalid data? Perhaps a boot option like nopart=sda,sdb or similar would be in order? Is this an argument to be passed to the kernel at boot time? It didn't work for me. My suggestion was to Neil or other kernel maintainers. If they agree that this is worth fixing, the option could be added in the kernel. It isn't there now, I was soliciting responses on whether this was desirable. Unfortunately I see no way to avoid data in the partition table location, which looks like a partition table, from being used. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5 fill up?
Mr. James W. Laferriere wrote: Kuca , Thank you for posting this snippet . Neil , Might changing , can be given as max which means to choose the largest size that To can be given as 'max' which means to choose the largest size that help those reading this be aware that this is a 'string' to add to the end of --size= ? Also if there are other keywords not quoted ('') this might be a good opertunity . ;-) Tia , JimL Definitely a good idea! If I hadn't seen an example posted using that feature, I would have assumed that it just meant someone was too lazy to type 'maximum' at the time. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: proactive-raid-disk-replacement
Tuomas Leikola wrote: On 9/10/06, Bodo Thiesen [EMAIL PROTECTED] wrote: So, we need a way, to feedback the redundancy from the raid5 to the raid1. snip long explanation Sounds awfully complicated to me. Perhaps this is how it internally works, but my 2 cents go to the option to gracefully remove a device (migrating to a spare without losing redundancy) in the kernel (or mdadm). I'm thinking mdadm /dev/raid-device -a /dev/new-disk mdadm /dev/raid-device --graceful-remove /dev/failing-disk also hopefully a path to do this instead of kicking (multiple) disks when bad blocks occur. Actually, an internal implementation is really needed if this is to be generally useful to a non-guru. And it has other possible uses, as well. if there were just a --migrate command: mdadm --migrate /dev/md0 /dev/sda /dev/sdf as an example for discussion, the whole process of not only moving the data, but getting recovered information from the RAID array could be done by software which does the right thing, creating superblocks, copy UUID, etc. And as a last step it could invalidate the superblock on the failing drive (so reboots would work right) and leave the array running on the new drive. But wait, there's more! Assume that I want to upgrade from a set of 250GB drives to 400GB drives. Using this feature I could replace a drive at a time, then --grow the array. The process for doing that is complex currently, and many manual steps invite errors. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Correct way to create multiple RAID volumes with hot-spare?
Steve Cousins wrote: Ruth Ivimey-Cook wrote: Steve, The recent Messed up creating new array... thread has someone who started by using the whole drives but she now wants to use partitions because the array is not starting automatically on boot (I think that was the symptom). I'm guessing this is because there is no partigion ID of fd since there isn't even a partition. Yes, that's right. Thanks Ruth. Neil (or others), what is the recommended way to have the array start up if you use whole drives instead of partitions? Do you put mdadm -A etc. in rc.local? I think you want it earlier than that, unless you want to do the whole mounting process by hand. It's distribution dependent, but doing it early allows the array to be handled like any other block device. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ATA cables and drives
Molle Bestefich wrote: I'm looking for new harddrives. This is my experience so far. SATA cables: = I have zero good experiences with any SATA cables. They've all been crap so far. 3.5 ATA harddrives buyable where I live: == (All drives are 7200rpm, for some reason.) Unless you live where delivery services don't go, you can get 10k SATA or 15k Ultra320 drives from many vendors. I checked newegg just as a reference, there are many others. The rest of your questions sound like you are running the drives very hot, and drive life is inversely proportional to temperature. There are mil-spec drives which will live at 100C, but they are not readily available and cost way more than keeping drives cool. I have no idea what kind of SATA cable failure you are seeing, I have some in machines I take to give presentations, and if 10k miles in the back of an SUV didn't cause problems, I doubt any normal operation would. I've had them bad when I got them, but if they work once they keep working, in my experience. I've tried Maxtor and IBM (now Hitachi) harddrives. Both makes have failed on me, but most of the time due to horrible packaging. I don't care a split-second whether one kind is marginally faster than the other, so all the reviews on AnandTech etc. are utterly useless to me. There's an infinite number of more effective ways to get better performance than to buy a slightly faster harddrive. I DO care about quality, namely: * How often the drives has catastrophic failure, * How they handle heat (dissipation acceptance - how hot before it fails?), * How big the spare area is, * How often they have single-sector failures, * How long the manufacturer warranty lasts, * How easy the manufacturer is to work with wrt. warranty. I haven't been able to figure the spare area size, heat properties, etc. for any drives. Thus my only criteria so far has been manufacturer warranty: How much bitching do I get when I tell them my drive doesn't work. My main experience is with Maxtor. Maxtor has been none less than superb wrt. warranty! Download an ISO with a diag tool, burn the CD, boot the CD, type in the fault code it prints on Maxtor's site, and a day or two later you've got a new drive in the mail and packaging to ship the old one back in. If something odd happens, call them up and they're extremely helpful. Unfortunately, I lack thorough experience with the other brands. Questions: === A.) Does anyone have experience with returning Hitachi, Seagate or WD drives to the manufacturer? Do they have manufacturer warranty at all? How much/little trouble did you have with Hitachi, Seagate or WD? B.) Can anyone *prove* (to a reasonable degree) that drives from manufacturer H, M, S or WD is of better quality? Has anyone seen a review that heat/shock/stress test drives? C.) Does good SATA cables exist? Eg. cables that lock on to the drives, or backplanes which lock the entire disk in place? Thanks for reading, and thanks in advance for answers (if any) :-). - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Slackware and RAID
Dexter Filmore wrote: Is anyone here who runs a soft raid on Slackware? Out of the box there are no raid scripts, the ones I made myself seem a little rawish, barely more than mdadm --assemble/--stop. I'm pretty much off Slack now, but I have run, the scripts you describe are about 2/3 of what you need, see the thread(s) here about monitoring. mdadm doesn't need a lot of direction... -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Slackware and RAID
Dexter Filmore wrote: Am Samstag, 16. September 2006 19:26 schrieb Bill Davidsen: Dexter Filmore wrote: Is anyone here who runs a soft raid on Slackware? Out of the box there are no raid scripts, the ones I made myself seem a little rawish, barely more than mdadm --assemble/--stop. I'm pretty much off Slack now, but I have run, the scripts you describe are about 2/3 of what you need, see the thread(s) here about monitoring. mdadm doesn't need a lot of direction... What's the remaining third? Monitoring... where I pointed you. I fumbled it into rc.S and rc.6, reason why I ask is that array degraded about 6 times in the few months I run it and I can't figure why. Only thing I know is that it degrades somewhere in the reboot process, so I suspect it might not properly shutdown. Since I haven't had problems I'll pass on trying to guess what's happening. When I have any problem usually a whole drive hits the floor, and I know what to fix and how. I assume you look at mdstat after boot? If it's clean before you shut down and dirty on boot, something isn't shutting down clean, OR that autodetect stuff isn't working as you want it to. I do NOT run lvm, I avoid stuff like that unless I really need it. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: new features time-line
Neil Brown wrote: On Friday October 13, [EMAIL PROTECTED] wrote: I am curious if there are plans for either of the following; -RAID6 reshape -RAID5 to RAID6 migration No concrete plans with timelines and milestones and such, no. I would like to implement both of these but I really don't know when I will find/make time. Probably by the end of 2007, but that is not a promise. We talked about RAID5E a while ago, is there any thought that this would actually happen, or is it one of the would be nice features? With larger drives I suspect the number of drives in arrays is going down, and anything which offers performance benefits for smaller arrays would be useful. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: why partition arrays?
Henrik Holst wrote: Bodo Thiesen wrote: Ken Walker [EMAIL PROTECTED] wrote: Is LVM stable, or can it cause more problems than separate raids on a array. [description of street smart raid setup] (The same function could probably be achieved with logical partitions and ordinary software raid levels.) So, now decide for your own, if you consider LVM stable - I would ;) Regards, Bodo Have you lost any disc (i.e. physical volumes) since February? Or lost the meta-data? I would not recommend anyone to use LVM if they are less than experts on Linux systems. Setting up a LVM system is easy: administrating and salvaging the same, was much more work. (I used it ~3 years ago) My read on LVM is that (a) it's one more thing for the admin to learn, (b) because it's seldom used the admin will be working from documentation if it has a problem, and (c) there is no bug-free software, therefore the use of LVM on top of RAID will be less reliable than a RAID-only solution. I can't quantify that, the net effect may be too small to measure. However, the cost and chance of a finger check from (a) and (b) are significant. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug with RAID1 hot spares?
Chase Venters wrote: Greetings, I was just testing a server I was about to send into production on kernel 2.6.18.1. The server has three SCSI disks with md1 set to a RAID1 with 2 mirrors and 1 spare. I have to ask, why? If the array is mostly written you might save a bit of bus time, but for reads having another copy of the data to read (usually) helps the performance by reducing wait for read occurences. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: future hardware
Dan wrote: I have been using an older 64bit system, socket 754 for a while now. It has the old PCI bus 33Mhz. I have two low cost (no HW RAID) PCI SATA I cards each with 4 ports to give me an eight disk RAID 6. I also have a Gig NIC, on the PCI bus. I have Gig switches with clients connecting to it at Gig speed. As many know you get a peak transfer rate of 133 MB/s or 1064Mb/s from that PCI bus http://en.wikipedia.org/wiki/Peripheral_Component_Interconnect The transfer rate is not bad across the network but my bottle neck it the PCI bus. I have been shopping around for new MB and PCI-express cards. I have been using mdadm for a long time and would like to stay with it. I am having trouble finding an eight port PCI-express card that does not have all the fancy HW RAID which jacks up the cost. I am now considering using a MB with eight SATA II slots onboard. GIGABYTE GA-M59SLI-S5 Socket AM2 NVIDIA nForce 590 SLI MCP ATX. What are other users of mdadm using with the PCI-express cards, most cost effective solution? There may still be m/b available with multiple PCI busses. Don't know if you are interested in a low budget solution, but that would address bandwidth and use existing hardware. Idle curiousity: what kind of case are you using for the drives? I will need to spec a machine with eight drives in the December-January timeframe. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question of PCI bandwidth affect on SW RAID arrays.
Justin Piszcz wrote: Quick question, On older systems, regular PCI motherboards, no PCI-e-- which is faster? 1) One 4 port SATA150 card in 1 PCI slot which has 4 drives connected. 2) Four SATA150 cards in 4 PCI slots, which each have 1 drive connected? I'd assume 2 would stress the PCI bus more and thus would be slower, but I am curious what other people know/think/etc? You are sending the same amount of data over the bus in either case, assuming you are using s/w RAID. This is where (real) hardware RAID has an advantage in performance, the parity doesn't go over the bus. If you have a server board it may have multiple PCI busses, and therefore a higher max bandwidth with multiple cards. I would still expect PCI-e to be faster. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: New features?
John Rowe wrote: All this discussion has led me to wonder if we users of linux RAID have a clear consensus of what our priorities are, ie what are the things we really want to see soon as opposed to the many things that would be nice but not worth delaying the important things for. FWIW, here are mine, in order although the first two are roughly equal priority. 1 Warm swap - replacing drives without taking down the array but maybe having to type in a few commands. Presumably a sata or sata/raid interface issue. (True hot swap is nice but not worth delaying warm- swap.) That seems to work now. It does assume that you have hardware hot swap capability. 2 Adding new disks to arrays. Allows incremental upgrades and to take advantage of the hard disk equivalent of Moore's law. Also seems to work. 3. RAID level conversion (1 to 5, 5 to 6, with single-disk to RAID 1 a lower priority). Single to RAID-N is possible, but involves a good bit of magic with leaving room for superblocks, etc. 4. Uneven disk sizes, eg adding a 400GB disk to a 2x200GB mirror to create a 400GB mirror. Together with 2 and 3, allows me to continuously expand a disk array. ??? -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: new array not starting
Robin Bowes wrote: Robin Bowes wrote: Robin Bowes wrote: This worked: # mdadm --assemble --auto=yes /dev/md2 /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj mdadm: /dev/md2 has been started with 8 drives. However, I'm not sure why it didn't start automatically at boot. Do I need to put it in /etc/mdadm.conf for it to star automatically? I thought md start all arrays it found at a start up? OK, I put /dev/md2 in /etc/mdadm.conf and it didn't make any difference. This is mdadm.conf (uuids are on same line as ARRAY): DEVICE partitions ARRAY /dev/md1 level=raid1 num-devices=2 uuid=300c1309:53d26470:64ac883f:2e3de671 ARRAY /dev/md0 level=raid1 num-devices=2 uuid=89649359:d89365a6:0192407d:e0e399a3 ARRAY /dev/md2 level=raid6 num-devices=8 UUID=68c2ea69:a30c3cb0:9af9f0b8:1300276b I saw an error fly by as the server was booting saying /dev/md2 not found. Do I need to create this device manually? Well, at the risk of having a complete conversation with myself, I've created partitions of type fd on each disk and re-created the array out of the partitions instead of the whole disk. mdadm --create /dev/md2 --auto=yes --raid-devices=8 --level=6 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1 I'm hoping this will enable the array to be auto-detected and started at boot. I'm guessing that whole devices don't get scanned when partitions is used. There was a fix for incorrect partition tables being used on whole drives, and perhaps that makes the whole device get ignored, or perhaps it never worked. Perhaps there's an interaction with LVM, the more complex you make your setup the greater the chance for learning experiences. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is my RAID broken?
Neil Brown wrote: On Sunday November 5, [EMAIL PROTECTED] wrote: If its resyncing that means it detected an error, right? Not a disk error. 'resyncing' means that at startup it looked like the array hadn't been shutdown properly so it is making sure that all the redundancy in the array is consistent. So it looks like you machine recently crashed (power failure?) and it is restarting. There is always the possibility that shutdown scripts don't do the right thing, as well. I believe one of the major distros showed this problem within the last few months, depending on the RAID options used. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5 array showing as degraded after motherboard replacement
James Lee wrote: Hi there, I'm running a 5-drive software RAID5 array across two controllers. The motherboard in that PC recently died - I sent the board back for RMA. When I refitted the motherboard, connected up all the drives, and booted up I found that the array was being reported as degraded (though all the data on it is intact). I have 4 drives on the on board controller and 1 drive on an XFX Revo 64 SATA controller card. The drive which is being reported as not being in the array is the one connected to the XFX controller. The OS can see that drive fine, and mdadm --examine on that drive shows that it is part of the array and that there are 5 active devices in the array. Doing mdadm --examine on one of the other four drives shows that the array has 4 active drives and one failed. mdadm --detail for the array also shows 4 active and one failed. Now I haven't lost any data here and I know I can just force a resync of the array which is fine. However I'm concerned about how this has happened. One worry is that the XFX SATA controller is doing something funny to the drive. I've noticed that it's BIOS has defaulted to RAID0 mode (even though there's only one drive on it) - I can't see how this would cause any particular problems here though. I guess it's possible that some data on the drive got corrupted when the motherboard failed... I notice in your later post that the driver thinks this is a JBOD setup, can you either tell the controller to JBOD or force the driver to consider this a RAID0 single disk setup? I don't know what RAID0 on one drive means, but I suspect that having the controller in the mode you want is desirable. That might have been changed in the hardware failure. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Recovering from default FC6 install
I tried something new on a test system, using the install partitioning tools to partition the disk. I had three drives and went with RAID-1 for boot, and RAID-5+LVM for the rest. After the install was complete I noted that it was solid busy on the drives, and found that the base RAID appears to have been created (a) with no superblock and (b) with no bitmap. That last is an issue, as a test system it WILL be getting hung and rebooted, and recovering the 1.5TB took hours. Is there an easy way to recover this? The LVM dropped on it has a lot of partitions, and there is a lot of data in them asfter several hours of feeding with GigE, so I can't readily back up and recreate by hand. Suggestions? -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Recovering from default FC6 install
Doug Ledford wrote: On Sun, 2006-11-12 at 01:00 -0500, Bill Davidsen wrote: I tried something new on a test system, using the install partitioning tools to partition the disk. I had three drives and went with RAID-1 for boot, and RAID-5+LVM for the rest. After the install was complete I noted that it was solid busy on the drives, and found that the base RAID appears to have been created (a) with no superblock and (b) with no bitmap. That last is an issue, as a test system it WILL be getting hung and rebooted, and recovering the 1.5TB took hours. Is there an easy way to recover this? The LVM dropped on it has a lot of partitions, and there is a lot of data in them asfter several hours of feeding with GigE, so I can't readily back up and recreate by hand. Suggestions? First, the Fedora installer *always* creates persistent arrays, so I'm not sure what is making you say it didn't, but they should be persistent. I got the detail on the md device, then -E on the components, and got a no super block found message, which made me think it wasn't there. Given that, I didn't have much hope for the part which starts assuming that they are persistent but I do thank you for the information, I'm sure it will be useful. I did try recreating, from the running FC6 rather than the rescue, since the large data was on it's own RAID and I could umount the f/s and stop the array. Alas, I think a grow is needed somewhere, after configuration, start, and mount of the f/s on RAID-5, e2fsck told me my data was toast. Shortest time to solution was to recreate the f/s and reload the data. The RAID-1 stuff is small, a total rebuild is acceptable in the case of a failure. FC install suggestion: more optional control over the RAID features during creation. Maybe there's an advanced features button in the install and I just missed it, but there should be, since the non-average user might be able to do useful things with the chunk size, and specify a bitmap. I would think that a bitmap would be the default on large arrays, assuming that 1TB is still large for the moment. Instructions and attachments save for future use, trimmed here. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: md manpage of mdadm 2.5.6
Joachim Wagner wrote: Hi Neil, In man -l mdadm-2.5.6/md.4 I read Firstly, after an unclear shutdown, the resync process will consult the bitmap and only resync those blocks that correspond to bits in the bitmap that are set. This can dramatically increase resync time. IMHO, increase should be changed to decrease or time to speed. Probably better to say unclean than unclear shutdown as well. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mdadm RAID5 Grow
mickg wrote: Neil Brown wrote: On Thursday October 5, [EMAIL PROTECTED] wrote: Neil Brown wrote: On Wednesday October 4, [EMAIL PROTECTED] wrote: I have been trying to run: mdadm --grow /dev/md0 --raid-devices=6 --backup-file /backup_raid_grow I get: mdadm: Need to backup 1280K of critical section.. mdadm: /dev/md0: Cannot get array details from sysfs It shouldn't do that Can you strace -o /tmp/trace -s 300 mdadm --grow . ... open(/sys/block/md0/md/component_size, O_RDONLY) = -1 ENOENT (No such file or directory) So it couldn't open .../component_size. That was added prior to the release of 2.6.16, and you are running 2.6.17.13 so the kernel certainly supports it. Most likely explanation is that /sys isn't mounted. Do you have a /sys? Is it mounted? Can you ls -l /sys/block/md0/md ?? Maybe you need to mkdir /sys mount -t sysfs sysfs /sys and try again. Worked like a charm! Thank you! There is a sysfs /syssysfs noauto 0 0 line in /etc/fstab I am assuming noauto is the culprit? Should it be made to automount ? mickg I will belatedly add that experience shows that /proc and /sys are optional (and can in theory be mounted other places), in practice a lot of software depends on them being present and in the usual place. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Odd (slow) RAID performance
Pardon if you see this twice, I sent it last night and it never showed up... I was seeing some bad disk performance on a new install of Fedora Core 6, so I did some measurements of write speed, and it would appear that write performance is so slow it can't write my data as fast as it is generated :-( The method: I wrote 2GB of data to various configurations with sync; time bash -c dd if=/dev/zero bs=1024k count=2048 of=X; sync where X was a raw partition, raw RAID device, or ext2 filesystem over a RAID device. I recorded the time reported by dd, which doesn't include a final sync, and total time from start of write to end of sync, which I believe represents the true effective performance. All tests were run on a dedicated system, with the RAID devices or filesystem freshly created. For a baseline, I wrote to a single drive, single raw partition, which gave about 50MB/s transfer. Then I created a RAID-0 device, striped over three test drives. As expected this gave a speed of about 147 MB/s. Then I created an ext2 filesystem over that device, and the test showed 139 MB/s speed. This was as expected. Then I stopped and deleted the RAID-0 and built a RAID-5 on the same partitions. A write to this raw RAID device showed only 37.5 MB/s!! Putting an ext2 f/s over that device dropped the speed to 35 MB/s. Since I am trying to write bursts at 60MB/s, this is a serious problem for me. Then I recreated a new RAID-10 array on the same partitions. This showed a write speed of 75.8 MB/s, double the speed even though I was (presumably) writing twice the data. And and ext2 f/s on that array showed 74 MB/s write speed. I didn't use /proc/diskstats to gather actual counts, nor do I know if they show actual transfer data below all the levels of o/s magic, but that sounds as if RAID-5 is not working right. I don't have enough space to use RAID-10 for incoming data, so that's not an option. Then I thought that perhaps my chunk size, defaulted to 64k, was too small. So I created and array with 256k chunk size. That showed about 36 MB/s to the raw array, and 32.4 MB/s to an ext2 f/s using the array. Finally I decided to create a new f/s using the stride= option, and see if that would work better. I had 256k chunks, two data and a parity per stripe, so I used the data size, 512k, for calculation. The man page says to use the f/s block size, 4k in this case, for calculation, so 512/4 was 128 stride size, and I used that. The increase was below the noise, about 50KB/s faster. Any thoughts on this gratefully accepted, I may try the motherboard RAID if I can't make this work, and it certainly explains why my swapping is so slow. That I can switch to RAID-1, it's used mainly for test, big data sets and suspend. If I can't make this fast I'd like to understand why it's slow. I did make the raw results http://www.tmr.com/%7Edavidsen/RAID_speed.html available if people want to see more info. http://www.tmr.com/~davidsen/RAID_speed.html -- Bill Davidsen [EMAIL PROTECTED] We have more to fear from the bungling of the incompetent than from the machinations of the wicked. - from Slashdot - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Odd (slow) RAID performance
Roger Lucas wrote: -Original Message- From: [EMAIL PROTECTED] [mailto:linux-raid- [EMAIL PROTECTED] On Behalf Of Bill Davidsen Sent: 30 November 2006 14:13 To: linux-raid@vger.kernel.org Subject: Odd (slow) RAID performance Pardon if you see this twice, I sent it last night and it never showed up... I was seeing some bad disk performance on a new install of Fedora Core 6, so I did some measurements of write speed, and it would appear that write performance is so slow it can't write my data as fast as it is generated :-( What drive configuration are you using (SCSI / ATA / SATA), what chipset is providing the disk interface and what cpu are you running with? 3xSATA, Seagate 320 ST3320620AS, Intel 6600, ICH7 controller using the ata-piix driver, with drive cache set to write-back. It's not obvious to me why that matters, but if it helps you see the problem I''m glad to provide the info. I'm seeing ~50MB/s on the raw drive, and 3x that on plain stripes, so I'm assuming that either the RAID-5 code is not working well or I haven't set it up optimally. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Odd (slow) RAID performance
Roger Lucas wrote: What drive configuration are you using (SCSI / ATA / SATA), what chipset is providing the disk interface and what cpu are you running with? 3xSATA, Seagate 320 ST3320620AS, Intel 6600, ICH7 controller using the ata-piix driver, with drive cache set to write-back. It's not obvious to me why that matters, but if it helps you see the problem I''m glad to provide the info. I'm seeing ~50MB/s on the raw drive, and 3x that on plain stripes, so I'm assuming that either the RAID-5 code is not working well or I haven't set it up optimally. If it had been ATA, and you had two drives as master+slave on the same cable, then they would be fast individually but slow as a pair. RAID-5 is higher overhead than RAID-0/RAID-1 so if your CPU was slow then you would see some degradation from that too. We have similar hardware here so I'll run some tests here and see what I get... Much appreciated. Since my last note I tried adding --bitmap=internal to the array. Bot is that a write performance killer. I will have the chart updated in a minute, but write dropped to ~15MB/s with bitmap. Since Fedora can't seem to shut the last array down cleanly, I get a rebuild on every boot :-( So the array for the LVM has bitmap on, as I hate to rebuild 1.5TB regularly. Have to do some compromises on that! Thanks for looking! -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html