Re: ATA cables and drives
Hi everyone; Thanks for the information so far! Greatly appreciated. I've just found this: http://home-tj.org/wiki/index.php/Sil_m15w#Message:_Re:_SiI_3112_.26_Seagate_drivers Which in particular mentions that Silicon Image controllers and Seagate drives don't work too well together, and neither Silicon Image nor Seagate wants to know about or do anything about the problem. Hmm. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ATA cables and drives
Jeff Garzik wrote: Molle Bestefich wrote: I've just found this: http://home-tj.org/wiki/index.php/Sil_m15w#Message:_Re:_SiI_3112_.26_Seagate_drivers Which in particular mentions that Silicon Image controllers and Seagate drives don't work too well together, and neither Silicon Image nor Seagate wants to know about or do anything about the problem. Not really true... It's not? That's how I read it. Which part of it is wrong? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB disks for RAID storage (was Re: Please help me save my data)
Martin Kihlgren writes: And no, nothing hangs except the disk access to the device in question when a disk fails. Sounds good! +1 for USB... My Seagate disks DO generate too much heat if I stack them on top of each other, which their form factor suggests they would accept. Starts to take up a lot of space if you need to lay them out like that. (Just for reference, I've had external USB thingys fail with just two drives stacked. They both failed.) My RAID5 + LVM + dm_crypt + XFS setup allows for a very extendable system. That does give you a cool feature set :-). With LVM and dm_crypt in there it does sound like you're running with beta quality software to my ears, however. If it works, great. And as long as I treat the entire disk set as one device, the bandwidth will not be an issue since I will never demand more bandwidth from the entire array than from a single USB drive anyway. Fair enough. It would be cool to get the extra bandwidth though. One solution is to use external SATA (eSATA) enclosures instead of USB enclosures. That would both raise bandwidth, and fix the transaction latency issue mentioned by Daniel Pittman. There's a couple of single-disk enclosures out there that allows to connect disks via either eSATA or USB2. None of them seems to come with cooling, though :-/. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please help me save my data
Patrick Hoover wrote: Is anyone else having issues with USB interfaced disks to implement RAID? Any thoughts on Pros / Cons for doing this? Sounds like a very good stress test for MD. I often find servers completely hung when a disk fails, this usually happens in the IDE layer. If using USB disks circumvents the IDE layer enough, using USB disks might get rid of these hangs. Would be nice at least. Maybe I'm just dreaming. For end users, USB might remove the need to take special care of cooling in your cabinet. OTOH, most USB disk enclosures have horrible thermal properties. USB would make it a lot easier to add new disks (beyond your cabinet's capacity) and to remove old disks when/if they're no longer needed. Users might run into a bandwidth issue at some point.. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
ATA cables and drives
I'm looking for new harddrives. This is my experience so far. SATA cables: = I have zero good experiences with any SATA cables. They've all been crap so far. 3.5 ATA harddrives buyable where I live: == (All drives are 7200rpm, for some reason.) Hitachi DeskStar 500 GB / 16 MB / 8.5 ms / SATA or PATA Maxtor DiamondMax 11 500 GB / 16 MB / 8.5 ms / SATA or PATA Maxtor MaXLine Pro500 GB / 16 MB / 8.5 ms / SATA or PATA Seagate Barracuda 7200.10 500 GB / 16 MB /?/ SATA or PATA Seagate Barracuda 7200.10 750 GB / 16 MB /?/ SATA or PATA Seagate Barracuda 7200.9 500 GB / 16 MB / 11 ms / SATA or PATA Seagate Barracuda 7200.9 500 GB / 8 MB / 11 ms / SATA or PATA Seagate Barracuda ES 500 GB / 16 MB / 8.5 ms / SATA Seagate Barracuda ES 750 GB / 16 MB / 8.5 ms / SATA Seagate ESATA 500 GB / 16 MB /?/ SATA (external) Seagate NL35.2 ST3500641NS500 GB / 16 MB / 8 ms / ? / SATA Seagate NL35.2 ST3500841NS500 GB / 8 MB / 8 ms / ? / SATA Western Digital SE16 WD5000KS 500 GB / 16 MB / 8.9 ms / SATA Western Digital RE2 WD5000YS 500 GB / 16 MB / 8.7 ms / SATA I've tried Maxtor and IBM (now Hitachi) harddrives. Both makes have failed on me, but most of the time due to horrible packaging. I don't care a split-second whether one kind is marginally faster than the other, so all the reviews on AnandTech etc. are utterly useless to me. There's an infinite number of more effective ways to get better performance than to buy a slightly faster harddrive. I DO care about quality, namely: * How often the drives has catastrophic failure, * How they handle heat (dissipation acceptance - how hot before it fails?), * How big the spare area is, * How often they have single-sector failures, * How long the manufacturer warranty lasts, * How easy the manufacturer is to work with wrt. warranty. I haven't been able to figure the spare area size, heat properties, etc. for any drives. Thus my only criteria so far has been manufacturer warranty: How much bitching do I get when I tell them my drive doesn't work. My main experience is with Maxtor. Maxtor has been none less than superb wrt. warranty! Download an ISO with a diag tool, burn the CD, boot the CD, type in the fault code it prints on Maxtor's site, and a day or two later you've got a new drive in the mail and packaging to ship the old one back in. If something odd happens, call them up and they're extremely helpful. Unfortunately, I lack thorough experience with the other brands. Questions: === A.) Does anyone have experience with returning Hitachi, Seagate or WD drives to the manufacturer? Do they have manufacturer warranty at all? How much/little trouble did you have with Hitachi, Seagate or WD? B.) Can anyone *prove* (to a reasonable degree) that drives from manufacturer H, M, S or WD is of better quality? Has anyone seen a review that heat/shock/stress test drives? C.) Does good SATA cables exist? Eg. cables that lock on to the drives, or backplanes which lock the entire disk in place? Thanks for reading, and thanks in advance for answers (if any) :-). - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: remark and RFC
Peter T. Breuer wrote: 1) I would like raid request retries to be done with exponential delays, so that we get a chance to overcome network brownouts. I presume the former will either not be objectionable You want to hurt performance for every single MD user out there, just because things doesn't work optimally under enbd, which is after all a rather rare use case compared to using MD on top of real disks. Uuuuh.. yeah, no objections there. Besides, it seems a rather pointless exercise to try and hide the fact from MD that the device is gone, since it *is* in fact missing. Seems wrong at the least. 2) I would like some channel of communication to be available with raid that devices can use to say that they are OK and would they please be reinserted in the array. The latter is the RFC thing It would be reasonable for MD to know the difference between - device has (temporarily, perhaps) gone missing and - device has physical errors when reading/writing blocks, because if MD knew that, then it would be trivial to automatically hot-add the missing device once available again. Whereas the faulty one would need the administrator to get off his couch. This would help in other areas too, like when a disk controller dies, or a cable comes (completely) loose. Even if the IDE drivers are not mature enough to tell us which kind of error it is, MD could still implement such a feature just to help enbd. I don't think a comm-channel is the right answer, though. I think the type=(missing/faulty) information should be embedded in the I/O error message from the block layer (enbd in your case) instead, to avoid race conditions and allow MD to take good decisions as early as possible. The comm channel and hey, I'm OK message you propose doesn't seem that different from just hot-adding the disks from a shell script using 'mdadm'. When the device felt good (or ill) it notified the raid arrays it knew it was in via another ioctl (really just hot-add or hot-remove), and the raid layer would do the appropriate catchup (or start bitmapping for it). No point in bitmapping. Since with the network down and all the devices underlying the RAID missing, there's nowhere to store data. Right? Some more factual data about your setup would maybe be good.. all I can do is make the enbd device block on network timeouts. But that's totally unsatisfactory, since real network outages then cause permanent blocks on anything touching a file system mounted remotely. People don't like that. If it's just this that you want to fix, you could write a DM module which returns I/O error if the request to the underlying device takes more than 10 seconds. Layer that module on top of the RAID, and make your enbd device block on network timeouts. Now the RAID array doesn't see missing disks on network outages, and users get near-instant errors when the array isn't responsive due to a network outage. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: remark and RFC
Peter T. Breuer wrote: You want to hurt performance for every single MD user out there, just There's no performance drop! Exponentially staged retries on failure are standard in all network protocols ... it is the appropriate reaction in general, since stuffing the pipe full of immediate retries doesn't allow the would-be successful transactions to even get a look in against that competition. That's assuming that there even is a pipe, which is something specific to ENBD / networked block devices, not something that the MD driver should in general care about. because things doesn't work optimally under enbd, which is after all a rather rare use case compared to using MD on top of real disks. Strawman. Quah? Besides, it seems a rather pointless exercise to try and hide the fact from MD that the device is gone, since it *is* in fact missing. Well, we don't really know that for sure. As you know, it is impossible to tell in general if the net has gone awol or is simply heavily overloaded (with retry requests). From MD's point of view, if we're unable to complete a request to the device, then it's either missing or faulty. If a call to the device blocks, then it's just very slow. I don't think it's wise to pollute these simple mechanics with a maybe it's in a sort-of failing due to a network outage, which might just be a brownout scenario. Better to solve the problem in a more appropriate place, somewhere that knows about the fact that we're simulating a block device over a network connection. Not introducing network-block-device aware code in MD is a good way to avoid wrong code paths and weird behaviour for real block device users. Missing vs. Faulty is OTOH a pretty simple interface, which maps fine to both real disks and NBDs. The retry on error is a good thing. I am simply suggesting that if the first retry also fails that we do some back off before trying again, since it is now likely (lacking more knowledge) that the device is having trouble and may well take some time to recover. I would suspect that an interval of 0 1 5 10 30 60s would be appropriate for retries. Only for networked block devices. Not for real disks, there you are just causing unbearable delays for users for no good reason, in the event that this code path is taken. One can cycle that twice for luck before giving up for good, if you like. The general idea in such backoff protocols is that it avoids filling a fixed bandwidth channel with retries (the sum of a constant times 1 + 1/2 + 1/4 + .. is a finite proportion of the channel bandwidth, but the sum of 1+1+1+1+1+... is unbounded), but here also there is an _additional_ assumption that the net is likely to have brownouts and so we _ought_ to retry at intervals since retrying immediately will definitely almost always do no good. Since the knowledge that the block device is on a network resides in ENBD, I think the most reasonable thing to do would be to implement a backoff in ENBD? Should be relatively simple to catch MD retries in ENBD and block for 0 1 5 10 30 60 seconds. That would keep the network backoff algorithm in a more right place, namely the place that knows the device is on a network. In normal failures there is zero delay anyway. Since the first retry would succeed, or? I'm not sure what this normal failure is, btw. And further, the bitmap takes care of delayed responses in the normal course of events. Mebbe. Does it? It would be reasonable for MD to know the difference between - device has (temporarily, perhaps) gone missing and - device has physical errors when reading/writing blocks, I agree. The problem is that we can't really tell what's happening (even in the lower level device) across a net that is not responding. In the case where requests can't be delivered over the network (or a SATA cable, whatever), it's a clear case of missing device. because if MD knew that, then it would be trivial to automatically hot-add the missing device once available again. Whereas the faulty one would need the administrator to get off his couch. Yes. The idea is that across the net approximately ALL failures are temporary ones, to a value of something like 99.99%. The cleaning lady is usually dusting the on-off switch on the router. This would help in other areas too, like when a disk controller dies, or a cable comes (completely) loose. Even if the IDE drivers are not mature enough to tell us which kind of error it is, MD could still implement such a feature just to help enbd. I don't think a comm-channel is the right answer, though. I think the type=(missing/faulty) information should be embedded in the I/O error message from the block layer (enbd in your case) instead, to avoid race conditions and allow MD to take good decisions as early as possible. That's a possibility. I certainly get two types of error back in the enbd driver .. remote error or network error. Remote error is when we
Re: remark and RFC
Peter T. Breuer wrote: We can't do a HOT_REMOVE while requests are outstanding, as far as I know. Actually, I'm not quite sure which kind of requests you are talking about. Only one kind. Kernel requests :). They come in read and write flavours (let's forget about the third race for the moment). I was wondering whether you were talking about requests from eg. userspace to MD, or from MD to the raw device. I guess it's not that important really, that's why I asked you off-list. Just getting in too deep, and being curious. Pipe refers to a channel of fixed bandwidth. Every communication channel is one. The pipe for a local disk is composed of the bus, disk architecture, controller, and also the kernel architecture layers. [snip] See above. The problem is generic to fixed bandwidth transmission channels, which, in the abstract, is everything. As soon as one does retransmits one has a kind of obligation to keep retransmissions down to a fixed maximum percentage of the potential traffic, which is generally accomplished via exponential backoff (a time-wise solution, in other words, sdeliberately mearing retransmits out along the time axis in order to prevent spikes). Right, so with the bandwidth to local disks being, say, 150MB/s, an appropriate backoff would be 0 0 0 0 0 0.1 0.1 0.1 0.1 secs. We can agree on that pretty fast.. right? ;-). The md layers now can generate retries by at least one mechanism that I know of .. a failed disk _read_ (maybe of existing data or parity data as part of an exterior write attempt) will generate a disk _write_ of the missed data (as reconstituted via redundancy info). I believe failed disk _write_ may also generate a retry, Can't see any reason why MD would try to fix a failed write, since it's not likely to be going to be successful anyway. Such delays may in themselves cause timeouts in md - I don't know. My RFC (maybe RFD) is aimed at raising a flag saying that something is going on here that needs better control. I'm still not convinced MD does retries at all.. What the upper layer, md, ought to do is back off. I think it should just kick the disk. I don't think it's wise to pollute these simple mechanics with a maybe it's in a sort-of failing due to a network outage, which might just be a brownout scenario. Better to solve the problem in a more appropriate place, somewhere that knows about the fact that we're simulating a block device over a network connection. I've already suggested a simple mechanism above .. back off on the retries, already. It does no harm to local disk devices. Except if the code path gets taken, and the user has to wait 10+20+30+60s for each failed I/O request. If you like, the constant of backoff can be based on how long it took the underlying device to signal the io request as failed. So a local disk that replies failed immediately can get its range of retries run through in a couple of hop skip and millijiffies. A network device that took 10s to report a timeout can get its next retry back again in 10s. That should give it time to recover. That sounds saner to me. Not introducing network-block-device aware code in MD is a good way to avoid wrong code paths and weird behaviour for real block device users. Uh, the net is everywhere. When you have 10PB of storage in your intelligent house's video image file system, the parts of that array are connected by networking room to room. Supecomputers used to have simple networking between each computing node. Heck, clusters still do :). Please keep your special case code out of the kernel :-). Uhm. Missing vs. Faulty is OTOH a pretty simple interface, which maps fine to both real disks and NBDs. It may well be a solution. I think we're still at the stage of precisely trying to identify the problem too! At the moment, most of what I can say is definitely, there is something wrong with the way the md layer reacts or can be controlled with respect to networking brown-outs and NBDs. Not for real disks, there you are just causing unbearable delays for users for no good reason, in the event that this code path is taken. We are discussing _error_ semantics. There is no bad effect at all on normal working! In the past, I've had MD run a box to a grinding halt more times than I like. It always results in one thing: The user pushing the big red switch. That's not acceptable for a RAID solution. It should keep working, without blocking all I/O from userspace for 5 minutes just because it thinks it's a good idea to hold up all I/O requests to underlying disks for 60s each, waiting to retry them. The effect on normal working should even be _good_ when errors occur, because now max bandwidth devoted to error retries is limited, leaving more max bandwidth for normal requests. Assuming you use your RAID component device as a regular device also, and that the underlying device is not able to satisfy the requests as fast as you
Re: trying to brute-force my RAID 5...
Sevrin Robstad wrote: I created the RAID when I installed Fedora Core 3 some time ago, didn't do anything special so the chunks should be 64kbyte and parity should be left-symmetric ? I have no idea what's default on FC3, sorry. Any Idea ? I missed that you were trying to fdisk -l /dev/md0.. As others have suggested, search for filesystems using fsck, or mount, or what not ;-). - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: trying to brute-force my RAID 5...
Sevrin Robstad wrote: I got a friend of mine to make a list of all the 6^6 combinations of dev 1 2 3 4 5 missing, shouldn't this work ??? Only if you get the layout and chunk size right. And make sure that you know whether you were using partitions (eg. sda1) or whole drives (eg. sda - bad idea). - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: only 4 spares and no access to my data
Karl Voit wrote: if (super == NULL) { fprintf(stderr, Name : No suitable drives found for %s\n, mddev); [...] Well I guess, the message will be shown, if the superblock is not found. Yes. No clue why, my buest guess is that you've already zeroed the superblock. What does madm --query / --examine say about /dev/sd[abcd], are there superblocks ? st = guess_super(fd); if (st == NULL) { if (!quiet) fprintf(stderr, Name : Unrecognised md component device - %s\n, dev); Again: this seems to be the case, when the superblock is empty. Yes, looks like it can't find any usable superblocks. Maybe you've accidentally zeroed the superblocks on sd[abcd]1 also? If you fdisk -l /dev/sd[abcd], does the partition tables look like they should / like they used to? What does mdadm --query / --examine /dev/sd[abcd]1 tell you, any superblocks ? Since my miserably failure I am probably too careful *g* The problem is also, that without deeper background knowledge, I can not predict, if this or that permanently affects the real data on the disks. My best guess is that it's OK and you won't loose data if you run --zero-superblock on /dev/sd[abcd] and then create an array on /dev/sd[abcd]1, but I do find it odd that it suddenly can't find superblocks on /dev/sd[abcd]1. Maybe such a person like me starts to think that sw-raid-tools like mdadm should warn users before permanent changes are executed. If mdadm should be used by users (additional to raid-geeks like you), it might be a good idea to prevent data loss. (Ment as a suggestion.) Perhaps. Or perhaps mdadm should just tell you that you're doing something stupid if you try to manipulate arrays on a block device which seems to contain a partition table. It's not like it's even remotely useful to create an MD array spanning the whole disk rather than spanning a partition which spans the whole disk, anyway. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: only 4 spares and no access to my data
Karl Voit wrote: 443: root at ned ~ # mdadm --examine /dev/sd[abcd] Shows that all 4 devices are ACTIVE SYNC Please note that there is no 1 behind sda up to sdd! Yes, you're right. Seems you've created an array/superblocks on both sd[abcd] (line 443 onwards), and on sd[abcd]1 (line 66 and onward). I'm unsure why 'pvscan' says there is an LVM PV on sda1 (line 118/119). Probably it's a misfeature in LVM, causing it to find the PV inside the MD volume if the array has not been started (since it says that the PV is ~700 GB). Running zero-superblock on sd[abcd] and then assembling the array from sd[abcd]_1_ sounds odd to me. Well this is because of the false(?) superblocks of sda-sdd in comparison to sda1 to sdd1. Yes ok, I missed that part of the story. In that case it sounds sane to zero the superblocks on sd[abcd], seeing that 'pvscan' and 'lvscan' finds live data that you could backup on the array consisting of sd[abcd]1. [EMAIL PROTECTED] ~ # mdadm --assemble /dev/md0 /dev/sda1 /dev/sdb1\ /dev/sdc1 /dev/sdd1 mdadm: cannot open device /dev/sda1: Device or resource busy mdadm: /dev/sda1 has no superblock - assembly aborted Odd message. Does lsof | grep sda show anything using /dev/sda(1)? I should have mentioned that I did not use the whole hard drive space for sd[abcd]1. I thought that if I have to replace one of my Samsungs with another drive that has not the very same capacity, I'd better use exactly 250GB partitions and forget the last approx. 49MB of the drives. Good idea. The problem seems to be the superblocks. Which ones, those on sd[abcd]1 ? You've probably destroyed them by syncing the array consisting of sd[abcd]. Can I repair them? No, but you can recreate them without touching your data. I think the suggestion from Andreas Gredler sounds sane. I'm unsure if hot-adding a device will recreate a superblock on it. Therefore I'd probably run --create on all four devices and use sysfs to force a repair, instead of (as Andreas suggests) creating the array with one 'missing' device. Do remember to zero the superblocks on sd[abcd] first, to prevent mishaps... - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: only 4 spares and no access to my data
Henrik Holst wrote: Is sda1 occupying the entire disk? since the superblock is the /last/ 128Kb (I'm assuming 128*1024 bytes) the superblocks should be one and the same. Ack, never considered that. Ugly!!! - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: only 4 spares and no access to my data
Karl Voit wrote: OK, I upgraded my kernel and mdadm: uname -a: Linux ned 2.6.13-grml #1 Tue Oct 4 18:24:46 CEST 2005 i686 GNU/Linux That release is 10 months old. Newest release is 2.6.17. You can see changes to MD since 2.6.13 here: http://www.kernel.org/git/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.17.y.gita=searchs=md%3A Anything from 2005-09-09 and further up the list is something that's in 2.6.17 but not in 2.6.13. For example, your MD does not have sysfs support, it seems... dpkg --list mdadm -- 2.4.1-6 Newest release is 2.5.2. 2.4.1 is 3 months old. Is it true, that I should try the following lines? mdadm --stop /dev/md0 mdadm --zero-superblock /dev/sda mdadm --zero-superblock /dev/sdb mdadm --zero-superblock /dev/sdc mdadm --zero-superblock /dev/sdd mdadm --assemble /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 --force (check, if it worked - probably not - and if not, try the following line) mdadm --create -n 4 -l 5 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 I don't have a unix box right here, but yes, that looks correct to me. You can make certain that the ordering of the devices is correct by looking in your paste bin, lines 12-15. Other RAID parameters (raid level, # of devices, persistence, layout chunk size) can be seen on lines 212-231. Did you mean something like echo repair /sys/block/md0/md/sync_action Exactly. (Gee, I hope someone stops me if I'm giving out bad advice. Heh ;-).) You can also assemble the array read-only after recreating the superblocks, and you can use check as a sync_action... But only if your kernel has MD with sysfs support ;-). - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: only 4 spares and no access to my data
Karl Voit wrote: I published the whole story (as much as I could log during my reboots and so on) on the web: http://paste.debian.net/8779 From the paste bin: 443: [EMAIL PROTECTED] ~ # mdadm --examine /dev/sd[abcd] Shows that all 4 devices are ACTIVE SYNC Next command: 563: [EMAIL PROTECTED] ~ # mdadm --assemble --update=summaries /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 mdadm: /dev/md0 assembled from 0 drives and 4 spares - not enough to start the array. Then: 568: [EMAIL PROTECTED] ~ # mdadm --examine /dev/sd[abcd]1 Suddenly shows all 4 devices as SPARE? What the heck happened in between? Did you do anything evil, or is it a MD bug, or what? mdadm-version: 1.12.0-1 uname: Linux ned 2.6.13-grml You should probably upgrade at some point, there's always a better chance that devels will look at your problem if you're running the version that they're sitting with.. Andreas Gredler suggested following lines as a last attempt but risk of loosing data which I want to avoid: mdadm --stop /dev/md0 mdadm --zero-superblock /dev/sda mdadm --zero-superblock /dev/sdb mdadm --zero-superblock /dev/sdc mdadm --zero-superblock /dev/sdd mdadm --assemble /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1\ /dev/sdd1 --force mdadm --create -n 4 -l 5 /dev/md0 missing /dev/sdb1\ /dev/sdc1 /dev/sdd1 Running zero-superblock on sd[abcd] and then assembling the array from sd[abcd]_1_ sounds odd to me. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Cutting power without breaking RAID
Tim wrote: That would probably be ideal, issue the power off command with something like a 30 second timeout, which would give the system time to power off cleanly first. I don't think that's ideal. Many systems restore power to the last known state, thus powering off cleanly would result in the machine not coming back up after the power cycle. Some machines can be changed to always power on in the BIOS. If the server administrator remembers to do so, that is. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ok to go ahead with this setup?
Christian Pernegger wrote: Intel SE7230NH1-E mainboard Pentium D 930 HPA recently said that x86_64 CPUs have better RAID5 performance. Promise Ultra133 TX2 (2ch PATA) - 2x Maxtor 6B300R0 (300GB, DiamondMax 10) in RAID1 Onboard Intel ICH7R (4ch SATA) - 4x Western Digital WD5000YS (500GB, Caviar RE2) in RAID5 Is it a NAS kind of device? In that case, drop the 2x 300GB disks and get 6x 500GB instead. You can partition those so that you have a RAID1 spanning the first 10GB of all 6 drives for use as the system partition, and use the rest in a RAID5. * Does this hardware work flawlessly with Linux? No clue. * Is it advisable to boot from the mirror? Should work. Would the box still boot with only one of the disks? If you configure things correctly - better test it. * Can I use EVMS as a frontend? Yes. Does it even use md or is EVMS's RAID something else entirely? EVMS uses a lot of underlying software, MD being one component. * Should I use the 300s as a single mirror, or span multiple ones over the two disks? What would the purpose be? * Am I even correct in assuming that I could stick an array in another box and have it work? Work for what? Comments welcome Get gigabit nics, in case you want to fiddle with iSCSI? :-). - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID tuning?
Nix wrote: Adam Talbot wrote: Can any one give me more info on this error? Pulled from /var/log/messages. raid6: read error corrected!! The message is pretty easy to figure out and the code (in drivers/md/raid6main.c) is clear enough. But the message could be clearer, for instance it would be a very good service to the user if the message showed which block had been corrected. Also using two !! in place of one ! makes the message seem a little dumb :-). - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Horrific Raid 5 crash... help!
David M. Strang wrote: Well today, during this illustrious rebuild... it appears I actually DID have a disk fail. So, I have 26 disks... 1 partially rebuilt, and 1 failed. Common scenario it seems. Hoping and praying that a rebuild didn't actually wipe the disk and maybe just synced things up -- I did a create with the 26 disks + 1 partially rebuilt and 1 'missing disk' I've always loathed that approach. It seems *so* wrong to force MD to nuke all the information that is in the superblocks. I know this is the recommended approach, and I wish it would be changed. Better sooner than later, too :-). well, the array came up but I get access denied on a zillion things, and the filesystem is freaking out. Do you have something like mdadm's printout of the superblocks before and after you did the mdadm --create? Without that information, it's going to be hard to tell what's going on. My best guess is that you've assembled the array using the halfway rebuilt disk, that it was kicked by MD long ago on basis of a single bad block somewhere, and that the disk happens to contain a bunch of old data which confuses the filesystem. But it's a *very* wild guess. Before I proceed any further... what are my options? Do I have any options? Hard to say unless you can tell us which disks failed, and when, and which disks you use to assemble now, etc etc. The more information, the better. I could run a fsck... but I held off fearing it could just make things worse. Good thinking.. I've often seen ext3 fsck do a lot more harm than good. (I believe reiser's fsck is pretty good, but haven't got any experience with it.) Until you're sure you've got everything right, I wouldn't run fsck. In fact, assemble the array readonly and mount the filesystem readonly. If you need to do modifications, and you're not sure you're doing it right, you can make a duplicate of the individual disks in the RAID and perform experiments on those. Of course, the price tag depends on how many GB your array is.. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: md: Change ENOTSUPP to EOPNOTSUPP
Ric Wheeler wrote: You are absolutely right - if you do not have a validated, working barrier for your low level devices (or a high end, battery backed array or JBOD), you should disable the write cache on your RAIDed partitions and on your normal file systems ;-) There is working support for SCSI (or libata S-ATA) barrier operations in mainline, but they conflict with queue enable targets which ends up leaving queuing on and disabling the barriers. Thank you very much for the information! How can I check that I have a validated, working barrier with my particular kernel version etc.? (Do I just assume that since it's not SCSI, it doesn't work?) I find it, hmm... stupefying? horrendous? completely brain dead? I don't know.. that noone warns users about this. I bet there's a million people out there, happily using MD (probably installed and initialized it with Fedora Core / anaconda) and thinking their data is safe, while in fact it is anything but. Damn, this is not a good situation.. (Any suggestions for a good place to fix this? Better really really really late than never...) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 003 of 5] md: Change ENOTSUPP to EOPNOTSUPP
NeilBrown wrote: Change ENOTSUPP to EOPNOTSUPP Because that is what you get if a BIO_RW_BARRIER isn't supported ! Dumb question, hope someone can answer it :). Does this mean that any version of MD up till now won't know that SATA disks does not support barriers, and therefore won't flush SATA disks and therefore I need to disable the disks's write cache if I want to be 100% sure that raid arrays are not corrupted? Or am I way off :-). - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: to be or not to be...
gelma wrote: first run: lot of strange errors report about impossible i_size values, duplicated blocks, and so on You mention only filesystem errors, no block device related errors. In this case, I'd say that it's more likely that dm-crypt is to blame rather than MD. I think you should try the dm-devel mailing list. Posting a complete log of everything that has happened would probably be a good thing. I have no experience with dm-crypt, but I do have experience with another dm target (dm-snapshot), which iss very good at destroying my data. If you want a stable solution for encrypting your files, I can recommend loop-aes. loop-aes has very well thought-through security, the docs are concise but have wide coverage, it has good backwards compatibility - probably not your biggest concern right now, but it is *very* nice to know that your data is accessible, in the future as well as now - etc.. I've been using it for a couple of years now, since the 2.2 or 2.4 days (can't remember), and I've had nothing short of an absolutely *brilliant* experience with it. Enough propaganda for now, hope that you get your problem solved :-). - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: data recovery on raid5
Sam Hopkins wrote: mdadm -C /dev/md0 -n 4 -l 5 missing /dev/etherd/e0.[023] While it should work, a bit drastic perhaps? I'd start with mdadm --assemble --force. With --force, mdadm will pull the event counter of the most-recently failed drive up to current status which should give you a readable array. After that, you could try running a check by echo'ing check into sync_action. If the check succeeds, fine, hotadd the last drive to your array and MD will start resync'ing. If the check fails because of a bad block, you'll have to make a decision. Live with the lost blocks, or try and reconstruct from the first kicked disk. I posted a patch this week that will allow you to forcefully get the array started with all of the disks - but beware, MD wasn't made with this in mind and will probably be confused and sometimes pick data from the first-kicked drive over data from the other drives. Only forcefully start the array with all drives if you absolutely have to... Oh, and I'm not an expert by any means, so take everything I say with a grain of salt :-). - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with 5disk RAID5 array - two drives lost
Tim Bostrom wrote: It appears that /dev/hdf1 failed this past week and /dev/hdh1 failed back in February. An obvious question would be, how much have you been altering the contents of the array since February? I tried a mdadm --assemble --force and was able to get the following: == mdadm: forcing event count in /dev/hdf1(1) from 777532 upto 777535 mdadm: clearing FAULTY flag for device 2 in /dev/md0 for /dev/hdf1 raid5: raid level 5 set md0 active with 4 out of 5 devices, algorithm 2 mdadm: /dev/md0 has been started with 4 drives (out of 5). == Looks good. I then tried to mount /dev/md0 A bit premature, I'd say. raid5: Disk failure on hdf1, disabling device. MD doesn't like to find errors when it's rebuilding. It will kick that disk off the array, which will cause MD to return crap (instead of stopping the array and removing the device - I wonder), again causing 'mount' etc. to fail. Quite unfortunate for you, since you have absolutely no redundancy with 4/5 drives, and you really can't afford to have the 4th disk kicked just because there's a bad block on it. This is something that MD could probably handle much better than it does now. In your case, you probably want to try and reconstruct from all 5 disks, but without loosing the information in their event counters - you want MD to use as much data as it can from the 4 fresh disks (assuming that they're at least 99% readable), and only when there's a rare bad block on one of them should it use data from the 5th. Seeing as 1) MD doesn't automatically check your array unless you ask it to 2) Modern disks have a habit of developing lots of bad blocks It would be very nice if MD could help out in these kind of situations. Unfortunately implementation is tricky as I see it, and currently MD can do no such thing. spurious 8259A interrupt: IRQ7. Oops. I'd look into that, I think it's a known bug. (Then again, maybe it's just the IDE drivers - I've experienced really bad IRQ handling both with old style IDE and with atalib.) hdf: dma_intr: error=0x40 { UncorrectableError }, LBAsect=6720 Hey, it's telling you where your data used to be. Cute. raid5: Disk failure on hdf1, disabling device. Operation continuing on 3 devices Haha! Real bright there, MD, continuing raid5 operation with 3/5 devices. Still not a bug, eh? :-) *poke, poke* I'm guessing /dev/hdf is shot. Actually, there's a lot of sequential sector numbers in the output you posted. I think it's unusual for a drive to develop that many bad blocks in a row. I could be wrong, and it could be a head crash or something (have you been moving the system around much?). But if I had to guess, I'd say that there's a real likelihood that it's a loose cable or a controller problem or a driver issue. Could you try and run: # dd if=/dev/hdf of=/dev/null bs=1M count=100 skip=1234567 You can play around with different random numbers instead of 1234567. If it craps out *immediately*, then I'd say it's a cable problem or so, and not a problem with what's on the platters. I haven't tried an fsck though. Would this be advisable? No, get the array running first, then fix the filesystem. You can initiate array checks and repairs like this: # cd /sys/block/md0/md/ # echo check sync_action or # echo repair sync_action Or something like that. Is there a way that I can try and build the array again with /dev/hdh instead of /dev/hdf with some possible data corruption on files that were added since Feb? Let's first see if we can't get hdf online. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: data recovery on raid5
Jonathan wrote: # mdadm -C /dev/md0 -n 4 -l 5 missing /dev/etherd/e0.[023] I think you should have tried mdadm --assemble --force first, as I proposed earlier. By doing the above, you have effectively replaced your version 0.9.0 superblocks with version 0.9.2. I don't know if version 0.9.2 superblocks are larger than 0.9.0, Neil hasn't responded to that yet. Potentially hazardous, who knows. Anyway. This is from your old superblock as described by Sam Hopkins: /dev/etherd/blah: Chunk Size : 32K This is from what you've just posted: /dev/etherd/blah: Chunk Size : 64K If I were you, I'd recreate your superblocks now, but with the correct chunk size (use -c). We'll be happy to pay you for your services. I'll be modest and charge you a penny per byte of data recovered, ho hum. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: data recovery on raid5
Jonathan wrote: I was already terrified of screwing things up now I'm afraid of making things worse Adrenalin... makes life worth living there for a sec, doesn't it ;o) based on what was posted before is this a sensible thing to try? mdadm -C /dev/md0 -c 32 -n 4 -l 5 missing /dev/etherd/e0.[023] Yes, looks exactly right. Is what I've done to the superblock size recoverable? I don't think you've done anything at all. I just *don't know* if you have, that's all. Was just trying to say that it wasn't super-cautious of you to begin with, that's all :-). I don't understand how mdadm --assemble would know what to do, which is why I didn't try it initially. By giving it --force, you tell it to forcefully mount the array even though it might be damaged. That means including some disks (the freshest ones) that are out of sync. That help? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: data recovery on raid5
Jonathan wrote: Well, the block sizes are back to 32k now, but I still had no luck mounting /dev/md0 once I created the array. Ahem, I missed something. Sorry, the 'a' was hard to spot. Your array used layout : left-asymmetric, while the superblock you've just created has layout: left-symmetric. Try again, but add the option --parity=left-asymmetric - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: data recovery on raid5
Jonathan wrote: how safe should the following be? mdadm --assemble /dev/md0 --uuid=8fe1fe85:eeb90460:c525faab:cdaab792 /dev/etherd/e0.[01234] You can hardly do --assemble anymore. After you have recreated superblocks on some of the devices, those are conceptually part of a different raid array. At least as seen by MD. I am *really* not interested in making my situation worse. We'll keep going till you got your data back.. Recreating superblocks again on e0.{0,2,3} can't hurt, since you've already done this and thereby nuked the old superblocks. You can shake your own hand and thank yourself now (oh, and Sam too) for posting all the debug output you have. Otherwise we would probably never have spotted nor known about the parity/chunk size differences :o). - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
libata retry - disable?
Does anyone know of a way to disable libata's 5-time retry when a read fails? It has the effect of causing every failed sector read to take 6 seconds before it fails, causing raid5 rebuilds to go awfully slow. It's generally undesirable too, when you've got RAID on top that can write replacement data onto the failed sectors.. Log showing a failed sector: Apr 18 09:49:53 linux kernel: end_request: I/O error, dev sda, sector 131124407 Apr 18 09:49:55 linux kernel: ata1: no sense translation for status: 0x51 Apr 18 09:49:55 linux kernel: ata1: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04 Apr 18 09:49:55 linux kernel: ata1: status=0x51 { DriveReady SeekComplete Error } Apr 18 09:49:56 linux kernel: ata1: no sense translation for status: 0x51 Apr 18 09:49:56 linux kernel: ata1: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04 Apr 18 09:49:56 linux kernel: ata1: status=0x51 { DriveReady SeekComplete Error } Apr 18 09:49:57 linux kernel: ata1: no sense translation for status: 0x51 Apr 18 09:49:57 linux kernel: ata1: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04 Apr 18 09:49:57 linux kernel: ata1: status=0x51 { DriveReady SeekComplete Error } Apr 18 09:49:59 linux kernel: ata1: no sense translation for status: 0x51 Apr 18 09:49:59 linux kernel: ata1: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04 Apr 18 09:49:59 linux kernel: ata1: status=0x51 { DriveReady SeekComplete Error } Apr 18 09:50:00 linux kernel: ata1: no sense translation for status: 0x51 Apr 18 09:50:00 linux kernel: ata1: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04 Apr 18 09:50:00 linux kernel: ata1: status=0x51 { DriveReady SeekComplete Error } Apr 18 09:50:01 linux kernel: ata1: no sense translation for status: 0x51 Apr 18 09:50:01 linux kernel: ata1: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04 Apr 18 09:50:01 linux kernel: ata1: status=0x51 { DriveReady SeekComplete Error } Apr 18 09:50:01 linux kernel: sd 0:0:0:0: SCSI error: return code = 0x802 Apr 18 09:50:01 linux kernel: sda: Current: sense key: Medium Error Apr 18 09:50:01 linux kernel: Additional sense: Unrecovered read error - auto reallocate failed Apr 18 09:50:01 linux kernel: end_request: I/O error, dev sda, sector 131124415 Apr 18 09:50:02 linux kernel: raid5: read error corrected!! Apr 18 09:50:02 linux last message repeated 112 times - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
mdadm -C / 0.90?
Hi Neil, list You wrote: mdadm -C /dev/md1 --assume-clean /dev/sd{a,b,c,d,e,f}1 Will the above destroy data by overwriting the on-disk v0.9 superblock with a larger v1 superblock? --assume-clean is not document in 'mdadm --create --help', by the way - what does it do? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: help wanted - 6-disk raid5 borked: _ _ U U U U
Molle Bestefich wrote: Neil Brown wrote: How do I force MD to raise the event counter on sdb1 and accept it into the array as-is, so I can avoid bad-block induced data corruption? For that, you have to recreate the array. Scary. And hairy. How much do I have to bribe you to make this work: # mdadm --assemble /dev/md1 --force /dev/sd{a,b,c,d,e,f}1 Not now, that is, but for the sake of future generations? :) I crossed my fingers and dived into the code. The mdadm code was super well written - Thanks, Neil :-) - so conjuring up a patch took less than 5 minutes. Patch attached - does it look acceptable for inclusion in mainline? With the patch, I can do this: # mdadm --assemble /dev/md1 --force /dev/sd{a,b,c,d,e,f}1 mdadm: forcing event count in /dev/sdb1(1) from 160264 upto 163368 mdadm: clearing FAULTY flag for device 1 in /dev/md1 for /dev/sdb1 mdadm: /dev/md1 has been started with 6 drives. And have the array started with all 6 devices (as I've asked it for). --- Assemble.c.ok 2006-04-18 02:47:11.0 +0200 +++ Assemble.c 2006-04-18 01:18:38.0 +0200 @@ -119,6 +119,7 @@ struct mdinfo info; char *avail; int nextspare = 0; + int devs_on_cmdline = devlist!=NULL; vers = md_get_version(mdfd); if (vers = 0) { @@ -407,9 +408,10 @@ sparecnt++; } } - while (force !enough(info.array.level, info.array.raid_disks, + while (force ((!enough(info.array.level, info.array.raid_disks, info.array.layout, -avail, okcnt)) { +avail, okcnt)) || +(devs_on_cmdline (num_devs okcnt { /* Choose the newest best drive which is * not up-to-date, update the superblock * and add it.
help wanted - 6-disk raid5 borked: _ _ U U U U
A system with 6 disks, it was UU a moment ago, after read errors on a file now looks like: /proc/mdstat: md1 : active raid5 sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[6](F) sda1[7](F) level 5, 64k chunk, algorithm 2 [6/4] [__] uname: linux 2.6.11-gentoo-r4 What's the recommended approach? Compile 2.6.16.5 and mdadm 2.4.1, install both, reboot, then use mdadm assemble --force (after automatic mount probably fails)? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: help wanted - 6-disk raid5 borked: _ _ U U U U
Neil Brown wrote: You shouldn't need to upgrade kernel Ok. I had a crazy idea that 2 devices down in a RAID5 was an MD bug. I didn't expect MD to kick that last disk - I would have thought that it would just pass on the read error in that situation. If you've got the time to explain I'd like to be wiser - why not? But yet, use --assemble --force and be aware that there could be data corruption (without knowing the history it is hard to say how likely). An 'fsck' at least would be recommended. Thanks a tera! (Probably no corruption since I was strictly reading from the array when the disks were kicked.) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: help wanted - 6-disk raid5 borked: _ _ U U U U
Neil Brown wrote: It is arguable that for a read error on a degraded raid5, that may not be the best thing to do, but I'm not completely convinced. A read error will mean that a write to the same stripe will have to fail, so at the very least we would want to switch the array read-only. That would be much nicer for me as a user, because: * I would know which disks are the freshest (the ones marked U). * My data wouldn't be abruptly pulled offline - right now I'm getting *weird* errors from the systems on top of the array. * I wouldn't have to try and guess the correct 'mdadm' command to stop/start the array (including pointing at the right disks). 2 cents ;). Thanks for the explanation. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: help wanted - 6-disk raid5 borked: _ _ U U U U
Neil Brown wrote: use --assemble --force # mdadm --assemble --force /dev/md1 mdadm: forcing event count in /dev/sda1(0) from 163362 upto 163368 mdadm: /dev/md1 has been started with 5 drives (out of 6). Oops, only 5 drives, but I know data is OK on all 6 drives. I also know that there are bad blocks on more than 1 drive. So I want MD to recover from the other drives in those cases, which I won't be able to with only 5 drives. In other words, checking/repairing with only 5 drives will lead to data corruption. I'll stop and try again, listing all 6 drives by hand: # mdadm --stop /dev/md1 # mdadm --assemble /dev/md1 --force /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 mdadm: /dev/md1 has been started with 5 drives (out of 6). Ugh. Didn't work. Bug? How do I force MD to raise the event counter on sdb1 and accept it into the array as-is, so I can avoid bad-block induced data corruption? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: help wanted - 6-disk raid5 borked: _ _ U U U U
Neil Brown wrote: How do I force MD to raise the event counter on sdb1 and accept it into the array as-is, so I can avoid bad-block induced data corruption? For that, you have to recreate the array. Scary. And hairy. How much do I have to bribe you to make this work: # mdadm --assemble /dev/md1 --force /dev/sd{a,b,c,d,e,f}1 Not now, that is, but for the sake of future generations? :) Make sure you get the chunksize, parity algorithm, and order correct, but something like mdadm -C /dev/md1 --assume-clean /dev/sda1 /dev/sdb1 /dev/sdc1 \ /dev/sdd1 /dev/sde1 /dev/sdf1 How do I backup the MD superblocks first so I can 'dd'-restore them in case the above command totally wrecks the array because I got it wrong? It's an old array, so it's version 00.90.00 superblocks - does that make a difference? and then echo check /sys/block/md1/md/sync_action and see what cat /sys/block/md1/md/mismatch_count reports at the end. Then maybe 'echo repair ' Super, thanks as always for your kind help. (After examining the syslog, I think that there might be a bug, or not - I'll start a new thread about that.) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[bug?] MD doesn't stop failed array
May I offer the point of view that this is a bug: MD apparently tries to keep a raid5 array up by using 4 out of 6 disks. Here's the event chain, from start to now: == 1.) Array assembled automatically with 6/6 devices. 2.) Read error, MD kicks sdb1. 3.) Read error, MD kicks sda1, doesn't seem to stop array. 4.) ext3 and loop0 devices run amok, probably writes crazy things to disk? Here's the syslog contents corresponding to the above: == Apr 13 16:50:38 linux kernel: md: adding sdf1 ... Apr 13 16:50:38 linux kernel: md: adding sde1 ... Apr 13 16:50:38 linux kernel: md: adding sdd1 ... Apr 13 16:50:38 linux kernel: md: adding sdc1 ... Apr 13 16:50:38 linux kernel: md: adding sdb1 ... Apr 13 16:50:38 linux kernel: md: adding sda1 ... Apr 13 16:50:38 linux kernel: md: created md1 Apr 13 16:50:38 linux kernel: md: bindsda1 Apr 13 16:50:38 linux kernel: md: bindsdb1 Apr 13 16:50:38 linux kernel: md: bindsdc1 Apr 13 16:50:38 linux kernel: md: bindsdd1 Apr 13 16:50:38 linux kernel: md: bindsde1 Apr 13 16:50:38 linux kernel: md: bindsdf1 Apr 13 16:50:38 linux kernel: md: running: sdf1sde1sdd1sdc1sdb1sda1 Apr 13 16:50:38 linux kernel: raid5: device sdf1 operational as raid disk 5 Apr 13 16:50:38 linux kernel: raid5: device sde1 operational as raid disk 4 Apr 13 16:50:38 linux kernel: raid5: device sdd1 operational as raid disk 3 Apr 13 16:50:38 linux kernel: raid5: device sdc1 operational as raid disk 2 Apr 13 16:50:38 linux kernel: raid5: device sdb1 operational as raid disk 1 Apr 13 16:50:38 linux kernel: raid5: device sda1 operational as raid disk 0 Apr 13 16:50:38 linux kernel: raid5: allocated 6290kB for md1 Apr 13 16:50:38 linux kernel: raid5: raid level 5 set md1 active with 6 out of 6 devices, algorithm 2 Apr 13 16:50:38 linux kernel: RAID5 conf printout: Apr 13 16:50:39 linux kernel: --- rd:6 wd:6 fd:0 Apr 13 16:50:39 linux kernel: disk 0, o:1, dev:sda1 Apr 13 16:50:39 linux kernel: disk 1, o:1, dev:sdb1 Apr 13 16:50:39 linux kernel: disk 2, o:1, dev:sdc1 Apr 13 16:50:39 linux kernel: disk 3, o:1, dev:sdd1 Apr 13 16:50:39 linux kernel: disk 4, o:1, dev:sde1 Apr 13 16:50:39 linux kernel: disk 5, o:1, dev:sdf1 [snip irrelevant] Apr 13 16:54:06 linux kernel: ata2: status=0x51 { DriveReady SeekComplete Error } Apr 13 16:54:06 linux kernel: ata2: error=0x04 { DriveStatusError } [11 repetitions of above 2 lines snipped] Apr 13 16:54:06 linux kernel: SCSI error : 1 0 0 0 return code = 0x802 Apr 13 16:54:06 linux kernel: sdb: Current: sense key: Aborted Command Apr 13 16:54:06 linux kernel: Additional sense: No additional sense information Apr 13 16:54:06 linux kernel: end_request: I/O error, dev sdb, sector 119 Apr 13 16:54:06 linux kernel: raid5: Disk failure on sdb1, disabling device. Operation continuing on 5 devices Apr 13 16:54:06 linux kernel: RAID5 conf printout: Apr 13 16:54:06 linux kernel: --- rd:6 wd:5 fd:1 Apr 13 16:54:06 linux kernel: disk 0, o:1, dev:sda1 Apr 13 16:54:06 linux kernel: disk 1, o:0, dev:sdb1 Apr 13 16:54:06 linux kernel: disk 2, o:1, dev:sdc1 Apr 13 16:54:06 linux kernel: disk 3, o:1, dev:sdd1 Apr 13 16:54:06 linux kernel: disk 4, o:1, dev:sde1 Apr 13 16:54:06 linux kernel: disk 5, o:1, dev:sdf1 Apr 13 16:54:06 linux kernel: RAID5 conf printout: Apr 13 16:54:06 linux kernel: --- rd:6 wd:5 fd:1 Apr 13 16:54:06 linux kernel: disk 0, o:1, dev:sda1 Apr 13 16:54:06 linux kernel: disk 2, o:1, dev:sdc1 Apr 13 16:54:06 linux kernel: disk 3, o:1, dev:sdd1 Apr 13 16:54:07 linux kernel: disk 4, o:1, dev:sde1 Apr 13 16:54:07 linux kernel: disk 5, o:1, dev:sdf1 Apr 13 16:54:06 linux kernel: ata2: status=0x51 { DriveReady SeekComplete Error } Apr 13 16:54:06 linux kernel: ata2: error=0x04 { DriveStatusError } [11 repetitions of above 2 lines snipped] Apr 13 16:54:06 linux kernel: SCSI error : 1 0 0 0 return code = 0x802 Apr 13 16:54:06 linux kernel: sdb: Current: sense key: Aborted Command Apr 13 16:54:06 linux kernel: Additional sense: No additional sense information Apr 13 16:54:06 linux kernel: end_request: I/O error, dev sdb, sector 119 Apr 13 16:54:06 linux kernel: raid5: Disk failure on sdb1, disabling device. Operation continuing on 5 devices Apr 13 16:54:06 linux kernel: RAID5 conf printout: Apr 13 16:54:06 linux kernel: --- rd:6 wd:5 fd:1 Apr 13 16:54:06 linux kernel: disk 0, o:1, dev:sda1 Apr 13 16:54:06 linux kernel: disk 1, o:0, dev:sdb1 Apr 13 16:54:06 linux kernel: disk 2, o:1, dev:sdc1 Apr 13 16:54:06 linux kernel: disk 3, o:1, dev:sdd1 Apr 13 16:54:06 linux kernel: disk 4, o:1, dev:sde1 Apr 13 16:54:06 linux kernel: disk 5, o:1, dev:sdf1 Apr 13 16:54:06 linux kernel: RAID5 conf printout: Apr 13 16:54:06 linux kernel: --- rd:6 wd:5 fd:1 Apr 13 16:54:06 linux kernel: disk 0, o:1, dev:sda1 Apr 13 16:54:06 linux kernel: disk 2, o:1, dev:sdc1 Apr 13 16:54:06 linux kernel: disk 3, o:1, dev:sdd1 Apr 13 16:54:07 linux kernel: disk 4, o:1,
Re: raid 5 corruption
Todd [EMAIL PROTECTED] wrote: The strangest thing happened the other day. I booted my machine and the permissions were all messed up. I couldn't access many files as root which were owned by root. I couldnt' run common programs as root or a standard user. Odd, have you found out why? What was the first error you saw? So I restarted and it wouldn't mount my raid drive (raid 5, 5 disks). I tried doing it manually from the livecd, and it's telling me it can't mount with only 2 disks. Is that because the kernel found only 2/5 physical disks, or because MD thinks that they're out-of-date? I tried to force with four drives and it claims there's no superblock for sda3. Try mdadm --assemble --force again, but exclude sda3 and assemble the array using the 4 other drives instead? You might want to run mdadm to query the superblock on each device. You can post the output to this list so others will be able to see which of your drives are considered 'freshest' by MD etc. There's nothing wrong with my disks. I can mount the boot partition. One doesn't imply the other. And since you don't tell where the boot partition resides, it hardly seems relevant to your RAID devices.. It's fine as far as I can tell. Does anyone know what's going on? Has anyone else experienced this? I have had problems in the past with other machines. One time a redhat machine locked up in X. Yeah, I've had X lock up on me quite a lot. I don't know if it was just X or the kernel. Probably the graphics driver. I restarted and it couldn't find the root i-node. It may have been correctable, but I just reinstalled. It seems strange that windows can crash on me every day and it still starts right back up. (I still have 98.) But linux seems to have more fragile file systems. Windows' flushing policy is a LOT more sane than Linux'. That's probably why you'll rarely get corrupted filesystems with Windows, and often with Linux. Like you, I've had filesystem corruption after system crashes happen to me with Linux quite a lot, and never (even though it crashes much more often) with Windows. My guess is that the Linux kernel folks are more concerned with a .01% improvement in performance than with your data and that's why the policy is as it is.. But I could easily be wrong, so take it with a grain of salt. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: block level vs. file level
Bill Davidsen wrote: Molle Bestefich wrote: it wrote: Ouch. How does hardware raid deal with this? Does it? Hardware RAID controllers deal with this by rounding the size of participant devices down to nearest GB, on the assumption that no drive manufacturers would have the guts to actually sell eg. a 250 GB drive with less than exactly 250.000.000.000 bytes of space on it. (It would be nice if the various flavors of Linux fdisk had an option to do this. It would be very nice if anaconda had an option to do this.) I guess if you care you specify the size of the partition instead of use it all. I use fdisk usually, cfdisk when installing, both let me set size, fdisk let's me set starting track and even play with the partition table's idea of geometry. What kind of an option did you have in mind? I don't know. Examples good enough? a.) Do not use space beyond highest GB b.) Do not use last cylinder Help texts could be: a.) Helps ensure that you can replace fx. a 300GB drive from one manufacturer with a 300GB from another. b.) Leave an area that Windows might use for disk metadata alone. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: block level vs. file level
it wrote: Ouch. How does hardware raid deal with this? Does it? Hardware RAID controllers deal with this by rounding the size of participant devices down to nearest GB, on the assumption that no drive manufacturers would have the guts to actually sell eg. a 250 GB drive with less than exactly 250.000.000.000 bytes of space on it. (It would be nice if the various flavors of Linux fdisk had an option to do this. It would be very nice if anaconda had an option to do this.) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.15: mdrun, udev -- who creates nodes?
[EMAIL PROTECTED] wrote: Not only that, the raid developers themselves consider autoassembly deprecated. http://article.gmane.org/gmane.linux.kernel/373620 Hmm. My knee-jerk, didn't-stop-to-think-about-it reaction is that this is one of the finest features of linux raid, so why remove it? I *think* that the raid developers may be, for once, choosing words not-so-wisely when talking about deprecating autoassembly. Last time I heard that I choked as well, only to find out later that Neil's notion of what auto-assembly is differed substantially from my own. Isn't there a faq/wiki somewhere where the official opinion on autoassembly deprecation and exactly what that means can go? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Silent Corruption on RAID5
Michael Barnwell wrote: I'm experiencing silent data corruption on my RAID 5 set of four 400GB SATA disks. I have circa the same hardware: * AMD Opteron 250 * Silicon Image 3114 * 300 GB Maxtor SATA Just to add a data point, I've run your test on my RAID 1 (not RAID 5 !) without problems. localhost ~ # dd bs=1024 count=1k if=/dev/zero of=./10GB.tst 1024+0 records in 1024+0 records out localhost ~ # od -t x1 ./10GB.tst 000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 1161 localhost ~ # uname -a Linux localhost 2.6.12.6-xen #6 SMP Fri Jan 6 06:49:53 CET 2006 x86_64 AMD Opteron(tm) Processor 250 AuthenticAMD GNU/Linux - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 006 of 7] md: Checkpoint and allow restart of raid5 reshape
NeilBrown wrote: We allow the superblock to record an 'old' and a 'new' geometry, and a position where any conversion is up to. When starting an array we check for an incomplete reshape and restart the reshape process if needed. *Super* cool! - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: blog entry on RAID limitation
Rik Herrin wrote: Wouldn't connecting a UPS + using a stable kernel version remove 90% or so of the RAID-5 write hole problem? There are some RAID systems that you'd rather not have redundant power on. Think encryption. As long as a system is online, it's normal for it to have encryption keys in memory and it's disk systems mounted through the decryption system. You wouldn't want someone to be able to steal your server along with the UPS and stuff it in a van with a power inverter :-). - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux RAID Enterprise-Level Capabilities and If It Supports Raid Level Migration and Online Capacity Expansion
Rik Herrin wrote: I was interested in Linux's RAID capabilities and read that mdadm was the tool of choice. We are currently comparing software RAID with hardware RAID MD is far superior to most of the hardware RAID solutions I've touched. In short, it seems MD is developed with the goal of keeping your data safe, not selling hardware. I've had problems both with MD and with hardware RAID. With hardware RAID, once things go bad, they really go bad. With MD, there's usually a straight-forward way to rescue things. And when there's not, Neil's a real nice guy who always stands up to help and fix bugs. I would trust my data with MD over any hardware RAID solution, including professional server RAID solutions from eg. Compaq or IBM. MD is a little more difficult to set up and also lacks in that it doesn't integrate with BIOS level stuff and boot loaders (maybe there's minimal MD RAID 1 support in Lilo, not sure). Depending on your choice of hardware, you might also get more features than MD can currently offer. 1) OCE: Online Capacity Expansion: From the latest version of mdadm (v2.2), it ssems that there is support for it with the -G option. How well tested is this? New feature, so obviously not tested very well. Neil said at one point that he was going to release this to the general public when it's stable and when it can recover an interrupted resize process. Sounds like a very reasonable and sane goal to me, I hope that this is still the case. Otherwise, it's easy to work around - you can just create a new RAID array on your new disks / extra disk space and then join it to the end of the old array using MD's linear personality or DM. Never tried it, but should work just fine. Also, in the Readme / Man page, it mentions: This usage causes mdadm to attempt to reconfigure a running array. This is only possibly if the kernel being used supports a particular reconfiguration. How can I know if the kernel I am using supports this reconfiguration? What if I'm compiling the kernel by hand. What options would I have to enable? Just the usual MD stuff I think. You'll probably need a quite new kernel where Neil's bitmap patches has been applied. Hopefully MD will detect whether the kernel is new enough or not, but I haven't tried myself ;-). 2) RAID Level Migration: Does mdadm currently support this feature? I don't think so, but sounds like RAID5 -- RAID6 is planned. Check back in a year or so ;-). Or choose the RAID level you *really* want to begin with (duh). Since you say we, I assume you're part of a very large corporation and thus intend to RAID a whole bunch of disks. Go with RAID6 + a couple of spares for that. If you intend to use really many disks, make multiple arrays. (Not sure whether you can share spares across arrays, but I think you can.) 3) Performance issues: I'm currently thinking of using either RAID 10 or LVM2 with RAID 5 to serve as a RAID server. The machine will be running either an AMD 64 processor or a dual-core AMD 64 processor, so I don't think the CPU will be a bottleneck. In fact, it should easily pass the speed of most hardware based RAID systems. I think there's two issues to cover, * Throughput * Seek times And of course they're not entirely separate issues - throughput will be lower when you're doing random access (seeking) and seek times will be higher when you're pulling lots of data out. I've seen lots of MD tests, but none that covered profiling MD's random access performance. So I suppose that most hardware solutions will do a lot better than MD here since they have been profiled with this in mind. Throughput-wise, I think MD is probably very good. But I can't back that up with factual data, sorry. 4) Would anyone recommend a certain hotswap enclosure? I would, but can't remember their name, sorry :-) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux RAID Enterprise-Level Capabilities and If It Supports Raid Level Migration and Online Capacity Expansion
Lajber Zoltan wrote: I have some simple test with bonnie++, the sw raid superior to hw raid, except big-name storage systems. http://zeus.gau.hu/~lajbi/diskbenchmarks.txt Cool. But what does gep, tip, diskvez, iras, olvasas and atlag mean? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mdadm 2.1: command line option parsing bug?
I found myself typing IMHO after writing up just about each comment. I've dropped that and you'll just have to know that all this is IMHO and not an attack on your ways if they happen to be different ^_^. Neil Brown wrote: I like the suggestion of adding one-line descriptions to this. How about: I'll first comment each command/description (general feedback later): Usage: mdadm --create device options... Create a new array from unused devices. mdadm --assemble device options... reassemble a previously created array I like 'em! mdadm --build device options... create or assemble an array without metadata Oh, that's what it's for, operating without metadata. Good to know. It would help a lot (for me) if the above also had a brief note about why I would ever want to use --build. Otherwise I'll just be confused about when I should use assemble and when I should use build and why it even exists. What's the logic behind having split create/assemble commands but a joined command for creating/assembling when there's no metadata? (I'm sure there is one, I'm just confused as always.) mdadm --manage device options... Make changes to an active array Nice. mdadm --misc options... devices report information or perform miscellaneous tasks. Get rid of the misc section or rename it to something meaningful.. mdadm --monitor options... monitor one or more arrays and report any changes in status Nice. mdadm device options... same as --manage Oh, so that's what it does. A bit confusing for a newbie like me that we're not sticking to _1_ syntax. Since the above is a side note about a convenient syntax hack, I think that (in the event that you find that it is not confusing [which I do :-] and decide to keep it), there should be at least a blank line between the very important --cmd descriptions and this rarely relevant note. General stuff: The note: * use --help in combination with --cmd for further help is missing. I think it would be good to retain it. I find it very unhelpful to have a --misc section. Every time I'm looking for some command, besides from having to guess in which section it's located, I have to check --misc also, since misc can cover anything. Yes, I can see how that is confusing. The difference between 'manage' and 'misc' is that manage will only apply to a single array, while misc can apply to multiple arrays. This is really a distinction that is mostly relevant in the implementation. I should try to hide it in the documentation. That would be good :-). Renaming --misc to something which relates to the fact that this is about multi md device commands and giving it an appropriate description line (I don't think your --misc description was any good, sorry, hehe) would also be *a lot* better.. And btw, why mention device and options for each and every section/command set/command when it's obvious that you need more parameters to do something fruitful? In my opinion it would be better to rid the general --help screen of them and instead specify in brief what functionality the sections are meant to cover. well. it is device options for all but misc, which has options ... devices... Yes, okay, I think that would be explained better by a description line for misc saying that it's about multiple devices (by the way, are we talking multiple MD or multiple component devices?) but I agree that it is probably unnecessary repetition. Also completely useless since there's no mention of what any of the options /are/, no? Just adds to confusion, so better to snip it. I keep typing mdadm --help --assemble which unfortunately gets me nowhere. A lot of the times it's because the general help text has scrolled out of view because I've just done an '--xxx --help' for some other command. Maybe it's just me, but a minor improvement could be to allow '--help --xxx'. That's a fair comment. I currently print the help message as soon as I see the --help option. But that could change to: if I see '--help', set a flag, then for every option, print the appropriate help. That would be really swell! Then mdadm --help --size could even give something useful... Yup :-). Hmm. Not bad at all. Would be good for the extreme newbie and confused ppl like me ;). ... snip ... For general help on options use mdadm --help-options ... snip ... Like misc, I think this is an odd section. Let's explore it's contents: $ mdadm --help-options Any parameter that does not start with '-' is treated as a device name ... snip ... To me, the above is very interesting information about mdadm's syntax and thus should be presented at the start of the general --help screen. ... snip ... The first such name is often the name of an md device. Subsequent names are
Re: mdadm 2.1: command line option parsing bug?
mdadm's command line arguments seem arcane and cryptic and unintuitive. It's difficult to grasp what combinations will actually do something worthwhile and what combinations will just yield a 'you cannot do that' output. I find myself spending 20 minutes with mdadm --help and experimenting with different commands (which shouldn't be the case when doing RAID stuff) just to do simple things like create an array or make MD assemble the devices that compose an array. Thanks strange. They seem very regular and intuitive to me (but I find it very hard to be objective). Maybe you have a mental model of what the task involves that differs from how md actually does things. Can you say anything more about the sort of mistakes you find yourself making. That might help either improve the help pages or the error messages (revamping all the diagnostic messages to make the more helpful is slowly climbing to the top of my todo list). I can, but I suck at putting these things down in writing, which is why my initial description was intentionally vague and shallow :-). Now, I'll try anyway. It'll be imprecise and I'll miss some things and show up late with major points etc., but at least I will have tried =). Ok.. Starting with the usage output: $ mdadm Usage: mdadm --help for help Nothing wrong with that. Good, concise stuff. Great. The --help output, however...: $ mdadm --help Usage: mdadm --create device options... mdadm --assemble device options... mdadm --build device options... ... snip ... ... I find confusing. Try and read the words create, assemble and build as an outsider or a child would read them, as regular english words, not with MD development in mind. All three words mean just about the same thing in plain english (something like taking smaller parts and constructing something bigger with them or some such). That confuses me. To make things worse, I now have to type in 3 commands (--help --cmd) and compare the output of each just to get a grasp on what each of the 3 individual commands do. The proces is laborious. Well, I'll live, but it's annoying to have to scroll up and down perhaps 5-6 screens of text whilst comparing options just to figure out which command I would like to use. I would much prefer if the 'areas of utility' that mdadm commands are divided in were self explanatory. I'm not saying that these 'areas of utility' (--create etc.) are not grouped in a logical fashion. Just that it's not always easy to comprehend what they cover. Perhaps it's enough just to add a small description per line of 'mdadm --help' output. After 'mdadm --create' it would fx. say 'creates a new RAID array based on multiple devices.' or so. Just a one-liner. Moving right along: ... snip ... mdadm --manage device options... mdadm --misc options... devices mdadm --monitor options... ... snip ... I find it very unhelpful to have a --misc section. Every time I'm looking for some command, besides from having to guess in which section it's located, I have to check --misc also, since misc can cover anything. ... snip ... mdadm device options... ... snip ... Where does that syntax suddenly come from? Is it a special no-command mode? What does it do? Hmm. Confusing. And btw, why mention device and options for each and every section/command set/command when it's obvious that you need more parameters to do something fruitful? In my opinion it would be better to rid the general --help screen of them and instead specify in brief what functionality the sections are meant to cover. ... snip ... mdadm is used for building, managing, and monitoring Linux md devices (aka RAID arrays) ... snip ... Probably belongs in the top of the output, but that's a small nit... ... snip ... For detailed help on the above major modes use --help after the mode e.g. mdadm --assemble --help ... snip ... I keep typing mdadm --help --assemble which unfortunately gets me nowhere. A lot of the times it's because the general help text has scrolled out of view because I've just done an '--xxx --help' for some other command. Maybe it's just me, but a minor improvement could be to allow '--help --xxx'. ... snip ... For general help on options use mdadm --help-options ... snip ... Like misc, I think this is an odd section. Let's explore it's contents: $ mdadm --help-options Any parameter that does not start with '-' is treated as a device name ... snip ... To me, the above is very interesting information about mdadm's syntax and thus should be presented at the start of the general --help screen. ... snip ... The first such name is often the name of an md device. Subsequent names are often names of component devices. ... snip ... This too. What's with the often btw? It is somewhat more helpful to me if
Re: mdadm 2.1: command line option parsing bug?
Neil Brown wrote: I would like it to take an argument in contexts where --bitmap was meaningful (Create, Assemble, Grow) and not where --brief is meaningful (Examine, Detail). but I don't know if getopt_long will allow the 'short_opt' string to be changed half way through processing... Here's an honest opinion from a regular user. mdadm's command line arguments seem arcane and cryptic and unintuitive. It's difficult to grasp what combinations will actually do something worthwhile and what combinations will just yield a 'you cannot do that' output. I find myself spending 20 minutes with mdadm --help and experimenting with different commands (which shouldn't be the case when doing RAID stuff) just to do simple things like create an array or make MD assemble the devices that compose an array. I know. Not very constructive, but a POV anyway. Maybe I just do not use MD enough, and so I shouldn't complain, because the interface is really not designed for the absolute newbie. If so, then I apologize. I don't have any constructive suggestions, except to say that the way the classic Cisco interface does things works very nicely. A lot of other manufacturers has also started doing things the Cisco way. If you don't have a Cisco router available, you can fx. use a Windows XP box. Type 'netsh' in a command prompt, then 'help'. Or alternatively 'netsh help'. You get the idea :-). - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: /boot on RAID5 with GRUB
Spencer Tuttle wrote: Is it possible to have /boot on /dev/md_d0p1 in a RAID5 configuration and boot with GRUB? Only if you get yourself a PCI card with a RAID BIOS on it and attach the disks to that. The RAID BIOS hooks interrupt 13 and allows GRUB (or DOS or LILO for that matter) to see the RAID5 array instead of the individual disks. Obviously any RAID card will do, it doesn't matter whether it has a CPU to do the RAID calculations or not (known as fake-RAID and ataraid if it doesn't). - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: questions about ext3, raid-5, small files and wasted disk space
On Saturday November 12, Neil Brown wrote: On Saturday November 12, Kyle Wong wrote: I understand that if I store a 224KB file into the RAID5, the file will be divided into 7 parts x 32KB, plus 32KB parity. (Am I correct in this?) Sort of ... if the filesystem happens to lay it out like that. But this isn't a useful way to think about it. The filesystem writes the data in 4K blocks. The raid5 layer worries about how to create the parity block. Well, there IS some optimization to be done here that we're all missing out on, if the filesystem does not take this into account, isn't there? Is it reasonable to assume that Linux filesystems always start the 'data block area' (whatever) exactly on x * fs block size kB into the device they're laid on? Doesn't seem *entirely* unreasonable that they'd do that, if not for optimization then just because their authors happened to think that it would be neat code-wise. If the filesystem do that, then an optimization would be to just make sure that the filesystem block size exactly equals the RAID chunk size. Things become slightly harder if you start partitioning your RAID device. FDISK needs to make sure that partitions are on cylinder boundaries, but luckily FDISK is rarely used to partition MD RAID devices - LVM or EVMS is. Both of those systems are technically free to move the partition data area a few kB back or forth within the RAID device so that the partition is aligned on a RAID chunk. Wouldn't that be great? It would of course give you nothing, unless the filesystem also aligns it's blocks (does it do that?). (Then there's the fakeraid / ataraid people. They're screwed, as far as optimizations in this area go. Maybe they can go and get someone to make a raid bios that understands MD metadata :-).) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Flappy hotswap disks?
If - a disk is part of a MD RAID 1 array, and - the disk is 'flapping', eg. going online and offline repeatedly in a hotswap system, and - a *write* occurs to the MD array at a time when the disk happens to be offline, will MD handle this correctly? Eg. will it increase the event counters on the other disks /even/ when no reboot or stop-start has been performed, so that when the flappy disk flaps back online, it will be (perhaps partially) resynced? Apologies if it's a dumb question. Hope someone's got a minute to answer it :-). - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Flappy hotswap disks?
Mario 'BitKoenig' Holbe wrote: Molle Bestefich wrote: Eg. will it increase the event counters on the other disks /even/ when no reboot or stop-start has been performed, so that when the flappy Event counters are increased immediately when an event occurs. A device failure is an event as well as start and stop of a RAID are. Ok. Thanks! Hope it's race-free and all :-). - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Drive fails raid6 array is not self rebuild .
Mr. James W. Laferriere wrote: Is there a documented procedure to follow during creation or after that will get a raid6 array to self rebuild ? MD will rebuild your array automatically, given that it has a spare disk to use. raid5: Disk failure on sde, disabling device. Operation continuing on 35 devices Seems like a raid5, not raid6.. [UU_U] No need to do any rebuilding on the remaining devices, since the data on them are fine. You've lost redundancy however, so you should add a new disk to the array ASAP. With 35 disks, I'd recommend that you at least use raid6 in place of raid5.. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Accelerating Linux software raid
Dan Williams wrote: The first question is whether a solution along these lines would be valued by the community? The effort is non-trivial. I don't represent the community, but I think the idea is great. When will it be finished and where can I buy the hardware? :-) And if you don't mind terribly, could you also add hardware acceleration support to loop-aes now that you're at it? :-) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] proactive raid5 disk replacement for 2.6.11
Pallai Roland wrote: Molle Bestefich wrote: Claas Hilbrecht wrote: Pallai Roland schrieb: this is a feature patch that implements 'proactive raid5 disk replacement' (http://www.arctic.org/~dean/raid-wishlist.html), After my experience with a broken raid5 (read the list) I think the partially failed disks feature you describe is really useful. I agree with you that this kind of error is rather common. Horrible idea. Once you have a bad block on one disk, you have definitively lost your data redundancy. That's bad. Hm, I think you don't understand the point, yes, that should be replaced as soon as you can, but the good sectors of that drive can be useful if some bad sectors are discovered on an another drive during the rebuilding. we must keep that drive in sync to keep that sectors useful, this is why the badblock tolerance is. Ok, I misunderstood you. Sorry, and thanks for the explanation. It is the common error if you've lot of disks and can't do daily media checks because of the IO load. Agreed. What should be done about bad blocks instead of your suggestion is to try and write the data back to the bad block before kicking the disk. If this succeeds, and the data can then be read from the failed block, the disk has automatically reassigned the sector to the spare sector area. You have redundancy again and the bad sector is fixed. If you're having a lot of problems with disks getting kicked because of bad blocks, then you need to diagnose some more to find out what the actual problem is. My best guess would be that either you're using an old version of MD that won't try to write to bad blocks, or the spare area on your disk is full, in which case it should be replaced. You can check the status of spare areas on disks with 'smartctl' or similar. Which version of md tries to rewrite bad blocks in raid5? Haven't followed the discussions closely, but I sure hope that the newest version does. (After all, spare areas are a somewhat old feature in harddrives..) I've problem with hidden bad blocks (never mind if that's repairable or not), the rewrite can't help, cause you don't know if that's there until you don't try to rebuild the array from degraded state to a replaced disk. I want to avoid from the rebuiling from degraded state, this is why the 'proactive replacement' feature is. Got it now. Super. Sounds good ;-). (I hope that you're simply rebuilding to a spare before kicking the drive, not doing something funky like remapping sectors or some such..) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] proactive raid5 disk replacement for 2.6.11
Claas Hilbrecht wrote: Pallai Roland schrieb: this is a feature patch that implements 'proactive raid5 disk replacement' (http://www.arctic.org/~dean/raid-wishlist.html), After my experience with a broken raid5 (read the list) I think the partially failed disks feature you describe is really useful. I agree with you that this kind of error is rather common. Horrible idea. Once you have a bad block on one disk, you have definitively lost your data redundancy. That's bad. What should be done about bad blocks instead of your suggestion is to try and write the data back to the bad block before kicking the disk. If this succeeds, and the data can then be read from the failed block, the disk has automatically reassigned the sector to the spare sector area. You have redundancy again and the bad sector is fixed. If you're having a lot of problems with disks getting kicked because of bad blocks, then you need to diagnose some more to find out what the actual problem is. My best guess would be that either you're using an old version of MD that won't try to write to bad blocks, or the spare area on your disk is full, in which case it should be replaced. You can check the status of spare areas on disks with 'smartctl' or similar. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Software RAID on Windows using Embedded Linux?
Ewan Grantham wrote: I know, this is borderline, but figure this is the group of folks who will know. I do a lot of audio and video stuff for myself and my family. I also have a rather unusual networking setup. Long story short, when I try to run Linux as my primary OS, I usually end up reinstalling Windows after a couple weeks because there are still holes in what I can do. That isn't the fault of Linux as much as folks who write device drivers or have video codecs that require DirectShow. However, the big thing I really miss from Linux that keeps me trying to find a way to convert is the support for Software RAID 5. It occured to me yesterday that perhaps the trick would be to use QEMU to run Knoppix or Damn Small Linux under Windows, and then setup a RAID 5 array under one of those. Not to mention then having access to Linux for some other fun stuff. I'm not sure if that's even possible, and if it is, how much trouble I would have moving files around to and from the RAID array if it's setup that way. So I'm wondering if anyone on the list has ever tried this? Not too much trouble, you should be able to just setup Samba in the Knoppix/DSL and access your files through QEMU's network emulation. It will be damn slow, however :-). Perhaps buy a NAS device like Linksys NSLU2? It's cheap and has low power consumption - but you'll need to hack it to add disks (via USB). http://peter.korsgaard.com/articles/debian-nslu2.php or http://www.tomsnetworking.com/Sections-article85-page3.php explains how to get into the Linux guts of the box. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Software RAID on Windows using Embedded Linux?
On 7/24/05, Ewan Grantham [EMAIL PROTECTED] wrote: On 7/24/05, Molle Bestefich [EMAIL PROTECTED] wrote: Ewan Grantham wrote: I know, this is borderline, but figure this is the group of folks who will know. I do a lot of audio and video stuff for myself and my family. I also have a rather unusual networking setup. Long story short, when I try to run Linux as my primary OS, I usually end up reinstalling Windows after a couple weeks because there are still holes in what I can do. That isn't the fault of Linux as much as folks who write device drivers or have video codecs that require DirectShow. However, the big thing I really miss from Linux that keeps me trying to find a way to convert is the support for Software RAID 5. It occured to me yesterday that perhaps the trick would be to use QEMU to run Knoppix or Damn Small Linux under Windows, and then setup a RAID 5 array under one of those. Not to mention then having access to Linux for some other fun stuff. I'm not sure if that's even possible, and if it is, how much trouble I would have moving files around to and from the RAID array if it's setup that way. So I'm wondering if anyone on the list has ever tried this? Not too much trouble, you should be able to just setup Samba in the Knoppix/DSL and access your files through QEMU's network emulation. It will be damn slow, however :-). Perhaps buy a NAS device like Linksys NSLU2? It's cheap and has low power consumption - but you'll need to hack it to add disks (via USB). http://peter.korsgaard.com/articles/debian-nslu2.php or http://www.tomsnetworking.com/Sections-article85-page3.php explains how to get into the Linux guts of the box. Interesting idea. However, I note that it is rather slow as well - since you're then accessing disks over 100 Mbps rather than at full USB2. And it appears you can only attach 2 USB disks to it. You should be able to hack it to accept more than 2 disks.. You're right about the 100 mbps network interface.. Which makes me wonder just how slow you mean when you say that the QEMU version would be slow? Slow compared to native access? Slow compared to the NSLU2 solution discussed above? Or slow compared to a snail being attacked by one of those Hawaiian Catepillars? :-) Slow as in a passing tortoise would take that system by surprise =).. Based on QEMU being a CPU emulator, not a virtualization system, thus penalizing performance with a factor 10 or so. Actually, just checked their web site, and they now have a non-open-source alternative that uses virtualization, so if you're going to use that it just might work ok. If you're willing to live with the fact that MD will at times suck up some CPU. Looking forward to hearing whether you can live with the solution or not (if you go that way) :-). - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID1 assembly requires manual mdadm --run
Neil Brown wrote: On Friday July 8, [EMAIL PROTECTED] wrote: So a clean RAID1 with a disk missing should start without --run, just like a clean RAID5 with a disk missing? Not that with /dev/loop3 no funcitoning, mdadm --assemble --scan will still work. Super! That was exactly the point of the test. You're correct, when I try again with a DEVICE line and --assemble --scan, everything works perfectly. (Not sure why Mitchell Laks saw something else happen.) Thanks a lot for taking the time to answer questions, the information is much appreciated! - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Degraded raid5 returns mdadm: /dev/hdc5 has no superblock - assembly aborted
On Friday July 8, [EMAIL PROTECTED] wrote: On 8 Jul 2005, Molle Bestefich wrote: On 8 Jul 2005, Melinda Taylor wrote: We have a computer based at the South Pole which has a degraded raid 5 array across 4 disks. One of the 4 HDD's mechanically failed but we have bought the majority of the system back online except for the raid5 array. I am pretty sure that data on the remaining 3 partitions that made up the raid5 array is intact - just confused. The reason I know this is that just before we took the system down, the raid5 array (mounted as /home) was still readable and writable even though /proc/mdstat said: On 7/8/05, Daniel Pittman wrote: What you want to do is start the array as degraded, using *only* the devices that were part of the disk set. Substitute 'missing' for the last device if needed but, IIRC, you should be able to say just: ] mdadm --assemble --force /dev/md2 /dev/hd[abd]5 Don't forget to fsck the filesystem thoroughly at this point. :) At this point, before adding the new disk, I'd suggest making *very* sure that the event counters match on the three existing disks. Because if they don't, MD will add the new disk with an event counter matching the freshest disk in the array. That will cause it to start synchronizing onto one of the good disks instead of onto the newly added disk Happened to me once, gah. Ack! I didn't know that. If the event counters don't match up, what can you do to correct the problem? Daniel Pittman wrote: Ack! I didn't know that. If the event counters don't match up, what can you do to correct the problem? In the 2.4 days, I think I used to plug cables in and out of the disks, rebooting the system again and again until the counters were aligned. Neil Brown wrote: The --assemble --force should result in all the event counters of the named drives being the same. Then it should be perfectly safe the add the new drive. Sounds like a better option! I cannot quite imagine a situation as described by Molle. Fair enough, the situation just struck me as something I had seen before, and it doesn't hurt to be sure.. If it was at all reproducible I'd love to hear more details. I'd rather not reproduce it :-). It's happened a couple of times on a production system.. Once back when it was running 2.4 and an old version of MD, and once while I was in the process of upgrading the box to 2.6 (so it might have been while it was booted into 2.4.. not sure). The box used to have two disks failing from time to time, one due to a semi-bad disk and one due to a flaky SATA cable. That's about all I can remember on top of my head. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RAID1 assembly requires manual mdadm --run
Mitchell Laks wrote: However I think that raids should boot as long as they are intact, as a matter of policy. Otherwise we lose our ability to rely upon them for remote servers... It does seem wrong that a RAID 5 starts OK with a disk missing, but a RAID 1 fails. Perhaps MD is unable to tell which disk in the RAID 1 is the freshest and therefore refuses to assemble any RAID 1's with disks missing? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MD bug or me being stupid?
Hmm, I think the information in /var/log/messages are actually interesting for MD debugging. Seems there was a bad sector somewhere in the middle of all this, which might have triggered something? Attached (gzipped - sorry for the inconvenience, but it's 5 kB vs. 250 kB!) I've cut out a lot of irrelevant cruft (kernel messages) and added comments about what I did when. linux-22apr-messages.gz Description: GNU Zip compressed data
Re: Bug in MDADM or just crappy computer?
Phantazm wrote: And the kernel log is filled up with this. Feb 20 08:43:13 [kernel] md: md0: sync done. Feb 20 08:43:13 [kernel] md: syncing RAID array md0 Feb 20 08:43:13 [kernel] md: minimum _guaranteed_ reconstruction speed: 5000 KB/sec/disc. Feb 20 08:43:13 [kernel] md: md0: sync done. Feb 20 08:43:13 [kernel] .6md: syncing RAID array md0 Feb 20 08:43:13 [kernel] md: minimum _guaranteed_ reconstruction speed: 5000 KB/sec/disc. No sync done here? What is it doing, multiple syncs in parallel? Feb 20 08:43:13 [kernel] .6md: syncing RAID array md0 Feb 20 08:43:13 [kernel] md: minimum _guaranteed_ reconstruction speed: 5000 KB/sec/disc. Again (below line)? Feb 20 08:43:13 [kernel] md: syncing RAID array md0 Feb 20 08:43:13 [kernel] md: minimum _guaranteed_ reconstruction speed: 5000 KB/sec/disc. Feb 20 08:43:13 [kernel] md: using maximum available idle IO bandwith (but not more than 15000 KB/sec) for reconstruction. Feb 20 08:43:13 [kernel] md: using 128k window, over a total of 199141632 blocks. Feb 20 08:43:13 [kernel] md: md0: sync done. There we go, a sync done (above) Feb 20 08:43:13 [kernel] .6md: syncing RAID array md0 Feb 20 08:43:13 [kernel] md: minimum _guaranteed_ reconstruction speed: 5000 KB/sec/disc. But then again (below).. Feb 20 08:43:13 [kernel] .6md: syncing RAID array md0 Feb 20 08:43:13 [kernel] md: minimum _guaranteed_ reconstruction speed: 5000 KB/sec/disc. And again... Feb 20 08:43:13 [kernel] .6md: syncing RAID array md0 Feb 20 08:43:13 [kernel] md: minimum _guaranteed_ reconstruction speed: 5000 KB/sec/disc. And again... Feb 20 08:43:13 [kernel] .6md: syncing RAID array md0 Feb 20 08:43:13 [kernel] md: minimum _guaranteed_ reconstruction speed: 5000 KB/sec/disc. And again?! Feb 20 08:43:13 [kernel] .6md: syncing RAID array md0 Feb 20 08:43:13 [kernel] md: minimum _guaranteed_ reconstruction speed: 5000 KB/sec/disc. And again... Feb 20 08:43:13 [kernel] .6md: syncing RAID array md0 Feb 20 08:43:13 [kernel] md: minimum _guaranteed_ reconstruction speed: 5000 KB/sec/disc. And again... Feb 20 08:43:13 [kernel] md: syncing RAID array md0 [snip] Odd! aint got a single clue what it could be. I'm seeing something that looks like the same, so let me know if you found out what happened. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questions about software RAID
David Greaves wrote: Guy wrote: Well, I agree with KISS, but from the operator's point of view! I want... [snip] Fair enough. [snip] should the LED control code be built into mdadm? Obviously not. But currently, a LED control app would have to pull information from /proc/mdstat, right? mdstat is a crappy place to derive any state from. It currently seems to have a dual purpose: - being a simple textual representation of RAID state for the user. - providing MD state information for userspace apps. That's not good. There seems to be an obvious lack of a properly thought out interface to notify userspace applications of MD events (disk failed -- go light a LED, etc). Please correct me if I'm on the wrong track, in which case the rest of this posting will be bogus. Maybe there are IOCTLs or such that I'm not aware of. I'm not sure how a proper interface could be done (so I'm basically just blabbering). ACPI has some sort of event system, but the MD one would need to be more flexible. For instance userspace apps has to pick up on MD events such as disk failures, even if the userspace app happens to not be running in the exact moment that the event occurs (due to system restart, daemon restart or what not). So the system that ACPI uses is probably unsuited. Perhaps a simple logfile would do. It's focus should be machine-readability (vs. human readability for mdstat). A userspace app could follow MD's state from the beginning (bootup, no devices discovered, logfile cleared), through device discovery and RAID assembly and to failing devices. By adding up the information in all the log lines, a userspace app could derive the current state of MD (which disks are dead..). Just a thought. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questions about software RAID
Hervé Eychenne wrote: Molle Bestefich wrote: There seems to be an obvious lack of a properly thought out interface to notify userspace applications of MD events (disk failed -- go light a LED, etc). I'm not sure how a proper interface could be done (so I'm basically just blabbering). ACPI has some sort of event system, but the MD one would need to be more flexible. For instance userspace apps has to pick up on MD events such as disk failures, even if the userspace app happens to not be running in the exact moment that the event occurs (due to system restart, daemon restart or what not). So the system that ACPI uses is probably unsuited. Perhaps a simple logfile would do. It's focus should be machine-readability (vs. human readability for mdstat). A userspace app could follow MD's state from the beginning (bootup, no devices discovered, logfile cleared), through device discovery and RAID assembly and to failing devices. By adding up the information in all the log lines, a userspace app could derive the current state of MD (which disks are dead..). No, as it requires active polling. No it doesn't. Just tail -f the logfile (or /proc/ or /sys/ file), and your app will receive due notice exactly when something happens. Or use inotify. I think something like a netlink device would be more accurate, but I'm not a kernel guru. No idea how that works :-). If by accurate you mean you'll get a faster reaction, that's wrong as per above explanation. And I'll try to explain why a logfile in other respects are actually _more_ accurate. I can see why a logfile _seems_ wrong at first sight. But the idea that it allows you to (*also*!) see historic MD events instead of just the current status this instant seems compelling. - You can be sure that you haven't missed or lost any MD events. If your monitoring app crashes or restarts, just look in the log. (If you're unsure whether you've notified the admin on some event or not; I'm sure MD could log the disk's event counters. The monitoring app could keep it's own how far have I gotten event counter [on disk], so the app knows it's own status.) - If the log resides in eg. /proc/whatever, you can pipe it to an actual file. It could be pretty useful for debugging MD (attach your MD log, send a mail asking what happened, and it'll be clear to the super-md-dude at first sight). - Seems more convincing to enterprise customers that you can actually see MD's every move in the log. Makes it seem much more robust and reliable. - Really useful for debugging the monitoring app - Probably other advantages. Haven't really thought it trough that well :-). The problem, as I see it, is if it's worth the implementation trouble (is it any harder than to implement a netlink / what not interface? No idea!) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: waiting for recovery to complete
David Greaves wrote: Does everyone really type cat /proc/mdstat from time to time?? How clumsy... And yes, I do :) You're not alone.. *gah...* - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: interesting failure scenario
Michael Tokarev wrote: I just come across an interesting situation, here's the scenario. [snip] Now we have an interesting situation. Both superblocks in d1 and d2 are identical, event counts are the same, both are clean. Things wich are different: utime - on d1 it is more recent (provided we haven't touched the system clock ofcourse) on d1, d2 is marked as faulty on d2, d1 is marked as faulty. Neither of the conditions are checked by mdadm. So, mdadm just starts a clean RAID1 array composed of two drives with different data on them. And noone noticies this fact (fsck which is reading from one disk goes ok), until some time later when some app reports data corruption (reading from another disk); you go check what's going on, notice there's no data corruption (reading from 1st disk), suspects memory and.. it's quite a long list of possible bad stuff which can go on here... ;) The above scenario is just a theory, but the theory with some quite non-null probability. Instead of hotplugging the disks, one can do a reboot having flaky ide/scsi cables or whatnot, so that disks will be detected on/off randomly... Probably it is a good idea to test utime too, in additional to event counters, in mdadm's Assemble.c (as comments says but code disagrees). Humn, please don't. I rely on MD assembling arrays if their event counters match but the utimes don't all the time. Happens quite often that a controller fails or something like that and you accidentally loose 2 disks in a raid5. I still want to be able to force the array to be assembled in these cases. I'm still on 2.4 btw, don't know if there's a better way to do it in 2.6 than manipulating the event counters. (Thinking about it, it would be perfect if the array would instantly go into read-only mode whenever it is degraded to a non-redundant state. That way there's a higher chance of assembling a working array afterwards?) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: AW: RAID1 and data safety?
Does this sound reasonable? Does to me. Great example! Thanks for painting the pretty picture :-). Seeing as you're clearly the superior thinker, I'll address your brain instead of wasting wattage on my own. Let's say that MD had the feature to read from both disks in a mirror and perform a comparison on read. Let's say that I had that feature turned on for 2 mirror arrays (4 disks). I want to get a bit of performance back though, so I stripe the two mirrored arrays. Do you see any problem in this scenario? Are we back to corruption could happen then or are we still OK? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH md ] md: allow degraded raid1 array to resync after an unclean shutdown.
The following is (I think) appropriate for 2.4.30. The bug it fixes can result in data corruption in a fairly unusual circumstance (having a 3 drive raid1 array running in degraded mode, and suffering a system crash). What's unusual? Having a 3 drive raid1 array? It's not unusual for a system to crash after a RAID array gets sent to degraded mode. Happens a lot on a system I administer. Probably caused by a linux-si3112-ide bug which first results in read errors, then (after md has been told to resync) results in a complete system crash... Another topic: Just noticed MD usage in a screenshot: http://linuxdevices.com/files/misc/ravehd_screenshot.png From this article: http://linuxdevices.com/news/NS8217660071.html Just in case anybody cares :-). - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID1 and data safety?
Neil Brown wrote: Is there any way to tell MD to do verify-on-write and read-from-all-disks on a RAID1 array? No. I would have thought that modern disk drives did some sort of verify-on-write, else how would they detect write errors, and they are certainly in the best place to do verify-on-write. Really? My guess was that they wouldn't, because it would lead to less performance. And that's why read errors crop up at read time. Doing it at the md level would be problematic as you would have to ensure that you really were reading from the media and not from some cache somewhere in the data path. I doubt it would be a mechanism that would actually increase confidence in the safety of the data. Hmm. Could hack it by reading / writing blocks larger than the cache. Ugly. Imagine a filesystem that could access multiple devices, and where it kept index information it didn't just keep one block address, but rather kept two block address, each on different devices, and a strong checksum of the data block. This would allow much the same robustness as read-from-all-drives and much lower overhead. As in, if the checksum fails, try loading the data blocks [again] from the other device? Not sure why a checksum of X data blocks should be cheaper performance-wise than a comparison between X data blocks, but I can see the point in that you only have to load the data once and check the checksum. Not quite the same security, but almost. In summary: - you cannot do it now. - I don't think md is at the right level to solve these sort of problems. I think a filesystem could do it much better. (I'm working on a filesystem slowly...) - read-from-all-disks might get implemented one day. verify-on-write is much less likely. Apologies if the answer is in the docs. It isn't. But it is in the list archives now Thanks! :-) (Guess I'll drop the idea for the time being...) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RAID1 and data safety?
Just wondering; Is there any way to tell MD to do verify-on-write and read-from-all-disks on a RAID1 array? I was thinking of setting up a couple of RAID1s with maximum data safety. I'd like to verify after each write to a disk plus I'd like to read from all disks and perform data comparison whenever something is read. I'd then run a RAID0 over the RAID1 arrays, to regain some of the speed lost from all of the excessive checking. Just wondering if it could be done :-). Apologies if the answer is in the docs. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Spare disk could not sleep / standby
Tobias wrote: [...] I just found your mail on this list, where I have been lurking for some weeks now to get acquainted with RAID, but I fear my mail would be almost OT there: Think so? It's about RAID on Linux isn't it? I'm gonna CC the list anyway, hope it's okay :-). I was just curious about the workings of MD in 2.6, since it sounded a bit like it wasn't possible to put a RAID array to sleep. I'm about to upgrade a server to 2.6, which needs to spin down when idle. Which is exactly what I am planning to do at my home - currently, I have [...] Thus my question: Would you have a link to info on the net concerning safely powering down an unused/idle Raid? No, but I can tell you what I did. I stuffed a bunch of cheap SATA disks and crappy controllers in an old system. (And replaced the power supply with one that has enough power on the 12V rail.) It's running 2.4, and since it's IDE disks, I just call 'hdparm -Swhatever' in rc.local, which instructs the disks to go on standby whenever they've been idle for 10 minutes. Works like a charm so far, been running for a couple of years. There does not seem to be any issues with MD and timing because of the disks using 5 seconds or so to spin up, MD happily waits for them, and no corruption or wrong behaviour has stemmed from putting the disks in sleep mode. There have been a couple of annoyances, though. One is that MD reads from the disks sequentially, thus spinning up the disks one by one. The more disks you have, the longer you will have to wait for the entire array to come up :-/. Would have been beautiful if MD issues the requests in parallel. Another is that you need to have your root partition outside of the array. The reason for this is that some fancy feature in your favorite distro with guarantee periodically writes something to the disk, which will make the array spin up constantly. Incidentally, this also makes using Linux as a desktop system a PITA, since the disks are noisy as hell if you leave it on. I'm currently using two old disks in RAID1 for the root filesystem, but I'm thinking that there's probably a better solution. Perhaps the root filesystem can be shifted to a ramdisk during startup. Or you could boot from a custom made CD - that would also be extremely handy as a rescue disk. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: strangre drive behaviour.
Max Waterman wrote: Can I just make it a slave device? How will that effect performance? AFAIR (CMIIW): - The standards does not allow a slave without a master. - The master has a role to play in that it does coordination of some sort (commands perhaps?) between the slave drive and the controller. But on the other hand, I've seen ATAPI cdrom drives working in slave only configurations for years. Hm. It shouldn't cause performance degradation, but it's a kinky setup which you should probably trust a bit less than a master-only setup. If it's not the CABLE SELECT thing, it could be that the firmware on the drive acting up is different from f/w on the other drives. Check versions? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel panic??
John McMonagle wrote: All panics seem to be associated with accessing bad spot on sdb It seems really strange that one can get panic from a drive problem. sarcasm Wow, yeah, never seen that happen with Linux before! /sarcasm Just for the fun of it, try digging up a disk which has a bad spot somewhere, preferably track 0, otherwise point an extended partition table to the bad spot. Then try and get your linux box to even boot with the disk in the system :-D. God, Linux is so stable.. Bwahaha... - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel panic??
Molle Bestefich wrote: sarcasm Wow, yeah, never seen that happen with Linux before! /sarcasm Wait a minute, that wasn't a very productive comment. Nevermind, I'm probably just ridden with faulty hardware. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Spare disk could not sleep / standby
Neil Brown wrote: It is writes, but don't be scared. It is just super-block updates. In 2.6, the superblock is marked 'clean' whenever there is a period of about 20ms of no write activity. This increases the chance on a resync won't be needed after a crash. (unfortunately) the superblocks on the spares need to be updated too. Ack, one of the cool things that a linux md array can do that others can't is imho that the disks can spin down when inactive. Granted, it's mostly for home users who want their desktop RAID to be quiet when it's not in use, and their basement multi-terabyte facility to use a minimum of power when idling, but anyway. Is there any particular reason to update the superblocks every 20 msecs when they're already marked clean? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Spare disk could not sleep / standby
Neil Brown wrote: Is my perception of the situation correct? No. Writing the superblock does not cause the array to be marked active. If the array is idle, the individual drives will be idle. Ok, thank you for the clarification. Seems like a design flaw to me, but then again, I'm biased towards hating this behaviour since I really like being able to put inactive RAIDs to sleep.. Hmmm... maybe I misunderstood your problem. I thought you were just talking about a spare not being idle when you thought it should be. Are you saying that your whole array is idle, but still seeing writes? That would have to be something non-md-specific I think. No, the confusion is my bad. That was the original problem posted by Peter Evertz, which you provided a workaround for. _I_ was just curious about the workings of MD in 2.6, since it sounded a bit like it wasn't possible to put a RAID array to sleep. I'm about to upgrade a server to 2.6, which needs to spin down when idle. Got a bit worried for a moment there =). Thanks again. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Joys of spare disks!
Guy [EMAIL PROTECTED] wrote: I generally agree with you, so I'm just gonna cite / reply to the points where we don't :-). This sounded like Neil's current plan. But if I understand the plan, the drive would be kicked out of the array. Yeah, sounds bad. Although it should be marked as degraded in mdstat, since there's basically no redundancy until the failed blocks have been reassigned somehow. And 1000 bad blocks! I have never had 2 on the same disk at the same time. AFAIK. I would agree that 1000 would put a strain on the system! Well, it happened to me on a Windows system, so I don't think that that is far-fetched. This was a desktop system with the case open, so it was bounced about a lot. Every time the disk reached one of the faulty areas, it recalibrated the head and then moved it out to try and read again. It retried the operation 5 times before giving up. While this was ongoing, Windows was frozen. It took at least 3 seconds each time I hit a bad area, and I think even more. If MD could read from a disk while a similar scenario occurred, and just mark the bad blocks for rewriting in some bad block rewrite bitmap or whatever, a system hang could be avoided. Trying to rewrite every failed sector sequentially in the code that also reads the data would incur a system hang. That's what I tried to say originally, though I probably didn't do a good job (I know little of linux md, guess it shows =)). Of course, the disks would, in the case of IDE, probably have to _not_ be in master/slave configurations, since the disk with failing blocks could perhaps hog the bus. Of course I know as little of ATA/IDE as I do of linux MD, so I'm basically just guessing here ;-). Sometime in the past I have said there should be a threshold on the number of bad blocks allowed. Once the threshold is reached, the disk should be assumed bad, or at least failing, and should be replaced. Hm. Why? If a re-write on the block succeeds and then a read on the block returns the correct data, the block has been fixed. I can see your point on old disks where it might be a magnetic problem that was causing the sector to fail, but on a modern disk, it has probably been relocated to the spare area. I think the disk should just be failed when a rewrite-and-verify cycle still fails. The threshold suggestion adds complexity and user-configurability (error-prone) to an area where it's not really needed, doesn't it? Another note. I'd like to see MD being able to have a user-specifiable bad block relocation area, just like modern disks have. It could use this when the disks spare area filled up. I even thought up a use case at one time that wasn't insane like, my disks is really beginning to show up a lot of failures now, but I think I'll keep it running a bit more, but I can't quite reminisce what it was. Does anyone know how many spare blocks are on a disk? It probably varies? Ie. crappy disks probably have a much too small area ;-). In this case it would be very cute if MD had an option to specify it's own relocation area (and perhaps even a recommendation for the user on how to set it wrt. specific harddisks). But OTOH, it sucks to implement features in MD that would be much easier to solve in the disks by just expanding the spare area (when present). My worse disk has 28 relocated bad blocks. Doesn't sound bad. Isn't there a SMART value that will show you how big a percentage of spare is used (0-255)? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Joys of spare disks!
Robin Bowes wrote: I envisage something like: md attempts read one disk/partition fails with a bad block md re-calculates correct data from other disks md writes correct data to bad disk - disk will re-locate the bad block Probably not that simple, since some times multiple blocks will go bad, and you wouldn't want the entire system to come to a screeching halt whenever that happens. A more consistent and risk-free way of doing it would probably be to do the above partial resync in a background thread or so?.. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2TB ?
No email [EMAIL PROTECTED] wrote: Forgive me as this is probably a silly question and one that has been answered many times, I have tried to search for the answers but have ended up more confused than when I started. So thought maybe I could ask the community to put me out of my misery Is there a version of MD that can create larger than 2TB raid sets? I have a couple of terabyte software RAID 1+0 arrays under Linux. No size problems with MD as of yet. But the filesystems is a different affair, and I think this is where you should watch out. Linux filesystems seems to stink real bad when they span multiple terabytes, at least that's my personal experience. I've tried both ext3 and reiserfs. Even simple operations such as deleting files suddenly take on the order of 10-20 minutes. I haven't got a ready explanation for why ext3 and reiser can't handle TB sizes, but I'd definately advice you against a multi-TB setup using Linux (at least until you find someone who has a working setup..) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
XFS or JFS? (Was: 2TB ?)
Carlos Knowlton wrote: Molle Bestefich wrote: Linux filesystems seems to stink real bad when they span multiple terabytes, at least that's my personal experience. I've tried both ext3 and reiserfs. Even simple operations such as deleting files suddenly take on the order of 10-20 minutes. I'm running some 3TB software arrays (12 * 250GB RAID5) with no trouble. I've opted for XFS over ext3 or reiserfs, and I see no trouble in accessing or deleting files. Is there anybody out there with a qualified opinion on what is best suited for TB arrays, XFS or JFS? not a problem. Well, as far a software RAID goes anyway - I wish it handled trivial media errors more gracefully (ie, without dropping disks). You should always back-up your data. Second that. MD is not the brightest thing around. I particularly dislike the game where you have a failed disk and accidentally yank a cable on another disk, and MD increases the usage counter on the remaining (unusable) disks in the array. Plugging in and out disks while rebooting Linux to see if you can get the counters to match and MD to assemble the array again does give that nice adrenalin surge, but I still prefer the more relaxing desktop games that come with Linux. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html