Re: Kernel 2.6.23.9 + mdadm 2.6.2-2 + Auto rebuild RAID1?
On 6 Dec 2007, Jan Engelhardt verbalised: On Dec 5 2007 19:29, Nix wrote: On Dec 1 2007 06:19, Justin Piszcz wrote: RAID1, 0.90.03 superblocks (in order to be compatible with LILO, if you use 1.x superblocks with LILO you can't boot) Says who? (Don't use LILO ;-) Well, your kernels must be on a 0.90-superblocked RAID-0 or RAID-1 device. It can't handle booting off 1.x superblocks nor RAID-[56] (not that I could really hope for the latter). If the superblock is at the end (which is the case for 0.90 and 1.0), then the offsets for a specific block on /dev/mdX match the ones for /dev/sda, so it should be easy to use lilo on 1.0 too, no? Sure, but you may have to hack /sbin/lilo to convince it to create the superblock there at all. It's likely to recognise that this is an md device without a v0.90 superblock and refuse to continue. (But I haven't tested it.) -- `The rest is a tale of post and counter-post.' --- Ian Rawlings describes USENET - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel 2.6.23.9 + mdadm 2.6.2-2 + Auto rebuild RAID1?
On 1 Dec 2007, Jan Engelhardt uttered the following: On Dec 1 2007 06:19, Justin Piszcz wrote: RAID1, 0.90.03 superblocks (in order to be compatible with LILO, if you use 1.x superblocks with LILO you can't boot) Says who? (Don't use LILO ;-) Well, your kernels must be on a 0.90-superblocked RAID-0 or RAID-1 device. It can't handle booting off 1.x superblocks nor RAID-[56] (not that I could really hope for the latter). But that's just /boot, not everything else. Not using ANY initramfs/initrd images, everything is compiled into 1 kernel image (makes things MUCH simpler and the expected device layout etc is always the same, unlike initrd/etc). My expected device layout is also always the same, _with_ initrd. Why? Simply because mdadm.conf is copied to the initrd, and mdadm will use your defined order. Of course the same is true of initramfs, which can give you the 1 kernel image back, too. (It's also nicer in that you can autoassemble e.g. LVM-on-RAID, or even LVM-on-RAID-over-nbd if you so desire.) -- `The rest is a tale of post and counter-post.' --- Ian Rawlings describes USENET - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: md device naming question
On 19 Sep 2007, maximilian attems said: hello, working on initramfs i'd be curious to know what the /sys/block entry of a /dev/md/NN device is. have a user request to support it and no handy box using it. i presume it may also be /sys/block/mdNN ? That's it, e.g. /sys/block/md0. Notable subdirectories include holders/ (block devices within the array, more than one if e.g. LVM is in use), slaves/ (block devices making up the array) and md/ (RAID-specific state). -- `Some people don't think performance issues are real bugs, and I think such people shouldn't be allowed to program.' --- Linus Torvalds - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Software based SATA RAID-5 expandable arrays?
On 11 Jul 2007, Michael stated: I am running Suse, and the check program is not available `check' isn't a program. The line suggested has a typo: it should be something like this: 30 2 * * Mon echo check /sys/block/md0/md/sync_action The only program that line needs is `echo' and I'm sure you've got that. (You also need to have sysfs mounted at /sys, but virtually everyone has their systems set up like that nowadays.) (obviously you can check more than one array: just stick in other lines that echo `check' into some other mdN at some other time of day.) -- `... in the sense that dragons logically follow evolution so they would be able to wield metal.' --- Kenneth Eng's colourless green ideas sleep furiously - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On 21 Jun 2007, Neil Brown stated: I have that - apparently naive - idea that drives use strong checksum, and will never return bad data, only good data or an error. If this isn't right, then it would really help to understand what the cause of other failures are before working out how to handle them Look at the section `Disks and errors' in Val Henson's excellent report on last year's filesystems workshop: http://lwn.net/Articles/190223/. Most of the error modes given there lead to valid checksums and wrong data... (while you're there, read the first part too :) ) -- `... in the sense that dragons logically follow evolution so they would be able to wield metal.' --- Kenneth Eng's colourless green ideas sleep furiously - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Software based SATA RAID-5 expandable arrays?
On 19 Jun 2007, Michael outgrape: [regarding `welcome to my killfile'] Grow up man, and I thanks for the threat. I will take that into account if anything bad happens to my computer system. Read http://en.wikipedia.org/wiki/Killfile and learn. All he's saying is `I am automatically ignoring you'. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: below 10MB/s write on raid5
On 12 Jun 2007, Jon Nelson told this: On Mon, 11 Jun 2007, Nix wrote: On 11 Jun 2007, Justin Piszcz told this: loki:~# time dd if=/dev/md1 bs=1000 count=502400 of=/dev/null 502400+0 records in 502400+0 records out 50240 bytes (502 MB) copied, 16.2995 s, 30.8 MB/s loki:~# time dd if=/dev/raid/usr bs=1000 count=502400 of=/dev/null 502400+0 records in 502400+0 records out 50240 bytes (502 MB) copied, 18.6172 s, 27.0 MB/s And what is it like with 'iflag=direct' which I really feel you have to use, otherwise you get caching. I have little enough memory on this box that caching is really not significant :) With iflag=direct I get, um, loki:/var/log# time dd if=/dev/md1 bs=1000 count=502400 of=/dev/null iflag=direct dd: reading `/dev/md1': Invalid argument 0+0 records in 0+0 records out 0 bytes (0 B) copied, 0.0324791 s, 0.0 kB/s real0m0.085s user0m0.000s sys 0m0.000s so not exactly ideal. -- `... in the sense that dragons logically follow evolution so they would be able to wield metal.' --- Kenneth Eng's colourless green ideas sleep furiously - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: below 10MB/s write on raid5
On 11 Jun 2007, Justin Piszcz told this: You can do a read test. 10gb read test: dd if=/dev/md0 bs=1M count=10240 of=/dev/null What is the result? I've read that LVM can incur a 30-50% slowdown. FWIW I see a much smaller penalty than that. loki:~# lvs -o +devices LV VGAttr LSize Origin Snap% Move Log Copy% Devices [...] usr raid -wi-ao 6.00G /dev/md1(50) loki:~# time dd if=/dev/md1 bs=1000 count=502400 of=/dev/null 502400+0 records in 502400+0 records out 50240 bytes (502 MB) copied, 16.2995 s, 30.8 MB/s real0m16.360s user0m0.310s sys 0m11.780s loki:~# time dd if=/dev/raid/usr bs=1000 count=502400 of=/dev/null 502400+0 records in 502400+0 records out 50240 bytes (502 MB) copied, 18.6172 s, 27.0 MB/s real0m18.790s user0m0.380s sys 0m14.750s So there's a penalty, sure, accounted for mostly in sys time, but it's only about 10%: small enough that I at least can ignore it in exchange for the administrative convenience of LVM. -- `... in the sense that dragons logically follow evolution so they would be able to wield metal.' --- Kenneth Eng's colourless green ideas sleep furiously - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID SB 1.x autodetection
On 29 May 2007, Jan Engelhardt uttered the following: from your post at http://www.mail-archive.com/linux-raid@vger.kernel.org/msg07384.html I read that autodetecting arrays with a 1.x superblock is currently impossible. Does it at least work to force the kernel to always assume a 1.x sb? There are some 'broken' distros out there that still don't use mdadm in initramfs, and recreating the initramfs each time is a bit cumbersome... The kernel build system should be able to do that for you, shouldn't it? -- `On a scale of one to ten of usefulness, BBC BASIC was several points ahead of the competition, scoring a relatively respectable zero.' --- Peter Corlett - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID SB 1.x autodetection
On 30 May 2007, Bill Davidsen stated: Nix wrote: On 29 May 2007, Jan Engelhardt uttered the following: from your post at http://www.mail-archive.com/linux-raid@vger.kernel.org/msg07384.html I read that autodetecting arrays with a 1.x superblock is currently impossible. Does it at least work to force the kernel to always assume a 1.x sb? There are some 'broken' distros out there that still don't use mdadm in initramfs, and recreating the initramfs each time is a bit cumbersome... The kernel build system should be able to do that for you, shouldn't it? That would be an improvement, yes. Allow me to rephrase: the kernel build system *can* do that for you ;) that is, it can build a gzipped cpio archive from components located anywhere on the filesystem or arbitrary source located under usr/. -- `On a scale of one to ten of usefulness, BBC BASIC was several points ahead of the competition, scoring a relatively respectable zero.' --- Peter Corlett - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Recovery of software RAID5 using FC6 rescue?
On 8 May 2007, Michael Tokarev told this: BTW, for such recovery purposes, I use initrd (initramfs really, but does not matter) with a normal (but tiny) set of commands inside, thanks to busybox. So everything can be done without any help from external recovery CD. Very handy at times, especially since all the network drivers are here on the initramfs too, so I can even start a netcat server while in initramfs, and perform recovery from remote system... ;) What you should probably do is drop into the shell that's being used to run init if mount fails (or, more generally, if after mount runs it hasn't ended up mounting anything: there's no need to rely on mount's success/failure status). e.g. from my initramfs's init script (obviously this is not runnable as is due to all the variables, but it should get the idea across): if [ -n $root ]; then /bin/mount -o $OPTS -t $TYPE $ROOT /new-root fi if /bin/mountpoint /new-root /dev/null; then :; else echo No root filesystem given to the kernel or found on the root RAID array. echo Append the correct 'root=', 'root-type=', and/or 'root-options=' echo boot options. echo echo Dropping to a minimal shell. Reboot with Ctrl-Alt-Delete. exec /bin/sh fi - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Recovery of software RAID5 using FC6 rescue?
On 9 May 2007, Michael Tokarev spake thusly: Nix wrote: On 8 May 2007, Michael Tokarev told this: BTW, for such recovery purposes, I use initrd (initramfs really, but does not matter) with a normal (but tiny) set of commands inside, thanks to busybox. So everything can be done without any help from external recovery CD. Very handy at times, especially since all the network drivers are here on the initramfs too, so I can even start a netcat server while in initramfs, and perform recovery from remote system... ;) What you should probably do is drop into the shell that's being used to run init if mount fails (or, more generally, if after mount runs it That's exactly what my initscript does ;) I thought so. I was really talking to Mark, I suppose. chk() { while ! $@; do warn the following command failed: warn $* p=** Continue(Ignore)/Shell/Retry (C/s/r)? Wow. Feature-rich :)) I may reused this rather nifty stuff. hasn't ended up mounting anything: there's no need to rely on mount's success/failure status). [...] Well, so far exitcode has been reliable. I guess I was being paranoid because I'm using busybox and at various times the exitcodes of its internal commands have been... unimplemented or unreliable. -- `In the future, company names will be a 32-character hex string.' --- Bruce Schneier on the shortage of company names - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6 array , part id 'fd' not assembling at boot .
On 19 Mar 2007, James W. Laferriere outgrabe: What I don't see is the reasoning behind the use of initrd . It's a kernel ran to put the dev tree in order , start up devices ,... Just to start the kernel again ? That's not what initrds do. No second kernel is started, and constructing /dev is not one of the jobs of initrds in any case. (There *is* something that runs a second kernel if the first one dies --- google for `kexec crash dump' --- but it's entirely different in design and intent from initrds, and isn't an early-boot thing but a kernel- crash-reporting thing.) There are three different ways to enter userspace for the first time. - You can boot with an initramfs. This is the recommended way and may eventually deprecate all the others. initramfses consist of gzipped cpio archives, either constructed by hand or built automatically by the kernel build system during the build process; and either linked into the kernel image or pointed at by the bootloader as if they were initrds (or both: the two images are automatically merged). These are extracted into the `rootfs', which is the (nonswappable) ramfs filesystem which is the root of the mount tree. A minimal rootfs (with nothing on it) is linked into the kernel if nothing else is. The executable /init in the initramfs is run to switch to userspace if such exists, You switch from the rootfs to the real root filesystem once you mount it by erasing everything on the rootfs and `exec chroot'ing and/or `mount --move'ing it into place. (busybox contains a `switch_root' built-in command to do this.) (I prefer directly linking an initramfs into the kernel, because the kernel image is still stand-alone then and you don't have to engage in messes involving tracking which initramfs archive is used by which kernel if you run multiple kernels.) - You can boot with an initrd, which is a compressed *filesystem image* loaded from an external file (which the kernel is pointed at by the bootloader). The kernel runs /linuxrc to switch to userspace, and userspace should use the `pivot_root' command to flip over to the real root filesystem. (There is an older way of switching roots involving echoing device numbers into a file under /proc. Ignore it, it's disgusting.) In both these cases it is the initramfs / initrd's responsibility to parse things like the root= and init= kernel command-line parameters (and any new ones that you choose to define). (This is a far older method than initramfs, which explains the apparent duplication of effort. initramfs arose largely out of dissatisfaction with the limitations of initrds.) - You can boot with neither. In this case the kernel mounts / for you, either from a local block device, from auto-assembled md arrays with v0.90 superblocks, or remotely off NFS. Because it doesn't fsck the root filesystem before mounting it, this is slightly risky compared to the other options (where your initramfs/initrd image can fsck before mounting as usual). (initramfs archives are safest of all here because the filesystem is newly constructed by the kernel at boot time, so it is *impossible* for it to be damaged.) This option is the one where the RAID auto-assembly kicks in, and the only one so inflexible that such is needed. H. Peter Anvin has an ongoing project to move everything this option does into a default initramfs, and remove this crud from the kernel entirely. When that happens, there'll be little excuse for assembling RAID arrays using the auto-assembler :) In otherwords I beleive that initrd's are essentially pointless . But that's just my opinion . It's wrong, sorry. Try mounting / on a RAID array atop LVM partially scattered across the network via the network block device, for instance (I was running like this for some time after some unfortunate disk failures left me with too little storage on one critical machine to store all the stuff it needed to run). Hell, try mounting / on LVM at all. You need userspace to get LVM up and running, so you *need* an initrd or initramfs to do that. -- `In the future, company names will be a 32-character hex string.' --- Bruce Schneier on the shortage of company names - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mdadm file system type check
On 17 Mar 2007, Chris Lindley told this: What I think the OP is getting at is that MDADM will create an array with partitions whose type is not set to FD (Linux Raid Auto), but are perhaps 83. The issue with that is that upon a reboot mdadm will not be able to start the array. I think you mean that the Linux kernel's auto-assembly code won't be able to start the array. mdadm doesn't care. If you use MDADM to manually reassemble the array then it will work fine. But until you reset the partition type to be FD, you will have to run this step every time you reboot the machine. That's what initramfs/initrd is good at :) -- `In the future, company names will be a 32-character hex string.' --- Bruce Schneier on the shortage of company names - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PATA/SATA Disk Reliability paper
On 20 Feb 2007, Al Boldi outgrape: Eyal Lebedinsky wrote: Disks are sealed, and a dessicant is present in each to keep humidity down. If you ever open a disk drive (e.g. for the magnets, or the mirror quality platters, or for fun) then you can see the dessicant sachet. Actually, they aren't sealed 100%. I'd certainly hope not, unless you like the sound of imploding drives when you carry one up a mountain. On wd's at least, there is a hole with a warning printed on its side: DO NOT COVER HOLE BELOW V V V V o I suspect that's for air-pressure equalization. In contrast, older models from the last century, don't have that hole. It was my understanding that disks have had some way of equalizing pressure with their surroundings for many years; but I haven't verified this so you may well be right that this is a recent thing. (Anyone know for sure?) -- `In the future, company names will be a 32-character hex string.' --- Bruce Schneier on the shortage of company names - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PATA/SATA Disk Reliability paper
On 22 Feb 2007, [EMAIL PROTECTED] uttered the following: On 20 Feb 2007, Al Boldi outgrape: Eyal Lebedinsky wrote: Disks are sealed, and a dessicant is present in each to keep humidity down. If you ever open a disk drive (e.g. for the magnets, or the mirror quality platters, or for fun) then you can see the dessicant sachet. Actually, they aren't sealed 100%. I'd certainly hope not, unless you like the sound of imploding drives when you carry one up a mountain. Or even exploding drives. (Oops.) -- `In the future, company names will be a 32-character hex string.' --- Bruce Schneier on the shortage of company names - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ooops on read-only raid5 while unmounting as xfs
On 23 Jan 2007, Neil Brown said: On Tuesday January 23, [EMAIL PROTECTED] wrote: My question is then : what prevents the upper layer to open the array read-write, submit a write and make the md code BUG_ON() ? The theory is that when you tell an md array to become read-only, it tells the block layer that it is read-only, and then if some process tries to open it read/write, it gets an array. Um. Do you mean it gets an *error*? ;} -- `The serial comma, however, is correct and proper, and abandoning it will surely lead to chaos, anarchy, rioting in the streets, the Terrorists taking over, and possibly the complete collapse of Human Civilization.' - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bad performance on RAID 5
On 18 Jan 2007, Bill Davidsen spake thusly: ) Steve Cousins wrote: time dd if=/dev/zero of=/mount-point/test.dat bs=1024k count=1024 That doesn't give valid (repeatable) results due to caching issues. Go back to the thread I started on RAID-5 write, and see my results. More important, the way I got rid of the cache effects (beside an unloaded systems) was: sync; time bash -c dd if=/dev/zero bs=1024k count=2048 of=/mnt/point/file; sync I empty the cache, then time the dd including the sync at the end. Results are far more repeatable. Recent versions of dd have `oflag=direct' as well, to open the output with O_DIRECT. (I'm not sure what the state of O_DIRECT on regular files is though.) -- `The serial comma, however, is correct and proper, and abandoning it will surely lead to chaos, anarchy, rioting in the streets, the Terrorists taking over, and possibly the complete collapse of Human Civilization.' - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FailSpare event?
On 15 Jan 2007, Bill Davidsen told this: Nix wrote: Number Major Minor RaidDevice State 0 860 active sync /dev/sda6 1 8 221 active sync /dev/sdb6 3 2252 active sync /dev/hdc5 Number Major Minor RaidDevice State 0 8 230 active sync /dev/sdb7 1 871 active sync /dev/sda7 3 352 active sync /dev/hda5 0, 1, and *3*. Where has number 2 gone? (And how does `Number' differ from `RaidDevice'? Why have both?) Did you ever move the data to these drives from another? I think this is what you see when you migrate by adding a drive as a spare, then mark an existing drive as failed, so the data is rebuilt on the new drive. Was there ever a device 2? Nope. These arrays were created in one lump and never had a spare. Plenty of pvmoves have happened on them, but that's *inside* the arrays, of course... -- `He accused the FSF of being something of a hypocrit, which shows that he neither understands hypocrisy nor can spell.' --- jimmybgood - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FailSpare event?
On 14 Jan 2007, Neil Brown told this: A quick look suggests that the following patch might make a difference, but there is more to it than that. I think there are subtle differences due to the use of version-1 superblocks. That might be just another one-line change, but I want to make sure first. Well, that certainly made that warning go away. I don't have any actually-failed disks, so I can't tell if it would *ever* warn anymore ;) ... actually, it just picked up some monthly array check activity: Jan 15 20:03:17 loki daemon warning: mdadm: Rebuild20 event detected on md device /dev/md2 So it looks like it works perfectly well now. (Looking at the code, yeah, without that change it'll never remember state changes at all!) One bit of residue from the state before this patch remains on line 352, where you initialize disc.state and then never use it for anything... -- `He accused the FSF of being something of a hypocrit, which shows that he neither understands hypocrisy nor can spell.' --- jimmybgood - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FailSpare event?
On 13 Jan 2007, [EMAIL PROTECTED] uttered the following: mdadm-2.6 bug, I fear. I haven't tracked it down yet but will look shortly: I can't afford to not run mdadm --monitor... odd, that code hasn't changed during 2.6 development. Whoo! Compile Monitor.c without optimization and the problem goes away. Hunting: maybe it's a compiler bug (anyone not using GCC 4.1.1 seeing this?), maybe mdadm is tripping undefined behaviour somewhere... -- `He accused the FSF of being something of a hypocrit, which shows that he neither understands hypocrisy nor can spell.' --- jimmybgood - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FailSpare event?
On 12 Jan 2007, Ernst Herzberg told this: Then every about 60 sec 4 times event=SpareActive mddev=/dev/md3 I see exactly this on both my RAID-5 arrays, neither of which have any spare device --- nor have any active devices transitioned to spare (which is what that event is actually supposed to mean). mdadm-2.6 bug, I fear. I haven't tracked it down yet but will look shortly: I can't afford to not run mdadm --monitor... odd, that code hasn't changed during 2.6 development. -- `He accused the FSF of being something of a hypocrit, which shows that he neither understands hypocrisy nor can spell.' --- jimmybgood - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FailSpare event?
On 13 Jan 2007, [EMAIL PROTECTED] spake thusly: On 12 Jan 2007, Ernst Herzberg told this: Then every about 60 sec 4 times event=SpareActive mddev=/dev/md3 I see exactly this on both my RAID-5 arrays, neither of which have any spare device --- nor have any active devices transitioned to spare (which is what that event is actually supposed to mean). Hm, the manual says that it means that a spare has transitioned to active (which seems more likely). Perhaps the comment at line 82 of Monitor.c is wrong, or I just don't understand what a `reverse transition' is supposed to be. -- `He accused the FSF of being something of a hypocrit, which shows that he neither understands hypocrisy nor can spell.' --- jimmybgood - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FailSpare event?
On 13 Jan 2007, [EMAIL PROTECTED] uttered the following: On 12 Jan 2007, Ernst Herzberg told this: Then every about 60 sec 4 times event=SpareActive mddev=/dev/md3 I see exactly this on both my RAID-5 arrays, neither of which have any spare device --- nor have any active devices transitioned to spare (which is what that event is actually supposed to mean). One oddity has already come to light. My /proc/mdstat says md2 : active raid5 sdb7[0] hda5[3] sda7[1] 19631104 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] [UUU] md1 : active raid5 sda6[0] hdc5[3] sdb6[1] 76807296 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] [UUU] hda5 and hdc5 look odd. Indeed, --examine says Number Major Minor RaidDevice State 0 860 active sync /dev/sda6 1 8 221 active sync /dev/sdb6 3 2252 active sync /dev/hdc5 Number Major Minor RaidDevice State 0 8 230 active sync /dev/sdb7 1 871 active sync /dev/sda7 3 352 active sync /dev/hda5 0, 1, and *3*. Where has number 2 gone? (And how does `Number' differ from `RaidDevice'? Why have both?) -- `He accused the FSF of being something of a hypocrit, which shows that he neither understands hypocrisy nor can spell.' --- jimmybgood - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux RAID version question
On 27 Nov 2006, Dragan Marinkovic stated: On 11/26/06, Nix [EMAIL PROTECTED] wrote: Well, I assemble my arrays with the command /sbin/mdadm --assemble --scan --auto=md [...] No metadata versions needed anywhere. [...] But you do have to specify the version (other than 0.90) when you want to build the array for the first time, correct? Unless you've specified it in mdadm.conf's CREATE stanza, sure. -- `The main high-level difference between Emacs and (say) UNIX, Windows, or BeOS... is that Emacs boots quicker.' --- PdS - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux RAID version question
On 25 Nov 2006, Dragan Marinkovic stated: Hm, I was playing with RAID 5 with one spare (3 + 1) and metadata version 1.2 . If I let it build to some 10% and cleanly reboot it does not start where it left off -- basically it starts from scratch. I was under the impression that RAID with metadata version 1.x exhibits different behavior with kernel 2.6.18 . I'm staring the array as: /sbin/mdadm --assemble --scan -e 1.2 --no-degraded --config=/etc/mdadm.conf On another topic, I looked through the code trying to find if metadata version can be stored as a default in conf file. Unfortunately, there is no such option. It would be nice to have it so you don't have to specify it when assembling the array. Well, I assemble my arrays with the command /sbin/mdadm --assemble --scan --auto=md and mdadm.conf looks like DEVICE partitions ARRAY /dev/md0 UUID=3a51b74f:8a759fe7:8520304c:3adbceb1 ARRAY /dev/md1 UUID=a5a6cad4:2c7fdc07:88a409b9:192ed3bf ARRAY /dev/md2 UUID=fe44916d:a1098576:8007fb81:2ee33b5a MAILADDR [EMAIL PROTECTED] No metadata versions needed anywhere. -- `The main high-level difference between Emacs and (say) UNIX, Windows, or BeOS... is that Emacs boots quicker.' --- PdS - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: invalid (zero) superblock magic upon creation of a new RAID-1 array
On 6 Nov 2006, Thomas Andrews uttered the following: Thanks Neil, I fixed my problem by creating the raid set using the -e option: mdadm -C /dev/md0 -e 0.90 --level=raid1 --raid-devices=2 /dev/sda1 /dev/sdb1 You're suggestion to use mdadm to assemble the array is not an option for me because it is the root partition that is raided, but thanks for putting me in the right direction. You can still use mdadm to assemble root filesystems: you just need an initramfs or initrd to do the work before / is mounted. (As a bonus you can fsck it before it's mounted, as well.) Most distros have tools that can do this for you, or you can do it by hand (see e.g. http://linux-raid.osdl.org/index.php/RAID_Boot). -- `When we are born we have plenty of Hydrogen but as we age our Hydrogen pool becomes depleted.' - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: why partition arrays?
On 21 Oct 2006, Bodo Thiesen yowled: was hdb and what was hdd? And hde? Hmmm ...), so we decided the following structure: hda - vg called raida - creating LVs called raida1..raida4 hdb - vg called raidb - creating LVs called raidb1..raidb4 I'm interested: why two VGs? Why not have one VG covering all RAID arrays, and then another one for any unRAIDed space (if any)? -- `When we are born we have plenty of Hydrogen but as we age our Hydrogen pool becomes depleted.' - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Starting point of the actual RAID data area
On 8 Oct 2006, Daniel Pittman said: Jyri Hovila [EMAIL PROTECTED] writes: I would appreciate it a lot if somebody could give me a hand here. All I need to understand right now is how I can find out the first sector of the actual RAID data. I'm starting with a simple configuration, where there are three identical drives, all of them used fully for one RAID 5 set. And no LVM at this point. It would start on the first sector of the first disk. That depends on the superblock version. Versions 0.9x and 1.0 will do as you say, but versions 1.1 and 1.2 will have the start of the disk holding either the RAID superblock (for 1.1) or 4Kb of nothing (for 1.2). -- `In typical emacs fashion, it is both absurdly ornate and still not really what one wanted.' --- jdev - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Recipe for Mirrored OS Drives
On 2 Oct 2006, David Greaves spake: I suggest you link from http://linux-raid.osdl.org/index.php/RAID_Boot The pages don't really have the same purpose. RAID_Boot is `how to boot your RAID system using initramfs'; this is `how to set up a RAID system in the first place', i.e., setup. I'll give it a bit of a tweak-and-rename in a bit. -- `In typical emacs fashion, it is both absurdly ornate and still not really what one wanted.' --- jdev - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Care and feeding of RAID?
On 6 Sep 2006, Mario Holbe spake: You don't necessarily need one. However, since Neil considers in-kernel RAID-autodetection a bad thing and since mdadm typically relies on mdadm.conf for RAID-assembly You can specify the UUID on the command-line too (although I don't). The advantage of the config file from my POV is that it lets me activate *all* my RAID arrays with one command, and the command doesn't change, no matter how complex the array configuration. (I'll admit that the sheer number of options to mdadm has always overwhelmed me to some degree, despite the excellent documentation, so I prefer approaches that keep a working command-line unchanged, especially for something as critical as boot-time assembly.) -- `In typical emacs fashion, it is both absurdly ornate and still not really what one wanted.' --- jdev - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Care and feeding of RAID?
On 5 Sep 2006, Paul Waldo uttered the following: What about bitmaps? Nobody has mentioned them. It is my understanding that you just turn them on with mdadm /dev/mdX -b internal. Any caveats for this? Notably, how many additional writes does it incur? I have some RAID arrays using drives which are quiet *until* you access them, and which then make a bloody racket. The superblock updates are bad enough, but bitmap updates, well, I don't really like seeing one write turned into twelve-odd disk hits that much (just a back-of-the-envelope guess for a three-disk RAID-5 array). -- `In typical emacs fashion, it is both absurdly ornate and still not really what one wanted.' --- jdev - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: remark and RFC
On 16 Aug 2006, Molle Bestefich murmured woefully: Peter T. Breuer wrote: The comm channel and hey, I'm OK message you propose doesn't seem that different from just hot-adding the disks from a shell script using 'mdadm'. [snip speculations on possible blocking calls] You could always try and see. Should be easy to simulate a network outage. Blocking calls are not the problem. Deadlocks are. The problem is that forking a userspace process necessarily involves kernel memory allocations (for the task struct, userspace memory map, possibly text pages if the necessary pieces of mdadm are not in the page cache), and if your swap is on the remote RAID array, you can't necessarily carry out those allocations. Note that the same deadlock situation is currently triggered by sending/receiving network packets, which is why swapping over NBD is a bad idea at present: however, this is being fixed at this moment because until it's fixed you can't reliably have a machine with all storage on iSCSI, for instance. However, the deadlock is only fixable for kernel allocations, because the amount of storage that'll need is bounded in several ways: you can't fix it for userspace allocations. So you can never rely on userspace working in this situation. -- `We're sysadmins. We deal with the inconceivable so often I can clearly see the need to define levels of inconceivability.' --- Rik Steenwinkel - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5/lvm setup questions
On 5 Aug 2006, David Greaves prattled cheerily: As an example of the cons: I've just set up lvm2 over my raid5 and whilst testing snapshots, the first thing that happened was a kernel BUG and an oops... I've been backing up using writable snapshots on LVM2 over RAID-5 for some time. No BUGs. I think the blame here is likely to be layable at the snapshots' door, anyway: they're still a little wobbly and the implementation is pretty complex: bugs surface on a regular basis. -- `We're sysadmins. We deal with the inconceivable so often I can clearly see the need to define levels of inconceivability.' --- Rik Steenwinkel - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: md reports: unknown partition table - fixed.
On 20 Jul 2006, Neil Brown uttered the following: On Tuesday July 18, [EMAIL PROTECTED] wrote: I think there's a bug here somewhere. I wonder/suspect that the superblock should contain the fact that it's a partitioned/able md device? I've thought about that and am not in favour. I would rather just assume everything is partitionable - put CREATE auto=part As long as `partitionable' doesn't imply `partitioned': I'd quite like LVM-on-raw-md to keep working... -- `We're sysadmins. We deal with the inconceivable so often I can clearly see the need to define levels of inconceivability.' --- Rik Steenwinkel - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: only 4 spares and no access to my data
On 18 Jul 2006, Neil Brown moaned: The superblock locations for sda and sda1 can only be 'one and the same' if sda1 is at an offset in sda which is a multiple of 64K, and if sda1 ends near the end of sda. This certainly can happen, but it is by no means certain. For this reason, version-1 superblocks record the offset of the superblock in the device so that if a superblock is written to sda1 and then read from sda, it will look wrong (wrong offset) and so will be ignored (no valid superblock here). One case where this can happen is Sun slices (and I think BSD disklabels too), where /dev/sda and /dev/sda1 start at the *same place*. (This causes amusing problems with LVM vgscan unless the raw devices are excluded, too.) -- `We're sysadmins. We deal with the inconceivable so often I can clearly see the need to define levels of inconceivability.' --- Rik Steenwinkel - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Still can't get md arrays that were started from an initrd to shutdown
On 17 Jul 2006, Christian Pernegger suggested tentatively: I'm still having problems with some md arrays not shutting down cleanly on halt / reboot. The problem seems to affect only arrays that are started via an initrd, even if they do not have the root filesystem on them. That's all arrays if they're either managed by EVMS or the ramdisk-creator is initramfs-tools. For yaird-generated initrds only the array with root on it is affected. Hm. FWIW mine (started via initramfs) all shut down happily, so something more than this is involved. My initramfs has a /dev populated via busybox's `mdev -s', yielding a /dev containing all devices the kernel reports, under their kernel names. mdadm is run like this: /sbin/mdadm --assemble --scan --auto=md --run and, well, it works, with udev populating the appropriate devices for me once we switch to the real root filesystem. If only there were an exitrd imge to comlement the initrds ... This shouldn't be necessary. One possible difference here is that switch_root (from recent busyboxes) deletes everything on the initramfs before chrooting to the new root. Are you sure that there's nothing left running from your initrd that could be holding open, say, some block device that comprises part of your RAID array? (I'm just guessing here, but this seems like the largest difference between your setup and mine.) -- `We're sysadmins. We deal with the inconceivable so often I can clearly see the need to define levels of inconceivability.' --- Rik Steenwinkel - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH*2] mdadm works with uClibc from SVN
On 26 Jun 2006, Neil Brown said: On Tuesday June 20, [EMAIL PROTECTED] wrote: For some time, mdadm's been dumping core on me in my uClibc-built initramfs. As you might imagine this is somewhat frustrating, not least since my root filesystem's in LVM on RAID. Half an hour ago I got around to debugging this. Imagine my surprise when I found that it was effectively guaranteed to crash: map_dev() in util.c is stubbed out for uClibc builds, and returns -1 at all times. That means that code in, among other places, config.c:load_partitions() is guaranteed to segfault, which is a bit tough if you're using the (sane) default `DEVICE partitions' in your mdadm.conf. I'm confused. map_dev doesn't return -1, it returns NULL. Yes, that was a typo on my part. And the code in load_partitions() handles a NULL return. Er, what? The code in load_partitions() in mdadm 2.5.1 and below says ,[ config.c:load_partitions() ] | name = map_dev(major, minor, 1); | | d = malloc(sizeof(*d)); | d-devname = strdup(name); ` So if map_dev() returns NULL, we do a strdup(NULL) - crash inside uClibc. So I cannot see why it would be core dumping. If you have more details, I'd be really interested. Backtrace (against uClibc compiled with debugging symbols, tested against a temporary RAID array on a loopback device): (gdb) run Starting program: /usr/packages/mdadm/i686-loki/mdadm.uclibc --assemble --uuid=572aeae1:532641bf:49c9aec5:6036454d /dev/md127 Program received signal SIGSEGV, Segmentation fault. 0x080622f8 in strlen (s=0x0) at libc/string/i386/strlen.c:40 40 libc/string/i386/strlen.c: No such file or directory. in libc/string/i386/strlen.c (gdb) bt #0 0x080622f8 in strlen (s=0x0) at libc/string/i386/strlen.c:40 #1 0x08062d1d in strdup (s1=0x0) at libc/string/strdup.c:30 #2 0x0804b0df in load_partitions () at config.c:247 #3 0x0804b77c in conf_get_devs (conffile=0x0) at config.c:715 #4 0x0804d685 in Assemble (st=0x0, mddev=0xbf9c753e /dev/md127, mdfd=6, ident=0xbf9c630c, conffile=0x0, devlist=0x0, backup_file=0x0, readonly=0, runstop=0, update=0x0, homehost=0x0, verbose=0, force=0) at Assemble.c:184 #5 0x08049d62 in main (argc=4, argv=0xbf9c6514) at mdadm.c:958 (gdb) frame 2 #2 0x0804b0df in load_partitions () at config.c:247 247 d-devname = strdup(name); (gdb) info locals name = 0x0 mp = 0xbf9c5a34 09873360 hda\n f = (FILE *) 0x807b018 buf =3 09873360 hda\n, '\0' repeats 546 times, 1\000\000\000\000\000((ÿÿÿ\177\021, '\0' repeats 15 times, \021\000\000\000\231\231\231\031Z]\234¿\000\000(\005ÿÿÿ\177¼\\\234¿ÓU\006\bZ]\234¿\000\000\000\000\n, '\0' repeats 15 times, d^\234¿ÏÈ\004\bX]\234¿\000\000\000\000\n\000\000\000\000\000Linux, '\0' repeats 60 times, loki\000), '\0' repeats 43 times, \006, '\0' repeats 15 times, 2.6.17-s\000\000\000\000à]\234¿ßÉ\005\b|]\234¿ð]\234¿, '\0' repeats 12 times... rv = (mddev_dev_t) 0x0 I hope the bug's more obvious now :) That said: utils doesn't handle the case for 'ftw not available' as well as it could. I will fix that. The __UCLIBC_HAS_FTW__ macro in features.h is the right way to do that for uClibc, as Luca noted. -- `NB: Anyone suggesting that we should say Tibibytes instead of Terabytes there will be hunted down and brutally slain. That is all.' --- Matthew Wilcox - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiple raids on one machine?
On 25 Jun 2006, Chris Allen uttered the following: Back to my 12 terabyte fileserver, I have decided to split the storage into four partitions each of 3TB. This way I can choose between XFS and EXT3 later on. So now, my options are between the following: 1. Single 12TB /dev/md0, partitioned into four 3TB partitions. But how do I do this? fdisk won't handle it. Can GNU Parted handle partitions this big? 2. Partition the raw disks into four partitions and make /dev/md0,md1,md2,md3. But am I heading for problems here? Is there going to be a big performance hit with four raid5 arrays on the same machine? Am I likely to have dataloss problems if my machine crashes? There is a third alternative which can be useful if you have a mess of drives of widely-differing capacities: make several RAID arrays so as to tesselate space across all the drives, and then pile an LVM on the top of all of them to fuse them back into one again. The result should give you the reliability of RAID-5 and the resizeability of LVM :) e.g. the config on my home server, which for reasons of disks-bought-at- different-times has disks varying in size from 10Gb through 40Gb to 72Gb. Discounting the tiny RAID-1 array used for booting off (LILO won't boot from RAID-5), it looks like this: Two RAID arrays, positioned so as to fill up as much space as possible on the various physical disks: Raid Level : raid5 Array Size : 76807296 (73.25 GiB 78.65 GB) Device Size : 76807296 (36.62 GiB 39.33 GB) Raid Devices : 3 [...] Number Major Minor RaidDevice State 0 860 active sync /dev/sda6 1 8 221 active sync /dev/sdb6 3 2252 active sync /dev/hdc5 Array Size : 19631104 (18.72 GiB 20.10 GB) Device Size : 19631104 (9.36 GiB 10.05 GB) Raid Devices : 3 Raid Level : raid5 Number Major Minor RaidDevice State 0 8 230 active sync /dev/sdb7 1 871 active sync /dev/sda7 3 352 active sync /dev/hda5 (Note that the arrays share some disks, the largest ones: each lays claim to almost the whole of one of the smaller disks.) Then atop that we have two LVM volume groups, one filling up any remaining non-RAIDed space and used for non-critical stuff which can be regenerated on demand (if a disk dies the whole VG will vanish; if we wanted to avoid that we could make that space into a RAID-1 array, but I have a lot of easily-regeneratable data and so didn't bother with that), and one filling *both* RAID arrays: VG#PV #LV #SN Attr VSize VFree Devices disks 3 7 0 wz--n- 43.95G 21.80G /dev/sda8(0) disks 3 7 0 wz--n- 43.95G 21.80G /dev/sdb8(0) disks 3 7 0 wz--n- 43.95G 21.80G /dev/hdc6(0) raid2 9 0 wz--n- 91.96G 49.77G /dev/md1(0) raid2 9 0 wz--n- 91.96G 49.77G /dev/md2(0) The result can survive any single disk failure, just like a single RAID-5 array: the worst case is that one of the /dev/c's dies and both arrays go degraded at once, but nothing else bad would happen to the RAIDed storage. Try doing *that* with hardware RAID. :))) -- `NB: Anyone suggesting that we should say Tibibytes instead of Terabytes there will be hunted down and brutally slain. That is all.' --- Matthew Wilcox - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH*2] mdadm works with uClibc from SVN
On Tue, 27 Jun 2006, Neil Brown prattled cheerily: On Tuesday June 27, [EMAIL PROTECTED] wrote: ,[ config.c:load_partitions() ] | name = map_dev(major, minor, 1); | | d = malloc(sizeof(*d)); | d-devname = strdup(name); ` Ahh.. uhmmm... Oh yes. I've fixed that since, but completely forgot about it and the change log didn't make it obvious. Yahoo. That's what I like about free software: the pre-emptive bugfixes! It seems that as soon as I find a bug, someone else will have already fixed it :) So: that's fixed for 2.5.2. Thanks for following it up. No, thank *you* for fixing my boot process :) -- `NB: Anyone suggesting that we should say Tibibytes instead of Terabytes there will be hunted down and brutally slain. That is all.' --- Matthew Wilcox - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiple raids on one machine?
On Tue, 27 Jun 2006, Chris Allen wondered: Nix wrote: There is a third alternative which can be useful if you have a mess of drives of widely-differing capacities: make several RAID arrays so as to tesselate space across all the drives, and then pile an LVM on the top of all of them to fuse them back into one again. But won't I be stuck with the same problem? ie I'll have a single 12TB lvm, and won't be able to use EXT3 on it? Not without ext3 patches (until the very-large-ext3 patches now pending on l-k go in), sure. But because it's LVMed you could cut it into a couple of exg3 filesystems easily. (I find it hard to imagine a single *directory* whose children contain 12Tb of files in a form that you can't cut into pieces with suitable use of bind mounts, but still, perhaps such exists.) -- `NB: Anyone suggesting that we should say Tibibytes instead of Terabytes there will be hunted down and brutally slain. That is all.' --- Matthew Wilcox - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH*2] mdadm works with uClibc from SVN
On Fri, 23 Jun 2006, Neil Brown mused: On Friday June 23, [EMAIL PROTECTED] wrote: On 20 Jun 2006, [EMAIL PROTECTED] prattled cheerily: For some time, mdadm's been dumping core on me in my uClibc-built initramfs. As you might imagine this is somewhat frustrating, not least since my root filesystem's in LVM on RAID. Half an hour ago I got around to debugging this. Ping? No, but I do know someone by that name. Sorry, I'm pinging all the patches I've sent out in the last few weeks and this one was on my hit list :) Yeh, it's on my todo list. Agreed. I suspect I would rather make mdadm not dump core of ftw isn't available. It was... surprising. I'm still not sure how I ever made it work without this fxi; the first uClibc-based initramfs I ever rolled had mdadm 2.3.1 in it and a `DEVICE partitions', and there was no core dump. A mystery. Is there some #define in an include file which will allow me to tell if the current uclibc supports ftw or not? I misspoke: ftw was split into multiple files in late 2005, but it was originally added in September 2003, in time for version 0.9.21. Obviously the #defines in ftw.h don't exist before that date, but that's a bit late to check, really. features.h provides the macros __UCLIBC_MAJOR__, __UCLIBC_MINOR__, and __UCLIBC_SUBLEVEL__: versions above 0.9.20 appear to support ftw() (at least, they have the function, in 32-bit form at least, which is certainly enough for this application!) (and I'm more like to reply to mail if it the To or Cc line mentions me specifically - it's less likely to get lost that way). OK. -- `NB: Anyone suggesting that we should say Tibibytes instead of Terabytes there will be hunted down and brutally slain. That is all.' --- Matthew Wilcox - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Large single raid and XFS or two small ones and EXT3?
On 23 Jun 2006, Francois Barre uttered the following: The problem is that there is no cost effective backup available. One-liner questions : - How does Google make backups ? Replication across huge numbers of cheap machines on a massively distributed filesystem. -- `NB: Anyone suggesting that we should say Tibibytes instead of Terabytes there will be hunted down and brutally slain. That is all.' --- Matthew Wilcox - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Large single raid and XFS or two small ones and EXT3?
On 23 Jun 2006, PFC suggested tentatively: - ext3 is slow if you have many files in one directory, but has more mature tools (resize, recovery etc) This is much less true if you turn on the dir_index feature. -- `NB: Anyone suggesting that we should say Tibibytes instead of Terabytes there will be hunted down and brutally slain. That is all.' --- Matthew Wilcox - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Large single raid and XFS or two small ones and EXT3?
On 23 Jun 2006, Christian Pedaschus said: and my main points for using ext3 is still: it's a very mature fs, nobody will tell you such horrible storys about data-lossage with ext3 than with any other filesystem. Actually I can, but it required bad RAM *and* a broken disk controller *and* an electrical storm *and* heavy disk loads (only read loads, but I didn't have noatime active so read implied write). In my personal experience it's since weathered machines with `only' RAM so bad that md5sums of 512Kb files wouldn't come out the same way twice with no problems at all (some file data got corrupted, unsurprisingly, but the metadata was fine). Definitely an FS to be relied upon. -- `NB: Anyone suggesting that we should say Tibibytes instead of Terabytes there will be hunted down and brutally slain. That is all.' --- Matthew Wilcox - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH*2] mdadm works with uClibc from SVN
For some time, mdadm's been dumping core on me in my uClibc-built initramfs. As you might imagine this is somewhat frustrating, not least since my root filesystem's in LVM on RAID. Half an hour ago I got around to debugging this. Imagine my surprise when I found that it was effectively guaranteed to crash: map_dev() in util.c is stubbed out for uClibc builds, and returns -1 at all times. That means that code in, among other places, config.c:load_partitions() is guaranteed to segfault, which is a bit tough if you're using the (sane) default `DEVICE partitions' in your mdadm.conf. As far as I can tell this is entirely because ftw() support is not implemented in uClibc. But as of November 2005, it *is* implemented in uClibc 0.9.29-to-be in SVN, which rumour has it is soon to be officially released. (The ftw() implementation is currently a copy of that from glibc, but knowing the uClibc folks it will shrink dramatically as time goes by. :) ) There are two options here. Either the trivial but rather unhelpful approach of, well, *telling* people that what they're building isn't going to work: diff -durN 2.5.1-orig/Makefile 2.5.1-patched/Makefile --- 2.5.1-orig/Makefile 2006-06-20 22:49:56.0 +0100 +++ 2.5.1-patched/Makefile 2006-06-20 22:50:34.0 +0100 @@ -97,6 +97,7 @@ mdadm.tcc : $(SRCS) mdadm.h $(TCC) -o mdadm.tcc $(SRCS) +# This doesn't work mdadm.uclibc : $(SRCS) mdadm.h $(UCLIBC_GCC) -DUCLIBC -DHAVE_STDINT_H -o mdadm.uclibc $(SRCS) $(STATICSRC) @@ -115,6 +116,7 @@ rm -f $(OBJS) $(CC) $(LDFLAGS) $(ASSEMBLE_FLAGS) -static -DHAVE_STDINT_H -o mdassemble.static $(ASSEMBLE_SRCS) $(STATICSRC) +# This doesn't work mdassemble.uclibc : $(ASSEMBLE_SRCS) mdadm.h rm -f $(OJS) $(UCLIBC_GCC) $(ASSEMBLE_FLAGS) -DUCLIBC -DHAVE_STDINT_H -static -o mdassemble.uclibc $(ASSEMBLE_SRCS) $(STATICSRC) diff -durN 2.5.1-orig/mdadm.h 2.5.1-patched/mdadm.h --- 2.5.1-orig/mdadm.h 2006-06-02 06:35:22.0 +0100 +++ 2.5.1-patched/mdadm.h 2006-06-20 22:50:55.0 +0100 @@ -344,6 +344,7 @@ #endif #ifdef UCLIBC +#error This is known not to work. struct FTW {}; # define FTW_PHYS 1 #else Or the even-more-trivial but arguably far more useful approach of making it work when possible (failing to compile with old uClibc because of the absence of ftw.h, and working perfectly well with new uClibc): diff -durN 2.5.1-orig/Makefile 2.5.1-patched/Makefile --- 2.5.1-orig/Makefile 2006-06-20 22:49:56.0 +0100 +++ 2.5.1-patched/Makefile 2006-06-20 22:52:34.0 +0100 @@ -98,7 +98,7 @@ $(TCC) -o mdadm.tcc $(SRCS) mdadm.uclibc : $(SRCS) mdadm.h - $(UCLIBC_GCC) -DUCLIBC -DHAVE_STDINT_H -o mdadm.uclibc $(SRCS) $(STATICSRC) + $(UCLIBC_GCC) -DHAVE_STDINT_H -o mdadm.uclibc $(SRCS) $(STATICSRC) mdadm.klibc : $(SRCS) mdadm.h rm -f $(OBJS) @@ -117,7 +117,7 @@ mdassemble.uclibc : $(ASSEMBLE_SRCS) mdadm.h rm -f $(OJS) - $(UCLIBC_GCC) $(ASSEMBLE_FLAGS) -DUCLIBC -DHAVE_STDINT_H -static -o mdassemble.uclibc $(ASSEMBLE_SRCS) $(STATICSRC) + $(UCLIBC_GCC) $(ASSEMBLE_FLAGS) -DHAVE_STDINT_H -static -o mdassemble.uclibc $(ASSEMBLE_SRCS) $(STATICSRC) # This doesn't work mdassemble.klibc : $(ASSEMBLE_SRCS) mdadm.h diff -durN 2.5.1-orig/mdadm.h 2.5.1-patched/mdadm.h --- 2.5.1-orig/mdadm.h 2006-06-02 06:35:22.0 +0100 +++ 2.5.1-patched/mdadm.h 2006-06-20 22:53:02.0 +0100 @@ -343,14 +343,9 @@ struct stat64; #endif -#ifdef UCLIBC - struct FTW {}; +#include ftw.h +#ifdef __dietlibc__ # define FTW_PHYS 1 -#else -# include ftw.h -# ifdef __dietlibc__ -# define FTW_PHYS 1 -# endif #endif extern int add_dev(const char *name, const struct stat *stb, int flag, struct FTW *s); diff -durN 2.5.1-orig/util.c 2.5.1-patched/util.c --- 2.5.1-orig/util.c 2006-06-16 01:25:44.0 +0100 +++ 2.5.1-patched/util.c2006-06-20 22:54:13.0 +0100 @@ -354,21 +354,6 @@ } *devlist = NULL; int devlist_ready = 0; -#ifdef UCLIBC -int add_dev(const char *name, const struct stat *stb, int flag, struct FTW *s) -{ - return 0; -} -char *map_dev(int major, int minor, int create) -{ -#if 0 - fprintf(stderr, Warning - fail to map %d,%d to a device name\n, - major, minor); -#endif - return NULL; -} -#else - #ifdef __dietlibc__ int add_dev_1(const char *name, const struct stat *stb, int flag) { @@ -467,8 +452,6 @@ return nonstd ? nonstd : std; } -#endif - unsigned long calc_csum(void *super, int bytes) { unsigned long long newcsum = 0; With this latter patch, mdadm works flawlessly with my uClibc, svnversion r15342 from 2006-06-08, and probably works just as well with all SVN releases after r13017, 2005-12-30. (One final request: please turn on world-readability in your generated tarballs ;) right now util.c and some others are readable only by user. Thanks.) -- `NB: Anyone suggesting that we should say
Re: RAID tuning?
On 13 Jun 2006, Gordon Henderson said: On Tue, 13 Jun 2006, Adam Talbot wrote: Can any one give me more info on this error? Pulled from /var/log/messages. raid6: read error corrected!! Not seen that one!!! The message is pretty easy to figure out and the code (in drivers/md/raid6main.c) is clear enough. The block device driver has reported a read error. In the old days (pre-2.6.15) the drive would have been kicked from the array for that, and the array would have dropped to degraded state; but nowadays the system tries to rewrite the stripe that should have been there (computed from the corresponding stripes on the other disks in the array), and only fails if that doesn't work. Generally hard disks activate sector sparing and stop reporting read errors for bad blocks only when the block is *written* to (it has to do that, annoying though the read errors are; since it can't read the data off the bad block, it can't tell what data should go onto the spare sector that replaces it until you write it). So it's disk damage, but unless it happens over and over again you probably don't need to be too conerned anymore. -- `Voting for any American political party is fundamentally incomprehensible.' --- Vadik - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mdadm 2.5
On 29 May 2006, Neil Brown suggested tentatively: On Sunday May 28, [EMAIL PROTECTED] wrote: - mdadm-2.4-strict-aliasing.patch fix for another srict-aliasing problem, you can typecast a reference to a void pointer to anything, you cannot typecast a reference to a struct. Why can't I typecast a reference to a struct??? It seems very unfair... ISO C forbids it. Pretty much the only time you're allowed to cast a pointer to A to a pointer to B is if one or the other of them is a pointer to void or char. The compiler can (and does) assume that all other pointers-to-T are in different `aliasing sets', i.e. that a modification through a pointer-to-T1 cannot affect variables of any other type (so they can be cached in registers, moved around without reference to each other, and so on). It's *very* advantageous, and with the interprocedural optimizations in GCC 4.1+ is getting more so all the time (the compiler can now move code between *functions* if it needs to, or propagate constants consistently used as function arguments down int into the function themselves; pointer aliasing torpedoes both of these optimizations, and many more.) (One particularly evil thing is the glibc printf handler extension. As a result of that, we must assume that any call to printf() or scanf() can potentially modify any global variable and any local to which a pointer may have escaped. This is decidedly suboptimal given that nobody sane ever uses those extensions!) I think I'll dare to upgrade mdadm soon. I tried upgrading from 2.3.1 to 2.4 (with uClibc from SVN dated 2006-03-28) and it promptly segfaulted at array assembly time. If it happens with 2.5 I'll get a backtrace (hard though that is to get when you don't have any accessible storage to put the core dump on... I upgraded uClibc at the same time so it may well be an uClibc bug.) -- `Voting for any American political party is fundamentally incomprehensible.' --- Vadik - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problems with device-mapper on top of RAID-5 and RAID-6
On 2 Jun 2006, Uwe Meyer-Gruhl uttered the following: Neil's suggestion indicates that there may be a race condition stacking md and dm over each other, but I have not yet tested that patch. I once had problems stacking cryptoloop over RAID-6, so it might really be a stacking problem. We don't know yet if LVM over RAID is affected as well. I've been running LVM on RAID (spanning two RAID-5 arrays) for some time now and have had no trouble (that I know of: a check just completed OK, so the RAID layer at least is consistent). The arrays have both IDE and SCSI (sym53c875) components. Of course this is a subtle and intermittent problem so this is hardly a green light :/ -- `Voting for any American political party is fundamentally incomprehensible.' --- Vadik - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: problems with raid=noautodetect - solved
On 24 May 2006, Florian Dazinger uttered the following: Neil Brown wrote: Presumably you have a 'DEVICE' line in mdadm.conf too? What is it. My first guess is that it isn't listing /dev/sdd? somehow. Otherwise, can you add a '-v' to the mdadm command that assembles the array, and capture the output. That might be helpful. NeilBrown stupid me! I had a DEVICE section, but somehow forgot about my /dev/sdd drive. `DEVICE partitions' is generally preferable for that reason, unless you have entries in /proc/partitions which you explicitly want to exclude from scanning for RAID superblocks. -- `On a scale of 1-10, X's brokenness rating is 1.1, but that's only because bringing Windows into the picture rescaled brokenness by a factor of 10.' --- Peter da Silva - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Does software RAID take advantage of SMP, or 64 bit CPU(s)?
On 23 May 2006, Neil Brown noted: On Monday May 22, [EMAIL PROTECTED] wrote: A few simple questions about the 2.6.16+ kernel and software RAID. Does software RAID in the 2.6.16 kernel take advantage of SMP? Not exactly. RAID5/6 tends to use just one cpu for parity calculations, but that frees up other cpus for doing other important work. To expand on this, that depends on how many RAID arrays you've got, since there's one parity-computation daemon per array. If you have several arrays and are writing to them at the same time, or several arrays and some are degraded, then several md*_raid* daemons might be working at once. But that's not very likely, I'd guess. (I have multiple RAID-5 arrays, but that's only because I'm trying to get useful RAIDing on multiple disks of drastically different size.) -- `On a scale of 1-10, X's brokenness rating is 1.1, but that's only because bringing Windows into the picture rescaled brokenness by a factor of 10.' --- Peter da Silva - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: xfs or ext3?
On 10 May 2006, Dexter Filmore wrote: Do I have to provide stride parameter like for ext2? Yes, definitely. -- `On a scale of 1-10, X's brokenness rating is 1.1, but that's only because bringing Windows into the picture rescaled brokenness by a factor of 10.' --- Peter da Silva - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: disks becoming slow but not explicitly failing anyone?
On 23 Apr 2006, Mark Hahn stipulated: I've seen a lot of cheap disks say (generally deep in the data sheet that's only available online after much searching and that nobody ever reads) that they are only reliable if used for a maximum of twelve hours a day, or 90 hours a week, or something of that nature. Even server I haven't, and I read lots of specs. they _will_ sometimes say that non-enterprise drives are intended or designed for a 8x5 desktop-like usage pattern. That's the phrasing, yes: foolish me assumed that meant `if you leave it on for much longer than that, things will go wrong'. to the normal way of thinking about reliability, this would simply mean a factor of 4.2x lower reliability - say from 1M to 250K hours MTBF. that's still many times lower rate of failure than power supplies or fans. Ah, right, it's not a drastic change. It still stuns me that anyone would ever voluntarily buy drives that can't be left switched on (which is perhaps why the manufacturers hide I've definitely never seen any spec that stated that the drive had to be switched off. the issue is really just what is the designed duty-cycle? I see. So it's just `we didn't try to push the MTBF up as far as we would on other sorts of disks'. I run a number of servers which are used as compute clusters. load is definitely 24x7, since my users always keep the queues full. but the servers are not maxed out 24x7, and do work quite nicely with desktop drives for years at a time. it's certainly also significant that these are in a decent machineroom environment. Yeah; i.e., cooled. I don't have a cleanroom in my house so the RAID array I run there is necessarily uncooled, and the alleged aircon in the room housing work's array is permanently on the verge of total collapse (I think it lowers the temperature, but not by much). it's unfortunate that disk vendors aren't more forthcoming with their drive stats. for instance, it's obvious that wear in MTBF terms would depend nonlinearly on the duty cycle. it's important for a customer to know where that curve bends, and to try to stay in the low-wear zone. similarly, disk Agreed! I tend to assume that non-laptop disks hate being turned on and hate temperature changes, so just keep them running 24x7. This seems to be OK, with the only disks this has ever killed being Hitachi server-class disks in a very expensive Sun server which was itself meant for 24x7 operation; the cheaper disks in my home systems were quite happy. (Go figure...) specs often just give a max operating temperature (often 60C!), which is almost disingenuous, since temperature has a superlinear effect on reliability. I'll say. I'm somewhat twitchy about the uncooled 37C disks in one of my machines: but one of the other disks ran at well above 60C for *years* without incident: it was an old one with no onboard temperature sensing, and it was perhaps five years after startup that I opened that machine for the first time in years and noticed that the disk housing nearly burned me when I touched it. The guy who installed it said that yes, it had always run that hot, and was that important? *gah* I got a cooler for that disk in short order. a system designer needs to evaluate the expected duty cycle when choosing disks, as well as many other factors which are probably more important. for instance, an earlier thread concerned a vast amount of read traffic to disks resulting from atime updates. Oddly, I see a steady pulse of write traffic, ~100Kb/s, to one dm device (translating into read+write on the underlying disks) even when the system is quiescient, all daemons killed, and all fsen mounted with noatime. One of these days I must fish out blktrace and see what's causing it (but that machine is hard to quiesce like that: it's in heavy use). simply using more disks also decreases the load per disk, though this is clearly only a win if it's the difference in staying out of the disks duty-cycle danger zone (since more disks divide system MTBF). Well, yes, but if you have enough more you can make some of them spares and push up the MTBF again (and the cooling requirements, and the power consumption: I wish there was a way to spin down spares until they were needed, but non-laptop controllers don't often seem to provide a way to spin anything down at all that I know of). -- `On a scale of 1-10, X's brokenness rating is 1.1, but that's only because bringing Windows into the picture rescaled brokenness by a factor of 10.' --- Peter da Silva - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: disks becoming slow but not explicitly failing anyone?
On 23 Apr 2006, Mark Hahn said: some people claim that if you put a normal (desktop) drive into a 24x7 server (with real round-the-clock load), you should expect failures quite promptly. I'm inclined to believe that with MTBF's upwards of 1M hour, vendors would not claim a 3-5yr warranty unless the actual failure rate was low, even if only running 8/24. I've seen a lot of cheap disks say (generally deep in the data sheet that's only available online after much searching and that nobody ever reads) that they are only reliable if used for a maximum of twelve hours a day, or 90 hours a week, or something of that nature. Even server disks generally seem to say something like that, but the figure given is more like `168 hours a week', i.e., constant use. It still stuns me that anyone would ever voluntarily buy drives that can't be left switched on (which is perhaps why the manufacturers hide the info in such an obscure place), and I don't know what might go wrong if you use the disk `too much': overheating? But still it seems that there are crappy disks out there with very silly limits on the time they can safely be used for. (But this *is* the RAID list: we know that disks suck, right?) -- `On a scale of 1-10, X's brokenness rating is 1.1, but that's only because bringing Windows into the picture rescaled brokenness by a factor of 10.' --- Peter da Silva - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: naming of md devices
On 23 Mar 2006, Dan Christensen moaned: To answer myself, the boot parameter raid=noautodetect is supposed to turn off autodetection. However, it doesn't seem to have an effect with Debian's 2.6.16 kernel. It does disable autodetection for my self-compiled kernel, but since that kernel has no initrd or initramfs, it gets stuck at that point. [If I understand correctly, you can't use mdadm for building the array without an initrd/ramfs.] That's true if your root filesystem is on RAID. I also tried putting root=LABEL=/ on my boot command line. Debian's kernel seemed to understand this but gave: Begin: Waiting for root filesystem... Done. Done. Begin: Mounting root filesystem ...kernel autodetection of raid seemed to happen here... ALERT /dev/disk/by_label// does not exist Ah, welcome to the udev problems. Look at the Debian kernel-maint list at lists.debian.org and marvel at the trouble they're having because they're using udev on their initramfs. I'm glad I used mdev instead :) Will the Debian kernel/initramfs fall back to using mdadm to build the arrays? `Fall back to'? If autodetection is turned off, it's not a fallback, it's the common case. the above is on unstable... i don't use stable (and stable definitely does the wrong thing -- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=338200). That bug is against initrd-tools, which is a different package I believe. Yes, it is unmaintained. BUT, my self-compiled kernel is now failing to bring up the arrays! I didn't change anything on the arrays or on this kernel's boot line, and I have not turned off kernel auto-detection, so I have no idea why there is a problem. Unfortunately, I don't have a serial console, and the kernel panics so I can't scroll back to see the relevant part of the screen. My self-compiled kernel has everything needed for my root filesystem compiled in, so I avoided needing an initramfs. Without boot messages it's very hard to say what's going on. If you have another machine, you could try booting with the messages going over a serial console... -- `Come now, you should know that whenever you plan the duration of your unplanned downtime, you should add in padding for random management freakouts.' - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: naming of md devices
On 23 Mar 2006, Daniel Pittman uttered the following: The initramfs tool, which is mostly shared with Ubuntu, is less stupid. It uses mdadm and a loop to scan through the devices found on the machine and find what RAID levels are required, then builds the RAID arrays with mdrun. That's much nicer. Unfortunately, it still doesn't transfer /etc/mdadm.conf to the initramfs, resulting in arrays changing position when constructed, to my annoyance. So, stupid, but not as stupid as the oldest tools. That surely can't be hard to fix, can it? -- `Come now, you should know that whenever you plan the duration of your unplanned downtime, you should add in padding for random management freakouts.' - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A random initramfs script
On Fri, 17 Mar 2006, Andre Noll murmured woefully: On 00:41, Nix wrote: So I downloaded iproute2-2.4.7-now-ss020116-try.tar.gz, but there seems to be a problem with errno.h: Holy meatballs that's ancient. It is the most recent version on the ftp server mentioned in the HOWTO. OK, so I guess the howto is a bit out of date :/ [uClibc] Alternatively, just suck down GCC from, say, svn://gcc.gnu.org/svn/gcc/tags/gcc_3_4_5_release, or ftp.gnu.org, or somewhere, and point buildroot at that. Yep, there's a 'dl' directory which contains all downloads. One can download the tarballs from anywhere else to that directory. Seems to work now. Ah, I just work from svn checkouts :) but they *are* twice the size of a normal tarball, so I guess I understand using the tarballs instead :) -- `Come now, you should know that whenever you plan the duration of your unplanned downtime, you should add in padding for random management freakouts.' - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A random initramfs script
On Thu, 16 Mar 2006, Neil Brown wrote: On Wednesday March 15, [EMAIL PROTECTED] wrote: On 08:29, Nix wrote: Yeah, that would work. Neil's very *emphatic* about hardwiring the UUIDs of your arrays, though I'll admit that given the existence of --examine --scan, I don't really see why. :) He likes to compare the situation with /etc/fstab. Nobody complains about having to edit /etc/fstab, so why keep people complaining about having to edit /etc/mdadm.conf? Indeed! And if you plug in some devices off another machine for disaster recovery, you don't want another disaster because you assembled the wrong arrays. Well, I can't have that go wrong because I assemble all of them :) One thing I would like to know, though: I screwed up construction of one of my arrays and forgot to give it a name, so every array with a V1 superblock has a name *except one*. Is there a way to change the name after array creation? (Another overloading of --grow, perhaps?) (I'm still quite new to md so rather queasy about `just trying' things like this with active arrays containing critical data. I suppose I should build a test array on a sparse loopback mount or something...) I would like an md superblock to be able to contain some indication of the 'name' of the machine which is meant to host the array, so that once a machine knows its own name, it can automatically find and mount its own arrays, ... and of course you could request the mounting of some other machine's arrays if you cart disks around between machines :) Seems like a good idea. -- `Come now, you should know that whenever you plan the duration of your unplanned downtime, you should add in padding for random management freakouts.' - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
A random initramfs script
In the interests of pushing people away from in-kernel autodetection, I thought I'd provide the initramfs script I just knocked up to boot my RAID+LVM system. It's had a whole four days of testing so it must work. :) It's being used to boot a system that boots from RAID-1 and has almost everything else on a pair of RAID-5 arrays (not quite everything, as the disks are wildly different sizes, so I was left with a 20Gb slice at the end of the largest disk that I'm swapping onto and using for data that I don't care about). It has a number of improvements over the initramfs embedded in the script that comes with mdadm: - It handles LVM2 as well as md (obviously if you boot off RAID you still have to boot off RAID1, but /boot can be a RAID1 filesystem of its own now, with / in LVM, on RAID, or both at once) - It fscks / before mounting it - If anything goes wrong, it drops you into an emergency shell in the rootfs, from where you have all the power of ash without hardly any builtin commands, lvm and mdadm to diagnose your problem :) you can't do *that* with in- kernel array autodetection! - it supports arguments `rescue', to drop into /bin/ash instead of init after mounting the real root filesystem, and `emergency', to drop into a shell on the initramfs before doing *anything*. - It supports root= and init= arguments, although for arcane reasons to do with LILO suckage you need to pass the root argument as `root=LABEL=/dev/some/device', or LILO will helpfully transform it into a device number, which is rarely useful if the device name is, say, /dev/emergency-volume-group/root ;) right now, if you don't pass root=, it tries to mount /dev/raid/root after initializing all the RAID arrays and LVM VGs it can. - it doesn't waste memory. initramfs isn't like initrd: if you just chroot into the new root filesystem, the data in the initramfs *stays around*, in *nonswappable* kernel memory. And it's not gzipped by that point, either! The downsides: - it needs a very new busybox, from Subversion after the start of this year: I'm using svn://busybox.net/trunk/busybox revision 14406, and a 2.6.12+ kernel with sysfs and hotplug support; this is because it populates /dev with the `mdev' mini-udev tool inside busybox, and switches root filesystems with the `switch_root' tool, which chroots only after erasing the entire contents of the initramfs (taking *great* care not to recurse off that filesystem!) - if you link against uClibc (recommended), you need a CVS uClibc too (i.e., one newer than 0.9.27). - it doesn't try to e.g. set up the network, so it can't do really whizzy things like mount a root filesystem situated on a network block device on some other host: if you want to do something like that you've probably already written a script to do it long ago - the init script's got a few too many things hardwired still, like the type of the root filesystem. I expect it's short enough to easily hack up if you need to :) - you need an /etc/mdadm.conf and an /etc/lvm/lvm.conf, both taken by default from the system you built the kernel on: personally I'd recommend a really simple one with no device= lines, like DEVICE partitions ARRAY /dev/md0 UUID=some:long:uuid:here ARRAY /dev/md1 UUID=another:long:uuid:here ARRAY /dev/md2 UUID=yetanother:long:uuid:here ... One oddity, also: after booting with this, I see some strange results from --examine --scan with mdadm-2.3.1: loki:/root# mdadm --examine --scan ARRAY /dev/md0 level=raid1 num-devices=4 UUID=3a51b74f:8a759fe7:8520304c:3adbceb1 ARRAY /dev/?? level=raid5 metadata=1 num-devices=3 UUID=a5a6cad42c:7fdc0788:a409b919:2ed3bf name=large ARRAY /dev/?? level=raid5 metadata=1 num-devices=3 UUID=fe44916da1:09857680:07fb812e:e33b5a loki:/root# ls -l /dev/md* brw-rw 1 root disk 9, 0 Mar 14 20:10 /dev/md0 brw-rw 1 root disk 9, 1 Mar 14 20:10 /dev/md1 brw-rw 1 root disk 9, 2 Mar 14 20:10 /dev/md2 This is decidedly peculiar because the kernel said it was using md1 and md2 on the initramfs, and the device numbers are surely right: raid5: device sda6 operational as raid disk 0 raid5: device hdc5 operational as raid disk 2 raid5: device sdb6 operational as raid disk 1 raid5: allocated 3155kB for md1 raid5: raid level 5 set md1 active with 3 out of 3 devices, algorithm 2 [...] raid5: device sdb7 operational as raid disk 0 raid5: device hda5 operational as raid disk 2 raid5: device sda7 operational as raid disk 1 raid5: allocated 3155kB for md2 raid5: raid level 5 set md2 active with 3 out of 3 devices, algorithm 2 Anyway, without further ado, here's usr/init: #!/bin/sh # # init --- locate and mount root filesystem # By Nix [EMAIL PROTECTED]. # # Placed in the public domain. # export PATH=/sbin:/bin /bin/mount -t proc proc /proc /bin/mount -t sysfs sysfs /sys CMDLINE=`cat /proc/cmdline` # Populate /dev from /sys /bin/mount -t tmpfs