Re: Got raid10 assembled wrong - how to fix?
George Spelvin wrote: I just discovered (the hard way, sigh, but not too much data loss) that a 4-drive RAID 10 array had the mirroring set up incorrectly. Given 4 drvies A, B, C and D, I had intended to mirror A-C and B-D, so that I could split the mirror and run on either (A,B) or (C,D). However, it turns out that the mirror pairs are A-B and C-D. So pulling both A and B off-line results in a non-functional array. So basically what I need to do is to decommission B and C, and rebuild the array with them swapped: A, C, B, D. Can someone tell me if the following incantation is correct? mdadm /dev/mdX -f /dev/B -r /dev/B mdadm /dev/mdX -f /dev/C -r /dev/C mdadm --zero-superblock /dev/B mdadm --zero-superblock /dev/C mdadm /dev/mdX -a /dev/C mdadm /dev/mdX -a /dev/B That should work. But I think you'd better just physically swap the drives instead - this way, no rebuilding the array will be necessary, and your data will be safe all the time. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: transferring RAID-1 drives via sneakernet
Jeff Breidenbach wrote: It's not a RAID issue, but make sure you don't have any duplicate volume names. According to Murphy's Law, if there are two / volumes, the wrong one will be chosen upon your next reboot. Thanks for the tip. Since I'm not using volumes or LVM at all, I should be safe from this particular problem. If you don't use names, you use numbers - like md0, md10 etc. The numbers, as they now ARE names, should be different too. There's more to this topic, much more. There are different ways to start (assemble) the arrays. I know at least 4 - kernel autodetection, mdadm with mdadm.conf listed some devices, mdadm with empty mdadm.conf and with using of 'homehost' parameter (assemble all our arrays), and mdrun utility. Also, some arrays may be assembled during initrd/initramfs stage, and some after... The best is either mdadm with something in mdadm.conf, or mdadm with homehost. Note that neither of these ways, your foreign array(s) will be assembled, and you will have to do it manually - wich is much better than to screw things up trying to mix-n-match pieces of the two systems. You'll just have to figure the device numbers of your foreign disks and issue an appropriate command, like this: mdadm --assemble /dev/md10 /dev/sdc1 /dev/sdd1 ... using not yet taken mdN number and the right device nodes for your disks/partitions. If you want to keep the disks here, you can add the array info into mdadm.conf or refresh superblock to have new homehost. But if you're using kernel autodetection or mdrun... well, I for one can't help here, -- your arrays will be numbered/renumbered by a chance... /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Deleting mdadm RAID arrays
Janek Kozicki wrote: Marcin Krol said: (by the date of Tue, 5 Feb 2008 11:42:19 +0100) 2. How can I delete that damn array so it doesn't hang my server up in a loop? dd if=/dev/zero of=/dev/sdb1 bs=1M count=10 This works provided the superblocks are at the beginning of the component devices. Which is not the case by default (0.90 superblocks, at the end of components), or with 1.0 superblocks. mdadm --zero-superblock /dev/sdb1 is the way to go here. I'm not using mdadm.conf at all. Everything is stored in the superblock of the device. So if you don't erase it - info about raid array will be still automatically found. That's wrong, as you need at least something to identify the array components. UUID is the most reliable and commonly used. You assemble the arrays as mdadm --assemble /dev/md1 --uuid=123456789 or something like that anyway. If not, your arrays may not start properly in case you shuffled disks (e.g replaced a bad one), or your disks were renumbered after a kernel or other hardware change and so on. The most convient place to store that info is mdadm.conf. Here, it looks just like: DEVICE partitions ARRAY /dev/md1 UUID=4ee58096:e5bc04ac:b02137be:3792981a ARRAY /dev/md2 UUID=b4dec03f:24ec8947:1742227c:761aa4cb By default mdadm offers additional information which helps to diagnose possible problems, namely: ARRAY /dev/md5 level=raid5 num-devices=4 UUID=6dc4e503:85540e55:d935dea5:d63df51b This new info isn't necessary for mdadm to work (but UUID is), yet it comes handy sometimes. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Deleting mdadm RAID arrays
Moshe Yudkowsky wrote: Michael Tokarev wrote: Janek Kozicki wrote: Marcin Krol said: (by the date of Tue, 5 Feb 2008 11:42:19 +0100) 2. How can I delete that damn array so it doesn't hang my server up in a loop? dd if=/dev/zero of=/dev/sdb1 bs=1M count=10 This works provided the superblocks are at the beginning of the component devices. Which is not the case by default (0.90 superblocks, at the end of components), or with 1.0 superblocks. mdadm --zero-superblock /dev/sdb1 Would that work if even if he doesn't update his mdadm.conf inside the /boot image? Or would mdadm attempt to build the array according to the instructions in mdadm.conf? I expect that it might depend on whether the instructions are given in terms of UUID or in terms of devices. After zeroing superblocks, mdadm will NOT assemble the array, regardless if using UUIDs or devices or whatever. In order to assemble the array, all component devices MUST have valid superblocks and the superblocks must match each other. mdadm --assemble in initramfs will simple fail to do its work. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Auto generation of mdadm.conf
Janek Kozicki wrote: Michael Tokarev said: (by the date of Tue, 05 Feb 2008 16:52:18 +0300) Janek Kozicki wrote: I'm not using mdadm.conf at all. That's wrong, as you need at least something to identify the array components. I was afraid of that ;-) So, is that a correct way to automatically generate a correct mdadm.conf ? I did it after some digging in man pages: echo 'DEVICE partitions' mdadm.conf mdadm --examine --scan --config=mdadm.conf ./mdadm.conf Now, when I do 'cat mdadm.conf' i get: DEVICE partitions ARRAY /dev/md/0 level=raid1 metadata=1 num-devices=3 UUID=75b0f87879:539d6cee:f22092f4:7a6e6f name='backup':0 ARRAY /dev/md/2 level=raid1 metadata=1 num-devices=3 UUID=4fd340a6c4:db01d6f7:1e03da2d:bdd574 name=backup:2 ARRAY /dev/md/1 level=raid5 metadata=1 num-devices=3 UUID=22f22c3599:613d5231:d407a655:bdeb84 name=backup:1 Hmm. I wonder why the name for md/0 is in quotes, while others are not. Looks quite reasonable. Should I append it to /etc/mdadm/mdadm.conf ? Probably... see below. This file currently contains: (commented lines are left out) DEVICE partitions CREATE owner=root group=disk mode=0660 auto=yes HOMEHOST system MAILADDR root This is the default content of /etc/mdadm/mdadm.conf on fresh debian etch install. But now I wonder HOW your arrays gets assembled in the first place. Let me guess... mdrun? Or maybe in-kernel auto-detection? The thing is that mdadm will NOT assemble your arrays given this config. If you have your disk/controller and md drivers built into the kernel, AND marked the partitions as linux raid autodetect, kernel may assemble them right at boot. But I don't remember if the kernel will even consider v.1 superblocks for its auto- assembly. In any way, don't rely on the kernel to do this work, in-kernel assembly code is very simplistic and works up to a moment when anything changes/breaks. It's almost the same code as was in old raidtools... Another possibility is mdrun utility (shell script) shipped with Debian's mdadm package. It's deprecated now, but still provided for compatibility. mdrun is even worse, it will try to assemble ALL arrays found, giving them random names and numbers, not handling failures correctly, and failing badly in case of, e.g. a foreign disk is found which happens to contain a valid raid superblock somewhere... Well. There's another, 3rd possibility: mdadm can assemble all arrays automatically (even if not listed explicitly in mdadm.conf) using homehost (only available with v.1 superblock). I haven't tried this option yet, so don't remember how it works. From the mdadm(8) manpage: Auto Assembly When --assemble is used with --scan and no devices are listed, mdadm will first attempt to assemble all the arrays listed in the config file. If a homehost has been specified (either in the config file or on the command line), mdadm will look further for possible arrays and will try to assemble anything that it finds which is tagged as belonging to the given homehost. This is the only situation where mdadm will assemble arrays without being given specific device name or identity information for the array. If mdadm finds a consistent set of devices that look like they should comprise an array, and if the superblock is tagged as belonging to the given home host, it will automatically choose a device name and try to assemble the array. If the array uses version-0.90 metadata, then the minor number as recorded in the superblock is used to create a name in /dev/md/ so for example /dev/md/3. If the array uses version-1 meta‐ data, then the name from the superblock is used to similarly create a name in /dev/md (the name will have any ’host’ prefix stripped first). So.. probably this is the way your arrays are being assembled, since you do have HOMEHOST in your mdadm.conf... Looks like it should work, after all... ;) And in this case there's no need to specify additional array information in the config file. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
Linda Walsh wrote: Michael Tokarev wrote: Unfortunately an UPS does not *really* help here. Because unless it has control program which properly shuts system down on the loss of input power, and the battery really has the capacity to power the system while it's shutting down (anyone tested this? Yes. I must say, I am not connected or paid by APC. With new UPS? and after an year of use, when the battery is not new?), -- unless the UPS actually has the capacity to shutdown system, it will cut the power at an unexpected time, while the disk(s) still has dirty caches... If you have a SmartUPS by APC, their is a freeware demon that monitors [...] Good stuff. I knew at least SOME UPSes are good... ;) Too bad I rarely see such stuff in use by regular home users... [] Note also that with linux software raid barriers are NOT supported. -- Are you sure about this? When my system boots, I used to have 3 new IDE's, and one older one. XFS checked each drive for barriers and turned off barriers for a disk that didn't support it. ... or are you referring specifically to linux-raid setups? I'm referring especially to linux-raid setups (software raid). md devices don't support barriers, because of a very simple reasons: once more than one disk drive is involved, md layer can't guarantee ordering ACROSS drives too. The problem is that in case of power loss during writes, when an array needs recovery/resync (at least the parts which were being written, if bitmaps are in use), md layer will choose arbitrary drive as a master and will copy data to another drive (speaking of simplest case of 2-drive raid1 array). But the thing is that one drive may have two last barriers written (I mean the data that was assotiated with the barriers), and another neither of the two - in two different places. And hence we may see quite.. some inconsistency here. This is regardless of whether underlying component devices supports barriers or not. Would it be possible on boot to have xfs probe the Raid array, physically, to see if barriers are really supported (or not), and disable them if they are not (and optionally disabling write caching, but that's a major performance hit in my experience. Xfs already probes the devices as you describe, exactly the same way as you've seen with your ide disks, and disables barriers. The question and confusing was about what happens when the barriers are disabled (provided, again, that we don't rely on UPS and other external things). As far as I understand, when barriers are working properly, xfs should be safe wrt power losses (still a bit unsure about this). Now, when barriers are turned off (for whatever reason), is it still as safe? I don't know. Does it use regular cache flushes in place of barriers in that case (which ARE supported by md layer)? Generally, it has been said numerous times that XFS is not powercut-friendly, and it has to be used when everything is stable, including power. Hence I'm afraid to deploy it where I know the power is not stable (we've about 70 such places here, with servers in each, where they don't always replace UPS batteries in time - ext3fs never crashed so far, while ext2 did). Thanks. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
Moshe Yudkowsky wrote: [] But that's *exactly* what I have -- well, 5GB -- and which failed. I've modified /etc/fstab system to use data=journal (even on root, which I thought wasn't supposed to work without a grub option!) and I can power-cycle the system and bring it up reliably afterwards. Note also that data=journal effectively doubles the write time. It's a bit faster for small writes (because all writes are first done into the journal, i.e. into the same place, so no seeking is needed), but for larger writes, the journal will become full and data found in it needs to be written to proper place, to free space for new data. Here, if you'll continue writing, you will have more than 2x speed degradation, because of a) double writes, and b) more seeking. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
Moshe Yudkowsky wrote: [] If I'm reading the man pages, Wikis, READMEs and mailing lists correctly -- not necessarily the case -- the ext3 file system uses the equivalent of data=journal as a default. ext3 defaults to data=ordered, not data=journal. ext2 doesn't have journal at all. The question then becomes what data scheme to use with reiserfs on the I'd say don't use reiserfs in the first place ;) Another way to phrase this: unless you're running data-center grade hardware and have absolute confidence in your UPS, you should use data=journal for reiserfs and perhaps avoid XFS entirely. By the way, even if you do have a good UPS, there should be some control program for it, to properly shut down your system when UPS loses the AC power. So far, I've seen no such programs... /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
Eric Sandeen wrote: Moshe Yudkowsky wrote: So if I understand you correctly, you're stating that current the most reliable fs in its default configuration, in terms of protection against power-loss scenarios, is XFS? I wouldn't go that far without some real-world poweroff testing, because various fs's are probably more or less tolerant of a write-cache evaporation. I suppose it'd depend on the size of the write cache as well. I know no filesystem which is, as you say, tolerant to a write-cache evaporation. If a drive says the data is written but in fact it's not, it's a Bad Drive (tm) and it should be thrown away immediately. Fortunately, almost all modern disk drives don't lie this way. The only thing needed for the filesystem is to tell the drive to flush it's cache at the appropriate time, and actually wait for the flush to complete. Barriers (mentioned in this thread) is just another way to do so, in a somewhat more efficient way, but normal cache flush will do as well. IFF the write caching is enabled in the first place - note that with some workloads, write caching in the drive actually makes write speed worse, not better - namely, in case of massive writes. Speaking of XFS (and with ext3fs with write barriers enabled) - I'm confused here as well, and answers to my questions didn't help either. As far as I understand, XFS only use barriers, not regular cache flushes, hence without write barrier support (which is not here for linux software raid, which is explained elsewhere) it's unsafe, -- probably the same applies to ext3 with barrier support enabled. But I'm not sure I got it all correctly. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: In this partition scheme, grub does not find md information?
John Stoffel wrote: [] C'mon, how many of you are programmed to believe that 1.2 is better than 1.0? But when they're not different, just just different placements, then it's confusing. Speaking of more is better thing... There were quite a few bugs fixed in recent months wrt version 1 superblocks - both in kernel and in mdadm. While 0.90 format is stable for a very long time, and unless you're hitting its limits (namely, max 26 drives in an array, no homehost field), there's nothing which makes v1 superblocks better than 0.90 ones. In my view, better = stable first, faster/easier/whatever second. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
Eric Sandeen wrote: [] http://oss.sgi.com/projects/xfs/faq.html#nulls and note that recent fixes have been made in this area (also noted in the faq) Also - the above all assumes that when a drive says it's written/flushed data, that it truly has. Modern write-caching drives can wreak havoc with any journaling filesystem, so that's one good reason for a UPS. If Unfortunately an UPS does not *really* help here. Because unless it has control program which properly shuts system down on the loss of input power, and the battery really has the capacity to power the system while it's shutting down (anyone tested this? With new UPS? and after an year of use, when the battery is not new?), -- unless the UPS actually has the capacity to shutdown system, it will cut the power at an unexpected time, while the disk(s) still has dirty caches... the drive claims to have metadata safe on disk but actually does not, and you lose power, the data claimed safe will evaporate, there's not much the fs can do. IO write barriers address this by forcing the drive to flush order-critical data before continuing; xfs has them on by default, although they are tested at mount time and if you have something in between xfs and the disks which does not support barriers (i.e. lvm...) then they are disabled again, with a notice in the logs. Note also that with linux software raid barriers are NOT supported. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
Moshe Yudkowsky wrote: I've been reading the draft and checking it against my experience. Because of local power fluctuations, I've just accidentally checked my system: My system does *not* survive a power hit. This has happened twice already today. I've got /boot and a few other pieces in a 4-disk RAID 1 (three running, one spare). This partition is on /dev/sd[abcd]1. I've used grub to install grub on all three running disks: grub --no-floppy EOF root (hd0,1) setup (hd0) root (hd1,1) setup (hd1) root (hd2,1) setup (hd2) EOF (To those reading this thread to find out how to recover: According to grub's map option, /dev/sda1 maps to hd0,1.) I usually install all the drives identically in this regard - to be treated as first bios disk (disk 0x80). As already pointed out in this thread - not all BIOSes are able to boot off a second or third disk, so if your first disk (sda) fail your only option is to put your sdb into place of sda and boot from it - this way, grub needs to think it's first boot drive too. By the way, lilo works here more easily and more reliable. You just install a standard mbr (lilo has it too) which just boots from an active partition, and install lilo onto the raid array, and tell it to NOT do anything fancy with raid at all (raid-extra-boot none). But for this to work, you have to have identical partitions with identical offsets - at least for the boot partitions. After the power hit, I get: Error 16 Inconsistent filesystem mounted But did it actually mount it? I then tried to boot up on hda1,1, hdd2,1 -- none of them worked. Which is in fact expected after the above. You have 3 identical copies (thanks to raid) of your boot filesystem, all 3 equally broken. When it boots, it assembles your /boot raid array - the same regardless if you boot off hda, hdb or hdc. The culprit, in my opinion, is the reiserfs file system. During the power hit, the reiserfs file system of /boot was left in an inconsistent state; this meant I had up to three bad copies of /boot. I've never seen any problem with ext[23] wrt unexpected power loss, so far. Running several 100s of different systems, some since 1998, some since 2000. Sure there was several inconsistencies, sometimes (maybe once or twice) some minor data loss (only few newly created files were lost), but most serious was to find a few items in lost+found after an fsck - that's ext2, never seen that with ext3. More, I tried hard to force a power failure at unexpected time, by doing massive write operations and cutting power while at it - I was never able to trigger any problem this way, at all. In any case, even if ext[23] is somewhat damaged, it can be mounted still - access to some files may return I/O errors (in the parts where it's really damaged), but the rest will work. On the other hand, I had several immediate issues with reiserfs. It was long time ago, when the filesystem first has been included into mainline kernel, so that doesn't reflect current situation. Yet even at that stage, reiserfs was declared stable by the authors. Issues were trivially triggerable by cutting the power at an unexpected time, and fsck didn't help several times. So I tend to avoid reiserfs - due to my own experience, and due to numerous problems elsewhere. Recommendations: 1. I'm going to try adding a data=journal option to the reiserfs file systems, including the /boot. If this does not work, then /boot must be ext3 in order to survive a power hit. By the way, if your /boot is separate filesystem (ie, there's nothing more there), I see absolutely, zero no reason for it to crash. /boot is modified VERY rarely (only when installing a kernel), and only when it's modified there's a chance for it to be damaged somehow. During the rest of the time, it's constant, and any power cut should not hurt it at all. If even for a non-modified filesystem reiserfs shows such behavour ( 2. We discussed what should be on the RAID1 bootable portion of the filesystem. True, it's nice to have the ability to boot from just the RAID1 portion. But if that RAID1 portion can't survive a power hit, there's little sense. It might make a lot more sense to put /boot on its own tiny partition. Hehe. /boot doesn't matter really. Separate /boot were used for 3 purposes: 1) to work around bios 1024th cylinder issues (long gone with LBA) 2) to be able to put the rest of the system onto an unsupported-by- bootloader filesystem/raid/lvm/etc. Like, lilo didn't support reiserfs (and still doesn't with tail packing enabled), so if you want to use reiserfs for your root fs, put /boot into a separate ext2fs. The same is true for raid - you can put the rest of the system into a raid5 array (unsupported by grub/lilo), and in order to boot, create small raid1 (or any other supported level) /boot. 3) to keep it as less volatile as possible. Like, an area of the disk which never changes (except of a few very rare cases). For
Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)
Moshe Yudkowsky wrote: Michael Tokarev wrote: Speaking of repairs. As I already mentioned, I always use small (256M..1G) raid1 array for my root partition, including /boot, /bin, /etc, /sbin, /lib and so on (/usr, /home, /var are on their own filesystems). And I had the following scenarios happened already: But that's *exactly* what I have -- well, 5GB -- and which failed. I've modified /etc/fstab system to use data=journal (even on root, which I thought wasn't supposed to work without a grub option!) and I can power-cycle the system and bring it up reliably afterwards. So I'm a little suspicious of this theory that /etc and others can be on the same partition as /boot in a non-ext3 file system. If even your separate /boot failed (which should NEVER fail), what to say about the rest? I mean, if you'll save your /boot, what help it will be for you, if your root fs is damaged? That's why I said /boot is mostly irrelevant. Well. You can have some recovery stuff in your initrd/initramfs - that's for sure (and for that to work, you can make your /boot more reliable by creating a separate filesystem for it). But if to go this route, it's better to boot off some recovery CD instead of trying recovery from very limited toolset available in your initramfs. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: In this partition scheme, grub does not find md information?
Moshe Yudkowsky wrote: [] Mr. Tokarev wrote: By the way, on all our systems I use small (256Mb for small-software systems, sometimes 512M, but 1G should be sufficient) partition for a root filesystem (/etc, /bin, /sbin, /lib, and /boot), and put it on a raid1 on all... ... doing [it] this way, you always have all the tools necessary to repair a damaged system even in case your raid didn't start, or you forgot where your root disk is etc etc. An excellent idea. I was going to put just /boot on the RAID 1, but there's no reason why I can't add a bit more room and put them all there. (Because I was having so much fun on the install, I'm using 4GB that I was going to use for swap space to mount base install and I'm working from their to build the RAID. Same idea.) Hmmm... I wonder if this more expansive /bin, /sbin, and /lib causes hits on the RAID1 drive which ultimately degrade overall performance? /lib is hit only at boot time to load the kernel, I'll guess, but /bin includes such common tools as bash and grep. You don't care of the speed of your root filesystem. Note there are two speeds - write and read. You only write to root (including /bin and /lib and so on) during software (re)install and during some configuration work (writing /etc/password and the like). First is very infrequent, and both needs only a few writes, -- so write speed isn't important. Read speed also not that important, because most commonly used stuff from there will be cached anyway (like libc.so, bash and grep), and again, reading such tiny stuff - it doesn't matter if it's fast raid or a slow one. What you do care about the speed of devices where your large, commonly accessed/modified files - such as video files esp. when you want streaming video - are resides. And even here, unless you've special requirement for speed, you will not notice any difference between slow and fast raid levels. For typical filesystem usage, raid5 works good for both reads and (cached, delayed) writes. It's workloads like databases where raid5 performs badly. What you do care about is your data integrity. It's not really interesting to reinstall a system or lose your data in case if something goes wrong, and it's best to have recovery tools as easily available as possible. Plus, amount of space you need. Also, placing /dev on a tmpfs helps alot to minimize number of writes necessary for root fs. Another interesting idea. I'm not familiar with using tmpfs (no need, until now); but I wonder how you create the devices you need when you're doing a rescue. When you start udev, your /dev will be on tmpfs. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WRONG INFO (was Re: In this partition scheme, grub does not find md information?)
Peter Rabbitson wrote: Moshe Yudkowsky wrote: over the other. For example, I've now learned that if I want to set up a RAID1 /boot, it must actually be 1.2 or grub won't be able to read it. (I would therefore argue that if the new version ever becomes default, then the default sub-version ought to be 1.2.) In the discussion yesterday I myself made a serious typo, that should not spread. The only superblock version that will work with current GRUB is 1.0 _not_ 1.2. Ghrrm. 1.0, or 0.9. 0.9 is still the default with mdadm. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: In this partition scheme, grub does not find md information?
Keld Jørn Simonsen wrote: [] Ugh. 2-drive raid10 is effectively just a raid1. I.e, mirroring without any striping. (Or, backwards, striping without mirroring). uhm, well, I did not understand: (Or, backwards, striping without mirroring). I don't think a 2 drive vanilla raid10 will do striping. Please explain. I was referring to raid0+1 here - mirror of stripes. Which makes no sense by its own, but when we create such thing on only 2 drives, it becomes just raid0... Backwards as raid1+0 vs raid0+1. This is just to show that various raid levels, in corner cases, tends to transform from one to another. Pretty much like with raid5 of 2 disks - it's the same as raid1. I think in raid5 of 2 disks, half of the chunks are parity chynks which are evenly distributed over the two disks, and the parity chunk is the XOR of the data chunk. But maybe I am wrong. Also the behaviour of suce a raid5 is different from a raid1 as the parity chunk is not used as data. With N-disk raid5, parity in a row is calculated by XORing together data from all the rest of the disks (N-1), ie, P = D1 ^ ... ^D(N-1). In case of 2-disk raid5 (it's also a corner case), the above formula becomes just P = D1. So, parity block in each row contains exactly the same data as data block, effectively turning the whole thing into a raid1 of two disks. Sure in raid5 parity blocks called just that - parity, but in reality that parity is THE SAME as data (again, in case of only 2-disk raid5). I am not sure what vanilla linux raid10 (near=2, far=1) has of properties. I think it can run with only 1 disk, but I think it number of copies should be = number of disks, so no. I have a clear understanding that in a vanilla linux raid10 (near=2, far=1) you can run with one failing disk, that is with only one working disk. Am I wrong? In fact, with (all softs) or raid10, it's not only the number of drives that can fail that matters, but also WHICH drives can fail. In classic raid10: DiskA DiskB DiskC DiskD 0 0 1 1 2 2 3 3 4 4 5 5 (where numbers are the data blocks), you can have only 2 working disks (ie, 2 failed), but only from different pairs. You can't have A and B failed and C and D working for example - you'll lose half the data and thus the filesystem. You can have A and C failed however, or A and D, or BC, or BD. You see - in the above example, all numbers (data blocks) should be present at least once (after you pull a drive or two or more). If at least some numbers don't appear at all, your raid array's dead. Now write out the layout you want to use like the above, and try removing some drives, and see if you still have all numbers. For example, with 3-disk linux raid10: A B C 0 0 1 1 2 2 3 3 4 4 5 5 We can't pull 2 drives anymore here. Eg, pulling AB removes 0 and 3. Pulling BC removes 2 and 5. AC = 1 and 4. With 5-drive linux raid10: A B C D E 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 ... AB can't be removed - 0, 5. AC CAN be removed, as are AD. But not AE - losing 2 and 7. And so on. 6-disk raid10 with 3 copies of each (near=3 with linux): A B C D E F 0 0 0 1 1 1 2 2 2 3 3 3 It can run as long as from each triple (ABC and DEF), at least one disk is here. Ie, you can lose up to 4 drives, as far as the condition is true. But if you lose only 3 - ABC or DEF - it can't work anymore. The same goes for raid5 and raid6, but they're symmetric -- any single (raid5) or double (raid6) disk failure is Ok. The principle is this: raid5: P = D1^D2^D3^...^D(N-1) so, you either have all Di (nothing to reconstruct), or you have all but one Di AND P - in this case, missing Dm can be recalculated as Dm = P^D1^...^D(m-1)^D(m+1)^...^D(N-1) (ie, a XOR of all the remaining blocks including parity). (exactly the same applies to raid4, because each row in raid4 is identical to that of raid5, the difference is that parity disk is different in each row in raid5, while in raid4 it stays the same). I wont write the formula for raid6 as it's somewhat more complicated, but the effect is the same - any data block can be reconstructed from any N-2 drives. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: In this partition scheme, grub does not find md information?
Moshe Yudkowsky wrote: Peter Rabbitson wrote: It is exactly what the names implies - a new kind of RAID :) The setup you describe is not RAID10 it is RAID1+0. As far as how linux RAID10 works - here is an excellent article: http://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10 Thanks. Let's just say that the md(4) man page was finally penetrating my brain, but the Wikipedia article helped a great deal. I had thought md's RAID10 was more standard. It is exactly standard - when you create it with default settings and with even number of drives (2, 4, 6, 8, ...), it will be exactly standard raid10 (or raid1+0, whatever) as described in various places on the net. But if you use odd number of drives, or if you pass some fancy --layout option, it will look differently. Still not suitable for lilo or grub, at least their current versions. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: In this partition scheme, grub does not find md information?
Keld Jørn Simonsen wrote: On Tue, Jan 29, 2008 at 09:57:48AM -0600, Moshe Yudkowsky wrote: In my 4 drive system, I'm clearly not getting 1+0's ability to use grub out of the RAID10. I expect it's because I used 1.2 superblocks (why not use the latest, I said, foolishly...) and therefore the RAID10 -- with even number of drives -- can't be read by grub. If you'd patch that information into the man pages that'd be very useful indeed. If you have 4 drives, I think the right thing is to use a raid1 with 4 drives, for your /boot partition. Then yo can survive that 3 disks crash! By the way, on all our systems I use small (256Mb for small-software systems, sometimes 512M, but 1G should be sufficient) partition for a root filesystem (/etc, /bin, /sbin, /lib, and /boot), and put it on a raid1 on all (usually identical) drives - be it 4 or 6 or more of them. Root filesystem does not change often, or at least it's write speed isn't that important. But doing this way, you always have all the tools necessary to repair a damaged system even in case your raid didn't start, or you forgot where your root disk is etc etc. But in this setup, /usr, /home, /var and so on should be separate partitions. Also, placing /dev on a tmpfs helps alot to minimize number of writes necessary for root fs. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: In this partition scheme, grub does not find md information?
Peter Rabbitson wrote: [] However if you want to be so anal about names and specifications: md raid 10 is not a _full_ 1+0 implementation. Consider the textbook scenario with 4 drives: (A mirroring B) striped with (C mirroring D) When only drives A and C are present, md raid 10 with near offset will not start, whereas standard RAID 1+0 is expected to keep clunking away. Ugh. Yes. offset is linux extension. But md raid 10 with default, n2 (without offset), configuration will behave exactly like in classic docs. Again. Linux md raid10 module implements standard raid10 as known in all widely used docs. And IN ADDITION, it can do OTHER FORMS, which differs from classic variant. Pretty like a hardware raid card from a brand vendor probably implements their own variations of standard raid levels. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: In this partition scheme, grub does not find md information?
Keld Jørn Simonsen wrote: On Tue, Jan 29, 2008 at 06:13:41PM +0300, Michael Tokarev wrote: Linux raid10 MODULE (which implements that standard raid10 LEVEL in full) adds some quite.. unusual extensions to that standard raid10 LEVEL. The resulting layout is also called raid10 in linux (ie, not giving new names), but it's not that raid10 (which is again the same as raid1+0) as commonly known in various literature and on the internet. Yet raid10 module fully implements STANDARD raid10 LEVEL. My understanding is that you can have a linux raid10 of only 2 drives, while the standard RAID 1+0 requires 4 drives, so this is a huge difference. Ugh. 2-drive raid10 is effectively just a raid1. I.e, mirroring without any striping. (Or, backwards, striping without mirroring). So to say, raid1 is just one particular configuration of raid10 - with only one mirror. Pretty much like with raid5 of 2 disks - it's the same as raid1. I am not sure what vanilla linux raid10 (near=2, far=1) has of properties. I think it can run with only 1 disk, but I think it number of copies should be = number of disks, so no. does not have striping capabilities. It would be nice to have more info on this, eg in the man page. It's all in there really. See md(4). Maybe it's not that verbose, but it's not a user's guide (as in: a large book), after all. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: In this partition scheme, grub does not find md information?
Moshe Yudkowsky wrote: Michael Tokarev wrote: There are more-or-less standard raid LEVELS, including raid10 (which is the same as raid1+0, or a stripe on top of mirrors - note it does not mean 4 drives, you can use 6 - stripe over 3 mirrors each of 2 components; or the reverse - stripe over 2 mirrors of 3 components each etc). Here's a baseline question: if I create a RAID10 array using default settings, what do I get? I thought I was getting RAID1+0; am I really? ..default settings AND even (4, 6, 8, 10, ...) number of drives. It will be standard raid10 or raid1+0 which is the same, as many stripes of mirrored (2 copies) data as fits with the number of disks. With odd number of disks it obviously will be soemthing else, not a standard raid10. My superblocks, by the way, are marked version 01; my metadata in mdadm.conf asked for 1.2. I wonder what I really got. The real question Ugh. Another source of confusion. In --superblock=1.2, 1 stands for the format, and 2 stands for the placement. So it's really format version 1. From mdadm(8): 1, 1.0, 1.1, 1.2 Use the new version-1 format superblock. This has few restrictions. The different sub-versions store the superblock at different locations on the device, either at the end (for 1.0), at the start (for 1.1) or 4K from the start (for 1.2). in my mind now is why grub can't find the info, and either it's because of 1.2 superblocks or because of sub-partitioning of components. As has been said numerous times in this thread, grub can't be used with anything but raid1 to start with (the same is true for lilo). Raid10 (or raid1+0, which is the same) - be it standard or linux extension format - is NOT raid1. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: In this partition scheme, grub does not find md information?
Peter Rabbitson wrote: Michael Tokarev wrote: Raid10 IS RAID1+0 ;) It's just that linux raid10 driver can utilize more.. interesting ways to lay out the data. This is misleading, and adds to the confusion existing even before linux raid10. When you say raid10 in the hardware raid world, what do you mean? Stripes of mirrors? Mirrors of stripes? Some proprietary extension? Mirrors of stripes makes no sense. What Neil did was generalize the concept of N drives - M copies, and called it 10 because it could exactly mimic the layout of conventional 1+0 [*]. However thinking about md level 10 in the terms of RAID 1+0 is wrong. Two examples (there are many more): * mdadm -C -l 10 -n 3 -o f2 /dev/md10 /dev/sda1 /dev/sdb1 /dev/sdc1 ^ Those are interesting ways Odd number of drives, no parity calculation overhead, yet the setup can still suffer a loss of a single drive * mdadm -C -l 10 -n 2 -o f2 /dev/md10 /dev/sda1 /dev/sdb1 ^ And this one too. There are more-or-less standard raid LEVELS, including raid10 (which is the same as raid1+0, or a stripe on top of mirrors - note it does not mean 4 drives, you can use 6 - stripe over 3 mirrors each of 2 components; or the reverse - stripe over 2 mirrors of 3 components each etc). Vendors often adds their own extensions, sometimes calling them as the original level, and sometimes giving them new names, especially in the marketing speak. Linux raid10 MODULE (which implements that standard raid10 LEVEL in full) adds some quite.. unusual extensions to that standard raid10 LEVEL. The resulting layout is also called raid10 in linux (ie, not giving new names), but it's not that raid10 (which is again the same as raid1+0) as commonly known in various literature and on the internet. Yet raid10 module fully implements STANDARD raid10 LEVEL. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: In this partition scheme, grub does not find md information?
Peter Rabbitson wrote: Moshe Yudkowsky wrote: One of the puzzling things about this is that I conceive of RAID10 as two RAID1 pairs, with RAID0 on top of to join them into a large drive. However, when I use --level=10 to create my md drive, I cannot find out which two pairs are the RAID1's: the --detail doesn't give that information. Re-reading the md(4) man page, I think I'm badly mistaken about RAID10. Furthermore, since grub cannot find the /boot on the md drive, I deduce that RAID10 isn't what the 'net descriptions say it is. In fact, everything matches. For lilo to work, it basically needs a whole filesystem on the same physical drive. It's exactly the case with raid1 (and only). With raid10, half of the filesystem is on one mirror, and another half is on another mirror. Like this: filesystem blocks on raid0 blocks DiskADiskB 0 0 1 1 2 2 3 3 4 4 5 5 .. (this is (this is the actual what LILO layout) expects) (Difference between raid10 and raid0 is that each of diskA and diskB is in fact composed of two identical devices). If your kernel is located in filesytem blocks number 2 and 3 for example, lilo has to read BOTH halves, but it is not smart enough to figure it out - it can only read everything from a single drive. It is exactly what the names implies - a new kind of RAID :) The setup you describe is not RAID10 it is RAID1+0. Raid10 IS RAID1+0 ;) It's just that linux raid10 driver can utilize more.. interesting ways to lay out the data. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fwd: Error on /dev/sda, but takes down RAID-1
Martin Seebach wrote: Hi, I'm not sure this is completely linux-raid related, but I can't figure out where to start: A few days ago, my server died. I was able to log in and salvage this content of dmesg: http://pastebin.com/m4af616df I talked to my hosting-people and they said it was an io-error on /dev/sda, and replaced that drive. After this, I was able to boot into a PXE-image and re-build the two RAID-1 devices with no problems - indicating that sdb was fine. I expected RAID-1 to be able to stomach exactly this kind of error - one drive dying. What did I do wrong? from that pastebin page. First, sdb has failed for whatever reason: ata2.00: qc timeout (cmd 0xec) ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4) ata2.00: revalidation failed (errno=-5) ata2.00: disabled ata2: EH complete sd 1:0:0:0: SCSI error: return code = 0x0004 end_request: I/O error, dev sdb, sector 80324865 raid1: Disk failure on sdb1, disabling device. Operation continuing on 1 devices RAID1 conf printout: --- wd:1 rd:2 disk 0, wo:0, o:1, dev:sda1 disk 1, wo:1, o:0, dev:sdb1 RAID1 conf printout: --- wd:1 rd:2 disk 0, wo:0, o:1, dev:sda1 At this time, it started to (re)sync other(?) arrays for some reason: md: syncing RAID array md0 md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. md: using maximum available idle IO bandwidth (but not more than 20 KB/sec) for reconstruction. md: using 128k window, over a total of 40162432 blocks. md: md0: sync done. RAID1 conf printout: --- wd:1 rd:2 disk 0, wo:0, o:1, dev:sda1 md: syncing RAID array md1 md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. md: using maximum available idle IO bandwidth (but not more than 20 KB/sec) for reconstruction. md: using 128k window, over a total of 100060736 blocks. Note again, errors on sdb: sd 1:0:0:0: SCSI error: return code = 0x0004 end_request: I/O error, dev sdb, sector 112455000 sd 1:0:0:0: SCSI error: return code = 0x0004 end_request: I/O error, dev sdb, sector 112455256 sd 1:0:0:0: SCSI error: return code = 0x0004 end_request: I/O error, dev sdb, sector 112455512 ... raid1: Disk failure on sdb3, disabling device. Operation continuing on 1 devices so another md array detected sdb failure. So we're with sda only. And volia, sda fails too, some time later: ata1: EH complete sd 0:0:0:0: SCSI error: return code = 0x0004 end_request: I/O error, dev sda, sector 80324865 sd 0:0:0:0: SCSI error: return code = 0x0004 end_request: I/O error, dev sda, sector 115481 ... At this point, the arrays are hosed - all disks of each array has failed, there's no data any more to read/write from/to. Since later sda has been replaced, and sdb recovered from the errors (it contains still-valid superblocks but with somewhat stale information), everything went ok. But the original problem is that you had BOTH disks failed, not only one. What caused THIS problem is another question. Maybe some overheating or power unit problem or somesuch, -- I don't know... But md code worked the best it can here. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Last ditch plea on remote double raid5 disk failure
Neil Brown wrote: On Monday December 31, [EMAIL PROTECTED] wrote: I'm hoping that if I can get raid5 to continue despite the errors, I can bring back up enough of the server to continue, a bit like the remount-ro option in ext2/ext3. If not, oh well... Sorry, but it is oh well. Speaking of all this bad block handling and dropping device in case of errors. Sure thing the situation here improved ALOT when rewriting a block in case of read error has been introduced. This was a very big step into the right direction. But this is still not sufficient, I think. What can be done currently, is to extend bitmap thing, to keep more information. Namely, if a block on one drive fails, and we failed to rewrite it as well (or when there was no way to rewrite it because the array was already running in degraded mode), don't drop the drive still, but fail the original request, AND mark THIS PARTICULAR BLOCK of THIS PARTICULAR DRIVE as bad in the bitmap. In the other words, bitmap can be extended to cover individual drives instead of the whole raid device. It's more - if there's no bitmap for the array, I mean no persistent bitmap, such a thing can still be done anyway, by keeping such a bitmap in memory only, up until the raid array will be shut down (in which case mark the whole drives with errors as bad). This way, it's possible to recover alot more data without risking losing the whole array any time. It's even more - up until some real write will be performed over a bad block, there's no need to record its badness - we can return the same error again as it's expected the drive will return it on a next read attempt. It's only write - real write - which makes this particular block to become bad as we wasn't able to write new data to it... Hm. Even in case of write failure, we can still keep the whole drive without marking anything as bad, again in a hope that the next of those blocks will error out again. This is an.. interesting question really, whenever one can rely on drive to not return bad (read: random) data in case it errored write operation. I definitely know a case when it's not true: we've a batch of seagate drives which seem to have firmware bug in them, which errors out on write with Defect list manipulation error sense code, but reads on this very sector returns something still, especially after a fresh boot (after a power-off). In any case, keeping this info in a bitmap should be sufficient to stop kicking the whole drives out of an array, which currently is a weakest point in linux software raid (IMHO). As it has been pointed out numerous times before, due to Murhpy's laws or other factors such as a phase of the Moon (and partly this behaviour can be described by the fact that after a drive failure, other drives receives more I/O requests, esp. when reconstruction starts, and hence have much more chances to error out on sectors which were not read before in a long time), drives tend to fail several at once, and often it's trivial to read the missing information from a drive which has just been kicked off the array at the place where another drive developed a bad sector. And another thought around all this. Linux sw raid definitely need a way to proactively replace a (probably failing) drive, without removing it from the array first. Something like, mdadm --add /dev/md0 /dev/sdNEW --inplace /dev/sdFAILING so that sdNEW will be a mirror of sdFAILING, and once the recovery procedure finishes (which may use data from other drives in case of I/O error reading sdFAILING - unlike described scenario of making a superblock-less mirror of sdNEW and sdFAILING), mdadm --remove /dev/md0 /dev/sdFAILING, which does not involve any further reconstructions anymore. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?
Justin Piszcz wrote: [] Good to know/have it confirmed by someone else, the alignment does not matter with Linux/SW RAID. Alignment matters when one partitions Linux/SW raid array. If the inside partitions will not be aligned on a stripe boundary, esp. in the worst case when the filesystem blocks cross the stripe boundary (wonder if it's ever possible... and I think it is, if a partition starts at some odd 512 bytes boundary, and filesystem block size is 4Kb, there's just no chance for an inside filesystem to do full-stripe writes, ever, so (modulo stripe cache size) all writes will go read-modify-write or similar way. And that's what the original article is about, by the way. It just happens that hardware raid array is more often split into partitions (using native tools) than linux software raid arrays. And that's what has been pointed out in this thread, as well... ;) /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid10: unfair disk load?
maobo wrote: Hi,all Yes, Raid10 read balance is the shortest position time first and considering the sequential access condition. But its performance is really poor from my test than raid0. Single-stream write performance of raid0, raid1 and raid10 should be of similar level (with raid5 and raid6 things are different) -- in all 3 cases, it should be near the write speed of a single drive. The only possible problematic cases is when you've some unlucky hardware which does not permit writing into two drives in parallel - in which case raid1 and raid10 write speed should be less than to raid0 and single drive. But even ol'good IDE drives/controllers, even if two disks are on the same channel, permits parallel writes. Modern SATA and SCSI/SAS should be no problem - hopefully, modulo (theoretically) some very cheap lame controllers. I think this is the process flow raid10 influence. But RAID0 is so simple and performed very well! From this point that striping is better than mirroring! RAID10 is stipe+mirror. But for write condition it performed really bad than RAID0. Isn't it? No it's not. When the hardware (and drivers) is sane anyway. Also, speed is a very objective thing, so to say - it very much depends on the workload. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid10: unfair disk load?
Michael Tokarev wrote: I just noticed that with Linux software RAID10, disk usage isn't equal at all, that is, most reads are done from the first part of mirror(s) only. Attached (disk-hour.png) is a little graph demonstrating this (please don't blame me for poor choice of colors and the like - this stuff is in works right now, it's a first rrd graph I produced :). There's a 14-drive RAID10 array and 2 more drives. In the graph it's clearly visible that there are 3 kinds of load for drives, because graphs for individual drives are stacked on each other forming 3 sets. One set (the 2 remaining drives) isn't interesting, but the 2 main ones (with many individual lines) are interesting. Ok, looks like vger.kernel.org dislikes png attachments. I wont represent graphs as ascii-art, and it's really not necessary -- see below. The 7 drives with higher utilization receives almost all reads - the second half of the array only gets reads sometimes. And all 14 drives - obviously - receives all writes. So the picture (modulo that sometimes above which is too small to take into account) is like - writes are going to all drives, while reads are done from the first half of each pair only. Also attached are two graphs for individual drives, one is from first half of the array (diskrq-sdb-hour.png), which receives almost all reads (other disks looks pretty much the same), and from the second half (diskrq-sdl-hour.png), which receives very few reads. The graphs shows number of disk transactions per second, separately for reads and writes. Here's a typical line from iostat -x: Dev: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb0,32 0,03 22,16 5,84 2054,79 163,7479,21 0,20 7,29 4,33 12,12 sdk0,38 0,03 6,28 5,84 716,61 163,7472,66 0,15 12,29 5,55 6,72 where sdb and sdk are two halfs of the same raid1 part of a raid10 array - i.e., the content of the two are identical. As shown, write requests are the same for the two, but read requests mostly goes to sdb (the first half), and very little to sdk (the second half). Should raid10 balance reads too, maybe in a way similar to what raid1 does? The kernel is 2.6.23 but very similar behavior is shown by earlier kernels as well. Raid10 stripe size is 256Mb, but again it doesn't really matter other sizes behave the same here. The amount of data is quite large and it is laid out and accessed pretty much randomly (it's a database server), so in theory, even with some optimizations like raid1 does (route request to a drive with nearest head position), the read request distribution should be basically the same. Thanks! /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid10: unfair disk load?
Janek Kozicki wrote: Michael Tokarev said: (by the date of Fri, 21 Dec 2007 14:53:38 +0300) I just noticed that with Linux software RAID10, disk usage isn't equal at all, that is, most reads are done from the first part of mirror(s) only. what's your kernel version? I recall that recently there have been some works regarding load balancing. It was in my original email: The kernel is 2.6.23 but very similar behavior is shown by earlier kernels as well. Raid10 stripe size is 256Mb, but again it doesn't really matter other sizes behave the same here. Strange I missed the new raid10 development you mentioned (I follow linux-raid quite closely). Lemme see... no, nothing relevant in 2.6.24-rc5 (compared with 2.6.23), at least git doesn't show anything interesting. What change(s) you're referring to? Thanks. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ERROR] scsi.c: In function 'scsi_get_serial_number_page'
Thierry Iceta wrote: Hi I would like to use raidtools-1.00.3 on Rhel5 distribution but I got thie error Use mdadm instead. Raidtools is dangerous/unsafe, and is not maintained for a long time already. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
external bitmaps.. and more
I come across a situation where external MD bitmaps aren't usable on any standard linux distribution unless special (non-trivial) actions are taken. First is a small buglet in mdadm, or two. It's not possible to specify --bitmap= in assemble command line - the option seems to be ignored. But it's honored when specified in config file. Also, mdadm should probably warn or even refuse to do things (unless --force is given) when an array being assembled is using external bitmap, but the bitmap file isn't specified. Now for something more.. interesting. The thing is that when a external bitmap is being used for an array, and that bitmap resides on another filesystem, all common distributions fails to start/mount and to shutdown/umount arrays/filesystems properly, because all starts/stops is done in one script, and all mounts/umounts in another, but for bitmaps to work the two should be intermixed with each other. Here's why. Suppose I've an array mdX which used bitmap /stuff/bitmap, where /stuff is another separate filesystem. In this case, during startup, /stuff should be mounted before bringing up mdX, and during shutdown, mdX should be stopped before trying to umount /stuff. Or else during startup mdX will not find /stuff/bitmap, and during shutdown /stuff filesystem is busy since mdX is holding a reference to it. Doing things in simple way doesn't work: if I specify to mount mdX as /data in /etc/fstab, -- since mdX hasn't been assembled by mdadm (due to missing bitmap), the system will not start, asking for emergency root password... Oh well. So the only solution for this so far is to convert md array assemble/stop operation into... MOUNTS/UMOUNTS! And specify all necessary information in /etc/fstab - for both arrays and filesystems, with proper ordering in order column. Ghrm. Technically speaking it's not difficult - mount.md and fsck.md wrappers for mdadm are trivially to write (I even tried that myself - a quick-n-dirty 5-minutes hack works). But it's... ugly. But I don't see any other reasonable solutions. Alternatives are additional scripts to start/stop/mount/umount filesystems residing on or related to advanced arrays (with external bitmaps in this case) - but looking at how much code is in current startup scripts around mounting/fscking, and having in mind that mount/umount does not support alternative /etc/fstab, this is umm.. even more ugly... Comments anyone? Thanks. /mjt P.S. Why external bitmaps in the first place? Well, that's a good question, and here's a (hopefully good too) answer: When there are sufficient disk drives available to dedicate some of them for bitmap(s), and there's a large array(s) with dynamic content (many writes), and the content is important enough to care about data safety wrt possible power losses and kernel OOPSes and whatnot, placing bitmap into another disk(s) helps alot with resyncs (it's not about resync speed, it's about resync general UNRELIABILITY, which is another topic - hopefully a long-term linux raid gurus will understand me here), but does not slow down writes hugely due to constant disk seeks when updating bitmaps. Those seeks tends to have huge impact on random write performance. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: assemble vs create an array.......
[Cc'd to xfs list as it contains something related] Dragos wrote: Thank you. I want to make sure I understand. [Some background for XFS list. The talk is about a broken linux software raid (the reason for breakage isn't relevant anymore). The OP seems to lost the order of drives in his array, and now tries to create new array ontop, trying different combinations of drives. The filesystem there WAS XFS. One point is that linux refuses to mount it, saying structure needs cleaning. This all is mostly md-related, but there are several XFS-related questions and concerns too.] 1- Does it matter which permutation of drives I use for xfs_repair (as long as it tells me that the Structure needs cleaning)? When it comes to linux I consider myself at intermediate level, but I am a beginner when it comes to raid and filesystem issues. The permutation DOES MATTER - for all the devices. Linux, when mounting an fs, only looks at the superblock of the filesystem, which is usually located at the beginning of the device. So in each case linux actually recognizes the filesystem (instead of seeing complete garbage), the same device is the first one - I.e, this way you found your first device. The rest may be still out of order. Raid5 data is laid like this (with 3 drives for simplicity, it's similar with more drives): DiskA DiskB DiskC Blk0 Data0 Data1 P0 Blk1 P1 Data2 Data3 Blk2 Data4 P2 Data5 Blk3 Data6 Data7 P3 ... and so on ... where your actual data blocks are Data0, Data1, ... DataN, and PX are parity blocks. As long as DiskA remains in this position, the beginning of the array is Data0 block, -- hence linux sees the beginning of the filesystem and recognizes it. But you can switch DiskB and DiskC still, and the rest of the data will be complete garbage, only data blocks on DiskA will be in place. So you still need to find order of the other drives (you found your first drive, DriveA, already). Note also that if Data1 block is all-zeros (a situation which is unlikely for a non-empty filesystem), P0 (first parity block) will be exactly the same as Data0, because XORing anything with zeros gives the same anything again (XOR is the operation used to calculate parity blocks in RAID5). So there's still a remote chance you've TWO first disks... What to do is to give repairfs a try for each permutation, but again without letting it to actually fix anything. Just run it in read-only mode and see which combination of drives gives less errors, or no fatal errors (there may be several similar combinations, with the same order of drives but with different drive missing). It's sad that xfs refuses mount when structure needs cleaning - the best way here is to actually mount it and see how it looks like, instead of trying repair tools. Is there some option to force-mount it still (in readonly mode, knowing it may OOPs kernel etc)? I'm not very familiar with xfs yet - it seems to be much faster than ext3 for our workload (mostly databases), and I'm experimenting with it slowly. But this very thread prompted me to think. If I can't force-mount it (or browse it using other ways) as I can almost always do with (somewhat?) broken ext[23] just to examine things, maybe I'm trying it before it's mature enough? ;) Note the smile, but note there's a bit of joke in every joke... :) 2- After I do it, assuming that it worked, how do I reintegrate the 'missing' drive while keeping my data? Just add it back -- mdadm --add /dev/mdX /dev/sdYZ. But don't do that till you actually see your data. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
Justin Piszcz said: (by the date of Sun, 2 Dec 2007 04:11:59 -0500 (EST)) The badblocks did not do anything; however, when I built a software raid 5 and the performed a dd: /usr/bin/time dd if=/dev/zero of=fill_disk bs=1M I saw this somewhere along the way: [42332.936706] ata5.00: spurious completions during NCQ issue=0x0 SAct=0x7000 FIS=004040a1:0800 [42333.240054] ata5: soft resetting port There's some (probably timing-related) bug with spurious completions during NCQ. Alot of people are seeing this same effect with different drives and controllers. Tejun is working on it. It's different to reproduce. Search for spurious completion - there are many hits... /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: man mdadm - suggested correction.
Janek Kozicki wrote: [] Can you please add do the manual under 'SEE ALSO' a reference to /usr/share/doc/mdadm ? /usr/share/doc/mdadm is Debian-specific (well.. not sure it's really Debian (or something derived from it) -- some other distros may use the same naming scheme, too). Other distributions may place the files into a different directory, or not ship them at all, or ship them in alternative package. In any case, say, on Debian a user always knows that other misc. docs are in /usr/share/doc/$package - even if no other links are provided in the manpage. Users familiar with other distributions knows where/how to find other docs there. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23.1: mdadm/raid5 hung/d-state
Justin Piszcz wrote: # ps auxww | grep D USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND root 273 0.0 0.0 0 0 ?DOct21 14:40 [pdflush] root 274 0.0 0.0 0 0 ?DOct21 13:00 [pdflush] After several days/weeks, this is the second time this has happened, while doing regular file I/O (decompressing a file), everything on the device went into D-state. The next time you come across something like that, do a SysRq-T dump and post that. It shows a stack trace of all processes - and in particular, where exactly each task is stuck. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23.1: mdadm/raid5 hung/d-state
Justin Piszcz wrote: On Sun, 4 Nov 2007, Michael Tokarev wrote: [] The next time you come across something like that, do a SysRq-T dump and post that. It shows a stack trace of all processes - and in particular, where exactly each task is stuck. Yes I got it before I rebooted, ran that and then dmesg file. Here it is: [1172609.665902] 80747dc0 80747dc0 80747dc0 80744d80 [1172609.668768] 80747dc0 81015c3aa918 810091c899b4 810091c899a8 That's only partial list. All the kernel threads - which are most important in this context - aren't shown. You ran out of dmesg buffer, and the most interesting entries was at the beginning. If your /var/log partition is working, the stuff should be in /var/log/kern.log or equivalent. If it's not working, there is a way to capture the info still, by stopping syslogd, cat'ing /proc/kmsg to some tmpfs file and scp'ing it elsewhere. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
John Stoffel wrote: Michael == Michael Tokarev [EMAIL PROTECTED] writes: If you are going to mirror an existing filesystem, then by definition you have a second disk or partition available for the purpose. So you would merely setup the new RAID1, in degraded mode, using the new partition as the base. Then you copy the data over to the new RAID1 device, change your boot setup, and reboot. Michael And you have to copy the data twice as a result, instead of Michael copying it only once to the second disk. So? Why is this such a big deal? As I see it, there are two seperate ways to setup a RAID1 setup, on an OS. [..] that was just a tiny nitpick, so to say, about a particular way to convert existing system into raid1 - not something which's done every day anyway. Still, double the time for copying your terabyte-sized drive is something to consider. [] Michael automatically activate it, thus making it busy. What I'm Michael talking about here is that any automatic activation of Michael anything should be done with extreme care, using smart logic Michael in the startup scripts if at all. Ah... but you can also de-active LVM partitions as well if you like. Yes, esp. being a newbie user who first installed linux on his PC just to see that he can't use his disk.. ;) That was a real situation - I helped someone who had never heard of LVM and did little of anything with filesystems/disks before. Michael The Doug's example - in my opinion anyway - shows wrong tools Michael or bad logic in the startup sequence, not a general flaw in Michael superblock location. I don't agree completely. I think the superblock location is a key issue, because if you have a superblock location which moves depending the filesystem or LVM you use to look at the partition (or full disk) then you need to be even more careful about how to poke at things. Superblock location does not depend on the filesystem. Raid exports the inside space only, excluding superblocks, to the next level (filesystem or else). This is really true when you use the full disk for the mirror, because then you don't have the partition table to base some initial guestimates on. Since there is an explicit Linux RAID partition type, as well as an explicit linux filesystem (filesystem is then decoded from the first Nk of the partition), you have a modicum of safety. Speaking of whole disks - first, don't do that (for reasons suitable for another topic), and second, using the whole disk or partitions makes no real difference whatsoever to the topic being discussed. There's just no need for the guesswork, except for the first install (to automatically recognize existing devices, and to use them after confirmation), and maybe for rescue systems, which again is a different topic. In any case, for a tool that does a guesswork (like libvolume-id, to create /dev/ symlinks), it's as easy to look at the end of the device as to the beginning or to any other fixed place - since the tool has to know the superblock format, it knows superblock location as well). Maybe manual guesswork, based on hexdump of first several kilobytes of data, is a bit more difficult in case where superblock is located at the end. But if one has to analyze hexdump, he doesn't care about raid anymore. If ext3 has the superblock in the first 4k of the disk, but you've setup the disk to use RAID1 with the LVM superblock at the end of the disk, you now need to be careful about how the disk is detected and then mounted. See above. For tools, it's trivial to distinguish a component of a raid volume from the volume itself, by looking for superblock at whatever location. Including stuff like mkfs, which - like mdadm does - may warn one about previous filesystem/volume information on the device in question. Michael Speaking of cases where it was really helpful to have an Michael ability to mount individual raid components directly without Michael the raid level - most of them was due to one or another Michael operator errors, usually together with bugs and/or omissions Michael in software. I don't remember exact scenarious anymore (last Michael time it was more than 2 years ago). Most of the time it was Michael one or another sort of system recovery. In this case, you're only talking about RAID1 mirrors, no other RAID configuration fits this scenario. And while this might look to be Definitely. However, linear - to some extent - can be used partially. But sure with much less usefulness. However, raid1 is much more common setup than anything else - IMHO anyway. It's the cheapest and the most reliable thing for an average user anyway - it's cheaper to get 2 large drives than to, say, 3 a bit smaller drives. Yes, raid1 has 1/2 space wasted, compared with, say, raid5 on top of 3 drives (only 1/3 wasted), but still 3 smallish drives costs more than 2 larger drives. helpful, I would strongly argue that it's not, because it's a special
Re: Software RAID when it works and when it doesn't
Justin Piszcz wrote: [] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Justin, forgive me please, but can you learn to trim the original messages when replying, at least cut off the very irrelevant parts? You're always quoting the whole message, even including the part after a line consiting of single minus sign - - a part that most MUAs will remove when replying... I have a question with re-mapping sectors, can software raid be as efficient or good at remapping bad sectors as an external raid controller for, e.g., raid 10 or raid5? Hard disks ARE remapping bad sectors by their own. In most cases that's sufficient - there's nothing to do for raid (be it hardware raid or software) except of perform a write to the bad place, just to trigger an in-disk remapping procedure. Even the cheapest drives nowadays has some remapping capability. There was an idea some years ago about having an additional layer on between a block device and whatever else is above it (filesystem or something else), that will just do bad block remapping. Maybe it was even implemented in LVM or IBM-proposed EVMS (the version that included in-kernel stuff too, not only the userspace management), but I don't remember details anymore. In any case, - but again, if memory serves me right, -- there was low interest in that because of exactly this -- drives are now more intelligent, there's hardly a notion of bad block anymore, at least persistent bad block, -- at least visible to the upper layers. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
Doug Ledford wrote: [] 1.0, 1.1, and 1.2 are the same format, just in different positions on the disk. Of the three, the 1.1 format is the safest to use since it won't allow you to accidentally have some sort of metadata between the beginning of the disk and the raid superblock (such as an lvm2 superblock), and hence whenever the raid array isn't up, you won't be able to accidentally mount the lvm2 volumes, filesystem, etc. (In worse case situations, I've seen lvm2 find a superblock on one RAID1 array member when the RAID1 array was down, the system came up, you used the system, the two copies of the raid array were made drastically inconsistent, then at the next reboot, the situation that prevented the RAID1 from starting was resolved, and it never know it failed to start last time, and the two inconsistent members we put back into a clean array). So, deprecating any of these is not really helpful. And you need to keep the old 0.90 format around for back compatibility with thousands of existing raid arrays. Well, I strongly, completely disagree. You described a real-world situation, and that's unfortunate, BUT: for at least raid1, there ARE cases, pretty valid ones, when one NEEDS to mount the filesystem without bringing up raid. Raid1 allows that. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
John Stoffel wrote: Michael == Michael Tokarev [EMAIL PROTECTED] writes: [] Michael Well, I strongly, completely disagree. You described a Michael real-world situation, and that's unfortunate, BUT: for at Michael least raid1, there ARE cases, pretty valid ones, when one Michael NEEDS to mount the filesystem without bringing up raid. Michael Raid1 allows that. Please describe one such case please. There have certainly been hacks of various RAID systems on other OSes such as Solaris where the VxVM and/or Solstice DiskSuite allowed you to encapsulate an existing partition into a RAID array. But in my experience (and I'm a professional sysadm... :-) it's not really all that useful, and can lead to problems liks those described by Doug. I'm doing a sysadmin work for about 15 or 20 years. If you are going to mirror an existing filesystem, then by definition you have a second disk or partition available for the purpose. So you would merely setup the new RAID1, in degraded mode, using the new partition as the base. Then you copy the data over to the new RAID1 device, change your boot setup, and reboot. [...] And you have to copy the data twice as a result, instead of copying it only once to the second disk. As Doug says, and I agree strongly, you DO NOT want to have the possibility of confusion and data loss, especially on bootup. And There are different point of views, and different settings etc. For example, I once dealt with a linux user who was unable to use his disk partition, because his system (it was RedHat if I remember correctly) recognized some LVM volume on his disk (it was previously used with Windows) and tried to automatically activate it, thus making it busy. What I'm talking about here is that any automatic activation of anything should be done with extreme care, using smart logic in the startup scripts if at all. The Doug's example - in my opinion anyway - shows wrong tools or bad logic in the startup sequence, not a general flaw in superblock location. Another example is ext[234]fs - it does not touch first 512 bytes of the device, so if there was an msdos filesystem there before, it will be recognized as such by many tools, and an attempt to mount it automatically will lead to at least scary output and nothing mounted, or in fsck doing fatal things to it in worst scenario. Sure thing the first 512 bytes should be just cleared.. but that's another topic. Speaking of cases where it was really helpful to have an ability to mount individual raid components directly without the raid level - most of them was due to one or another operator errors, usually together with bugs and/or omissions in software. I don't remember exact scenarious anymore (last time it was more than 2 years ago). Most of the time it was one or another sort of system recovery. In almost all machines I maintain, there's a raid1 for the root filesystem built of all the drives (be it 2 or 4 or even 6 of them) - the key point is to be able to boot off any of them in case some cable/drive/controller rearrangement has to be done. Root filesystem is quite small (256 or 512 Mb here), and it's not too dynamic either -- so it's not a big deal to waste space for it. Problem occurs - obviously - when something goes wrong. And most of the time issues we had happened on a remote site, where there was no expirienced operator/sysadmin handy. For example, when one drive was almost dead, and mdadm tried to bring the array up, machine just hanged for unknown amount of time. An unexpirienced operator was there. Instead of trying to teach him how to pass parameter to the initramfs to stop trying to assemble root array and next assembling it manually, I told him to pass root=/dev/sda1 to the kernel. Root mounts read-only, so it should be a safe thing to do - I only needed root fs and minimal set of services (which are even in initramfs) just for it to boot up to SOME state where I can log in remotely and fix things later. (no I didn't want to remove the drive yet, I wanted to examine it first, and it turned to be a good idea because the hang was happening only at the beginning of it, and while we tried to install replacement and fill it up with data, there was an unreadable sector found on another drive, so this old but not removed drive was really handy). Another situation - after some weird crash I had to examine the filesystems found on both components - I want to look at the filesystems and compare them, WITHOUT messing up with raid superblocks (later on I wrote a tiny program to save/restore 0.90 superblocks), and without attempting a reconstruction attempts. In fact, this very case - examining the contents - is something I've been doing many times for one or another reason. There's just no need to involve raid layer here at all, but it doesn't disturb things either (in some cases anyway). Yet another - many times we had to copy an old system to a new one - new machine boots with 3 drives
Re: Time to deprecate old RAID formats?
Justin Piszcz wrote: On Fri, 19 Oct 2007, Doug Ledford wrote: On Fri, 2007-10-19 at 13:05 -0400, Justin Piszcz wrote: [] Got it, so for RAID1 it would make sense if LILO supported it (the later versions of the md superblock) Lilo doesn't know anything about the superblock format, however, lilo expects the raid1 device to start at the beginning of the physical partition. In otherwords, format 1.0 would work with lilo. Did not work when I tried 1.x with LILO, switched back to 00.90.03 and it worked fine. There are different 1.x - and the difference is exactly this -- location of the superblock. In 1.0, superblock is located at the end, just like with 0.90, and lilo works just fine with it. It gets confused somehow (however I don't see how really, because it uses bmap() to get a list of physical blocks for the files it wants to access - those should be in absolute numbers, regardless of the superblock locaction) when the superblock is at the beginning (v 1.1 or 1.2). /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: very degraded RAID5, or increasing capacity by adding discs
Neil Brown wrote: On Tuesday October 9, [EMAIL PROTECTED] wrote: [] o During this reshape time, errors may be fatal to the whole array - while mdadm do have a sense of critical section, but the whole procedure isn't as much tested as the rest of raid code, I for one will not rely on it, at least for now. For example, a power failure at an unexpected moment, or some plain-stupid error in reshape code so that the whole array goes boom etc... While it is true that the resize code is less tested than other code, it is designed to handle a single failure at any time (so a power failure is OK as long as the array is not running degraded), and I have said that if anyone does suffer problems while performing a reshape, I will do my absolute best to get the array functioning and the data safe again. Well... Neil, it's your code, so you trust it - that's ok, I also (tries to) trust my code until someone finds a bug in it.. ;) And I'm a sysadmin (among other things), who's professional property must be a bit of paranoia.. You got the idea ;) o A filesystem on the array has to be resized separately after re{siz,shap}ing the array. And filesystems are different at this point, too - there are various limitations. For example, it's problematic to grow ext[23]fs by large amounts, because when it gets initially created, mke2fs calculates sizes of certain internal data structures based on the device size, and those structures can't be grown significantly, only recreating the filesystem will do the trick. This isn't entirely true. For online resizing (while the filesystem is mounted) there are some limitations as you suggest. For offline resizing (while filesystem is not mounted) there are no such limitations. There still is - at least for ext[23]. Even offline resizers can't do resizes from any to any size, extfs developers recommend to recreate filesystem anyway if size changes significantly. I'm too lazy to find a reference now, it has been mentioned here on linux-raid at least this year. It's sorta like fat (yea, that ms-dog filesystem) - when you resize it from, say, 501Mb to 999Mb, everything is ok, but if you want to go from 501Mb to 1Gb+1, you have to recreate almost all data structures because sizes of all internal fields changes - and here it's much safer to just re-create it from scratch than trying to modify it in place. Sure it's much better for extfs, but the point is still the same. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: very degraded RAID5, or increasing capacity by adding discs
Janek Kozicki wrote: Hello, Recently I started to use mdadm and I'm very impressed by its capabilities. I have raid0 (250+250 GB) on my workstation. And I want to have raid5 (4*500 = 1500 GB) on my backup machine. Hmm. Are you sure you need that much space on the backup, to start with? Maybe better backup strategy will help to avoid hardware costs? Such as using rsync for backups as discussed on this mailinglist about a month back (rsync is able to keep many ready to use copies of your filesystems but only store files that actually changed since the last backup, thus requiring much less space than many full backups). The backup machine currently doesn't have raid, just a single 500 GB drive. I plan to buy more HDDs to have a bigger space for my backups but since I cannot afford all HDDs at once I face a problem of expanding an array. I'm able to add one 500 GB drive every few months until I have all 4 drives. But I cannot make a backup of a backup... so reformatting/copying all data each time when I add new disc to the array is not possible for me. Is it possible anyhow to create a very degraded raid array - a one that consists of 4 drives, but has only TWO ? This would involve some very tricky *hole* management on the block device... A one that places holes in stripes on the block device, until more discs are added to fill the holes. When the holes are filled, the block device grows bigger, and with lvm I just increase the filesystem size. This is perhaps coupled with some unstripping that moves/reorganizes blocks around to fill/defragment the holes. It's definitely not possible with raid5. Only option is to create a raid5 array consisting of less drives than it should contain at the end, and reshape it when you get more drives, as others noted in this thread. But do note the following points: o degraded raid5 isn't really Raid - i.e, it's not any better than a raid0 array, that is, any disk fails = the whole array fails. So instead of creating a degraded raid5 array initially, create smaller one instead, but not degraded, and reshape it when necessary. o reshaping takes time, and for this volume, reshape will take many hours, maybe days, to complete. o During this reshape time, errors may be fatal to the whole array - while mdadm do have a sense of critical section, but the whole procedure isn't as much tested as the rest of raid code, I for one will not rely on it, at least for now. For example, a power failure at an unexpected moment, or some plain-stupid error in reshape code so that the whole array goes boom etc... o A filesystem on the array has to be resized separately after re{siz,shap}ing the array. And filesystems are different at this point, too - there are various limitations. For example, it's problematic to grow ext[23]fs by large amounts, because when it gets initially created, mke2fs calculates sizes of certain internal data structures based on the device size, and those structures can't be grown significantly, only recreating the filesystem will do the trick. is it just a pipe dream? I'd say it is... ;) /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Journalling filesystem corruption fixed in between?
Rustedt, Florian wrote: Hello list, some folks reported severe filesystem-crashes with ext3 and reiserfs on mdraid level 1 and 5. I guess much more strong evidience and details are needed. Without any additional information I for one can only make a (not-so-pleasant) guess about those some folks, nothing more. We're running several dozens of systems on raid1s and raid5s since 2.4 kernel (and some since 2.2 if memory serves, with an additional patch for raid functionality), -- nothing except of usual mostly hardware problems since that. And many other people use linux raid and especially ext3 file- system in production on large boxes with good load -- such a corruption, be it not a particular system specific (due to, for example, a bad ram or faulty controller or whatever), should cause alot of messages here @linux-raid and elsewhere. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: problem killing raid 5
Daniel Santos wrote: I retried rebuilding the array once again from scratch, and this time checked the syslog messages. The reconstructions process is getting stuck at a disk block that it can't read. I double checked the block number by repeating the array creation, and did a bad block scan. No bad blocks were found. How could the md driver be stuck if the block is fine ? Supposing that the disk has bad blocks, can I have a raid device on disks that have badblocks ? Each one of the disks is 400 GB. Probably not a good idea because if a drive has bad blocks it probably will have more in the future. But anyway, can I ? The bad blocks would have to be known to the md driver. Well, almost all modern drives can remap bad blocks (at least I know no drive that can't). Most of the time it happens on write - becaue if such a bad block is found during read operation and the drive really can't read the content of that block, it can't remap it either without losing data. From my expirience (about 20 years, many 100s of drives, mostly (old) SCSI but (old) IDE too), it's pretty normal for a drive to develop several bad blocks, especially during first year of usage. Sometimes however, number of bad blocks grows quite rapidly and such a drive definietely should be replaced - at least Seagate drives are covered by warranty in this case. SCSI drives has 2 so-called defect lists, stored somewhere inside the drive - factory-preset list (bad blocks found during internal testing when producing a drive), and grown list (bad blocks found by drive during normal usage). Factory-preset list can contain from 0 to about 1000 entries or even more (depending on the size too), grown list can be as large as 500 blocks or more, whenever it's fatal or not depends on whenever new bad blocks continues to be found or not. We have several drives which developed that many bad blocks in first few months of usage, the list stopped growing, and they're still working just fine for 5 years. Both defect lists can be shown by scsitools programs. I don't know how one can see defect lists on a IDE or SATA drive. Note that md layer (raid1, 4, 5, 6, 10 - but obviously not raid0 and linear) are now able to repair bad blocks automatically, by forcing write to the same place of the drive where a read error occured - this usually forces drive to automatically reallocate that block and continue. But in any case, md should not stall - be it during reconstruction or not. For this, I can't comment - to me it smells like a bug somewhere (md layer? error handling in driver? something else?) which should be found and fixed. And for this, some more details are needed I guess -- kernel version is a start. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: problem killing raid 5
Patrik Jonsson wrote: Michael Tokarev wrote: [] But in any case, md should not stall - be it during reconstruction or not. For this, I can't comment - to me it smells like a bug somewhere (md layer? error handling in driver? something else?) which should be found and fixed. And for this, some more details are needed I guess -- kernel version is a start. Really? It's my understanding that if md finds an unreadable block during raid5 reconstruction, it has no option but to fail since the information can't be reconstructed. When this happened to me, I had to Yes indeed, it should fail, but not stuck as Daniel reported. Ie, it should either complete the work or fail, but not sleep somewhere in between. [] This is why it's important to run a weekly check so md can repair blocks *before* a drive fails. *nod*. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Backups w/ rsync
Dean S. Messing wrote: Michael Tokarev writes: [] : the procedure is something like this: : : cd /backups : rm -rf tmp/ : cp -al $yesterday tmp/ : rsync -r --delete -t ... /filesystem tmp : mv tmp $today : : That is, link the previous backup to temp (which takes no space : except directories), rsync current files to there (rsync will : break links for changed files), and rename temp to $today. Very nice. The breaking of the hardlink is the key. I wondered about this when Michal using rsync yesterday. I just tested the idea. It does indeed work. Well, others in this thread already presented other, simpler ways, namely using --link-dest rsync option. I was just too lazy to read the man page, but I already knew other tools can do the work ;) One question: why do you not use -a instead of -r -t? It would seem that one would want to preserve permissions, and group and user ownerships. Also, is there a reason to _not_ preserve sym-links in the backup. Your script appears to copy the referent. Note the above -- SOMETHING like this. I was typing from memory, it's not an actual script, just to show an idea. Sure real script does more than that, including error checking too. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Help: very slow software RAID 5.
Dean S. Messing wrote: [] [] That's what attracted me to RAID 0 --- which seems to have no downside EXCEPT safety :-). So I'm not sure I'll ever figure out the right tuning. I'm at the point of abandoning RAID entirely and just putting the three disks together as a big LV and being done with it. (I don't have quite the moxy to define a RAID 0 array underneath it. :-) Putting three disks together as a big LV - that's exactly what linear md module. It's almost as unsafe as raid0, but with linear read/write speed equal to speed of single drive... Note also that the more drives you add to raid0-like config, the more chances of failure you'll have - because raid0 fails when ANY drive fails. Ditto - for certain extent - for linear md module and for one big LV which is basically the same thing. By the way, before abandoming R in RAID, I'd check whenever the resulting speed with raid5 (after at least read-ahead tuning) is acceptable, and use that if yes. If no, maybe raid10 over the same 3 drives will give better results. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SWAP file on a RAID-10 array possible?
Tomas France wrote: Thanks for the answer, David! I kind of think RAID-10 is a very good choice for a swap file. For now I will need to setup the swap file on a simple RAID-1 array anyway, I just need to be prepared when it's time to add more disks and transform the whole thing into RAID-10... which will be big fun anyway, for sure ;) By the way, you don't really need raid10 for swap. Built-in linux swap code can utilize multiple swap areas just fine - mkswap + swapon on multiple devices/files. This is essentially a raid0. For raid10, one thing needed is the mirroring, with is provided by raid1. So when you've two drives, use single partition on both to form a raid1 array for swap space. If you've 4 drives, create 2 raid1 arrays and specify them both as swap space, giving them appropriate priority (prio=xxx in swap line in fstab). With 6 drives, have 3 raid1 arrays and so on... This way, the whole thing is much simpler and more manageable. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A raid in a raid.
mullaly wrote: [] All works well until a system reboot. md2 appears to be brought up before md0 and md1 which causes the raid to start without two of its drives. Is there anyway to fix this? How about listing the arrays in proper order in mdadm.conf ? /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3ware 9650 tips
Joshua Baker-LePain wrote: [] Yep, hardware RAID -- I need the hot swappability (which, AFAIK, is still an issue with md). Just out of curiocity - what do you mean by swappability ? For many years we're using linux software raid, we had no problems with swappability of the component drives (in case of drive failures and what not). With non-hotswappable drives (old scsi and ide ones), rebooting is needed for the system to recognize the drives. For modern sas/sata drives, i can replace a faulty drive without anyone noticing... Maybe you're referring to something else? Thanks. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RFC: dealing with bad blocks: another view
Now MD subsystem does a very good job at trying to recover a bad block on a disk, by re-writing its content (to force drive to reallocate the block in question) and verifying it's written ok. But I wonder if it's worth the effort to go further than that. Now, md can use bitmaps. And a bitmap can be used not only to mark clean vs dirty chunks, but also, for example, good vs bad chunks. That to say. If we discover a read error on one component of an array, we tried to re-write it but rewrite (or reread) failed. Now current implementation will kick the bad drive from the array. But here it is possible to not kick it, but to turn corresponding bit(s) in the bitmap that says the data on this location on this drive is wrong, don't try to read from it. And continue using this drive as before (modulo the bits/parts just turned on). The rationale is -- each time we kick the whole drive from an array, for whatever reason, -- we greatly reduce chances of the whole array to be in working condition. For some reason, drives from the same batch tend to discover bad blocks close to each other - i mean, we see a bad block on one drive, and pretty soon we see another bad block on another drive (at least from our expirience). So by kicking one drive, we increase failure probability even more. We had a large batch of seagate 36g scsi drives, which all has some issue with firmware -- each time a drive detects a bad sector, and we try to mitigate it (by rewriting it), the drive reports defect list manipulation error (I don't remember exact sense code), and only on second attempt it rewrites the sector in question successefully. Seagate refused to acknowlege this problem, no matter how we argued -- they said it's mishandling (like, we improperly handled the drives). That to show just one example of numerous cases when such kicking of the whole drive is not good idea. Even more. If we see *read* error, there's no need to mark this chunk as bad in the bitmap -- only if we see *write* error while writing some *new* data. Ie, that bad bit in the bitmap may mean data at this place is out of sync, like extended dirty. When interpreted like that, there's no need to allocate new bit, but existing dirty bit can be used. On resync, we try to write again, and just keep that dirty bit if write failed. Obviously, we should not try to read from those dirty places. And if there's no components left to read from, just return read error - for this single read, but continue running the array (maybe in read-only mode, whatever). It seems like it's pretty simple to implement with existing code. The only requiriment is to have a bitmap - obviously, without the bitmap the whole idea does not work. This fits perfectly the policy does not belong to the kernel model as well. Never, ever, try to do something large (like kicking off the whole disk), but let userspace to descide what to do... Mdadm event handlers (scripts called when something goes wrong) can kick the disk off just fine. Comments, anyone? /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Recovery of software RAID5 using FC6 rescue?
Nix wrote: On 8 May 2007, Michael Tokarev told this: BTW, for such recovery purposes, I use initrd (initramfs really, but does not matter) with a normal (but tiny) set of commands inside, thanks to busybox. So everything can be done without any help from external recovery CD. Very handy at times, especially since all the network drivers are here on the initramfs too, so I can even start a netcat server while in initramfs, and perform recovery from remote system... ;) What you should probably do is drop into the shell that's being used to run init if mount fails (or, more generally, if after mount runs it That's exactly what my initscript does ;) chk() { while ! $@; do warn the following command failed: warn $* p=** Continue(Ignore)/Shell/Retry (C/s/r)? while : ; do if ! read -t 10 -p $p x 21; then echo (timeout, continuing) return 1 fi case $x in [Ss!]*) /bin/sh 21 ;; [Rr]*) break;; [CcIi]*|) return 1;; *) echo (unrecognized response);; esac done done } chk mount -n -t proc proc /proc chk mount -n -t sysfs sysfs /sys ... info mounting $rootfstype fs on $root (options: $rootflags) chk mount -n -t $rootfstype -o $rootflags $root /root if [ $? != 0 ] ! grep -q ^[^ ]\\+ /root /proc/mounts; then warn root filesystem ($rootfstype on $root) is NOT mounted! fi ... hasn't ended up mounting anything: there's no need to rely on mount's success/failure status). [...] Well, so far exitcode has been reliable. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: No such device on --remove
Bernd Schubert wrote: Benjamin Schieder wrote: [EMAIL PROTECTED]:~# mdadm /dev/md/2 -r /dev/hdh5 mdadm: hot remove failed for /dev/hdh5: No such device md1 and md2 are supposed to be raid5 arrays. You are probably using udev, don't you? Somehow there's presently no /dev/hdh5, but to remove /dev/hdh5 out of the raid, mdadm needs this device. There's a workaround, you need to create the device in /dev using mknod and then you can remove it with mdadm. In case the /dev/hdh5 device node is missing, mdadm will complain No such file or directory (ENOENT), instead of No such device (ENODEV). In this case, as I explained in my previous email, the arrays aren't running, and the error refers to manipulations (md ioctls) with existing /dev/md/2. It has nothing to do with udev. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: removed disk md-device
Bernd Schubert wrote: Hi, we are presently running into a hotplug/linux-raid problem. Lets assume a hard disk entirely fails or a stupid human being pulls it out of the system. Several partitions of the very same hardisk are also part of linux-software raid. Also, /dev is managed by udev. Problem-1) When the disk fails, udev will remove it from /dev. Unfortunately this will make it impossible to remove the disk or its partitions from /dev/mdX device, since mdadm tries to read the device fail and will abort if this file is not there. What do you mean by fails here? All the device information is still here, look at /sys/block/mdX/md/rdY/block . Even if, say, sda (which was a part of md0) disappeared, there will still be /sys/block/sda directory, because md subsystem keeps it open. Yes the device node may be removed by udev (oh how i dislike udev!), but all the info is still here. Also, all the info is in the array information available using ioctl. mdadm can work it out from here, but it's a bit ugly. Problem-2) Even though the kernel detected the device to not exist anymore, it didn't inform its md-layer about this event. The md-layer will first detect non-existent disk, if a read or write attempt to one of its raid-partitions fails. Unfortunately, if you are unluckily, it might never detect that, e.g. for raid1 devices. This is backwards. If you're unlucky should be the opposite -- You're lucky. Well ok, it really depends on other things. Because if md-layer does not detect failed disk, it means that disk hasn't been needed so far (because any attempt to do I/O on it will fail, and the disk will be kicked off the array). And since there was no need in that disk, that means no changes has been made to the array (because in case of any change, all disks will be written to). Which, in turn, means either of: a) disk will reappear (there are several failure modes, sometimes just bus rescan or powercycle will do the trick), and noone will even notice, and everything will be ok. b) disk is dead. And I think this is where you say unlucky - because for quite some (unknown amount) of time, the array will be running in degraded mode, instead of enabling/resyncing hot spare etc. Again: it depends on the failure scenario. What to do here is questionable, because a) contradicts with b). So far, I haven't seen disks dying (well, maybe 2 or 3 times), but I've seen disks disappearing randomly for no apparent reason, and bus reset or powercycle brings them back just fine. So for me, this is lucky behaviour.. ;) Also, with all the modern hotpluggable drives (usb, sata, hotpluggable scsi, and esp. networked storage, where network may add its own failure modes), it's much more easier to make a device disappear - by touching cables for example - this is the case a). I think there should be several solutions to these problems. 1) Before udev removes a device file, it should run a pre-remove script, which should check if the device is listed in /proc/mdstat and if it is listed there, it should run mdadm to remove this device from the. Does udev presently support to run pre-remove scripts? 2.) As soon as the kernel detects a failed device, it should also inform the md layer. See above: it depends. 3.) Does mdadm really need the device? No it doesn't. In order to fail or remove a component device from an array, only major:minor number is needed. Device nodes aren't needed even to assemble array, but only if doing it the dumb way - during assembly, mdadm examines the devices and tries to add some intelligency to the process, and for that, device nodes are really necessary. But not for hotremovals. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Swapping out for larger disks
Brad Campbell wrote: [] It occurs though that the superblocks would be in the wrong place for the new drives and I'm wondering if the kernel or mdadm might not find them. I once had a similar issue. And wrote a tiny program (a hack, sort of), to read or write md superblock from/to a component device. The only thing it really does is to calculate the superblock location - exactly as it is done in mdadm. Here it is: http://www.corpit.ru/mjt/mdsuper.c Usage is like: mdsuper read /dev/old-device | mdsuper write /dev/new-device (or using an intermediate file). So you're doing like this: shutdown array for i in all-devices-in-array dd if=old-device[i] of=new-device[i] iflag=direct oflag=direct mdsuper read old-device | mdsuper write new-device done assemble-array-on-new-devices mdadm -G --size=max /dev/mdx or something like that. Note that the program does not work for anything but 0.90 superblocks (i haven't used 1.0 superblocks yet - 0.90 works for me just fine). However, it should be trivial to extend it to handle v1 superblocks too. Note also that it's trivial to do something like that in shell too, with blockdev --getsz to get the device size, some shell- style $((math)), and dd magic. And 3rd note: using direct as above speeds up the copying *alot*, while keeping system load at zero. Without direct, one pair of disks and the system is doing nothing but the copying... /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: No such device on --remove
Benjamin Schieder wrote: Hi list. md2 : inactive hdh5[4](S) hdg5[1] hde5[3] hdf5[2] 11983872 blocks [EMAIL PROTECTED]:~# mdadm -R /dev/md/2 mdadm: failed to run array /dev/md/2: Input/output error [EMAIL PROTECTED]:~# mdadm /dev/md/ 0 1 2 3 4 5 [EMAIL PROTECTED]:~# mdadm /dev/md/2 -r /dev/hdh5 mdadm: hot remove failed for /dev/hdh5: No such device md1 and md2 are supposed to be raid5 arrays. The arrays are inactive. In this condition, an array can be shut down, or bought up by adding another disk with proper superblock. So running it isn't possible because kernel thinks the array is inconsistent, and removing isn't possible because the array isn't running. It's inactive because when mdadm tried to assemble it, it didn't find enough devices with recent-enough event counter. In other words, the raid superblocks on the individual drives are inconsistent (some are older than others). If the problem is due to power failure, fixing the situation is usually just a matter of adding -f (force) option to mdadm assemble line, forcing mdadm to increment the almost-recent drive's event counter before bringing the array up. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Recovery of software RAID5 using FC6 rescue?
Mark A. O'Neil wrote: Hello, I hope this is the appropriate forum for this request if not please direct me to the correct one. I have a system running FC6, 2.6.20-1.2925, software RAID5 and a power outage seems to have borked the file structure on the RAID. Boot shows the following disks: sda #first disk in raid5: 250GB sdb #the boot disk: 80GB sdc #second disk in raid5: 250GB sdd #third disk in raid5: 250GB sde #fourth disk in raid5: 250GB When I boot the system kernel panics with the following info displayed: ... ata1.00: cd c8/00:08:e6:3e:13/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 in exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 ata1.00: (BMDMA stat 0x25) ata1.00: cd c8/00:08:e6:3e:13/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 in EXT3-fs error (device sda3) ext_get_inode_loc: unable to read inode block -inode=8, block=1027 EXT3-fs: invalid journal inode mount: error mounting /dev/root on /sysroot as ext3: invalid argument setuproot: moving /dev failed: no such file or directory setuproot: error mounting /proc: no such file or directory setuproot: error mounting /sys: no such file or directory switchroot: mount failed: no such file of directory Kernel panic - not synching: attempted to kil init! Wug. At which point the system locks as expected. Another perhaps not related tidbit is when viewing sda1 using (I think I did not write down the command) mdadm --misc --examine device I see (inpart) data describing the device in the array: sda1 raid 4, total 4, active 4, working 4 and then a listing of disks sdc1, sdd1, sde1 all of which show viewing the remaining disks in the list shows: sdX1 raid 4, total 3, active 3, working 3 You sure it's raid4, not raid5? Because if it really is raid4, but before you had a raid5 array, you're screwed, and the only way to recover is to re-create the array (without losing data), re-writing superblocks (see below). BTW, --misc can be omited - you only need mdadm -E /dev/sda1 and then a listing of the disks with the first disk being shonw as removed. It seems that the other disks do not have a reference to sda1? That in itself is perplexing to me but I vaguely recall seeing that before - it has been awhile since I set the system up. Check UUID values on all drives (also from mdadm -E output) - shoule be the same. And compare Events field in there too. Maybe you had 4-disk array before, but later re-created it to be 3-disks? Another possible cause is the disk failures resulting in bad superblock reads, but that's highly unlikely. Anyway, I think the ext3-fs error is less an issue with the software raid and more an issue that fsck could fix. My problem is how to non-destructively mount the raid from the rescue disk so that I can run fsck on the raid. I do not think mounting and running fsck on the individual disks is the correct solution. Some straight forward instructions (or a pointer to some) on doing this from the rescue prompt would be most useful. I have been searching the last couple evenings and have yet to find something I completely understand. I have little experience with software raid and mdadm and while this is an excellent opportunity to learn a bit (and I am) I would like to successfully recover my data in a more timely fashion rather than mess it up beyond recovery as the result of a dolt interpretation of a man page. The applications and data itself is replaceable - just time consuming as in days rather than what I hope, with proper instruction, will amount to an evening or two worth of work to mount the RAID and run fsck. Not sure about pointers. But here are some points. Figure out which arrays/disks you really had. The raid level and number of drives are really important. Now two mantras: mdadm --assemble /dev/md0 /dev/sda1 /dev/sdc1 /dev/sdd1 /dev/sde1 This will try to bring the array up. It will either come ok, or will fail due to event count mismatches (more than 1 difference). In case you have more than 1 mismatch, you can try adding --force option, to tell mdadm to ignore mismatches and try the best it can. The array wont resync, it will be started from best (n-1) drives. If there's a drive error, you can omit the bad drive from the command and assemble a degraded array, but before doing so, see which drives are more fresh (by examining Event counts in mdadm -E output). If one of the remaining drives has (much) lower event count than the rest, while the bad one is (more-or-less) good, you've a good chance to have bad (unrecoverable) filesystem. This happens if the lower-events drive has been kicked off the array (for whatever reason) long before your last disaster happened, and hence it contains very old data and you've very few chances to recover it without the bad drive. And another mantra, which can be helpful if assemble doesn't work for some reason: mdadm --create /dev/md0 --level=5 --num-devices=4 --layout=x
Re: s2disk and raid
Neil Brown wrote: On Tuesday April 3, [EMAIL PROTECTED] wrote: [] After the power cycle the kernel boots, devices are discovered, among which the ones holding raid. Then we try to find the device that holds swap in case of resume and / in case of a normal boot. Now comes a crucial point. The script that finds the raid array, finds the array in an unclean state and starts syncing. [] So you can start arrays 'readonly', and resume off a raid1 without any risk of the the resync starting when it shouldn't. But I wonder why this raid is necessary in the first place. For raid1, assuming the superblock is at the end, -- the only thing needed for resume is one component of the mirror. I.e, if your raid array is (was) composed off hda1 and hdb1, either of the two will do as source of resume image. The trick is to find which, in case the array was degraded -- and mdadm does the job here, but assembling it isn't really necessary. Maybe mdadm can be told to examine the component devices and write a short line to stdout *instead* of real assembly (like mdadm -A --dummy), to show the most recent component, and the offset if superblock is at the beginning... having that, it will be possible to resume from that component directly... By the way, my home-grown initramfs stuff accepts several devices for resume= command line, and tries each in turn. If main disks has more-or-less stable names, this may be an alternative way. To mean, just give the component devices in resume= line... Yes, this way it may do some weird things in case when the original swap array was degraded (with first component, which contained a valid resume image, removed from the array)... But it's not really a big issue, since - usually anyway - if one uses resume=, it means the machine in question isn't some remote 100-miles-away, but it's here, and it's ok to bypass the resume for recovery purposes. Just some random thoughts. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Swap initialised as an md?
Bill Davidsen wrote: [] If you use RAID0 on an array it will be faster (usually) than just partitions, but any process with swapped pages will crash if you lose either drive. With RAID1 operation will be more reliable but no faster. If you use RAID10 the array will be faster and more reliable, but most recovery CDs don't know about RAID10 swap. Any reliable swap will also have the array size smaller than the sum of the partitions (you knew that). You seems to forgot to mention 2 more things: o swap isn't usually needed for recovery CDs o kernel vm subsystem already can do equivalent of raid0 for swap internally, by means of allocating several block devices for swap space with the same priority. If reliability (of swapped processes) is important, one can create several RAID1 arrays and raid0 them using regular vm techniques. The result will be RAID10 for swap. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid 10 Problems?
Jan Engelhardt wrote: [] The other thing is, the bitmap is supposed to be written out at intervals, not at every write, so the extra head movement for bitmap updates should be really low, and not making the tar -xjf process slower by half a minute. Is there a way to tweak the write-bitmap-to-disk interval? Perhaps something in /sys or ye olde /proc. Maybe linux-raid@ knows 8) Hmm. Bitmap is supposed to be written before actual data write, to mark the to-be-written areas of the array as being written, so that those areas can be detected and recovered in case of power failure during actual write. So in case of writing to a clean array, head movement always takes place - first got to bitmap area, and second to the actual data area. That written at intervals is about clearing the bitmaps after some idle time. In other words, dirtying bitmap bits occurs right before actual write, and clearing bits occurs at intervals. Sure, if you write to (or near) the same place again and again, without giving a chance to md subsystem to actually clean the bitmap, there will be no additional head movement. And that means, for example, tar -xjf sometimes, since filesystem will place the files being extracted close to each other, thus hitting the same bit in the bitmap, hence md will skip repeated bitmap updates in this case. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nonzero mismatch_cnt with no earlier error
Jason Rainforest wrote: I tried doing a check, found a mismatch_cnt of 8 (7*250Gb SW RAID5, multiple controllers on Linux 2.6.19.2, SMP x86-64 on Athlon64 X2 4200 +). I then ordered a resync. The mismatch_cnt returned to 0 at the start of As pointed out later it was repair, not resync. the resync, but around the same time that it went up to 8 with the check, it went up to 8 in the resync. After the resync, it still is 8. I haven't ordered a check since the resync completed. As far as I understand, repair will do the same as check does, but ALSO will try to fix the problems found. So the number in mismatch_cnt after a repair will indicate the amount of mismatches found _and fixed_ /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Move superblock on partition resize?
Rob Bray wrote: I am trying to grow a raid5 volume in-place. I would like to expand the partition boundaries, then grow raid5 into the newly-expanded partitions. I was wondering if there is a way to move the superblock from the end of the old partition to the end of the new partition. I've tried dd if=/dev/sdX1 of=/dev/sdX1 bs=512 count=256 skip=(sizeOfOldPartitionInBlocks - 256) seek=(sizeOfNewPartitionInBlocks - 256) unsuccessfully. Also, copying the last 128KB (256 blocks) of the old partition before the table modification to a file, and placing that data at the tail of the new partition also yields no beans. I can drop one drive at a time from the group, change the partition table, then hot-add it, but a resync times 7 drives is a lot of juggling. Any ideas? The superblock location is somewhat tricky to calculate correctly. I've used a tiny program (attached) for exactly this purpose. /mjt /* mdsuper: read or write a linux software raid superbloc (version 0.90) * from or to a given device. * * GPL. * Written by Michael Tokarev ([EMAIL PROTECTED]) */ #define _GNU_SOURCE #include sys/types.h #include stdio.h #include unistd.h #include errno.h #include stdlib.h #include fcntl.h #include string.h #include sys/ioctl.h #include linux/ioctl.h #include linux/types.h #include linux/raid/md_p.h #include linux/fs.h int main(int argc, char **argv) { unsigned long long dsize; unsigned long long offset; int mdfd; int n; mdp_super_t super; const char *dev; if (argc != 3) { fprintf(stderr, mdsuper: usage: mdsuper {read|write} mddev\n); return 1; } if (strcmp(argv[1], read) == 0) n = O_RDONLY; else if (strcmp(argv[1], write) == 0) n = O_WRONLY; else { fprintf(stderr, mdsuper: read or write arg required, not \%s\\n, argv[1]); return 1; } dev = argv[2]; mdfd = open(dev, n, 0); if (mdfd 0) { perror(dev); return 1; } if (ioctl(mdfd, BLKGETSIZE64, dsize) 0) { perror(dev); return 1; } if (dsize MD_RESERVED_SECTORS*2) { fprintf(stderr, mdsuper: %s is too small\n, dev); return 1; } offset = MD_NEW_SIZE_SECTORS(dsize9); fprintf(stderr, size=%Lu (%Lu sect), offset=%Lu (%Lu sect)\n, dsize, dsize9, offset * 512, offset); offset *= 512; if (n == O_RDONLY) { if (pread64(mdfd, super, sizeof(super), offset) != sizeof(super)) { perror(dev); return 1; } if (super.md_magic != MD_SB_MAGIC) { fprintf(stderr, %s: bad magic (0x%08x, should be 0x%08x)\n, dev, super.md_magic, MD_SB_MAGIC); return 1; } if (write(1, super, sizeof(super)) != sizeof(super)) { perror(write); return 1; } } else { if (read(0, super, sizeof(super)) != sizeof(super)) { perror(read); return 1; } if (pwrite64(mdfd, super, sizeof(super), offset) != sizeof(super)) { perror(dev); return 1; } } return 0; }
Re: Kernel 2.6.19.2 New RAID 5 Bug (oops when writing Samba - RAID5)
Justin Piszcz wrote: [] Is this a bug that can or will be fixed or should I disable pre-emption on critical and/or server machines? Disabling pre-emption on critical and/or server machines seems to be a good idea in the first place. IMHO anyway.. ;) /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel 2.6.19.2 New RAID 5 Bug (oops when writing Samba - RAID5)
Justin Piszcz wrote: On Tue, 23 Jan 2007, Michael Tokarev wrote: Disabling pre-emption on critical and/or server machines seems to be a good idea in the first place. IMHO anyway.. ;) So bottom line is make sure not to use preemption on servers or else you will get weird spinlock/deadlocks on RAID devices--GOOD To know! This is not a reason. The reason is that preemption usually works worse on servers, esp. high-loaded servers - the more often you interrupt a (kernel) work, the more nedleess context switches you'll have, and the more slow the whole thing works. Another point is that with preemption enabled, we have more chances to hit one or another bug somewhere. Those bugs should be found and fixed for sure, but important servers/data isn't a place usually for bughunting. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 software vs hardware: parity calculations?
dean gaudet wrote: [] if this is for a database or fs requiring lots of small writes then raid5/6 are generally a mistake... raid10 is the only way to get performance. (hw raid5/6 with nvram support can help a bit in this area, but you just can't beat raid10 if you need lots of writes/s.) A small nitpick. At least some databases never do small-sized I/O, at least not against the datafiles. That is, for example, Oracle uses a fixed-size I/O block size, specified at database (or tablespace) creation time, -- by default it's 4Kb or 8Kb, but may be 16Kb or 32Kb as well. Now, if you'll make your raid array stripe size to match the blocksize of a database, *and* ensure the files are aligned on disk properly, it will just work without needless reads to calculate parity blocks during writes. But the problem with that is it's near impossible to do. First, even if the db writes in 32Kb blocks, it means the stripe size should be 32Kb, which is only suitable for raid5 with 3 disks, having chunk size of 16Kb, or with 5 disks, chunk size 8Kb (this last variant is quite bad, because chunk size of 8Kb is too small). In other words, only very limited set of configurations will be more-or-less good. And second, most filesystems used for databases don't care about correct file placement. For example, ext[23]fs with maximum blocksize of 4Kb will align files by 4Kb, not by stripe size - which means that a whole 32Kb block will be laid like - first 4Kb on first stripe, rest 24Kb on the next stripe, which means that for both parts full read-write cycle will be needed again to update parity blocks - the thing we tried to avoid by choosing the sizes in a previous step. Only xfs so far (from the list of filesystems I've checked) pays attention to stripe size and tries to ensure files are aligned to stripe size. (Yes I know mke2fs's stride=xxx parameter, but it only affects metadata, not data). That's why all the above is a small nitpick - i.e., in theory, it IS possible to use raid5 for database workload in certain cases, but due to all the gory details, it's nearly impossible to do right. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID 5 Performance Optimizations: 2.6.19.1: (211MB/s read 195MB/s write)
Justin Piszcz wrote: Using 4 raptor 150s: Without the tweaks, I get 111MB/s write and 87MB/s read. With the tweaks, 195MB/s write and 211MB/s read. Using kernel 2.6.19.1. Without the tweaks and with the tweaks: # Stripe tests: echo 8192 /sys/block/md3/md/stripe_cache_size # DD TESTS [WRITE] DEFAULT: (512K) $ dd if=/dev/zero of=10gb.no.optimizations.out bs=1M count=10240 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 96.6988 seconds, 111 MB/s [] 8192K READ AHEAD $ dd if=10gb.16384k.stripe.out of=/dev/null bs=1M 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 64.9454 seconds, 165 MB/s What exactly are you measuring? Linear read/write, like copying one device to another (or to /dev/null), in large chunks? I don't think it's an interesting test. Hint: how many times a day you plan to perform such a copy? (By the way, for a copy of one block device to another, try using O_DIRECT, with two dd processes doing the copy - one reading, and another writing - this way, you'll get best results without huge affect on other things running on the system. Like this: dd if=/dev/onedev bs=1M iflag=direct | dd of=/dev/twodev bs=1M oflag=direct ) /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID1 root and swap and initrd
[A late follow-up] Bill Davidsen wrote: Michael Tokarev wrote: Andre Majorel wrote: [] Thanks Jurriaan and Gordon. I think I may still be f*cked, however. The Lilo doc says you can't use raid-extra-boot=mbr-only if boot= does not point to a raid device. Which it doesn't because in my setup, boot=/dev/sda. Using boot=/dev/md5 would solve the raid-extra-boot issue but the components of /dev/md5 are not primary partitions (/dev/sda5, /dev/sdb5) so I don't think that would work. So just move it to sda1 (or sda2, sda3) from sda5, ensure you've two identical drives (or at least your boot partitions are layed up identically), and use boot=/dev/md1 (or md2, md3). Do NOT use raid-extra-boot (set it to none), but set up standard mbr code into boot sector of both drives (in debian, it's 'mbr' package; lilo can be used for that too - once for each drive), and mark your boot partition on both drives as active. This is the most clean setup to boot off raid. You'll have two drives, both will be bootable, and both will be updated when you'll run lilo. Another bonus - if you'll ever install a foreign OS on this system, which tends to update boot code, all your stuff will still be intact - the only thing you'll need to do to restore linux boot is to reset 'active' flags for your partitions (and no, winnt disk manager does not allow you to do so - no ability to set non-dos (non-windows) partition active). I *could* run lilo once for each disk after tweaking boot= in lilo.conf, or just supply a different -M option but I'm not sure. The Lilo doc is not terribly enlightening. Not for me, anyway. :-) No, don't do that. Even if you can automate it. It's error-prone to say the best, and it will bite you at an unexpected moment. The desirable solution is to use the DOS MBR (boot active partition) and put the boot stuff in the RAID device. However, you can just write the MBR to the hda and then to hdb. Note that you don't play with the partition names, the 2nd MBR will only be used if the 1st drive fails, and therefore at the BIOS level the 2nd drive will now be hda (or C:) if LILO stiff uses the BIOS to load the next sector. Just a small note. DOS MBR can't boot from non-primary partition. In this case, the boot code is at sda5, which is on extended partition. But. Lilo now has ability to write its own MBR, which, in turn, IS able to boot of logical partition. lilo -M /dev/sda ext /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID1 root and swap and initrd
Andre Majorel wrote: [] Thanks Jurriaan and Gordon. I think I may still be f*cked, however. The Lilo doc says you can't use raid-extra-boot=mbr-only if boot= does not point to a raid device. Which it doesn't because in my setup, boot=/dev/sda. Using boot=/dev/md5 would solve the raid-extra-boot issue but the components of /dev/md5 are not primary partitions (/dev/sda5, /dev/sdb5) so I don't think that would work. So just move it to sda1 (or sda2, sda3) from sda5, ensure you've two identical drives (or at least your boot partitions are layed up identically), and use boot=/dev/md1 (or md2, md3). Do NOT use raid-extra-boot (set it to none), but set up standard mbr code into boot sector of both drives (in debian, it's 'mbr' package; lilo can be used for that too - once for each drive), and mark your boot partition on both drives as active. This is the most clean setup to boot off raid. You'll have two drives, both will be bootable, and both will be updated when you'll run lilo. Another bonus - if you'll ever install a foreign OS on this system, which tends to update boot code, all your stuff will still be intact - the only thing you'll need to do to restore linux boot is to reset 'active' flags for your partitions (and no, winnt disk manager does not allow you to do so - no ability to set non-dos (non-windows) partition active). I *could* run lilo once for each disk after tweaking boot= in lilo.conf, or just supply a different -M option but I'm not sure. The Lilo doc is not terribly enlightening. Not for me, anyway. :-) No, don't do that. Even if you can automate it. It's error-prone to say the best, and it will bite you at an unexpected moment. A nice little tip I learned from someone else on the list is to have your md devices named after the partition numbers. So: That'd be me... ;) /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID1 root and swap and initrd
Andre Majorel wrote: [] So just move it to sda1 (or sda2, sda3) from sda5 Problem is, the disks are entirely used by an extended partition. There's nowhere to move sd?5 to. You're using raid, so you've at least two disk drives. remove one component off all your raid devices (second disk), repartition the disk, re-add the components back - this will copy data over to the second diks. Next repeat the same procedure with the first disk. Or something like that -- probably only single partition (on both disks) needs to be recreated this way. I think it's possible to turn the first logical partition into a primary partition by modifying the partition table on the MBR but I'm not sure I'm up for that. It's possible to move sda5 to sda1, but not easy - because at the start of extended partition there's a (relatively large) space for the logical partitions, you can't just relabel your partitions, you have to actually move data. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: why not make everything partitionable?
martin f krafft wrote: Hi folks, you cannot create partitions within partitions, but you can well use whole disks for a filesystem without any partitions. It's usually better to have a partition table in place, at least on x86. Just to stop possible confusion - be it from kernel, or from inability to identify disks properly (think [c]fdisk displaying labels) or from anything else. But ok. Along the same lines, I wonder why md/mdadm distinguish between partitionable and non-partitionable in the first place. Why isn't everything partitionable? It's both historic (before, there was no partitionable md arrays), and due to the fact that the number of partitions is limited by only single major number (ie, 256 (sub)partitions max). Maybe there are other reasons - I don't have a defite answer. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 001 of 6] md: Send online/offline uevents when an md array starts/stops.
Neil Brown wrote: [/dev/mdx...] (much like how /dev/ptmx is used to create /dev/pts/N entries.) [] I have the following patch sitting in my patch queue (since about March). It does what you suggest via /sys/module/md-mod/parameters/MAGIC_FILE which is the only md-specific part of the /sys namespace that I could find. However I'm not at all convinced that it is a good idea. I would much rather have mdadm control device naming than leave it up to udev. This is again the same device naming question as pops up every time someone mentions udev. And as usual, I'm suggesting the following, which should - hopefully - make everyone happy: create kernel names *always*, be it /dev/mdN or /dev/sdF or whatever, so that things like /proc/partitions, /proc/mdstat etc will be useful. For this, the ideal solution - IMHO - is to have mini-devfs-like filesystem mounted as /dev, so that it is possible to have bare names without any help from any external programs like udev, but I don't want to start another flamewar here, esp. since it's off-topic to *this* discussion. Note /dev/mdN is as good as /dev/md/N - because there are only a few active devices wich appear in /dev, there's no risk to have too many files in /dev, hence no need to put them into subdirs like /dev/md/, /dev/sd/ etc. if so desired, create *symlinks* at /dev with appropriate user-controlled names to those official kernel device nodes. Be it like /dev/disk/by-label/ or /dev/cdrom0 or whatever. The links can be created by mdadm, OR by udev - in this case, it's really irrelevant. Udev rules does a good job of creating /dev/disk/ hierarchy already, and that seems to be sufficient - i see no reason to make other device nodes (symlinks) by mdadm. By the way, unlike /dev/sdE and /dev/hdF entries, /dev/mdN nodes are pretty stable. Even if scsi disks gets reordered, mdadm finds the component devices by UUID (if DEVICE partitions is given in config file), and you have /dev/md1 pointing to the same logical partition (have the same filesystem or data) regardless how you shuffle your disks (IF mdadm was able to find all components and assemble the array, anyway). So sometimes, I use md/mdadm on systems WITHOUT any raided drives, but where I suspect disk devices may change for whatever reason - I just create raid0 arrays composed of a single partition and let mdadm to find them in /dev/sd* and to assemble stable-numbered /dev/mdN devices - without any help of udev or anything else (I for one dislike udev for several reasons). An in any case, we have the semantic that opening an md device-file creates the device, and we cannot get rid of that semantic without a lot of warning and a lot of pain. And adding a new semantic isn't really going to help. I don't think so. With new semantic in place, we've two options (provided current semantics stays, and I don't see a strong reason why it should be removed except of the bloat): a) with new mdadm utilizing new semantics, there's nothing to change in udev -- it will all Just Work, by mdadm opening /dev/md-control-node (how it's called) and assembling devices using that, and during assemble, udev will receive proper events about new disks appearing and will handle that as usual. b) without new mdadm, it will work as before (now). And in this case, let's not send any udev events, as mdadm already created the nodes etc. So if a user wants neat and nice md/udev integration, the way to go is case a. If it's not required, either case will do. Sure, eventually, long term, support for case b can be removed. Or not - depending on how the things will be implemented, because when done properly, both cases will call the same routine(s), but case b will just skip sending uevents, so ioctl handlers becomes two- or one-liners (two in case a and one in case b), which isn't bloat really ;) /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 001 of 6] md: Send online/offline uevents when an md array starts/stops.
Michael Tokarev wrote: Neil Brown wrote: [/dev/mdx...] [] An in any case, we have the semantic that opening an md device-file creates the device, and we cannot get rid of that semantic without a lot of warning and a lot of pain. And adding a new semantic isn't really going to help. I don't think so. With new semantic in place, we've two options (provided current semantics stays, and I don't see a strong reason why it should be removed except of the bloat): a) with new mdadm utilizing new semantics, there's nothing to change in udev -- it will all Just Work, by mdadm opening /dev/md-control-node (how it's called) and assembling devices using that, and during assemble, udev will receive proper events about new disks appearing and will handle that as usual. b) without new mdadm, it will work as before (now). And in this case, let's not send any udev events, as mdadm already created the nodes etc. Forgot to add. This is important point: do NOT change current behavour wrt uevents, ie, don't add uevents for current semantics at all. Only send uevents (and in this case it will be normal add and remove events) when assembling arrays the new way, using (stable!) /dev/mdcontrol misc device, after RUN_ARRAY and STOP_ARRAY actions has been performed. /mjt So if a user wants neat and nice md/udev integration, the way to go is case a. If it's not required, either case will do. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
mdadm: bitmaps not supported by this kernel?
Another 32/64 bits issue, it seems. Running 2.6.18.1 x86-64 kernel and mdadm 2.5.3 (32 bit). # mdadm -G /dev/md1 --bitmap=internal mdadm: bitmaps not supported by this kernel. # mdadm -G /dev/md1 --bitmap=none mdadm: bitmaps not supported by this kernel. etc. Recompiling mdadm in 64bit mode eliminates the problem. So far, only bitmap manipulation is broken this way. I dunno if other things are broken too - at least --assemble, --create, --stop, --detail works. Thanks. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] md: Fix bug where new drives added to an md array sometimes don't sync properly.
Neil Brown wrote: [] Fix count of degraded drives in raid10. Signed-off-by: Neil Brown [EMAIL PROTECTED] --- .prev/drivers/md/raid10.c 2006-10-09 14:18:00.0 +1000 +++ ./drivers/md/raid10.c 2006-10-05 20:10:07.0 +1000 @@ -2079,7 +2079,7 @@ static int run(mddev_t *mddev) disk = conf-mirrors + i; if (!disk-rdev || - !test_bit(In_sync, rdev-flags)) { + !test_bit(In_sync, disk-rdev-flags)) { disk-head_position = 0; mddev-degraded++; } Neil, this makes me nervous. Seriously. How many bugs like this has been fixed so far? 10? 50? I stopped counting long time ago. And it's the same thing in every case - misuse of rdev vs disk-rdev. The same pattern. I wonder if it can be avoided in the first place somehow - maybe don't declare and use local variable `rdev' (not by name, but by the semantics of it), and always use disk-rdev or mddev-whatever in every place, explicitly, and let the compiler optimize the deref if possible? And btw, this is another 2.6.18.1 candidate (if it's not too late already). Thanks. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: avoiding the initial resync on --create
Doug Ledford wrote: On Mon, 2006-10-09 at 15:10 -0400, Rob Bray wrote: [] Probably the best thing to do would be on create of the array, setup a large all 0 block of mem and repeatedly write that to all blocks in the array devices except parity blocks and use a large all 1 block for that. Then you could just write the entire array at blinding speed. You could call that the quick-init option or something. You wouldn't be able to use the array until it was done, but it would be quick. If you wanted to be *really* fast, at least for SCSI drives you could write one large chunk of 0's and one large chunk of 1's at the first parity block, then use the SCSI COPY command to copy the 0 chunk everywhere it needs to go, and likewise for the parity chunk, and avoid transferring the data over the SCSI bus more than once. Some notes. First, raid array gets created sometimes in order to repair a broken array. Ie, you had an array, you lose it for whatever reason, and re-create it, avoiding initial resync (--assume-clean option), in a hope your data is still here. For that, you don't want to zero-fill your drives, for sure! :) And second, at least SCSI drives have FORMAT UNIT command, which has a range argument (from-sector and to-sector), and, if memory serves me right, also filler argument as well (the data, 512-byte block, to write to all the sectors in the range). (Well, it was long ago when I looked at that stuff, so it might be some other command, but it's here anyway). I'm not sure it's used/available in block device layer (most probably it isn't). But this is the fastest way to fill (parts of) your drives with whatever repeated pattern of bytes you want. Including this initial zero-filling. But either way, you don't really need to do that in kernel space -- Userspace solution will work too. Ok ok, if kernel is doing it after array creation, the array is available immediately for other use, which is a plus. And yes, I'm not sure implementing it is worth the effort. Unless you're re-creating your multi-terabyte array several times a day ;) /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Simulating Drive Failure on Mirrored OS drive
andy liebman wrote: Read up on the md-faulty device. Got any links to this? As I said, we know how to set the device as faulty, but I'm not convinced this is a good simulation of a drive that fails (times out, becomes unresponsive, etc.) Note that 'set device as faulty' is NOT the same as `md-faulty device'. Read mdamd(8) manpage, and see options `-l' (level) and `-p' (parity) for create mode. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG/PATCH] md bitmap broken on big endian machines
Paul Clements wrote: Michael Tokarev wrote: Neil Brown wrote: ffs is closer, but takes an 'int' and we have a 'unsigned long'. So use ffz(~X) to convert a chunksize into a chunkshift. So we don't use ffs(int) for an unsigned value because of int vs unsigned int, but we use ffz() with negated UNSIGNED. Looks even more broken to me, even if it happens to work correctly... ;) No, it doesn't matter about the signedness, these are just bit operations. The problem is the size (int vs. long), even though in practice it's very unlikely you'd ever have a bitmap chunk size that exceeded 32 bits. But it's better to be correct and not have to worry about it. I understand the point, in the first place (I didn't mentioned long vs int above, however). The thing is: when reading the code, it looks just plain wrong. Esp. since function prototypes aren't here, but for those ffs(), ffz() etc they're hidden somewhere in include/asm/* (as they're architecture-dependent), and it's not at all obvious which is signed and which is unsigned, which is long or int etc. At the very least, return -ENOCOMMENT :) /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG/PATCH] md bitmap broken on big endian machines
Neil Brown wrote: [] Use ffz instead of find_first_set to convert multiplier to shift. From: Paul Clements [EMAIL PROTECTED] find_first_set doesn't find the least-significant bit on bigendian machines, so it is really wrong to use it. ffs is closer, but takes an 'int' and we have a 'unsigned long'. So use ffz(~X) to convert a chunksize into a chunkshift. So we don't use ffs(int) for an unsigned value because of int vs unsigned int, but we use ffz() with negated UNSIGNED. Looks even more broken to me, even if it happens to work correctly... ;) /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
proactive-raid-disk-replacement
Recently Dean Gaudet, in thread titled 'Feature Request/Suggestion - Drive Linking', mentioned his document, http://arctic.org/~dean/proactive-raid5-disk-replacement.txt I've read it, and have some umm.. concerns. Here's why: mdadm -Gb internal --bitmap-chunk=1024 /dev/md4 mdadm /dev/md4 -r /dev/sdh1 mdadm /dev/md4 -f /dev/sde1 -r /dev/sde1 mdadm --build /dev/md5 -ayes --level=1 --raid-devices=2 /dev/sde1 missing mdadm /dev/md4 --re-add /dev/md5 mdadm /dev/md5 -a /dev/sdh1 ... wait a few hours for md5 resync... And here's the problem. While new disk, sdh1, are resynced from old, probably failing disk sde1, chances are high that there will be an unreadable block on sde1. And this means the whole thing will not work -- md5 initially contained one working drive (sde1) and one spare (sdh1) which is being converted (resynced) to working disk. But after read error on sde1, md5 will contain one failed drive and one spare -- for raid1 it's fatal combination. While at the same time, it's perfectly easy to reconstruct this failing block from other component devices of md4. That to say: this way of replacing disk in a software raid array isn't much better than just removing old drive and adding new one. And if the drive you're replacing is failing (according to SMART for example), this method is more likely to fail. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5 fill up?
Lars Schimmer wrote: Hi! I´ve got a software RAiD5 with 6 250GB HDs. Now I changed one disk after another to a 400GB HD and resynced the raid5 after each change. Now the RAID5 has got 6 400GB HDs and still uses only 6*250GB space. How can I grow the md0 device to use 6*400GB? mdadm --grow is your friend. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: proactive-raid-disk-replacement
dean gaudet wrote: On Fri, 8 Sep 2006, Michael Tokarev wrote: Recently Dean Gaudet, in thread titled 'Feature Request/Suggestion - Drive Linking', mentioned his document, http://arctic.org/~dean/proactive-raid5-disk-replacement.txt I've read it, and have some umm.. concerns. Here's why: mdadm -Gb internal --bitmap-chunk=1024 /dev/md4 By the way, don't specify bitmap-chunk for internal bitmap. It's needed for file-based (external) bitmap. With internal bitmap, we have fixed size in superblock for it, so bitmap-chunk is determined by dividing that size by size of the array. mdadm /dev/md4 -r /dev/sdh1 mdadm /dev/md4 -f /dev/sde1 -r /dev/sde1 mdadm --build /dev/md5 -ayes --level=1 --raid-devices=2 /dev/sde1 missing mdadm /dev/md4 --re-add /dev/md5 mdadm /dev/md5 -a /dev/sdh1 ... wait a few hours for md5 resync... And here's the problem. While new disk, sdh1, are resynced from old, probably failing disk sde1, chances are high that there will be an unreadable block on sde1. And this means the whole thing will not work -- md5 initially contained one working drive (sde1) and one spare (sdh1) which is being converted (resynced) to working disk. But after read error on sde1, md5 will contain one failed drive and one spare -- for raid1 it's fatal combination. While at the same time, it's perfectly easy to reconstruct this failing block from other component devices of md4. this statement is an argument for native support for this type of activity in md itself. Yes, definitely. That to say: this way of replacing disk in a software raid array isn't much better than just removing old drive and adding new one. hmm... i'm not sure i agree. in your proposal you're guaranteed to have no redundancy while you wait for the new disk to sync in the raid5. It's not a proposal per se, it's just another possible way (used by majority of users I think, because it's way simpler ;) in my proposal the probability that you'll retain redundancy through the entire process is non-zero. we can debate how non-zero it is, but non-zero is greater than zero. Yes there will be no redundancy in my variant, guaranteed. And yes, there is probability to complete the whole your process without a glitch. i'll admit it depends a heck of a lot on how long you wait to replace your disks, but i prefer to replace mine well before they get to the point where just reading the entire disk is guaranteed to result in problems. And if the drive you're replacing is failing (according to SMART for example), this method is more likely to fail. my practice is to run regular SMART long self tests, which tend to find Current_Pending_Sectors (which are generally read errors waiting to happen) and then launch a repair sync action... that generally drops the Current_Pending_Sector back to zero. either through a realloc or just simply rewriting the block. if it's a realloc then i consider if there's enough of them to warrant replacing the disk... so for me the chances of a read error while doing the raid1 thing aren't as high as they could be... So the whole thing goes this way: 0) do a SMART selftest ;) 1) do repair for the whole array 2) copy data from failing to new drive (using temporary superblock-less array) 2a) if step 2 failed still, probably due to new bad sectors, go the old way, removing the failing drive and adding new one. That's 2x or 3x (or 4x counting the selftest, but that should be done regardless) more work than just going the old way from the beginning, but still some chances to have it completed flawlessly in 2 steps, without losing redundancy. Too complicated and too long for most people I'd say ;) I can come to yet another way, which is only somewhat possible with current md code. In 3 variants. 1) Offline the array, stop it. Make a copy of the drive using dd with error=skip (or how it is), noticing the bad blocks Mark those bad blocks in bitmap as dirty Assemble the array with new drive, letting it to resync the blocks to new drive which we were unable to copy previously. This variant does not lose redundancy at all, but requires the array to be off-line during the whole copy procedure. What's missing (which has been discussed on linux-raid@ recently too) is the ability to mark those bad blocks in bitmap. 2) The same, but not offlining the array. Hot-remove a drive, make copy of it to new drive, flip necessary bitmap bits, and re-add the new drive, and let raid code to resync changed (during copy, while the array was still active, something might has changed) and missing blocks. This variant still loses redundancy, but not much of it, provided the bitmap code works correctly. 3) The same as your way, with the difference that we tell md to *skip* and ignore possible errors during resync (which is also not possible currently). but yeah you've convinced me this solution isn't good
Re: Feature Request/Suggestion - Drive Linking
Tuomas Leikola wrote: [] Here's an alternate description. On first 'unrecoverable' error, the disk is marked as FAILING, which means that a spare is immediately taken into use to replace the failing one. The disk is not kicked, and readable blocks can still be used to rebuild other blocks (from other FAILING disks). The rebuild can be more like a ddrescue type operation, which is probably a lot faster in the case of raid6, and the disk can be automatically kicked after the sync is done. If there is no read access to the FAILING disk, the rebuild will be faster just because seeks are avoided in a busy system. It's not that simple. The issue is with writes. If there's a failing disk, md code will need to keep track of up-to-date, or good sectors of it vs obsolete ones. Ie, when write fails, the data in that block is either unreadable (but can become readable on the next try, say, after themperature change or whatnot), or readable but contains old data, or is readable but contains some random garbage. So at least that block(s) of the disk should not be copied to the spare during resync, and should not be read at all, to avoid returning wrong data to userspace. In short, if the array isn't stopped (or changed to read-only), we should watch for writes, and remember which ones are failed. Which is some non-trivial change. Yes, bitmaps somewhat helps here. /mjt -- VGER BF report: H 0.418675 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
spurious dots in dmesg when reconstructing arrays
A long time ago I noticied pretty bad formatting of dmesg text in md array reconstruction output, but never bothered to ask. So here it goes. Example dmesg (RAID conf printout sections omitted): md: bindsdb1 RAID1 conf printout: ..6md: syncing RAID array md1 md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. md: using maximum available idle IO bandwith (but not more than 20 KB/sec) for reconstruction. md: using 128k window, over a total of 248896 blocks. md: bindsdb2 RAID5 conf printout: ..6md: delaying resync of md2 until md1 has finished resync (they share one or more physical units) md: bindsdb3 RAID1 conf printout: ..6md: delaying resync of md31 until md1 has finished resync (they share one or more physical units) 6md: delaying resync of md2 until md31 has finished resync (they share one or more physical units) md: bindsdb5 RAID5 conf printout: 6md: delaying resync of md5 until md31 has finished resync (they share one or more physical units) ..6md: delaying resync of md31 until md1 has finished resync (they share one or more physical units) ..6md: delaying resync of md2 until md5 has finished resync (they share one or more physical units) md: bindsdb6 RAID1 conf printout: ..6md: delaying resync of md61 until md5 has finished resync (they share one or more physical units) ..6md: delaying resync of md31 until md1 has finished resync (they share one or more physical units) ..6md: delaying resync of md2 until md5 has finished resync (they share one or more physical units) 6md: delaying resync of md5 until md31 has finished resync (they share one or more physical units) md: bindsdb7 RAID5 conf printout: ..6md: delaying resync of md7 until md5 has finished resync (they share one or more physical units) ..6md: delaying resync of md31 until md1 has finished resync (they share one or more physical units) 6md: delaying resync of md5 until md31 has finished resync (they share one or more physical units) ..6md: delaying resync of md2 until md5 has finished resync (they share one or more physical units) ..6md: delaying resync of md61 until md7 has finished resync (they share one or more physical units) md: md1: sync done. ..6md: delaying resync of md61 until md7 has finished resync (they share one or more physical units) ..6md: delaying resync of md2 until md5 has finished resync (they share one or more physical units) 6md: delaying resync of md5 until md31 has finished resync (they share one or more physical units) ...6md: syncing RAID array md31 md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. md: using maximum available idle IO bandwith (but not more than 20 KB/sec) for reconstruction. md: using 128k window, over a total of 995904 blocks. ..6md: delaying resync of md7 until md5 has finished resync (they share one or more physical units) RAID1 conf printout: md: md31: sync done. RAID1 conf printout: === here, the actual conf printout is below: ...6md: delaying resync of md7 until md5 has finished resync (they share one or more physical units) ...6md: syncing RAID array md5 md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. md: using maximum available idle IO bandwith (but not more than 20 KB/sec) for reconstruction. md: using 128k window, over a total of 666560 blocks. = here: --- wd:2 rd:2 disk 0, wo:0, o:1, dev:sdb3 disk 1, wo:0, o:1, dev:sdd3 ..6md: delaying resync of md61 until md7 has finished resync (they share one or more physical units) ..6md: delaying resync of md2 until md5 has finished resync (they share one or more physical units) md: md5: sync done. 6md: delaying resync of md7 until md2 has finished resync (they share one or more physical units) ..6md: delaying resync of md61 until md7 has finished resync (they share one or more physical units) ..6md: syncing RAID array md2 md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. md: using maximum available idle IO bandwith (but not more than 20 KB/sec) for reconstruction. md: using 128k window, over a total of 1252992 blocks. RAID5 conf printout: md: md2: sync done. ...6md: delaying resync of md61 until md7 has finished resync (they share one or more physical units) ..6md: syncing RAID array md7 md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. md: using maximum available idle IO bandwith (but not more than 20 KB/sec) for reconstruction. md: using 128k window, over a total of 136 blocks. RAID5 conf printout: md: md7: sync done. ...6md: syncing RAID array md61 md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. md: using maximum available idle IO bandwith (but not more than 20 KB/sec) for reconstruction. md: using 128k window, over a total of 30394880 blocks. RAID5 conf printout: md: md61: sync
Re: [bug?] raid1 integrity checking is broken on 2.6.18-rc4
Justin Piszcz wrote: Is there a doc for all of the options you can echo into the sync_action? I'm assuming mdadm does these as well and echo is just another way to run work with the array? How about the obvious, Documentation/md.txt ? And no, mdadm does not perform or trigger data integrity checking. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: modifying degraded raid 1 then re-adding other members is bad
Neil Brown wrote: On Tuesday August 8, [EMAIL PROTECTED] wrote: Assume I have a fully-functional raid 1 between two disks, one hot-pluggable and the other fixed. If I unplug the hot-pluggable disk and reboot, the array will come up degraded, as intended. If I then modify a lot of the data in the raid device (say it's my root fs and I'm running daily Fedora development updates :-), which modifies only the fixed disk, and then plug the hot-pluggable disk in and re-add its members, it appears that it comes up without resyncing and, well, major filesystem corruption ensues. Is this a known issue, or should I try to gather more info about it? Looks a lot like http://bugzilla.kernel.org/show_bug.cgi?id=6965 Attached are two patches. One against -mm and one against -linus. They are below. Please confirm if the appropriate one help. NeilBrown (-mm) Avoid backward event updates in md superblock when degraded. If we - shut down a clean array, - restart with one (or more) drive(s) missing - make some changes - pause, so that they array gets marked 'clean', the event count on the superblock of included drives will be the same as that of the removed drives. So adding the removed drive back in will cause it to be included with no resync. To avoid this, we only update the eventcount backwards when the array is not degraded. In this case there can (should) be no non-connected drives that we can get confused with, and this is the particular case where updating-backwards is valuable. Why we're updating it BACKWARD in the first place? Also, why, when we adding something to the array, the event counter is checked -- should it resync regardless? Thanks. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Converting Ext3 to Ext3 under RAID 1
Paul Clements wrote: Is 16 blocks a large enough area? Maybe. The superblock will be between 64KB and 128KB from the end of the partition. This depends on the size of the partition: SB_LOC = PART_SIZE - 64K - (PART_SIZE (64K-1)) So, by 16 blocks, I assume you mean 16 filesystem blocks (which are generally 4KB for ext3). So as long as your partition ends exactly on a 64KB boundary, you should be OK. Personally, I would err on the safe side and just shorten the filesystem by 128KB. It's not like you're going to miss the extra 64KB. Or, better yet, shrink it by 1Mb or even 10Mb, whatever, convert to raid, and - the point - resize it to the max size of the raid device (ie, don't give size argument to resize2fs). This way, you will be both safe and will use 100% of the available size. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: let md auto-detect 128+ raid members, fix potential race condition
Alexandre Oliva wrote: [] If mdadm can indeed scan all partitions to bring up all raid devices in them, like nash's raidautorun does, great. I'll give that a try, Never, ever, try to do that (again). Mdadm (or vgscan, or whatever) should NOT assemble ALL arrays found, but only those which it has been told to assemble. This is it again: you bring another disk into a system (disk which comes from another machine), and mdadm finds FOREIGN arrays and brings them up as /dev/md0, where YOUR root filesystem should be. That's what 'homehost' option is for, for example. If initrd should be reconfigured after some changes (be it raid arrays, LVM volumes, hostname, whatever), -- I for one am fine with that. Hopefully no one will argue that if you forgot to install an MBR into your replacement drive, it was entirely your own fault that your system become unbootable, after all ;) /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 006 of 9] md: Remove the working_disks and failed_disks from raid5 state data.
NeilBrown wrote: They are not needed. conf-failed_disks is the same as mddev-degraded By the way, `failed_disks' is more understandable than `degraded' in this context. Degraded usually refers to the state of the array in question, when failed_disks 0. That to say: I'd rename degraded back to failed_disks, here and in the rest of raid drivers... ;) /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Grub vs Lilo
Jason Lunz wrote: [EMAIL PROTECTED] said: Wondering if anyone can comment on an easy way to get grub to update all components in a raid1 array. I have a raid1 /boot with a raid10 /root and have previously used lilo with the raid-extra-boot option to install to boot sectors of all component devices. With grub it appears that you can only update non default devices via the command line. I like the ability to be able to type lilo and have all updated in one hit. Is there a way to do this with grub? assuming your /boot is made of hda1 and hdc1: grub-install /dev/hda1 grub-install /dev/hdc1 Don't do that. Because if your hda dies, and you will try to boot off hdc instead (which will be hda in this case), grub will try to read hdc which is gone, and will fail. Most of the time (unless the bootloader is really smart and understands mirroring in full - lilo and grub does not) you want to have THE SAME boot code on both (or more, in case of 3 or 4 disks mirrors) your disks, including bios disk codes. after the above two commands, grub will write code to boot from disk 0x80 to hda, and from disk 0x81 (or 0x82) to hdc. So when your hdc becomes hda, it will not boot. In order to solve this all, you have to write diskmap file and run grub-install twice. Both times, diskmap should list 0x80 for the device to which you're installing grub. I don't remember the syntax of the diskmap file (or even if it's really called 'diskmap'), but assuming hda and hdc notation, I mean the following: echo /dev/hda 0x80 /boot/grub/diskmap grub-install /dev/hda1 echo /dev/hdc 0x80 /boot/grub/diskmap # overwrite it! grub-install /dev/hdc1 The thing with all this my RAID devices works, it is really simple! thing is: for too many people it indeed works, so they think it's good and correct way. But it works up to the actual failure, which, in most setups, isn't tested. But once something failed, umm... Jason, try to remove your hda (pretend it is failed) and boot off hdc to see what I mean ;) (Well yes, rescue disk will help in that case... hopefully. But not RAID, which, when installed properly, will really make disk failure transparent). /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Grub vs Lilo
Bernd Rieke wrote: Michael Tokarev wrote on 26.07.2006 20:00: . . The thing with all this my RAID devices works, it is really simple! thing is: for too many people it indeed works, so they think it's good and correct way. But it works up to the actual failure, which, in most setups, isn't tested. But once something failed, umm... Jason, try to remove your hda (pretend it is failed) and boot off hdc to see what I mean ;) (Well yes, rescue disk will help in that case... hopefully. But not RAID, which, when installed properly, will really make disk failure transparent). /mjt Yes Michael, your right. We use a simple RAID1 config with swap and / on three SCSI-disks (2 working, one hot-spare) on SuSE 9.3 systems. We had to use lilo to handle the boot off of any of the two (three) disks. But we had problems over problems until lilo 22.7 came up. With this version of lilo we can pull off any disk in any scenario. The box boots in any case. Well, alot of systems here works on root-on-raid1 with lilo-2.2.4 (Debian package), and grub. By works I mean they really works, ie, any disk failure don't prevent the system from working and (re)booting flawlessly (provided the disk is really dead, as opposed to when it is present but fails to read (some) data - in which case the only way is either to remove it physically or to choose another boot device in BIOS. But that's entirely different story, about (non-existed) really smart boot loader I mentioned in my previous email). The trick is to set the system up properly. Simple/obvious way (installing grub to hda1 and hdc1) don't work when you remove hda, but complex way works. More, I'd not let LILO to do more guesswork for me (like raid-extra-boot stuff, or whatever comes with 22.7 - to be honest, I didn't look at it at all, as debian package of 2.2.4 (or 22.4?) works for me just fine). Just write the damn thing into the start of mdN (and let raid code to replicate it to all drives, regardless of how many of them there is), after realizing it's really a partition number X (with offset Y) on a real disk, and use bios code 0x80 for all disk access. That's all. The rest - like ensuring all the (boot) partitions are at the same place on every disk, that disk geometry is the same etc - is my duty, and this duty is done by me accurately - because I want the disks to be interchangeable. We were wondering when we asked the groups while in trouble with lilo before 22.7 not having any response. Ok, the RAID-Driver and the kernel worked fine while resyncing the spare in case of a disk failure (thanks to Neil Brown for that). But if a box had to be rebooted with a failed disk the situation became worse. And you have to reboot because hotplug still doesn't work. But nobody seems to care abou or nobody apart of us has these problems ... Just curious - when/where you asked? [] So we came to the conclusion that everybody is working on RAID but nobody cares about the things around, just as you mentioned, thanks for that. I tend to disagree. My statement above refers to simple advise sometimes given here and elsewhere, do this and that, it worked for me. By users who didn't do their homework, who never tested the stuff, who, sometimes, just has no idea as of HOW to test (it's not an insulting statement hopefully - I don't blame them for their lack of knowlege, it's something which isn't really cheap, after all). Majority of users are of this sort, and they follow each other's advises, again, without testing. HOWTOs written by such users, as well (as someone mentioned to me in private email as a response to my reply). I mean, the existing software works. It really works. The only thing left is to set it up correctly. And please PLEASE don't treat it all as blames to bad users. It's not. I learned this stuff the hard way too. After having unbootable remote machines after a disk failure, when everything seemed to be ok. After screwing up systems using famous linux raid autodetect stuff everyone loves, when, after replacing a failed disk to another, which -- bad me -- was a part of another raid array on another system, and the box choosen to assemble THAT raid array instead of this box's one, and overwritten good disk with data from new disk which was in a testing machine. And so on. That all to say: it's easy to make a mistake, and treating the resulting setup as a good one, until shit start happening. But shit happens very rarely, compared to average system usage, so you may never know at all that your setup is wrong, and ofcourse you will tell how to do things to others... :) /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] enable auto=yes by default when using udev
Neil Brown wrote: On Monday July 3, [EMAIL PROTECTED] wrote: Hello, the following patch aims at solving an issue that is confusing a lot of users. when using udev, device files are created only when devices are registered with the kernel, and md devices are registered only when started. mdadm needs the device file _before_ starting the array. so when using udev you must add --auto=yes to the mdadm commandline or to the ARRAY line in mdadm.conf following patch makes auto=yes the default when using udev The principle I'm reasonably happy with, though you can now make this the default with a line like CREATE auto=yes in mdadm.conf. However + +/* if we are using udev and auto is not set, mdadm will almost + * certainly fail, so we force it here. + */ +if (autof == 0 access(/dev/.udevdb,F_OK) == 0) +autof=2; + I'm worried that this test is not very robust. On my Debian/unstable system running used, there is no /dev/.udevdb though there is a /dev/.udev/db I guess I could test for both, but then udev might change again I'd really like a more robust check. Why to test for udev at all? If the device does not exist, regardless if udev is running or not, it might be a good idea to try to create it. Because IT IS NEEDED, period. Whenever the operation fails or not, and whenever we fail if it fails or not - it's another question, and I think that w/o explicit auto=yes, we may ignore create error and try to continue, and with auto=yes, we fail on create error. Note that /dev might be managed by some other tool as well, like mudev from busybox, or just a tiny shell /sbin/hotplug script. Note also that the whole root filesystem might be on tmpfs (like in initramfs), so /dev will not be a mountpoint. Also, I think mdadm should stop creating strange temporary nodes somewhere as it does now. If /dev/whatever exist, use it. If not, create it (unless, perhaps, auto=no is specified) directly with proper mknod(/dev/mdX), but don't try to use some temporary names in /dev or elsewhere. In case of nfs-mounted read-only root filesystem, if someone will ever need to assemble raid arrays in that case.. well, he can either prepare proper /dev on the nfs server, or use tmpfs-based /dev, or just specify /tmp/mdXX instead of /dev/mdXX - whatever suits their needs better. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: New FAQ entry? (was IBM xSeries stop responding during RAID1 reconstruction)
Niccolo Rigacci wrote: [] From the command line you can see which schedulers are supported and change it on the fly (remember to do it for each RAID disk): # cat /sys/block/hda/queue/scheduler noop [anticipatory] deadline cfq # echo cfq /sys/block/hda/queue/scheduler Otherwise you can recompile your kernel and set CFQ as the default I/O scheduler (CONFIG_DEFAULT_CFQ=y in Block layer, IO Schedulers, Default I/O scheduler). There's much easier/simpler way to set default scheduler. As someone suggested, RTFM Documentation/kernel-parameters.txt. Passing elevator=cfq (or whatever) will do the trick much simpler than kernel recompile. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: problems with raid=noautodetect
Neil Brown wrote: On Friday May 26, [EMAIL PROTECTED] wrote: [] If we assume there is a list of devices provided by a (possibly default) 'DEVICE' line, then DEVICEFILTER !pattern1 !pattern2 pattern3 pattern4 could mean that any device in that list which matches pattern 1 or 2 is immediately discarded, and remaining device that matches patterns 3 or 4 are included, and the remainder are discard. The rule could be that the default is to include any devices that don't match a !pattern, unless there is a pattern without a '!', in which case the default is to reject non-accepted patterns. Is that straight forward enough, or do I need an order allow,deny like apache has? I'd suggest the following. All the other devices are included or excluded from the list of devices to consider based on the last component in the DEVICE line. Ie. if it ends up at !dev, all the rest of devices are included. If it ends up at dev (w/o !), all the rest are excluded. If memory serves me right, it's how squid ACLs works. There's no need to introduce new keyword. Given this rule, a line DEVICE a b c will do exactly as it does now. Line DEVICE a b c !d is somewhat redundant - it's the same as DEVICE !d Ie, if the list ends up at !stuff, append `partitions' (or *) to it. Ofcourse mixing !s and !!s is useful, like to say use all sda* but not sda1: DEVICE !sda1 sda* (and nothing else). And the default is to have `DEVICE partitions'. The only possible issue I see here is that with udev, it's possible to use, say, /dev/disk/by-id/*-like stuff (don't remember exact directory layout) -- symlinked to /dev/sd* according to the disk serial number or something like that -- for this to work, mdadm needs to use glob() internally. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: linear writes to raid5
Neil Brown wrote: On Tuesday April 18, [EMAIL PROTECTED] wrote: [] I mean, mergeing bios into larger requests makes alot of sense between a filesystem and md levels, but it makes alot less sense to do that between md and physical (fsvo physical anyway) disks. This seems completely backwards to me, so we must be thinking of different things. Creating large requests above the md level doesn't make a lot of sense to me because there is a reasonable chance that md will just need to break the requests up again to submit to different devices lower down. Building large requests for the physical disk makes lots of sense because you get much better throughput on an e.g. SCSI buss by having few large requests rather than many small requests. But this building should be done close to the device so that as much information as possible is available about particular device characteristics. What is the rationale for your position? My rationale was that if md layer receives *write* requests not smaller than a full stripe size, it is able to omit reading data to update, and can just calculate new parity from the new data. Hence, combining a dozen small write requests coming from a filesystem to form a single request = full stripe size should dramatically increase performance. Eg, when I use dd with O_DIRECT mode (oflag=direct) and experiment with different block size, write performance increases alot when bs becomes full stripe size. Ofcourse it decreases again when bs is increased a bit further (as md starts reading again, to construct parity blocks). For read requests, it makes much less difference where to combine them. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: linear writes to raid5
Neil Brown wrote: [] raid5 shouldn't need to merge small requests into large requests. That is what the 'elevator' or io_scheduler algorithms are for. There already merge multiple bio's into larger 'requests'. If they aren't doing that, then something needs to be fixed. It is certainly possible that raid5 is doing something wrong that makes merging harder - maybe sending bios in the wrong order, or sending them with unfortunate timing. And if that is the case it certainly makes sense to fix it. But I really don't see that raid5 should be merging requests together - that is for a lower-level to do. Hmm. So where's the elevator level - before raid level (between e.g. a filesystem and md), or after it (between md and physical devices) ? I mean, mergeing bios into larger requests makes alot of sense between a filesystem and md levels, but it makes alot less sense to do that between md and physical (fsvo physical anyway) disks. Thanks. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: accessing mirrired lvm on shared storage
Neil Brown wrote: [] Very cool... that would be extremely nice to have. Any estimate on when you might get to this? I'm working on it, but there are lots of distractions Neil, is there anything you're NOT working on? ;) Sorry just can't resist... ;) /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Terrible slow write speed to MegaRAID SCSI array
We've installed an LSI Logic MegaRaid SCSI 320-1 card on our server (used only temporarily to move data to larger disks, but that's not the point), and measured linear write performance, just to know how much time it will took to copy our (somewhat large) data to the new array. And to my surprize, this card, with current firmware, current (2.6.15) kernel and modern disks is terrible slow on writing - writing speed varies between 1.5 megabutes/sec to 10 megabytes/sec, depending on the logical drive settings in the megaraid adapter. I tried different raid configurations (raid10 - 3 spans of two-disk raid1s; raid1 out of 2 drives; raid 0 on one drive) - makes no difference whatsoever (but on raid1, I only can get 8 megabytes/sec write speed - 10 mb/sec is on raid10). The disks are 140Gb FUJITSU MAT3147NC ones, pretty modern. One disk, when plugged into non-megaraid card and accessed as single disk, delivers about 60 megabytes/sec write speed, and about 80 mb/sec read (reading speed on megaraid array is quite good - about 240 mb/sec for 6-disk raid10). I've upgraded firmware on the megaraid card to the latest one available on lsi logic website - nothing changed wrt write speed (but read speed decreased from 240 to about 190 mb/sec - still acceptable for me). I'm using megaraid_mbox driver. The only improvement I was able to get is when I enable write caching on the card, -- in this case, writing speed is very good up to first 64Mb (amount of memory on the card), and decreases again back to 1.5..10 mb/sec after 64Mb. The question is: where's the problem? linux? driver? hardware? Anyone else expirienced this problem? Thanks. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 000 of 5] md: Introduction
NeilBrown wrote: Greetings. In line with the principle of release early, following are 5 patches against md in 2.6.latest which implement reshaping of a raid5 array. By this I mean adding 1 or more drives to the array and then re-laying out all of the data. Neil, is this online resizing/reshaping really needed? I understand all those words means alot for marketing persons - zero downtime, online resizing etc, but it is much safer and easier to do that stuff 'offline', on an inactive array, like raidreconf does - safer, easier, faster, and one have more possibilities for more complex changes. It isn't like you want to add/remove drives to/from your arrays every day... Alot of good hw raid cards are unable to perform such reshaping too. /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 000 of 5] md: Introduction
Sander wrote: Michael Tokarev wrote (ao): [] Neil, is this online resizing/reshaping really needed? I understand all those words means alot for marketing persons - zero downtime, online resizing etc, but it is much safer and easier to do that stuff 'offline', on an inactive array, like raidreconf does - safer, easier, faster, and one have more possibilities for more complex changes. It isn't like you want to add/remove drives to/from your arrays every day... Alot of good hw raid cards are unable to perform such reshaping too. [] Actually, I don't understand why you bother at all. One writes the feature. Another uses it. How would this feature harm you? This is about code complexity/bloat. It's already complex enouth. I rely on the stability of the linux softraid subsystem, and want it to be reliable. Adding more features, especially non-trivial ones, does not buy you bugfree raid subsystem, just the opposite: it will have more chances to crash, to eat your data etc, and will be harder in finding/fixing bugs. Raid code is already too fragile, i'm afraid simple I/O errors (which is what we need raid for) may crash the system already, and am waiting for the next whole system crash due to eg superblock update error or whatnot. I saw all sorts of failures due to linux softraid already (we use it here alot), including ones which required complete array rebuild with heavy data loss. Any unnecessary bloat (note the quotes: I understand some people like this and other features) makes whole system even more fragile than it is already. Compare this with my statement about offline reshaper above: separate userspace (easier to write/debug compared with kernel space) program which operates on an inactive array (no locking needed, no need to worry about other I/O operations going to the array at the time of reshaping etc), with an ability to plan it's I/O strategy in alot more efficient and safer way... Yes this apprpach has one downside: the array has to be inactive. But in my opinion it's worth it, compared to more possibilities to lose your data, even if you do NOT use that feature at all... /mjt - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html