Re: Array will not assemble
On Friday July 7, [EMAIL PROTECTED] wrote: Perhaps I am misunderstanding how assemble works, but I have created a new RAID 1 array on a pair of SCSI drives and am having difficulty re-assembling it after a reboot. The relevent mdadm.conf entry looks like this: ARRAY /dev/md3 level=raid1 num-devices=2 UUID=72189255:acddbac3:316abdb0:9152808d devices=/dev/sdc,/dev/sdd Add DEVICE /dev/sd? or similar on a separate line. Remove devices=/dev/sdc,/dev/sdd NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does md determine which partitions to use in RAID1 when DEVICE partitions is specified
On Monday July 3, [EMAIL PROTECTED] wrote: I have Fedora Core 5 installed with mirroring on the Boot partition and root partition. I created a Logical Volume Group on the mirrored root partition. How does md figure out which partitions are actually specified. It says it stores the uuid in the superblock, but I can't seem to figure out where this is, or how to get it. is it in the partition, or volume. The superblock is near the end of whatever you told md to use for the array. From your 'mdadm --detail' output, I see that means /dev/sda1 /dev/sdb1 /dev/sda2 etc. The superblock for each partition is near the end of each partition. The reason I'm asking this, I'd like to add two USB 2.0 drives that are mirrored, and I would specify the device name (/dev/sdd, /dev/sde) for the ARRAY, but I found that the allocation of these device names is dependent upon when the drive is inserted into the USB. You don't need to care about the device name. Just add the uuid information to mdadm.conf. Providing the devices are plugged in, mdadm will find them. I'm going to ask if there is a way to lock the volume names for devices (I'm thinking by UUID) for USB devices in partitions. 'udev' might have functionality to do this. But you don't really need it with mdadm. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 write performance
On Sunday July 2, [EMAIL PROTECTED] wrote: Neil hello. I have been looking at the raid5 code trying to understand why writes performance is so poor. raid5 write performance is expected to be poor, as you often need to pre-read data or parity before the write can be issued. If I am not mistaken here, It seems that you issue a write in size of one page an no more no matter what buffer size I am using . I doubt the small write size would contribute more than a couple of percent to the speed issue. Scheduling (when to write, when to pre-read, when to wait a moment) is probably much more important. 1. Is this page is directed only to parity disk ? No. All drives are written with one page units. Each request is divided into one-page chunks, these one page chunks are gathered - where possible - into strips, and the strips are handled as units (Where a strip is like a stripe, only 1 page wide rather then one chunk wide - if that makes sense). 2. How can i increase the write throughout ? Look at scheduling patterns - what order are the blocks getting written, do we pre-read when we don't need to, things like that. The current code tries to do the right thing, and it certainly has been worse in the past, but I wouldn't be surprised if it could still be improved. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] enable auto=yes by default when using udev
On Monday July 3, [EMAIL PROTECTED] wrote: Hello, the following patch aims at solving an issue that is confusing a lot of users. when using udev, device files are created only when devices are registered with the kernel, and md devices are registered only when started. mdadm needs the device file _before_ starting the array. so when using udev you must add --auto=yes to the mdadm commandline or to the ARRAY line in mdadm.conf following patch makes auto=yes the default when using udev The principle I'm reasonably happy with, though you can now make this the default with a line like CREATE auto=yes in mdadm.conf. However + + /* if we are using udev and auto is not set, mdadm will almost + * certainly fail, so we force it here. + */ + if (autof == 0 access(/dev/.udevdb,F_OK) == 0) + autof=2; + I'm worried that this test is not very robust. On my Debian/unstable system running used, there is no /dev/.udevdb though there is a /dev/.udev/db I guess I could test for both, but then udev might change again I'd really like a more robust check. Maybe I could test if /dev was a mount point? Any other ideas? NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: changing MD device names
On Saturday July 1, [EMAIL PROTECTED] wrote: I have a system which was running several raid1 devices (md0 - md2) using 2 physical drives (hde, and hdg). I wanted to swap out these drives for two different ones, so I did the following: 1) swap out hdg for a new drive 2) create degraded raid1's (md3 and md4) using partitions on new hdg 3) format md3 and md4 and copy data from md0-2 to md3-4 4) install grub on new hdg 5) pull hde Now, after a bit of fixing in the grub menu and fstab, I have a system that boots up using just 1 of the new drives, but the md devices are md3 and md4. What's the easiest way to change the prefered minor # and get these to be md0 and md1? Will just booting from a rescue or live CD and assembling the new drives as md0 md1 automatically update the prefered minor in their superblocks? The system is running Centos 4 (2.6.9-34.0.1.EL kernel). You need to do a tiny bit more than assemble the new drives as md0 and md1. You also need to cause some write activity so that md bothers to update the superblock. Mounting and unmounting the filesystem should do it. Or you could assemble with --update=super-minor NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid issues after power failure
On Friday June 30, [EMAIL PROTECTED] wrote: On Fri, 30 Jun 2006, Francois Barre wrote: Did you try upgrading mdadm yet ? yes, I have version 2.5 now, and it produces the same results. Try adding '--force' to the -A line. That tells mdadm to try really hard to assemble the array. You should be aware that when a degraded array has an unclean shutdown it is possible that data corruption could result, possibly in files that have not be changed for a long time. It is also quite possible that there is no data corruption, or it is only on part of the array that are not actually in use. I recommend at least a full 'fsck' in this situation. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Strange intermittant errors + RAID doesn't fail the disk.
On Friday June 30, [EMAIL PROTECTED] wrote: More problems ... As reported I have 4x WD5000YS (Caviar RE2 500 GB) in a md RAID5 array. I've been benchmarking and otherwise testing the new array these last few days, and apart from the fact that the md doesn't shut down properly I've had no problems. Today I wanted to finally copy some data over, but I after 5sec I got: [...] ata2: port reset, p_is 800 is 2 pis 0 cmd 44017 tf d0 ss 123 se 0 ata2: status=0x50 { DriveReady SeekComplete } sdc: Current: sense key: No Sense Additional sense: No additional sense information ata2: handling error/timeout ata2: port reset, p_is 0 is 0 pis 0 cmd 44017 tf 150 ss 123 se 0 ata2: status=0x50 { DriveReady SeekComplete } ata2: error=0x01 { AddrMarkNotFound } sdc: Current: sense key: No Sense Additional sense: No additional sense information [repeat] All processes accessing the array hang and can't even be killed by kill -9, but md does not mark the disk as failed. Looks very much like a problem with the SATA controller. If the repeat look you have shown there is an infinite loop, then presumably some failure is not being handled properly. I suggest you find a SATA related mailing list to post this to (Look in the MAINTAINERS file maybe) or post it to linux-kernel. I doubt this is directly related to the raid code at all. Good luck :-) NeilBrown I then tested all four disks individually in another box -- according to WD's drive diagnostic they're fine. Re-created the array on the disks, which worked for a few hours, now I get the same error again. :( Kernel is 2.6.17-1-686 (Debian testing). I could go back to 16, but 15 is missing a CIFS change I need. Any help is appreciated. Christian - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Cutting power without breaking RAID
On Thursday June 29, [EMAIL PROTECTED] wrote: Why should this trickery be needed? When an array is mounted r/o it should be clean. How can it be dirty. I assume readonly implies noatime, I mount physically readonly devices without explicitly saying noatime and nothing whines. The 'filesystem' is mounted r/o. The 'array' is not read-only, and you cannot set an array to read-only while a filesystem is mounted (because the array cannot tell that the mount is read-only). A little while after the last write, an array will mark itself as clean. The effect of the 'kill -9', is to reduce this 'little while' to 0. So remount readonly wait a little while kill machine would work too. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Drive issues in RAID vs. not-RAID ..
On Wednesday June 28, [EMAIL PROTECTED] wrote: I've seen a few comments to the effect that some disks have problems when used in a RAID setup and I'm a bit preplexed as to why this might be.. What's the difference between a drive in a RAID set (either s/w or h/w) and a drive on it's own, assuming the load, etc. is roughly the same in each setup? Is it just bad feeling or is there any scientific reasons for it? I don't think that 'disks' have problems being in a raid, but I believe some controllers do (though I don't know whether it is the controller or the driver that is at fault). RAID make concurrent requests much more likely and so is likely to push hard at any locking issues. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Cutting power without breaking RAID
On Wednesday June 28, [EMAIL PROTECTED] wrote: Hello, I'm facing this problem: when my Linux box detects a POWER FAIL event from the UPS, it starts a normal shutdown. Just before the normal kernel poweroff, it sends to the UPS a signal on the serial line which says cut-off the power to the server and switch-off the UPS. This is required to reboot the server as soon as the power is restored. The problem is that the root partition is on top of a RAID-1 filesystem which is still mounted when the program that kills the power is run, so the system goes down with a non clean RAID volume. What can be the proper action to do before killing the power to ensure that RAID will remain clean? It seems that remounting the partition read-only is not sufficient. Are you running a 2.4 kernel or a 2.6 kernel? With 2.4, you cannot do what you want to do. With 2.6, killall -9 md0_raid1 should do the trick (assuming root is on /dev/md0. If it is elsewhere, choose a different process name). After you kill -9 the raid thread, the array will be marked clean immediately all writes complete, and marked dirty again before allowing another write. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mdadm 2.5.2 - Static built , Interesting warnings when
On Tuesday June 27, [EMAIL PROTECTED] wrote: Hello All , What change in Glibc mekes this necessary ? Is there a method available to include the getpwnam getgrnam structures so that full static build will work . Tia , JimL gcc -Wall -Werror -Wstrict-prototypes -ggdb -DSendmail=\/usr/sbin/sendmail -t\ -DCONFFILE=\/etc/mdadm.conf\ -DCONFFILE2=\/etc/mdadm/mdadm.conf\ -DHAVE_STDINT_H -o sha1.o -c sha1.c gcc -static -o mdadm mdadm.o config.o mdstat.o ReadMe.o util.o Manage.o Assemble.o Build.o Create.o Detail.o Examine.o Grow.o Monitor.o dlink.o Kill.o Query.o mdopen.o super0.o super1.o bitmap.o restripe.o sysfs.o sha1.o config.o(.text+0x8c4): In function `createline': /home/archive/mdadm-2.5.2/config.c:341: warning: Using 'getgrnam' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking config.o(.text+0x80b):/home/archive/mdadm-2.5.2/config.c:326: warning: Using 'getpwnam' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking nroff -man mdadm.8 mdadm.man Are you running make LDFLAGS=-static mdadm or something like that? No, that won't work any more. Use make mdadm.static that will get you a good static binary by included 'pwgr.o'. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
ANNOUNCE: mdadm 2.5.2 - A tool for managing Soft RAID under Linux
I am pleased to announce the availability of mdadm version 2.5.2 It is available at the usual places: http://www.cse.unsw.edu.au/~neilb/source/mdadm/ and countrycode=xx. http://www.${countrycode}kernel.org/pub/linux/utils/raid/mdadm/ and via git at git://neil.brown.name/mdadm http://neil.brown.name/git?p=mdadm mdadm is a tool for creating, managing and monitoring device arrays using the md driver in Linux, also known as Software RAID arrays. Release 2.5.2 is primarily a bugfix release over 2.5.1. It also contains a work-around for a kernel bug which affects hot-adding to arrays with a version-1 superblock. Changelog Entries: - Fix problem with compiling with gcc-2 compilers - Fix compile problem of post-incrmenting a variable in a macro arg. - Stop map_dev from returning [0:0], as that breaks things. - Add 'Array Slot' line to --examine for version-1 superblocks to make it a bit easier to see what is happening. - Work around bug in --add handling for version-1 superblocks in 2.6.17 (and prior). - Make -assemble a bit more resilient to finding strange information in superblocks. - Don't claim newly added spares are InSync!! (don't know why that code was ever in there) - Work better when no 'ftw' is available, and check to see if current uclibc provides ftw. - Never use /etc/mdadm.conf if --config file is given (previously some code used one, some used the other). Development of mdadm is sponsored by SUSE Labs, Novell Inc. NeilBrown 27th June 2006 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is shrinking raid5 possible?
On Friday June 23, [EMAIL PROTECTED] wrote: Why would you ever want to reduce the size of a raid5 in this way? A feature that would have been useful to me a few times is the ability to shrink an array by whole disks. Example: 8x 300 GB disks - 2100 GB raw capacity shrink file system, remove 2 disks = 6x 300 GB disks -- 1500 GB raw capacity This is shrinking an array by removing drives. We were talking about shrinking an array by reducing the size of drives - a very different think. Yes, it might be sometimes useful to reduce the number of drives in a raid5. This would be similar to adding a drive to a raid5 (now possible), but the data copy would have to go in a different direction, so there would need to be substantial changes to the code. I'm not sure it is really worth the effort I'm afraid, but it might get done, one day, especially if someone volunteers some code ... ;-) NeilBrown Why? If you're not backed up by a company budget, moving data to an new array (extra / larger disks) is extremely difficult. A lot of cases will hold 8 disks but not 16, never mind the extra RAID controller. Building another temporary server and moving the data via Gigabit is slow and expensive as well. Shrinking the array step-by-step and unloading data onto a regular filesystem on the freed disks would be a cheap (if time consuming) way to migrate, because the data could be copied back to the new array a disk at a time. Thanks, C. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug in 2.6.17 / mdadm 2.5.1
On Monday June 26, [EMAIL PROTECTED] wrote: Neil Brown wrote: snip Alternately you can apply the following patch to the kernel and version-1 superblocks should work better. -stable material? Maybe. I'm not sure it exactly qualifies, but I might try sending it to them and see what they think. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: recover data from linear raid
On Monday June 26, [EMAIL PROTECTED] wrote: This is what I get now, after creating with fdisk /dev/hdb1 and /dev/hdc1 as linux raid autodetect partitions So I'm totally confused now. You said it was 'linear', but the boot log showed 'raid0'. The drives didn't have a partition table on them, yet it is clear from the old boot log that the did. Are you sure they are the same drives, 'cause it doesn't seem like it. You could try hunting for ext3 superblocks on the device. There might be an easier way but od -x /dev/hdb | grep '^.60 ef53 ' should find them. Once you have this information we might be able to make something work. But I feel the chances are dwindling. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: recover data from linear raid
On Monday June 26, [EMAIL PROTECTED] wrote: This is what I get now, after creating with fdisk /dev/hdb1 and /dev/hdc1 as linux raid autodetect partitions So I'm totally confused now. You said it was 'linear', but the boot log showed 'raid0'. The drives didn't have a partition table on them, yet it is clear from the old boot log that the did. Are you sure they are the same drives, 'cause it doesn't seem like it. You could try hunting for ext3 superblocks on the device. There might be an easier way but od -x /dev/hdb | grep '^.60 ef53 ' should find them. Once you have this information we might be able to make something work. But I feel the chances are dwindling. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug in 2.6.17 / mdadm 2.5.1
On Sunday June 25, [EMAIL PROTECTED] wrote: Hi! There's a bug in Kernel 2.6.17 and / or mdadm which prevents (re)adding a disk to a degraded RAID5-Array. Thank you for the detailed report. The bug is in the md driver in the kernel (not in mdadm), and only affects version-1 superblocks. Debian recently changed the default (in /etc/mdadm.conf) to use version-1 superblocks which I thought would be OK (I've some testing) but obviously I missed something. :-( If you remove the metadata=1 (or whatever it is) from /etc/mdadm/mdadm.conf and then create the array, it will be created with a version-0.90 superblock has had more testing. Alternately you can apply the following patch to the kernel and version-1 superblocks should work better. NeilBrown - Set desc_nr correctly for version-1 superblocks. This has to be done in -load_super, not -validate_super ### Diffstat output ./drivers/md/md.c |6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff .prev/drivers/md/md.c ./drivers/md/md.c --- .prev/drivers/md/md.c 2006-06-26 11:02:43.0 +1000 +++ ./drivers/md/md.c 2006-06-26 11:02:46.0 +1000 @@ -1057,6 +1057,11 @@ static int super_1_load(mdk_rdev_t *rdev if (rdev-sb_size bmask) rdev- sb_size = (rdev-sb_size | bmask)+1; + if (sb-level == cpu_to_le32(LEVEL_MULTIPATH)) + rdev-desc_nr = -1; + else + rdev-desc_nr = le32_to_cpu(sb-dev_number); + if (refdev == 0) ret = 1; else { @@ -1165,7 +1170,6 @@ static int super_1_validate(mddev_t *mdd if (mddev-level != LEVEL_MULTIPATH) { int role; - rdev-desc_nr = le32_to_cpu(sb-dev_number); role = le16_to_cpu(sb-dev_roles[rdev-desc_nr]); switch(role) { case 0x: /* spare */ - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Large single raid and XFS or two small ones and EXT3?
On Friday June 23, [EMAIL PROTECTED] wrote: The problem is that there is no cost effective backup available. One-liner questions : - How does Google make backups ? No, Google ARE the backups :-) - Aren't tapes dead yet ? LTO-3 does 300Gig, and LTO-4 is planned. They may not cope with tera-byte arrays in one hit, but they still have real value. - What about a NUMA principle applied to storage ? You mean an Hierarchical Storage Manager? Yep, they exist. I'm sure SGI, EMC and assorted other TLAs could sell you one. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: read perfomance patchset
On Monday June 19, [EMAIL PROTECTED] wrote: Neil hello if i am not mistaken here: in first instance of : if(bi) ... ... you return without setting to NULL Yes, you are right. Thanks. And fixing that bug removes the crash. However I've been doing a few tests and it is hard to measure much improvement, which is strange. I can maybe see a 1% improvement but that could just be noise. I do some more and see if I can find out what is happening. Interestingly, with a simple dd if=/dev/md1 of=/dev/null bs=1024k test, 2.6.16 is substantially faster (10%) than 2.6.17-rc6-mm2 before that patches are added. There is something weird there. Have you done any testing? NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is shrinking raid5 possible?
On Thursday June 22, [EMAIL PROTECTED] wrote: Neil Brown wrote: On Monday June 19, [EMAIL PROTECTED] wrote: Hi, I'd like to shrink the size of a RAID5 array - is this possible? My first attempt shrinking 1.4Tb to 600Gb, mdadm --grow /dev/md5 --size=629145600 gives mdadm: Cannot set device size/shape for /dev/md5: No space left on device Yep. The '--size' option refers to: Amount (in Kibibytes) of space to use from each drive in RAID1/4/5/6. This must be a multiple of the chunk size, and must leave about 128Kb of space at the end of the drive for the RAID superblock. (from the man page). So you were telling md to use the first 600GB of each device in the array, and it told you there wasn't that much room. If your array has N drives, you need to divide the target array size by N-1 to find the target device size. So if you have a 5 drive array, then you want --size=157286400 May I say in all honesty that making people do that math instead of the computer is a really bad user interface? Good, consider it said. An means to just set the target size of the resulting raid device would be a LOT less likely to cause bad user input, and while I'm complaining it should inderstand suffices 'k', 'm', and 'g'. Let me put another perspective on this. Why would you ever want to reduce the size of a raid5 in this way? The only reason that I can think of is that you want to repartition each device to use a smaller partition for the raid5, and free up some space for something else. If that is what you are doing, you will have already done the math and you will know what size you want your final partitions to be, so setting the device size is just as easy as setting the array size. If you really are interested in array size and have no interest in recouping the wasted space on the drives, then there would be no point in shrinking the array (that I can think of). Just 'mkfs' a filesystem to the desired size and ignore the rest of the array. In short, reducing a raid5 to a particular size isn't something that really makes sense to me. Reducing the amount of each device that is used does - though I would much more expect people to want to increase that size. If Paul really has a reason to reduce the array to a particular size then fine. I'm mildly curious, but it's his business and I'm happy for mdadm to support it, though indirectly. But I strongly suspect that most people who want to resize their array will be thinking in terms of the amount of each device that is used, so that is how mdadm works. Far easier to use for the case where you need, for instance, 10G of storage for a database, tell mdadm what devices to use and what you need (and the level of course) and let the computer figure out the details, rounding up, leaving 128k, and phase of the moon if you decide to use it. mdadm is not intended to be a tool that manages your storage for you. If you want that, then I suspect EVMS is what you want (though I am only guessing - I've never used it). mdadm it a tool that enables YOU to manage your storage. NeilBrown Sorry, I think the current approach is baaad human interface. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid5 reshape
On Tuesday June 20, [EMAIL PROTECTED] wrote: Nigel J. Terry wrote: Well good news and bad news I'm afraid... Well I would like to be able to tell you that the time calculation now works, but I can't. Here's why: Why I rebooted with the newly built kernel, it decided to hit the magic 21 reboots and hence decided to check the array for clean. The normally takes about 5-10 mins, but this time took several hours, so I went to bed! I suspect that it was doing the full reshape or something similar at boot time. What magic 21 reboots?? md has no mechanism to automatically check the array after N reboots or anything like that. Or are you thinking of the 'fsck' that does a full check every so-often? Now I am not sure that this makes good sense in a normal environment. This could keep a server down for hours or days. I might suggest that if such work was required, the clean check is postponed till next boot and the reshape allowed to continue in the background. An fsck cannot tell if there is a reshape happening, but the reshape should notice the fsck and slow down to a crawl so the fsck can complete... Anyway the good news is that this morning, all is well, the array is clean and grown as can be seen below. However, if you look further below you will see the section from dmesg which still shows RIP errors, so I guess there is still something wrong, even though it looks like it is working. Let me know if i can provide any more information. Once again, many thanks. All I need to do now is grow the ext3 filesystem... . ...ok start reshape thread md: syncing RAID array md0 md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. md: using maximum available idle IO bandwidth (but not more than 20 KB/sec) for reconstruction. md: using 128k window, over a total of 245111552 blocks. Unable to handle kernel NULL pointer dereference at RIP: {stext+2145382632} PGD 7c3f9067 PUD 7cb9e067 PMD 0 Process md0_reshape (pid: 1432, threadinfo 81007aa42000, task 810037f497b0) Stack: 803dce42 1d383600 Call Trace: 803dce42{md_do_sync+1307} 802640c0{thread_return+0} 8026411e{thread_return+94} 8029925d{keventd_create_kthread+0} 803dd3d9{md_thread+248} That looks very much like the bug that I already sent you a patch for! Are you sure that the new kernel still had this patch? I'm a bit confused by this NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can't get drives containing spare devices to spindown
On Thursday June 22, [EMAIL PROTECTED] wrote: Marc L. de Bruin wrote: Situation: /dev/md0, type raid1, containing 2 active devices (/dev/hda1 and /dev/hdc1) and 2 spare devices (/dev/hde1 and /dev/hdg1). Those two spare 'partitions' are the only partitions on those disks and therefore I'd like to spin down those disks using hdparm for obvious reasons (noise, heat). Specifically, 'hdparm -S value device' sets the standby (spindown) timeout for a drive; the value is used by the drive to determine how long to wait (with no disk activity) before turning off the spindle motor to save power. However, it turns out that md actually sort-of prevents those spare disks to spindown. I can get them off for about 3 to 4 seconds, after which they immediately spin up again. Removing the spare devices from /dev/md0 (mdadm /dev/md0 --remove /dev/hd[eg]1) actually solves this, but I have no intention actually removing those devices. How can I make sure that I'm actually able to spin down those two spare drives? This is fixed in current -mm kernels and the fix should be in 2.6.18. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can't get drives containing spare devices to spindown
On Thursday June 22, [EMAIL PROTECTED] wrote: Thanks Neil for your quick reply. Would it be possible to elaborate a bit on the problem and the solution? I guess I won't be on 2.6.18 for some time... When an array has been idle (no writes) for a short time (20 or 200 ms, depending on which kernel you are running) the array is flagged as 'clean'. so that a crash/power failure at that point will not require a full resync. The 'clean' flag is stored on all superblocks, including the spares. So this causes writes to all devices when there is changes to activity status. Even fairly quite filesystems see occasional updates (updating atime on files, or such syncing the journal), and that causes all devices to be touched. Fix 1/ don't set the 'dirty' flag on spares - there really is no need. However whenever the dirty bit is changed, the 'events' count is updated, so just doing the above will cause the spares to get way behind the main devices in their 'events' count so they will no longer be treated as part of the array. So 2/ When clearing the dirty flag (and nothing else has happened), decrement the events count rather than increment it. Together, these mean that simple dirty/clean transitions do not touch the spares. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: the question about raid0_make_request
On Monday June 19, [EMAIL PROTECTED] wrote: We can imagine that there is a raid0 array whose layerout is drawn in the attachment. Take this for example. There are 3 zones totally, and their zone-nb_dev is 5,4,3 respectively. In the raid0_make_request function, the var block is the offset of bio in kilobytes. x = block chunksize_bits; tmp_dev = zone-dev[sector_div(x, zone-nb_dev)]; If block is in the chunk 5, then x = block chunksize_bits = 5.And the nb_dev of zone2 is 4. So tmp_dev = zone-dev[sector_div(5,4)] = zone-dev[1]. But we can see that the right result should be zone-dev[0]. Then how does the 'bug' get the right underlying device? When you say 'right' result, you really mean 'expected' result. You expect the layout to be 0 1 2 3 4 5 6 7 8 9 10 11 The actual layout for Linux-md-raid0 is 0 1 2 3 4 8 5 6 7 9 10 11 Not what you would expect, but still a valid layout. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid5 reshape
On Monday June 19, [EMAIL PROTECTED] wrote: That seems to have fixed it. The reshape is now progressing and there are no apparent errors in dmesg. Details below. Great! I'll send another confirmation tomorrow when hopefully it has finished :-) Many thanks for a great product and great support. And thank you for being a patient beta-tester! NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ANNOUNCE: mdadm 2.5.1 - A tool for managing Soft RAID under Linux
On Monday June 19, [EMAIL PROTECTED] wrote: Neil Brown wrote: I am pleased to announce the availability of mdadm version 2.5.1 What the heck, here's another one. :) This one is slightly more serious. We're getting a device of 0:0 in Fail events from the mdadm monitor sometimes now (due to the change in map_dev, which allows it to sometimes return 0:0 instead of just NULL for an unknown device). Thanks for this and the other two. They are now in .git The patch fixes my issue. I don't know if there are more. I chose to do this differently - map_dev will now return NULL for 0,0 and all users can cope with a NULL. NeilBrown Thanks, Paul --- mdadm-2.5.1/Monitor.c Thu Jun 1 21:33:41 2006 +++ mdadm-2.5.1-new/Monitor.c Mon Jun 19 14:51:31 2006 @@ -328,7 +328,7 @@ int Monitor(mddev_dev_t devlist, } disc.major = disc.minor = 0; } - if (dv == NULL st-devid[i]) + if ((dv == NULL || strcmp(dv, 0:0) == 0) st-devid[i]) dv = map_dev(major(st-devid[i]), minor(st-devid[i]), 1); change = newstate ^ st-devstate[i]; - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: the question about raid0_make_request
On Monday June 19, [EMAIL PROTECTED] wrote: When I read the code of raid0_make_request,I meet some questions. 1\ block = bio-bi_sector 1,it's the device offset in kilotytes. so why do we use block substract zone-zone_offset? The zone-zone_offset is the zone offset relative the mddev in sectors. zone_offset is set to 'curr_zone_offset' in create_strip_zones, curr_zone_offset is a sum of 'zone-size' values. zone-size is (typically) calculated by (smallest-size - current_offset) *c 'smallest' is an rdev. So the unit of 'zone_offset' are ultimately the same units as that of rdev-size. rdev-size is set in md.c is set e.g. from calc_dev_size(rdev, sb-chunk_size); which uses the value from calc_dev_sboffset which shifts the size in bytes by BLOCK_SIZE_BITS which is defined in fs.h to be 10. So the units of zone_offset is in kilobytes, not sectors. 2\ the codes below: x = block chunksize_bits; tmp_dev = zone-dev[sector_div(x, zone-nb_dev)]; actually, we get the underlying device by 'sector_div(x, zone-nb_dev)'.The var x is the chunk nr relative to the start of the mddev in my opinion.But not all of the zone-nb_dev is the same, so we cann't get the right rdev by 'sector_div(x, zone-nb_dev)', I think. x is the chunk number relative to the start of the current zone, not the start of the mddev: sector_t x = (block - zone-zone_offset) chunksize_bits; taking the remainder after dividing this by the number of devices in the current zone gives the number of the device to use. Hope that helps. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid5 reshape
On Saturday June 17, [EMAIL PROTECTED] wrote: Any ideas what I should do next? Thanks Looks like you've probably hit a bug. I'll need a bit more info though. First: [EMAIL PROTECTED] ~]# cat /proc/mdstat Personalities : [raid5] [raid4] md0 : active raid5 sdb1[1] sda1[0] hdc1[4](S) hdb1[2] 490223104 blocks super 0.91 level 5, 128k chunk, algorithm 2 [4/3] [UUU_] [=...] reshape = 6.9% (17073280/245111552) finish=86.3min speed=44003K/sec unused devices: none This really makes it look like the reshape is progressing. How long after the reboot was this taken? How long after hdc1 has hot added (roughly)? What does it show now? What happens if you remove hdc1 again? Does the reshape keep going? What I would expect to happen in this case is that the array reshapes into a degraded array, then the missing disk is recovered onto hdc1. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid5 reshape
OK, thanks for the extra details. I'll have a look and see what I can find, but it'll probably be a couple of days before I have anything useful for you. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid5 reshape
On Friday June 16, [EMAIL PROTECTED] wrote: You have to grow the ext3 fs separately. ext2resize /dev/mdX. Keep in mind this can only be done off-line. ext3 can be resized online. I think ext2resize in the latest release will do the right thing whether it is online or not. There is a limit to the amount of expansion that can be achieved on-line. This limit is set when making the filesystem. Depending on which version of ext2-utils you used to make the filesystem, it may or may not already be prepared for substantial expansion. So if you want to do it on-line, give it a try or ask on the ext3-users list for particular details on what versions you need and how to see if your fs can be expanded. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IBM xSeries stop responding during RAID1 reconstruction
On Thursday June 15, [EMAIL PROTECTED] wrote: On Wed, Jun 14, 2006 at 10:46:09AM -0500, Bill Cizek wrote: Niccolo Rigacci wrote: When the sync is complete, the machine start to respond again perfectly. I was able to work around this by lowering /proc/sys/dev/raid/speed_limit_max to a value below my disk thruput value (~ 50 MB/s) as follows: $ echo 45000 /proc/sys/dev/raid/speed_limit_max Thanks! This hack seems to solve my problem too. So it seems that the RAID subsystem does not detect a proper speed to throttle the sync. The RAID subsystem doesn't try to detect a 'proper' speed. When there is nothing else happening, it just drives the disks as fast as they will go. If this is causing a lockup, then there is something else wrong, just as any single process should not - by writing constantly to disks - be able to clog up the whole system. Maybe if you could get the result of alt-sysrq-P or even alt-sysrq-T while the system seems to hang. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid5 software problems after loosing 4 disks for 48 hours
On Friday June 16, [EMAIL PROTECTED] wrote: And is there a way if more then 1 disks goes offline, for the whole array to be taken offline? My understanding of raid5 is loose 1+ disks and nothing on the raid would be readable. this is not the case here. Nothing will be writable, but some blocks might be readable. All the disks are online now, what do I need to do to rebuild the array? Have you tried mdadm --assemble --force /dev/md0 /dev/sd[bcdefghijklmnop]1 ?? Actually, it occurs to me that that might not do the best thing if 4 drives disappeared at exactly the same time (though it is unlikely that you would notice) You should probably use mdadm --create /dev/md0 -f -l5 -n15 -c32 /dev/sd[bcdefghijklmnop]1 This is assuming that e,f,g,h were in that order in the array before they died. The '-f' is quite important - it tells mdadm not recover a spare, but to resync the parity blocks. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: to understand the logic of raid0_make_request
On Friday June 16, [EMAIL PROTECTED] wrote: Thanks a lot.I went through the code again following your guide.But I still can't understand how the bio-bi_sector and bio-bi_dev are computed.I don't know what the var 'block' stands for. Could you explain them to me ? 'block' is simply bi_sector/2 - the device offset in kilobytes rather than in sectors. raid0 supports having different devices of different sizes. The array is divided into 'zones'. The first zone has all devices, and extends as far as the smallest devices. The last zone extends to the end of the largest device, and may have only one, or several devices in it. The may be other zones depending on how many different sizes of device there are. The first thing that happens is the correct zone is found by lookng in the hash_table. Then we subtract the zone offset, divide by the chunk size, and then divide by the number of devices in that zone. The remainder of this last division tells us which device to use. Then we mutliply back out to find the offset in that device. I know that it rather brief, but I hope it helps. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6
On Thursday June 15, [EMAIL PROTECTED] wrote: I am confronted with a big problem of the raid6 algorithm, when recently I learn the raid6 code of linux 2.6 you have contributed . Unfortunately I can not understand the algorithm of P +Q parity in this program . Is this some formula for this raid6 algorithm? I realy respect your help,could you show me some details about this algorithm? See: http://en.wikipedia.org/wiki/Redundant_array_of_independent_disks#RAID_6 and http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: to understand the logic of raid0_make_request
On Tuesday June 13, [EMAIL PROTECTED] wrote: hello,everyone. I am studying the code of raid0.But I find that the logic of raid0_make_request is a little difficult to understand. Who can tell me what the function of raid0_make_request will do eventually? One of two possibilities. Most often it will update bio-bi_dev and bio-bi_sector to refer to the correct location on the correct underlying devices, and then will return '1'. The fact that it returns '1' is noticed by generic_make_request in block/ll_rw_block.c and generic_make_request will loop around and retry the request on the new device at the new offset. However in the unusual case that the request cross a chunk boundary and so needs to be sent to two different devices, raidi_make_request will split the bio into to (using bio_split) will submit each of the two bios directly down to the appropriate devices - and will then return '0', so that generic make request doesn't loop around. I hope that helps. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid 5 read performance
On Friday June 9, [EMAIL PROTECTED] wrote: Neil hello Sorry for the delay. too many things to do. You aren't alone there! I have implemented all said in : http://www.spinics.net/lists/raid/msg11838.html As always I have some questions: 1. mergeable_bvec I did not understand first i must admit. now i do not see how it differs from the one of raid0. so i actually copied it and renamed it. Sounds fine. For raid5 there is no need to force write requests to be split up, but that's a minor difference. 2. statistics. i have added md statistics since the code does not reach the statics in make_request. it returns from make_request before that. Why not put the code *after* that? Not that it matters a great deal. I'll comment more when I see the code I expect. 3. i have added the new retry list called toread_aligned to raid5_conf_t . hope this is correct. Sounds good. 4. your instructions are to add a failed bio to sh, but it does not say to handle it directly. i have tried it and something is missing here. raid5d handle stripes only if conf-handle_list is not empty. i added handle_stripe and and release_stripe of my own. this way i managed to get from the completion routine: R5: read error corrected!! message . ( i have tested by failing a ram disk ). Sounds right, but I'd need to see the code to be sure. 5. I am going to test the non common path heavily before submitting you the patch ( on real disks and use several file systems and several chunk sizes). I'd rather see the patch earlier, even if it isn't fully tested. It is quite a big patch so I need to know which kernel do you want me to use ? i am using poor 2.6.15. A patch against the latest -mm would be best, but I'm happy to take it against anything even vaguely recent. However, it needs to be multiple patches, not just one. This is a *very* important point. As that original email said: This should be developed and eventually presented as a sequence of patches. There are several distinct steps in this change and they need to be reviewed separately or it is just too hard. So I would really like it if you could separate out the changes into logically distinct patches. If you can't or won't, then still send the patch, but I'll have to break it up so it'll probably take longer to process. Thanks for your efforts, NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid5 read error correction log
On Saturday June 3, [EMAIL PROTECTED] wrote: Hey Neil, It would sure be nice if the log contained any info about the error correction that's been done rather than simply saying read error corrected, like which array chunk, device and sector was corrected. I'm having a persistent pending sector on a drive, and when I do check or repair, it says read error corrected many times, but I don't know whether it's doing the same sector over and over or if there are just so many of them... I seem to remember reading something about this on the list some time ago, is it already in the kernel? (I'm running 2.6.17-rc4 now). Yes added to todo list: include sector/dev info in read-error-corrected messages Btw, when it does correct a read error, I assume it also tries to read it again to verify that the correction worked? Yes. It doesn't check that the read returns the correct data, but it does check that a read succeeds. However I'm not certain that the read request will punch through any cache on the drive. It could be that the reads return data out of the cache without accessing data on the surface of the disk NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: problems with raid6, mdadm: RUN_ARRAY failed
On Friday June 2, [EMAIL PROTECTED] wrote: I have some old controler Mylex Acceleraid 170LP with 6 SCSI 36GB disks on it. Running hardware raid5 resulted with very poor performance (7Mb/sec in sequential writing, with horrid iowait). So I configured it to export 6 logical disks and tried creating raid6 and see if I can get better results. Trying to create an array with a missing component results in: ~/mdadm-2.5/mdadm -C /dev/md3 -l6 -n6 /dev/rd/c0d0p3 /dev/rd/c0d2p3 /dev/rd/c0d3p3 /dev/rd/c0d4p3 /dev/rd/c0d5p3 missing mdadm: RUN_ARRAY failed: Input/output error There should have been some messages in the kernel log when this happened. Can you report them too? Thanks, NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 hang on get_active_stripe
On Friday June 2, [EMAIL PROTECTED] wrote: On Thu, 1 Jun 2006, Neil Brown wrote: I've got one more long-shot I would like to try first. If you could backout that change to ll_rw_block, and apply this patch instead. Then when it hangs, just cat the stripe_cache_active file and see if that unplugs things or not (cat it a few times). nope that didn't unstick it... i had to raise stripe_cache_size (from 256 to 768... 512 wasn't enough)... -dean Ok, thanks. I still don't know what is really going on, but I'm 99.9863% sure this will fix it, and is a reasonable thing to do. (Yes, I lose a ';'. That is deliberate). Please let me know what this proves, and thanks again for your patience. NeilBrown Signed-off-by: Neil Brown [EMAIL PROTECTED] ### Diffstat output ./drivers/md/raid5.c |5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c --- ./drivers/md/raid5.c~current~ 2006-05-28 21:56:56.0 +1000 +++ ./drivers/md/raid5.c2006-06-02 17:24:07.0 +1000 @@ -285,7 +285,7 @@ static struct stripe_head *get_active_st (conf-max_nr_stripes *3/4) || !conf-inactive_blocked), conf-device_lock, - unplug_slaves(conf-mddev); + raid5_unplug_device(conf-mddev-queue) ); conf-inactive_blocked = 0; } else - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Clarifications about check/repair, i.e. RAID SCRUBBING
On Friday June 2, [EMAIL PROTECTED] wrote: In any regard: I'm talking about triggering the following functionality: echo check /sys/block/mdX/md/sync_action echo repair /sys/block/mdX/md/sync_action On a RAID5, and soon a RAID6, I'm looking to set up a cron job, and am trying to figure out what exactly to schedule. The answers to the following questions might shed some light on this: 1. GENERALLY SPEAKING, WHAT IS THE DIFFERENCE BETWEEN THE CHECK AND REPAIR COMMANDS? The md.txt doc mentions for check that a repair may also happen for some raid levels. Which RAID levels, and in what cases? If I perform a check is there a cache of bad blocks that need to be fixed that can quickly be repaired by executing the repair command? Or would it go through the entire array again? I'm working with new drives, and haven't come across any bad blocks to test this with. 'check' just reads everything and doesn't trigger any writes unless a read error is detected, in which case the normally read-error handing kicks in. So it can be useful on a read-only array. 'repair' does that same but when it finds an inconsistency is corrects it by writing something. If any raid personality had not be taught to specifically understand 'check', then a 'check' run would effect a 'repair'. I think 2.6.17 will have all personalities doing the right thing. check doesn't keep a record of problems, just a count. 'repair' will reprocess the whole array. 2. CAN CHECK BE RUN ON A DEGRADED ARRAY (say with N out of N+1 disks on a RAID level 5)? I can test this out, but was it designed to do this, versus REPAIR only working on a full set of active drives? Perhaps repair is assuming that I have N+1 disks so that parity can be WRITTEN? No, check on a degraded raid5, or a raid6 with 2 missing devices, or a raid1 with only one device will not do anything. It will terminate immediately. After all, there is nothing useful that it can do. 3. RE: FEEDBACK/LOGGING: it seems that I might see some messages in dmesg logging output such as raid5:read error corrected!, is that right? I realize that mismatch_count can also be used to see if there was any action during a check or repair. I'm assuming this stuff doesn't make its way into an email. You are correct on all counts. mdadm --monitor doesn't know about this yet. ((writes notes in mdadm todo list)). 4. DOES REPAIR PERFORM READS TO CHECK THE ARRAY, AND THEN WRITE TO THE ARRAY *ONLY WHEN NECESSARY* TO PERFORM FIXES FOR CERTAIN BLOCKS? (I know, it's sorta a repeat of question number 1+2). repair only writes when necessary. In the normal case, it will only read every blocks. 5. IS THERE ILL-EFFECT TO STOP EITHER CHECK OR REPAIR BY ISSUING IDLE? No. 6. IS IT AT ALL POSSIBLE TO CHECK A CERTAIN RANGE OF BLOCKS? And to keep track of which blocks were checked? The motivation is to start checking some blocks overnight, and to pick-up where I left off the next night... Not yet. It might be possible one day. 7. ANY OTHER CONSIDERATIONS WHEN SCRUBBING THE RAID? Not that I am aware of. NeilBrown Sorry for some of these questions being so similar in nature. I just want to make sure I understand it correctly. Neil, again, a BIG thanks for this new functionality. I'm looking forward to putting a system in place to exercise my drives! Cheers, -- roy - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5E
On Wednesday May 31, [EMAIL PROTECTED] wrote: Where I was working most recently some systems were using RAID5E (RAID5 with both the parity and hot spare distributed). This seems to be highly desirable for small arrays, where spreading head motion over one more drive will improve performance, and in all cases where a rebuild to the hot spare will avoid a bottleneck on a single drive. Is there any plan to add this capability? I thought about it briefly As I understand it, the layout of raid5e when non-degraded is very similar to raid6 - however the 'Q' block is simply not used. This would be trivial to implement. The interesting bit comes when a device fails and you want to rebuild that distributed spare. There are two possible ways that you could do this: 1/ Leave the spare where it is and write the correct data into each spare. This would be fairly easy but would leave an array with an very ... interesting layout of data. When you add a replacement you just move everything back. 2/ reshape the array to be a regular raid5 layout. This would be hard to do well without NVRAM as you are moving live data, but would result in a neat and tidy array. Ofcourse adding a drive back in would be interesting again... I had previously only thought of option '2', and so discarded the idea as not worth the effort. The more I think about it, the more possible option 1 sounds. I've put it back on my todo list, but I don't expect to get to it this year. Ofcourse if someone else wants to give it a try, I'm happy to make suggestions and review code. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 006 of 10] md: Set/get state of array via sysfs
On Wednesday May 31, [EMAIL PROTECTED] wrote: * NeilBrown ([EMAIL PROTECTED]) wrote: This allows the state of an md/array to be directly controlled via sysfs and adds the ability to stop and array without tearing it down. Array states/settings: clear No devices, no size, no level Equivalent to STOP_ARRAY ioctl It looks like this demoted CAP_SYS_ADMIN to CAP_DAC_OVERRIDE for the equiv ioctl. Intentional? Uhm.. no. Thanks. I'll fix that, see if I've done similar things elsewhere, and keep it in mind for the future. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 008 of 10] md: Allow raid 'layout' to be read and set via sysfs.
On Wednesday May 31, [EMAIL PROTECTED] wrote: * NeilBrown ([EMAIL PROTECTED]) wrote: +static struct md_sysfs_entry md_layout = +__ATTR(layout, 0655, layout_show, layout_store); 0644? I think the correct response is Doh! :-) Yes, thanks, NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID 5 Whole Devices - Partition
On Tuesday May 30, [EMAIL PROTECTED] wrote: Hello, I am trying to create a RAID5 array out of 3 160GB SATA drives. After i create the array i want to partition the device into 2 partitions. The system lies on a SCSI disk and the 2 partitions will be used for data storage. The SATA host is an HPT374 device with drivers compiled in the kernel. These are the steps i followed mdadm -Cv --auto=part /dev/md_d0 --chunk=64 -l 5 --raid-devices=3 /dev/hde /dev/hdi /dev/hdk Running this command notifies me that there is an ext2 fs on one of the drives even if i fdisked them before and removed all partititions. Why is this happening? The ext2 superblock is on the second 1K for the device. The only place that fdisk writes is in the first 512 bytes. So fdisk is never going to remove the signature of a an ext2 filesystem. In anycase i continue with the array creation This is the right thing to do. After initialization 5 new devices are created in /dev /dev/md_d0 /dev/md_d0p1 /dev/md_d0_p1 /dev/md_d0_p2 /dev/md_d0_p3 /dev/md_d0_p4 The problems arise when i reboot. A device /dev/md0 seems to keep the 3 disks busy and as a result when You need to find out where that is coming from. Complete kernel logs might help. Maybe you have an initrd which is trying to be helpful? the time comes to assemble the array i get the error that the disks are busy. When the system boots i cat /proc/mdstat and see that /dev/md0 is a raid5 array made of the two disks and it comes up as degraded I can then stop the array using mdadm -S /dev/md0 and restart it using mdadm -As which uses the correct /dev/md_d0. Examining that shows its clean and ok /dev/md_d0: Version : 00.90.01 Creation Time : Tue May 30 17:03:31 2006 Raid Level : raid5 Array Size : 312581632 (298.10 GiB 320.08 GB) Device Size : 156290816 (149.05 GiB 160.04 GB) Raid Devices : 3 Total Devices : 3 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Tue May 30 19:48:03 2006 State : clean Active Devices : 3 Working Devices : 3 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 64K Number Major Minor RaidDevice State 0 3300 active sync /dev/hde 1 5601 active sync /dev/hdi 2 5702 active sync /dev/hdk UUID : 9f520781:7f3c2052:1cb5078e:c3f3b95c Events : 0.2 Is this the expected behavior? Why doesnt the kernel ignore /dev/md0 and tries to use it? I tried using raid=noautodetect but it didnt help I am using 2.6.9 Most be something else trying to start the array. Maybe a stray 'raidstart'. Maybe something in an initrd. This is my mdadm.conf DEVICE /dev/hde /dev/hdi /dev/hdk ARRAY /dev/md_d0 level=raid5 num-devices=3 UUID=9f520781:7f3c2052:1cb5078e:c3f3b95c devices=/dev/hde,/dev/hdi,/dev/hdk auto=partition MAILADDR [EMAIL PROTECTED] This should work providing the device names of the ide drives never change -- which is fairly safe. It isn't safe for SCSI drives. Furthermore when i fdisk the drives after all of this i can see the 2 partitions on /dev/hde and /dev/hdi but /dev/hdk shows that no partition exists. Is this a sign of data corruption or drive failure? Shouldnt all 3 drives show the same partition information? No. The drives shouldn't really have partition information at all. The raid array has the partition information. However the first block of /dev/hde is also the first block of /dev/md_d0, so it will appear to have the same partition table. And the first block of /dev/hdk is an 'xor' of the first blocks of hdi and hde. So if the first block of hdi is all zeros, then the first block of /dev/hdk will have the same partition table. fdisk /dev/hde /dev/hde1 1 19457 156288352 fd Linux raid autodetect fdisk /dev/hdi /dev/hdi1 1 19457 156288321 fd Linux raid autodetect When you created the partitions in /dev/md_d0, you must have set the partition type to 'Linux raid autodetect'. You don't want to do that. Change it to 'Linux' or whatever. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 hang on get_active_stripe
On Tuesday May 30, [EMAIL PROTECTED] wrote: On Tue, 30 May 2006, Neil Brown wrote: Could you try this patch please? On top of the rest. And if it doesn't fail in a couple of days, tell me how regularly the message kblockd_schedule_work failed gets printed. i'm running this patch now ... and just after reboot, no freeze yet, i've already seen a handful of these: May 30 17:05:09 localhost kernel: kblockd_schedule_work failed May 30 17:05:59 localhost kernel: kblockd_schedule_work failed May 30 17:08:16 localhost kernel: kblockd_schedule_work failed May 30 17:10:51 localhost kernel: kblockd_schedule_work failed May 30 17:11:51 localhost kernel: kblockd_schedule_work failed May 30 17:12:46 localhost kernel: kblockd_schedule_work failed May 30 17:14:14 localhost kernel: kblockd_schedule_work failed 1 every minute or so. That's probably more than I would have expected, but strongly lends evidence to the theory that this is the problem. I certainly wouldn't expect a failure every time kblockd_schedule_work failed (in the original code), but the fact that it does fail sometimes means there is a possible race which can cause the failure that experienced. So I am optimistic that the patch will have fixed the problem. Please let me know when you reach an uptime of 3 days. Thanks, NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 hang on get_active_stripe
On Tuesday May 30, [EMAIL PROTECTED] wrote: actually i think the rate is higher... i'm not sure why, but klogd doesn't seem to keep up with it: [EMAIL PROTECTED]:~# grep -c kblockd_schedule_work /var/log/messages 31 [EMAIL PROTECTED]:~# dmesg | grep -c kblockd_schedule_work 8192 # grep 'last message repeated' /var/log/messages ?? Obviously even faster than I thought. I guess workqueue threads must take a while to get scheduled... I'm beginning to wonder if I really have found the bug after all :-( I'll look forward to the results either way. Thanks, NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mdadm 2.5 (Was: ANNOUNCE: mdadm 2.5 - A tool for managing Soft RAID under Linux)
On Monday May 29, [EMAIL PROTECTED] wrote: On Mon, May 29, 2006 at 12:08:25PM +1000, Neil Brown wrote: On Sunday May 28, [EMAIL PROTECTED] wrote: Thanks for the patches. They are greatly appreciated. You're welcome - mdadm-2.3.1-kernel-byteswap-include-fix.patch reverts a change introduced with mdadm 2.3.1 for redhat compatibility asm/byteorder.h is an architecture dependent file and does more stuff than a call to the linux/byteorder/XXX_endian.h the fact that not calling asm/byteorder.h does not define __BYTEORDER_HAS_U64__ is just an example of issues that might arise. if redhat is broken it should be worked around differently than breaking mdadm. I don't understand the problem here. What exactly breaks with the code currently in 2.5? mdadm doesn't need __BYTEORDER_HAS_U64__, so why does not having id defined break anything? The coomment from the patch says: not including asm/byteorder.h will not define __BYTEORDER_HAS_U64__ causing __fswab64 to be undefined and failure compiling mdadm on big_endian architectures like PPC But again, mdadm doesn't use __fswab64 More details please. you use __cpu_to_le64 (ie in super0.c line 987) bms-sync_size = __cpu_to_le64(size); which in byteorder/big_endian.h is defined as #define __cpu_to_le64(x) ((__force __le64)__swab64((x))) but __swab64 is defined in byteorder/swab.h (included by byteorder/big_endian.h) as #if defined(__GNUC__) (__GNUC__ = 2) defined(__OPTIMIZE__) # define __swab64(x) \ (__builtin_constant_p((__u64)(x)) ? \ ___swab64((x)) : \ __fswab64((x))) #else # define __swab64(x) __fswab64(x) #endif /* OPTIMIZE */ Grrr.. Thanks for the details. I think I'll just give up and do it myself. e.g. short swap16(short in) { int i; short out=0; for (i=0; i4; i++) { out = out8 | (in255); in = in 8; } return out; } I don't need top performance and at least this should be portable... NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 hang on get_active_stripe
On Saturday May 27, [EMAIL PROTECTED] wrote: On Sat, 27 May 2006, Neil Brown wrote: Thanks. This narrows it down quite a bit... too much infact: I can now say for sure that this cannot possible happen :-) 2/ The message.gz you sent earlier with the echo t /proc/sysrq-trigger trace in it didn't contain information about md4_raid5 - the got another hang again this morning... full dmesg output attached. Thanks. Nothing surprising there, which maybe is a surprise itself... I'm still somewhat stumped by this. But given that it is nicely repeatable, I'm sure we can get there... The following patch adds some more tracing to raid5, and might fix a subtle bug in ll_rw_blk, though it is an incredible long shot that this could be affecting raid5 (if it is, I'll have to assume there is another bug somewhere). It certainly doesn't break ll_rw_blk. Whether it actually fixes something I'm not sure. If you could try with these on top of the previous patches I'd really appreciate it. When you read from /stripe_cache_active, it should trigger a (cryptic) kernel message within the next 15 seconds. If I could get the contents of that file and the kernel messages, that should help. Thanks heaps, NeilBrown Signed-off-by: Neil Brown [EMAIL PROTECTED] ### Diffstat output ./block/ll_rw_blk.c |4 ++-- ./drivers/md/raid5.c | 18 ++ 2 files changed, 20 insertions(+), 2 deletions(-) diff ./block/ll_rw_blk.c~current~ ./block/ll_rw_blk.c --- ./block/ll_rw_blk.c~current~2006-05-28 21:54:23.0 +1000 +++ ./block/ll_rw_blk.c 2006-05-28 21:55:17.0 +1000 @@ -874,7 +874,7 @@ static void __blk_queue_free_tags(reques } q-queue_tags = NULL; - q-queue_flags = ~(1 QUEUE_FLAG_QUEUED); + clear_bit(QUEUE_FLAG_QUEUED, q-queue_flags); } /** @@ -963,7 +963,7 @@ int blk_queue_init_tags(request_queue_t * assign it, all done */ q-queue_tags = tags; - q-queue_flags |= (1 QUEUE_FLAG_QUEUED); + set_bit(QUEUE_FLAG_QUEUED, q-queue_flags); return 0; fail: kfree(tags); diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c --- ./drivers/md/raid5.c~current~ 2006-05-27 09:17:10.0 +1000 +++ ./drivers/md/raid5.c2006-05-28 21:56:56.0 +1000 @@ -1701,13 +1701,20 @@ static sector_t sync_request(mddev_t *md * During the scan, completed stripes are saved for us by the interrupt * handler, so that they will not have to wait for our next wakeup. */ +static unsigned long trigger; + static void raid5d (mddev_t *mddev) { struct stripe_head *sh; raid5_conf_t *conf = mddev_to_conf(mddev); int handled; + int trace = 0; PRINTK(+++ raid5d active\n); + if (test_and_clear_bit(0, trigger)) + trace = 1; + if (trace) + printk(raid5d runs\n); md_check_recovery(mddev); @@ -1725,6 +1732,13 @@ static void raid5d (mddev_t *mddev) activate_bit_delay(conf); } + if (trace) + printk( le=%d, pas=%d, bqp=%d le=%d\n, + list_empty(conf-handle_list), + atomic_read(conf-preread_active_stripes), + blk_queue_plugged(mddev-queue), + list_empty(conf-delayed_list)); + if (list_empty(conf-handle_list) atomic_read(conf-preread_active_stripes) IO_THRESHOLD !blk_queue_plugged(mddev-queue) @@ -1756,6 +1770,8 @@ static void raid5d (mddev_t *mddev) unplug_slaves(mddev); PRINTK(--- raid5d inactive\n); + if (trace) + printk(raid5d done\n); } static ssize_t @@ -1813,6 +1829,7 @@ stripe_cache_active_show(mddev_t *mddev, struct list_head *l; n = sprintf(page, %d\n, atomic_read(conf-active_stripes)); n += sprintf(page+n, %d preread\n, atomic_read(conf-preread_active_stripes)); + n += sprintf(page+n, %splugged\n, blk_queue_plugged(mddev-queue)?:not ); spin_lock_irq(conf-device_lock); c1=0; list_for_each(l, conf-bitmap_list) @@ -1822,6 +1839,7 @@ stripe_cache_active_show(mddev_t *mddev, c2++; spin_unlock_irq(conf-device_lock); n += sprintf(page+n, bitlist=%d delaylist=%d\n, c1, c2); + trigger = 0x; return n; } else return 0; - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mdadm 2.5 (Was: ANNOUNCE: mdadm 2.5 - A tool for managing Soft RAID under Linux)
On Sunday May 28, [EMAIL PROTECTED] wrote: On Fri, May 26, 2006 at 04:33:08PM +1000, Neil Brown wrote: I am pleased to announce the availability of mdadm version 2.5 hello, i tried rebuilding mdadm 2.5 on current mandriva cooker, which uses gcc-4.1.1, glibc-2.4 and dietlibc 0.29 and found the following issues addressed by patches attacched to this message I would be glad if you could review these patches and include them in upcoming mdadm releases. Thanks for the patches. They are greatly appreciated. - mdadm-2.3.1-kernel-byteswap-include-fix.patch reverts a change introduced with mdadm 2.3.1 for redhat compatibility asm/byteorder.h is an architecture dependent file and does more stuff than a call to the linux/byteorder/XXX_endian.h the fact that not calling asm/byteorder.h does not define __BYTEORDER_HAS_U64__ is just an example of issues that might arise. if redhat is broken it should be worked around differently than breaking mdadm. I don't understand the problem here. What exactly breaks with the code currently in 2.5? mdadm doesn't need __BYTEORDER_HAS_U64__, so why does not having id defined break anything? The coomment from the patch says: not including asm/byteorder.h will not define __BYTEORDER_HAS_U64__ causing __fswab64 to be undefined and failure compiling mdadm on big_endian architectures like PPC But again, mdadm doesn't use __fswab64 More details please. - mdadm-2.4-snprintf.patch this is self commenting, just an error in the snprintf call I wonder how that snuck in... There was an odd extra tab in the patch, but no-matter. I changed it to use 'sizeof(buf)' to be consistent with other uses of snprint. Thanks. - mdadm-2.4-strict-aliasing.patch fix for another srict-aliasing problem, you can typecast a reference to a void pointer to anything, you cannot typecast a reference to a struct. Why can't I typecast a reference to a struct??? It seems very unfair... However I have no problem with the patch. Applied. Thanks. I should really change it to use 'list.h' type lists from the linux kernel. - mdadm-2.5-mdassemble.patch pass CFLAGS to mdassemble build, enabling -Wall -Werror showed some issues also fixed by the patch. yep, thanks. - mdadm-2.5-rand.patch Posix dictates rand() versus bsd random() function, and dietlibc deprecated random(), so switch to srand()/rand() and make everybody happy. Everybody? 'man 3 rand' tells me: Do not use this function in applications intended to be portable when good randomness is needed. Admittedly mdadm doesn't need to be portable - it only needs to run on Linux. But this line in the man page bothers me. I guess -Drandom=rand -Dsrandom=srand might work no. stdlib.h doesn't like that. 'random' returns 'long int' while rand returns 'int'. Interestingly 'random_r' returns 'int' as does 'rand_r'. #ifdef __dietlibc__ #includestrings.h /* dietlibc has deprecated random and srandom!! */ #define random rand #define srandom srand #endif in mdadm.h. Will that do you? - mdadm-2.5-unused.patch glibc 2.4 is pedantic on ignoring return values from fprintf, fwrite and write, so now we check the rval and actually do something with it. in the Grow.c case i only print a warning, since i don't think we can do anithing in case we fail invalidating those superblocks (is should never happen, but then...) Ok, thanks. You can see these patches at http://neil.brown.name/cgi-bin/gitweb.cgi?p=mdadm more welcome :-) Thanks again, NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] install a static build
On Sunday May 28, [EMAIL PROTECTED] wrote: Hello Luca, maybe you better add an install-static target. you're right, that would be a cleaner approach. I've don so, and while doing so added install-tcc, install-ulibc, install-klibc too. And while I'm busy in the Makefile anyway I've made a third patch which adds the uninstall: target too. -- --- Dirk Jagdmann http://cubic.org/~doj - http://llg.cubic.org thanks for these. They are now in my git tree: http://neil.brown.name/cgi-bin/gitweb.cgi?p=mdadm git://neil.brown.name/mdadm They claim me as their author I'm afraid.. I'll have to fix my scripts to get it right next time :-( NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: problems with raid=noautodetect
On Friday May 26, [EMAIL PROTECTED] wrote: On Tue, May 23, 2006 at 08:39:26AM +1000, Neil Brown wrote: Presumably you have a 'DEVICE' line in mdadm.conf too? What is it. My first guess is that it isn't listing /dev/sdd? somehow. Neil, i am seeing a lot of people that fall in this same error, and i would propose a way of avoiding this problem 1) make DEVICE partitions the default if no device line is specified. As you note, we think alike on this :-) 2) deprecate the DEVICE keyword issuing a warning when it is found in the configuration file Not sure I'm so keen on that, at least not in the near term. 3) introduce DEVICEFILTER or similar keyword with the same meaning at the actual DEVICE keyboard If it has the same meaning, why not leave it called 'DEVICE'??? However, there is at least the beginnings of a good idea here. If we assume there is a list of devices provided by a (possibly default) 'DEVICE' line, then DEVICEFILTER !pattern1 !pattern2 pattern3 pattern4 could mean that any device in that list which matches pattern 1 or 2 is immediately discarded, and remaining device that matches patterns 3 or 4 are included, and the remainder are discard. The rule could be that the default is to include any devices that don't match a !pattern, unless there is a pattern without a '!', in which case the default is to reject non-accepted patterns. Is that straight forward enough, or do I need an order allow,deny like apache has? Thanks for the suggestion. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5 kicks non-fresh drives
On Friday May 26, [EMAIL PROTECTED] wrote: I had no idea about this particular configuration requirement. None of just to be clear: it's not a requirement. if you want the very nice auto-assembling behavior, you need to designate the auto-assemblable partitions. but you can assemble manually without 0xfd partitions (even if that's in an initrd, for instance.) I think the current situation is good, since there is some danger of going too far. for instance, testing each partition to see whether it contains a valid superblock would be pretty crazy, right? I'm curious: why exactly do you say that? Doing the reads themselves cannot be a problem as the kernel already reads the partition table from each devices. Reading superblocks is no big deal. If you don't like the idea of assembling everything that was found, how is that different from. requiring either the auto-assemble-me partition type, or explicit partitions given in a config file is a happy medium... assembling everything that was found which had an 'auto-assemble-me' flag? That flag, in common usage, contains almost zero information more than the existence of the raid superblock. Am I missing something? My opinion: the auto-assemble-me partition type is not a happy medium. The superblock containing the hostname (as supported by mdadm-2.5) is (I hope). NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5 kicks non-fresh drives
On Friday May 26, [EMAIL PROTECTED] wrote: On Thu, 25 May 2006, Craig Hollabaugh wrote: That did it! I set the partition FS Types from 'Linux' to 'Linux raid autodetect' after my last re-sync completed. Manually stopped and started the array. Things looked good, so I crossed my fingers and rebooted. The kernel found all the drives and all is happy here in Colorado. Would it make sense for the raid code to somehow warn in the log when a device in a raid set doesn't have Linux raid autodetect partition type? If this was in dmesg, would you have spotted the problem before? Maybe. Unfortunately md doesn't really have direct access to information on partition types. The way it gets access for auto-detect is an ugly hack which I would rather not make any further use of. Maybe mdadm could be more helpful here. e.g. when you create, assemble, or 'detail' an array it could report any inconsistencies in the partition types, and when you --add a device which is isn't a Raid-autodetect partition to an array the currently comprises such partitions it could give a warning. I had thought that 'libblkid' would help with that, but having looked at the doco, it appears not Maybe I use libparted... or maybe borrow code out of kpartx. There don't seem to be any easy options ;-( Thanks for the suggestion (and if anyone has some good partition hacking code...) NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mdadm and 2.4 kernel?
On Thursday May 25, [EMAIL PROTECTED] wrote: Hi, for various reasons i'll need to run mdadm on a 2.4 kernel. Now I have 2.4.32 kernel. Take a look: [EMAIL PROTECTED]:~# mdadm --create --verbose /dev/md0 --level=1 --bitmap=/root/md0bitmap -n 2 /dev/nda /dev/ndb --force --assume-clean mdadm: /dev/nda appears to be part of a raid array: level=raid1 devices=2 ctime=Thu May 25 20:10:47 2006 mdadm: /dev/ndb appears to be part of a raid array: level=raid1 devices=2 ctime=Thu May 25 20:10:47 2006 mdadm: size set to 39118144K Continue creating array? y mdadm: Warning - bitmaps created on this kernel are not portable between different architectured. Consider upgrading the Linux kernel. mdadm: Cannot set bitmap file for /dev/md0: No such device 2.4 does not support bitmaps (nor do early 2.6 kernels). [EMAIL PROTECTED]:~# mdadm --create --verbose /dev/md0 --level=1 -n 2 /dev/nda /dev/ndb --force --assume-clean mdadm: /dev/nda appears to be part of a raid array: level=raid1 devices=2 ctime=Thu May 25 20:10:47 2006 mdadm: /dev/ndb appears to be part of a raid array: level=raid1 devices=2 ctime=Thu May 25 20:10:47 2006 mdadm: size set to 39118144K Continue creating array? y mdadm: SET_ARRAY_INFO failed for /dev/md0: File exists [EMAIL PROTECTED]:~# It seems /dev/md0 is already active somehow. Try mdadm -S /dev/md0 first. What does cat /proc/mdstat say? NeilBrown Obviously the devices /dev/nda and /dev/ndb exists (i can make fdisk on them). Can someone help me? Thanks. Stefano. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 hang on get_active_stripe
On Friday May 26, [EMAIL PROTECTED] wrote: On Tue, 23 May 2006, Neil Brown wrote: i applied them against 2.6.16.18 and two days later i got my first hang... below is the stripe_cache foo. thanks -dean neemlark:~# cd /sys/block/md4/md/ neemlark:/sys/block/md4/md# cat stripe_cache_active 255 0 preread bitlist=0 delaylist=255 neemlark:/sys/block/md4/md# cat stripe_cache_active 255 0 preread bitlist=0 delaylist=255 neemlark:/sys/block/md4/md# cat stripe_cache_active 255 0 preread bitlist=0 delaylist=255 Thanks. This narrows it down quite a bit... too much infact: I can now say for sure that this cannot possible happen :-) Two things that might be helpful: 1/ Do you have any other patches on 2.6.16.18 other than the 3 I sent you? If you do I'd like to see them, just in case. 2/ The message.gz you sent earlier with the echo t /proc/sysrq-trigger trace in it didn't contain information about md4_raid5 - the controlling thread for that array. It must have missed out due to a buffer overflowing. Next time it happens, could you to get this trace again and see if you can find out what what md4_raid5 is going. Maybe do the 'echo t' several times. I think that you need a kernel recompile to make the dmesg buffer larger. Thanks for your patience - this must be very frustrating for you. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5 kicks non-fresh drives
On Thursday May 25, [EMAIL PROTECTED] wrote: From dmesg md: Autodetecting RAID arrays. md: autorun ... md: considering sdl1 ... md: adding sdl1 ... md: adding sdi1 ... md: adding sdh1 ... md: adding sdg1 ... md: adding sdf1 ... md: adding sde1 ... md: adding sdd1 ... md: adding sdc1 ... md: adding sdb1 ... md: adding sda1 ... md: adding hdc1 ... md: created md0 The kernel didn't add sdj or sdk. And the partition types of sdj1 and sdk1 are ??? NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Max. md array size under 32-bit i368 ...
On Wednesday May 24, [EMAIL PROTECTED] wrote: I know this has come up before, but a few quick googles hasn't answered my questions - I'm after the max. array size that can be created under bog-standard 32-bit intel Linux, and any issues re. partitioning. I'm aiming to create a raid-6 over 12 x 500GB drives - am I going to have any problems? No, this should work providing your kernel is compiled with CONFIG_LBD. NeilBrown (I'm not parittioning the resulting md device, just the underlying sd devices and building a single md out of sd[a-l]4 ...) Cheers, Gordon - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4 disks in raid 5: 33MB/s read performance?
On Wednesday May 24, [EMAIL PROTECTED] wrote: Mark Hahn wrote: I just dd'ed a 700MB iso to /dev/null, dd returned 33MB/s. Isn't that a little slow? what bs parameter did you give to dd? it should be at least 3*chunk (probably 3*64k if you used defaults.) I would expect readahead to make this unproductive. Mind you, I didn't say it is, but I can't see why not. There was a problem with data going through stripe cache when it didn't need to, but I thought that was fixed. Neil? Am I an optimist? Probably You are write about readahead - it should make the difference in block size irrelevant. You are wrong about the problem of reading through the cache being fixed. It hasn't yet. We still read through the cache. However that shouldn't cause more than a 10% speed reduction, and 33MB/s sounds like more than 10% down. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: iostat messed up with md on 2.6.16.x
On Wednesday May 24, [EMAIL PROTECTED] wrote: Hi, I upgraded my kernel from 2.6.15.6 to 2.6.16.16 and now the 'iostat -x 1' permanently shows 100% utilisation on each disk that member of an md array. I asked my friend who using 3 boxes with 2.6.16.2 2.6.16.9 2.6.16.11 and raid1, he's reported the same too. it works for anyone? I don't think that it's exactly a md problem, but only appears with the md, so I wrote here I did a basic debugging on evening and I think the problem is the double calling of disk-in_flight-- in block/ll_rw_blk.c - I dont know why, but here's a sample line from /proc/diskstats after a raid array assembled: 80 sda 52 1134 8256 568 3 7 24 16 4294967295 433820 4294534144 ^^ in_flight = -1 I wrote an ugly workaround and now the iostat working well [see attach#1], but if it's a real bug, someone should find the root cause of it, please http://lkml.org/lkml/2006/5/23/42 might help... NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Does software RAID take advantage of SMP, or 64 bit CPU(s)?
On Monday May 22, [EMAIL PROTECTED] wrote: A few simple questions about the 2.6.16+ kernel and software RAID. Does software RAID in the 2.6.16 kernel take advantage of SMP? Not exactly. RAID5/6 tends to use just one cpu for parity calculations, but that frees up other cpus for doing other important work. Does software RAID take advantage of 64-bit CPU(s)? No more or less that other code in the kernel. Sometimes using a 64-bit CPU is a cost because more data is shuffled around... Was there some particular sort of 'advantage' that you were thinking of? NeilBrown If there are any good web sites that cover this information, a link would be GREAT! -Adam Talbot - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: improving raid 5 performance
On Tuesday May 23, [EMAIL PROTECTED] wrote: Neil hello. 1. i have applied the common path according to http://www.spinics.net/lists/raid/msg11838.html as much as i can. Great. I look forward to seeing the results. it looks ok in terms of throughput. before i continue to a non common path ( step 3 ) i do not understand raid0_mergeable_bvec entirely. Not too surprising - it is rather subtle unfortunately. as i understand the code checks alignment . i made a version for this purpose which looks like that: Yes, it checks alignment with the chunks and devices. However we always have to allow one page to be added to a bio, so sometimes we have to accept a bio that crosses a chunk/device boundary. The main (possibly only) user is in __bio_add_page in fs/bio.c so we basically code the merge_bvec_fn to meet the needs of that code. static int raid5_mergeable_bvec(request_queue_t *q, struct bio *bio, struct bio_vec *biovec) { mddev_t *mddev = q-queuedata; sector_t sector=bio-bi_sector+get_start_sect(bio-bi_bdev); int max; unsigned int chunk_sectors = mddev-chunk_size 9; unsigned int bio_sectors = bio-bi_size 9; max=(chunk_sectors-((sector(chunk_sectors-1))+bio_sectors))9; if (max 0){ printk(handle_aligned_read not aligned %d %d %d %lld\n,max,chunk_sectors,bio_sectors,sector); return -1; // Is bigger than one chunk size } //printk(handle_aligned_read aligned %d %d %d %lld\n,max,chunk_sectors,bio_sectors,sector); return max; } you cannot return a negative number, because the result is compared with an 'unsigned int', and the comparison will be unsigned. So return -1 is a problem. I think you need to make this code look a lot more like raid0_mergeable_bvec. Questions: 1.1 why did you drop the max=0 case ? I'm not sure what you mean by 'drop'. If bio-bi_size == 0, then we are not allowed to return a number smaller than biovec-bv_len, otherwise bio_add_page wont be able to put any pages on the bio, and so wont be able to start any IO. 1.2 what these lines mean ? do i need it ? if (max = biovec-bv_len bio_sectors == 0) return biovec-bv_len; else return max; } Yes, you need this. It basically implements the above restriction. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 resize in 2.6.17 - how will it be different from raidreconf?
On Monday May 22, [EMAIL PROTECTED] wrote: Will it be less risky to grow an array that way? It should be. In particular it will survive an unexpected reboot (as long as you don't lose and drives at the same time) which I don't think raidreconf would. Testing results so far are quite positive. Write cache comes to mind - did you test power fail scenarios? I haven't done any tests involving power-cycling the machine, but I doubt they would show anything. When a reshape restarts after a crash, at least the last few stripes are re-written, which should catch anything that was pending at the moment of power failure. (And while talking of that: can I add for example two disks and grow *and* migrate to raid6 in one sweep or will I have to go raid6 and then add more disks?) Adding two disks would be the preferred way to do it. Add only one disk and going to raid6 is problematic because the reshape process will be over-writing live data the whole time, making crash protection quite expensive. By contrast, when you are expanding the size of the array, after the first few stripes you are writing to an area of the drives where there is no live data. Let me see if I got this right: if I add *two* disks and go from raid 5 to 6 with raidreconf, no live data needs to be overwritten and in case something fails I will still be able to assemble the old array..? I cannot speak for raidreconf, though my understanding is that it doesn't support raid6. If you mean md/reshape, then what will happen (raid5-raid6 isn't implemented yet) is this The raid5 is converted to raid6 with more space incrementally. Once the process has been underway for a little while, there will be: - a region of the drives that is laid out out as raid6 - the new layout - a region of the drives that is not in use at all - finally a region of the drives that is still laid out as raid5. Data from the start of the last region is constantly copied into the start of the middle region, and the two region boundaries are moved forward regularly. While this happens the middle region grows. If there is a crash, on restart this layout (part raid5, part raid6) will be picked up and the reshaping process continued. There is a 'critical' section at the very beginning where the middle region is non-existent. To handle this we copy the first few blocks to somewhere safe (a file or somewhere on the new drives) and use that space as the middle region to copy data to. If the system reboots during this critical section, mdadm will restore the data from the backup that it made before assembling the array. If you want to convert a raid5 to a raid6 and only add one drive, it shouldn't be hard to see that the middle region never exists. To cope with this safely, mdadm would need to be constantly backing up sections of the array before allowing the kernel to reshape that section. This is certainly quite possible and may well be implemented one day, but can be expected to be quite slow. I hope that clarifies the situation. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: problems with raid=noautodetect
On Monday May 22, [EMAIL PROTECTED] wrote: hi list, I read somewhere that it would be better not to rely on the autodetect-mechanism in the kernel at boot time, but rather to set up /etc/mdadm.conf accordingly and boot with raid=noautodetect. Well, I tried that :) I set up /etc/mdadm.conf for my 2 raid5 arrays: snip # mountpoint: /home/media ARRAY /dev/md0 level=raid5 UUID=86ed1434:43380717:4abf124e:970d843a devices=/dev/sda1,/dev/sdb1,/dev/sdd3 # mountpoint: /mnt/raid ARRAY /dev/md1 level=raid5 UUID=baf59fb5:f4805e7a:91a77644:af3dde17 # devices=/dev/sda2,/dev/sdb2,/dev/sdd2 snap Presumably you have a 'DEVICE' line in mdadm.conf too? What is it. My first guess is that it isn't listing /dev/sdd? somehow. Otherwise, can you add a '-v' to the mdadm command that assembles the array, and capture the output. That might be helpful. NeilBrown and rebooted with raid=noautodetect. It booted fine, but the 3rd disks from each array (/dev/sdd2 and /dev/sdd3) were removed, so I had 2 degraded raid5 arrays. It was possible to readd them with sth. like: mdadm /dev/md0 -a /dev/sdd3 (synced and /proc/mdstat showed [UUU]) but after the next reboot, the two partitions were again removed ([UU_])?! This was a reproducible error, I tried it several times with different /etc/mdadm.conf settings (ARRAY-statement with UUID=, devices=, UUID+devices, etc.). I´m now running autodetect again, all raid arrays are working fine, but can anyone explain this strange behaviour? (kernel-2.6.16.14, amd64) thanks, florian PS: please cc me, as I´m not subscribed to the list - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: spin_lock_irq() in handle_stripe()
On Monday May 22, [EMAIL PROTECTED] wrote: Good day Neil, all if I understand right, we disable irqs in handle_stripe() just because of using device_lock which can be grabbed from interrupt context (_end_io functions). can we replace it by a new separate spinlock and don't block interrupts in handle_stripe() + add_stripe_bio() ? Yes, irqs are disabled in handle_stripe, but only for relatively short periods of time. Do you have reason to think this is a problem.? device_lock does currently protect a number of data structures. Not all of them are accessed in interrupt context, and so they could be changed to be protected by a different lock, possibly sh-lock. You would need to carefully work out exactly what it is protecting, determine which of those aren't accessed from interrupts, and see about moving them (one by one preferably) to a different lock. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4 disks in raid 5: 33MB/s read performance?
On Monday May 22, [EMAIL PROTECTED] wrote: I just dd'ed a 700MB iso to /dev/null, dd returned 33MB/s. Isn't that a little slow? System is a sil3114 4 port sata 1 controller with 4 samsung spinpoint 250GB, 8MB cache in raid 5 on a Athlon XP 2000+/512MB. Yes, read on raid5 isn't as fast as we might like at the moment. It looks like you are getting about 11MB/s of each disk which is probably quite a bit slower than they can manage (what is the single-drive read speed you get dding from /dev/sda or whatever). You could try playing with the readahead number (blockdev --setra/--getra). I'm beginning to think that the default setting is a little low. You could also try increasing the stripe-cache size by writing numbers to /sys/block/mdX/md/stripe_cache_size On my test system with a 4 drive raid5 over fast SCSI drives, I get 230MB/sec on drives that give 90MB/sec. If I increase the stripe_cache_size from 256 to 1024, I get 260MB/sec. I wonder if your SATA controller is causing you grief. Could you try dd if=/dev/SOMEDISK of=/dev/null bs=1024k count=1024 and then do the same again on all devices in parallel e.g. dd if=/dev/SOMEDISK of=/dev/null bs=1024k count=1024 dd if=/dev/SOMEOTHERDISK of=/dev/null bs=1024k count=1024 ... and see how the speeds compare. (I get about 55MB/sec on each of 5 drives, or 270MB/sec, which is probably hitting the SCSI buss limit which as a theoretical max of 320MB/sec I think) NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 hang on get_active_stripe
On Wednesday May 17, [EMAIL PROTECTED] wrote: On Thu, 11 May 2006, dean gaudet wrote: On Tue, 14 Mar 2006, Neil Brown wrote: On Monday March 13, [EMAIL PROTECTED] wrote: I just experienced some kind of lockup accessing my 8-drive raid5 (2.6.16-rc4-mm2). The system has been up for 16 days running fine, but now processes that try to read the md device hang. ps tells me they are all sleeping in get_active_stripe. There is nothing in the syslog, and I can read from the individual drives fine with dd. mdadm says the state is active. ... i seem to be running into this as well... it has happenned several times in the past three weeks. i attached the kernel log output... it happenned again... same system as before... I've spent all morning looking at this and while I cannot see what is happening I did find a couple of small bugs, so that is good... I've attached three patches. The first fix two small bugs (I think). The last adds some extra information to /sys/block/mdX/md/stripe_cache_active They are against 2.6.16.11. If you could apply them and if the problem recurs, report the content of stripe_cache_active several times before and after changing it, just like you did last time, that might help throw some light on the situation. Thanks, NeilBrown Status: ok Fix a plug/unplug race in raid5 When a device is unplugged, requests are moved from one or two (depending on whether a bitmap is in use) queues to the main request queue. So whenever requests are put on either of those queues, we should make sure the raid5 array is 'plugged'. However we don't. We currently plug the raid5 queue just before putting requests on queues, so there is room for a race. If something unplugs the queue at just the wrong time, requests will be left on the queue and nothing will want to unplug them. Normally something else will plug and unplug the queue fairly soon, but there is a risk that nothing will. Signed-off-by: Neil Brown [EMAIL PROTECTED] ### Diffstat output ./drivers/md/raid5.c | 18 ++ 1 file changed, 6 insertions(+), 12 deletions(-) diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c --- ./drivers/md/raid5.c~current~ 2006-05-23 12:27:58.0 +1000 +++ ./drivers/md/raid5.c2006-05-23 12:28:26.0 +1000 @@ -77,12 +77,14 @@ static void __release_stripe(raid5_conf_ if (atomic_read(conf-active_stripes)==0) BUG(); if (test_bit(STRIPE_HANDLE, sh-state)) { - if (test_bit(STRIPE_DELAYED, sh-state)) + if (test_bit(STRIPE_DELAYED, sh-state)) { list_add_tail(sh-lru, conf-delayed_list); - else if (test_bit(STRIPE_BIT_DELAY, sh-state) -conf-seq_write == sh-bm_seq) + blk_plug_device(conf-mddev-queue); + } else if (test_bit(STRIPE_BIT_DELAY, sh-state) + conf-seq_write == sh-bm_seq) { list_add_tail(sh-lru, conf-bitmap_list); - else { + blk_plug_device(conf-mddev-queue); + } else { clear_bit(STRIPE_BIT_DELAY, sh-state); list_add_tail(sh-lru, conf-handle_list); } @@ -1519,13 +1521,6 @@ static int raid5_issue_flush(request_que return ret; } -static inline void raid5_plug_device(raid5_conf_t *conf) -{ - spin_lock_irq(conf-device_lock); - blk_plug_device(conf-mddev-queue); - spin_unlock_irq(conf-device_lock); -} - static int make_request (request_queue_t *q, struct bio * bi) { mddev_t *mddev = q-queuedata; @@ -1577,7 +1572,6 @@ static int make_request (request_queue_t goto retry; } finish_wait(conf-wait_for_overlap, w); - raid5_plug_device(conf); handle_stripe(sh); release_stripe(sh); Status: ok Fix some small races in bitmap plugging in raid5. The comment gives more details, but I didn't quite have the sequencing write, so there was room for races to leave bits unset in the on-disk bitmap for short periods of time. Signed-off-by: Neil Brown [EMAIL PROTECTED] ### Diffstat output ./drivers/md/raid5.c | 30 +++--- 1 file changed, 27 insertions(+), 3 deletions(-) diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c --- ./drivers/md/raid5.c~current~ 2006-05-23 12:28:26.0 +1000 +++ ./drivers/md/raid5.c2006-05-23 12:28:53.0 +1000 @@ -15,6 +15,30 @@ * Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */ +/* + * BITMAP UNPLUGGING: + * + * The sequencing for updating the bitmap reliably is a little + * subtle
Re: raid 5 read performance
On Sunday May 21, [EMAIL PROTECTED] wrote: Question : What is the cost of not walking trough the raid5 code in the case of READ ? if i add and error handling code will it be suffice ? Please read http://www.spinics.net/lists/raid/msg11838.html and ask if you have further questions. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 resize in 2.6.17 - how will it be different from raidreconf?
On Monday May 22, [EMAIL PROTECTED] wrote: How will the raid5 resize in 2.6.17 be different from raidreconf? It is done (mostly) in the kernel while the array is active, rather than completely in user-space while the array is off-line. Will it be less risky to grow an array that way? It should be. In particular it will survive an unexpected reboot (as long as you don't lose and drives at the same time) which I don't think raidreconf would. Testing results so far are quite positive. Will it be possible to migrate raid5 to raid6? Eventually, but no time frame yet. (And while talking of that: can I add for example two disks and grow *and* migrate to raid6 in one sweep or will I have to go raid6 and then add more disks?) Adding two disks would be the preferred way to do it. Add only one disk and going to raid6 is problematic because the reshape process will be over-writing live data the whole time, making crash protection quite expensive. By contrast, when you are expanding the size of the array, after the first few stripes you are writing to an area of the drives where there is no live data. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mdadm: bitmap size
(Please don't reply off-list. If the conversation starts on the list, please leave it there unless there is a VERY GOOD reason). On Monday May 22, [EMAIL PROTECTED] wrote: On 5/19/06, Neil Brown [EMAIL PROTECTED] wrote: On Friday May 19, [EMAIL PROTECTED] wrote: As i can see the bitmap do exactly this, but the default bitmap is too small! Why do you say that? Are you using an internal bitmap, or a bitmap in a separate file? I was using bitmap in a separate file. Why i said that tha bitmap is too small? I try to explain: the raid device is a raid1, created on /dev/md0 trought mdadm, and the bitmap use a 4 kb chunk-size on external file (in root directory) setfaulty /dev/md0 /dev/nda raidhotremove /dev/md0 /dev/nda cd /mnt/md0 wget http://... (240 kb file...) raidhotadd /dev/md0 /dev/nda And now dmesg said that the bitmap was obsolete (01 or something like that) and that the md driver will force a total recovery. raidhotadd doesn't know anything about bitmaps. If you use 'mdadm /dev/mda --add /dev/nda' you should find that it works better. I recommend getting rid of setfaulty / raidhotadd /raidhotremove etc and just using mdadm. NeilBrown A recovery of 40 gb for a 240 kb file is a little bit expensive.. :-) Unfortunately i cannot give you the exact output because the server is down now.:-| The only way to control the size of the bitmap is the change the bitmap chunk size. Okay thanks. Warning: if you have more than 1 million bits in the bitmap, the kernel may fail in memory allocation and may not be able to assemble your array. Thankyou for your help. Stefano. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: recovery speed on many-disk RAID 1
On Saturday May 20, jeff@jab.org wrote: interrupted by seeks from read requests on the RAID. But that's not really necessary; imagine if it instead went something like: sbb1 - sbg1# High bandwidth copy operation limited by drive speed sb[cde]1# These guys handle read requests Yeh... that could be done. There is even a comment in the 2.4 code: /* If reconstructing, and 1 working disc, * could dedicate one to rebuild and others to * service read requests .. */ though that seems to have disappeared from 2.6. Given that - high rebuild speed could swamp some buss and so interfere with regular IO and - it is fairly easy to ask for the speed to be higher I'm not sure that it is necessary might be interesting though. I've added it to my todo list, but if someone else would like to have a try NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid5 resize testing opportunity
On Thursday May 18, [EMAIL PROTECTED] wrote: Hi Neil, The raid5 reshape seems to have gone smoothly (nice job!), though it took 11 hours! Are there any pieces of info you would like about the array? Excellent! No, no other information would be useful. This is the first real-life example that I know of of adding 2 devices at once. That should be no more difficult, but it is good to know that it works in fact as well as in theory. Thanks, NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] MD RAID Acceleration: Move stripe operations outside the spin lock
On Tuesday May 16, [EMAIL PROTECTED] wrote: This is the second revision of the effort to enable offload of MD's xor and copy operations to dedicated hardware resources. Please comment on the approach of this patch and whether it will be suitable to expand this to the other areas in handle_stripe where such calculations are performed. Implementation of the xor offload API is a work in progress, the intent is to reuse I/OAT. Overview: Neil, as you recommended, this implementation flags the necessary operations on a stripe and then queues the execution to a separate thread (similar to how disk cycles are handled). See the comments added to raid5.h for more details. Hi. This certainly looks like it is heading in the right direction - thanks. I have a couple of questions, which will probably lead to more. You obviously need some state-machine functionality to oversee the progression like xor - drain - xor (for RMW) or clear - copy - xor (for RCW). You have encoded the different states on a per block basis (storing it in sh-dev[x].flags) rather than on a per-strip basis (and so encoding it in sh-state). What was the reason for this choice? The only reason I can think of is to allow more parallelism : different blocks within a strip can be in different states. I cannot see any value in this as the 'xor' operation will work across all (active) blocks in the strip and so you will have to synchronise all the blocks on that operation. I feel the code would be simpler if the state was in the strip rather than the block. The wait_for_block_op queue and it's usage seems odd to me. handle_stripe should never have to wait for anything. when a block_op is started, the sh-count should be incremented, and the decremented when the block-ops have finished. Only when will handle_stripe get to look at the stripe_head again. So waiting shouldn't be needed. Your GFP_KERNEL kmalloc is a definite no-no which can lead to deadlocks (it might block while trying to write data out thought the same raid array). At least it should be GFP_NOIO. However a better solution would be to create and use a mempool - they are designed for exactly this sort of usage. However I'm not sure if even that is needed. Can a stripe_head have move than one active block_ops task happening? If not, the 'stripe_work' should be embedded in the 'stripe_head'. There will probably be more questions once these are answered, but as the code is definitely a good start. Thanks, NeilBrown (*) Since reading the DDF documentation again, I've realised that using the word 'stripe' both for a chunk-wide stripe and a block-wide stripe is very confusing. So I've started using 'strip' for a block-wide stripe. Thus a 'stripe_head' should really be a 'strip_head'. I hope this doesn't end up being even more confusing :-) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 hang on get_active_stripe
On Wednesday May 17, [EMAIL PROTECTED] wrote: let me know if you want the task dump output from this one too. No thanks - I doubt it will containing anything helpful. I'll try to put some serious time into this next week - as soon as I get mdadm 2.5 out. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Raid5 resize testing opportunity
On Wednesday May 17, [EMAIL PROTECTED] wrote: Hi all, For Neil's benefit (:-) I'm about to test the raid5 resize code by trying to grow our 2TB raid5 from 8 to 10 devices. Currently, I'm running a 2.6.16-rc4-mm2 kernel. Is this current enough to support the resize? (I suspect not.) If I upgrade to 2.6.17-rc4-mm1, would that do it, or is it even in stable 2.6.16.16? Thanks! You need at least 2.6.17-rc1. I would suggest the latest -rc: 2.6.17-rc4 Don't use -mm. It could have new bugs, and you don't want them to trouble you when you are growing your array. I look forward to the results! NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: softraid and multiple distros
On Monday May 15, [EMAIL PROTECTED] wrote: I always use entire disks if I want the entire disks raided (sounds obvious, doesn't it...) I only use partitions when I want to vary the raid layout for different parts of the disk (e.g. mirrored root, mirrored swap, raid6 for the rest). But that certainly doesn't mean it is wrong to use partitions for the whole disk. The idea behind this is: let's say a disk fails, and you get a replacement, but it has a different geometry or a few blocks less - won't work. Even the same disk model might vary after a while. So I made 0xfd partitions of the size (whole disk minus few megs). An alternative is to use the --size option of mdadm to make the array slightly smaller than the smallest drive. So you don't need partitions for this (though it is perfectly alright to use them if you like). You can tell mdadm where to look. If you want to be sure that it won't look at entire drives, only partitions, then a line like DEVICES /dev/[hs]d*[0-1] in /etc/mdadm.conf might be what you want. However as you should be listing the uuids in /etc/mdadm.conf, any Umm... yeah, should I? What else would you use to uniquely identify the arrays? Not device names I hope. superblock with an unknown uuid will easily be ignored. If you are relying nf 0xfd autodetect to assemble your arrays, then obviously the entire-disk superblock will be ignored (because they wont be in the right place in any partition). So mdadm --assemble --scan is fine for my scenario even with those orphaned superblocks. I cannot say for sure without seeing your mdadm.conf, but probably. Should get me some sedatives for the day when this all explodes :P Just make sure it happens on your day off, then someone else will need the sedatives :-) NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.
On Monday May 15, [EMAIL PROTECTED] wrote: Ho hum, I give up. Thankyou :-) I found our debate very valuable - it helped me clarify my understanding of some areas of linux filesystem semantics (and as I am trying to write a filesystem in my 'spare time', that will turn out to be very useful). It also revealed some problems in the code! I don't think, in practice, this code fixes any demonstrable bug though. I thought it was our job to kill the bugs *before* they were demonstrated :-) I'm still convinced that the previous code could lead to deadlocks or worse under sufficiently sustained high memory pressure and fs activity. I'll send a patch shortly that fixes the known problems and awkwardnesses in the new code. Thanks again, NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid0 over 2 h/w raid5's OOPSing at mkfs
On Monday May 15, [EMAIL PROTECTED] wrote: I've got a x86_64 system with 2 3ware 9550SX-12s, each set up as a raid5 w/ a hot spare. Over that, I do a software raid0 stripe via: mdadm -C /dev/md0 -c 512 -l 0 -n 2 /dev/sd[bc]1 Whenever I try to format md0 (I've tried both mke2fs and mkfs.xfs), the system OOPSes. I'm running centos-4 with the default kernel, but I've upgraded the 3ware driver/firmware to the most recent versions. Based on the OOPS I'll paste below, who should I be blaming for this crash? Any ideas on how to fix it? Thanks. Try this. http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=1eb29128c644581fa51f822545921394ad4f719f Raid0 has troubles in 64bit machines back in 2.6.9 days. NeilBrown. Unable to handle kernel NULL pointer dereference at 0027 RIP: a0194ab6{:raid0:raid0_make_request+448} PML4 3e913067 PGD 7b652067 PMD 0 Oops: [1] SMP CPU 1 Modules linked in: raid0 md5 ipv6 parport_pc lp parport i2c_dev i2c_core sunrpc ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables dm_mirror dm_mod button battery ac ohci_hcd hw_random tg3 floppy ext3 jbd 3w_9xxx(U) 3w_ sd_mod scsi_mod Pid: 2955, comm: mke2fs Not tainted 2.6.9-34.ELsmp - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 001 of 3] md: Change md/bitmap file handling to use bmap to file blocks-fix
On Monday May 15, [EMAIL PROTECTED] wrote: NeilBrown [EMAIL PROTECTED] wrote: + do_sync_file_range(file, 0, LLONG_MAX, + SYNC_FILE_RANGE_WRITE | + SYNC_FILE_RANGE_WAIT_AFTER); That needs a SYNC_FILE_RANGE_WAIT_BEFORE too. Otherwise any dirty, under-writeback pages will remain dirty. I'll make that change. Ahhh.. yes... that makes sense! Thanks :-) NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: recovery from mkswap on mounted raid1 ext3 filesystem?
On Monday May 15, [EMAIL PROTECTED] wrote: I accidentally ran mkswap on an md raid1 device which had a mounted ext3 filesystem on it. I also did a swapon, but I don't think anything was written to swap before I noticed the mistake. How much of the partition is toast, and is it something e2fsck might fix? oh dear I think (and an strace seems to confirm) that mkswap only writes in the first 4k of the device. This will have held the superblock, but there is always at least one backup - I think it is as block 8193. But 'fsck -n' should help you out, though you might need 'fsck.ext2 -n' as 'fsck' might think it is a swap device... Ofcourse, if the filesystem is mounted, then unmounting the filesystem should write the superblock, which might fix any corruption you caused.. I'm very surprised that swapon worked if the fs was mounted - there should be mutual exclusion there. Moreover, shouldn't the mkswap command check whether a device is in use before overwriting it? Yes, but before 2.6 this was very hard to do (in 2.6 it is easy, just open with O_EXCL). I doubt mkswap has seen much maintenance later not since Dec 2004 in fact. And the only checks it does is to make sure you aren't running mkswap on /dev/hda or /dev/hdb !!! Adrian: You seem to be the MAINTAINER of mkswap.. any chance of opening for O_EXCL as well as O_RDWR. That would make it a lot safer. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.
On Saturday May 13, [EMAIL PROTECTED] wrote: Paul Clements [EMAIL PROTECTED] wrote: Andrew Morton wrote: The loss of pagecache coherency seems sad. I assume there's never a requirement for userspace to read this file. Actually, there is. mdadm reads the bitmap file, so that would be broken. Also, it's just useful for a user to be able to read the bitmap (od -x, or similar) to figure out approximately how much more he's got to resync to get an array in-sync. Other than reading the bitmap file, I don't know of any way to determine that. Read it with O_DIRECT :( Which is exactly what the next release of mdadm does. As the patch comment said: : With this approach the pagecache may contain data which is inconsistent with : what is on disk. To alleviate the problems this can cause, md invalidates : the pagecache when releasing the file. If the file is to be examined : while the array is active (a non-critical but occasionally useful function), : O_DIRECT io must be used. And new version of mdadm will have support for this. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: softraid and multiple distros
On Sunday May 14, [EMAIL PROTECTED] wrote: Am Sonntag, 14. Mai 2006 16:50 schrieben Sie: What do I need to do when I want to install a different distro on the machine with a raid5 array? Which files do I need? /etc/mdadm.conf? /etc/raittab? both? MD doesn't need any files to function, since it can auto-assemble arrays based on their superblocks (for partition-type 0xfd). I see. Now an issue arises someone else here mentioned: My first attempt was to use the entire disks, then I was hinted that this approach wasn't too hot so I made partitions. I always use entire disks if I want the entire disks raided (sounds obvious, doesn't it...) I only use partitions when I want to vary the raid layout for different parts of the disk (e.g. mirrored root, mirrored swap, raid6 for the rest). But that certainly doesn't mean it is wrong to use partitions for the whole disk. Now the devices have all two superblocks, the one left from the first try which are now kinda orphaned and those now active. Can I trust mdadm to handle this properly on its own? You can tell mdadm where to look. If you want to be sure that it won't look at entire drives, only partitions, then a line like DEVICES /dev/[hs]d*[0-1] in /etc/mdadm.conf might be what you want. However as you should be listing the uuids in /etc/mdadm.conf, any superblock with an unknown uuid will easily be ignored. If you are relying nf 0xfd autodetect to assemble your arrays, then obviously the entire-disk superblock will be ignored (because they wont be in the right place in any partition). NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.
(replying to bits of several emails) On Friday May 12, [EMAIL PROTECTED] wrote: Neil Brown [EMAIL PROTECTED] wrote: However some IO requests cannot complete until the filesystem I/O completes, so we need to be sure that the filesystem I/O won't block waiting for memory, or fail with -ENOMEM. That sounds like a complex deadlock. Suppose the bitmap writeout requres some writeback to happen before it can get enough memory to proceed. Exactly. Bitmap writeout must not block on fs-writeback. It can block on device writeout (e.g. queue congestion or mempool exhaustion) but it must complete without waiting in the fs layer or above, and without the possibility of any error other -EIO. Otherwise we can get deadlocked writing to the raid array. bh_submit (or bio_submit) is certain to be safe in this respect. I'm not so confident about anything at the fs level. Read it with O_DIRECT :( Which is exactly what the next release of mdadm does. As the patch comment said: : With this approach the pagecache may contain data which is inconsistent with : what is on disk. To alleviate the problems this can cause, md invalidates : the pagecache when releasing the file. If the file is to be examined : while the array is active (a non-critical but occasionally useful function), : O_DIRECT io must be used. And new version of mdadm will have support for this. Which doesn't help `od -x' and is going to cause older mdadm userspace to mysteriously and subtly fail. Or does the user-kernel interface have versioning which will prevent this? As I said: 'non-critical'. Nothing important breaks if reading the file gets old data. Reading the file while the array is active is purely a curiosity thing. There is information in /proc/mdstat which gives a fairly coarse view of the same data. It could lead to some confusion, but if a compliant mdadm comes out before this gets into a mainline kernel, I doubt there will be any significant issue. Read/writing the bitmap needs to work reliably when the array is not active, but suitable sync/invalidate calls in the kernel should make that work perfectly. I know this is technically a regression in user-space interface, and you don't like such regression with good reason Maybe I could call invalidate_inode_pages every few seconds or whenever the atime changes, just to be on the safe side :-) I have a patch which did that, but decided that the possibility of kmalloc failure at awkward times would make that not suitable. submit_bh() can and will allocate memory, although most decent device drivers should be OK. submit_bh (like all decent device drivers) uses a mempool for memory allocation so we can be sure that the delay in getting memory is bounded by the delay for a few IO requests to complete, and we can be sure the allocation won't fail. This is perfectly fine. I don't think a_ops really provides an interface that I can use, partly because, as I said in a previous email, it isn't really a public interface to a filesystem. It's publicer than bmap+submit_bh! I don't know how you can say that. bmap is so public that it is exported to userspace through an IOCTL and is used by lilo (admitted only for reading, not writing). More significantly it is used by swapfile which is a completely independent subsystem from the filesystem. Contrast this with a_ops. The primary users of a_ops are routines like generic_file_{read,write} and friends. These are tools - library routines - that are used by filesystems to implement their 'file_operations' which are the real public interface. As far as these uses go, it is not a public interface. Where a filesystem doesn't use some library routines, it does not need to implement the matching functionality in the a_op interface. The other main user is the 'VM' which might try to flush out or invalidate pages. However the VM isn't using this interface to interact with files, but only to interact with pages, and it doesn't much care what is done with the pages providing they get clean, or get released, or whatever. The way I perceive Linux design/development, active usage is far more significant than documented design. If some feature of an interface isn't being actively used - by in-kernel code - then you cannot be sure that feature will be uniformly implemented, or that it won't change subtly next week. So when I went looking for the best way to get md/bitmap to write to a file, I didn't just look at the interface specs (which are pretty poorly documented anyway), I looked at existing code. I can find 3 different parts of the kernel that write to a file. They are swap-file loop nfsd nfsd uses vfs_read/vfs_write which have too many failure/delay modes for me to safely use. loop uses prepare_write/commit_write (if available) or f_op-write (not vfs_write - I wonder why) which is not much better than what nfsd uses. And as far as I can tell
Re: [PATCH 002 of 8] md/bitmap: Remove bitmap writeback daemon.
On Friday May 12, [EMAIL PROTECTED] wrote: NeilBrown [EMAIL PROTECTED] wrote: ./drivers/md/bitmap.c | 115 ++ hmm. I hope we're not doing any of that filesystem I/O within the context of submit_bio() or kblockd or anything like that. Looks OK from a quick scan. No. We do all the I/O from the context of the per-array thread. However some IO requests cannot complete until the filesystem I/O completes, so we need to be sure that the filesystem I/O won't block waiting for memory, or fail with -ENOMEM. a_ops-commit_write() already ran set_page_dirty(), so you don't need that in there. Is that documented somewhere. but yes, that seems to be right. Thanks. I assume this always works in units of a complete page? It's strange to do prepare_write() followed immediately by commit_write(). Normally prepare_write() will do some prereading, but it's smart enough to not do that if the caller is preparing to write the whole page. Yes, it is strange. That was one of the things that made me want to review this code and figure out how to do it properly. As far as I can see, much of 'address_space' is really an internal interface to support routines used by the filesystem. A filesystem may choose to use address spaces, and has a fair degree of freedom when it comes to which bits to make use of and exactly what they mean. About the only thing that *has* to be supported is -writepages -- which has a fair degree of latitude in exactly what it does -- and -writepage -- which can only be called after locking a page and rechecking the -mapping. bitmap.c is currently trying to do something every different. It uses -readpage to get pages in the page cache (even though some address spaces don't define -readpage) and then holds onto those pages without holding the page lock, and then calls -writepage to flush them out from time to time. Before calling writepage it gets the pagelock, but doesn't re-check that -mapping is correct (there is nothing much it can do if it isn't correct..). I noticed this is particularly a problem with tmpfs. When you call writepage on a tmpfs page, the page is swizzled into the swap cache, and -mapping becomes NULL - not the behaviour that bitmap is expecting. Now I agree that tmpfs is an unusual case, and that storing a bitmap in tmpfs doesn't make a lot of sense (though it can make some...) but the point is that if a filesystem is allowed to move pages around like that, then bitmap cannot hold on to pages in the page cache like it wants to. It simply isn't a well defined thing to do. We normally use PAGE_CACHE_SIZE for these things, not PAGE_SIZE. Same diff. Yeah why is that? Why have two names for exactly the same value? How does a poor develop know when to use one and when the other? More particularly, how does one remember. I would argue that the same diff should be no difference - not even in spelling If you have a page and you want to write the whole thing out then there's really no need to run prepare_write or commit_write at all. Just initialise the whole page, run set_page_dirty() then write_one_page(). I see that now. But only after locking the page, and rechecking that -mapping is correct, and if it isn't well, more work is involved that bitmap is in a position to do. Perhaps it should check that the backing filesystem actually implements commit_write(), prepare_write(), readpage(), etc. Some might not, and the user will get taught not to do that via an oops. Might help, but as I think you've gathered, I really want a whole different approach to writing to the file. One that I can justify as being a correct use of interfaces, and also that I can be certain will not block or fail on a kmalloc or similar. Hence the bmap thing later. Thanks for your feedback. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.
On Friday May 12, [EMAIL PROTECTED] wrote: NeilBrown [EMAIL PROTECTED] wrote: If md is asked to store a bitmap in a file, it tries to hold onto the page cache pages for that file, manipulate them directly, and call a cocktail of operations to write the file out. I don't believe this is a supportable approach. erk. I think it's better than... This patch changes the approach to use the same approach as swap files. i.e. bmap is used to enumerate all the block address of parts of the file and we write directly to those blocks of the device. That's going in at a much lower level. Even swapfiles don't assume buffer_heads. I'm not assuming buffer_heads. I'm creating buffer heads and using them for my own purposes. These are my pages and my buffer heads. None of them belong to the filesystem. The buffer_heads are simply a convenient data-structure to record the several block addresses for each page. I could have equally created an array storing all the addresses, and built the required bios by hand at write time. But buffer_heads did most of the work for me, so I used them. Yes, it is a lower level, but 1/ I am certain that there will be no kmalloc problems and 2/ Because it is exactly the level used by swapfile, I know that it is sufficiently 'well defined' that no-one is going to break it. When playing with bmap() one needs to be careful that nobody has truncated the file on you, else you end up writing to disk blocks which aren't part of the file any more. Well we currently play games with i_write_count to ensure that no-one else has the file open for write. And if no-one else can get write access, then it cannot be truncated. I did think about adding the S_SWAPFILE flag, but decided to leave that for a separate patch and review different approaches to preventing write access first (e.g. can I use a lease?). All this (and a set_fs(KERNEL_DS), ug) looks like a step backwards to me. Operating at the pagecache a_ops level looked better, and more filesystem-independent. If you really want filesystem independence, you need to use vfs_read and vfs_write to read/write the file. I have a patch which did that, but decided that the possibility of kmalloc failure at awkward times would make that not suitable. So I now use vfs_read to read in the file (just like knfsd) and bmap/submit_bh to write out the file (just like swapfile). I don't think a_ops really provides an interface that I can use, partly because, as I said in a previous email, it isn't really a public interface to a filesystem. I haven't looked at this patch at all closely yet. Do I really need to? I assume you are asking that because you hope I will retract the patch. While I'm always open to being educated, I am not yet convinced that there is any better way, or even any other usable way, to do what needs to be done, so I am not inclined to retract the patch. I'd like to say that you don't need to read it because it is perfect, but unfortunately history suggests that is unlikely to be true. Whether you look more closely is of course up to you, but I'm convinced that patch is in the right direction, and your review and comments are always very valuable. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID5 - 4 disk reboot trouble.
On Thursday May 11, [EMAIL PROTECTED] wrote: Hi, I'm running a raid5 system, and when I reboot my raid seems to be failing. (One disk is set to spare and other disk seems to be oke in the detials page but we get a INPUT/OUTPUT error when trying to mount it) We cannot seem te find the problem in this setup. ... State : clean, degraded, recovering ^^ Do you ever let the recovery actually finish? Until you do you don't have real redundancy. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hardware raid 5 and software raid 0 stripe broke.
On Thursday May 11, [EMAIL PROTECTED] wrote: We have a Linux box running redhat 7.2 We have two hardware controllers in it with about 500gig's each. They're raid 5. We were using a software raid to combine them all together. 1 hard drive went down so we replaced it and now the system won't boot. We have used a Knoppix boot cd to get into a linux system. We can see /dev/sda1 and /dev/sdb1. However, /dev/md0 cannot be accessed. Is there a safe way to create the software raid from the cd to see if maybe mdadm on the original system got corrupt? any help would be greatly appreciated. What booted off knoppix, what does mdadm -E /dev/sda1 mdadm -E /dev/sdb1 produce? How about mdadm -A /dev/md0 /dev/sda1 /dev/sdb1 NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 009 of 11] md: Support stripe/offset mode in raid10
On Wednesday May 3, [EMAIL PROTECTED] wrote: Neil Brown wrote: On Tuesday May 2, [EMAIL PROTECTED] wrote: NeilBrown wrote: The industry standard DDF format allows for a stripe/offset layout where data is duplicated on different stripes. e.g. A B C D D A B C E F G H H E F G (columns are drives, rows are stripes, LETTERS are chunks of data). Presumably, this is the case for --layout=f2 ? Almost. mdadm doesn't support this layout yet. 'f2' is a similar layout, but the offset stripes are a lot further down the drives. It will possibly be called 'o2' or 'offset2'. If so, would --layout=f4 result in a 4-mirror/striped array? o4 on a 4 drive array would be A B C D D A B C C D A B B C D A E F G H Yes, so would this give us 4 physically duplicate mirrors? It would give 4 devices each containing the same data, but in a different layout - much as the picture shows If not, would it be possible to add a far-offset mode to yield such a layout? Exactly what sort of layout do you want? Also, would it be possible to have a staged write-back mechanism across multiple stripes? What exactly would that mean? Write the first stripe, then write subsequent duplicate stripes based on idle with a max delay for each delayed stripe. And what would be the advantage? Faster burst writes, probably. I still don't get what you are after. You always need to wait for writes of all copies to complete before acknowledging the write to the filesystem, otherwise you risk corruption if there is a crash and a device failure. So inserting any delays (other than the per-device plugging which helps to group adjacent requests) isn't going to make things go faster. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: strange RAID5 problem
On Monday May 8, [EMAIL PROTECTED] wrote: Good evening. I am having a bit of a problem with a largish RAID5 set. Now it is looking more and more like I am about to lose all the data on it, so I am asking (begging?) to see if anyone can help me sort this out. Very thorough description, but you omitted the 'dmesg' output corresponding to : [EMAIL PROTECTED] ~]# mdadm --assemble /dev/md3 /dev/sdq1 /dev/sdr1 /dev/sds1 /dev/sdt1 /dev/sdu1 /dev/sdv1 /dev/sdx1 /dev/sdy1 /dev/sdz1 /dev/sdaa1 /dev/sdab1 /dev/sdac1 /dev/sdad1 /dev/sdae1 /dev/sdaf1 mdadm: failed to RUN_ARRAY /dev/md3: Invalid argument Also, you don't seem to have tried '--force' with '--assemble'. It might help. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Two-disk RAID5?
On Friday May 5, [EMAIL PROTECTED] wrote: Sorry, I couldn't find a diplomatic way to say you're completely wrong. We don't necessarily expect a diplomatic way, but a clear and intelligent one would be helpful. In two-disk RAID5 which is it? 1) The 'parity bit' is the same as the datum. Yes. 2) The parity bit is the complement of the datum. No. 3) It doesn't work at a bit-wise level. No. Many of us feel that RAID5 looks like: parity = data[0]; for (i=1; i ndisks; ++i) parity ^= data[i]; Actually in linux/md/raid5 it is more like parity = 0 for (i=0; i ndisks; ++i) parity ^= data[i]; which has exactly the same result. (well, it should really be ndatadisks, but I think we both knew that was what you meant). which implies (1). It could easily be (2) but merely saying it's not data, it's parity doesn't clarify matters a great deal. But I'm pleased my question has stirred up such controversy! A bit of controversy is always a nice way to pass those long winter nights only it isn't winter anywhere at the moment :-) NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 009 of 11] md: Support stripe/offset mode in raid10
On Tuesday May 2, [EMAIL PROTECTED] wrote: NeilBrown wrote: The industry standard DDF format allows for a stripe/offset layout where data is duplicated on different stripes. e.g. A B C D D A B C E F G H H E F G (columns are drives, rows are stripes, LETTERS are chunks of data). Presumably, this is the case for --layout=f2 ? Almost. mdadm doesn't support this layout yet. 'f2' is a similar layout, but the offset stripes are a lot further down the drives. It will possibly be called 'o2' or 'offset2'. If so, would --layout=f4 result in a 4-mirror/striped array? o4 on a 4 drive array would be A B C D D A B C C D A B B C D A E F G H Also, would it be possible to have a staged write-back mechanism across multiple stripes? What exactly would that mean? And what would be the advantage? NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 004 of 11] md: Increase the delay before marking metadata clean, and make it configurable.
On Sunday April 30, [EMAIL PROTECTED] wrote: NeilBrown [EMAIL PROTECTED] wrote: When a md array has been idle (no writes) for 20msecs it is marked as 'clean'. This delay turns out to be too short for some real workloads. So increase it to 200msec (the time to update the metadata should be a tiny fraction of that) and make it sysfs-configurable. ... + safe_mode_delay + When an md array has seen no write requests for a certain period + of time, it will be marked as 'clean'. When another write + request arrive, the array is marked as 'dirty' before the write + commenses. This is known as 'safe_mode'. + The 'certain period' is controlled by this file which stores the + period as a number of seconds. The default is 200msec (0.200). + Writing a value of 0 disables safemode. + Why not make the units milliseconds? Rename this to safe_mode_delay_msecs to remove any doubt. Because umpteen years ago when I was adding thread-usage statistics to /proc/net/rpc/nfsd I used milliseconds and Linus asked me to make it seconds - a much more obvious unit. See Email below. It seems very sensible to me. ... + msec = simple_strtoul(buf, e, 10); + if (e == buf || (*e *e != '\n')) + return -EINVAL; + msec = (msec * 1000) / scale; + if (msec == 0) + mddev-safemode_delay = 0; + else { + mddev-safemode_delay = (msec*HZ)/1000; + if (mddev-safemode_delay == 0) + mddev-safemode_delay = 1; + } + return len; And most of that goes away. Maybe it could go in a library :-? NeilBrown From: Linus Torvalds [EMAIL PROTECTED] To: Neil Brown [EMAIL PROTECTED] cc: [EMAIL PROTECTED] Subject: Re: PATCH knfsd - stats tidy up. Date: Tue, 18 Jul 2000 12:21:12 -0700 (PDT) Content-Type: TEXT/PLAIN; charset=US-ASCII On Tue, 18 Jul 2000, Neil Brown wrote: The following patch converts jiffies to milliseconds for output, and also makes the number wrap predicatably at 1,000,000 seconds (approximately one fortnight). If no programs depend on the format, I actually prefer format changes like this to be of the obvious kind. One such obvious kind is the format 0.001 which obviously means 0.001 seconds. And yes, I'm _really_ sorry that a lot of the old /proc files contain jiffies. Lazy. Ugly. Bad. Much of it my bad. Doing 0.001 doesn't mean that you have to use floating point, in fact you've done most of the work already in your ms patch, just splitting things out a bit works well: /* gcc knows to combine / and % - generate one divl */ unsigned int sec = time / HZ, msec = time % HZ; msec = (msec * 1000) / HZ; sprintf( %d.%03d, sec, msec) (It's basically the same thing you already do, except it doesn't re-combine the seconds and milliseconds but just prints them out separately.. And it has the advantage that if you want to change it to microseconds some day, you can do so very trivially without breaking the format. Plus it's readable as hell.) Linus - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: try to write back redundant data before failing disk in raid5 setup
On Monday May 1, [EMAIL PROTECTED] wrote: Hello, Suppose a read action on a disk which is member of a raid5 (or raid1 or any other raid where there's data redundancy) fails. What ahppens next is that the entire disk is marked as failed and a raid5 rebuild is initiated. However, that seems like overkill to me. If only one sector on one disk failed, that sector could be re-calculated (using parity calculations) AND written back to the original disk (i.e. the disk with the bad sector). Any modern disk will do sector remapping, so the bad sector will simply be replaced by a good one and there's no need to fail the entire disk. ... and any modern linux kernel (since about 2.6.15) will to exactly what you suggest. The reason I bring this up is that I think raid5 rebuilds are 'scary' things. Suppose a raid5 rebuild is initiated while other members of the raid5 set have bad -but yet undetected- sectors scattered around the disc (Current_Pending_Sector in smartd speak). Now this raid5 rebuild would fail, losing the entire raid5 set. While each and every bit in the raid5 set might still be salvagable! (I've seen this happen on 5x250Gb raid5 sets.) For this reason it is good to regularly do a background read check of the entire array. echo check /sys/block/mdX/md/sync_action Any read errors will trigger and attempt to overwrite the bad block with good data. Do this regularly, *before* any drive really failed. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 resizing
On Monday May 1, [EMAIL PROTECTED] wrote: Hey folks. There's no point in using LVM on a raid5 setup if all you intend to do in the future is resize the filesystem on it, is there? The new raid5 resizing code takes care of providing the extra space and then as long as the say ext3 filesystem is created with resize_inode all should be sweet. Right? Or have I missed something crucial here? :) You are correct. md/raid5 makes the extra space available all by itself. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 003 of 5] md: Change ENOTSUPP to EOPNOTSUPP
On Friday April 28, [EMAIL PROTECTED] wrote: NeilBrown wrote: Change ENOTSUPP to EOPNOTSUPP Because that is what you get if a BIO_RW_BARRIER isn't supported ! Dumb question, hope someone can answer it :). Does this mean that any version of MD up till now won't know that SATA disks does not support barriers, and therefore won't flush SATA disks and therefore I need to disable the disks's write cache if I want to be 100% sure that raid arrays are not corrupted? Or am I way off :-). The effect of this bug is almost unnoticeable. In almost all cases, md will detect that a drive doesn't support barriers when writing out the superblock - this is completely separate code and is correct. Thus md/raid1 will reject any barrier requests coming from the filesystem and will never pass them down, and will not make a wrong decision because of this bug. The only cases where this bug could cause a problem are: 1/ when the first write is a barrier write. It is possible that reiserfs does this in some cases. However only this write will be at risk. 2/ if a device changes its behaviour from accepting barriers to not accepting barrier (Which is very uncommon). As md will be rejecting barrier requests, the filesystem will know not to trust them and should use other techniques such as waiting for dependant requests to complete, and calling blkdev_issue_flush were appropriate. Whether filesystems actually do this, I am less certain. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Two-disk RAID5?
On Wednesday April 26, [EMAIL PROTECTED] wrote: I suspect I should have just kept out of this, and waited for someone like Neil to answer authoratatively. So...Neil, what's the right answer to Tuomas's 2 disk RAID5 question? :) .. and a deep resounding voice from on-high spoke and in it's infinite wisdom it said yeh, whatever The data layout on a 2disk raid5 and a 2 disk raid1 is identical (if you ignore chunksize issues (raid1 doesn't need one) and the superblock (which isn't part of the data)). Each drive contains identical data(*). Write throughput to a the r5 would be a bit slower because data is always copied in memory first, then written. Read through put would be largely the same if the r5 chunk size was fairly large, but much poorer for r5 if the chunksize was small. Converting a raid1 to a raid5 while offline would be quite straight forward except for the chunksize issue. If the r1 wasn't a multiple of the chunksize you chose for r5, then you would lose the last fraction of a chunk. So if you are planning to do this, set the size of your r1 to something that is nice and round (e.g. a multiple of 128k). Converting a raid1 to a raid5 while online is something I have been thinking about, but it is not likely to happen any time soon. I think that answers all the issues. NeilBrown (*) The term 'mirror' for raid1 has always bothered me because a mirror presents a reflected image, while raid1 copies the data without any transformation. With a 2drive raid5, one drive gets the original data, and the other drive gets the data after it has been 'reflected' through an XOR operation, so maybe a 2drive raid5 is really a 'mirrored' pair Except that the data is still the same as XOR with 0 produces no change. So, if we made a tiny change to raid5 and got the xor operation to start with 0xff in every byte, then the XOR would reflect each byte in a reasonable meaningful way, and we might actually get a mirrored pair!!! But I don't think that would provide any real value :-) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Trying to start dirty, degraded RAID6 array
On Thursday April 27, [EMAIL PROTECTED] wrote: The short version: I have a 12-disk RAID6 array that has lost a device and now whenever I try to start it with: mdadm -Af /dev/md0 /dev/sd[abcdefgijkl]1 I get: mdadm: failed to RUN_ARRAY /dev/md0: Input/output error ... raid6: cannot start dirty degraded array for md0 The '-f' is meant to make this work. However it seems there is a bug. Could you please test this patch? It isn't exactly the right fix, but it definitely won't hurt. Thanks, NeilBrown Signed-off-by: Neil Brown [EMAIL PROTECTED] ### Diffstat output ./super0.c |1 + 1 file changed, 1 insertion(+) diff ./super0.c~current~ ./super0.c --- ./super0.c~current~ 2006-03-28 17:10:51.0 +1100 +++ ./super0.c 2006-04-27 10:03:40.0 +1000 @@ -372,6 +372,7 @@ static int update_super0(struct mdinfo * if (sb-level == 5 || sb-level == 4 || sb-level == 6) /* need to force clean */ sb-state |= (1 MD_SB_CLEAN); + rv = 1; } if (strcmp(update, assemble)==0) { int d = info-disk.number; - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: linear writes to raid5
On Thursday April 20, [EMAIL PROTECTED] wrote: Neil Brown wrote: What is the rationale for your position? My rationale was that if md layer receives *write* requests not smaller than a full stripe size, it is able to omit reading data to update, and can just calculate new parity from the new data. Hence, combining a dozen small write requests coming from a filesystem to form a single request = full stripe size should dramatically increase performance. That makes sense. However in both cases (above and below raid5), the device receiving the requests is in a better position to know what size is a good size than the client sending the requests. That is exactly what the 'plugging' concept is for. When a request arrives, the device is 'plugged' so that it won't process new requests, and the request plus any following requests are queued. At some point the queue is unplugged and the device should be able to collect related requests to make large requests of an appropriate size and alignment for the device. The current suggestion is that plugging is quite working right for raid5. That is certainly possible. Eg, when I use dd with O_DIRECT mode (oflag=direct) and experiment with different block size, write performance increases alot when bs becomes full stripe size. Ofcourse it decreases again when bs is increased a bit further (as md starts reading again, to construct parity blocks). Yes O_DIRECT is essentially saying I know what I am doing and I want to bypass all the smarts and go straight to the device. O_DIRECT requests should certainly be sized and aligned to make the device. For non-O_DIRECT it shouldn't matter so much. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Trying to start dirty, degraded RAID6 array
On Thursday April 27, [EMAIL PROTECTED] wrote: Neil Brown wrote: The '-f' is meant to make this work. However it seems there is a bug. Could you please test this patch? It isn't exactly the right fix, but it definitely won't hurt. Thanks, Neil, I'll give this a go when I get home tonight. Is there any way to start an array without kicking off a rebuild ? echo 1 /sys/module/md_mod/parameters/start_ro If you do this, then arrays will be read-only when they are started, and so will not do a rebuild. The first write request to the array (e.g. if you mount a filesystem) will cause a switch to read/write and any required rebuild will start. echo 0 will revert the effect. This requires a reasonably recent kernel. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/2] raid6_end_write_request() spinlock fix
On Tuesday April 25, [EMAIL PROTECTED] wrote: Hello, Reduce the raid6_end_write_request() spinlock window. Andrew: please don't include these in -mm. This one and the corresponding raid5 are wrong, and I'm not sure yet the unplug_device changes. In this case, the call to md_error, which in turn calls error in raid6main.c, requires the lock to be held as it contains: if (!test_bit(Faulty, rdev-flags)) { mddev-sb_dirty = 1; if (test_bit(In_sync, rdev-flags)) { conf-working_disks--; mddev-degraded++; conf-failed_disks++; clear_bit(In_sync, rdev-flags); /* * if recovery was running, make sure it aborts. */ set_bit(MD_RECOVERY_ERR, mddev-recovery); } set_bit(Faulty, rdev-flags); which is fairly clearly not safe without some locking. Coywolf: As I think I have already said, I appreciate your review of the md/raid code and your attempts to improve it - I'm sure there is plenty of room to make improvements. However posting patches with minimal commentary on code that you don't fully understand is not the best way to work with the community. If you see something that you think is wrong, it is much better to ask why it is the way it is, explain why you think it isn't right, and quite possibly include an example patch. Then we can discuss the issue and find the best solution. So please feel free to post further patches, but please include more commentary, and don't assume you understand something that you don't really. Thanks, NeilBrown Signed-off-by: Coywolf Qi Hunt [EMAIL PROTECTED] --- diff --git a/drivers/md/raid6main.c b/drivers/md/raid6main.c index bc69355..820536e 100644 --- a/drivers/md/raid6main.c +++ b/drivers/md/raid6main.c @@ -468,7 +468,6 @@ static int raid6_end_write_request (stru struct stripe_head *sh = bi-bi_private; raid6_conf_t *conf = sh-raid_conf; int disks = conf-raid_disks, i; - unsigned long flags; int uptodate = test_bit(BIO_UPTODATE, bi-bi_flags); if (bi-bi_size) @@ -486,16 +485,14 @@ static int raid6_end_write_request (stru return 0; } - spin_lock_irqsave(conf-device_lock, flags); if (!uptodate) md_error(conf-mddev, conf-disks[i].rdev); rdev_dec_pending(conf-disks[i].rdev, conf-mddev); - clear_bit(R5_LOCKED, sh-dev[i].flags); set_bit(STRIPE_HANDLE, sh-state); - __release_stripe(conf, sh); - spin_unlock_irqrestore(conf-device_lock, flags); + release_stripe(sh); + return 0; } -- Coywolf Qi Hunt - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: to be or not to be...
On Sunday April 23, [EMAIL PROTECTED] wrote: Hi all, to make a long story very very shorty: a) I create /dev/md1, kernel latest rc-2-git4 and mdadm-2.4.1.tgz, with this command: /root/mdadm -Cv /dev/.static/dev/.static/dev/.static/dev/md1 \ --bitmap-chunk=1024 --chunk=256 --assume-clean --bitmap=internal \ ^^ -l5 -n5 /dev/hda2 /dev/hdb2 /dev/hde2 /dev/hdf2 missing From the man page: --assume-clean Tell mdadm that the array pre-existed and is known to be clean. It can be useful when trying to recover from a major failure as you can be sure that no data will be affected unless you actu- ally write to the array. It can also be used when creating a RAID1 or RAID10 if you want to avoid the initial resync, however this practice - while normally safe - is not recommended. Use ^^^ this ony if you really know what you are doing. ^^ So presumably you know what you are doing, and I wonder why you bother to ask us :-) Ofcourse, if you don't know what you are doing, then I suggest dropping the --assume-clean. In correct use of this flag can lead to data corruption. This is particularly true if your array goes degraded, but is also true while your array isn't degraded. In this case it is (I think) very unusual and may not be the cause of your corruption, but you should avoid using the flag anyway. b) dm-encrypt /dev/md1 c) create fs with: mkfs.ext3 -O dir_index -L 'tritone' -i 256000 /dev/mapper/raidone d) export it via nfs (mounting /dev/mapper/raidone as ext2) Why not ext3? e) start to cp-ing files f) after 1 TB of written data, with no problem/warning, one of the not-in-raid-array HD freeze This could signal a bad controller. If it does, then you cannot trust any drives. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html