Re: raid1 error handling and faulty drives
Neil Brown writes: On Wednesday September 5, [EMAIL PROTECTED] wrote: ... 2) It adds a threshold on the level of recent error acivity which is acceptable in a given interval, all configured through /sys. If a mirror has generated more errors in this interval than the threshold, it is kicked out of the array. This is probably a good idea. It bothers me a little to require 2 separate numbers in sysfs... When we get a read error, we quiesce the device, the try to sort out the read errors, so we effectively handle them in batches. Maybe we should just set a number of seconds, and if there are a 3 or more batches in that number of seconds, we kick the drive... just a thought. I think I was just trying to be as flexible as possible. If we were to use one number, I'd do the opposite and fix the interval but allow the threshold to be configured just because I would tend to think about a disk being bad in terms of it having a more than an expected number of errors in some fixed interval rather than because it had a fixed number of errors in less than some expected interval. Mathematically the approaches ought to be equivalent. One would think that #2 should not be necessary as the raid1 retry logic already attempts to rewrite and then reread bad sectors and fails the drive if it cannot do both. However, what we observe is that the re-write step succeeds as does the re-read but the drive is really no more healthy. Maybe the re-read is not actually going out to the media in this case due to some caching effect? I have occasionally wondered if a cache would defeat this test. I wonder if we can push a FORCE MEDIA ACCESS flag down with that read. I'll ask. I looked around for something like this but it doesn't appear to be implemented that I could see. I couldn't even find an explicit mention of read caching in any drive specs to begin with. Read-ahead seems to be the closest concept. Thanks. I agree that we do need something along these lines. It might be a while before I can give the patch the brainspace it deserves as I am travelling this fortnight. Looking forward to further discussion. Thank you! -- Mike Accetta ECI Telecom Ltd. Transport Networking Division, US (previously Laurel Networks) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
raid1 error handling and faulty drives
I've been looking at ways to minimize the impact of a faulty drive in a raid1 array. Our major problem is that a faulty drive can absorb lots of wall clock time in error recovery within the device driver (SATA libata error handler in this case), during which any further raid activity is blocked and the system effectively hangs. This tends to negate the high availability advantage of placing the file system on a RAID array in the first place. We've had one particularly bad drive, for example, which could sync without indicating any write errors but as soon as it became active in the array would start yielding read errors. It this particular case it would take 30 minutes or more for the process to progress to a point where some fatal error would occur to kick the drive out of the array and return the system to normal opreation. For SATA, this effect can be partially mitigated by reducing the default 30 second timeout at the SCSI layer (/sys/block/sda/device/timeout). However, the system stills spends 45 seconds or so per retry in the driver issuing various reset operations in an attempt to recover from the error before returning control to the SCSI layer. I've been experimenting with a patch which makes two basic changes. 1) It issues the first read request against a mirror with more than 1 drive active using the BIO_RW_FAILFAST flag to short-circuit the SCSI layer from re-trying the failed operation in the low level device driver the default 5 times. 2) It adds a threshold on the level of recent error acivity which is acceptable in a given interval, all configured through /sys. If a mirror has generated more errors in this interval than the threshold, it is kicked out of the array. One would think that #2 should not be necessary as the raid1 retry logic already attempts to rewrite and then reread bad sectors and fails the drive if it cannot do both. However, what we observe is that the re-write step succeeds as does the re-read but the drive is really no more healthy. Maybe the re-read is not actually going out to the media in this case due to some caching effect? This patch (against 2.6.20) still contains some debugging printk's but should be otherwise functional. I'd be interested in any feedback on this specific approach and would also be happy if this served to foster an error recovery discussion which came up with some even better approach. Mike Accetta ECI Telecom Ltd. Transport Networking Division, US (previously Laurel Networks) --- diff -Naurp 2.6.20/drivers/md/md.c kernel/drivers/md/md.c --- 2.6.20/drivers/md/md.c 2007-06-04 13:52:42.0 -0400 +++ kernel/drivers/md/md.c 2007-08-30 16:28:58.633816000 -0400 @@ -1842,6 +1842,46 @@ static struct rdev_sysfs_entry rdev_erro __ATTR(errors, S_IRUGO|S_IWUSR, errors_show, errors_store); static ssize_t +errors_threshold_show(mdk_rdev_t *rdev, char *page) +{ + return sprintf(page, %d\n, rdev-errors_threshold); +} + +static ssize_t +errors_threshold_store(mdk_rdev_t *rdev, const char *buf, size_t len) +{ + char *e; + unsigned long n = simple_strtoul(buf, e, 10); + if (*buf (*e == 0 || *e == '\n')) { + rdev-errors_threshold = n; + return len; + } + return -EINVAL; +} +static struct rdev_sysfs_entry rdev_errors_threshold = +__ATTR(errors_threshold, S_IRUGO|S_IWUSR, errors_threshold_show, errors_threshold_store); + +static ssize_t +errors_interval_show(mdk_rdev_t *rdev, char *page) +{ + return sprintf(page, %d\n, rdev-errors_interval); +} + +static ssize_t +errors_interval_store(mdk_rdev_t *rdev, const char *buf, size_t len) +{ + char *e; + unsigned long n = simple_strtoul(buf, e, 10); + if (*buf (*e == 0 || *e == '\n')) { + rdev-errors_interval = n; + return len; + } + return -EINVAL; +} +static struct rdev_sysfs_entry rdev_errors_interval = +__ATTR(errors_interval, S_IRUGO|S_IWUSR, errors_interval_show, errors_interval_store); + +static ssize_t slot_show(mdk_rdev_t *rdev, char *page) { if (rdev-raid_disk 0) @@ -1925,6 +1965,8 @@ static struct attribute *rdev_default_at rdev_state.attr, rdev_super.attr, rdev_errors.attr, + rdev_errors_interval.attr, + rdev_errors_threshold.attr, rdev_slot.attr, rdev_offset.attr, rdev_size.attr, @@ -2010,6 +2052,9 @@ static mdk_rdev_t *md_import_device(dev_ rdev-flags = 0; rdev-data_offset = 0; rdev-sb_events = 0; + rdev-errors_interval = 15; /* minutes */ + rdev-errors_threshold = 16;/* sectors */ + rdev-errors_asof = INITIAL_JIFFIES; atomic_set(rdev-nr_pending, 0); atomic_set(rdev-read_errors, 0); atomic_set(rdev-corrected_errors, 0); diff -Naurp 2.6.20/drivers/md/raid1.c kernel/drivers/md/raid1.c --- 2.6.20/drivers/md/raid1.c 2007-06-04 13:52:42.0 -0400 +++ kernel/drivers/md/raid1.c 2007
Re: detecting read errors after RAID1 check operation
Neil Brown writes: On Wednesday August 15, [EMAIL PROTECTED] wrote: Neil Brown writes: On Wednesday August 15, [EMAIL PROTECTED] wrote: ... This happens in our old friend sync_request_write()? I'm dealing with Yes, that would be the place. ... This fragment if (j 0 || test_bit(MD_RECOVERY_CHECK, mddev-recovery)) { sbio-bi_end_io = NULL; rdev_dec_pending(conf-mirrors[i].rdev, mddev); } else { /* fixup the bio for reuse */ ... } looks suspicously like any correction attempt for 'check' is being short-circuited to me, regardless of whether or not there was a read error. Actually, even if the rewrite was not being short-circuited, I still don't see the path that would update 'corrected_errors' in this case. There are only two raid1.c sites that touch 'corrected_errors', one is in fix_read_errors() and the other is later in sync_request_write(). With my limited understanding of how this all works, neither of these paths would seem to apply here. hmmm yes I guess I was thinking of the RAID5 code rather than the RAID1 code. It doesn't do the right thing does it? Maybe this patch is what we need. I think it is right. Thanks, NeilBrown Signed-off-by: Neil Brown [EMAIL PROTECTED] ### Diffstat output ./drivers/md/raid1.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c --- .prev/drivers/md/raid1.c 2007-08-16 10:29:58.0 +1000 +++ ./drivers/md/raid1.c 2007-08-17 12:07:35.0 +1000 @@ -1260,7 +1260,8 @@ static void sync_request_write(mddev_t * j = 0; if (j = 0) mddev-resync_mismatches += r1_bio-sec tors; - if (j 0 || test_bit(MD_RECOVERY_CHECK, mddev -recovery)) { + if (j 0 || (test_bit(MD_RECOVERY_CHECK, mdde v-recovery) +text_bit(BIO_UPTODATE, sbio- bi_flags))) { sbio-bi_end_io = NULL; rdev_dec_pending(conf-mirrors[i].rdev, mddev); } else { I tried this (with the typo fixed) and it indeed issues a re-write. However, it doesn't seem to do anything with the corrected errors count if the rewrite succeeds. Since end_sync_write() is only used one other place when !In_sync, I tried the following and it seems to work to get the error count updated. I don't know whether this belongs in end_sync_write() but I'd think it needs to come after the write actually succeeds so that seems like the earliest it could be done. --- BUILD/kernel/drivers/md/raid1.c 2007-06-04 13:52:42.0 -0400 +++ drivers/md/raid1.c 2007-08-17 16:52:14.219364000 -0400 @@ -1166,6 +1166,7 @@ static int end_sync_write(struct bio *bi conf_t *conf = mddev_to_conf(mddev); int i; int mirror=0; + mdk_rdev_t *rdev; if (bio-bi_size) return 1; @@ -1175,6 +1176,8 @@ static int end_sync_write(struct bio *bi mirror = i; break; } + + rdev = conf-mirrors[mirror].rdev; if (!uptodate) { int sync_blocks = 0; sector_t s = r1_bio-sector; @@ -1186,7 +1189,13 @@ static int end_sync_write(struct bio *bi s += sync_blocks; sectors_to_go -= sync_blocks; } while (sectors_to_go 0); - md_error(mddev, conf-mirrors[mirror].rdev); + md_error(mddev, rdev); + } else if (test_bit(In_sync, rdev-flags)) { + /* +* If we are currently in sync, this was a re-write to +* correct a read error and we should account for it. +*/ + atomic_add(r1_bio-sectors, rdev-corrected_errors); } update_head_pos(mirror, r1_bio); @@ -1251,7 +1260,8 @@ static void sync_request_write(mddev_t * } if (j = 0) mddev-resync_mismatches += r1_bio-sectors; - if (j 0 || test_bit(MD_RECOVERY_CHECK, mddev-recovery)) { + if (j 0 || (test_bit(MD_RECOVERY_CHECK, mddev-recovery) + test_bit(BIO_UPTODATE, sbio-bi_flags))) { sbio-bi_end_io = NULL; rdev_dec_pending(conf-mirrors[i].rdev, mddev); } else { -- Mike Accetta ECI Telecom Ltd. Transport Networking Division, US (previously Laurel Networks) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body
detecting read errors after RAID1 check operation
We run a check operation periodically to try and turn up problems with drives about to go bad before they become too severe. In particularly, if there were any drive read errors during the check operation I would like to be able to notice and raise an alarm for human attention so that the failing drive can be replaced sooner than later. I'm looking for a programatic way to detect this reliably without having to grovel through the log files looking for kernel hard drive error messages that may have occurred during the check operation. There are already files like /sys/block/md_d0/md/dev-sdb/errors in /sys which would be very convenient to consult but according to the kernel driver implementation the error counts reported there are apparently for corrected errors and not relevant for read errors during a check operation. I am contemplating adding a parallel /sys file that would report all errors, not just the corrected ones. Does this seem reasonable? Are there other alternatives that might make sense here? -- Mike Accetta ECI Telecom Ltd. Transport Networking Division, US (previously Laurel Networks) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: detecting read errors after RAID1 check operation
Neil Brown writes: On Wednesday August 15, [EMAIL PROTECTED] wrote: There are already files like /sys/block/md_d0/md/dev-sdb/errors in /sys which would be very convenient to consult but according to the kernel driver implementation the error counts reported there are apparently for corrected errors and not relevant for read errors during a check operation. When 'check' hits a read error, an attempt is made to 'correct' it by over-writing with correct data. This will either increase the 'errors' count or fail the drive completely. What 'check' doesn't do (and 'repair' does) it react when it find that successful reads of all drives (in a raid1) do not match. So just use the 'errors' number - it is exactly what you want. This happens in our old friend sync_request_write()? I'm dealing with simulated errors and will dig further to make sure that is not perturbing the results but I don't see any 'errors' effect. This is with our patched 2.6.20 raid1.c. The logic doesn't seem to be any different in 2.6.22 from what I can tell, though. This fragment if (j 0 || test_bit(MD_RECOVERY_CHECK, mddev-recovery)) { sbio-bi_end_io = NULL; rdev_dec_pending(conf-mirrors[i].rdev, mddev); } else { /* fixup the bio for reuse */ ... } looks suspicously like any correction attempt for 'check' is being short-circuited to me, regardless of whether or not there was a read error. Actually, even if the rewrite was not being short-circuited, I still don't see the path that would update 'corrected_errors' in this case. There are only two raid1.c sites that touch 'corrected_errors', one is in fix_read_errors() and the other is later in sync_request_write(). With my limited understanding of how this all works, neither of these paths would seem to apply here. -- Mike Accetta ECI Telecom Ltd. Transport Networking Division, US (previously Laurel Networks) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: deliberately degrading RAID1 to a single disk, then back again
Justin Piszcz wrote: mdadm /dev/md0 --fail /dev/sda1 On Tue, 26 Jun 2007, Maurice Hilarius wrote: ... From time to time I want to degrade back to only single disk, and turn off RAID as the overhead has some cost From time to time I want to restore to RAID1 function, and re-synch the pair to current. Yes, this is a backup scenario.. Are there any Recommendations ( with mdadm syntax) please? Plus mdadm /dev/md0 --remove /dev/sda1 to take the drive out of the array and mdadm /dev/md0 --add /dev/sda1 to put it back and start the re-sync. When you mention turn off RAID do you mean to not even bring up the disk as a degraded array? If all you do is fail and remove the partition, RAID is actually still running to manage the single remaining partition, but I would expect that overhead to be minimal. -- Mike Accetta ECI Telecom Ltd. Transport Networking Division, US (previously Laurel Networks) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: When does a disk get flagged as bad?
Alberto Alonso writes: OK, lets see if I can understand how a disk gets flagged as bad and removed from an array. I was under the impression that any read or write operation failure flags the drive as bad and it gets removed automatically from the array. However, as I indicated in a prior post I am having problems where the array is never degraded. Does an error of type: end_request: I/O error, dev sdb, sector not count as a read/write error? I was also under the impression that any read or write error would fail the drive out of the array but some recent experiments with error injecting seem to indicate otherwise at least for raid1. My working hypothesis is that only write errors fail the drive. Read errors appear to just redirect the sector to a different mirror. I actually ran across what looks like a bug in the raid1 recovery/check/repair read error logic that I posted about last week but which hasn't generated any response yet (cf. http://article.gmane.org/gmane.linux.raid/15354). This bug results in sending a zero length write request down to the underlying device driver. A consequence of issuing a zero length write is that it fails at the device level, which raid1 sees as a write failure, which then fails the array. The fix I proposed actually has the effect of *not* failing the array in this case since the spurious failing write is never generated. I'm not sure what is actually supposed to happen in this case. Hopefully, someone more knowledgeable will comment soon. -- Mike Accetta ECI Telecom Ltd. Data Networking Division (previously Laurel Networks) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
raid1 check/repair read error recovery in 2.6.20
I believe I've come across a bug in the disk read error recovery logic for raid1 check/repair operations in 2.6.20. The raid1.c file looks identical in 2.6.21 so the problem should still exist there as well. This all surfaced when using a variant of CONFIG_FAIL_MAKE_REQUEST to inject read errors on one of the mirrors of a raid1 array. I noticed that while this would ultimately fail the array, it would always seem to generate ata1.00: WARNING: zero len r/w req ata1.00: WARNING: zero len r/w req ata1.00: WARNING: zero len r/w req ata1.00: WARNING: zero len r/w req ata1.00: WARNING: zero len r/w req ata1.00: WARNING: zero len r/w req diagnostics at the same time (no clue why there are six of them). Delving into this further I eventually settled on sync_request_write() in raid1.c as a likely culprit and added the WARN_ON (below) @@ -1386,6 +1393,7 @@ atomic_inc(r1_bio-remaining); md_sync_acct(conf-mirrors[i].rdev-bdev, wbio-bi_size 9); + WARN_ON(wbio-bi_size == 0); generic_make_request(wbio); } to confirm that this code was indeed sending a zero size bio down to the device layer in this circumstance. Looking at the preceding code in sync_request_write() it appears that the loop comparing the results of all reads just skips a mirror where the read failed (BIO_UPTODATE is clear) without doing any of the sbio prep or the memcpy() from the pbio. There is other read/re-write logic in the following if-clause but this seems to apply only if none of the mirrors were readable. Regardless, the fact that a zero length bio is being issued in the schedule writes section is compelling evidence that something is wrong somewhere. I tried the following patch to raid1.c which short-ciruits the data comparison in the read error case but otherwise does the rest of the sbio prep for the mirror with the error. It seems to have eliminated the ATA warning at least. Is it a correct thing to do? @@ -1235,17 +1242,20 @@ } r1_bio-read_disk = primary; for (i=0; imddev-raid_disks; i++) - if (r1_bio-bios[i]-bi_end_io == end_sync_read - test_bit(BIO_UPTODATE, r1_bio-bios[i]-bi_flags)) { + if (r1_bio-bios[i]-bi_end_io == end_sync_read) { int j; int vcnt = r1_bio-sectors (PAGE_SHIFT- 9); struct bio *pbio = r1_bio-bios[primary]; struct bio *sbio = r1_bio-bios[i]; - for (j = vcnt; j-- ; ) - if (memcmp(page_address(pbio-bi_io_vec[j].bv_page), - page_address(sbio-bi_io_vec[j].bv_page), - PAGE_SIZE)) - break; + if (test_bit(BIO_UPTODATE, sbio-bi_flags)) { + for (j = vcnt; j-- ; ) + if (memcmp(page_address(pbio-bi_io_vec[j].bv_page), + page_address(sbio-bi_io_vec[j].bv_page), + PAGE_SIZE)) + break; + } else { + j = 0; + } if (j = 0) mddev-resync_mismatches += r1_bio-sectors; if (j 0 || test_bit(MD_RECOVERY_CHECK, mddev-recovery)) { -- Mike Accetta ECI Telecom Ltd. Data Networking Division (previously Laurel Networks) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Partitioned arrays initially missing from /proc/partitions
David Greaves writes: ... It looks like the same (?) problem as Mike (see below - Mike do you have a patch?) but I'm on 2.6.20.7 with mdadm v2.5.6 ... We have since started assembling the array from the initrd using --homehost and --auto-update-homehost which takes a different path through the code, and in this path the kernel figures out there are partitions on the array before mdadm exists. For the previous code path, we had been ruuning with the patch I described in my original post which I've included below. I'd guess that the bug is actually in the kernel code and I looked at it briefly but couldn't figure out how things all fit together well enough to come up with a patch there. The user level patch is a bit of a hack and there may be other code paths that also need a similar patch. I only made this patch in the assembly code path we were executing at the time. BUILD/mdadm/mdadm.c#2 (text) - BUILD/mdadm/mdadm.c#3 (text) content @@ -983,6 +983,10 @@ NULL, readonly, runstop, NULL, verbose-quiet, force); close(mdfd); + mdfd = open(array_list-devname, O_RDONLY); + if (mdfd = 0) { + close(mdfd); + } } } break; -- Mike Accetta ECI Telecom Ltd. Data Networking Division (previously Laurel Networks) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID1, hot-swap and boot integrity
Gabor Gombas wrote: On Fri, Mar 02, 2007 at 09:04:40AM -0500, Mike Accetta wrote: Thoughts or other suggestions anyone? This is a case where a very small /boot partition is still a very good idea... 50-100MB is a good choice (some initramfs generators require quite a bit of space under /boot while generating the initramfs image esp. if you use distro-provided contains-everything-and-the-kitchen-sink kernels, so it is not wise to make /boot _too_ small). But if you do not want /boot to be separate a moderately sized root partition is equally good. What you want to avoid is the whole disk is a single partition/file system kind of setup. Yes, we actually have a separate (smallish) boot partition at the front of the array. This does reduce the at-risk window substantially. I'll have to ponder whether it reduces it close enough to negligible to then ignore, but that is indeed a good point to consider. -- Mike Accetta ECI Telecom Ltd. Data Networking Division (previously Laurel Networks) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID1, hot-swap and boot integrity
Bill Davidsen wrote: Gabor Gombas wrote: On Fri, Mar 02, 2007 at 09:04:40AM -0500, Mike Accetta wrote: Thoughts or other suggestions anyone? This is a case where a very small /boot partition is still a very good idea... 50-100MB is a good choice (some initramfs generators require quite a bit of space under /boot while generating the initramfs image esp. if you use distro-provided contains-everything-and-the-kitchen-sink kernels, so it is not wise to make /boot _too_ small). You are exactly right on that! Some (many) BIOS implementations will read the boot sector off the drive, and if there is no error will run the boot sector. But if you do not want /boot to be separate a moderately sized root partition is equally good. What you want to avoid is the whole disk is a single partition/file system kind of setup. Actually, the solution is moderately simple, install the replacement drive, create the partitions, and **don't mark the boot partition active** until the copy is complete. The BIOS will boot from the 1st active partition it finds (again, in sane cases). I never have anything changing in /boot in normal operation, so I admit to using dd to do a copy with the array stopped. No particular reason to think it works better than just a rebuild. After the partition is valid I set the active flag in the partition. I gathered the impression somewhere, perhaps incorrectly, that the active flag was a function of the boot block, not the BIOS. We use Grub in the MBR and don't even have an active flag set in the partition table. The system still boots. -- Mike Accetta ECI Telecom Ltd. Data Networking Division (previously Laurel Networks) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID1, hot-swap and boot integrity
H. Peter Anvin wrote: Mike Accetta wrote: I've been considering trying something like having the re-sync algorithm on a whole disk array defer the copy for sector 0 to the very end of the re-sync operation. Assuming the BIOS makes at least a minimal consistency check on sector 0 before electing to boot from the drive, this would keep it from selecting a partially re-sync'd drive that was not previously bootable. The only check that it will make is to look for 55 AA at the end of the MBR. Note that typically the MBR is not part of any of your MD volumes. Yes, that is also what I've observed in the case of our BIOS. I'm still trying to get our BIOS vendor to confirm that it will fail over to the next drive in the boot list on a read error of sector 0. We're contemplating some GRUB hacking to fail-over to the other drive once it is in control and sees problems. I wonder if having the MBR typically outside of the array and the relative newness of partitioned arrays are related? When I was considering how to architect the RAID1 layout it seemed like a partitioned array on the entire disk worked most naturally. -- Mike Accetta ECI Telecom Ltd. Data Networking Division (previously Laurel Networks) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libata hotplug and md raid?
I am currently looking at using md RAID1 and libata hotplug under 2.6.19. This relevant thread from Oct 2006 http://thread.gmane.org/gmane.linux.raid/13321/focus=13321 tailed off after this proposal from Neil Brown: On Monday October 16, [EMAIL PROTECTED] wrote: So the question remains: How will hotplug and md work together? How does md and hotplug work together for current hotplug devices? I have the same questions. How does this work in a pure SCSI environment? (has it been tested?) If something should change, should those changes be in the MD layer? Or can this *really* all be done nicely from userspace? How? I would imagine that device removal would work like this: 1/ you unplug the device 2/ kernel notices and generates an unplug event to udev. 3/ Udev does all the work to try to disconnect the device: force unmount (though that doesn't work for most filesystems) remove from dm remove from md (mdadm /dev/mdwhatever --fail /dev/dead --remove /dev/dead) 4/ Udev removes the node from /dev. udev can find out what needs to be done by looking at /sys/block/whatever/holders. I don't know exactly how to get udev to do this, or whether there would be 'issues' in getting it to work reliably. However if anyone wants to try I'm happy to help out where I can. NeilBrown Not seeing any subsequent reports on the list, I decided to try implementing the proposed approach. The immdiate problem I ran into was that /sys appears to have been cleaned up before udev sees the remove event and the /sys/block/whatever/holders file is no longer even around to consult at that point. As a secondary problem, the /dev/dead file is also apparently removed by udev before any programs mentioned in removal rules get a chance to run so there is no longer any device to provide to mdadm to remove at the time the program does run, even if it had been possible to find out what md files were holders of the removed block device to begin with. Do I have the details right? Any new thoughts in the last few months about how it would be best to solve this problem? -- Mike Accetta ECI Telecom Ltd. Data Networking Division (previously Laurel Networks) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Partitioned arrays initially missing from /proc/partitions
In setting up a partitioned array as the boot disk and using a nash initrd to find the root file system by volume label, I see a delay in the appearance of the /dev/md_d0p partitions in /proc/partitions. When the mdadm --assemble command completes, only /dev/md_d0 is visible. Since the raid partitions are not visible after the assemble, the volume label search will not consult them in looking for the root volume and the boot gets aborted. When I run a similar assemble command while up multi-user in a friendlier debug environment I see the same effect and observe that pretty much any access of /dev/md_d0 has the side effect of then making the /dev/md_d0p partitions visible in /proc/partitions. I tried a few experiments changing the --assemble code in mdadm. If I open() and close() /dev/md_d0 after assembly *before* closing the file descriptor which the assemble step used to assemble the array, there is no effect. Even doing a BLKRRPART ioctl call on the assembly fd or the newly opened fd have no effect. The kernel prints unknown partition diagnostics on the console. However, if the assembly fd is first close()'d, a simple open() of /dev/md_d0 and immediate close() of that fd has the side effect of making the /dev/md_d0p partitions visible and one sees the console disk partitioning confirmation from the kernel as well. Adding the open()/close() after assembly within mdadm solves my problem, but I thought I'd raise the issue on the list as it seems there is a bug somewhere. I see in the kernel md driver that the RUN_ARRAY ioctl() calls do_md_run() which calls md_probe() which calls add_disk() and I gather that this would normally have the side effect of making the partitions visible. However, my experiments at user level seem to imply that the array isn't completely usable until the assembly file descriptor is closed, even on return from the ioctl(), and hence the kernel add_disk() isn't having the desired partitioning side effect at the point it is being invoked. This is all with kernel 2.6.18 and mdadm 2.3.1 -- Mike Accetta ECI Telecom Ltd. Data Networking Division (previously Laurel Networks) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html