Re: raid1 error handling and faulty drives

2007-09-07 Thread Mike Accetta
Neil Brown writes:

 On Wednesday September 5, [EMAIL PROTECTED] wrote:
  ...
  
  2) It adds a threshold on the level of recent error acivity which is
 acceptable in a given interval, all configured through /sys.  If a
 mirror has generated more errors in this interval than the threshold,
 it is kicked out of the array.
 
 This is probably a good idea.  It bothers me a little to require 2
 separate numbers in sysfs...
 
 When we get a read error, we quiesce the device, the try to sort out
 the read errors, so we effectively handle them in batches.  Maybe we
 should just set a number of seconds, and if there are a 3 or more
 batches in that number of seconds, we kick the drive... just a thought.
 

I think I was just trying to be as flexible as possible.  If we were to
use one number, I'd do the opposite and fix the interval but allow the
threshold to be configured just because I would tend to think about a
disk being bad in terms of it having a more than an expected number of
errors in some fixed interval rather than because it had a fixed number
of errors in less than some expected interval.  Mathematically the
approaches ought to be equivalent.

  One would think that #2 should not be necessary as the raid1 retry
  logic already attempts to rewrite and then reread bad sectors and fails
  the drive if it cannot do both.  However, what we observe is that the
  re-write step succeeds as does the re-read but the drive is really no
  more healthy.  Maybe the re-read is not actually going out to the media
  in this case due to some caching effect?
 
 I have occasionally wondered if a cache would defeat this test.  I
 wonder if we can push a FORCE MEDIA ACCESS flag down with that
 read.  I'll ask.

I looked around for something like this but it doesn't appear to
be implemented that I could see.  I couldn't even find an explicit
mention of read caching in any drive specs to begin with.  Read-ahead
seems to be the closest concept.

 Thanks.  I agree that we do need something along these lines.  It
 might be a while before I can give the patch the brainspace it
 deserves as I am travelling this fortnight.

Looking forward to further discussion.  Thank you!
--
Mike Accetta

ECI Telecom Ltd.
Transport Networking Division, US (previously Laurel Networks)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


raid1 error handling and faulty drives

2007-09-05 Thread Mike Accetta

I've been looking at ways to minimize the impact of a faulty drive in
a raid1 array.  Our major problem is that a faulty drive can absorb
lots of wall clock time in error recovery within the device driver
(SATA libata error handler in this case), during which any further raid
activity is blocked and the system effectively hangs.  This tends to
negate the high availability advantage of placing the file system on a
RAID array in the first place.

We've had one particularly bad drive, for example, which could sync
without indicating any write errors but as soon as it became active in
the array would start yielding read errors.  It this particular case it
would take 30 minutes or more for the process to progress to a point
where some fatal error would occur to kick the drive out of the array
and return the system to normal opreation.

For SATA, this effect can be partially mitigated by reducing the default
30 second timeout at the SCSI layer (/sys/block/sda/device/timeout).
However, the system stills spends 45 seconds or so per retry in the
driver issuing various reset operations in an attempt to recover from
the error before returning control to the SCSI layer.

I've been experimenting with a patch which makes two basic changes.

1) It issues the first read request against a mirror with more than 1 drive
   active using the BIO_RW_FAILFAST flag to short-circuit the SCSI layer from 
   re-trying the failed operation in the low level device driver the default 5
   times.

2) It adds a threshold on the level of recent error acivity which is
   acceptable in a given interval, all configured through /sys.  If a
   mirror has generated more errors in this interval than the threshold,
   it is kicked out of the array.

One would think that #2 should not be necessary as the raid1 retry
logic already attempts to rewrite and then reread bad sectors and fails
the drive if it cannot do both.  However, what we observe is that the
re-write step succeeds as does the re-read but the drive is really no
more healthy.  Maybe the re-read is not actually going out to the media
in this case due to some caching effect?

This patch (against 2.6.20) still contains some debugging printk's but
should be otherwise functional.  I'd be interested in any feedback on
this specific approach and would also be happy if this served to foster
an error recovery discussion which came up with some even better approach.

Mike Accetta

ECI Telecom Ltd.
Transport Networking Division, US (previously Laurel Networks)
---
diff -Naurp 2.6.20/drivers/md/md.c kernel/drivers/md/md.c
--- 2.6.20/drivers/md/md.c  2007-06-04 13:52:42.0 -0400
+++ kernel/drivers/md/md.c  2007-08-30 16:28:58.633816000 -0400
@@ -1842,6 +1842,46 @@ static struct rdev_sysfs_entry rdev_erro
 __ATTR(errors, S_IRUGO|S_IWUSR, errors_show, errors_store);
 
 static ssize_t
+errors_threshold_show(mdk_rdev_t *rdev, char *page)
+{
+   return sprintf(page, %d\n, rdev-errors_threshold);
+}
+
+static ssize_t
+errors_threshold_store(mdk_rdev_t *rdev, const char *buf, size_t len)
+{
+   char *e;
+   unsigned long n = simple_strtoul(buf, e, 10);
+   if (*buf  (*e == 0 || *e == '\n')) {
+   rdev-errors_threshold = n;
+   return len;
+   }
+   return -EINVAL;
+}
+static struct rdev_sysfs_entry rdev_errors_threshold =
+__ATTR(errors_threshold, S_IRUGO|S_IWUSR, errors_threshold_show, 
errors_threshold_store);
+
+static ssize_t
+errors_interval_show(mdk_rdev_t *rdev, char *page)
+{
+   return sprintf(page, %d\n, rdev-errors_interval);
+}
+
+static ssize_t
+errors_interval_store(mdk_rdev_t *rdev, const char *buf, size_t len)
+{
+   char *e;
+   unsigned long n = simple_strtoul(buf, e, 10);
+   if (*buf  (*e == 0 || *e == '\n')) {
+   rdev-errors_interval = n;
+   return len;
+   }
+   return -EINVAL;
+}
+static struct rdev_sysfs_entry rdev_errors_interval =
+__ATTR(errors_interval, S_IRUGO|S_IWUSR, errors_interval_show, 
errors_interval_store);
+
+static ssize_t
 slot_show(mdk_rdev_t *rdev, char *page)
 {
if (rdev-raid_disk  0)
@@ -1925,6 +1965,8 @@ static struct attribute *rdev_default_at
rdev_state.attr,
rdev_super.attr,
rdev_errors.attr,
+   rdev_errors_interval.attr,
+   rdev_errors_threshold.attr,
rdev_slot.attr,
rdev_offset.attr,
rdev_size.attr,
@@ -2010,6 +2052,9 @@ static mdk_rdev_t *md_import_device(dev_
rdev-flags = 0;
rdev-data_offset = 0;
rdev-sb_events = 0;
+   rdev-errors_interval = 15; /* minutes */
+   rdev-errors_threshold = 16;/* sectors */
+   rdev-errors_asof = INITIAL_JIFFIES;
atomic_set(rdev-nr_pending, 0);
atomic_set(rdev-read_errors, 0);
atomic_set(rdev-corrected_errors, 0);
diff -Naurp 2.6.20/drivers/md/raid1.c kernel/drivers/md/raid1.c
--- 2.6.20/drivers/md/raid1.c   2007-06-04 13:52:42.0 -0400
+++ kernel/drivers/md/raid1.c   2007

Re: detecting read errors after RAID1 check operation

2007-08-17 Thread Mike Accetta

Neil Brown writes:
 On Wednesday August 15, [EMAIL PROTECTED] wrote:
  Neil Brown writes:
   On Wednesday August 15, [EMAIL PROTECTED] wrote:

  ... 
  This happens in our old friend sync_request_write()?  I'm dealing with
 
 Yes, that would be the place.
 
  ...
  This fragment
  
  if (j  0 || test_bit(MD_RECOVERY_CHECK, mddev-recovery)) {
  sbio-bi_end_io = NULL;
  rdev_dec_pending(conf-mirrors[i].rdev, mddev);
  } else {
  /* fixup the bio for reuse */
  ...
  }
  
  looks suspicously like any correction attempt for 'check' is being
  short-circuited to me, regardless of whether or not there was a read
  error.  Actually, even if the rewrite was not being short-circuited,
  I still don't see the path that would update 'corrected_errors' in this
  case.  There are only two raid1.c sites that touch 'corrected_errors', one
  is in fix_read_errors() and the other is later in sync_request_write().
  With my limited understanding of how this all works, neither of these
  paths would seem to apply here.
 
 hmmm yes
 I guess I was thinking of the RAID5 code rather than the RAID1 code.
 It doesn't do the right thing does it?
 Maybe this patch is what we need.  I think it is right.
 
 Thanks,
 NeilBrown
 
 
 Signed-off-by: Neil Brown [EMAIL PROTECTED]
 
 ### Diffstat output
  ./drivers/md/raid1.c |3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)
 
 diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c
 --- .prev/drivers/md/raid1.c  2007-08-16 10:29:58.0 +1000
 +++ ./drivers/md/raid1.c  2007-08-17 12:07:35.0 +1000
 @@ -1260,7 +1260,8 @@ static void sync_request_write(mddev_t *
   j = 0;
   if (j = 0)
   mddev-resync_mismatches += r1_bio-sec
 tors;
 - if (j  0 || test_bit(MD_RECOVERY_CHECK, mddev
 -recovery)) {
 + if (j  0 || (test_bit(MD_RECOVERY_CHECK, mdde
 v-recovery)
 +text_bit(BIO_UPTODATE, sbio-
 bi_flags))) {
   sbio-bi_end_io = NULL;
   rdev_dec_pending(conf-mirrors[i].rdev,
  mddev);
   } else {

I tried this (with the typo fixed) and it indeed issues a re-write.
However, it doesn't seem to do anything with the corrected errors
count if the rewrite succeeds.  Since end_sync_write() is only used one
other place when !In_sync, I tried the following and it seems to work
to get the error count updated.  I don't know whether this belongs in
end_sync_write() but I'd think it needs to come after the write actually
succeeds so that seems like the earliest it could be done.

--- BUILD/kernel/drivers/md/raid1.c 2007-06-04 13:52:42.0 -0400
+++ drivers/md/raid1.c  2007-08-17 16:52:14.219364000 -0400
@@ -1166,6 +1166,7 @@ static int end_sync_write(struct bio *bi
conf_t *conf = mddev_to_conf(mddev);
int i;
int mirror=0;
+   mdk_rdev_t *rdev;
 
if (bio-bi_size)
return 1;
@@ -1175,6 +1176,8 @@ static int end_sync_write(struct bio *bi
mirror = i;
break;
}
+
+   rdev = conf-mirrors[mirror].rdev;
if (!uptodate) {
int sync_blocks = 0;
sector_t s = r1_bio-sector;
@@ -1186,7 +1189,13 @@ static int end_sync_write(struct bio *bi
s += sync_blocks;
sectors_to_go -= sync_blocks;
} while (sectors_to_go  0);
-   md_error(mddev, conf-mirrors[mirror].rdev);
+   md_error(mddev, rdev);
+   } else if (test_bit(In_sync, rdev-flags)) {
+   /*
+* If we are currently in sync, this was a re-write to
+* correct a read error and we should account for it.
+*/
+   atomic_add(r1_bio-sectors, rdev-corrected_errors);
}
 
update_head_pos(mirror, r1_bio);
@@ -1251,7 +1260,8 @@ static void sync_request_write(mddev_t *
}
if (j = 0)
mddev-resync_mismatches += 
r1_bio-sectors;
-   if (j  0 || test_bit(MD_RECOVERY_CHECK, 
mddev-recovery)) {
+   if (j  0 || (test_bit(MD_RECOVERY_CHECK, 
mddev-recovery)
+  test_bit(BIO_UPTODATE, 
sbio-bi_flags))) {
sbio-bi_end_io = NULL;
rdev_dec_pending(conf-mirrors[i].rdev, 
mddev);
} else {
--
Mike Accetta

ECI Telecom Ltd.
Transport Networking Division, US (previously Laurel Networks)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body

detecting read errors after RAID1 check operation

2007-08-15 Thread Mike Accetta

We run a check operation periodically to try and turn up problems with
drives about to go bad before they become too severe.  In particularly,
if there were any drive read errors during the check operation I would
like to be able to notice and raise an alarm for human attention so that
the failing drive can be replaced sooner than later.  I'm looking for a
programatic way to detect this reliably without having to grovel through
the log files looking for kernel hard drive error messages that may have
occurred during the check operation.

There are already files like /sys/block/md_d0/md/dev-sdb/errors in /sys
which would be very convenient to consult but according to the kernel
driver implementation the error counts reported there are apparently
for corrected errors and not relevant for read errors during a check
operation.

I am contemplating adding a parallel /sys file that would report
all errors, not just the corrected ones.  Does this seem reasonable?
Are there other alternatives that might make sense here?
--
Mike Accetta

ECI Telecom Ltd.
Transport Networking Division, US (previously Laurel Networks)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: detecting read errors after RAID1 check operation

2007-08-15 Thread Mike Accetta
Neil Brown writes:
 On Wednesday August 15, [EMAIL PROTECTED] wrote:
  
  There are already files like /sys/block/md_d0/md/dev-sdb/errors in /sys
  which would be very convenient to consult but according to the kernel
  driver implementation the error counts reported there are apparently
  for corrected errors and not relevant for read errors during a check
  operation.
  
 
 When 'check' hits a read error, an attempt is made to 'correct' it by
 over-writing with correct data.  This will either increase the
 'errors' count or fail the drive completely.
 
 What 'check' doesn't do (and 'repair' does) it react when it find that
 successful reads of all drives (in a raid1) do not match.
 
 So just use the 'errors' number - it is exactly what you want.

This happens in our old friend sync_request_write()?  I'm dealing with
simulated errors and will dig further to make sure that is not perturbing
the results but I don't see any 'errors' effect.  This is with our
patched 2.6.20 raid1.c.  The logic doesn't seem to be any different in
2.6.22 from what I can tell, though.

This fragment

if (j  0 || test_bit(MD_RECOVERY_CHECK, mddev-recovery)) {
sbio-bi_end_io = NULL;
rdev_dec_pending(conf-mirrors[i].rdev, mddev);
} else {
/* fixup the bio for reuse */
...
}

looks suspicously like any correction attempt for 'check' is being
short-circuited to me, regardless of whether or not there was a read
error.  Actually, even if the rewrite was not being short-circuited,
I still don't see the path that would update 'corrected_errors' in this
case.  There are only two raid1.c sites that touch 'corrected_errors', one
is in fix_read_errors() and the other is later in sync_request_write().
With my limited understanding of how this all works, neither of these
paths would seem to apply here.
--
Mike Accetta

ECI Telecom Ltd.
Transport Networking Division, US (previously Laurel Networks)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: deliberately degrading RAID1 to a single disk, then back again

2007-06-27 Thread Mike Accetta

Justin Piszcz wrote:

mdadm /dev/md0 --fail /dev/sda1

On Tue, 26 Jun 2007, Maurice Hilarius wrote:


...

From time to time I want to degrade back to only single disk, and turn

off RAID as the overhead has some cost

From time to time I want to restore to RAID1 function, and re-synch the

pair to current.

Yes, this is a backup scenario..

Are there any Recommendations ( with mdadm syntax)  please?


Plus

  mdadm /dev/md0 --remove /dev/sda1

to take the drive out of the array and

  mdadm /dev/md0 --add /dev/sda1

to put it back and start the re-sync.

When you mention turn off RAID do you mean to not even bring up the disk
as a degraded array?  If all you do is fail and remove the partition, RAID is
actually still running to manage the single remaining partition, but I would
expect that overhead to be minimal.
--
Mike Accetta

ECI Telecom Ltd.
Transport Networking Division, US (previously Laurel Networks)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: When does a disk get flagged as bad?

2007-05-30 Thread Mike Accetta
Alberto Alonso writes:
 OK, lets see if I can understand how a disk gets flagged
 as bad and removed from an array. I was under the impression
 that any read or write operation failure flags the drive as
 bad and it gets removed automatically from the array.
 
 However, as I indicated in a prior post I am having problems
 where the array is never degraded. Does an error of type:
 end_request: I/O error, dev sdb, sector 
 not count as a read/write error?

I was also under the impression that any read or write error would
fail the drive out of the array but some recent experiments with error
injecting seem to indicate otherwise at least for raid1.  My working
hypothesis is that only write errors fail the drive.  Read errors appear
to just redirect the sector to a different mirror.

I actually ran across what looks like a bug in the raid1
recovery/check/repair read error logic that I posted about
last week but which hasn't generated any response yet (cf.
http://article.gmane.org/gmane.linux.raid/15354).  This bug results in
sending a zero length write request down to the underlying device driver.
A consequence of issuing a zero length write is that it fails at the
device level, which raid1 sees as a write failure, which then fails the
array.  The fix I proposed actually has the effect of *not* failing the
array in this case since the spurious failing write is never generated.
I'm not sure what is actually supposed to happen in this case.  Hopefully,
someone more knowledgeable will comment soon.
--
Mike Accetta

ECI Telecom Ltd.
Data Networking Division (previously Laurel Networks)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


raid1 check/repair read error recovery in 2.6.20

2007-05-24 Thread Mike Accetta
I believe I've come across a bug in the disk read error recovery logic
for raid1 check/repair operations in 2.6.20.  The raid1.c file looks
identical in 2.6.21 so the problem should still exist there as well.

This all surfaced when using a variant of CONFIG_FAIL_MAKE_REQUEST to
inject read errors on one of the mirrors of a raid1 array.  I noticed
that while this would ultimately fail the array, it would always seem
to generate

ata1.00: WARNING: zero len r/w req 
ata1.00: WARNING: zero len r/w req 
ata1.00: WARNING: zero len r/w req 
ata1.00: WARNING: zero len r/w req 
ata1.00: WARNING: zero len r/w req 
ata1.00: WARNING: zero len r/w req 

diagnostics at the same time (no clue why there are six of them).
Delving into this further I eventually settled on sync_request_write()
in raid1.c as a likely culprit and added the WARN_ON (below)

@@ -1386,6 +1393,7 @@
atomic_inc(r1_bio-remaining);
md_sync_acct(conf-mirrors[i].rdev-bdev, wbio-bi_size  9);
 
+   WARN_ON(wbio-bi_size == 0);
generic_make_request(wbio);
}


to confirm that this code was indeed sending a zero size bio down to
the device layer in this circumstance.

Looking at the preceding code in sync_request_write() it appears that
the loop comparing the results of all reads just skips a mirror where
the read failed (BIO_UPTODATE is clear) without doing any of the sbio
prep or the memcpy() from the pbio.  There is other read/re-write logic
in the following if-clause but this seems to apply only if none of the
mirrors were readable.  Regardless, the fact that a zero length bio is
being issued in the schedule writes section is compelling evidence
that something is wrong somewhere.

I tried the following patch to raid1.c which short-ciruits the data
comparison in the read error case but otherwise does the rest of the
sbio prep for the mirror with the error.  It seems to have eliminated
the ATA warning at least.  Is it a correct thing to do?

@@ -1235,17 +1242,20 @@
}
r1_bio-read_disk = primary;
for (i=0; imddev-raid_disks; i++)
-   if (r1_bio-bios[i]-bi_end_io == end_sync_read 
-   test_bit(BIO_UPTODATE, r1_bio-bios[i]-bi_flags)) 
{
+   if (r1_bio-bios[i]-bi_end_io == end_sync_read) {
int j;
int vcnt = r1_bio-sectors  (PAGE_SHIFT- 9);
struct bio *pbio = r1_bio-bios[primary];
struct bio *sbio = r1_bio-bios[i];
-   for (j = vcnt; j-- ; )
-   if 
(memcmp(page_address(pbio-bi_io_vec[j].bv_page),
-  
page_address(sbio-bi_io_vec[j].bv_page),
-  PAGE_SIZE))
-   break;
+   if (test_bit(BIO_UPTODATE, sbio-bi_flags)) {
+   for (j = vcnt; j-- ; )
+   if 
(memcmp(page_address(pbio-bi_io_vec[j].bv_page),
+  
page_address(sbio-bi_io_vec[j].bv_page),
+  PAGE_SIZE))
+   break;
+   } else {
+   j = 0;
+   }
if (j = 0)
mddev-resync_mismatches += 
r1_bio-sectors;
if (j  0 || test_bit(MD_RECOVERY_CHECK, 
mddev-recovery)) {
--
Mike Accetta

ECI Telecom Ltd.
Data Networking Division (previously Laurel Networks)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Partitioned arrays initially missing from /proc/partitions

2007-04-23 Thread Mike Accetta
David Greaves writes:

...
 It looks like the same (?) problem as Mike (see below - Mike do you have a
 patch?) but I'm on 2.6.20.7 with mdadm v2.5.6
...

We have since started assembling the array from the initrd using
--homehost and --auto-update-homehost which takes a different path through
the code, and in this path the kernel figures out there are partitions
on the array before mdadm exists.

For the previous code path, we had been ruuning with the patch I described
in my original post which I've included below.  I'd guess that the bug
is actually in the kernel code and I looked at it briefly but couldn't
figure out how things all fit together well enough to come up with a
patch there.  The user level patch is a bit of a hack and there may be
other code paths that also need a similar patch.  I only made this patch
in the assembly code path we were executing at the time.

 BUILD/mdadm/mdadm.c#2 (text) - BUILD/mdadm/mdadm.c#3 (text)  content
@@ -983,6 +983,10 @@
   NULL,
   readonly, 
runstop, NULL, verbose-quiet, force);
close(mdfd);
+   mdfd = open(array_list-devname, 
O_RDONLY); 
+   if (mdfd = 0) {
+   close(mdfd);
+   }
}
}
break;
--
Mike Accetta

ECI Telecom Ltd.
Data Networking Division (previously Laurel Networks)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1, hot-swap and boot integrity

2007-03-05 Thread Mike Accetta

Gabor Gombas wrote:

On Fri, Mar 02, 2007 at 09:04:40AM -0500, Mike Accetta wrote:


Thoughts or other suggestions anyone?


This is a case where a very small /boot partition is still a very good
idea... 50-100MB is a good choice (some initramfs generators require
quite a bit of space under /boot while generating the initramfs image
esp. if you use distro-provided contains-everything-and-the-kitchen-sink
kernels, so it is not wise to make /boot _too_ small).

But if you do not want /boot to be separate a moderately sized root
partition is equally good. What you want to avoid is the whole disk is
a single partition/file system kind of setup.


Yes, we actually have a separate (smallish) boot partition at the front of
the array.  This does reduce the at-risk window substantially.  I'll have to
ponder whether it reduces it close enough to negligible to then ignore, but
that is indeed a good point to consider.
--
Mike Accetta

ECI Telecom Ltd.
Data Networking Division (previously Laurel Networks)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1, hot-swap and boot integrity

2007-03-05 Thread Mike Accetta

Bill Davidsen wrote:

Gabor Gombas wrote:

On Fri, Mar 02, 2007 at 09:04:40AM -0500, Mike Accetta wrote:

 

Thoughts or other suggestions anyone?



This is a case where a very small /boot partition is still a very good
idea... 50-100MB is a good choice (some initramfs generators require
quite a bit of space under /boot while generating the initramfs image
esp. if you use distro-provided 
contains-everything-and-the-kitchen-sink

kernels, so it is not wise to make /boot _too_ small).
  
You are exactly right on that! Some (many) BIOS implementations will 
read the boot sector off the drive, and if there is no error will run 
the boot sector.

But if you do not want /boot to be separate a moderately sized root
partition is equally good. What you want to avoid is the whole disk is
a single partition/file system kind of setup.

  
Actually, the solution is moderately simple, install the replacement 
drive, create the partitions, and **don't mark the boot partition 
active** until the copy is complete. The BIOS will boot from the 1st 
active partition it finds (again, in sane cases).


I never have anything changing in /boot in normal operation, so I admit 
to using dd to do a copy with the array stopped. No particular reason to 
think it works better than just a rebuild. After the partition is valid 
I set the active flag in the partition.




I gathered the impression somewhere, perhaps incorrectly, that the active
flag was a function of the boot block, not the BIOS.  We use Grub in the MBR
and don't even have an active flag set in the partition table.  The system
still boots.
--
Mike Accetta

ECI Telecom Ltd.
Data Networking Division (previously Laurel Networks)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1, hot-swap and boot integrity

2007-03-05 Thread Mike Accetta

H. Peter Anvin wrote:

Mike Accetta wrote:


I've been considering trying something like having the re-sync algorithm
on a whole disk array defer the copy for sector 0 to the very end of the
re-sync operation.  Assuming the BIOS makes at least a minimal 
consistency

check on sector 0 before electing to boot from the drive, this would keep
it from selecting a partially re-sync'd drive that was not previously 
bootable.


The only check that it will make is to look for 55 AA at the end of the 
MBR.


Note that typically the MBR is not part of any of your MD volumes.


Yes, that is also what I've observed in the case of our BIOS.  I'm still
trying to get our BIOS vendor to confirm that it will fail over to the next
drive in the boot list on a read error of sector 0.  We're contemplating
some GRUB hacking to fail-over to the other drive once it is in control
and sees problems.

I wonder if having the MBR typically outside of the array and the relative
newness of partitioned arrays are related?  When I was considering how to
architect the RAID1 layout it seemed like a partitioned array on the
entire disk worked most naturally.
--
Mike Accetta

ECI Telecom Ltd.
Data Networking Division (previously Laurel Networks)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libata hotplug and md raid?

2007-01-10 Thread Mike Accetta

I am currently looking at using md RAID1 and libata hotplug under 2.6.19.
This relevant thread from Oct 2006

http://thread.gmane.org/gmane.linux.raid/13321/focus=13321

tailed off after this proposal from Neil Brown:

 On Monday October 16, [EMAIL PROTECTED] wrote:
   So the question remains: How will hotplug and md work together?
   
   How does md and hotplug work together for current hotplug devices?
  
  I have the same questions.
  
  How does this work in a pure SCSI environment? (has it been tested?)
  If something should change, should those changes be in the MD layer?
  Or can this *really* all be done nicely from userspace?  How?
 
 I would imagine that device removal would work like this:
  1/  you unplug the device
  2/ kernel notices and generates an unplug event to udev.
  3/ Udev does all the work to try to disconnect the device:
  force unmount (though that doesn't work for most filesystems)
  remove from dm
  remove from md (mdadm /dev/mdwhatever --fail /dev/dead --remove 
 /dev/dead)
  4/ Udev removes the node from /dev.
 
 udev can find out what needs to be done by looking at
 /sys/block/whatever/holders. 
 
 I don't know exactly how to get udev to do this, or whether there
 would be 'issues' in getting it to work reliably.  However if anyone
 wants to try I'm happy to help out where I can.
 
 NeilBrown

Not seeing any subsequent reports on the list, I decided to try
implementing the proposed approach.  The immdiate problem I ran into
was that /sys appears to have been cleaned up before udev sees the
remove event and the /sys/block/whatever/holders file is no longer
even around to consult at that point.  As a secondary problem, the
/dev/dead file is also apparently removed by udev before any programs
mentioned in removal rules get a chance to run so there is no longer any
device to provide to mdadm to remove at the time the program does run,
even if it had been possible to find out what md files were holders of
the removed block device to begin with.  Do I have the details right?
Any new thoughts in the last few months about how it would be best to
solve this problem?
--
Mike Accetta

ECI Telecom Ltd.
Data Networking Division (previously Laurel Networks)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Partitioned arrays initially missing from /proc/partitions

2006-12-01 Thread Mike Accetta
In setting up a partitioned array as the boot disk and using a nash 
initrd to find the root file system by volume label, I see a delay in 
the appearance of the /dev/md_d0p partitions in /proc/partitions.  When 
the mdadm --assemble command completes, only /dev/md_d0 is visible. 
Since the raid partitions are not visible after the assemble, the volume 
label search will not consult them in looking for the root volume and 
the boot gets aborted. When I run a similar assemble command while up 
multi-user in a friendlier debug environment I see the same effect and 
observe that pretty much any access of /dev/md_d0 has the side effect of 
then making the /dev/md_d0p partitions visible in /proc/partitions.


I tried a few experiments changing the --assemble code in mdadm.  If I 
open() and close() /dev/md_d0 after assembly *before* closing the file 
descriptor which the assemble step used to assemble the array, there is 
no effect.  Even doing a BLKRRPART ioctl call on the assembly fd or the 
newly opened fd have no effect.  The kernel prints unknown partition 
diagnostics on the console.  However, if the assembly fd is first 
close()'d, a simple open() of /dev/md_d0 and immediate close() of that 
fd has the side effect of making the /dev/md_d0p partitions visible and 
one sees the console disk partitioning confirmation from the kernel as well.


Adding the open()/close() after assembly within mdadm solves my problem, 
but I thought I'd raise the issue on the list as it seems there is a bug 
somewhere.  I see in the kernel md driver that the RUN_ARRAY ioctl() 
calls do_md_run() which calls md_probe() which calls add_disk() and I 
gather that this would normally have the side effect of making the 
partitions visible.  However, my experiments at user level seem to imply 
that the array isn't completely usable until the assembly file 
descriptor is closed, even on return from the ioctl(), and hence the 
kernel add_disk() isn't having the desired partitioning side effect at 
the point it is being invoked.


This is all with kernel 2.6.18 and mdadm 2.3.1
--
Mike Accetta

ECI Telecom Ltd.
Data Networking Division (previously Laurel Networks)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html