Re: [PATCH] proactive raid5 disk replacement for 2.6.11

2005-08-15 Thread Claas Hilbrecht
--Am Sonntag, 14. August 2005 22:10 +0200 Pallai Roland [EMAIL PROTECTED] 
schrieb:



 this is a feature patch that implements 'proactive raid5 disk
replacement' (http://www.arctic.org/~dean/raid-wishlist.html),


After my experience with a broken raid5 (read the list) I think the 
partially failed disks feature you describe is really useful. I agree 
with you that this kind of error is rather common.


--
Claas Hilbrecht
http://www.jucs-kramkiste.de

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Found a new bug!

2005-08-15 Thread djani22
Thanks, I will test it, when I can...

In this moment, my system is an working online system, and now only one 8TB
space what I can use...
Thats right, maybe I can built linear array from only one soure device,but:
My first problem is, on my 8TB device is already exists XFS filesystem, with
valuable data, what I can't backup.
It is still OK, but I can't insert one raid layer, because the raid's
superblock, and the XFS is'nt shrinkable. :-(

The only one way (I think) to plug in another raw device, and build an array
from 8TB-device + new small device, to get much space to FS.

But it is too risky for me!

Do you think it is safe?

Currently I use 2.6.13-rc3.
This patch is good for this version, or only the last version?

Witch is the last? 2.6.13-rc6 or rc6-git7, or 2.6.14 -git cvs? :)

Thanks,

Janos


- Original Message -
From: Neil Brown [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: linux-raid@vger.kernel.org
Sent: Monday, August 15, 2005 3:21 AM
Subject: Re: Found a new bug!


 On Monday August 15, [EMAIL PROTECTED] wrote:
  Hello list, Neil!
 
  Is there something news with the 2TB raid-input problem?
  Sooner or later, I will need to join two 8TB array to one big 16TB. :-)

 Thanks for the reminder.

 The following patch should work, but my test machine won't boot the
 current -mm kernels :-( so it is hard to test properly.

 Let me know the results if you are able to test it.

 Thanks,
 NeilBrown

 -
 Support md/linear array with components greater than 2 terabytes.

 linear currently uses division by the size of the smallest componenet
 device to find which device a request goes to.
 If that smallest device is larger than 2 terabytes, then the division
 will not work on some systems.

 So we introduce a pre-shift, and take care not to make the hash table
 too large, much like the code in raid0.

 Also get rid of conf-nr_zones, which is not needed.

 Signed-off-by: Neil Brown [EMAIL PROTECTED]

 ### Diffstat output
  ./drivers/md/linear.c |   99
--
  ./include/linux/raid/linear.h |4 -
  2 files changed, 70 insertions(+), 33 deletions(-)

 diff ./drivers/md/linear.c~current~ ./drivers/md/linear.c
 --- ./drivers/md/linear.c~current~ 2005-08-15 11:18:21.0 +1000
 +++ ./drivers/md/linear.c 2005-08-15 11:18:27.0 +1000
 @@ -38,7 +38,8 @@ static inline dev_info_t *which_dev(mdde
   /*
   * sector_div(a,b) returns the remainer and sets a to a/b
   */
 - (void)sector_div(block, conf-smallest-size);
 + block = conf-preshift;
 + (void)sector_div(block, conf-hash_spacing);
   hash = conf-hash_table[block];

   while ((sector1) = (hash-size + hash-offset))
 @@ -47,7 +48,7 @@ static inline dev_info_t *which_dev(mdde
  }

  /**
 - * linear_mergeable_bvec -- tell bio layer if a two requests can be
merged
 + * linear_mergeable_bvec -- tell bio layer if two requests can be merged
   * @q: request queue
   * @bio: the buffer head that's been built up so far
   * @biovec: the request that could be merged to it.
 @@ -116,7 +117,7 @@ static int linear_run (mddev_t *mddev)
   dev_info_t **table;
   mdk_rdev_t *rdev;
   int i, nb_zone, cnt;
 - sector_t start;
 + sector_t min_spacing;
   sector_t curr_offset;
   struct list_head *tmp;

 @@ -127,11 +128,6 @@ static int linear_run (mddev_t *mddev)
   memset(conf, 0, sizeof(*conf) + mddev-raid_disks*sizeof(dev_info_t));
   mddev-private = conf;

 - /*
 - * Find the smallest device.
 - */
 -
 - conf-smallest = NULL;
   cnt = 0;
   mddev-array_size = 0;

 @@ -159,8 +155,6 @@ static int linear_run (mddev_t *mddev)
   disk-size = rdev-size;
   mddev-array_size += rdev-size;

 - if (!conf-smallest || (disk-size  conf-smallest-size))
 - conf-smallest = disk;
   cnt++;
   }
   if (cnt != mddev-raid_disks) {
 @@ -168,6 +162,36 @@ static int linear_run (mddev_t *mddev)
   goto out;
   }

 + min_spacing = mddev-array_size;
 + sector_div(min_spacing, PAGE_SIZE/sizeof(struct dev_info *));
 +
 + /* min_spacing is the minimum spacing that will fit the hash
 + * table in one PAGE.  This may be much smaller than needed.
 + * We find the smallest non-terminal set of consecutive devices
 + * that is larger than min_spacing as use the size of that as
 + * the actual spacing
 + */
 + conf-hash_spacing = mddev-array_size;
 + for (i=0; i  cnt-1 ; i++) {
 + sector_t sz = 0;
 + int j;
 + for (j=i; icnt-1  sz  min_spacing ; j++)
 + sz += conf-disks[j].size;
 + if (sz = min_spacing  sz  conf-hash_spacing)
 + conf-hash_spacing = sz;
 + }
 +
 + /* hash_spacing may be too large for sector_div to work with,
 + * so we might need to pre-shift
 + */
 + conf-preshift = 0;
 + if (sizeof(sector_t)  sizeof(u32)) {
 + sector_t space = conf-hash_spacing;
 + while (space  (sector_t)(~(u32)0)) {
 + space = 1;
 + conf-preshift++;
 + }
 + }
   /*
   * This code was restructured to work around a gcc-2.95.3 internal
   * compiler error.  Alter it with care.
 @@ -177,39 +201,52 @@ static int linear_run (mddev_t 

Re: [PATCH] proactive raid5 disk replacement for 2.6.11

2005-08-15 Thread Mario 'BitKoenig' Holbe
Hi,

Pallai Roland [EMAIL PROTECTED] wrote:
  this is a feature patch that implements 'proactive raid5 disk
 replacement' (http://www.arctic.org/~dean/raid-wishlist.html),
 that could help a lot on large raid5 arrays built from cheap sata
...
  linux software raid is very fragile by default, the typical (nervous)

I just had a fast look over your patch, so please forgive me if I could
have found the answer in the code.
What I'm wondering about is how does your patch make the whole system
behave in case of more harmful errors?
The read errors you are talking about are quite harmless regarding
subsequent access to the device. Unfortunately there *are* errors (even
read errors, too), especially when you are talking about cheap IDE (ATA,
SATA) equipment, where subsequent access to the device results in
infinite (bus-)lockups. I think, this is the reason why Software-RAID
does never ever touch a failing drive again. If you are changing this
behaviour in general, you risk lock-ups of the raid-device just because
one of the drives got locked up.
What I did not find in your patch is some differentiation between the
harmless and harmful error conditions. I'm not even sure, if this is
possible at all.

regards
   Mario
-- 
Um mit einem Mann gluecklich zu werden, muss man ihn sehr gut
verstehen und ihn ein bisschen lieben.
Um mit einer Frau gluecklich zu werden, muss man sie sehr lieben
und darf erst gar nicht versuchen, sie zu verstehen.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] proactive raid5 disk replacement for 2.6.11

2005-08-15 Thread Pallai Roland

On Mon, 2005-08-15 at 13:29 +0200, Mario 'BitKoenig' Holbe wrote:
 Pallai Roland [EMAIL PROTECTED] wrote:
   this is a feature patch that implements 'proactive raid5 disk
  replacement' (http://www.arctic.org/~dean/raid-wishlist.html),
  that could help a lot on large raid5 arrays built from cheap sata
 ...
   linux software raid is very fragile by default, the typical (nervous)
 
 What I'm wondering about is how does your patch make the whole system
 behave in case of more harmful errors?
 The read errors you are talking about are quite harmless regarding
 subsequent access to the device. Unfortunately there *are* errors (even
 read errors, too), especially when you are talking about cheap IDE (ATA,
 SATA) equipment, where subsequent access to the device results in
 infinite (bus-)lockups. I think, this is the reason why Software-RAID
 does never ever touch a failing drive again. If you are changing this
 behaviour in general, you risk lock-ups of the raid-device just because
 one of the drives got locked up.
 yes, I understand your point, but I think the low level ATA driver must
be fixed if that lets a drive to lock up. as I know, the SCSI layer send
an abort/reset to the device driver if a request not served within a
timeout value (hey, give me some kind of result, now!), it's a good
operation, only a really braindead driver ignores that alarm..
 as I saw it in practice, modern sata drivers doesn't let a drive to
lock up, others should be teached about timeout

 unfortunately, bad blocks are often served slowly from damaged disks
and the array tries to access those periodically in this way, it could
slow down the array. I think about it, and would be a good starting
practice to build a table called 'this disk is bad for this stripe', an
insert occurs after a read error, a delete after if stripe is rewritten
to the disk. it reduces error lines in dmesg about bad sectors too

 What I did not find in your patch is some differentiation between the
 harmless and harmful error conditions. I'm not even sure, if this is
 possible at all.
 currently it doesn't tolerate write errors, if a write fails, the drive
gets kicked immediately, so a fully failed disk will not be accessed
forever.. anyway, it's really hard to determine what's a harmful error
(at this layer we've got a bit for that:), maybe we should to compute a
success-fail ratio (%) for a time, or scan the 'this disk is bad for
this stripe' table for errors and disable the disk if count of bad
blocks is over a user-defined threshold

summary (todo..!):
 - I think, we shouldn't care about drive lockups
 - would be good a 'this disk is bad for this stripe' table to speed up
array with partially failed drives, easy to implement it
 - make a switch to choose 'partially failed' feature on per-array basis
after the patch is being applied (eg. to remain compatible with buggy -
forever locking- device drivers)

 well?


-- 
 dap

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


confused raid1

2005-08-15 Thread Jon Lewis
I've inheritted responsibility for a server with a root raid1 that 
degrades every time the system is rebooted.  It's a 2.4.x kernel.  I've 
got both raidutils and mdadm available.


The raid1 device is supposed to be /dev/hde1  /dev/hdg1 with /dev/hdc1 as 
a spare.  I believe it was created with raidutils and the following 
portion of /etc/raidtab:


raiddev /dev/md1
raid-level  1
nr-raid-disks   2
chunk-size  64k
persistent-superblock   1
nr-spare-disks  1
device  /dev/hde1
raid-disk 0
device  /dev/hdg1
raid-disk 1
device  /dev/hdc1
spare-disk0

The output of mdadm -E concerns me though.

# mdadm -E /dev/hdc1
/dev/hdc1:
  Magic : a92b4efc
Version : 00.90.00
   UUID : 8b65fa52:21176cc9:cbb74149:c418b5a4
  Creation Time : Tue Jan 13 13:21:41 2004
 Raid Level : raid1
Device Size : 30716160 (29.29 GiB 31.45 GB)
   Raid Devices : 2
  Total Devices : 1
Preferred Minor : 1

Update Time : Thu Aug 11 08:38:59 2005
  State : dirty, no-errors
 Active Devices : 2
Working Devices : 2
 Failed Devices : -1
  Spare Devices : 0
   Checksum : 6a4dddb8 - correct
 Events : 0.195

  Number   Major   Minor   RaidDevice State
this 1  2211  active sync   /dev/hdc1
   0 0  3310  active sync   /dev/hde1
   1 1  2211  active sync   /dev/hdc1

# mdadm -E /dev/hde1
/dev/hde1:
  Magic : a92b4efc
Version : 00.90.00
   UUID : 8b65fa52:21176cc9:cbb74149:c418b5a4
  Creation Time : Tue Jan 13 13:21:41 2004
 Raid Level : raid1
Device Size : 30716160 (29.29 GiB 31.45 GB)
   Raid Devices : 2
  Total Devices : 1
Preferred Minor : 1

Update Time : Mon Aug 15 11:16:43 2005
  State : dirty, no-errors
 Active Devices : 2
Working Devices : 2
 Failed Devices : -1
  Spare Devices : 0
   Checksum : 6a5348c9 - correct
 Events : 0.199


  Number   Major   Minor   RaidDevice State
this 0  3310  active sync   /dev/hde1
   0 0  3310  active sync   /dev/hde1
   1 1  3411  active sync   /dev/hdg1

# mdadm -E /dev/hdg1
/dev/hdg1:
  Magic : a92b4efc
Version : 00.90.00
   UUID : 8b65fa52:21176cc9:cbb74149:c418b5a4
  Creation Time : Tue Jan 13 13:21:41 2004
 Raid Level : raid1
Device Size : 30716160 (29.29 GiB 31.45 GB)
   Raid Devices : 2
  Total Devices : 1
Preferred Minor : 1

Update Time : Mon Aug 15 11:16:43 2005
  State : dirty, no-errors
 Active Devices : 2
Working Devices : 2
 Failed Devices : -1
  Spare Devices : 0
   Checksum : 6a5348cc - correct
 Events : 0.199


  Number   Major   Minor   RaidDevice State
this 1  3411  active sync   /dev/hdg1
   0 0  3310  active sync   /dev/hde1
   1 1  3411  active sync   /dev/hdg1

Shouldn't total devices be at least 2?  How can failed devices be -1?

When the system reboots, md1 becomes just /dev/hdc1.  I've used mdadm to 
add hde1, fail and then remove hdc1, and add hdg1.  How can I repair the 
array such that it will survive the next reboot and keep hde1 and hdg1 as 
the working devices?


md1 : active raid1 hdg1[1] hde1[0]
  30716160 blocks [2/2] [UU]

--
 Jon Lewis   |  I route
 Senior Network Engineer |  therefore you are
 Atlantic Net| 
_ http://www.lewis.org/~jlewis/pgp for PGP public key_

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: confused raid1

2005-08-15 Thread Tyler

A few questions:

a) what kernel version are you using?
b) what mdadm version are you using?
c) what messages conscerning the raid are in the log when its failing
one of the drives and making hdc1 an active drive?
d) what linux distribution (and version) are you using?

Tyler.

Jon Lewis wrote:

I've inheritted responsibility for a server with a root raid1 that 
degrades every time the system is rebooted.  It's a 2.4.x kernel.  
I've got both raidutils and mdadm available.


The raid1 device is supposed to be /dev/hde1  /dev/hdg1 with 
/dev/hdc1 as a spare.  I believe it was created with raidutils and the 
following portion of /etc/raidtab:


raiddev /dev/md1
raid-level  1
nr-raid-disks   2
chunk-size  64k
persistent-superblock   1
nr-spare-disks  1
device  /dev/hde1
raid-disk 0
device  /dev/hdg1
raid-disk 1
device  /dev/hdc1
spare-disk0

The output of mdadm -E concerns me though.

# mdadm -E /dev/hdc1
/dev/hdc1:
  Magic : a92b4efc
Version : 00.90.00
   UUID : 8b65fa52:21176cc9:cbb74149:c418b5a4
  Creation Time : Tue Jan 13 13:21:41 2004
 Raid Level : raid1
Device Size : 30716160 (29.29 GiB 31.45 GB)
   Raid Devices : 2
  Total Devices : 1
Preferred Minor : 1

Update Time : Thu Aug 11 08:38:59 2005
  State : dirty, no-errors
 Active Devices : 2
Working Devices : 2
 Failed Devices : -1
  Spare Devices : 0
   Checksum : 6a4dddb8 - correct
 Events : 0.195

  Number   Major   Minor   RaidDevice State
this 1  2211  active sync   /dev/hdc1
   0 0  3310  active sync   /dev/hde1
   1 1  2211  active sync   /dev/hdc1

# mdadm -E /dev/hde1
/dev/hde1:
  Magic : a92b4efc
Version : 00.90.00
   UUID : 8b65fa52:21176cc9:cbb74149:c418b5a4
  Creation Time : Tue Jan 13 13:21:41 2004
 Raid Level : raid1
Device Size : 30716160 (29.29 GiB 31.45 GB)
   Raid Devices : 2
  Total Devices : 1
Preferred Minor : 1

Update Time : Mon Aug 15 11:16:43 2005
  State : dirty, no-errors
 Active Devices : 2
Working Devices : 2
 Failed Devices : -1
  Spare Devices : 0
   Checksum : 6a5348c9 - correct
 Events : 0.199


  Number   Major   Minor   RaidDevice State
this 0  3310  active sync   /dev/hde1
   0 0  3310  active sync   /dev/hde1
   1 1  3411  active sync   /dev/hdg1

# mdadm -E /dev/hdg1
/dev/hdg1:
  Magic : a92b4efc
Version : 00.90.00
   UUID : 8b65fa52:21176cc9:cbb74149:c418b5a4
  Creation Time : Tue Jan 13 13:21:41 2004
 Raid Level : raid1
Device Size : 30716160 (29.29 GiB 31.45 GB)
   Raid Devices : 2
  Total Devices : 1
Preferred Minor : 1

Update Time : Mon Aug 15 11:16:43 2005
  State : dirty, no-errors
 Active Devices : 2
Working Devices : 2
 Failed Devices : -1
  Spare Devices : 0
   Checksum : 6a5348cc - correct
 Events : 0.199


  Number   Major   Minor   RaidDevice State
this 1  3411  active sync   /dev/hdg1
   0 0  3310  active sync   /dev/hde1
   1 1  3411  active sync   /dev/hdg1

Shouldn't total devices be at least 2?  How can failed devices be -1?

When the system reboots, md1 becomes just /dev/hdc1.  I've used mdadm 
to add hde1, fail and then remove hdc1, and add hdg1.  How can I 
repair the array such that it will survive the next reboot and keep 
hde1 and hdg1 as the working devices?


md1 : active raid1 hdg1[1] hde1[0]
  30716160 blocks [2/2] [UU]

--
 Jon Lewis   |  I route
 Senior Network Engineer |  therefore you are
 Atlantic Net| _ 
http://www.lewis.org/~jlewis/pgp for PGP public key_

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.338 / Virus Database: 267.10.9/72 - Release Date: 8/14/2005

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: confused raid1

2005-08-15 Thread Jon Lewis

On Mon, 15 Aug 2005, Mario 'BitKoenig' Holbe wrote:


Well, reading the kernel boot messages could help.
Perhaps, the hdc1 partition is type fd (raid autodetect) and the driver
for hd[eg] is not in place when the RAID Autodetection is running.


I should have included that.  All 3 of them are type fd.

--
 Jon Lewis   |  I route
 Senior Network Engineer |  therefore you are
 Atlantic Net| 
_ http://www.lewis.org/~jlewis/pgp for PGP public key_

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: confused raid1

2005-08-15 Thread Tyler

Well, i guess you won't know if you don't try.

Do your other servers pronounce the same error in their logs upon 
bootup? regarding the module?


Tyler.

Jon Lewis wrote:


On Mon, 15 Aug 2005, Tyler wrote:


Try this suggestion (regarding modules.conf).

https://www.redhat.com/archives/fedora-list/2003-December/msg05205.html



I don't see why that modules.conf addition would be necessary / make a 
difference.  I have other servers with root-raid1 that haven't needed 
that, and mkinitrd is smart enough (reads /etc/raidtab) to know that 
raid1 is needed and loads the raid1 module in the initrd linuxrc script.


--
 Jon Lewis   |  I route
 Senior Network Engineer |  therefore you are
 Atlantic Net| _ 
http://www.lewis.org/~jlewis/pgp for PGP public key_

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.338 / Virus Database: 267.10.9/72 - Release Date: 8/14/2005

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: confused raid1

2005-08-15 Thread Neil Brown
On Monday August 15, [EMAIL PROTECTED] wrote:
 
 It's quite probable, that before the following reboot, md1 was hdc1 
 and hde1.
 
 Aug  9 02:02:39  kernel: md: created md1
 Aug  9 02:02:39  kernel: md: bindhdc1,1
 Aug  9 02:02:39  kernel: md: bindhde1,2
 Aug  9 02:02:39  kernel: md: bindhdg1,3
 Aug  9 02:02:39  kernel: md: running: hdg1hde1hdc1
 Aug  9 02:02:39  kernel: md: hdg1's event counter: 00b0
 Aug  9 02:02:39  kernel: md: hde1's event counter: 00b4
 Aug  9 02:02:39  kernel: md: hdc1's event counter: 00b4
 Aug  9 02:02:39  kernel: md: superblock update time inconsistency -- using 
 the most recent one
 Aug  9 02:02:39  kernel: md: freshest: hde1
 Aug  9 02:02:39  kernel: md: kicking non-fresh hdg1 from array!
 Aug  9 02:02:39  kernel: md: unbindhdg1,2
 Aug  9 02:02:39  kernel: md: export_rdev(hdg1)
 Aug  9 02:02:39  kernel: md: RAID level 1 does not need chunksize! Continuing 
 anyway.
 Aug  9 02:02:39  kernel: kmod: failed to exec /sbin/modprobe -s -k 
 md-personality-3, errno = 2
 Aug  9 02:02:39  kernel: md: personality 3 is not loaded!
 Aug  9 02:02:39  kernel: md :do_md_run() returned -22
 Aug  9 02:02:39  kernel: md: md1 stopped.
 Aug  9 02:02:39  kernel: md: unbindhde1,1
 Aug  9 02:02:39  kernel: md: export_rdev(hde1)
 Aug  9 02:02:39  kernel: md: unbindhdc1,0
 Aug  9 02:02:39  kernel: md: export_rdev(hdc1)
 Aug  9 02:02:39  kernel: md: ... autorun DONE.

So md-personality-3 doesn't get loaded, and the array doesn't get
started at all.  i.e. the 'partition type FD' is not having any
useful effect.

So how does the array get started?  
Are there other message about md later in the kernel logs that talk
about md1 ??

My guess is that 'raidstart' is being used to start the array
somewhere along the line.  'raidstart' doesn't not start raid arrays
reliably.  Don't use it.  Remove it from your system.  It is unsafe.

If you cannot get the raid1 module to be loaded properly, make sure
that 'mdadm' is being used to assemble the array.  It has a much
better chance of getting it right.

NeilBrown

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html