Re: Reading takes 100% precedence over writes for mdadm+raid5?

2007-12-06 Thread David Rees
On Dec 6, 2007 1:06 AM, Justin Piszcz [EMAIL PROTECTED] wrote:
 On Wed, 5 Dec 2007, Jon Nelson wrote:

  I saw something really similar while moving some very large (300MB to
  4GB) files.
  I was really surprised to see actual disk I/O (as measured by dstat)
  be really horrible.

 Any work-arounds, or just don't perform heavy reads the same time as
 writes?

What kernel are you using? (Did I miss it in your OP?)

The per-device write throttling in 2.6.24 should help significantly,
have you tried the latest -rc and compared to your current kernel?

-Dave
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reading takes 100% precedence over writes for mdadm+raid5?

2007-12-06 Thread Justin Piszcz



On Thu, 6 Dec 2007, David Rees wrote:


On Dec 6, 2007 1:06 AM, Justin Piszcz [EMAIL PROTECTED] wrote:

On Wed, 5 Dec 2007, Jon Nelson wrote:


I saw something really similar while moving some very large (300MB to
4GB) files.
I was really surprised to see actual disk I/O (as measured by dstat)
be really horrible.


Any work-arounds, or just don't perform heavy reads the same time as
writes?


What kernel are you using? (Did I miss it in your OP?)

The per-device write throttling in 2.6.24 should help significantly,
have you tried the latest -rc and compared to your current kernel?

-Dave



2.6.23.9-- thanks will try out the latest -rc or wait for 2.6.24!

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reading takes 100% precedence over writes for mdadm+raid5?

2007-12-06 Thread Jon Nelson
On 12/6/07, David Rees [EMAIL PROTECTED] wrote:
 On Dec 6, 2007 1:06 AM, Justin Piszcz [EMAIL PROTECTED] wrote:
  On Wed, 5 Dec 2007, Jon Nelson wrote:
 
   I saw something really similar while moving some very large (300MB to
   4GB) files.
   I was really surprised to see actual disk I/O (as measured by dstat)
   be really horrible.
 
  Any work-arounds, or just don't perform heavy reads the same time as
  writes?

 What kernel are you using? (Did I miss it in your OP?)

 The per-device write throttling in 2.6.24 should help significantly,
 have you tried the latest -rc and compared to your current kernel?

I was using 2.6.22.12 I think (openSUSE kernel).
I can try using pretty much any kernel - I'm preparing to do an
unrelated test using 2.6.24rc4 this weekend. If I remember I'll try to
see what disk I/O looks like there.


-- 
Jon
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel 2.6.23.9 + mdadm 2.6.2-2 + Auto rebuild RAID1?

2007-12-06 Thread Jan Engelhardt

On Dec 5 2007 19:29, Nix wrote:

 On Dec 1 2007 06:19, Justin Piszcz wrote:

 RAID1, 0.90.03 superblocks (in order to be compatible with LILO, if
 you use 1.x superblocks with LILO you can't boot)

 Says who? (Don't use LILO ;-)

Well, your kernels must be on a 0.90-superblocked RAID-0 or RAID-1
device. It can't handle booting off 1.x superblocks nor RAID-[56]
(not that I could really hope for the latter).

If the superblock is at the end (which is the case for 0.90 and 1.0),
then the offsets for a specific block on /dev/mdX match the ones for /dev/sda,
so it should be easy to use lilo on 1.0 too, no?
(Yes, it will not work with 1.1 or 1.2.)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: assemble vs create an array.......

2007-12-06 Thread Dragos

Thank you.
I want to make sure I understand.

1- Does it matter which permutation of drives I use for xfs_repair (as 
long as it tells me that the Structure needs cleaning)? When it comes to 
linux I consider myself at intermediate level, but I am a beginner when 
it comes to raid and filesystem issues.


2- After I do it, assuming that it worked, how do I reintegrate the 
'missing' drive while keeping my data?


Thank you for you time.
Dragos


David Greaves wrote:

Dragos wrote:
  

Thank you for your very fast answers.

First I tried 'fsck -n' on the existing array. The answer was that If I
wanted to check a XFS partition I should use 'xfs_check'. That seems to
say that my array was partitioned with xfs, not reiserfs. Am I correct?

Then I tried the different permutations:
mdadm --create /dev/md0 --raid-devices=3 --level=5 missing /dev/sda1
/dev/sdb1
mount /dev/md0 temp
mdadm --stop --scan

mdadm --create /dev/md0 --raid-devices=3 --level=5 /dev/sda1 missing
/dev/sdb1
mount /dev/md0 temp
mdadm --stop --scan



[etc]

  

With some arrays mount reported:
  mount: you must specify the filesystem type
and with others:
  mount: Structure needs cleaning

No choice seems to have been successful.



OK, not as good as you could have hoped for.

Make sure you have the latest xfs tools.

you may want to try xfs_repair and you can use the -n (I think - check man page)
option.

You may need to force it to ignore the log

David



  

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reading takes 100% precedence over writes for mdadm+raid5?

2007-12-06 Thread Bill Davidsen

Justin Piszcz wrote:
root  2206 1  4 Dec02 ?00:10:37 dd if /dev/zero of 
1.out bs 1M
root  2207 1  4 Dec02 ?00:10:38 dd if /dev/zero of 
2.out bs 1M
root  2208 1  4 Dec02 ?00:10:35 dd if /dev/zero of 
3.out bs 1M
root  2209 1  4 Dec02 ?00:10:45 dd if /dev/zero of 
4.out bs 1M
root  2210 1  4 Dec02 ?00:10:35 dd if /dev/zero of 
5.out bs 1M
root  2211 1  4 Dec02 ?00:10:35 dd if /dev/zero of 
6.out bs 1M
root  2212 1  4 Dec02 ?00:10:30 dd if /dev/zero of 
7.out bs 1M
root  2213 1  4 Dec02 ?00:10:42 dd if /dev/zero of 
8.out bs 1M
root  2214 1  4 Dec02 ?00:10:35 dd if /dev/zero of 
9.out bs 1M
root  2215 1  4 Dec02 ?00:10:37 dd if /dev/zero of 
10.out bs 1M
root  3080 24.6  0.0  10356  1672 ?D01:22   5:51 dd if 
/dev/md3 of /dev/null bs 1M


Was curious if when running 10 DD's (which are writing to the RAID 5) 
fine, no issues, suddenly all go into D-state and let the read/give it 
100% priority?


Is this normal?


I'm jumping back to the start of this thread, because after reading all 
the discussion I noticed that you are mixing apples and oranges here. 
Your write programs are going to files in the filesystem, and your read 
is going against the raw device. That may explain why you see something 
I haven't noticed doing all filesystem i/o.


I am going to do a large rsync to another filesystem in the next two 
days, I will turn on some measurements when I do. But if you are just 
investigating this behavior, perhaps you could retry with a single read 
from a file rather than the device.


   [...snip...]

--
Bill Davidsen [EMAIL PROTECTED]
 Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over... Otto von Bismark 



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


external bitmaps.. and more

2007-12-06 Thread Michael Tokarev
I come across a situation where external MD bitmaps
aren't usable on any standard linux distribution
unless special (non-trivial) actions are taken.

First is a small buglet in mdadm, or two.

It's not possible to specify --bitmap= in assemble
command line - the option seems to be ignored.  But
it's honored when specified in config file.

Also, mdadm should probably warn or even refuse to
do things (unless --force is given) when an array being
assembled is using external bitmap, but the bitmap file
isn't specified.

Now for something more.. interesting.

The thing is that when a external bitmap is being used
for an array, and that bitmap resides on another filesystem,
all common distributions fails to start/mount and to
shutdown/umount arrays/filesystems properly, because
all starts/stops is done in one script, and all mounts/umounts
in another, but for bitmaps to work the two should be intermixed
with each other.

Here's why.

Suppose I've an array mdX which used bitmap /stuff/bitmap,
where /stuff is another separate filesystem.  In this case,
during startup, /stuff should be mounted before bringing up
mdX, and during shutdown, mdX should be stopped before
trying to umount /stuff.  Or else during startup mdX will
not find /stuff/bitmap, and during shutdown /stuff filesystem
is busy since mdX is holding a reference to it.

Doing things in simple way doesn't work: if I specify to
mount mdX as /data in /etc/fstab, -- since mdX hasn't been
assembled by mdadm (due to missing bitmap), the system will
not start, asking for emergency root password...

Oh well.

So the only solution for this so far is to convert md array
assemble/stop operation into... MOUNTS/UMOUNTS!  And specify
all necessary information in /etc/fstab - for both arrays and
filesystems, with proper ordering in order column.
Ghrm.

Technically speaking it's not difficult - mount.md and fsck.md
wrappers for mdadm are trivially to write (I even tried that
myself - a quick-n-dirty 5-minutes hack works).  But it's...
ugly.

But I don't see any other reasonable solutions.  Alternatives
are additional scripts to start/stop/mount/umount filesystems
residing on or related to advanced arrays (with external
bitmaps in this case) - but looking at how much code is in
current startup scripts around mounting/fscking, and having
in mind that mount/umount does not support alternative
/etc/fstab, this is umm.. even more ugly...

Comments anyone?

Thanks.

/mjt

P.S.  Why external bitmaps in the first place?  Well, that's
a good question, and here's a (hopefully good too) answer:
When there are sufficient disk drives available to dedicate
some of them for bitmap(s), and there's a large array(s)
with dynamic content (many writes), and the content is
important enough to care about data safety wrt possible
power losses and kernel OOPSes and whatnot, placing bitmap
into another disk(s) helps alot with resyncs (it's not
about resync speed, it's about resync general UNRELIABILITY,
which is another topic - hopefully a long-term linux
raid gurus will understand me here), but does not slow
down writes hugely due to constant disk seeks when updating
bitmaps.  Those seeks tends to have huge impact on random
write performance.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: assemble vs create an array.......

2007-12-06 Thread Michael Tokarev
[Cc'd to xfs list as it contains something related]

Dragos wrote:
 Thank you.
 I want to make sure I understand.

[Some background for XFS list.  The talk is about a broken linux software
raid (the reason for breakage isn't relevant anymore).  The OP seems to
lost the order of drives in his array, and now tries to create new array
ontop, trying different combinations of drives.  The filesystem there
WAS XFS.  One point is that linux refuses to mount it, saying
structure needs cleaning.  This all is mostly md-related, but there
are several XFS-related questions and concerns too.]

 
 1- Does it matter which permutation of drives I use for xfs_repair (as
 long as it tells me that the Structure needs cleaning)? When it comes to
 linux I consider myself at intermediate level, but I am a beginner when
 it comes to raid and filesystem issues.

The permutation DOES MATTER - for all the devices.
Linux, when mounting an fs, only looks at the superblock of the filesystem,
which is usually located at the beginning of the device.

So in each case linux actually recognizes the filesystem (instead of
seeing complete garbage), the same device is the first one - I.e, this
way you found your first device.  The rest may be still out of order.

Raid5 data is laid like this (with 3 drives for simplicity, it's similar
with more drives):

   DiskA   DiskB   DiskC
Blk0   Data0   Data1   P0
Blk1   P1  Data2   Data3
Blk2   Data4   P2  Data5
Blk3   Data6   Data7   P3
... and so on ...

where your actual data blocks are Data0, Data1, ... DataN,
and PX are parity blocks.

As long as DiskA remains in this position, the beginning of
the array is Data0 block, -- hence linux sees the beginning
of the filesystem and recognizes it.  But you can switch
DiskB and DiskC still, and the rest of the data will be
complete garbage, only data blocks on DiskA will be in
place.

So you still need to find order of the other drives
(you found your first drive, DriveA, already).

Note also that if Data1 block is all-zeros (a situation
which is unlikely for a non-empty filesystem), P0 (first
parity block) will be exactly the same as Data0, because
XORing anything with zeros gives the same anything again
(XOR is the operation used to calculate parity blocks in
RAID5).  So there's still a remote chance you've TWO
first disks...

What to do is to give repairfs a try for each permutation,
but again without letting it to actually fix anything.
Just run it in read-only mode and see which combination
of drives gives less errors, or no fatal errors (there
may be several similar combinations, with the same order
of drives but with different drive missing).

It's sad that xfs refuses mount when structure needs
cleaning - the best way here is to actually mount it
and see how it looks like, instead of trying repair
tools.  Is there some option to force-mount it still
(in readonly mode, knowing it may OOPs kernel etc)?

I'm not very familiar with xfs yet - it seems to be
much faster than ext3 for our workload (mostly databases),
and I'm experimenting with it slowly.  But this very
thread prompted me to think.  If I can't force-mount it
(or browse it using other ways) as I can almost always
do with (somewhat?) broken ext[23] just to examine things,
maybe I'm trying it before it's mature enough? ;)  Note
the smile, but note there's a bit of joke in every joke... :)

 2- After I do it, assuming that it worked, how do I reintegrate the
 'missing' drive while keeping my data?

Just add it back -- mdadm --add /dev/mdX /dev/sdYZ.
But don't do that till you actually see your data.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: assemble vs create an array.......

2007-12-06 Thread Eric Sandeen
Michael Tokarev wrote:

 It's sad that xfs refuses mount when structure needs
 cleaning - the best way here is to actually mount it
 and see how it looks like, instead of trying repair
 tools.  Is there some option to force-mount it still
 (in readonly mode, knowing it may OOPs kernel etc)?

depends what went wrong, but in general that error means that metadata
corruption was encountered which was sufficient for xfs to abort
whatever it was doing.  It's not done lightly; it's likely bailing out
because it had no other choice.

You can't force mount something which is sufficiently corrupted that
xfs can't understand it anymore...  IOW you can't traverse and read
corrupted/scrambled metadata, no mount option can help you.  :)

If the shutdown were encountered during use, you could maybe avoid the
bad metadata.  If it's during mount that's probably a more fundamental
problem.

kernel messages when you get the structure needs cleaning error would
be a clue as to what it actually hit.

-Eric
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid6 check/repair

2007-12-06 Thread Andre Noll
On 15:31, Bill Davidsen wrote:

 Thiemo posted metacode which I find appears correct,

It assumes that _exactly_ one disk has bad data which is hard to verify
in practice. But yes, it's probably the best one can do if both P and
Q happen to be incorrect. IMHO mdadm shouldn't do this automatically
though and should always keep backup copies of the data it overwrites
with good data.

Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe


signature.asc
Description: Digital signature


[PATCH] (2nd try) force parallel resync

2007-12-06 Thread Bernd Schubert
Hello,

here is the second version of the patch. With this version also on  
setting /sys/block/*/md/sync_force_parallel the sync_thread is woken up. 
Though, I still don't understand why md_wakeup_thread() is not working.


Signed-off-by: Bernd Schubert [EMAIL PROTECTED]


Index: linux-2.6.22/drivers/md/md.c
===
--- linux-2.6.22.orig/drivers/md/md.c   2007-12-06 19:51:55.0 +0100
+++ linux-2.6.22/drivers/md/md.c2007-12-06 19:52:33.0 +0100
@@ -2843,6 +2843,41 @@ __ATTR(sync_speed_max, S_IRUGO|S_IWUSR, 
 
 
 static ssize_t
+sync_force_parallel_show(mddev_t *mddev, char *page)
+{
+return sprintf(page, %d\n, mddev-parallel_resync);
+}
+
+static ssize_t
+sync_force_parallel_store(mddev_t *mddev, const char *buf, size_t len)
+{
+   char *e;
+   unsigned long n = simple_strtoul(buf, e, 10);
+
+   if (!*buf || (*e  *e != '\n') || (n != 0  n != 1))
+   return -EINVAL;
+
+   mddev-parallel_resync = n;
+
+   if (mddev-sync_thread) {
+   dprintk(md: waking up MD thread %s.\n,
+  mddev-sync_thread-tsk-comm);
+   set_bit(THREAD_WAKEUP, mddev-sync_thread-flags);
+   wake_up_process(mddev-sync_thread-tsk);
+
+/* FIXME: why does md_wakeup_thread() not work?,
+   somehow related to:  wake_up(thread-wqueue);
+   md_wakeup_thread(mddev-sync_thread); */
+   }
+   return len;
+}
+
+/* force parallel resync, even with shared block devices */
+static struct md_sysfs_entry md_sync_force_parallel =
+__ATTR(sync_force_parallel, S_IRUGO|S_IWUSR,
+   sync_force_parallel_show, sync_force_parallel_store);
+
+static ssize_t
 sync_speed_show(mddev_t *mddev, char *page)
 {
unsigned long resync, dt, db;
@@ -2980,6 +3015,7 @@ static struct attribute *md_redundancy_a
md_sync_min.attr,
md_sync_max.attr,
md_sync_speed.attr,
+   md_sync_force_parallel.attr,
md_sync_completed.attr,
md_suspend_lo.attr,
md_suspend_hi.attr,
@@ -5264,8 +5300,9 @@ void md_do_sync(mddev_t *mddev)
ITERATE_MDDEV(mddev2,tmp) {
if (mddev2 == mddev)
continue;
-   if (mddev2-curr_resync  
-   match_mddev_units(mddev,mddev2)) {
+   if (!mddev-parallel_resync
+ mddev2-curr_resync
+ match_mddev_units(mddev,mddev2)) {
DEFINE_WAIT(wq);
if (mddev  mddev2  mddev-curr_resync == 2) {
/* arbitrarily yield */
Index: linux-2.6.22/include/linux/raid/md_k.h
===
--- linux-2.6.22.orig/include/linux/raid/md_k.h 2007-12-06 19:51:55.0 
+0100
+++ linux-2.6.22/include/linux/raid/md_k.h  2007-12-06 19:52:33.0 
+0100
@@ -170,6 +170,9 @@ struct mddev_s
int sync_speed_min;
int sync_speed_max;
 
+   /* resync even though the same disks are shared among md-devices */
+   int parallel_resync;
+
int ok_start_degraded;
/* recovery/resync flags 
 * NEEDED:   we might need to start a resync/recover


-- 
Bernd Schubert
Q-Leap Networks GmbH
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RAID mapper device size wrong after replacing drives

2007-12-06 Thread Ian P

Hi,

I have a problem with my RAID array under Linux after upgrading to larger
drives. I have a machine with Windows and Linux dual-boot which had a pair
of 160GB drives in a RAID-1 mirror with 3 partitions: partiton 1 = Windows
boot partition (FAT32), partiton 2 = Linux /boot (ext3), partiton 3 =
Windows system (NTFS). The Linux /root is on a separate physical drive. The
dual boot is via Grub installed on the /boot partiton, and this was all
working fine.

But I just upgraded the drives in the RAID pair, replacing them with 500GB
drives. I did this by replacing one of the 160s with a new 500 and letting
the RAID copy the drive, splitting the drives out of the RAID array and
increasing the size of the last partition of the 500 (which I did under
Windows since its the Windows partiton) then replacing the last 160 with the
other 500 and having the RAID controller create a new array with the two
500s, copying the drive that I'd copied from the 160. This worked great for
Windows, and that now boots and sees a 500GB RAID drive with all the data
intact.

However, Linux has a problem and will not now boot all the way. It reports
that the RAID /dev/mapper volume failed - the partition is beyond the
boundaries of the disk. Running fdisk shows that it is seeing the larger
partiton, but still sees the size of the RAID /dev/mapper drive as 160GB.
Here is the fdisk output for one of the physical drives and for the RAID
mapper drive:

Disk /dev/sda: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot  Start End  Blocks   Id  System
/dev/sda1   1 625 5018624b  W95 FAT32
Partition 1 does not end on cylinder boundary.
/dev/sda2 626 637   96390   83  Linux
/dev/sda3   * 638   60802   4832645127  HPFS/NTFS


Disk /dev/mapper/isw_bcifcijdi_Raid-0: 163.9 GB, 163925983232 bytes
255 heads, 63 sectors/track, 19929 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot  Start End  Blocks  
Id  System
/dev/mapper/isw_bcifcijdi_Raid-0p1   1 625 5018624   
b  W95 FAT32
Partition 1 does not end on cylinder boundary.
/dev/mapper/isw_bcifcijdi_Raid-0p2 626 637   96390  
83  Linux
/dev/mapper/isw_bcifcijdi_Raid-0p3   * 638   60802   483264512   
7  HPFS/NTFS


They differ only in the drive capacity and number of cylinders.

I started to try to run a Linux reinstall, but it reports that the partiion
table on the mapper drive is invalid, giving an option to re-initialize it
but saying that doing so will lose all the data on the drive.

So questions:

1. Where is the drive size information for the RAID mapper drive kept, and
is there some way to patch it?

2. Is there some way to re-initialize the RAID mapper drive without
destroying the data on the drive?

Thanks,
Ian
-- 
View this message in context: 
http://www.nabble.com/RAID-mapper-device-size-wrong-after-replacing-drives-tf4958354.html#a14200241
Sent from the linux-raid mailing list archive at Nabble.com.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: assemble vs create an array.......

2007-12-06 Thread David Chinner
On Thu, Dec 06, 2007 at 07:39:28PM +0300, Michael Tokarev wrote:
 What to do is to give repairfs a try for each permutation,
 but again without letting it to actually fix anything.
 Just run it in read-only mode and see which combination
 of drives gives less errors, or no fatal errors (there
 may be several similar combinations, with the same order
 of drives but with different drive missing).

Ugggh. 

 It's sad that xfs refuses mount when structure needs
 cleaning - the best way here is to actually mount it
 and see how it looks like, instead of trying repair
 tools. 

It self protection - if you try to write to a corrupted filesystem,
you'll only make the corruption worse. Mounting involves log
recovery, which writes to the filesystem

 Is there some option to force-mount it still
 (in readonly mode, knowing it may OOPs kernel etc)?

Sure you can: mount -o ro,norecovery dev mtpt

But it you hit corruption it will still shut down on you. If
the machine oopses then that is a bug.

 thread prompted me to think.  If I can't force-mount it
 (or browse it using other ways) as I can almost always
 do with (somewhat?) broken ext[23] just to examine things,
 maybe I'm trying it before it's mature enough? ;)

Hehe ;)

For maximum uber-XFS-guru points, learn to browse your filesystem
with xfs_db. :P

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-06 Thread Andrew Morton
On Sat, 1 Dec 2007 06:26:08 -0500 (EST)
Justin Piszcz [EMAIL PROTECTED] wrote:

 I am putting a new machine together and I have dual raptor raid 1 for the 
 root, which works just fine under all stress tests.
 
 Then I have the WD 750 GiB drive (not RE2, desktop ones for ~150-160 on 
 sale now adays):
 
 I ran the following:
 
 dd if=/dev/zero of=/dev/sdc
 dd if=/dev/zero of=/dev/sdd
 dd if=/dev/zero of=/dev/sde
 
 (as it is always a very good idea to do this with any new disk)
 
 And sometime along the way(?) (i had gone to sleep and let it run), this 
 occurred:
 
 [42880.680144] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x401 
 action 0x2 frozen

Gee we're seeing a lot of these lately.

 [42880.680231] ata3.00: irq_stat 0x00400040, connection status changed
 [42880.680290] ata3.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb 
 0x0 data 512 in
 [42880.680292]  res 40/00:ac:d8:64:54/00:00:57:00:00/40 Emask 0x10 
 (ATA bus error)
 [42881.841899] ata3: soft resetting port
 [42885.966320] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
 [42915.919042] ata3.00: qc timeout (cmd 0xec)
 [42915.919094] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x5)
 [42915.919149] ata3.00: revalidation failed (errno=-5)
 [42915.919206] ata3: failed to recover some devices, retrying in 5 secs
 [42920.912458] ata3: hard resetting port
 [42926.411363] ata3: port is slow to respond, please be patient (Status 
 0x80)
 [42930.943080] ata3: COMRESET failed (errno=-16)
 [42930.943130] ata3: hard resetting port
 [42931.399628] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
 [42931.413523] ata3.00: configured for UDMA/133
 [42931.413586] ata3: EH pending after completion, repeating EH (cnt=4)
 [42931.413655] ata3: EH complete
 [42931.413719] sd 2:0:0:0: [sdc] 1465149168 512-byte hardware sectors 
 (750156 MB)
 [42931.413809] sd 2:0:0:0: [sdc] Write Protect is off
 [42931.413856] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
 [42931.413867] sd 2:0:0:0: [sdc] Write cache: enabled, read cache: 
 enabled, doesn't support DPO or FUA
 
 Usually when I see this sort of thing with another box I have full of 
 raptors, it was due to a bad raptor and I never saw it again after I 
 replaced the disk that it happened on, but that was using the Intel P965 
 chipset.
 
 For this board, it is a Gigabyte GSP-P35-DS4 (Rev 2.0) and I have all of 
 the drives (2 raptors, 3 750s connected to the Intel ICH9 Southbridge).
 
 I am going to do some further testing but does this indicate a bad drive? 
 Bad cable?  Bad connector?
 
 As you can see above, /dev/sdc stopped responding for a little bit and 
 then the kernel reset the port.
 
 Why is this though?  What is the likely root cause?  Should I replace the 
 drive?  Obviously this is not normal and cannot be good at all, the idea 
 is to put these drives in a RAID5 and if one is going to timeout that is 
 going to cause the array to go degraded and thus be worthless in a raid5 
 configuration.
 
 Can anyone offer any insight here?

It would be interesting to try 2.6.21 or 2.6.22.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-06 Thread Justin Piszcz



On Thu, 6 Dec 2007, Andrew Morton wrote:


On Sat, 1 Dec 2007 06:26:08 -0500 (EST)
Justin Piszcz [EMAIL PROTECTED] wrote:


I am putting a new machine together and I have dual raptor raid 1 for the
root, which works just fine under all stress tests.

Then I have the WD 750 GiB drive (not RE2, desktop ones for ~150-160 on
sale now adays):

I ran the following:

dd if=/dev/zero of=/dev/sdc
dd if=/dev/zero of=/dev/sdd
dd if=/dev/zero of=/dev/sde

(as it is always a very good idea to do this with any new disk)

And sometime along the way(?) (i had gone to sleep and let it run), this
occurred:

[42880.680144] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x401
action 0x2 frozen


Gee we're seeing a lot of these lately.


[42880.680231] ata3.00: irq_stat 0x00400040, connection status changed
[42880.680290] ata3.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb
0x0 data 512 in
[42880.680292]  res 40/00:ac:d8:64:54/00:00:57:00:00/40 Emask 0x10
(ATA bus error)
[42881.841899] ata3: soft resetting port
[42885.966320] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[42915.919042] ata3.00: qc timeout (cmd 0xec)
[42915.919094] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x5)
[42915.919149] ata3.00: revalidation failed (errno=-5)
[42915.919206] ata3: failed to recover some devices, retrying in 5 secs
[42920.912458] ata3: hard resetting port
[42926.411363] ata3: port is slow to respond, please be patient (Status
0x80)
[42930.943080] ata3: COMRESET failed (errno=-16)
[42930.943130] ata3: hard resetting port
[42931.399628] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[42931.413523] ata3.00: configured for UDMA/133
[42931.413586] ata3: EH pending after completion, repeating EH (cnt=4)
[42931.413655] ata3: EH complete
[42931.413719] sd 2:0:0:0: [sdc] 1465149168 512-byte hardware sectors
(750156 MB)
[42931.413809] sd 2:0:0:0: [sdc] Write Protect is off
[42931.413856] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
[42931.413867] sd 2:0:0:0: [sdc] Write cache: enabled, read cache:
enabled, doesn't support DPO or FUA

Usually when I see this sort of thing with another box I have full of
raptors, it was due to a bad raptor and I never saw it again after I
replaced the disk that it happened on, but that was using the Intel P965
chipset.

For this board, it is a Gigabyte GSP-P35-DS4 (Rev 2.0) and I have all of
the drives (2 raptors, 3 750s connected to the Intel ICH9 Southbridge).

I am going to do some further testing but does this indicate a bad drive?
Bad cable?  Bad connector?

As you can see above, /dev/sdc stopped responding for a little bit and
then the kernel reset the port.

Why is this though?  What is the likely root cause?  Should I replace the
drive?  Obviously this is not normal and cannot be good at all, the idea
is to put these drives in a RAID5 and if one is going to timeout that is
going to cause the array to go degraded and thus be worthless in a raid5
configuration.

Can anyone offer any insight here?


It would be interesting to try 2.6.21 or 2.6.22.



This was due to NCQ issues (disabling it fixed the problem).

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-06 Thread Andrew Morton
On Thu, 6 Dec 2007 17:38:08 -0500 (EST)
Justin Piszcz [EMAIL PROTECTED] wrote:

 
 
 On Thu, 6 Dec 2007, Andrew Morton wrote:
 
  On Sat, 1 Dec 2007 06:26:08 -0500 (EST)
  Justin Piszcz [EMAIL PROTECTED] wrote:
 
  I am putting a new machine together and I have dual raptor raid 1 for the
  root, which works just fine under all stress tests.
 
  Then I have the WD 750 GiB drive (not RE2, desktop ones for ~150-160 on
  sale now adays):
 
  I ran the following:
 
  dd if=/dev/zero of=/dev/sdc
  dd if=/dev/zero of=/dev/sdd
  dd if=/dev/zero of=/dev/sde
 
  (as it is always a very good idea to do this with any new disk)
 
  And sometime along the way(?) (i had gone to sleep and let it run), this
  occurred:
 
  [42880.680144] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x401
  action 0x2 frozen
 
  Gee we're seeing a lot of these lately.
 
  [42880.680231] ata3.00: irq_stat 0x00400040, connection status changed
  [42880.680290] ata3.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 cdb
  0x0 data 512 in
  [42880.680292]  res 40/00:ac:d8:64:54/00:00:57:00:00/40 Emask 0x10
  (ATA bus error)
  [42881.841899] ata3: soft resetting port
  [42885.966320] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
  [42915.919042] ata3.00: qc timeout (cmd 0xec)
  [42915.919094] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x5)
  [42915.919149] ata3.00: revalidation failed (errno=-5)
  [42915.919206] ata3: failed to recover some devices, retrying in 5 secs
  [42920.912458] ata3: hard resetting port
  [42926.411363] ata3: port is slow to respond, please be patient (Status
  0x80)
  [42930.943080] ata3: COMRESET failed (errno=-16)
  [42930.943130] ata3: hard resetting port
  [42931.399628] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
  [42931.413523] ata3.00: configured for UDMA/133
  [42931.413586] ata3: EH pending after completion, repeating EH (cnt=4)
  [42931.413655] ata3: EH complete
  [42931.413719] sd 2:0:0:0: [sdc] 1465149168 512-byte hardware sectors
  (750156 MB)
  [42931.413809] sd 2:0:0:0: [sdc] Write Protect is off
  [42931.413856] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00
  [42931.413867] sd 2:0:0:0: [sdc] Write cache: enabled, read cache:
  enabled, doesn't support DPO or FUA
 
  Usually when I see this sort of thing with another box I have full of
  raptors, it was due to a bad raptor and I never saw it again after I
  replaced the disk that it happened on, but that was using the Intel P965
  chipset.
 
  For this board, it is a Gigabyte GSP-P35-DS4 (Rev 2.0) and I have all of
  the drives (2 raptors, 3 750s connected to the Intel ICH9 Southbridge).
 
  I am going to do some further testing but does this indicate a bad drive?
  Bad cable?  Bad connector?
 
  As you can see above, /dev/sdc stopped responding for a little bit and
  then the kernel reset the port.
 
  Why is this though?  What is the likely root cause?  Should I replace the
  drive?  Obviously this is not normal and cannot be good at all, the idea
  is to put these drives in a RAID5 and if one is going to timeout that is
  going to cause the array to go degraded and thus be worthless in a raid5
  configuration.
 
  Can anyone offer any insight here?
 
  It would be interesting to try 2.6.21 or 2.6.22.
 
 
 This was due to NCQ issues (disabling it fixed the problem).
 

I cannot locate any further email discussion on this topic.

Disabling NCQ at either compile time or runtime is not a fix and further
work should be done here to maek the kernel run acceptably on that
hardware.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] (2nd try) force parallel resync

2007-12-06 Thread Neil Brown
On Thursday December 6, [EMAIL PROTECTED] wrote:
 Hello,
 
 here is the second version of the patch. With this version also on  
 setting /sys/block/*/md/sync_force_parallel the sync_thread is woken up. 
 Though, I still don't understand why md_wakeup_thread() is not working.

Could give a little more detail on why you want this?  When do you
want multiple arrays on the same device to sync at the same time?
What exactly is the hardware like?

md threads generally run for a little while to perform some task, then
stop and wait to be needed again.  md_wakeup_thread says you are
needed again.

The resync/recovery thread is a bit different.  It just run md_do_sync
once.  md_wakeup_thread is not really meaningful in that context.

What you want is:
wake_up(resync_wait);

that will get any thread that is waiting for some other array to
resync to wake up and see if something needs to be done.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 001 of 3] md: raid6: Fix mktable.c

2007-12-06 Thread NeilBrown

From: H. Peter Anvin [EMAIL PROTECTED]

Make both mktables.c and its output CodingStyle compliant.  Update the
copyright notice.

Signed-off-by: H. Peter Anvin [EMAIL PROTECTED]
Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/mktables.c |   43 +--
 1 file changed, 17 insertions(+), 26 deletions(-)

diff .prev/drivers/md/mktables.c ./drivers/md/mktables.c
--- .prev/drivers/md/mktables.c 2007-12-03 14:47:09.0 +1100
+++ ./drivers/md/mktables.c 2007-12-03 14:56:06.0 +1100
@@ -1,13 +1,10 @@
-#ident $Id: mktables.c,v 1.2 2002/12/12 22:41:27 hpa Exp $
-/* --- *
+/* -*- linux-c -*- --- *
  *
- *   Copyright 2002 H. Peter Anvin - All Rights Reserved
+ *   Copyright 2002-2007 H. Peter Anvin - All Rights Reserved
  *
- *   This program is free software; you can redistribute it and/or modify
- *   it under the terms of the GNU General Public License as published by
- *   the Free Software Foundation, Inc., 53 Temple Place Ste 330,
- *   Bostom MA 02111-1307, USA; either version 2 of the License, or
- *   (at your option) any later version; incorporated herein by reference.
+ *   This file is part of the Linux kernel, and is made available under
+ *   the terms of the GNU General Public License version 2 or (at your
+ *   option) any later version; incorporated herein by reference.
  *
  * --- */
 
@@ -73,8 +70,8 @@ int main(int argc, char *argv[])
for (j = 0; j  256; j += 8) {
printf(\t\t);
for (k = 0; k  8; k++)
-   printf(0x%02x, , gfmul(i, j+k));
-   printf(\n);
+   printf(0x%02x,%c, gfmul(i, j + k),
+  (k == 7) ? '\n' : ' ');
}
printf(\t},\n);
}
@@ -83,47 +80,41 @@ int main(int argc, char *argv[])
/* Compute power-of-2 table (exponent) */
v = 1;
printf(\nconst u8 __attribute__((aligned(256)))\n
-   raid6_gfexp[256] =\n
-   {\n);
+  raid6_gfexp[256] =\n {\n);
for (i = 0; i  256; i += 8) {
printf(\t);
for (j = 0; j  8; j++) {
-   exptbl[i+j] = v;
-   printf(0x%02x, , v);
+   exptbl[i + j] = v;
+   printf(0x%02x,%c, v, (j == 7) ? '\n' : ' ');
v = gfmul(v, 2);
if (v == 1)
v = 0;  /* For entry 255, not a real entry */
}
-   printf(\n);
}
printf(};\n);
 
/* Compute inverse table x^-1 == x^254 */
printf(\nconst u8 __attribute__((aligned(256)))\n
-   raid6_gfinv[256] =\n
-   {\n);
+  raid6_gfinv[256] =\n {\n);
for (i = 0; i  256; i += 8) {
printf(\t);
for (j = 0; j  8; j++) {
-   v = gfpow(i+j, 254);
-   invtbl[i+j] = v;
-   printf(0x%02x, , v);
+   invtbl[i + j] = v = gfpow(i + j, 254);
+   printf(0x%02x,%c, v, (j == 7) ? '\n' : ' ');
}
-   printf(\n);
}
printf(};\n);
 
/* Compute inv(2^x + 1) (exponent-xor-inverse) table */
printf(\nconst u8 __attribute__((aligned(256)))\n
-   raid6_gfexi[256] =\n
-   {\n);
+  raid6_gfexi[256] =\n {\n);
for (i = 0; i  256; i += 8) {
printf(\t);
for (j = 0; j  8; j++)
-   printf(0x%02x, , invtbl[exptbl[i+j]^1]);
-   printf(\n);
+   printf(0x%02x,%c, invtbl[exptbl[i + j] ^ 1],
+  (j == 7) ? '\n' : ' ');
}
-   printf(};\n\n);
+   printf(};\n);
 
return 0;
 }
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 000 of 3] md: a few little patches

2007-12-06 Thread NeilBrown
Following 3 patches for md provide some code tidyup and a small
functionality improvement.
They do not need to go into 2.6.24 but are definitely appropriate 25-rc1.

(Patches made against 2.6.24-rc3-mm2)

Thanks,
NeilBrown


 [PATCH 001 of 3] md: raid6: Fix mktable.c
 [PATCH 002 of 3] md: raid6: clean up the style of raid6test/test.c
 [PATCH 003 of 3] md: Update md bitmap during resync.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID mapper device size wrong after replacing drives

2007-12-06 Thread Neil Brown

I think you would have more luck posting this to
[EMAIL PROTECTED] - I think that is where support for device mapper
happens.

NeilBrown


On Thursday December 6, [EMAIL PROTECTED] wrote:
 
 Hi,
 
 I have a problem with my RAID array under Linux after upgrading to larger
 drives. I have a machine with Windows and Linux dual-boot which had a pair
 of 160GB drives in a RAID-1 mirror with 3 partitions: partiton 1 = Windows
 boot partition (FAT32), partiton 2 = Linux /boot (ext3), partiton 3 =
 Windows system (NTFS). The Linux /root is on a separate physical drive. The
 dual boot is via Grub installed on the /boot partiton, and this was all
 working fine.
 
 But I just upgraded the drives in the RAID pair, replacing them with 500GB
 drives. I did this by replacing one of the 160s with a new 500 and letting
 the RAID copy the drive, splitting the drives out of the RAID array and
 increasing the size of the last partition of the 500 (which I did under
 Windows since its the Windows partiton) then replacing the last 160 with the
 other 500 and having the RAID controller create a new array with the two
 500s, copying the drive that I'd copied from the 160. This worked great for
 Windows, and that now boots and sees a 500GB RAID drive with all the data
 intact.
 
 However, Linux has a problem and will not now boot all the way. It reports
 that the RAID /dev/mapper volume failed - the partition is beyond the
 boundaries of the disk. Running fdisk shows that it is seeing the larger
 partiton, but still sees the size of the RAID /dev/mapper drive as 160GB.
 Here is the fdisk output for one of the physical drives and for the RAID
 mapper drive:
 
 Disk /dev/sda: 500.1 GB, 500107862016 bytes
 255 heads, 63 sectors/track, 60801 cylinders
 Units = cylinders of 16065 * 512 = 8225280 bytes
 
Device Boot  Start End  Blocks   Id  System
 /dev/sda1   1 625 5018624b  W95 FAT32
 Partition 1 does not end on cylinder boundary.
 /dev/sda2 626 637   96390   83  Linux
 /dev/sda3   * 638   60802   4832645127  HPFS/NTFS
 
 
 Disk /dev/mapper/isw_bcifcijdi_Raid-0: 163.9 GB, 163925983232 bytes
 255 heads, 63 sectors/track, 19929 cylinders
 Units = cylinders of 16065 * 512 = 8225280 bytes
 
 Device Boot  Start End  Blocks  
 Id  System
 /dev/mapper/isw_bcifcijdi_Raid-0p1   1 625 5018624   
 b  W95 FAT32
 Partition 1 does not end on cylinder boundary.
 /dev/mapper/isw_bcifcijdi_Raid-0p2 626 637   96390  
 83  Linux
 /dev/mapper/isw_bcifcijdi_Raid-0p3   * 638   60802   483264512   
 7  HPFS/NTFS
 
 
 They differ only in the drive capacity and number of cylinders.
 
 I started to try to run a Linux reinstall, but it reports that the partiion
 table on the mapper drive is invalid, giving an option to re-initialize it
 but saying that doing so will lose all the data on the drive.
 
 So questions:
 
 1. Where is the drive size information for the RAID mapper drive kept, and
 is there some way to patch it?
 
 2. Is there some way to re-initialize the RAID mapper drive without
 destroying the data on the drive?
 
 Thanks,
 Ian
 -- 
 View this message in context: 
 http://www.nabble.com/RAID-mapper-device-size-wrong-after-replacing-drives-tf4958354.html#a14200241
 Sent from the linux-raid mailing list archive at Nabble.com.
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 002 of 3] md: raid6: clean up the style of raid6test/test.c

2007-12-06 Thread NeilBrown

From: H. Peter Anvin [EMAIL PROTECTED]
Date: Fri, 26 Oct 2007 11:22:42 -0700

Clean up the coding style in raid6test/test.c.  Break it apart into
subfunctions to make the code more readable.

Signed-off-by: H. Peter Anvin [EMAIL PROTECTED]
Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/raid6test/test.c |  115 --
 1 file changed, 68 insertions(+), 47 deletions(-)

diff .prev/drivers/md/raid6test/test.c ./drivers/md/raid6test/test.c
--- .prev/drivers/md/raid6test/test.c   2007-12-03 14:57:55.0 +1100
+++ ./drivers/md/raid6test/test.c   2007-12-03 14:57:55.0 +1100
@@ -1,12 +1,10 @@
 /* -*- linux-c -*- --- *
  *
- *   Copyright 2002 H. Peter Anvin - All Rights Reserved
+ *   Copyright 2002-2007 H. Peter Anvin - All Rights Reserved
  *
- *   This program is free software; you can redistribute it and/or modify
- *   it under the terms of the GNU General Public License as published by
- *   the Free Software Foundation, Inc., 53 Temple Place Ste 330,
- *   Bostom MA 02111-1307, USA; either version 2 of the License, or
- *   (at your option) any later version; incorporated herein by reference.
+ *   This file is part of the Linux kernel, and is made available under
+ *   the terms of the GNU General Public License version 2 or (at your
+ *   option) any later version; incorporated herein by reference.
  *
  * --- */
 
@@ -30,67 +28,87 @@ char *dataptrs[NDISKS];
 char data[NDISKS][PAGE_SIZE];
 char recovi[PAGE_SIZE], recovj[PAGE_SIZE];
 
-void makedata(void)
+static void makedata(void)
 {
int i, j;
 
-   for (  i = 0 ; i  NDISKS ; i++ ) {
-   for ( j = 0 ; j  PAGE_SIZE ; j++ ) {
+   for (i = 0; i  NDISKS; i++) {
+   for (j = 0; j  PAGE_SIZE; j++)
data[i][j] = rand();
-   }
+
dataptrs[i] = data[i];
}
 }
 
+static char disk_type(int d)
+{
+   switch (d) {
+   case NDISKS-2:
+   return 'P';
+   case NDISKS-1:
+   return 'Q';
+   default:
+   return 'D';
+   }
+}
+
+static int test_disks(int i, int j)
+{
+   int erra, errb;
+
+   memset(recovi, 0xf0, PAGE_SIZE);
+   memset(recovj, 0xba, PAGE_SIZE);
+
+   dataptrs[i] = recovi;
+   dataptrs[j] = recovj;
+
+   raid6_dual_recov(NDISKS, PAGE_SIZE, i, j, (void **)dataptrs);
+
+   erra = memcmp(data[i], recovi, PAGE_SIZE);
+   errb = memcmp(data[j], recovj, PAGE_SIZE);
+
+   if (i  NDISKS-2  j == NDISKS-1) {
+   /* We don't implement the DQ failure scenario, since it's
+  equivalent to a RAID-5 failure (XOR, then recompute Q) */
+   erra = errb = 0;
+   } else {
+   printf(algo=%-8s  faila=%3d(%c)  failb=%3d(%c)  %s\n,
+  raid6_call.name,
+  i, disk_type(i),
+  j, disk_type(j),
+  (!erra  !errb) ? OK :
+  !erra ? ERRB :
+  !errb ? ERRA : ERRAB);
+   }
+
+   dataptrs[i] = data[i];
+   dataptrs[j] = data[j];
+
+   return erra || errb;
+}
+
 int main(int argc, char *argv[])
 {
-   const struct raid6_calls * const * algo;
+   const struct raid6_calls *const *algo;
int i, j;
-   int erra, errb;
+   int err = 0;
 
makedata();
 
-   for ( algo = raid6_algos ; *algo ; algo++ ) {
-   if ( !(*algo)-valid || (*algo)-valid() ) {
+   for (algo = raid6_algos; *algo; algo++) {
+   if (!(*algo)-valid || (*algo)-valid()) {
raid6_call = **algo;
 
/* Nuke syndromes */
memset(data[NDISKS-2], 0xee, 2*PAGE_SIZE);
 
/* Generate assumed good syndrome */
-   raid6_call.gen_syndrome(NDISKS, PAGE_SIZE, (void 
**)dataptrs);
+   raid6_call.gen_syndrome(NDISKS, PAGE_SIZE,
+   (void **)dataptrs);
 
-   for ( i = 0 ; i  NDISKS-1 ; i++ ) {
-   for ( j = i+1 ; j  NDISKS ; j++ ) {
-   memset(recovi, 0xf0, PAGE_SIZE);
-   memset(recovj, 0xba, PAGE_SIZE);
-
-   dataptrs[i] = recovi;
-   dataptrs[j] = recovj;
-
-   raid6_dual_recov(NDISKS, PAGE_SIZE, i, 
j, (void **)dataptrs);
-
-   erra = memcmp(data[i], recovi, 
PAGE_SIZE);
-   errb = memcmp(data[j], recovj, 
PAGE_SIZE);
-
-   if ( i  NDISKS-2  j == NDISKS-1 ) {
- 

[PATCH 003 of 3] md: Update md bitmap during resync.

2007-12-06 Thread NeilBrown

Currently and md array with a write-intent bitmap does not updated
that bitmap to reflect successful partial resync.  Rather the entire
bitmap is updated when the resync completes.

This is because there is no guarentee that resync requests will
complete in order, and tracking each request individually is
unnecessarily burdensome.

However there is value in regularly updating the bitmap, so add code
to periodically pause while all pending sync requests complete, then
update the bitmap.  Doing this only every few seconds (the same as the
bitmap update time) does not notciably affect resync performance.

Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/bitmap.c |   34 +-
 ./drivers/md/raid1.c  |1 +
 ./drivers/md/raid10.c |2 ++
 ./drivers/md/raid5.c  |3 +++
 ./include/linux/raid/bitmap.h |3 +++
 5 files changed, 38 insertions(+), 5 deletions(-)

diff .prev/drivers/md/bitmap.c ./drivers/md/bitmap.c
--- .prev/drivers/md/bitmap.c   2007-12-03 14:58:48.0 +1100
+++ ./drivers/md/bitmap.c   2007-12-03 14:59:00.0 +1100
@@ -1342,14 +1342,38 @@ void bitmap_close_sync(struct bitmap *bi
 */
sector_t sector = 0;
int blocks;
-   if (!bitmap) return;
+   if (!bitmap)
+   return;
while (sector  bitmap-mddev-resync_max_sectors) {
bitmap_end_sync(bitmap, sector, blocks, 0);
-/*
-   if (sector  500) printk(bitmap_close_sync: sec %llu blks 
%d\n,
-(unsigned long long)sector, blocks);
-*/ sector += blocks;
+   sector += blocks;
+   }
+}
+
+void bitmap_cond_end_sync(struct bitmap *bitmap, sector_t sector)
+{
+   sector_t s = 0;
+   int blocks;
+
+   if (!bitmap)
+   return;
+   if (sector == 0) {
+   bitmap-last_end_sync = jiffies;
+   return;
+   }
+   if (time_before(jiffies, (bitmap-last_end_sync
+ + bitmap-daemon_sleep * HZ)))
+   return;
+   wait_event(bitmap-mddev-recovery_wait,
+  atomic_read(bitmap-mddev-recovery_active) == 0);
+
+   sector = ~((1ULL  CHUNK_BLOCK_SHIFT(bitmap)) - 1);
+   s = 0;
+   while (s  sector  s  bitmap-mddev-resync_max_sectors) {
+   bitmap_end_sync(bitmap, s, blocks, 0);
+   s += blocks;
}
+   bitmap-last_end_sync = jiffies;
 }
 
 static void bitmap_set_memory_bits(struct bitmap *bitmap, sector_t offset, int 
needed)

diff .prev/drivers/md/raid10.c ./drivers/md/raid10.c
--- .prev/drivers/md/raid10.c   2007-12-03 14:58:48.0 +1100
+++ ./drivers/md/raid10.c   2007-12-03 14:58:10.0 +1100
@@ -1670,6 +1670,8 @@ static sector_t sync_request(mddev_t *md
if (!go_faster  conf-nr_waiting)
msleep_interruptible(1000);
 
+   bitmap_cond_end_sync(mddev-bitmap, sector_nr);
+
/* Again, very different code for resync and recovery.
 * Both must result in an r10bio with a list of bios that
 * have bi_end_io, bi_sector, bi_bdev set,

diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c
--- .prev/drivers/md/raid1.c2007-12-03 14:58:48.0 +1100
+++ ./drivers/md/raid1.c2007-12-03 14:58:10.0 +1100
@@ -1684,6 +1684,7 @@ static sector_t sync_request(mddev_t *md
if (!go_faster  conf-nr_waiting)
msleep_interruptible(1000);
 
+   bitmap_cond_end_sync(mddev-bitmap, sector_nr);
raise_barrier(conf);
 
conf-next_resync = sector_nr;

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c2007-12-03 14:58:48.0 +1100
+++ ./drivers/md/raid5.c2007-12-03 14:58:10.0 +1100
@@ -4333,6 +4333,9 @@ static inline sector_t sync_request(mdde
return sync_blocks * STRIPE_SECTORS; /* keep things rounded to 
whole stripes */
}
 
+
+   bitmap_cond_end_sync(mddev-bitmap, sector_nr);
+
pd_idx = stripe_to_pdidx(sector_nr, conf, raid_disks);
 
sh = wait_for_inactive_cache(conf, sector_nr, raid_disks, pd_idx);

diff .prev/include/linux/raid/bitmap.h ./include/linux/raid/bitmap.h
--- .prev/include/linux/raid/bitmap.h   2007-12-03 14:58:48.0 +1100
+++ ./include/linux/raid/bitmap.h   2007-12-03 14:58:10.0 +1100
@@ -244,6 +244,8 @@ struct bitmap {
 */
unsigned long daemon_lastrun; /* jiffies of last run */
unsigned long daemon_sleep; /* how many seconds between updates? */
+   unsigned long last_end_sync; /* when we lasted called end_sync to
+ * update bitmap with resync progress */
 
atomic_t pending_writes; /* pending writes to the bitmap file */
wait_queue_head_t write_wait;
@@ -275,6 +277,7 @@ void bitmap_endwrite(struct bitmap *bitm
 int bitmap_start_sync(struct bitmap *bitmap, 

Re: Kernel 2.6.23.9 + mdadm 2.6.2-2 + Auto rebuild RAID1?

2007-12-06 Thread Nix
On 6 Dec 2007, Jan Engelhardt verbalised:
 On Dec 5 2007 19:29, Nix wrote:

 On Dec 1 2007 06:19, Justin Piszcz wrote:

 RAID1, 0.90.03 superblocks (in order to be compatible with LILO, if
 you use 1.x superblocks with LILO you can't boot)

 Says who? (Don't use LILO ;-)

Well, your kernels must be on a 0.90-superblocked RAID-0 or RAID-1
device. It can't handle booting off 1.x superblocks nor RAID-[56]
(not that I could really hope for the latter).

 If the superblock is at the end (which is the case for 0.90 and 1.0),
 then the offsets for a specific block on /dev/mdX match the ones for /dev/sda,
 so it should be easy to use lilo on 1.0 too, no?

Sure, but you may have to hack /sbin/lilo to convince it to create the
superblock there at all. It's likely to recognise that this is an md
device without a v0.90 superblock and refuse to continue. (But I haven't
tested it.)

-- 
`The rest is a tale of post and counter-post.' --- Ian Rawlings
   describes USENET
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html