Re: Array will not assemble

2006-07-06 Thread Neil Brown
On Friday July 7, [EMAIL PROTECTED] wrote:
 Perhaps I am misunderstanding how assemble works, but I have created a 
 new RAID 1 array on a pair of SCSI drives and am having difficulty 
 re-assembling it after a reboot.
 
 The relevent mdadm.conf entry looks like this:
 
 
 ARRAY /dev/md3 level=raid1 num-devices=2 
 UUID=72189255:acddbac3:316abdb0:9152808d devices=/dev/sdc,/dev/sdd

Add
  DEVICE /dev/sd?
or similar on a separate line.
Remove
  devices=/dev/sdc,/dev/sdd

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How does md determine which partitions to use in RAID1 when DEVICE partitions is specified

2006-07-03 Thread Neil Brown
On Monday July 3, [EMAIL PROTECTED] wrote:
 I have Fedora Core 5 installed with mirroring on the Boot partition
 and root partition.  I created a Logical Volume Group on the mirrored
 root partition.
 
 How does md figure out which partitions are actually specified.  It
 says it stores the uuid in the superblock, but I can't seem to figure
 out where this is, or how to get it.  is it in the partition, or
 volume.

The superblock is near the end of whatever you told md to use for the
array.
 From your 'mdadm --detail' output, I see that means /dev/sda1
/dev/sdb1 /dev/sda2 etc.  The superblock for each partition is near
the end of each partition.

 
 The reason I'm asking this, I'd like to add two USB 2.0 drives that
 are mirrored, and I would specify the device name (/dev/sdd, /dev/sde)
 for the ARRAY, but I found that the allocation of these device names
 is dependent upon when the drive is inserted into the USB.

You don't need to care about the device name.  Just add the uuid
information to mdadm.conf.  Providing the devices are plugged in,
mdadm will find them.

 
 I'm going to ask if there is a way to lock the volume names for
 devices (I'm thinking by UUID) for USB devices in partitions.
 

'udev' might have functionality to do this.  But you don't really need
it with mdadm.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 write performance

2006-07-02 Thread Neil Brown
On Sunday July 2, [EMAIL PROTECTED] wrote:
 Neil hello.
 
 I have been looking at the raid5 code trying to understand why writes
 performance is so poor.

raid5 write performance is expected to be poor, as you often need to
pre-read data or parity before the write can be issued.

 If I am not mistaken here, It seems that you issue a write in size of
 one page an no more no matter what buffer size I am using .

I doubt the small write size would contribute more than a couple of
percent to the speed issue.  Scheduling (when to write, when to
pre-read, when to wait a moment) is probably much more important.

 
 1. Is this page is directed only to parity disk ?

No.  All drives are written with one page units.  Each request is
divided into one-page chunks, these one page chunks are gathered -
where possible - into strips, and the strips are handled as units
(Where a strip is like a stripe, only 1 page wide rather then one chunk
wide - if that makes sense).

 2. How can i increase the write throughout ?

Look at scheduling patterns - what order are the blocks getting
written, do we pre-read when we don't need to, things like that.

The current code tries to do the right thing, and it certainly has
been worse in the past, but I wouldn't be surprised if it could still
be improved.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] enable auto=yes by default when using udev

2006-07-02 Thread Neil Brown
On Monday July 3, [EMAIL PROTECTED] wrote:
 Hello,
 the following patch aims at solving an issue that is confusing a lot of
 users.
 when using udev, device files are created only when devices are
 registered with the kernel, and md devices are registered only when
 started.
 mdadm needs the device file _before_ starting the array.
 so when using udev you must add --auto=yes to the mdadm commandline or
 to the ARRAY line in mdadm.conf
 
 following patch makes auto=yes the default when using udev
 

The principle I'm reasonably happy with, though you can now make this
the default with a line like

  CREATE auto=yes
in mdadm.conf.

However

 +
 + /* if we are using udev and auto is not set, mdadm will almost
 +  * certainly fail, so we force it here.
 +  */
 + if (autof == 0  access(/dev/.udevdb,F_OK) == 0)
 + autof=2;
 +

I'm worried that this test is not very robust.
On my Debian/unstable system running used, there is no
 /dev/.udevdb
though there is a
 /dev/.udev/db

I guess I could test for both, but then udev might change
again I'd really like a more robust check.

Maybe I could test if /dev was a mount point?

Any other ideas?

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: changing MD device names

2006-07-01 Thread Neil Brown
On Saturday July 1, [EMAIL PROTECTED] wrote:
 I have a system which was running several raid1 devices (md0 - md2) using 
 2 physical drives (hde, and hdg).  I wanted to swap out these drives for 
 two different ones, so I did the following:
 
 1) swap out hdg for a new drive
 2) create degraded raid1's (md3 and md4) using partitions on new hdg
 3) format md3 and md4 and copy data from md0-2 to md3-4
 4) install grub on new hdg
 5) pull hde
 
 Now, after a bit of fixing in the grub menu and fstab, I have a system 
 that boots up using just 1 of the new drives, but the md devices are md3 
 and md4.  What's the easiest way to change the prefered minor # and get 
 these to be md0 and md1?  Will just booting from a rescue or live CD and 
 assembling the new drives as md0  md1 automatically update the prefered 
 minor in their superblocks?
 
 The system is running Centos 4 (2.6.9-34.0.1.EL kernel).

You need to do a tiny bit more than assemble the new drives as md0 and
md1.  You also need to cause some write activity so that md bothers to
update the superblock.  Mounting and unmounting the filesystem should
do it.
Or you could assemble with --update=super-minor

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid issues after power failure

2006-06-30 Thread Neil Brown
On Friday June 30, [EMAIL PROTECTED] wrote:
 On Fri, 30 Jun 2006, Francois Barre wrote:
  Did you try upgrading mdadm yet ?
 
 yes, I have version 2.5 now, and it produces the same results.
 

Try adding '--force' to the -A line.
That tells mdadm to try really hard to assemble the array.

You should be aware that when a degraded array has an unclean
shutdown it is possible that data corruption could result, possibly in
files that have not be changed for a long time.  It is also quite
possible that there is no data corruption, or it is only on part of
the array that are not actually in use.

I recommend at least a full 'fsck' in this situation.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Strange intermittant errors + RAID doesn't fail the disk.

2006-06-30 Thread Neil Brown
On Friday June 30, [EMAIL PROTECTED] wrote:
 More problems ...
 
 As reported I have 4x WD5000YS (Caviar RE2 500 GB) in a md RAID5
 array. I've been benchmarking and otherwise testing the new array
 these last few days, and apart from the fact that the md doesn't shut
 down properly I've had no problems.
 
 Today I wanted to finally copy some data over, but I after 5sec I got:
 
 [...]
 ata2: port reset, p_is 800 is 2 pis 0 cmd 44017 tf d0 ss 123 se 0
 ata2: status=0x50 { DriveReady SeekComplete }
 sdc: Current: sense key: No Sense
 Additional sense: No additional sense information
 ata2: handling error/timeout
 ata2: port reset, p_is 0 is 0 pis 0 cmd 44017 tf 150 ss 123 se 0
 ata2: status=0x50 { DriveReady SeekComplete }
 ata2: error=0x01 { AddrMarkNotFound }
 sdc: Current: sense key: No Sense
 Additional sense: No additional sense information
 [repeat]
 
 All processes accessing the array hang and can't even be killed by
 kill -9, but md does not mark the disk as failed.

Looks very much like a problem with the SATA controller.
If the repeat look you have shown there is an infinite loop, then
presumably some failure is not being handled properly.

I suggest you find a SATA related mailing list to post this to (Look
in the MAINTAINERS file maybe) or post it to linux-kernel.

I doubt this is directly related to the raid code at all.

Good luck :-)

NeilBrown


 
 I then tested all four disks individually in another box -- according
 to WD's drive diagnostic they're fine. Re-created the array on the
 disks, which worked for a few hours, now I get the same error again.
 :(
 
 Kernel is 2.6.17-1-686 (Debian testing). I could go back to 16, but 15
 is missing a CIFS change I need.
 
 Any help is appreciated.
 
 Christian
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Cutting power without breaking RAID

2006-06-29 Thread Neil Brown
On Thursday June 29, [EMAIL PROTECTED] wrote:
 Why should this trickery be needed? When an array is mounted r/o it 
 should be clean. How can it be dirty. I assume readonly implies noatime, 
 I mount physically readonly devices without explicitly saying noatime 
 and nothing whines.

The 'filesystem' is mounted r/o.  The 'array' is not read-only, and
you cannot set an array to read-only while a filesystem is mounted
(because the array cannot tell that the mount is read-only).

A little while after the last write, an array will mark itself as
clean.  The effect of the 'kill -9', is to reduce this 'little while'
to 0.

So 
  remount readonly
  wait a little while
  kill machine 

would work too.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Drive issues in RAID vs. not-RAID ..

2006-06-28 Thread Neil Brown
On Wednesday June 28, [EMAIL PROTECTED] wrote:
 
 I've seen a few comments to the effect that some disks have problems when
 used in a RAID setup and I'm a bit preplexed as to why this might be..
 
 What's the difference between a drive in a RAID set (either s/w or h/w)
 and a drive on it's own, assuming the load, etc. is roughly the same in
 each setup?
 
 Is it just bad feeling or is there any scientific reasons for it?

I don't think that 'disks' have problems being in a raid, but I
believe some controllers do (though I don't know whether it is the
controller or the driver that is at fault).  RAID make concurrent
requests much more likely and so is likely to push hard at any locking
issues.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Cutting power without breaking RAID

2006-06-28 Thread Neil Brown
On Wednesday June 28, [EMAIL PROTECTED] wrote:
 Hello,
 
 I'm facing this problem:
 
 when my Linux box detects a POWER FAIL event from the UPS, it 
 starts a normal shutdown. Just before the normal kernel poweroff, 
 it sends to the UPS a signal on the serial line which says 
 cut-off the power to the server and switch-off the UPS.
 
 This is required to reboot the server as soon as the power is 
 restored.
 
 The problem is that the root partition is on top of a RAID-1 
 filesystem which is still mounted when the program that kills the 
 power is run, so the system goes down with a non clean RAID 
 volume.
 
 What can be the proper action to do before killing the power to 
 ensure that RAID will remain clean? It seems that remounting 
 the partition read-only is not sufficient.

Are you running a 2.4 kernel or a 2.6 kernel?

With 2.4, you cannot do what you want to do.

With 2.6, 
   killall -9 md0_raid1

should do the trick (assuming root is on /dev/md0.  If it is elsewhere,
choose a different process name).
After you kill -9 the raid thread, the array will be marked clean
immediately all writes complete, and marked dirty again before
allowing another write.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm 2.5.2 - Static built , Interesting warnings when

2006-06-28 Thread Neil Brown
On Tuesday June 27, [EMAIL PROTECTED] wrote:
   Hello All ,  What change in Glibc mekes this necessary ?  Is there a
   method available to include the getpwnam  getgrnam structures so that
   full static build will work .  Tia ,  JimL
 
 gcc -Wall -Werror -Wstrict-prototypes -ggdb -DSendmail=\/usr/sbin/sendmail 
 -t\ -DCONFFILE=\/etc/mdadm.conf\ -DCONFFILE2=\/etc/mdadm/mdadm.conf\ 
 -DHAVE_STDINT_H -o sha1.o -c sha1.c
 gcc -static -o mdadm mdadm.o config.o mdstat.o  ReadMe.o util.o Manage.o 
 Assemble.o Build.o Create.o Detail.o Examine.o Grow.o Monitor.o dlink.o 
 Kill.o Query.o mdopen.o super0.o super1.o bitmap.o restripe.o sysfs.o sha1.o
 config.o(.text+0x8c4): In function `createline': 
 /home/archive/mdadm-2.5.2/config.c:341: warning: Using 'getgrnam' in 
 statically linked applications requires at runtime the shared libraries from 
 the glibc version used for linking
 config.o(.text+0x80b):/home/archive/mdadm-2.5.2/config.c:326: warning: Using 
 'getpwnam' in statically linked applications requires at runtime the shared 
 libraries from the glibc version used for linking
 nroff -man mdadm.8  mdadm.man

Are you running
  make LDFLAGS=-static mdadm
or something like that?  No, that won't work any more.
Use
   make mdadm.static

that will get you a good static binary by included 'pwgr.o'.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ANNOUNCE: mdadm 2.5.2 - A tool for managing Soft RAID under Linux

2006-06-27 Thread Neil Brown


I am pleased to announce the availability of
   mdadm version 2.5.2

It is available at the usual places:
   http://www.cse.unsw.edu.au/~neilb/source/mdadm/
and
   countrycode=xx.
   http://www.${countrycode}kernel.org/pub/linux/utils/raid/mdadm/
and via git at
   git://neil.brown.name/mdadm
   http://neil.brown.name/git?p=mdadm

mdadm is a tool for creating, managing and monitoring
device arrays using the md driver in Linux, also
known as Software RAID arrays.

Release 2.5.2 is primarily a bugfix release over 2.5.1.
It also contains a work-around for a kernel bug which affects
hot-adding to arrays with a version-1 superblock.

Changelog Entries:
-   Fix problem with compiling with gcc-2 compilers
-   Fix compile problem of post-incrmenting a variable in a macro arg.
-   Stop map_dev from returning [0:0], as that breaks things.
-   Add 'Array Slot' line to --examine for version-1 superblocks
to make it a bit easier to see what is happening.
-   Work around bug in --add handling for  version-1 superblocks
in 2.6.17 (and prior).
-   Make -assemble a bit more resilient to finding strange
information in superblocks.
-   Don't claim newly added spares are InSync!! (don't know why that
code was ever in there)
-   Work better when no 'ftw' is available, and check to see
if current uclibc provides ftw.
-   Never use /etc/mdadm.conf if --config file is given (previously
some code used one, some used the other).

Development of mdadm is sponsored by
 SUSE Labs, Novell Inc.

NeilBrown  27th June 2006


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is shrinking raid5 possible?

2006-06-26 Thread Neil Brown
On Friday June 23, [EMAIL PROTECTED] wrote:
  Why would you ever want to reduce the size of a raid5 in this way?
 
 A feature that would have been useful to me a few times is the ability
 to shrink an array by whole disks.
 
 Example:
 
 8x 300 GB disks - 2100 GB raw capacity
 
 shrink file system, remove 2 disks =
 
 6x 300 GB disks -- 1500 GB raw capacity
 

This is shrinking an array by removing drives.  We were talking about
shrinking an array by reducing the size of drives - a very different
think.

Yes, it might be sometimes useful to reduce the number of drives in a
raid5.  This would be similar to adding a drive to a raid5 (now
possible), but the data copy would have to go in a different
direction, so there would need to be substantial changes to the code.
I'm not sure it is really worth the effort I'm afraid, but it might
get done, one day, especially if someone volunteers some code ... ;-)

NeilBrown


 Why?
 
 If you're not backed up by a company budget, moving data to an new
 array (extra / larger disks) is extremely difficult. A lot of cases
 will hold 8 disks but not 16, never mind the extra RAID controller.
 Building another temporary server and moving the data via Gigabit is
 slow and expensive as well.
 
 Shrinking the array step-by-step and unloading data onto a regular
 filesystem on the freed disks would be a cheap (if time consuming) way
 to migrate, because the data could be copied back to the new array a
 disk at a time.
 
 Thanks,
 
 C.
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bug in 2.6.17 / mdadm 2.5.1

2006-06-26 Thread Neil Brown
On Monday June 26, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
 snip
  Alternately you can apply the following patch to the kernel and
  version-1 superblocks should work better.
 
 -stable material?

Maybe.  I'm not sure it exactly qualifies, but I might try sending it
to them and see what they think.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recover data from linear raid

2006-06-26 Thread Neil Brown
On Monday June 26, [EMAIL PROTECTED] wrote:
 
   This is what I get now, after creating with fdisk /dev/hdb1 and 
 /dev/hdc1 as linux raid autodetect partitions

So I'm totally confused now.

You said it was 'linear', but the boot log showed 'raid0'.

The drives didn't have a partition table on them, yet it is clear from
the old boot log that the did.  Are you sure they are the same drives,
'cause it doesn't seem like it.

You could try hunting for ext3 superblocks on the device.  There might
be an easier way but

  od -x /dev/hdb | grep '^.60     ef53 '

should find them.  Once you have this information we might be able to
make something work.  But I feel the chances are dwindling.


NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recover data from linear raid

2006-06-26 Thread Neil Brown
On Monday June 26, [EMAIL PROTECTED] wrote:
 
   This is what I get now, after creating with fdisk /dev/hdb1 and 
 /dev/hdc1 as linux raid autodetect partitions

So I'm totally confused now.

You said it was 'linear', but the boot log showed 'raid0'.

The drives didn't have a partition table on them, yet it is clear from
the old boot log that the did.  Are you sure they are the same drives,
'cause it doesn't seem like it.

You could try hunting for ext3 superblocks on the device.  There might
be an easier way but

  od -x /dev/hdb | grep '^.60     ef53 '

should find them.  Once you have this information we might be able to
make something work.  But I feel the chances are dwindling.


NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bug in 2.6.17 / mdadm 2.5.1

2006-06-25 Thread Neil Brown
On Sunday June 25, [EMAIL PROTECTED] wrote:
 Hi!
 
 There's a bug in Kernel 2.6.17 and / or mdadm which prevents (re)adding
 a disk to a degraded RAID5-Array.

Thank you for the detailed report.
The bug is in the md driver in the kernel (not in mdadm), and only
affects version-1 superblocks.  Debian recently changed the default
(in /etc/mdadm.conf) to use version-1 superblocks which I thought
would be OK (I've some testing) but obviously I missed something. :-(

If you remove the metadata=1 (or whatever it is) from
/etc/mdadm/mdadm.conf and then create the array, it will be created
with a version-0.90 superblock has had more testing.

Alternately you can apply the following patch to the kernel and
version-1 superblocks should work better.

NeilBrown

-
Set desc_nr correctly for version-1 superblocks.

This has to be done in -load_super, not -validate_super

### Diffstat output
 ./drivers/md/md.c |6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c   2006-06-26 11:02:43.0 +1000
+++ ./drivers/md/md.c   2006-06-26 11:02:46.0 +1000
@@ -1057,6 +1057,11 @@ static int super_1_load(mdk_rdev_t *rdev
if (rdev-sb_size  bmask)
rdev- sb_size = (rdev-sb_size | bmask)+1;
 
+   if (sb-level == cpu_to_le32(LEVEL_MULTIPATH))
+   rdev-desc_nr = -1;
+   else
+   rdev-desc_nr = le32_to_cpu(sb-dev_number);
+
if (refdev == 0)
ret = 1;
else {
@@ -1165,7 +1170,6 @@ static int super_1_validate(mddev_t *mdd
 
if (mddev-level != LEVEL_MULTIPATH) {
int role;
-   rdev-desc_nr = le32_to_cpu(sb-dev_number);
role = le16_to_cpu(sb-dev_roles[rdev-desc_nr]);
switch(role) {
case 0x: /* spare */
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Large single raid and XFS or two small ones and EXT3?

2006-06-23 Thread Neil Brown
On Friday June 23, [EMAIL PROTECTED] wrote:
  The problem is that there is no cost effective backup available.
 
 One-liner questions :
 - How does Google make backups ?

No, Google ARE the backups :-)

 - Aren't tapes dead yet ?

LTO-3 does 300Gig, and LTO-4 is planned.
They may not cope with tera-byte arrays in one hit, but they still
have real value.

 - What about a NUMA principle applied to storage ?

You mean an Hierarchical Storage Manager?  Yep, they exist.  I'm sure
SGI, EMC and assorted other TLAs could sell you one.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: read perfomance patchset

2006-06-22 Thread Neil Brown
On Monday June 19, [EMAIL PROTECTED] wrote:
 Neil hello
 
 if i am not mistaken here:
 
 in first instance of :   if(bi) ...
...
 
 you return without setting to NULL
 

Yes, you are right. Thanks.
And fixing that bug removes the crash.
However

I've been doing a few tests and it is hard to measure much
improvement, which is strange.

I can maybe see a 1% improvement but that could just be noise.
I do some more and see if I can find out what is happening.

Interestingly, with a simple
  dd if=/dev/md1 of=/dev/null bs=1024k
test, 2.6.16 is substantially faster (10%) than 2.6.17-rc6-mm2 before 
that patches are added.  There is something weird there.

Have you done any testing?

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is shrinking raid5 possible?

2006-06-22 Thread Neil Brown
On Thursday June 22, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
 
 On Monday June 19, [EMAIL PROTECTED] wrote:
   
 
 Hi,
 
 I'd like to shrink the size of a RAID5 array - is this
 possible? My first attempt shrinking 1.4Tb to 600Gb,
 
 mdadm --grow /dev/md5 --size=629145600
 
 gives
 
 mdadm: Cannot set device size/shape for /dev/md5: No space left on device
 
 
 
 Yep.
 The '--size' option refers to:
   Amount  (in  Kibibytes)  of  space  to  use  from  each drive 
  in
   RAID1/4/5/6.  This must be a multiple of  the  chunk  size,  
  and
   must  leave about 128Kb of space at the end of the drive for 
  the
   RAID superblock.  
 (from the man page).
 
 So you were telling md to use the first 600GB of each device in the
 array, and it told you there wasn't that much room.
 If your array has N drives, you need to divide the target array size
 by N-1 to find the target device size.
 So if you have a 5 drive array, then you want
   --size=157286400
 
 
 May I say in all honesty that making people do that math instead of the 
 computer is a really bad user interface? Good, consider it said. An 
 means to just set the target size of the resulting raid device would be 
 a LOT less likely to cause bad user input, and while I'm complaining it 
 should inderstand suffices 'k', 'm', and 'g'.

Let me put another perspective on this.

Why would you ever want to reduce the size of a raid5 in this way?
The only reason that I can think of is that you want to repartition
each device to use a smaller partition for the raid5, and free up some
space for something else.  If that is what you are doing, you will
have already done the math and you will know what size you want your
final partitions to be, so setting the device size is just as easy as
setting the array size.

If you really are interested in array size and have no interest in
recouping the wasted space on the drives, then there would be no point
in shrinking the array (that I can think of).
Just 'mkfs' a filesystem to the desired size and ignore the rest of
the array.


In short, reducing a raid5 to a particular size isn't something that
really makes sense to me.  Reducing the amount of each device that is
used does - though I would much more expect people to want to increase
that size.

If Paul really has a reason to reduce the array to a particular size
then fine.  I'm mildly curious, but it's his business and I'm happy
for mdadm to support it, though indirectly.  But I strongly suspect
that most people who want to resize their array will be thinking in
terms of the amount of each device that is used, so that is how mdadm
works.

 
 Far easier to use for the case where you need, for instance, 10G of 
 storage for a database, tell mdadm what devices to use and what you need 
 (and the level of course) and let the computer figure out the details, 
 rounding up, leaving 128k, and phase of the moon if you decide to use it.
 

mdadm is not intended to be a tool that manages your storage for you.  If
you want that, then I suspect EVMS is what you want (though I am only
guessing - I've never used it).  mdadm it a tool that enables YOU to
manage your storage.

NeilBrown



 Sorry, I think the current approach is baaad human interface.
 
 -- 
 bill davidsen [EMAIL PROTECTED]
   CTO TMR Associates, Inc
   Doing interesting things with small computers since 1979
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid5 reshape

2006-06-21 Thread Neil Brown
On Tuesday June 20, [EMAIL PROTECTED] wrote:
 Nigel J. Terry wrote:
 
 Well good news and bad news I'm afraid...
 
 Well I would like to be able to tell you that the time calculation now 
 works, but I can't. Here's why: Why I rebooted with the newly built 
 kernel, it decided to hit the magic 21 reboots and hence decided to 
 check the array for clean. The normally takes about 5-10 mins, but this 
 time took several hours, so I went to bed! I suspect that it was doing 
 the full reshape or something similar at boot time.
 

What magic 21 reboots??  md has no mechanism to automatically check
the array after N reboots or anything like that.  Or are you thinking
of the 'fsck' that does a full check every so-often?


 Now I am not sure that this makes good sense in a normal environment. 
 This could keep a server down for hours or days. I might suggest that if 
 such work was required, the clean check is postponed till next boot and 
 the reshape allowed to continue in the background.

An fsck cannot tell if there is a reshape happening, but the reshape
should notice the fsck and slow down to a crawl so the fsck can complete...

 
 Anyway the good news is that this morning, all is well, the array is 
 clean and grown as can be seen below. However, if you look further below 
 you will see the section from dmesg which still shows RIP errors, so I 
 guess there is still something wrong, even though it looks like it is 
 working. Let me know if i can provide any more information.
 
 Once again, many thanks. All I need to do now is grow the ext3 filesystem...
.

 ...ok start reshape thread
 md: syncing RAID array md0
 md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
 md: using maximum available idle IO bandwidth (but not more than 20 
 KB/sec) for reconstruction.
 md: using 128k window, over a total of 245111552 blocks.
 Unable to handle kernel NULL pointer dereference at  RIP:
 {stext+2145382632}
 PGD 7c3f9067 PUD 7cb9e067 PMD 0

 Process md0_reshape (pid: 1432, threadinfo 81007aa42000, task 
 810037f497b0)
 Stack: 803dce42  1d383600 
   
 
 Call Trace: 803dce42{md_do_sync+1307} 
 802640c0{thread_return+0}
8026411e{thread_return+94} 
 8029925d{keventd_create_kthread+0}
803dd3d9{md_thread+248} 

That looks very much like the bug that I already sent you a patch for!
Are you sure that the new kernel still had this patch?

I'm a bit confused by this

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can't get drives containing spare devices to spindown

2006-06-21 Thread Neil Brown
On Thursday June 22, [EMAIL PROTECTED] wrote:
 Marc L. de Bruin wrote:
 
  Situation: /dev/md0, type raid1, containing 2 active devices 
  (/dev/hda1 and /dev/hdc1) and 2 spare devices (/dev/hde1 and /dev/hdg1).
 
  Those two spare 'partitions' are the only partitions on those disks 
  and therefore I'd like to spin down those disks using hdparm for 
  obvious reasons (noise, heat). Specifically, 'hdparm -S value 
  device' sets the standby (spindown) timeout for a drive; the value 
  is used by the drive to determine how long to wait (with no disk 
  activity) before turning off the spindle motor to save power.
 
  However, it turns out that md actually sort-of prevents those spare 
  disks to spindown. I can get them off for about 3 to 4 seconds, after 
  which they immediately spin up again. Removing the spare devices from 
  /dev/md0 (mdadm /dev/md0 --remove /dev/hd[eg]1) actually solves this, 
  but I have no intention actually removing those devices.
 
  How can I make sure that I'm actually able to spin down those two 
  spare drives?

This is fixed in current -mm kernels and the fix should be in 2.6.18.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can't get drives containing spare devices to spindown

2006-06-21 Thread Neil Brown
On Thursday June 22, [EMAIL PROTECTED] wrote:
 
 Thanks Neil for your quick reply. Would it be possible to elaborate a 
 bit on the problem and the solution? I guess I won't be on 2.6.18 for 
 some time...
 

When an array has been idle (no writes) for a short time (20 or 200
ms, depending on which kernel you are running) the array is flagged as
'clean'. so that a crash/power failure at that point will not require
a full resync.  The 'clean' flag is stored on all superblocks,
including the spares.  So this causes writes to all devices when there
is changes to activity status.

Even fairly quite filesystems see occasional updates (updating atime
on files, or such syncing the journal), and that causes all devices to
be touched.

Fix
 1/ don't set the 'dirty' flag on spares - there really is no need.

However whenever the dirty bit is changed, the 'events' count is
updated, so just doing the above will cause the spares to get way
behind the main devices in their 'events' count so they will no longer
be treated as part of the array.  So

 2/ When clearing the dirty flag (and nothing else has happened),
   decrement the events count rather than increment it.

Together, these mean that simple dirty/clean transitions do not touch
the spares.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: the question about raid0_make_request

2006-06-19 Thread Neil Brown
On Monday June 19, [EMAIL PROTECTED] wrote:
 We can imagine that there is a raid0 array whose layerout is drawn in the
 attachment.
 Take this for example.
 There are 3 zones totally, and their zone-nb_dev is 5,4,3 respectively.
 In the raid0_make_request function, the var block is the offset of bio in
 kilobytes.
 
 
  x = block  chunksize_bits;
   tmp_dev = zone-dev[sector_div(x, zone-nb_dev)];
 
 If block is in the chunk 5, then x = block  chunksize_bits = 5.And the
 nb_dev of zone2 is 4.
 So tmp_dev = zone-dev[sector_div(5,4)] = zone-dev[1].
 But we can see that the right result should be zone-dev[0].
 Then how does the 'bug'  get the  right  underlying device?

When you say 'right' result, you really mean 'expected' result.

You expect the layout to be

   0   1   2   3   4
   5   6   7   8
   9   10  11

The actual layout for Linux-md-raid0 is

   0   1   2   3   4
   8   5   6   7
   9   10  11

Not what you would expect, but still a valid layout.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid5 reshape

2006-06-19 Thread Neil Brown
On Monday June 19, [EMAIL PROTECTED] wrote:
 
 That seems to have fixed it. The reshape is now progressing and
 there are no apparent errors in dmesg. Details below. 

Great!

 
 I'll send another confirmation tomorrow when hopefully it has finished :-)
 
 Many thanks for a great product and great support.

And thank you for being a patient beta-tester!

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ANNOUNCE: mdadm 2.5.1 - A tool for managing Soft RAID under Linux

2006-06-19 Thread Neil Brown
On Monday June 19, [EMAIL PROTECTED] wrote:
  Neil Brown wrote:
  I am pleased to announce the availability of
 mdadm version 2.5.1
 
 What the heck, here's another one. :) This one is slightly more serious. 
 We're getting a device of 0:0 in Fail events from the mdadm monitor 
 sometimes now (due to the change in map_dev, which allows it to 
 sometimes return 0:0 instead of just NULL for an unknown device).
 

Thanks for this and the other two.  They are now in .git

 The patch fixes my issue. I don't know if there are more.
 

I chose to do this differently - map_dev will now return NULL for 0,0
and all users can cope with a NULL.

NeilBrown



 Thanks,
 Paul
 --- mdadm-2.5.1/Monitor.c Thu Jun  1 21:33:41 2006
 +++ mdadm-2.5.1-new/Monitor.c Mon Jun 19 14:51:31 2006
 @@ -328,7 +328,7 @@ int Monitor(mddev_dev_t devlist,
   }
   disc.major = disc.minor = 0;
   }
 - if (dv == NULL  st-devid[i])
 + if ((dv == NULL || strcmp(dv, 0:0) == 0)  
 st-devid[i])
   dv = map_dev(major(st-devid[i]),
minor(st-devid[i]), 1);
   change = newstate ^ st-devstate[i];
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: the question about raid0_make_request

2006-06-18 Thread Neil Brown
On Monday June 19, [EMAIL PROTECTED] wrote:
 When I read  the code of raid0_make_request,I meet some questions.
 
 1\ block = bio-bi_sector  1,it's the device offset in kilotytes.
 so why do we use block substract zone-zone_offset? The
 zone-zone_offset is the zone offset relative the mddev in sectors.

zone_offset is set to 'curr_zone_offset' in create_strip_zones,
curr_zone_offset is a sum of 'zone-size' values.
zone-size is (typically) calculated by
(smallest-size - current_offset) *c
'smallest' is an rdev.
So the unit of 'zone_offset' are ultimately the same units as that of
 rdev-size.
rdev-size is set in md.c is set e.g. from
 calc_dev_size(rdev, sb-chunk_size);
which uses the value from calc_dev_sboffset which shifts the size in
bytes by BLOCK_SIZE_BITS which is defined in fs.h to be 10.
So the units of zone_offset is in kilobytes, not sectors.

 
 2\ the codes below:
   x = block  chunksize_bits;
   tmp_dev = zone-dev[sector_div(x, zone-nb_dev)];
 actually, we get the underlying device by 'sector_div(x,
 zone-nb_dev)'.The var x is the chunk nr relative to the start  of the
 mddev in my opinion.But not all of the zone-nb_dev is the same, so we
 cann't get the right rdev by 'sector_div(x, zone-nb_dev)', I think.

x is the chunk number relative to the start of the current zone, not
the start of the mddev:
sector_t x =  (block - zone-zone_offset)  chunksize_bits;

taking the remainder after dividing this by the number of devices in
the current zone gives the number of the device to use.

Hope that helps.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid5 reshape

2006-06-17 Thread Neil Brown
On Saturday June 17, [EMAIL PROTECTED] wrote:
 
 Any ideas what I should do next? Thanks
 

Looks like you've probably hit a bug.  I'll need a bit more info
though.

First:

 [EMAIL PROTECTED] ~]# cat /proc/mdstat
 Personalities : [raid5] [raid4]
 md0 : active raid5 sdb1[1] sda1[0] hdc1[4](S) hdb1[2]
   490223104 blocks super 0.91 level 5, 128k chunk, algorithm 2 [4/3] 
 [UUU_]
   [=...]  reshape =  6.9% (17073280/245111552) 
 finish=86.3min speed=44003K/sec
 
 unused devices: none

This really makes it look like the reshape is progressing.  How
long after the reboot was this taken?  How long after hdc1 has hot
added (roughly)?  What does it show now?

What happens if you remove hdc1 again?  Does the reshape keep going?

What I would expect to happen in this case is that the array reshapes
into a degraded array, then the missing disk is recovered onto hdc1.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid5 reshape

2006-06-17 Thread Neil Brown

OK, thanks for the extra details.  I'll have a look and see what I can
find, but it'll probably be a couple of days before I have anything
useful for you.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid5 reshape

2006-06-16 Thread Neil Brown
On Friday June 16, [EMAIL PROTECTED] wrote:
 You have to grow the ext3 fs separately. ext2resize /dev/mdX. Keep in 
 mind this can only be done off-line.
 

ext3 can be resized online. I think ext2resize in the latest release
will do the right thing whether it is online or not.

There is a limit to the amount of expansion that can be achieved
on-line.  This limit is set when making the filesystem.  Depending on
which version of ext2-utils you used to make the filesystem, it may or
may not already be prepared for substantial expansion.

So if you want to do it on-line, give it a try or ask on the
ext3-users list for particular details on what versions you need and
how to see if your fs can be expanded.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IBM xSeries stop responding during RAID1 reconstruction

2006-06-15 Thread Neil Brown
On Thursday June 15, [EMAIL PROTECTED] wrote:
 On Wed, Jun 14, 2006 at 10:46:09AM -0500, Bill Cizek wrote:
  Niccolo Rigacci wrote:
  
  When the sync is complete, the machine start to respond again 
  perfectly.
  
  I was able to work around this by lowering 
  /proc/sys/dev/raid/speed_limit_max to a value
  below my disk thruput value (~ 50 MB/s) as follows:
  
  $ echo 45000  /proc/sys/dev/raid/speed_limit_max
 
 Thanks!
 
 This hack seems to solve my problem too. So it seems that the 
 RAID subsystem does not detect a proper speed to throttle the 
 sync.

The RAID subsystem doesn't try to detect a 'proper' speed.
When there is nothing else happening, it just drives the disks as fast
as they will go.
If this is causing a lockup, then there is something else wrong, just
as any single process should not - by writing constantly to disks - be
able to clog up the whole system.

Maybe if you could get the result of 
  alt-sysrq-P
or even
  alt-sysrq-T
while the system seems to hang.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid5 software problems after loosing 4 disks for 48 hours

2006-06-15 Thread Neil Brown
On Friday June 16, [EMAIL PROTECTED] wrote:
 
 And is there a way if more then 1 disks goes offline, for the whole
 array to be taken offline? My understanding of raid5 is loose 1+ disks
 and nothing on the raid would be readable. this is not the case here.
 

Nothing will be writable, but some blocks might be readable.


 All the disks are online now, what do I need to do to rebuild the array?

Have you tried
  mdadm --assemble --force /dev/md0 /dev/sd[bcdefghijklmnop]1
??
Actually, it occurs to me that that might not do the best thing if 4
drives disappeared at exactly the same time (though it is unlikely
that you would notice)
You should probably use
 mdadm --create /dev/md0 -f -l5 -n15 -c32  /dev/sd[bcdefghijklmnop]1
This is assuming that  e,f,g,h were in that order in the array before
they died.
The '-f' is quite important - it tells mdadm not recover a spare, but
to resync the parity blocks.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: to understand the logic of raid0_make_request

2006-06-15 Thread Neil Brown
On Friday June 16, [EMAIL PROTECTED] wrote:
 
 
 Thanks a lot.I went through the code again following your guide.But I
 still can't understand how the bio-bi_sector and bio-bi_dev are
 computed.I don't know what the var 'block' stands for.
 Could you explain them to me ?

'block' is simply bi_sector/2 - the device offset in kilobytes
rather than in sectors.

raid0 supports having different devices of different sizes.
The array is divided into 'zones'.
The first zone has all devices, and extends as far as the smallest
devices.
The last zone extends to the end of the largest device, and may have
only one, or several devices in it.
The may be other zones depending on how many different sizes of device
there are.

The first thing that happens is the correct zone is found by lookng in
the hash_table.  Then we subtract the zone offset, divide by the chunk
size, and then divide by the number of devices in that zone.  The
remainder of this last division tells us which device to use.
Then we mutliply back out to find the offset in that device.

I know that it rather brief, but I hope it helps.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid6

2006-06-14 Thread Neil Brown
On Thursday June 15, [EMAIL PROTECTED] wrote:
 I am confronted  with a big problem of the raid6 algorithm,
 when recently I learn the raid6 code of linux 2.6 you have contributed
 .
  Unfortunately I can not understand the algorithm of  P +Q parity in
 this program . Is this some formula for this raid6 algorithm? I realy
 respect your help,could you show me some details about this algorithm?

See:
 http://en.wikipedia.org/wiki/Redundant_array_of_independent_disks#RAID_6
and
 http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf
 
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: to understand the logic of raid0_make_request

2006-06-12 Thread Neil Brown
On Tuesday June 13, [EMAIL PROTECTED] wrote:
 hello,everyone.
 I am studying the code of raid0.But I find that the logic of
 raid0_make_request is a little difficult to understand.
 Who can tell me what the function of raid0_make_request will do eventually?

One of two possibilities.

Most often it will update bio-bi_dev and bio-bi_sector to refer to
the correct location on the correct underlying devices, and then 
will return '1'.
The fact that it returns '1' is noticed by generic_make_request in
block/ll_rw_block.c and generic_make_request will loop around and
retry the request on the new device at the new offset.

However in the unusual case that the request cross a chunk boundary
and so needs to be sent to two different devices, raidi_make_request
will split the bio into to (using bio_split) will submit each of the
two bios directly down to the appropriate devices - and will then
return '0', so that generic make request doesn't loop around.

I hope that helps.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid 5 read performance

2006-06-09 Thread Neil Brown
On Friday June 9, [EMAIL PROTECTED] wrote:
 Neil hello
 
 Sorry for the delay. too many things to do.

You aren't alone there!

 
 I have implemented all said in :
 http://www.spinics.net/lists/raid/msg11838.html
 
 As always I have some questions:
 
 1.  mergeable_bvec
  I did not understand first i must admit. now i do not see how it
 differs from the
  one of raid0.  so i  actually copied it and renamed it.

Sounds fine.  For raid5 there is no need to force write requests to be
split up, but that's a minor difference.

 
 2. statistics.
 i have added md statistics since the code does not reach the
 statics in make_request.
 it returns from make_request before that.

Why not put the code *after* that?  Not that it matters a great deal.
I'll comment more when I see the code I expect.

 
 3. i have added the new retry list called toread_aligned to raid5_conf_t .
 hope this is correct.
 

Sounds good.


 4.  your instructions are to add a failed bio to sh, but it does not
 say to handle it directly.
 i have tried it and something is missing here. raid5d handle
 stripes only if  conf-handle_list is not empty. i added handle_stripe
 and and release_stripe of my own.
this way i managed to get from the completion routine:
R5: read error corrected!!  message . ( i have tested by failing
 a ram disk ).
 

Sounds right, but I'd need to see the code to be sure.

 
 5. I am going to test the non common path heavily before submitting
 you the patch ( on real disks  and use  several file systems and
 several chunk sizes).

I'd rather see the patch earlier, even if it isn't fully tested.

  It is quite a big patch so I need to know which kernel do you want me
 to use ? i am using poor 2.6.15.

A patch against the latest -mm would be best, but I'm happy to take it
against anything even vaguely recent.

However, it needs to be multiple patches, not just one.
This is a *very* important point.  As that original email said:

  This should be developed and eventually presented as a sequence of
  patches.

There are several distinct steps in this change and they need to be
reviewed separately or it is just too hard.
So I would really like it if you could separate out the changes into
logically distinct patches.
If you can't or won't, then still send the patch, but I'll have to
break it up so it'll probably take longer to process.


Thanks for your efforts,

NeilBrown

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid5 read error correction log

2006-06-04 Thread Neil Brown
On Saturday June 3, [EMAIL PROTECTED] wrote:
 Hey Neil,
 
 It would sure be nice if the log contained any info about the error
 correction that's been done rather than simply saying read error
 corrected, like which array chunk, device and sector was corrected. I'm
 having a persistent pending sector on a drive, and when I do check or
 repair, it says read error corrected many times, but I don't know
 whether it's doing the same sector over and over or if there are just so
 many of them...  I seem to remember reading something about this on the
 list some time ago, is it already in the kernel? (I'm running 2.6.17-rc4
 now).

Yes added to todo list:
   include sector/dev info in read-error-corrected messages

 
 Btw, when it does correct a read error, I assume it also tries to read
 it again to verify that the correction worked?

Yes.  It doesn't check that the read returns the correct data, but it
does check that a read succeeds.  However I'm not certain that the
read request will punch through any cache on the drive.  It could be
that the reads return data out of the cache without accessing data on
the surface of the disk

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: problems with raid6, mdadm: RUN_ARRAY failed

2006-06-04 Thread Neil Brown
On Friday June 2, [EMAIL PROTECTED] wrote:
 I have some old controler Mylex Acceleraid 170LP with 6 SCSI 36GB disks on
 it. Running hardware raid5 resulted with very poor performance (7Mb/sec in
 sequential writing, with horrid iowait).
 
 So I configured it to export 6 logical disks and tried creating raid6 and see
 if I can get better results. Trying to create an array with a missing 
 component
 results in:
 
 ~/mdadm-2.5/mdadm -C /dev/md3 -l6 -n6 /dev/rd/c0d0p3  /dev/rd/c0d2p3 
 /dev/rd/c0d3p3 /dev/rd/c0d4p3 /dev/rd/c0d5p3 missing
 mdadm: RUN_ARRAY failed: Input/output error

There should have been some messages in the kernel log when this
happened.  Can you report them too?

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 hang on get_active_stripe

2006-06-02 Thread Neil Brown
On Friday June 2, [EMAIL PROTECTED] wrote:
 On Thu, 1 Jun 2006, Neil Brown wrote:
 
  I've got one more long-shot I would like to try first.  If you could
  backout that change to ll_rw_block, and apply this patch instead.
  Then when it hangs, just cat the stripe_cache_active file and see if
  that unplugs things or not (cat it a few times).
 
 nope that didn't unstick it... i had to raise stripe_cache_size (from 256 
 to 768... 512 wasn't enough)...
 
 -dean

Ok, thanks.
I still don't know what is really going on, but I'm 99.9863% sure this
will fix it, and is a reasonable thing to do.
(Yes, I lose a ';'.  That is deliberate).

Please let me know what this proves, and thanks again for your
patience.

NeilBrown


Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/raid5.c |5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~   2006-05-28 21:56:56.0 +1000
+++ ./drivers/md/raid5.c2006-06-02 17:24:07.0 +1000
@@ -285,7 +285,7 @@ static struct stripe_head *get_active_st
  (conf-max_nr_stripes 
*3/4)
 || 
!conf-inactive_blocked),
conf-device_lock,
-   unplug_slaves(conf-mddev);
+   
raid5_unplug_device(conf-mddev-queue)
);
conf-inactive_blocked = 0;
} else
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Clarifications about check/repair, i.e. RAID SCRUBBING

2006-06-02 Thread Neil Brown
On Friday June 2, [EMAIL PROTECTED] wrote:
 
 In any regard:
 
 I'm talking about triggering the following functionality:
 
 echo check  /sys/block/mdX/md/sync_action
 echo repair  /sys/block/mdX/md/sync_action
 
 On a RAID5, and soon a RAID6, I'm looking to set up a cron job, and am 
 trying to figure out what exactly to schedule.  The answers to the 
 following questions might shed some light on this:
 
 1. GENERALLY SPEAKING, WHAT IS THE DIFFERENCE BETWEEN THE CHECK AND 
 REPAIR COMMANDS?
 The md.txt doc mentions for check that a repair may also happen for 
 some raid levels.
 Which RAID levels, and in what cases?  If I perform a check is there a 
 cache of bad blocks that need to be fixed that can quickly be repaired 
 by executing the repair command?  Or would it go through the entire 
 array again?  I'm working with new drives, and haven't come across any 
 bad blocks to test this with.

'check' just reads everything and doesn't trigger any writes unless a
read error is detected, in which case the normally read-error handing
kicks in.  So it can be useful on a read-only array.

'repair' does that same but when it finds an inconsistency is corrects
it by writing something.
If any raid personality had not be taught to specifically understand
'check', then a 'check' run would effect a 'repair'.  I think 2.6.17
will have all personalities doing the right thing.

check doesn't keep a record of problems, just a count.  'repair' will
reprocess the whole array.


 
 2. CAN CHECK BE RUN ON A DEGRADED ARRAY (say with N out of N+1 disks 
 on a RAID level 5)?  I can test this out, but was it designed to do 
 this, versus REPAIR only working on a full set of active drives? 
 Perhaps repair is assuming that I have N+1 disks so that parity can be 
 WRITTEN?

No, check on a degraded raid5, or a raid6 with 2 missing devices, or a
raid1 with only one device will not do anything.  It will terminate
immediately.   After all, there is nothing useful that it can do.

 
 3. RE: FEEDBACK/LOGGING: it seems that I might see some messages in 
 dmesg logging output such as raid5:read error corrected!, is that 
 right?  I realize that mismatch_count can also be used to see if there 
 was any action during a check or repair.  I'm assuming this stuff 
 doesn't make its way into an email.

You are correct on all counts.  mdadm --monitor doesn't know about
this yet. ((writes notes in mdadm todo list)).

 
 4. DOES REPAIR PERFORM READS TO CHECK THE ARRAY, AND THEN WRITE TO THE 
 ARRAY *ONLY WHEN NECESSARY* TO PERFORM FIXES FOR CERTAIN BLOCKS?  (I 
 know, it's sorta a repeat of question number 1+2).
 

repair only writes when necessary.  In the normal case, it will only
read every blocks.


 5. IS THERE ILL-EFFECT TO STOP EITHER CHECK OR REPAIR BY ISSUING IDLE?

No.

 
 6. IS IT AT ALL POSSIBLE TO CHECK A CERTAIN RANGE OF BLOCKS?  And to 
 keep track of which blocks were checked?  The motivation is to start 
 checking some blocks overnight, and to pick-up where I left off the next 
 night...

Not yet.  It might be possible one day.

 
 7. ANY OTHER CONSIDERATIONS WHEN SCRUBBING THE RAID?
 

Not that I am aware of.

NeilBrown


 Sorry for some of these questions being so similar in nature.  I just 
 want to make sure I understand it correctly.
 
 Neil, again, a BIG thanks for this new functionality.  I'm looking 
 forward to putting a system in place to exercise my drives!
 
 Cheers,
 
 -- roy
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5E

2006-05-31 Thread Neil Brown
On Wednesday May 31, [EMAIL PROTECTED] wrote:
 Where I was working most recently some systems were using RAID5E (RAID5 
 with both the parity and hot spare distributed). This seems to be highly 
 desirable for small arrays, where spreading head motion over one more 
 drive will improve performance, and in all cases where a rebuild to the 
 hot spare will avoid a bottleneck on a single drive.
 
 Is there any plan to add this capability?

I thought about it briefly

As I understand it, the layout of raid5e when non-degraded is very
similar to raid6 - however the 'Q' block is simply not used.
This would be trivial to implement.

The interesting bit comes when a device fails and you want to rebuild
that distributed spare.
There are two possible ways that you could do this:

1/ Leave the spare where it is and write the correct data into each
 spare.  This would be fairly easy but would leave an array with an
 very ... interesting layout of data.
 When you add a replacement you just move everything back.

2/ reshape the array to be a regular raid5 layout.  This would be hard
 to do well without NVRAM as you are moving live data, but would result
 in a neat and tidy array.  Ofcourse adding a drive back in would be
 interesting again...

I had previously only thought of option '2', and so discarded the idea
as not worth the effort.  The more I think about it, the more possible
option 1 sounds.
I've put it back on my todo list, but I don't expect to get to it this
year.  Ofcourse if someone else wants to give it a try, I'm happy to
make suggestions and review code.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 006 of 10] md: Set/get state of array via sysfs

2006-05-31 Thread Neil Brown
On Wednesday May 31, [EMAIL PROTECTED] wrote:
 * NeilBrown ([EMAIL PROTECTED]) wrote:
  
  This allows the state of an md/array to be directly controlled
  via sysfs and adds the ability to stop and array without
  tearing it down.
  
  Array states/settings:
  
   clear
   No devices, no size, no level
   Equivalent to STOP_ARRAY ioctl
 
 It looks like this demoted CAP_SYS_ADMIN to CAP_DAC_OVERRIDE for the
 equiv ioctl.  Intentional?

Uhm.. no.  Thanks.  I'll fix that, see if I've done similar things
elsewhere, and keep it in mind for the future.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 008 of 10] md: Allow raid 'layout' to be read and set via sysfs.

2006-05-31 Thread Neil Brown
On Wednesday May 31, [EMAIL PROTECTED] wrote:
 * NeilBrown ([EMAIL PROTECTED]) wrote:
  +static struct md_sysfs_entry md_layout =
  +__ATTR(layout, 0655, layout_show, layout_store);
 
 0644?

I think the correct response is Doh! :-)

Yes, thanks,
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID 5 Whole Devices - Partition

2006-05-30 Thread Neil Brown
On Tuesday May 30, [EMAIL PROTECTED] wrote:
 Hello,
 
 I am trying to create a RAID5 array out of 3 160GB SATA drives. After
 i create the array i want to partition the device into 2 partitions.
 
 The system lies on a SCSI disk and the 2 partitions will be used for
 data storage.
 The SATA host is an HPT374 device with drivers compiled in the kernel.
 
 These are the steps i followed
 
 mdadm -Cv --auto=part /dev/md_d0 --chunk=64 -l 5 --raid-devices=3
 /dev/hde /dev/hdi /dev/hdk
 
 Running this command notifies me that there is an ext2 fs on one of
 the drives even if i fdisked them before and removed all partititions.
 Why is this happening?

The ext2 superblock is on the second 1K for the device.
The only place that fdisk writes is in the first 512 bytes.  So fdisk
is never going to remove the signature of a an ext2 filesystem.


 
 In anycase i continue with the array creation

This is the right thing to do.

 
 After initialization 5 new devices are created in /dev
 
 /dev/md_d0
 /dev/md_d0p1
 /dev/md_d0_p1
 /dev/md_d0_p2
 /dev/md_d0_p3
 /dev/md_d0_p4
 
 The problems arise when i reboot.
 A device /dev/md0 seems to keep the 3 disks busy and as a result when

You need to find out where that is coming from.  Complete kernel logs
might help.  Maybe you have an initrd which is trying to be helpful?


 the time comes
 to assemble the array i get the error that the disks are busy.
 When the system boots i cat /proc/mdstat and see that /dev/md0 is a
 raid5 array made of the two disks and it comes up as degraded
 
 I can then stop the array using mdadm -S /dev/md0 and restart it using
 mdadm -As which uses the correct /dev/md_d0. Examining that shows its
 clean and ok
 
 /dev/md_d0:
 Version : 00.90.01
   Creation Time : Tue May 30 17:03:31 2006
  Raid Level : raid5
  Array Size : 312581632 (298.10 GiB 320.08 GB)
 Device Size : 156290816 (149.05 GiB 160.04 GB)
Raid Devices : 3
   Total Devices : 3
 Preferred Minor : 0
 Persistence : Superblock is persistent
 
 Update Time : Tue May 30 19:48:03 2006
   State : clean
  Active Devices : 3
 Working Devices : 3
  Failed Devices : 0
   Spare Devices : 0
 
  Layout : left-symmetric
  Chunk Size : 64K
 
 Number   Major   Minor   RaidDevice State
0  3300  active sync   /dev/hde
1  5601  active sync   /dev/hdi
2  5702  active sync   /dev/hdk
UUID : 9f520781:7f3c2052:1cb5078e:c3f3b95c
  Events : 0.2
 
 Is this the expected behavior? Why doesnt the kernel ignore /dev/md0
 and tries to use it? I tried using raid=noautodetect but it didnt help
 I am using 2.6.9

Most be something else trying to start the array.  Maybe a stray
'raidstart'.  Maybe something in an initrd.

 
 This is my mdadm.conf
 DEVICE /dev/hde /dev/hdi /dev/hdk
 ARRAY /dev/md_d0 level=raid5 num-devices=3
 UUID=9f520781:7f3c2052:1cb5078e:c3f3b95c
devices=/dev/hde,/dev/hdi,/dev/hdk auto=partition
 MAILADDR [EMAIL PROTECTED]

This should work providing the device names of the ide drives never
change  -- which is fairly safe.  It isn't safe for SCSI drives.


 
 Furthermore when i fdisk the drives after all of this i can see the 2
 partitions on /dev/hde and /dev/hdi but /dev/hdk shows that no
 partition exists. Is this a sign of data corruption or drive failure?
 Shouldnt all 3 drives show the same partition information?

No.  The drives shouldn't really have partition information at all.
The raid array has the partition information.
However the first block of /dev/hde is also the first block of
/dev/md_d0, so it will appear to have the same partition table.
And the first block of /dev/hdk is an 'xor' of the first blocks of hdi
and hde.  So if the first block of hdi is all zeros, then the first
block of /dev/hdk will have the same partition table.


 fdisk /dev/hde
 /dev/hde1   1   19457   156288352   fd  Linux raid autodetect
 
 fdisk /dev/hdi
 /dev/hdi1   1   19457   156288321   fd  Linux raid
 autodetect

When you created the partitions in /dev/md_d0, you must have set the
partition type to 'Linux raid autodetect'.  You don't want to do that.
Change it to 'Linux' or whatever.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 hang on get_active_stripe

2006-05-30 Thread Neil Brown
On Tuesday May 30, [EMAIL PROTECTED] wrote:
 On Tue, 30 May 2006, Neil Brown wrote:
 
  Could you try this patch please?  On top of the rest.
  And if it doesn't fail in a couple of days, tell me how regularly the
  message 
 kblockd_schedule_work failed
  gets printed.
 
 i'm running this patch now ... and just after reboot, no freeze yet, i've 
 already seen a handful of these:
 
 May 30 17:05:09 localhost kernel: kblockd_schedule_work failed
 May 30 17:05:59 localhost kernel: kblockd_schedule_work failed
 May 30 17:08:16 localhost kernel: kblockd_schedule_work failed
 May 30 17:10:51 localhost kernel: kblockd_schedule_work failed
 May 30 17:11:51 localhost kernel: kblockd_schedule_work failed
 May 30 17:12:46 localhost kernel: kblockd_schedule_work failed
 May 30 17:14:14 localhost kernel: kblockd_schedule_work failed

1 every minute or so.  That's probably more than I would have
expected, but strongly lends evidence to the theory that this is the
problem.

I certainly wouldn't expect a failure every time kblockd_schedule_work
failed (in the original code), but the fact that it does fail
sometimes means there is a possible race which can cause the failure
that experienced.

So I am optimistic that the patch will have fixed the problem.  Please
let me know when you reach an uptime of 3 days.

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 hang on get_active_stripe

2006-05-30 Thread Neil Brown
On Tuesday May 30, [EMAIL PROTECTED] wrote:
 
 actually i think the rate is higher... i'm not sure why, but klogd doesn't 
 seem to keep up with it:
 
 [EMAIL PROTECTED]:~# grep -c kblockd_schedule_work /var/log/messages
 31
 [EMAIL PROTECTED]:~# dmesg | grep -c kblockd_schedule_work
 8192

# grep 'last message repeated' /var/log/messages
??

Obviously even faster than I thought.  I guess workqueue threads must
take a while to get scheduled...
I'm beginning to wonder if I really have found the bug after all :-(

I'll look forward to the results either way.

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mdadm 2.5 (Was: ANNOUNCE: mdadm 2.5 - A tool for managing Soft RAID under Linux)

2006-05-29 Thread Neil Brown
On Monday May 29, [EMAIL PROTECTED] wrote:
 On Mon, May 29, 2006 at 12:08:25PM +1000, Neil Brown wrote:
 On Sunday May 28, [EMAIL PROTECTED] wrote:
 Thanks for the patches.  They are greatly appreciated.
 You're welcome
  
  - mdadm-2.3.1-kernel-byteswap-include-fix.patch
  reverts a change introduced with mdadm 2.3.1 for redhat compatibility
  asm/byteorder.h is an architecture dependent file and does more
  stuff than a call to the linux/byteorder/XXX_endian.h
  the fact that not calling asm/byteorder.h does not define
  __BYTEORDER_HAS_U64__ is just an example of issues that might arise.
  if redhat is broken it should be worked around differently than breaking
  mdadm.
 
 I don't understand the problem here.  What exactly breaks with the
 code currently in 2.5?  mdadm doesn't need __BYTEORDER_HAS_U64__, so
 why does not having id defined break anything?
 The coomment from the patch says:
   not including asm/byteorder.h will not define __BYTEORDER_HAS_U64__
   causing __fswab64 to be undefined and failure compiling mdadm on
   big_endian architectures like PPC
 
 But again, mdadm doesn't use __fswab64 
 More details please.
 you use __cpu_to_le64 (ie in super0.c line 987)
 
 bms-sync_size = __cpu_to_le64(size);
 
 which in byteorder/big_endian.h is defined as
 
 #define __cpu_to_le64(x) ((__force __le64)__swab64((x)))
 
 but __swab64 is defined in byteorder/swab.h (included by
 byteorder/big_endian.h) as
 
 #if defined(__GNUC__)  (__GNUC__ = 2)  defined(__OPTIMIZE__)
 #  define __swab64(x) \
 (__builtin_constant_p((__u64)(x)) ? \
 ___swab64((x)) : \
 __fswab64((x)))
 #else
 #  define __swab64(x) __fswab64(x)
 #endif /* OPTIMIZE */


Grrr..

Thanks for the details.  I think I'll just give up and do it myself.
e.g.
short swap16(short in)
{
int i;
short out=0;
for (i=0; i4; i++) {
out = out8 | (in255);
in = in  8;
}
return out;
}

I don't need top performance and at least this should be portable...

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 hang on get_active_stripe

2006-05-28 Thread Neil Brown
On Saturday May 27, [EMAIL PROTECTED] wrote:
 On Sat, 27 May 2006, Neil Brown wrote:
 
  Thanks.  This narrows it down quite a bit... too much infact:  I can
  now say for sure that this cannot possible happen :-)
  
2/ The message.gz you sent earlier with the
echo t  /proc/sysrq-trigger
   trace in it didn't contain information about md4_raid5 - the 
 
 got another hang again this morning... full dmesg output attached.
 

Thanks.  Nothing surprising there, which maybe is a surprise itself...

I'm still somewhat stumped by this.  But given that it is nicely
repeatable, I'm sure we can get there...

The following patch adds some more tracing to raid5, and might fix a
subtle bug in ll_rw_blk, though it is an incredible long shot that
this could be affecting raid5 (if it is, I'll have to assume there is
another bug somewhere).   It certainly doesn't break ll_rw_blk.
Whether it actually fixes something I'm not sure.

If you could try with these on top of the previous patches I'd really
appreciate it.

When you read from /stripe_cache_active, it should trigger a
(cryptic) kernel message within the next 15 seconds.  If I could get
the contents of that file and the kernel messages, that should help.

Thanks heaps,

NeilBrown


Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./block/ll_rw_blk.c  |4 ++--
 ./drivers/md/raid5.c |   18 ++
 2 files changed, 20 insertions(+), 2 deletions(-)

diff ./block/ll_rw_blk.c~current~ ./block/ll_rw_blk.c
--- ./block/ll_rw_blk.c~current~2006-05-28 21:54:23.0 +1000
+++ ./block/ll_rw_blk.c 2006-05-28 21:55:17.0 +1000
@@ -874,7 +874,7 @@ static void __blk_queue_free_tags(reques
}
 
q-queue_tags = NULL;
-   q-queue_flags = ~(1  QUEUE_FLAG_QUEUED);
+   clear_bit(QUEUE_FLAG_QUEUED, q-queue_flags);
 }
 
 /**
@@ -963,7 +963,7 @@ int blk_queue_init_tags(request_queue_t 
 * assign it, all done
 */
q-queue_tags = tags;
-   q-queue_flags |= (1  QUEUE_FLAG_QUEUED);
+   set_bit(QUEUE_FLAG_QUEUED, q-queue_flags);
return 0;
 fail:
kfree(tags);

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~   2006-05-27 09:17:10.0 +1000
+++ ./drivers/md/raid5.c2006-05-28 21:56:56.0 +1000
@@ -1701,13 +1701,20 @@ static sector_t sync_request(mddev_t *md
  * During the scan, completed stripes are saved for us by the interrupt
  * handler, so that they will not have to wait for our next wakeup.
  */
+static unsigned long trigger;
+
 static void raid5d (mddev_t *mddev)
 {
struct stripe_head *sh;
raid5_conf_t *conf = mddev_to_conf(mddev);
int handled;
+   int trace = 0;
 
PRINTK(+++ raid5d active\n);
+   if (test_and_clear_bit(0, trigger))
+   trace = 1;
+   if (trace)
+   printk(raid5d runs\n);
 
md_check_recovery(mddev);
 
@@ -1725,6 +1732,13 @@ static void raid5d (mddev_t *mddev)
activate_bit_delay(conf);
}
 
+   if (trace)
+   printk( le=%d, pas=%d, bqp=%d le=%d\n,
+  list_empty(conf-handle_list),
+  atomic_read(conf-preread_active_stripes),
+  blk_queue_plugged(mddev-queue),
+  list_empty(conf-delayed_list));
+
if (list_empty(conf-handle_list) 
atomic_read(conf-preread_active_stripes)  IO_THRESHOLD 
!blk_queue_plugged(mddev-queue) 
@@ -1756,6 +1770,8 @@ static void raid5d (mddev_t *mddev)
unplug_slaves(mddev);
 
PRINTK(--- raid5d inactive\n);
+   if (trace)
+   printk(raid5d done\n);
 }
 
 static ssize_t
@@ -1813,6 +1829,7 @@ stripe_cache_active_show(mddev_t *mddev,
struct list_head *l;
n = sprintf(page, %d\n, atomic_read(conf-active_stripes));
n += sprintf(page+n, %d preread\n, 
atomic_read(conf-preread_active_stripes));
+   n += sprintf(page+n, %splugged\n, 
blk_queue_plugged(mddev-queue)?:not );
spin_lock_irq(conf-device_lock);
c1=0;
list_for_each(l, conf-bitmap_list)
@@ -1822,6 +1839,7 @@ stripe_cache_active_show(mddev_t *mddev,
c2++;
spin_unlock_irq(conf-device_lock);
n += sprintf(page+n, bitlist=%d delaylist=%d\n, c1, c2);
+   trigger = 0x;
return n;
} else
return 0;
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mdadm 2.5 (Was: ANNOUNCE: mdadm 2.5 - A tool for managing Soft RAID under Linux)

2006-05-28 Thread Neil Brown
On Sunday May 28, [EMAIL PROTECTED] wrote:
 On Fri, May 26, 2006 at 04:33:08PM +1000, Neil Brown wrote:
 
 
 I am pleased to announce the availability of
mdadm version 2.5
 
 
 hello,
 i tried rebuilding mdadm 2.5 on current mandriva cooker, which uses
 gcc-4.1.1, glibc-2.4 and dietlibc 0.29 and found the following issues
 addressed by patches attacched to this message
 I would be glad if you could review these patches and include them in
 upcoming mdadm releases.

Thanks for the patches.  They are greatly appreciated.

 
 - mdadm-2.3.1-kernel-byteswap-include-fix.patch
 reverts a change introduced with mdadm 2.3.1 for redhat compatibility
 asm/byteorder.h is an architecture dependent file and does more
 stuff than a call to the linux/byteorder/XXX_endian.h
 the fact that not calling asm/byteorder.h does not define
 __BYTEORDER_HAS_U64__ is just an example of issues that might arise.
 if redhat is broken it should be worked around differently than breaking
 mdadm.

I don't understand the problem here.  What exactly breaks with the
code currently in 2.5?  mdadm doesn't need __BYTEORDER_HAS_U64__, so
why does not having id defined break anything?
The coomment from the patch says:
  not including asm/byteorder.h will not define __BYTEORDER_HAS_U64__
  causing __fswab64 to be undefined and failure compiling mdadm on
  big_endian architectures like PPC

But again, mdadm doesn't use __fswab64 
More details please.

 
 - mdadm-2.4-snprintf.patch
 this is self commenting, just an error in the snprintf call

I wonder how that snuck in...
There was an odd extra tab in the patch, but no-matter.
I changed it to use 'sizeof(buf)' to be consistent with other uses
of snprint.  Thanks.

 
 - mdadm-2.4-strict-aliasing.patch
 fix for another srict-aliasing problem, you can typecast a reference to a
 void pointer to anything, you cannot typecast a reference to a
 struct.

Why can't I typecast a reference to a struct??? It seems very
unfair...
However I have no problem with the patch.  Applied.  Thanks.
I should really change it to use 'list.h' type lists from the linux
kernel.

 
 - mdadm-2.5-mdassemble.patch
 pass CFLAGS to mdassemble build, enabling -Wall -Werror showed some
 issues also fixed by the patch.

yep, thanks.

 
 - mdadm-2.5-rand.patch
 Posix dictates rand() versus bsd random() function, and dietlibc
 deprecated random(), so switch to srand()/rand() and make everybody
 happy.

Everybody?
'man 3 rand' tells me:

   Do not use this function in applications  intended  to  be
   portable when good randomness is needed.

Admittedly mdadm doesn't need to be portable - it only needs to run on
Linux.  But this line in the man page bothers me.

I guess
-Drandom=rand -Dsrandom=srand
might work no.  stdlib.h doesn't like that.
'random' returns 'long int' while rand returns 'int'.
Interestingly 'random_r' returns 'int' as does 'rand_r'.

#ifdef __dietlibc__
#includestrings.h
/* dietlibc has deprecated random and srandom!! */
#define random rand
#define srandom srand
#endif

in mdadm.h.  Will that do you?


 
 - mdadm-2.5-unused.patch
 glibc 2.4 is pedantic on ignoring return values from fprintf, fwrite and
 write, so now we check the rval and actually do something with it.
 in the Grow.c case i only print a warning, since i don't think we can do
 anithing in case we fail invalidating those superblocks (is should never
 happen, but then...)

Ok, thanks.


You can see these patches at
   http://neil.brown.name/cgi-bin/gitweb.cgi?p=mdadm

more welcome :-)

Thanks again,
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] install a static build

2006-05-28 Thread Neil Brown
On Sunday May 28, [EMAIL PROTECTED] wrote:
 Hello Luca,
 
  maybe you better add an install-static target.
 
 you're right, that would be a cleaner approach. I've don so, and while
 doing so added install-tcc, install-ulibc, install-klibc too.
 
 And while I'm busy in the Makefile anyway I've made a third patch
 which adds the uninstall: target too.
 
 -- 
 --- Dirk Jagdmann
  http://cubic.org/~doj
 - http://llg.cubic.org

thanks for these.  
They are now in my git tree:
  http://neil.brown.name/cgi-bin/gitweb.cgi?p=mdadm
  git://neil.brown.name/mdadm
  
They claim me as their author I'm afraid.. I'll have to fix my scripts
to get it right next time :-(

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: problems with raid=noautodetect

2006-05-28 Thread Neil Brown
On Friday May 26, [EMAIL PROTECTED] wrote:
 On Tue, May 23, 2006 at 08:39:26AM +1000, Neil Brown wrote:
 Presumably you have a 'DEVICE' line in mdadm.conf too?  What is it.
 My first guess is that it isn't listing /dev/sdd? somehow.
 
 Neil,
 i am seeing a lot of people that fall in this same error, and i would
 propose a way of avoiding this problem
 
 1) make DEVICE partitions the default if no device line is specified.

As you note, we think alike on this :-)

 2) deprecate the DEVICE keyword issuing a warning when it is found in
 the configuration file

Not sure I'm so keen on that, at least not in the near term.


 3) introduce DEVICEFILTER or similar keyword with the same meaning at
 the actual DEVICE keyboard

If it has the same meaning, why not leave it called 'DEVICE'???

However, there is at least the beginnings of a good idea here.

If we assume there is a list of devices provided by a (possibly
default) 'DEVICE' line, then 

DEVICEFILTER   !pattern1 !pattern2 pattern3 pattern4

could mean that any device in that list which matches pattern 1 or 2
is immediately discarded, and remaining device that matches patterns 3
or 4 are included, and the remainder are discard.

The rule could be that the default is to include any devices that
don't match a !pattern, unless there is a pattern without a '!', in
which case the default is to reject non-accepted patterns.
Is that straight forward enough, or do I need an
  order allow,deny
like apache has?


Thanks for the suggestion.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 kicks non-fresh drives

2006-05-28 Thread Neil Brown
On Friday May 26, [EMAIL PROTECTED] wrote:
  I had no idea about this particular configuration requirement. None of
 
 just to be clear: it's not a requirement.  if you want the very nice 
 auto-assembling behavior, you need to designate the auto-assemblable 
 partitions.  but you can assemble manually without 0xfd partitions
 (even if that's in an initrd, for instance.)
 
 I think the current situation is good, since there is some danger of 
 going too far.  for instance, testing each partition to see whether 
 it contains a valid superblock would be pretty crazy, right?

I'm curious: why exactly do you say that?
Doing the reads themselves cannot be a problem as the kernel already
reads the partition table from each devices.  Reading superblocks is
no big deal.

If you don't like the idea of assembling everything that was found,
how is that different from.

  requiring
 either the auto-assemble-me partition type, or explicit partitions 
 given in a config file is a happy medium...

assembling everything that was found which had an 'auto-assemble-me'
flag?  That flag, in common usage, contains almost zero information
more than the existence of the raid superblock.

Am I missing something?

My opinion:  the auto-assemble-me partition type is not a happy
medium.  The superblock containing the hostname (as supported by
mdadm-2.5) is (I hope).

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 kicks non-fresh drives

2006-05-28 Thread Neil Brown
On Friday May 26, [EMAIL PROTECTED] wrote:
 On Thu, 25 May 2006, Craig Hollabaugh wrote:
 
  That did it! I set the partition FS Types from 'Linux' to 'Linux raid 
  autodetect' after my last re-sync completed. Manually stopped and 
  started the array. Things looked good, so I crossed my fingers and 
  rebooted. The kernel found all the drives and all is happy here in 
  Colorado.
 
 Would it make sense for the raid code to somehow warn in the log when a 
 device in a raid set doesn't have Linux raid autodetect partition type? 
 If this was in dmesg, would you have spotted the problem before?

Maybe.  Unfortunately md doesn't really have direct access to
information on partition types.  The way it gets access for
auto-detect is an ugly hack which I would rather not make any further
use of.

Maybe mdadm could be more helpful here.
e.g. when you create, assemble, or 'detail' an array it could report
any inconsistencies in the partition types, and when you --add
a device which is isn't a Raid-autodetect partition to an
array the currently comprises such partitions it could give a warning.

I had thought that 'libblkid' would help with that, but having looked
at the doco, it appears not
Maybe I use libparted... or maybe borrow code out of kpartx.
There don't seem to be any easy options ;-(

Thanks for the suggestion (and if anyone has some good partition
hacking code...)

NeilBrown

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm and 2.4 kernel?

2006-05-26 Thread Neil Brown
On Thursday May 25, [EMAIL PROTECTED] wrote:
 Hi, for various reasons i'll need to run mdadm on a 2.4 kernel.
 Now I have 2.4.32 kernel.
 
 Take a look:
 
 [EMAIL PROTECTED]:~# mdadm --create --verbose /dev/md0 --level=1 
 --bitmap=/root/md0bitmap -n 2 /dev/nda /dev/ndb --force --assume-clean
 mdadm: /dev/nda appears to be part of a raid array:
 level=raid1 devices=2 ctime=Thu May 25 20:10:47 2006
 mdadm: /dev/ndb appears to be part of a raid array:
 level=raid1 devices=2 ctime=Thu May 25 20:10:47 2006
 mdadm: size set to 39118144K
 Continue creating array? y
 mdadm: Warning - bitmaps created on this kernel are not portable
   between different architectured.  Consider upgrading the Linux kernel.
 mdadm: Cannot set bitmap file for /dev/md0: No such device
 

2.4 does not support bitmaps (nor do early 2.6 kernels).

 
 [EMAIL PROTECTED]:~# mdadm --create --verbose /dev/md0 --level=1  -n 2 
 /dev/nda 
 /dev/ndb --force --assume-clean
 mdadm: /dev/nda appears to be part of a raid array:
 level=raid1 devices=2 ctime=Thu May 25 20:10:47 2006
 mdadm: /dev/ndb appears to be part of a raid array:
 level=raid1 devices=2 ctime=Thu May 25 20:10:47 2006
 mdadm: size set to 39118144K
 Continue creating array? y
 mdadm: SET_ARRAY_INFO failed for /dev/md0: File exists
 [EMAIL PROTECTED]:~# 

It seems /dev/md0 is already active somehow.
Try
  mdadm -S /dev/md0
first.  What does cat /proc/mdstat say?

NeilBrown


 
 Obviously the devices /dev/nda and /dev/ndb exists (i can make fdisk 
 on them).
 
 Can someone help me?
 Thanks.
 Stefano.
 
 
 
 
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 hang on get_active_stripe

2006-05-26 Thread Neil Brown
On Friday May 26, [EMAIL PROTECTED] wrote:
 On Tue, 23 May 2006, Neil Brown wrote:
 
 i applied them against 2.6.16.18 and two days later i got my first hang... 
 below is the stripe_cache foo.
 
 thanks
 -dean
 
 neemlark:~# cd /sys/block/md4/md/
 neemlark:/sys/block/md4/md# cat stripe_cache_active 
 255
 0 preread
 bitlist=0 delaylist=255
 neemlark:/sys/block/md4/md# cat stripe_cache_active 
 255
 0 preread
 bitlist=0 delaylist=255
 neemlark:/sys/block/md4/md# cat stripe_cache_active 
 255
 0 preread
 bitlist=0 delaylist=255

Thanks.  This narrows it down quite a bit... too much infact:  I can
now say for sure that this cannot possible happen :-)

Two things that might be helpful:
  1/ Do you have any other patches on 2.6.16.18 other than the 3 I
sent you?  If you do I'd like to see them, just in case.
  2/ The message.gz you sent earlier with the
  echo t  /proc/sysrq-trigger
 trace in it didn't contain information about md4_raid5 - the 
 controlling thread for that array.  It must have missed out
 due to a buffer overflowing.  Next time it happens, could you
 to get this trace again and see if you can find out what
 what md4_raid5 is going.  Maybe do the 'echo t' several times.
 I think that you need a kernel recompile to make the dmesg
 buffer larger.

Thanks for your patience - this must be very frustrating for you.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 kicks non-fresh drives

2006-05-25 Thread Neil Brown
On Thursday May 25, [EMAIL PROTECTED] wrote:
 
 From dmesg
 md: Autodetecting RAID arrays.
 md: autorun ...
 md: considering sdl1 ...
 md:  adding sdl1 ...
 md:  adding sdi1 ...
 md:  adding sdh1 ...
 md:  adding sdg1 ...
 md:  adding sdf1 ...
 md:  adding sde1 ...
 md:  adding sdd1 ...
 md:  adding sdc1 ...
 md:  adding sdb1 ...
 md:  adding sda1 ...
 md:  adding hdc1 ...
 md: created md0
 
 The kernel didn't add sdj or sdk.
 

And the partition types of sdj1 and sdk1 are ???

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Max. md array size under 32-bit i368 ...

2006-05-24 Thread Neil Brown
On Wednesday May 24, [EMAIL PROTECTED] wrote:
 
 I know this has come up before, but a few quick googles hasn't answered my
 questions - I'm after the max. array size that can be created under
 bog-standard 32-bit intel Linux, and any issues re. partitioning.
 
 I'm aiming to create a raid-6 over 12 x 500GB drives - am I going to
 have any problems?

No, this should work providing your kernel is compiled with CONFIG_LBD.

NeilBrown

 
 (I'm not parittioning the resulting md device, just the underlying sd
 devices and building a single md out of  sd[a-l]4 ...)
 
 Cheers,
 
 Gordon
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4 disks in raid 5: 33MB/s read performance?

2006-05-24 Thread Neil Brown
On Wednesday May 24, [EMAIL PROTECTED] wrote:
 Mark Hahn wrote:
 
 I just dd'ed a 700MB iso to /dev/null, dd returned 33MB/s.
 Isn't that a little slow?
 
 
 
 what bs parameter did you give to dd?  it should be at least 3*chunk
 (probably 3*64k if you used defaults.)
 
 
 I would expect readahead to make this unproductive. Mind you, I didn't 
 say it is, but I can't see why not. There was a problem with data going 
 through stripe cache when it didn't need to, but I thought that was fixed.
 
 Neil? Am I an optimist?

Probably

You are write about readahead - it should make the difference in block
size irrelevant.

You are wrong about the problem of reading through the cache being
fixed.  It hasn't yet.  We still read through the cache.
However that shouldn't cause more than a 10% speed reduction, and
33MB/s sounds like more than 10% down.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: iostat messed up with md on 2.6.16.x

2006-05-24 Thread Neil Brown
On Wednesday May 24, [EMAIL PROTECTED] wrote:
 
 Hi,
 
  I upgraded my kernel from 2.6.15.6 to 2.6.16.16 and now the 'iostat -x
 1' permanently shows 100% utilisation on each disk that member of an md
 array. I asked my friend who using 3 boxes with 2.6.16.2 2.6.16.9
 2.6.16.11 and raid1, he's reported the same too. it works for anyone?
 I don't think that it's exactly a md problem, but only appears with the
 md, so I wrote here
 
  I did a basic debugging on evening and I think the problem is the
 double calling of disk-in_flight-- in block/ll_rw_blk.c  -  I dont know
 why, but here's a sample line from /proc/diskstats after a raid array
 assembled:
 
 80 sda 52 1134 8256 568 3 7 24 16 4294967295 433820 4294534144
   ^^ in_flight = -1
 
  I wrote an ugly workaround and now the iostat working well [see
 attach#1], but if it's a real bug, someone should find the root cause of
 it, please

http://lkml.org/lkml/2006/5/23/42

might help...

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Does software RAID take advantage of SMP, or 64 bit CPU(s)?

2006-05-23 Thread Neil Brown
On Monday May 22, [EMAIL PROTECTED] wrote:
 A few simple questions about the 2.6.16+ kernel and software RAID.
 Does software RAID in the 2.6.16 kernel take advantage of SMP?

Not exactly.  RAID5/6 tends to use just one cpu for parity
calculations, but that frees up other cpus for doing other important
work.

 Does software RAID take advantage of 64-bit CPU(s)?

No more or less that other code in the kernel.  Sometimes using a 64-bit
CPU is a cost because more data is shuffled around...

Was there some particular sort of 'advantage' that you were thinking
of?

NeilBrown


 If there are any good web sites that cover this information, a link
 would be GREAT!
 
 -Adam Talbot
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: improving raid 5 performance

2006-05-23 Thread Neil Brown
On Tuesday May 23, [EMAIL PROTECTED] wrote:
 Neil hello.
 
 1.
 i have applied the common path according to
 http://www.spinics.net/lists/raid/msg11838.html as much as i can.

Great.  I look forward to seeing the results.

 
 it looks ok in terms of throughput.
 before i continue to a non common path ( step 3 ) i do not understand
 raid0_mergeable_bvec entirely.

Not too surprising - it is rather subtle unfortunately.

 
 as i understand the code checks alignment . i made a version for this
 purpose which looks like that:

Yes, it checks alignment with the chunks and devices.
However we always have to allow one page to be added to a bio, so
sometimes we have to accept a bio that crosses a chunk/device boundary.

The main (possibly only) user is in __bio_add_page in fs/bio.c
so we basically code the merge_bvec_fn to meet the needs of that code.

 
 static int raid5_mergeable_bvec(request_queue_t *q, struct bio *bio,
 struct bio_vec *biovec)
 {
   mddev_t *mddev = q-queuedata;
   sector_t sector=bio-bi_sector+get_start_sect(bio-bi_bdev);
   int max;
   unsigned int chunk_sectors = mddev-chunk_size  9;
   unsigned int bio_sectors = bio-bi_size  9;
 
   max=(chunk_sectors-((sector(chunk_sectors-1))+bio_sectors))9;
   if (max  0){
   printk(handle_aligned_read not aligned %d %d %d
 %lld\n,max,chunk_sectors,bio_sectors,sector);
   return -1; // Is bigger than one chunk size
   }
 
 //printk(handle_aligned_read aligned %d %d %d
 %lld\n,max,chunk_sectors,bio_sectors,sector);
   return max;
 }

you cannot return a negative number, because the result is
compared with an 'unsigned int', and the comparison will be unsigned. 
So return -1 is a problem.  I think you need to make this code 
look a lot more like raid0_mergeable_bvec.


 
 Questions:
1.1 why did you drop the max=0 case ?

I'm not sure what you mean by 'drop'.
If bio-bi_size == 0, then we are not allowed to return a number
smaller than biovec-bv_len, otherwise bio_add_page wont be able
to put any pages on the bio, and so wont be able to start any IO.

1.2  what these lines mean ? do i need it ?
if (max = biovec-bv_len  bio_sectors == 0)
  return biovec-bv_len;
  else
 return max;
  }

Yes, you need this.  It basically implements the above restriction.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 resize in 2.6.17 - how will it be different from raidreconf?

2006-05-22 Thread Neil Brown
On Monday May 22, [EMAIL PROTECTED] wrote:
   Will it be less risky to grow an array that way?
 
  It should be.  In particular it will survive an unexpected reboot (as
  long as you don't lose and drives at the same time) which I don't
  think raidreconf would.
  Testing results so far are quite positive.
 
 Write cache comes to mind - did you test power fail scenarios?
 

I haven't done any tests involving power-cycling the machine, but I
doubt they would show anything.

When a reshape restarts after a crash, at least the last few stripes
are re-written, which should catch anything that was pending at the
moment of power failure.

   (And while talking of that: can I add for example two disks and grow
   *and* migrate to raid6 in one sweep or will I have to go raid6 and then
   add more disks?)
 
  Adding two disks would be the preferred way to do it.
  Add only one disk and going to raid6 is problematic because the
  reshape process will be over-writing live data the whole time, making
  crash protection quite expensive.
  By contrast, when you are expanding the size of the array, after the
  first few stripes you are writing to an area of the drives where there
  is no live data.
 
 Let me see if I got this right: if I add *two* disks and go from raid 5 to 6 
 with raidreconf, no live data needs to be overwritten and in case something 
 fails I will still be able to assemble the old array..?

I cannot speak for raidreconf, though my understanding is that it
doesn't support raid6.

If you mean md/reshape, then what will happen (raid5-raid6 isn't
implemented yet) is this

The raid5 is converted to raid6 with more space incrementally.
Once the process has been underway for a little while, there will
be:
   - a region of the drives that is laid out out as raid6 - the new
 layout
   - a region of the drives that is not in use at all
   - finally a region of the drives that is still laid out as raid5.

Data from the start of the last region is constantly copied into the
start of the middle region, and the two region boundaries are moved
forward regularly.  While this happens the middle region grows.

If there is a crash, on restart this layout (part raid5, part raid6)
will be picked up and the reshaping process continued.

There is a 'critical' section at the very beginning where the middle
region is non-existent. To handle this we copy the first few blocks to
somewhere safe (a file or somewhere on the new drives) and use that
space as the middle region to copy data to.  If the system reboots
during this critical section, mdadm will restore the data from the
backup that it made before assembling the array.

If you want to convert a raid5 to a raid6 and only add one drive, it
shouldn't be hard to see that the middle region never exists.
To cope with this safely, mdadm would need to be constantly backing up
sections of the array before allowing the kernel to reshape that
section.  This is certainly quite possible and may well be implemented
one day, but can be expected to be quite slow.

I hope that clarifies the situation.

NeilBrown

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: problems with raid=noautodetect

2006-05-22 Thread Neil Brown
On Monday May 22, [EMAIL PROTECTED] wrote:
 hi list,
 I read somewhere that it would be better not to rely on the 
 autodetect-mechanism in the kernel at boot time, but rather to set up 
 /etc/mdadm.conf accordingly and boot with raid=noautodetect. Well, I 
 tried that :)
 
 I set up /etc/mdadm.conf for my 2 raid5 arrays:
 
  snip 
 # mountpoint: /home/media
 ARRAY  /dev/md0
 level=raid5
 UUID=86ed1434:43380717:4abf124e:970d843a
 devices=/dev/sda1,/dev/sdb1,/dev/sdd3
 
 # mountpoint: /mnt/raid
 ARRAY  /dev/md1
 level=raid5
 UUID=baf59fb5:f4805e7a:91a77644:af3dde17
 #   devices=/dev/sda2,/dev/sdb2,/dev/sdd2
  snap 

Presumably you have a 'DEVICE' line in mdadm.conf too?  What is it.
My first guess is that it isn't listing /dev/sdd? somehow.

Otherwise, can you add a '-v' to the mdadm command that assembles the
array, and capture the output.  That might be helpful.

NeilBrown


 
 and rebooted with raid=noautodetect. It booted fine, but the 3rd disks 
 from each array (/dev/sdd2 and /dev/sdd3) were removed, so I had 2 
 degraded raid5 arrays. It was possible to readd them with sth. like:
 
 mdadm /dev/md0 -a /dev/sdd3
 (synced and /proc/mdstat showed [UUU])
 
 but after the next reboot, the two partitions were again removed 
 ([UU_])?! This was a reproducible error, I tried it several times with 
 different /etc/mdadm.conf settings (ARRAY-statement with UUID=, 
 devices=, UUID+devices, etc.).
 
 I´m now running autodetect again, all raid arrays are working fine, but 
 can anyone explain this strange behaviour?
 (kernel-2.6.16.14, amd64)
 
 thanks,
 florian
 
 PS: please cc me, as I´m not subscribed to the list
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: spin_lock_irq() in handle_stripe()

2006-05-22 Thread Neil Brown
On Monday May 22, [EMAIL PROTECTED] wrote:
 
 
 Good day Neil, all
 
 if I understand right, we disable irqs in handle_stripe()
 just because of using device_lock which can be grabbed
 from interrupt context (_end_io functions). can we replace
 it by a new separate spinlock and don't block interrupts
 in handle_stripe() + add_stripe_bio() ?

Yes, irqs are disabled in handle_stripe, but only for relatively short
periods of time.  Do you have reason to think this is a problem.?

device_lock does currently protect a number of data structures.
Not all of them are accessed in interrupt context, and so they could
be changed to be protected by a different lock, possibly sh-lock.
You would need to carefully work out exactly what it is protecting,
determine which of those aren't accessed from interrupts, and see
about moving them (one by one preferably) to a different lock.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4 disks in raid 5: 33MB/s read performance?

2006-05-22 Thread Neil Brown
On Monday May 22, [EMAIL PROTECTED] wrote:
 I just dd'ed a 700MB iso to /dev/null, dd returned 33MB/s.
 Isn't that a little slow?
 System is a sil3114 4 port sata 1 controller with 4 samsung spinpoint 250GB, 
 8MB cache in raid 5 on a Athlon XP 2000+/512MB.
 

Yes, read on raid5 isn't as fast as we might like at the moment.

It looks like you are getting about 11MB/s of each disk which is
probably quite a bit slower than they can manage (what is the
single-drive read speed you get dding from /dev/sda or whatever).

You could try playing with the readahead number (blockdev --setra/--getra).
I'm beginning to think that the default setting is a little low.

You could also try increasing the stripe-cache size by writing numbers
to 
   /sys/block/mdX/md/stripe_cache_size

On my test system with a 4 drive raid5 over fast SCSI drives, I
get 230MB/sec on drives that give 90MB/sec.
If I increase the stripe_cache_size from 256 to 1024, I get 260MB/sec.

I wonder if your SATA  controller is causing you grief.
Could you try
   dd if=/dev/SOMEDISK of=/dev/null bs=1024k count=1024
and then do the same again on all devices in parallel
e.g.
   dd if=/dev/SOMEDISK of=/dev/null bs=1024k count=1024 
   dd if=/dev/SOMEOTHERDISK of=/dev/null bs=1024k count=1024 
   ...

and see how the speeds compare.
(I get about 55MB/sec on each of 5 drives, or 270MB/sec, which
is probably hitting the SCSI buss limit which as a theoretical 
max of 320MB/sec I think)

NeilBrown

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 hang on get_active_stripe

2006-05-22 Thread Neil Brown
On Wednesday May 17, [EMAIL PROTECTED] wrote:
 On Thu, 11 May 2006, dean gaudet wrote:
 
  On Tue, 14 Mar 2006, Neil Brown wrote:
  
   On Monday March 13, [EMAIL PROTECTED] wrote:
I just experienced some kind of lockup accessing my 8-drive raid5
(2.6.16-rc4-mm2). The system has been up for 16 days running fine, but
now processes that try to read the md device hang. ps tells me they are
all sleeping in get_active_stripe. There is nothing in the syslog, and I
can read from the individual drives fine with dd. mdadm says the state
is active.
 ...
  
  i seem to be running into this as well... it has happenned several times 
  in the past three weeks.  i attached the kernel log output...
 
 it happenned again...  same system as before...
 

I've spent all morning looking at this and while I cannot see what is
happening I did find a couple of small bugs, so that is good...

I've attached three patches.  The first fix two small bugs (I think).
The last adds some extra information to
  /sys/block/mdX/md/stripe_cache_active

They are against 2.6.16.11.

If you could apply them and if the problem recurs, report the content
of stripe_cache_active several times before and after changing it,
just like you did last time, that might help throw some light on the
situation.

Thanks,
NeilBrown

Status: ok

Fix a plug/unplug race in raid5

When a device is unplugged, requests are moved from one or two
(depending on whether a bitmap is in use) queues to the main
request queue.

So whenever requests are put on either of those queues, we should make
sure the raid5 array is 'plugged'.
However we don't.  We currently plug the raid5 queue just before
putting requests on queues, so there is room for a race.  If something
unplugs the queue at just the wrong time, requests will be left on
the queue and nothing will want to unplug them.
Normally something else will plug and unplug the queue fairly
soon, but there is a risk that nothing will.

Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/raid5.c |   18 ++
 1 file changed, 6 insertions(+), 12 deletions(-)

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~   2006-05-23 12:27:58.0 +1000
+++ ./drivers/md/raid5.c2006-05-23 12:28:26.0 +1000
@@ -77,12 +77,14 @@ static void __release_stripe(raid5_conf_
if (atomic_read(conf-active_stripes)==0)
BUG();
if (test_bit(STRIPE_HANDLE, sh-state)) {
-   if (test_bit(STRIPE_DELAYED, sh-state))
+   if (test_bit(STRIPE_DELAYED, sh-state)) {
list_add_tail(sh-lru, conf-delayed_list);
-   else if (test_bit(STRIPE_BIT_DELAY, sh-state) 
-conf-seq_write == sh-bm_seq)
+   blk_plug_device(conf-mddev-queue);
+   } else if (test_bit(STRIPE_BIT_DELAY, sh-state) 
+  conf-seq_write == sh-bm_seq) {
list_add_tail(sh-lru, conf-bitmap_list);
-   else {
+   blk_plug_device(conf-mddev-queue);
+   } else {
clear_bit(STRIPE_BIT_DELAY, sh-state);
list_add_tail(sh-lru, conf-handle_list);
}
@@ -1519,13 +1521,6 @@ static int raid5_issue_flush(request_que
return ret;
 }
 
-static inline void raid5_plug_device(raid5_conf_t *conf)
-{
-   spin_lock_irq(conf-device_lock);
-   blk_plug_device(conf-mddev-queue);
-   spin_unlock_irq(conf-device_lock);
-}
-
 static int make_request (request_queue_t *q, struct bio * bi)
 {
mddev_t *mddev = q-queuedata;
@@ -1577,7 +1572,6 @@ static int make_request (request_queue_t
goto retry;
}
finish_wait(conf-wait_for_overlap, w);
-   raid5_plug_device(conf);
handle_stripe(sh);
release_stripe(sh);
 
Status: ok

Fix some small races in bitmap plugging in raid5.

The comment gives more details, but I didn't quite have the
sequencing write, so there was room for races to leave bits
unset in the on-disk bitmap for short periods of time.

Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/raid5.c |   30 +++---
 1 file changed, 27 insertions(+), 3 deletions(-)

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~   2006-05-23 12:28:26.0 +1000
+++ ./drivers/md/raid5.c2006-05-23 12:28:53.0 +1000
@@ -15,6 +15,30 @@
  * Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
  */
 
+/*
+ * BITMAP UNPLUGGING:
+ *
+ * The sequencing for updating the bitmap reliably is a little
+ * subtle

Re: raid 5 read performance

2006-05-21 Thread Neil Brown
On Sunday May 21, [EMAIL PROTECTED] wrote:
 
 Question :
What is the cost of not walking trough the raid5 code in the
 case of READ ?
if i add and error handling code will it be suffice ?
 

Please read

http://www.spinics.net/lists/raid/msg11838.html

and ask if you have further questions.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 resize in 2.6.17 - how will it be different from raidreconf?

2006-05-21 Thread Neil Brown
On Monday May 22, [EMAIL PROTECTED] wrote:
 How will the raid5 resize in 2.6.17 be different from raidreconf? 

It is done (mostly) in the kernel while the array is active, rather
than completely in user-space while the array is off-line.

 Will it be less risky to grow an array that way?

It should be.  In particular it will survive an unexpected reboot (as
long as you don't lose and drives at the same time) which I don't
think raidreconf would.
Testing results so far are quite positive.


 Will it be possible to migrate raid5 to raid6?
 

Eventually, but no time frame yet.

 (And while talking of that: can I add for example two disks and grow *and* 
 migrate to raid6 in one sweep or will I have to go raid6 and then add more 
 disks?)

Adding two disks would be the preferred way to do it.
Add only one disk and going to raid6 is problematic because the
reshape process will be over-writing live data the whole time, making
crash protection quite expensive.
By contrast, when you are expanding the size of the array, after the
first few stripes you are writing to an area of the drives where there
is no live data.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm: bitmap size

2006-05-21 Thread Neil Brown

(Please don't reply off-list.  If the conversation starts on the list,
please leave it there unless there is a VERY GOOD reason).

On Monday May 22, [EMAIL PROTECTED] wrote:
 On 5/19/06, Neil Brown [EMAIL PROTECTED] wrote:
 
  On Friday May 19, [EMAIL PROTECTED] wrote:
   As i can see the bitmap do exactly this, but the default bitmap is too
   small!
 
  Why do you say that?
  Are you using an internal bitmap, or a bitmap in a separate file?
 
 
 I was using bitmap in a separate file.
 Why i said that tha bitmap is too small? I try to explain:
 
 the raid device is a raid1, created on /dev/md0 trought mdadm, and the
 bitmap use a 4 kb chunk-size on external file (in root directory)
 
 
 setfaulty /dev/md0 /dev/nda
 raidhotremove /dev/md0 /dev/nda
 cd /mnt/md0
 wget http://...   (240 kb file...)
 raidhotadd /dev/md0 /dev/nda
 
 And now dmesg said that the bitmap was obsolete (01 or something like that)
 and that the md driver will force a total recovery.

raidhotadd doesn't know anything about bitmaps.
If you use 'mdadm /dev/mda --add /dev/nda' you should find that it
works better.

I recommend getting rid of setfaulty / raidhotadd /raidhotremove etc
and just using mdadm.

NeilBrown


 
 A recovery of 40 gb for a 240 kb file is a little bit expensive.. :-)
 
 Unfortunately i cannot give you the exact output because the server is down
 now.:-|
 
 
 The only way to control the size of the bitmap is the change the
  bitmap chunk size.
 
 
 Okay thanks.
 
 Warning: if you have more than 1 million bits in the bitmap, the
  kernel may fail in memory allocation and may not be able to assemble
  your array.
 
 
 Thankyou for your help.
 
 Stefano.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recovery speed on many-disk RAID 1

2006-05-20 Thread Neil Brown
On Saturday May 20, jeff@jab.org wrote:
 interrupted by seeks from read requests on the RAID. But that's not
 really necessary; imagine if it instead went something like:
 
   sbb1 - sbg1# High bandwidth copy operation limited by drive speed
   sb[cde]1# These guys handle read requests
 

Yeh... that could be done.  There is even a comment in the 2.4 code:

/* If reconstructing, and 1 working disc,
 * could dedicate one to rebuild and others to
 * service read requests ..
 */

though that seems to have disappeared from 2.6.

Given that
 - high rebuild speed could swamp some buss and so interfere with
   regular IO and
 - it is fairly easy to ask for the speed to be higher

I'm not sure that it is necessary might be interesting though.

I've added it to my todo list, but if someone else would like to have
a try

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid5 resize testing opportunity

2006-05-18 Thread Neil Brown
On Thursday May 18, [EMAIL PROTECTED] wrote:
 Hi Neil,
 
 The raid5 reshape seems to have gone smoothly (nice job!), though it
 took 11 hours! Are there any pieces of info you would like about the array?

Excellent!

No, no other information would be useful.  
This is the first real-life example that I know of of adding 2 devices
at once.  That should be no more difficult, but it is good to know
that it works in fact as well as in theory.

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] MD RAID Acceleration: Move stripe operations outside the spin lock

2006-05-18 Thread Neil Brown
On Tuesday May 16, [EMAIL PROTECTED] wrote:
 This is the second revision of the effort to enable offload of MD's xor
 and copy operations to dedicated hardware resources.  Please comment on
 the approach of this patch and whether it will be suitable to expand
 this to the other areas in handle_stripe where such calculations are
 performed.  Implementation of the xor offload API is a work in progress,
 the intent is to reuse I/OAT.
 
 Overview:
 Neil, as you recommended, this implementation flags the necessary
 operations on a stripe and then queues the execution to a separate
 thread (similar to how disk cycles are handled).  See the comments added
 to raid5.h for more details.

Hi.

This certainly looks like it is heading in the right direction - thanks.

I have a couple of questions, which will probably lead to more.

You obviously need some state-machine functionality to oversee the
progression like  xor - drain - xor (for RMW) or clear - copy - xor
(for RCW).
You have encoded the different states on a per block basis (storing it
in sh-dev[x].flags) rather than on a per-strip basis (and so encoding
it in sh-state).
What was the reason for this choice?

The only reason I can think of is to allow more parallelism :
different blocks within a strip can be in different states.  I cannot
see any value in this as the 'xor' operation will work across all
(active) blocks in the strip and so you will have to synchronise all
the blocks on that operation.

I feel the code would be simpler if the state was in the strip rather
than the block.

The wait_for_block_op queue and it's usage seems odd to me.
handle_stripe should never have to wait for anything.
when a block_op is started, the sh-count should be incremented, and
the decremented when the block-ops have finished.  Only when will
handle_stripe get to look at the stripe_head again.  So waiting
shouldn't be needed.

Your GFP_KERNEL kmalloc is a definite no-no which can lead to
deadlocks (it might block while trying to write data out thought the
same raid array).  At least it should be GFP_NOIO.
However a better solution would be to create and use a mempool - they
are designed for exactly this sort of usage.
However I'm not sure if even that is needed.  Can a stripe_head have
move than one active block_ops task happening?  If not, the
'stripe_work' should be embedded in the 'stripe_head'.

There will probably be more questions once these are answered, but as
the code is definitely a good start.

Thanks,
NeilBrown


(*) Since reading the DDF documentation again, I've realised that
using the word 'stripe' both for a chunk-wide stripe and a block-wide
stripe is very confusing.  So I've started using 'strip' for a
block-wide stripe.  Thus a 'stripe_head' should really be a
'strip_head'.

I hope this doesn't end up being even more confusing :-)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 hang on get_active_stripe

2006-05-18 Thread Neil Brown
On Wednesday May 17, [EMAIL PROTECTED] wrote:
 
 let me know if you want the task dump output from this one too.
 

No thanks - I doubt it will containing anything helpful.

I'll try to put some serious time into this next week - as soon as I
get mdadm 2.5 out.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid5 resize testing opportunity

2006-05-17 Thread Neil Brown
On Wednesday May 17, [EMAIL PROTECTED] wrote:
 Hi all,
 
 For Neil's benefit (:-) I'm about to test the raid5 resize code by
 trying to grow our 2TB raid5 from 8 to 10 devices. Currently, I'm
 running a 2.6.16-rc4-mm2 kernel.  Is this current enough to support the
 resize? (I suspect not.) If I upgrade to 2.6.17-rc4-mm1, would that do
 it, or is it even in stable 2.6.16.16?

Thanks!

You need at least 2.6.17-rc1.  I would suggest the latest -rc:
 2.6.17-rc4

Don't use -mm.  It could have new bugs, and you don't want them to
trouble you when you are growing your array.

I look forward to the results!

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: softraid and multiple distros

2006-05-15 Thread Neil Brown
On Monday May 15, [EMAIL PROTECTED] wrote:
  I always use entire disks if I want the entire disks raided (sounds
  obvious, doesn't it...)  I only use partitions when I want to vary the
  raid layout for different parts of the disk (e.g. mirrored root, mirrored
  swap, raid6 for the rest).   But that certainly doesn't mean it is
  wrong to use partitions for the whole disk.
 
 The idea behind this is: let's say a disk fails, and you get a replacement, 
 but it has a different geometry or a few blocks less - won't work. 
 Even the same disk model might vary after a while.
 So I made 0xfd partitions of the size (whole disk minus few megs).
 

An alternative is to use the --size option of mdadm to make the array
slightly smaller than the smallest drive.  So you don't need
partitions for this (though it is perfectly alright to use them if you
like). 

  You can tell mdadm where to look.  If you want to be sure that it
  won't look at entire drives, only partitions, then a line like
 DEVICES /dev/[hs]d*[0-1]
  in /etc/mdadm.conf might be what you want.
  However as you should be listing the uuids in /etc/mdadm.conf, any
 
 Umm... yeah, should I?

What else would you use to uniquely identify the arrays?  Not device
names I hope.

 
  superblock with an unknown uuid will easily be ignored.
 
  If you are relying nf 0xfd autodetect to assemble your arrays, then
  obviously the entire-disk superblock will be ignored (because they
  wont be in the right place in any partition).
 
 So mdadm --assemble --scan is fine for my scenario even with those orphaned 
 superblocks.

I cannot say for sure without seeing your mdadm.conf, but probably.

 
 Should get me some sedatives for the day when this all explodes :P

Just make sure it happens on your day off, then someone else will
need the sedatives :-)

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.

2006-05-15 Thread Neil Brown
On Monday May 15, [EMAIL PROTECTED] wrote:
 
 Ho hum, I give up.

Thankyou :-)  I found our debate very valuable - it helped me clarify
my understanding of some areas of linux filesystem semantics (and as I
am trying to write a filesystem in my 'spare time', that will turn out
to be very useful).  It also revealed some problems in the code!

 I don't think, in practice, this code fixes any
 demonstrable bug though.

I thought it was our job to kill the bugs *before* they were
demonstrated :-)

I'm still convinced that the previous code could lead to deadlocks or
worse under sufficiently sustained high memory pressure and fs
activity.

I'll send a patch shortly that fixes the known problems and
awkwardnesses in the new code.

Thanks again,
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid0 over 2 h/w raid5's OOPSing at mkfs

2006-05-15 Thread Neil Brown
On Monday May 15, [EMAIL PROTECTED] wrote:
 I've got a x86_64 system with 2 3ware 9550SX-12s, each set up as a raid5 
 w/ a hot spare.  Over that, I do a software raid0 stripe via:
 
 mdadm -C /dev/md0 -c 512 -l 0 -n 2 /dev/sd[bc]1
 
 Whenever I try to format md0 (I've tried both mke2fs and mkfs.xfs), the 
 system OOPSes.  I'm running centos-4 with the default kernel, but I've 
 upgraded the 3ware driver/firmware to the most recent versions.  Based on 
 the OOPS I'll paste below, who should I be blaming for this crash?  Any 
 ideas on how to fix it?  Thanks.

Try this.

http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=1eb29128c644581fa51f822545921394ad4f719f

Raid0 has troubles in 64bit machines back in 2.6.9 days.

NeilBrown.


 
 Unable to handle kernel NULL pointer dereference at 0027 RIP:
 a0194ab6{:raid0:raid0_make_request+448}
 PML4 3e913067 PGD 7b652067 PMD 0
 Oops:  [1] SMP
 CPU 1
 Modules linked in: raid0 md5 ipv6 parport_pc lp parport i2c_dev i2c_core 
 sunrpc ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables dm_mirror 
 dm_mod button battery ac ohci_hcd hw_random tg3 floppy ext3 jbd 3w_9xxx(U) 
 3w_ sd_mod scsi_mod
 Pid: 2955, comm: mke2fs Not tainted 2.6.9-34.ELsmp
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 001 of 3] md: Change md/bitmap file handling to use bmap to file blocks-fix

2006-05-15 Thread Neil Brown
On Monday May 15, [EMAIL PROTECTED] wrote:
 NeilBrown [EMAIL PROTECTED] wrote:
 
   +  do_sync_file_range(file, 0, LLONG_MAX,
   + SYNC_FILE_RANGE_WRITE |
   + SYNC_FILE_RANGE_WAIT_AFTER);
 
 That needs a SYNC_FILE_RANGE_WAIT_BEFORE too.  Otherwise any dirty,
 under-writeback pages will remain dirty.  I'll make that change.

Ahhh.. yes... that makes sense!  Thanks :-)

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: recovery from mkswap on mounted raid1 ext3 filesystem?

2006-05-15 Thread Neil Brown
On Monday May 15, [EMAIL PROTECTED] wrote:
 I accidentally ran mkswap on an md raid1 device which had a mounted
 ext3 filesystem on it.  I also did a swapon, but I don't think
 anything was written to swap before I noticed the mistake.  How much
 of the partition is toast, and is it something e2fsck might fix?

oh dear

I think (and an strace seems to confirm) that mkswap only writes in
the first 4k of the device.  This will have held the superblock, but
there is always at least one backup - I think it is as block 8193.
But 'fsck -n' should help you out, though you might need
'fsck.ext2 -n' as 'fsck' might think it is a swap device...

Ofcourse, if the filesystem is mounted, then unmounting the filesystem
should write the superblock, which might fix any corruption you
caused..

I'm very surprised that swapon worked if the fs was mounted - there
should be mutual exclusion there.

 
 Moreover, shouldn't the mkswap command check whether a device is in
 use before overwriting it?

Yes, but before 2.6 this was very hard to do (in 2.6 it is easy, just
open with O_EXCL).  I doubt mkswap has seen much maintenance
later not since Dec 2004 in fact.  And the only checks it does is
to make sure you aren't running mkswap on /dev/hda or /dev/hdb !!!

Adrian:  You seem to be the MAINTAINER of mkswap.. any chance of 
opening for O_EXCL as well as O_RDWR.  That would make it a lot safer.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.

2006-05-14 Thread Neil Brown
On Saturday May 13, [EMAIL PROTECTED] wrote:
 Paul Clements [EMAIL PROTECTED] wrote:
 
  Andrew Morton wrote:
  
   The loss of pagecache coherency seems sad.  I assume there's never a
   requirement for userspace to read this file.
  
  Actually, there is. mdadm reads the bitmap file, so that would be 
  broken. Also, it's just useful for a user to be able to read the bitmap 
  (od -x, or similar) to figure out approximately how much more he's got 
  to resync to get an array in-sync. Other than reading the bitmap file, I 
  don't know of any way to determine that.
 
 Read it with O_DIRECT :(

Which is exactly what the next release of mdadm does.
As the patch comment said:

: With this approach the pagecache may contain data which is inconsistent with 
: what is on disk.  To alleviate the problems this can cause, md invalidates
: the pagecache when releasing the file.  If the file is to be examined
: while the array is active (a non-critical but occasionally useful function),
: O_DIRECT io must be used.  And new version of mdadm will have support for 
this.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: softraid and multiple distros

2006-05-14 Thread Neil Brown
On Sunday May 14, [EMAIL PROTECTED] wrote:
 Am Sonntag, 14. Mai 2006 16:50 schrieben Sie:
   What do I need to do when I want to install a different distro on the
   machine with a raid5 array?
   Which files do I need? /etc/mdadm.conf? /etc/raittab? both?
 
  MD doesn't need any files to function, since it can auto-assemble
  arrays based on their superblocks (for partition-type 0xfd).
 
 I see. Now an issue arises someone else here mentioned: 
 My first attempt was to use the entire disks, then I was hinted that this 
 approach wasn't too hot so I made partitions. 

I always use entire disks if I want the entire disks raided (sounds
obvious, doesn't it...)  I only use partitions when I want to vary the
raid layout for different parts of the disk (e.g. mirrored root, mirrored
swap, raid6 for the rest).   But that certainly doesn't mean it is
wrong to use partitions for the whole disk.


 Now the devices have all two superblocks, the one left from the first try 
 which are now kinda orphaned and those now active. 
 Can I trust mdadm to handle this properly on its own?

You can tell mdadm where to look.  If you want to be sure that it
won't look at entire drives, only partitions, then a line like
   DEVICES /dev/[hs]d*[0-1]
in /etc/mdadm.conf might be what you want. 
However as you should be listing the uuids in /etc/mdadm.conf, any
superblock with an unknown uuid will easily be ignored.

If you are relying nf 0xfd autodetect to assemble your arrays, then
obviously the entire-disk superblock will be ignored (because they
wont be in the right place in any partition).

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.

2006-05-14 Thread Neil Brown

(replying to bits of several emails)

On Friday May 12, [EMAIL PROTECTED] wrote:
 Neil Brown [EMAIL PROTECTED] wrote:

  However some IO requests cannot complete until the filesystem I/O
  completes, so we need to be sure that the filesystem I/O won't block
  waiting for memory, or fail with -ENOMEM.
 
 That sounds like a complex deadlock.  Suppose the bitmap writeout requres
 some writeback to happen before it can get enough memory to proceed.
 

Exactly. Bitmap writeout must not block on fs-writeback.  It can block
on device writeout (e.g. queue congestion or mempool exhaustion) but
it must complete without waiting in the fs layer or above, and without
the possibility of any error other -EIO.  Otherwise we can get
deadlocked writing to the raid array. bh_submit (or bio_submit) is
certain to be safe in this respect.  I'm not so confident about
anything at the fs level.

   Read it with O_DIRECT :(
  
  Which is exactly what the next release of mdadm does.
  As the patch comment said:
  
  : With this approach the pagecache may contain data which is inconsistent 
  with 
  : what is on disk.  To alleviate the problems this can cause, md invalidates
  : the pagecache when releasing the file.  If the file is to be examined
  : while the array is active (a non-critical but occasionally useful 
  function),
  : O_DIRECT io must be used.  And new version of mdadm will have support for 
  this.
 
 Which doesn't help `od -x' and is going to cause older mdadm userspace to
 mysteriously and subtly fail.  Or does the user-kernel interface have
 versioning which will prevent this?
 

As I said: 'non-critical'.  Nothing important breaks if reading the
file gets old data.  Reading the file while the array is active is
purely a curiosity thing.  There is information in /proc/mdstat which
gives a fairly coarse view of the same data.  It could lead to some
confusion, but if a compliant mdadm comes out before this gets into a
mainline kernel, I doubt there will be any significant issue.

Read/writing the bitmap needs to work reliably when the array is not
active, but suitable sync/invalidate calls in the kernel should make
that work perfectly.

I know this is technically a regression in user-space interface, and
you don't like such regression with good reason Maybe I could call
invalidate_inode_pages every few seconds or whenever the atime
changes, just to be on the safe side :-)

   I have a patch which did that,
  but decided that the possibility of kmalloc failure at awkward times
  would make that not suitable.
 
 submit_bh() can and will allocate memory, although most decent device
 drivers should be OK.
 

submit_bh (like all decent device drivers) uses a mempool for memory
allocation so we can be sure that the delay in getting memory is
bounded by the delay for a few IO requests to complete, and we can be
sure the allocation won't fail.  This is perfectly fine.

  
  I don't think a_ops really provides an interface that I can use, partly
  because, as I said in a previous email, it isn't really a public
  interface to a filesystem.
 
 It's publicer than bmap+submit_bh!
 

I don't know how you can say that.

bmap is so public that it is exported to userspace through an IOCTL
and is used by lilo (admitted only for reading, not writing).  More
significantly it is used by swapfile which is a completely independent
subsystem from the filesystem.  Contrast this with a_ops.  The primary
users of a_ops are routines like generic_file_{read,write} and
friends.  These are tools - library routines - that are used by
filesystems to implement their 'file_operations' which are the real
public interface.  As far as these uses go, it is not a public
interface.  Where a filesystem doesn't use some library routines, it
does not need to implement the matching functionality in the a_op
interface.

The other main user is the 'VM' which might try to flush out or
invalidate pages.  However the VM isn't using this interface to
interact with files, but only to interact with pages, and it doesn't
much care what is done with the pages providing they get clean, or get
released, or whatever.

The way I perceive Linux design/development, active usage is far more
significant than documented design.  If some feature of an interface
isn't being actively used - by in-kernel code - then you cannot be
sure that feature will be uniformly implemented, or that it won't
change subtly next week.

So when I went looking for the best way to get md/bitmap to write to a
file, I didn't just look at the interface specs (which are pretty
poorly documented anyway), I looked at existing code.
I can find 3 different parts of the kernel that write to a file.
They are
   swap-file
   loop
   nfsd

nfsd uses vfs_read/vfs_write  which have too many failure/delay modes
  for me to safely use.
loop uses prepare_write/commit_write (if available) or f_op-write
  (not vfs_write - I wonder why) which is not much better than what
  nfsd uses.  And as far as I can tell

Re: [PATCH 002 of 8] md/bitmap: Remove bitmap writeback daemon.

2006-05-12 Thread Neil Brown
On Friday May 12, [EMAIL PROTECTED] wrote:
 NeilBrown [EMAIL PROTECTED] wrote:
 
   ./drivers/md/bitmap.c |  115 
  ++
 
 hmm.  I hope we're not doing any of that filesystem I/O within the context
 of submit_bio() or kblockd or anything like that.  Looks OK from a quick
 scan.

No.  We do all the I/O from the context of the per-array thread.
However some IO requests cannot complete until the filesystem I/O
completes, so we need to be sure that the filesystem I/O won't block
waiting for memory, or fail with -ENOMEM.

 
 a_ops-commit_write() already ran set_page_dirty(), so you don't need that
 in there.

Is that documented somewhere. but yes, that seems to be right. Thanks.

 
 I assume this always works in units of a complete page?  It's strange to do
 prepare_write() followed immediately by commit_write().  Normally
 prepare_write() will do some prereading, but it's smart enough to not do
 that if the caller is preparing to write the whole page.
 

Yes, it is strange.  That was one of the things that made me want to
review this code and figure out how to do it properly.

As far as I can see, much of 'address_space' is really an internal
interface to support routines used by the filesystem.  A filesystem
may choose to use address spaces, and has a fair degree of freedom
when it comes to which bits to make use of and exactly what they
mean.
About the only thing that *has* to be supported is -writepages --
which has a fair degree of latitude in exactly what it does -- and
-writepage -- which can only be called after locking a page and
rechecking the -mapping.

bitmap.c is currently trying to do something every different.
It uses -readpage to get pages in the page cache (even though some
address spaces don't define -readpage) and then holds onto those
pages without holding the page lock, and then calls -writepage to
flush them out from time to time.
Before calling writepage it gets the pagelock, but doesn't re-check
that -mapping is correct (there is nothing much it can do if it isn't
correct..).

I noticed this is particularly a problem with tmpfs.  When you call
writepage on a tmpfs page, the page is swizzled into the swap cache,
and -mapping becomes NULL - not the behaviour that bitmap is
expecting.

Now I agree that tmpfs is an unusual case, and that storing a bitmap
in tmpfs doesn't make a lot of sense (though it can make some...) but
the point is that if a filesystem is allowed to move pages around like
that, then bitmap cannot hold on to pages in the page cache like it
wants to.  It simply isn't a well defined thing to do.


 We normally use PAGE_CACHE_SIZE for these things, not PAGE_SIZE.  Same diff.
 

Yeah why is that?  Why have two names for exactly the same value?
How does a poor develop know when to use one and when the other?  More
particularly, how does one remember.
I would argue that the same diff should be no difference - not even
in spelling

 If you have a page and you want to write the whole thing out then there's
 really no need to run prepare_write or commit_write at all.  Just
 initialise the whole page, run set_page_dirty() then write_one_page().
 

I see that now.  But only after locking the page, and rechecking that
-mapping is correct, and if it isn't  well, more work is involved
that bitmap is in a position to do.

 Perhaps it should check that the backing filesystem actually implements
 commit_write(), prepare_write(), readpage(), etc.  Some might not, and the
 user will get taught not to do that via an oops.
 

Might help, but as I think you've gathered, I really want a whole
different approach to writing to the file.  One that I can justify as
being a correct use of interfaces, and also that I can be certain
will not block or fail on a kmalloc or similar.  Hence the bmap thing
later.


Thanks for your feedback.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.

2006-05-12 Thread Neil Brown
On Friday May 12, [EMAIL PROTECTED] wrote:
 NeilBrown [EMAIL PROTECTED] wrote:
 
  If md is asked to store a bitmap in a file, it tries to hold onto the
  page cache pages for that file, manipulate them directly, and call a
  cocktail of operations to write the file out.  I don't believe this is
  a supportable approach.
 
 erk.  I think it's better than...
 
  This patch changes the approach to use the same approach as swap files.
  i.e. bmap is used to enumerate all the block address of parts of the file
  and we write directly to those blocks of the device.
 
 That's going in at a much lower level.  Even swapfiles don't assume
 buffer_heads.

I'm not assuming buffer_heads.  I'm creating buffer heads and using
them for my own purposes.  These are my pages and my buffer heads.
None of them belong to the filesystem.
The buffer_heads are simply a convenient data-structure to record the
several block addresses for each page.  I could have equally created
an array storing all the addresses, and built the required bios by
hand at write time.  But buffer_heads did most of the work for me, so
I used them.

Yes, it is a lower level, but
 1/ I am certain that there will be no kmalloc problems and
 2/ Because it is exactly the level used by swapfile, I know that it
is sufficiently 'well defined' that no-one is going to break it.

 
 When playing with bmap() one needs to be careful that nobody has truncated
 the file on you, else you end up writing to disk blocks which aren't part
 of the file any more.

Well we currently play games with i_write_count to ensure that no-one
else has the file open for write.  And if no-one else can get write
access, then it cannot be truncated.
I did think about adding the S_SWAPFILE flag, but decided to leave
that for a separate patch and review different approaches to
preventing write access first (e.g. can I use a lease?).

 
 All this (and a set_fs(KERNEL_DS), ug) looks like a step backwards to me. 
 Operating at the pagecache a_ops level looked better, and more
 filesystem-independent.

If you really want filesystem independence, you need to use vfs_read
and vfs_write to read/write the file.  I have a patch which did that,
but decided that the possibility of kmalloc failure at awkward times
would make that not suitable.
So I now use vfs_read to read in the file (just like knfsd) and
bmap/submit_bh to write out the file (just like swapfile).

I don't think a_ops really provides an interface that I can use, partly
because, as I said in a previous email, it isn't really a public
interface to a filesystem.

 
 I haven't looked at this patch at all closely yet.  Do I really need to?

I assume you are asking that because you hope I will retract the
patch.  While I'm always open to being educated, I am not yet
convinced that there is any better way, or even any other usable way,
to do what needs to be done, so I am not inclined to retract the
patch.

I'd like to say that you don't need to read it because it is perfect,
but unfortunately history suggests that is unlikely to be true.

Whether you look more closely is of course up to you, but I'm convinced
that patch is in the right direction, and your review and comments are
always very valuable.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 - 4 disk reboot trouble.

2006-05-11 Thread Neil Brown
On Thursday May 11, [EMAIL PROTECTED] wrote:
 Hi,
 
 I'm running a raid5 system, and when I reboot my raid seems to be 
 failing. (One disk is set to spare and other disk seems to be oke in the 
 detials page but we get a INPUT/OUTPUT error when trying to mount it)
 
 We cannot seem te find the problem in this setup.
...
   State : clean, degraded, recovering
 ^^

Do you ever let the recovery actually finish?  Until you do you don't
have real redundancy.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hardware raid 5 and software raid 0 stripe broke.

2006-05-11 Thread Neil Brown
On Thursday May 11, [EMAIL PROTECTED] wrote:
 We have a Linux box running redhat 7.2   We have two hardware 
 controllers in it with about 500gig's each.   They're raid 5.   We were 
 using a software raid to combine them all together.   1 hard drive went 
 down so we replaced it and now the system won't boot.   We have used a 
 Knoppix boot cd to get into a linux system.   We can see /dev/sda1 and 
 /dev/sdb1.   However, /dev/md0 cannot be accessed.   Is there a safe way 
 to create the software raid from the cd to see if maybe mdadm on the 
 original system got corrupt?   any help would be greatly appreciated.

What booted off knoppix, what does

 mdadm -E /dev/sda1
 mdadm -E /dev/sdb1

produce?

How about
  mdadm -A /dev/md0 /dev/sda1 /dev/sdb1

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 009 of 11] md: Support stripe/offset mode in raid10

2006-05-08 Thread Neil Brown
On Wednesday May 3, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
  On Tuesday May 2, [EMAIL PROTECTED] wrote:
   NeilBrown wrote:
The industry standard DDF format allows for a stripe/offset layout
where data is duplicated on different stripes. e.g.
   
  A  B  C  D
  D  A  B  C
  E  F  G  H
  H  E  F  G
   
(columns are drives, rows are stripes, LETTERS are chunks of data).
  
   Presumably, this is the case for --layout=f2 ?
 
  Almost.  mdadm doesn't support this layout yet.
  'f2' is a similar layout, but the offset stripes are a lot further
  down the drives.
  It will possibly be called 'o2' or 'offset2'.
 
   If so, would --layout=f4 result in a 4-mirror/striped array?
 
  o4 on a 4 drive array would be
 
 A  B  C  D
 D  A  B  C
 C  D  A  B
 B  C  D  A
 E  F  G  H
 
 
 Yes, so would this give us 4 physically duplicate mirrors?

It would give 4 devices each containing the same data, but in a
different layout - much as the picture shows


 If not, would it be possible to add a far-offset mode to yield such
 a layout?

Exactly what sort of layout do you want?


 
   Also, would it be possible to have a staged write-back mechanism across
   multiple stripes?
 
  What exactly would that mean?
 
 Write the first stripe, then write subsequent duplicate stripes based on idle 
 with a max delay for each delayed stripe.
 
  And what would be the advantage?
 
 Faster burst writes, probably.

I still don't get what you are after.
You always need to wait for writes of all copies to complete before
acknowledging the write to the filesystem, otherwise you risk
corruption if there is a crash and a device failure.
So inserting any delays (other than the per-device plugging which
helps to group adjacent requests) isn't going to make things go
faster.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: strange RAID5 problem

2006-05-08 Thread Neil Brown
On Monday May 8, [EMAIL PROTECTED] wrote:
 Good evening.
 
 I am having a bit of a problem with a largish RAID5 set.
 Now it is looking more and more like I am about to lose all the data on
 it, so I am asking (begging?) to see if anyone can help me sort this out.
 

Very thorough description, but you omitted the 'dmesg' output
corresponding to :

 
 [EMAIL PROTECTED] ~]# mdadm
 --assemble /dev/md3 /dev/sdq1 /dev/sdr1 /dev/sds1 /dev/sdt1 /dev/sdu1
 /dev/sdv1 /dev/sdx1 /dev/sdy1 /dev/sdz1 /dev/sdaa1 /dev/sdab1 /dev/sdac1
 /dev/sdad1 /dev/sdae1 /dev/sdaf1
 mdadm: failed to RUN_ARRAY /dev/md3: Invalid argument


Also, you don't seem to have tried '--force' with '--assemble'.  It
might help.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Two-disk RAID5?

2006-05-05 Thread Neil Brown
On Friday May 5, [EMAIL PROTECTED] wrote:
 
  Sorry, I couldn't find a diplomatic way to say you're completely wrong.
 
 We don't necessarily expect a diplomatic way, but a clear and
 intelligent one would be helpful. 
 
 In two-disk RAID5 which is it?
 
   1) The 'parity bit' is the same as the datum.

Yes.

 
   2) The parity bit is the complement of the datum.

No.

 
   3) It doesn't work at a bit-wise level.

No.

 
 Many of us feel that RAID5 looks like:
 
   parity = data[0];
   for (i=1; i  ndisks; ++i)
   parity ^= data[i];

Actually in linux/md/raid5 it is more like

parity = 0
for (i=0; i  ndisks; ++i)
parity ^= data[i];

which has exactly the same result.
(well, it should really be ndatadisks, but I think we both knew that
was what you meant).

 
 which implies (1). It could easily be (2) but merely saying it's not
 data, it's parity doesn't clarify matters a great deal. 
 
 But I'm pleased my question has stirred up such controversy!

A bit of controversy is always a nice way to pass those long winter
nights only it isn't winter anywhere at the moment :-)

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 009 of 11] md: Support stripe/offset mode in raid10

2006-05-02 Thread Neil Brown
On Tuesday May 2, [EMAIL PROTECTED] wrote:
 NeilBrown wrote:
  The industry standard DDF format allows for a stripe/offset layout
  where data is duplicated on different stripes. e.g.
 
A  B  C  D
D  A  B  C
E  F  G  H
H  E  F  G
 
  (columns are drives, rows are stripes, LETTERS are chunks of data).
 
 Presumably, this is the case for --layout=f2 ?

Almost.  mdadm doesn't support this layout yet.  
'f2' is a similar layout, but the offset stripes are a lot further
down the drives.
It will possibly be called 'o2' or 'offset2'.

 If so, would --layout=f4 result in a 4-mirror/striped array?

o4 on a 4 drive array would be 

   A  B  C  D
   D  A  B  C
   C  D  A  B
   B  C  D  A
   E  F  G  H
   

 
 Also, would it be possible to have a staged write-back mechanism across 
 multiple stripes?

What exactly would that mean?  And what would be the advantage?

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 004 of 11] md: Increase the delay before marking metadata clean, and make it configurable.

2006-05-01 Thread Neil Brown
On Sunday April 30, [EMAIL PROTECTED] wrote:
 NeilBrown [EMAIL PROTECTED] wrote:
 
  
  When a md array has been idle (no writes) for 20msecs it is marked as
  'clean'.  This delay turns out to be too short for some real
  workloads.  So increase it to 200msec (the time to update the metadata
  should be a tiny fraction of that) and make it sysfs-configurable.
  
  
  ...
  
  +   safe_mode_delay
  + When an md array has seen no write requests for a certain period
  + of time, it will be marked as 'clean'.  When another write
  + request arrive, the array is marked as 'dirty' before the write
  + commenses.  This is known as 'safe_mode'.
  + The 'certain period' is controlled by this file which stores the
  + period as a number of seconds.  The default is 200msec (0.200).
  + Writing a value of 0 disables safemode.
  +
 
 Why not make the units milliseconds?  Rename this to safe_mode_delay_msecs
 to remove any doubt.

Because umpteen years ago when I was adding thread-usage statistics to
/proc/net/rpc/nfsd I used milliseconds and Linus asked me to make it
seconds - a much more obvious unit.  See Email below.
It seems very sensible to me.

...
  +   msec = simple_strtoul(buf, e, 10);
  +   if (e == buf || (*e  *e != '\n'))
  +   return -EINVAL;
  +   msec = (msec * 1000) / scale;
  +   if (msec == 0)
  +   mddev-safemode_delay = 0;
  +   else {
  +   mddev-safemode_delay = (msec*HZ)/1000;
  +   if (mddev-safemode_delay == 0)
  +   mddev-safemode_delay = 1;
  +   }
  +   return len;
 
 And most of that goes away.

Maybe it could go in a library :-?

NeilBrown



From: Linus Torvalds [EMAIL PROTECTED]
To: Neil Brown [EMAIL PROTECTED]
cc: [EMAIL PROTECTED]
Subject: Re: PATCH knfsd - stats tidy up.
Date: Tue, 18 Jul 2000 12:21:12 -0700 (PDT)
Content-Type: TEXT/PLAIN; charset=US-ASCII



On Tue, 18 Jul 2000, Neil Brown wrote:
 
 The following patch converts jiffies to milliseconds for output, and
 also makes the number wrap predicatably at 1,000,000 seconds
 (approximately one fortnight).

If no programs depend on the format, I actually prefer format changes like
this to be of the obvious kind. One such obvious kind is the format

0.001

which obviously means 0.001 seconds. 

And yes, I'm _really_ sorry that a lot of the old /proc files contain
jiffies. Lazy. Ugly. Bad. Much of it my bad.

Doing 0.001 doesn't mean that you have to use floating point, in fact
you've done most of the work already in your ms patch, just splitting
things out a bit works well:

/* gcc knows to combine / and % - generate one divl */
unsigned int sec = time / HZ, msec = time % HZ;
msec = (msec * 1000) / HZ;

sprintf( %d.%03d, sec, msec)

(It's basically the same thing you already do, except it doesn't
re-combine the seconds and milliseconds but just prints them out
separately.. And it has the advantage that if you want to change it to
microseconds some day, you can do so very trivially without breaking the
format. Plus it's readable as hell.)

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: try to write back redundant data before failing disk in raid5 setup

2006-04-30 Thread Neil Brown
On Monday May 1, [EMAIL PROTECTED] wrote:
 Hello,
 
 Suppose a read action on a disk which is member of a raid5 (or raid1 or any
 other raid where there's data redundancy) fails.
 What ahppens next is that the entire disk is marked as failed and a raid5
 rebuild is initiated.
 
 However, that seems like overkill to me. If only one sector on one disk
 failed, that sector could be re-calculated  (using parity calculations)
 AND written back to the original disk (i.e. the disk with the bad sector).
 Any modern disk will do sector remapping, so the bad sector will simply be
 replaced by a good one and there's no need to fail the entire disk.
 

... and any modern linux kernel (since about 2.6.15) will to exactly
what you suggest.

 The reason I bring this up is that I think raid5 rebuilds are 'scary'
 things. Suppose a raid5 rebuild is initiated while other members of the
 raid5 set have bad -but yet undetected- sectors scattered around the disc
 (Current_Pending_Sector in smartd speak). Now this raid5 rebuild would fail,
 losing the entire raid5 set. While each and every bit in the raid5 set might
 still be salvagable!  (I've seen this happen on 5x250Gb raid5 sets.)
 

For this reason it is good to regularly do a background read check of
the entire array.
  echo check  /sys/block/mdX/md/sync_action

Any read errors will trigger and attempt to overwrite the bad block
with good data.  Do this regularly, *before* any drive really failed.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 resizing

2006-04-30 Thread Neil Brown
On Monday May 1, [EMAIL PROTECTED] wrote:
 Hey folks.
 
 There's no point in using LVM on a raid5 setup if all you intend to do
 in the future is resize the filesystem on it, is there? The new raid5
 resizing code takes care of providing the extra space and then as long
 as the say ext3 filesystem is created with resize_inode all should be
 sweet. Right? Or have I missed something crucial here? :)

You are correct.  md/raid5 makes the extra space available all by
itself. 

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 003 of 5] md: Change ENOTSUPP to EOPNOTSUPP

2006-04-29 Thread Neil Brown
On Friday April 28, [EMAIL PROTECTED] wrote:
 NeilBrown wrote:
  Change ENOTSUPP to EOPNOTSUPP
  Because that is what you get if a BIO_RW_BARRIER isn't supported !
 
 Dumb question, hope someone can answer it :).
 
 Does this mean that any version of MD up till now won't know that SATA
 disks does not support barriers, and therefore won't flush SATA disks
 and therefore I need to disable the disks's write cache if I want to
 be 100% sure that raid arrays are not corrupted?
 
 Or am I way off :-).

The effect of this bug is almost unnoticeable.

In almost all cases, md will detect that a drive doesn't support
barriers when writing out the superblock - this is completely separate
code and is correct.  Thus md/raid1 will reject any barrier requests
coming from the filesystem and will never pass them down, and will not
make a wrong decision because of this bug.

The only cases where this bug could cause a problem are:
 1/ when the first write is a barrier write.  It is possible that
reiserfs does this in some cases.  However only this write will be
at risk.
 2/ if a device changes its behaviour from accepting barriers to  
not accepting barrier (Which is very uncommon).

As md will be rejecting barrier requests, the filesystem will know not
to trust them and should use other techniques such as waiting for
dependant requests to complete, and calling blkdev_issue_flush were
appropriate.

Whether filesystems actually do this, I am less certain.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Two-disk RAID5?

2006-04-26 Thread Neil Brown
On Wednesday April 26, [EMAIL PROTECTED] wrote:
 
 I suspect I should have just kept out of this, and waited for someone like 
 Neil to answer authoratatively.
 
 So...Neil, what's the right answer to Tuomas's 2 disk RAID5 question? :)
 

.. and a deep resounding voice from on-high spoke and in it's infinite
wisdom it said
 
   yeh, whatever


The data layout on a 2disk raid5 and a 2 disk raid1 is identical (if
you ignore chunksize issues (raid1 doesn't need one) and the
superblock (which isn't part of the data)).  Each drive contains
identical data(*).

Write throughput to a the r5 would be a bit slower because data is
always copied in memory first, then written.
Read through put would be largely the same if the r5 chunk size was
fairly large, but much poorer for r5 if the chunksize was small.

Converting a raid1 to a raid5 while offline would be quite straight
forward except for the chunksize issue.  If the r1 wasn't a multiple
of the chunksize you chose for r5, then you would lose the last
fraction of a chunk.  So if you are planning to do this, set the size
of your r1 to something that is nice and round (e.g. a multiple of
128k).

Converting a raid1 to a raid5 while online is something I have been
thinking about, but it is not likely to happen any time soon.

I think that answers all the issues.

NeilBrown

(*) The term 'mirror' for raid1 has always bothered me because a
mirror presents a reflected image, while raid1 copies the data without
any transformation.

With a 2drive raid5, one drive gets the original data, and the other
drive gets the data after it has been 'reflected' through an XOR
operation, so maybe a 2drive raid5 is really a 'mirrored' pair
Except that the data is still the same as XOR with 0 produces no
change.
So, if we made a tiny change to raid5 and got the xor operation to
start with 0xff in every byte, then the XOR would reflect each byte
in a reasonable meaningful way, and we might actually get a mirrored
pair!!!  

But I don't think that would provide any real value :-)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Trying to start dirty, degraded RAID6 array

2006-04-26 Thread Neil Brown
On Thursday April 27, [EMAIL PROTECTED] wrote:
 The short version:
 
 I have a 12-disk RAID6 array that has lost a device and now whenever I 
 try to start it with:
 
 mdadm -Af /dev/md0 /dev/sd[abcdefgijkl]1
 
 I get:
 
 mdadm: failed to RUN_ARRAY /dev/md0: Input/output error
 
...
 raid6: cannot start dirty degraded array for md0

The '-f' is meant to make this work.  However it seems there is a bug.

Could you please test this patch?  It isn't exactly the right fix, but
it definitely won't hurt.

Thanks,
NeilBrown

Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./super0.c |1 +
 1 file changed, 1 insertion(+)

diff ./super0.c~current~ ./super0.c
--- ./super0.c~current~ 2006-03-28 17:10:51.0 +1100
+++ ./super0.c  2006-04-27 10:03:40.0 +1000
@@ -372,6 +372,7 @@ static int update_super0(struct mdinfo *
if (sb-level == 5 || sb-level == 4 || sb-level == 6)
/* need to force clean */
sb-state |= (1  MD_SB_CLEAN);
+   rv = 1;
}
if (strcmp(update, assemble)==0) {
int d = info-disk.number;
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linear writes to raid5

2006-04-26 Thread Neil Brown
On Thursday April 20, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
  
  What is the rationale for your position?
 
 My rationale was that if md layer receives *write* requests not smaller
 than a full stripe size, it is able to omit reading data to update, and
 can just calculate new parity from the new data.  Hence, combining a
 dozen small write requests coming from a filesystem to form a single
 request = full stripe size should dramatically increase
 performance.

That makes sense.

However in both cases (above and below raid5), the device receiving
the requests is in a better position to know what size is a good
size than the client sending the requests.
That is exactly what the 'plugging' concept is for.  When a request
arrives, the device is 'plugged' so that it won't process new
requests, and the request plus any following requests are queued.  At
some point the queue is unplugged and the device should be able to
collect related requests to make large requests of an appropriate size
and alignment for the device.

The current suggestion is that plugging is quite working right for
raid5.  That is certainly possible.


 
 Eg, when I use dd with O_DIRECT mode (oflag=direct) and experiment with
 different block size, write performance increases alot when bs becomes
 full stripe size.  Ofcourse it decreases again when bs is increased a
 bit further (as md starts reading again, to construct parity blocks).
 

Yes O_DIRECT is essentially saying I know what I am doing and I
want to bypass all the smarts and go straight to the device.
O_DIRECT requests should certainly be sized and aligned to make the
device.  For non-O_DIRECT it shouldn't matter so much.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Trying to start dirty, degraded RAID6 array

2006-04-26 Thread Neil Brown
On Thursday April 27, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
  The '-f' is meant to make this work.  However it seems there is a bug.
  
  Could you please test this patch?  It isn't exactly the right fix, but
  it definitely won't hurt.
 
 Thanks, Neil, I'll give this a go when I get home tonight.
 
 Is there any way to start an array without kicking off a rebuild ?

echo 1  /sys/module/md_mod/parameters/start_ro 

If you do this, then arrays will be read-only when they are started,
and so will not do a rebuild.  The first write request to the array
(e.g. if you mount a filesystem) will cause a switch to read/write and
any required rebuild will start. 

echo 0  
will revert the effect.

This requires a reasonably recent kernel.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/2] raid6_end_write_request() spinlock fix

2006-04-24 Thread Neil Brown
On Tuesday April 25, [EMAIL PROTECTED] wrote:
 Hello,
 
 Reduce the raid6_end_write_request() spinlock window.

Andrew: please don't include these in -mm.  This one and the
corresponding raid5 are wrong, and I'm not sure yet the unplug_device
changes.

In this case, the call to md_error, which in turn calls error in
raid6main.c, requires the lock to be held as it contains:
if (!test_bit(Faulty, rdev-flags)) {
mddev-sb_dirty = 1;
if (test_bit(In_sync, rdev-flags)) {
conf-working_disks--;
mddev-degraded++;
conf-failed_disks++;
clear_bit(In_sync, rdev-flags);
/*
 * if recovery was running, make sure it aborts.
 */
set_bit(MD_RECOVERY_ERR, mddev-recovery);
}
set_bit(Faulty, rdev-flags);

which is fairly clearly not safe without some locking.

Coywolf:  As I think I have already said, I appreciate your review of
the md/raid code and your attempts to improve it - I'm sure there is
plenty of room to make improvements.  
However posting patches with minimal commentary on code that you don't
fully understand is not the best way to work with the community.
If you see something that you think is wrong, it is much better to ask
why it is the way it is, explain why you think it isn't right, and
quite possibly include an example patch.  Then we can discuss the
issue and find the best solution.

So please feel free to post further patches, but please include more
commentary, and don't assume you understand something that you don't
really.

Thanks,
NeilBrown



 
 Signed-off-by: Coywolf Qi Hunt [EMAIL PROTECTED]
 ---
 
 diff --git a/drivers/md/raid6main.c b/drivers/md/raid6main.c
 index bc69355..820536e 100644
 --- a/drivers/md/raid6main.c
 +++ b/drivers/md/raid6main.c
 @@ -468,7 +468,6 @@ static int raid6_end_write_request (stru
   struct stripe_head *sh = bi-bi_private;
   raid6_conf_t *conf = sh-raid_conf;
   int disks = conf-raid_disks, i;
 - unsigned long flags;
   int uptodate = test_bit(BIO_UPTODATE, bi-bi_flags);
  
   if (bi-bi_size)
 @@ -486,16 +485,14 @@ static int raid6_end_write_request (stru
   return 0;
   }
  
 - spin_lock_irqsave(conf-device_lock, flags);
   if (!uptodate)
   md_error(conf-mddev, conf-disks[i].rdev);
  
   rdev_dec_pending(conf-disks[i].rdev, conf-mddev);
 -
   clear_bit(R5_LOCKED, sh-dev[i].flags);
   set_bit(STRIPE_HANDLE, sh-state);
 - __release_stripe(conf, sh);
 - spin_unlock_irqrestore(conf-device_lock, flags);
 + release_stripe(sh);
 +
   return 0;
  }
  
 
 -- 
 Coywolf Qi Hunt
 -
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: to be or not to be...

2006-04-23 Thread Neil Brown
On Sunday April 23, [EMAIL PROTECTED] wrote:
 Hi all,
   to make a long story very very shorty:
   a) I create /dev/md1, kernel latest rc-2-git4 and mdadm-2.4.1.tgz,
  with this command:
  /root/mdadm -Cv /dev/.static/dev/.static/dev/.static/dev/md1 \
  --bitmap-chunk=1024 --chunk=256 --assume-clean --bitmap=internal 
 \
   ^^
  -l5 -n5 /dev/hda2 /dev/hdb2 /dev/hde2 /dev/hdf2 missing
   

 From the man page:
   --assume-clean
  Tell mdadm that the array pre-existed and is known to be  clean.
  It  can be useful when trying to recover from a major failure as
  you can be sure that no data will be affected unless  you  actu-
  ally  write  to  the array.  It can also be used when creating a
  RAID1 or RAID10 if you want to avoid the initial resync, however
  this  practice - while normally safe - is not recommended.   Use
   ^^^
  this ony if you really know what you are doing.
  ^^

So presumably you know what you are doing, and I wonder why you bother
to ask us :-)
Ofcourse, if you don't know what you are doing, then I suggest
dropping the --assume-clean.  
In correct use of this flag can lead to data corruption.  This is
particularly true if your array goes degraded, but is also true while
your array isn't degraded.  In this case it is (I think) very unusual
and may not be the cause of your corruption, but you should avoid
using the flag anyway.


   b) dm-encrypt /dev/md1
   
   c) create fs with:
  mkfs.ext3 -O dir_index -L 'tritone' -i 256000 /dev/mapper/raidone
   
   d) export it via nfs (mounting /dev/mapper/raidone as ext2)
  

Why not ext3?

 
   e) start to cp-ing files
 
   f) after 1 TB of written data, with no problem/warning, one of the
   not-in-raid-array HD freeze

This could signal a bad controller.  If it does, then you cannot trust
any drives.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


<    1   2   3   4   5   6   7   8   9   10   >