from:"Neil Brown"

Re: Linux 2.4.0-test8 and swap/journaling fs on raid

2000-09-27 Thread Neil Brown


On Wednesday September 27, [EMAIL PROTECTED] wrote:
 I was just wondering if the issues with swap on a raid device and with using a
 journaling fs on a raid device had been fixed in the latest 2.4.0-test
 kernels?

Yes.  md in 2.4 doesn't do interesting things with the buffer cache,
so swap and journaling filesystems should have no issues with it.


 
 I've gone too soft over the last few years to read the raid code myself :-)
 
 Thanks in advance,
 
 Craig
 
 PS I might be able to make sense of the code, but my wife would kill me if I
 spent any more time on the computer.

So print it out and read it that way :-)  Following cross references
is a bit slow though...  maybe a source browser on a palm pilot so you
can still do it in the family room...

(At this point 5 people chime in and tell me about 3 different
whizz-bang packages which print C source with line numbers and cross
references and )

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

PATCH: raid1 - assorted bug fixes

2000-11-06 Thread Neil Brown




Linus,

 The following patch addresses a small number of bugs in raid1.c in
 2.4.0-test10.

 1/ A number of routines that are called from interrupt context used
 spin_lock_irq / spin_unlock_irq
   instead of the more appropriate
 spin_lock_irqsave( ,flags)   /  spin_unlock_irqrestore( ,flags)

   This can, and did, lead to deadlocks on an SMP system.

 2/ b_rsector and b_rdev are used in a couple of cases *after*
generic_make_request has been called.  If the underlying devices
was, for example, RAID0, these fields would no longer have the
assumed values.  I have changed these cases to use b_blocknr
(scales) and b_dev. 

This bug could affect correctness if raid1 is used over raid0 or
raid-linear or LVM.

 3/ In two cases, b_blocknr is calculated by *multiplying* b_rsector
by the sector-per-block count instead of *dividing* it.

This bug could affect correctness when restarted a read request
after a drive failure.

NeilBrown

--- ./drivers/md/raid1.c2000/11/07 02:14:25 1.1
+++ ./drivers/md/raid1.c2000/11/07 02:15:21 1.2
@@ -91,7 +91,8 @@
 
 static inline void raid1_free_bh(raid1_conf_t *conf, struct buffer_head *bh)
 {
-   md_spin_lock_irq(conf-device_lock);
+   unsigned long flags;
+   spin_lock_irqsave(conf-device_lock, flags);
while (bh) {
struct buffer_head *t = bh;
bh=bh-b_next;
@@ -103,7 +104,7 @@
conf-freebh_cnt++;
}
}
-   md_spin_unlock_irq(conf-device_lock);
+   spin_unlock_irqrestore(conf-device_lock, flags);
wake_up(conf-wait_buffer);
 }
 
@@ -182,10 +183,11 @@
r1_bh-mirror_bh_list = NULL;
 
if (test_bit(R1BH_PreAlloc, r1_bh-state)) {
-   md_spin_lock_irq(conf-device_lock);
+   unsigned long flags;
+   spin_lock_irqsave(conf-device_lock, flags);
r1_bh-next_r1 = conf-freer1;
conf-freer1 = r1_bh;
-   md_spin_unlock_irq(conf-device_lock);
+   spin_unlock_irqrestore(conf-device_lock, flags);
} else {
kfree(r1_bh);
}
@@ -229,14 +231,15 @@
 
 static inline void raid1_free_buf(struct raid1_bh *r1_bh)
 {
+   unsigned long flags;
struct buffer_head *bh = r1_bh-mirror_bh_list;
raid1_conf_t *conf = mddev_to_conf(r1_bh-mddev);
r1_bh-mirror_bh_list = NULL;

-   md_spin_lock_irq(conf-device_lock);
+   spin_lock_irqsave(conf-device_lock, flags);
r1_bh-next_r1 = conf-freebuf;
conf-freebuf = r1_bh;
-   md_spin_unlock_irq(conf-device_lock);
+   spin_unlock_irqrestore(conf-device_lock, flags);
raid1_free_bh(conf, bh);
 }
 
@@ -371,7 +374,7 @@
 {
struct buffer_head *bh = r1_bh-master_bh;
 
-   io_request_done(bh-b_rsector, mddev_to_conf(r1_bh-mddev),
+   io_request_done(bh-b_blocknr*(bh-b_size9), mddev_to_conf(r1_bh-mddev),
test_bit(R1BH_SyncPhase, r1_bh-state));
 
bh-b_end_io(bh, uptodate);
@@ -599,7 +602,7 @@
 
bh_req = r1_bh-bh_req;
memcpy(bh_req, bh, sizeof(*bh));
-   bh_req-b_blocknr = bh-b_rsector * sectors;
+   bh_req-b_blocknr = bh-b_rsector / sectors;
bh_req-b_dev = mirror-dev;
bh_req-b_rdev = mirror-dev;
/*  bh_req-b_rsector = bh-n_rsector; */
@@ -643,7 +646,7 @@
/*
 * prepare mirrored mbh (fields ordered for max mem throughput):
 */
-   mbh-b_blocknr= bh-b_rsector * sectors;
+   mbh-b_blocknr= bh-b_rsector / sectors;
mbh-b_dev= conf-mirrors[i].dev;
mbh-b_rdev   = conf-mirrors[i].dev;
mbh-b_rsector= bh-b_rsector;
@@ -1181,7 +1184,7 @@
struct buffer_head *bh1 = mbh;
mbh = mbh-b_next;
generic_make_request(WRITE, bh1);
-   md_sync_acct(bh1-b_rdev, bh1-b_size/512);
+   md_sync_acct(bh1-b_dev, bh1-b_size/512);
}
} else {
dev = bh-b_dev;
@@ -1406,7 +1409,7 @@
init_waitqueue_head(bh-b_wait);
 
generic_make_request(READ, bh);
-   md_sync_acct(bh-b_rdev, bh-b_size/512);
+   md_sync_acct(bh-b_dev, bh-b_size/512);
 
return (bsize  10);
 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

Re: compatability between patched 2.2 and 2.4?

2000-11-07 Thread Neil Brown


On Tuesday November 7, [EMAIL PROTECTED] wrote:
 
 I have a question regarding the diffrences between the 2.2+RAID-patch
 kernels and the 2.4-test kernels - I was wondering if there are any
 diffrences between them.
 
 For example, if I build systems with a 2.2.17+RAID and later install 2.4
 kernels on them will the trasition be seemless as far as RAID goes?
 
 Thx, marc

Transition should be fine - unless you are using a sparc.

But then for sparc, the 2.2 patch didn't really work properly unless
you hacked the superblock layout, so you probably know what you are
doing anyway.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

Any experience with LSI SYM 53c1010 scsi controller??

2000-11-27 Thread Neil Brown




Hi,
 I am considering using an ASUS CUR-DLS mother board in a new
 NFS/RAID server, and wonder if anyone was any experience to report
 either with it, or with the Ultra-160 dual buss scsi controller that
 it has - the LSI SYM 53c1010.

 From what I can find in the kernel source, and from lsi logic's home
 page it is supported, but I would love to hear from someone who has
 used it.

 Thanks,
NeilBrown

http://www.asus.com.tw/products/Motherboard/pentiumpro/cur-dls/index.html
ftp://ftp.lsil.com/HostAdapterDrivers/linux/c8xx-driver/Readme

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

Re: [BUG] reconstruction doesn't start

2000-11-27 Thread Neil Brown


On Monday November 27, [EMAIL PROTECTED] wrote:
 
 When md2 is finished then md1 is resynced. Shouldn't they do
 resync at the same time?
 
 I never saw "md: serializing resync,..." what I supected to get because
 md0 and md1 share the same physical disks.
 
 My findigs:
The md driver in 2.4.0-test11-ac4 does ALL raid-1 resyncs serialized!!! 

Close.  All *reconstructions* are serialised.  All *resyncs* are not
synchronised.
Here *reconstruction* mean when a failed disk has been placed and data
and parity is reconstructed onto it.
*resync* means that after an unclean shutdown the parity is checked
and corrected if necessary.

This is an artifact of how the code was written.  It is not something
that "should be".  It is merely something that "is".

It is on my todo list to fix this, but not very high.

NeilBrown

 
 Can someone replicate this with their system?
 
 MfG / Regards
 Friedrich Lobenstock
 
 
 
 -
 To unsubscribe from this list: send the line "unsubscribe linux-raid" in
 the body of a message to [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

Re: raid1 resync problem ? (fwd)

2000-11-28 Thread Neil Brown


On Tuesday November 28, [EMAIL PROTECTED] wrote:
 Hi, 
 
 I'm forwarding the message to you guys because I got no answer from Ingo
 
 Thanks

I would suggest always CCing to [EMAIL PROTECTED]  I have
taken the liberty of Ccing this reply there.

 
 -- Forwarded message --
 Date: Sat, 25 Nov 2000 14:21:28 -0200 (BRST)
 From: Marcelo Tosatti [EMAIL PROTECTED]
 To: Ingo Molnar [EMAIL PROTECTED]
 Subject: raid1 resync problem ? 
 
 
 Hi Ingo, 
 
 While reading raid1 code from 2.4 kernel, I've found the following part on
 raid1_make_request function:
 
 ...
 spin_lock_irq(conf-segment_lock);
 wait_event_lock_irq(conf-wait_done, 
   -bh-b_rsector  conf-start_active ||   -
   -  bh-b_rsector = conf-start_future,-
 conf-segment_lock);
 if (bh-b_rsector  conf-start_active)
 conf-cnt_done++;
 else {
 conf-cnt_future++;
 if (conf-phase)
 set_bit(R1BH_SyncPhase, r1_bh-state);
 }
 spin_unlock_irq(conf-segment_lock);
 ...
 
 
 If I understood correctly, bh-b_rsector is used to know if the sector
 number of the request being processed is not inside the resync range. 
 
 In case it is, it sleeps waiting for the resync daemon. Otherwise, it can
 send the operation to the lower level block device(s). 
 
 The problem is that the code does not check for the request length to know
 if the last sector of the request is smaller than conf-start_active. 
 
 For example, if we have conf-start_active = 1000, a write request with 8
 sectors and bh-b_rsector = 905 is allowed to be done. 3 blocks (1001,
 1002 and 1003) of this request are inside the resync range. 

The reason is subtle, but this cannot happen.
resync is always done in full pages. So (on intel) start_active will
always be a multiple of 8.  Also, b_size can be at most one page
(i.e. 4096 == 8 sectors) and b_rsector will be aligned to a multiple
of b_size.  Given this, if rsector  start_active, you can be certain
that rsector+(b_size9) = start_active, so there isn't a problem and
your change is not necessary.   Adding a comment to the code to
explain this subtlety might be sensible though...

NeilBrown


 
 If haven't missed anything, we can easily fix it using the last sector
 (bh-b_rsector + (bh-b_size  9)) instead the first sector when
 comparing with conf-start_active.
 
 Waiting for your comments. 
 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

Re: raid1 resync problem ? (fwd)

2000-11-28 Thread Neil Brown


On Tuesday November 28, [EMAIL PROTECTED] wrote:
 
 snip
 
   If I understood correctly, bh-b_rsector is used to know if the sector
   number of the request being processed is not inside the resync range. 
   
   In case it is, it sleeps waiting for the resync daemon. Otherwise, it can
   send the operation to the lower level block device(s). 
   
   The problem is that the code does not check for the request length to know
   if the last sector of the request is smaller than conf-start_active. 
   
   For example, if we have conf-start_active = 1000, a write request with 8
   sectors and bh-b_rsector = 905 is allowed to be done. 3 blocks (1001,
   1002 and 1003) of this request are inside the resync range. 
  
  The reason is subtle, but this cannot happen.
  resync is always done in full pages. So (on intel) start_active will
  always be a multiple of 8.  Also, b_size can be at most one page
  (i.e. 4096 == 8 sectors) and b_rsector will be aligned to a multiple
  of b_size.  Given this, if rsector  start_active, you can be certain
  that rsector+(b_size9) = start_active, so there isn't a problem and
  your change is not necessary.   Adding a comment to the code to
  explain this subtlety might be sensible though...
 
 This becomes a problem with kiobuf requests (I have a patch to make raid1
 code kiobuf-aware).
 
 With kiobufes, its possible (right now) to have requests up to 64kb, so
 the current code is problematic. 
 

In that case, you change sounds quite reasonable and should be
included in you patch to make raid1 kiobuf-aware.

Is there a URL to this patch? Can I look at it?

NeilBrown

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

Re: we are finding that parity writes are half of all writes when writing 50mb files

2000-11-28 Thread Neil Brown


On Tuesday November 28, [EMAIL PROTECTED] wrote:
 On Tue, Nov 28, 2000 at 10:50:06AM +1100, Neil Brown wrote:
 However, there is only one "unplug-all-devices"(*) call in the API
 that a reader or write can make.  It is not possible to unplug a
 particular device, or better still, to unplug a particular request.
 
 This isn't totally true. When we run out of requests on a certain
 device and we must startup the I/O to release some of them, we unplug
 only the _single_ device and we don't unplug-all-devices anymore.
 
 So during writes to disk (without using O_SYNC that isn't supported
 by 2.4.0-test11 anyways :) you never unplug-all-devices, but you only
 unplug finegrined at the harddisk level.

Thanks for these comments.  They helped think more clearly about what
was going on, and as a result I have raid5 working even faster still,
though not quite as fast as I hope...

The raid5 device has a "stripe cache" where the stripes play a similar
role to the requests used in the elevator code (__make_request).
i.e. they gather together buffer_heads for requests that are most
efficiently processed at the same time.

When I run out of stripes, I need to wait for one to become free, but
first I need to unplug any underlying devices to make sure that
something *will* become free soon.  When I unplug those devices, I
have to call "run_task_queue(tq_disk)" (because that is the only
interface), and this unplugs the raid5 device too.  This substantially
reduces the effectiveness of the plugging that I had implemented.

To get around this artifact that I unplug whenever I need a stripe, I
changed the "get-free-stripe" code so that if it has to wait for a
stripe, it waits for 16 stripes.  This means that we should be able to
get up to 16 stripes all plugged together.

This has helped a lot and I am now getting dbench thoughputs on 4K and
8K chunk sizes (still waiting for the rest of the results) that are
better than 2.2 ever gave me.
It still isn't as good as I hoped:
With 4K chunks the 3drive thoughput it significantly better than the
2drive throughput.  With 8K it is now slightly less (instead of much
less).  But in 2.2, 3drives with 8K chunks is better than 2drives with
8K chunks.

What I really want to be able to do when I need a stripe but don't
have a free one is:

   1/ unplug any underlying devices
   2/ if 50% of my stripes are plugged, unplug this device.

(or some mild variation of that).  However with the current interface,
I cannot.

Still, I suspect I can squeeze a bit more out with the current
interface, and it will be enough for 2.4.  It will be fun working to
make the interfaces just right for 2.5

NeilBrown

 
 That isn't true for reads of course, for reads it's the highlevel FS/VM layer
 that unplugs the queue and it only knows about the run_task_queue(tq_disk) but
 Jens has patches to fix it too :).

I'll have to have a look... but not today.

 
 Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

PATCH - md/Makefile - link order

2000-11-28 Thread Neil Brown



Linus,
 A couple of versions of this patch went into Alan's tree, but weren't
 quite right.  This one is minimal, but works.

 The problem is that the the tidy up of xor.o, it auto-initialises
 itself, instead of being called by raid.o, and so needs to be linked
 *before* md.o, as the initialiser for md.o may start up a raid5
 device that needs xor.

 This patch simply puts xor before md.  I would like to tidy this up
 further and have all the raid flavours auto-initialise, but there are
 issues that I have to clarify with the kbuild people before I do
 that.

 After compiling with this patch, 
   % objdump -t vmlinux | grep initcall.init
 contains:
c03345dc l O .initcall.init 0004 __initcall_calibrate_xor_block
c03345e0 l O .initcall.init 0004 __initcall_md_init
c03345e4 l O .initcall.init 0004 __initcall_md_run_setup

 in that order which convinces me that it gets the order right.

NeilBrown

--- ./drivers/md/Makefile   2000/11/29 03:46:13 1.1
+++ ./drivers/md/Makefile   2000/11/29 04:05:27 1.2
@@ -16,12 +16,16 @@
 obj-n  :=
 obj-   :=
 
-obj-$(CONFIG_BLK_DEV_MD)   += md.o
+# NOTE: xor.o must link *before* md.o so that auto-detect
+# of raid5 arrays works (and doesn't Oops).  Fortunately
+# they are both export-objs, so setting the order here
+# works.
 obj-$(CONFIG_MD_LINEAR)+= linear.o
 obj-$(CONFIG_MD_RAID0) += raid0.o
 obj-$(CONFIG_MD_RAID1) += raid1.o
 obj-$(CONFIG_MD_RAID5) += raid5.o xor.o
+obj-$(CONFIG_BLK_DEV_MD)   += md.o
 obj-$(CONFIG_BLK_DEV_LVM)  += lvm-mod.o
 
 # Translate to Rules.make lists.
 O_OBJS := $(filter-out $(export-objs), $(obj-y))
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

PATCH - raid5.c - bad calculation

2000-11-28 Thread Neil Brown



Linus, 
 I sent this patch to Alan a little while ago, but after ac4, so I
 don't know if it went into his tree.

 There is a bit of code at the front of raid5_sync_request which
 calculates which block is the parity block for a given stripe.
 However, to convert from a block number (1K units) to a sector number
 it does 2 instead of 1 or *2, which leads to the wrong results.
 This can lead to data corruption, hanging, or an Oops.

 This patch fixes it (and allows my raid5 testing to run happily to
 completion). 

NeilBrown


--- ./drivers/md/raid5.c2000/11/29 04:15:54 1.1
+++ ./drivers/md/raid5.c2000/11/29 04:16:29 1.2
@@ -1516,8 +1516,8 @@
raid5_conf_t *conf = (raid5_conf_t *) mddev-private;
struct stripe_head *sh;
int sectors_per_chunk = conf-chunk_size  9;
-   unsigned long stripe = (block_nr2)/sectors_per_chunk;
-   int chunk_offset = (block_nr2) % sectors_per_chunk;
+   unsigned long stripe = (block_nr1)/sectors_per_chunk;
+   int chunk_offset = (block_nr1) % sectors_per_chunk;
int dd_idx, pd_idx;
unsigned long first_sector;
int raid_disks = conf-raid_disks;
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

PATCH - md_boot - ifdef fix

2000-11-28 Thread Neil Brown




Linus,
 The are currently two ways to get md/raid devices configured at boot
 time.
 AUTODETECT_RAID finds bits of raid arrays from partition types and
automagically connected them together
 MD_BOOT allows bits of raid arrays to be explicitly described on the
boot line.

 Currently, MD_BOOT is not effective unless AUTODETECT_RAID is also
 enabled as both are implemented by md_run_setup, and md_run_setup is
 only called ifdef CONFIG_AUTODETECT_RAID.
 This patch fixes this irregularity.

NeilBrown

(patch against test12-pre2, as were the previous few, but I forget to mention).


--- ./drivers/md/md.c   2000/11/29 04:22:13 1.2
+++ ./drivers/md/md.c   2000/11/29 04:49:29 1.3
@@ -3853,7 +3853,7 @@
 #endif
 
 __initcall(md_init);
-#ifdef CONFIG_AUTODETECT_RAID
+#if defined(CONFIG_AUTODETECT_RAID) || defined(CONFIG_MD_BOOT)
 __initcall(md_run_setup);
 #endif
 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

PATCH - md - MAX_REAL yields to MD_SB_DISKS

2000-11-28 Thread Neil Brown



Linus,
 md currently has two #defines which give a limit to the number of
 devices that can be in a given raid array:

  MAX_REAL (==12) dates back to the time before we had persistent
 superblocks, and mostly affects raid0
 
  MD_SB_DISKS (==27) is a characteristic of the newer persistent
 superblocks and says how many devices can be described in a
 superblock.

 Have the two is inconsistent and needlessly limits raid0 arrays.
 This patch replaces MAX_REAL in the few places that it occurs with
 MD_SB_DISKS. 

 Thanks to Gary Murakami [EMAIL PROTECTED] for prodding me to
 make this patch.

NeilBrown



--- ./include/linux/raid/md_k.h 2000/11/29 04:54:32 1.1
+++ ./include/linux/raid/md_k.h 2000/11/29 04:55:47 1.2
@@ -59,7 +59,6 @@
 #error MD doesnt handle bigger kdev yet
 #endif
 
-#define MAX_REAL 12/* Max number of disks per md dev */
 #define MAX_MD_DEVS  (1MINORBITS)/* Max number of md dev */
 
 /*
--- ./include/linux/raid/raid0.h2000/11/29 04:54:32 1.1
+++ ./include/linux/raid/raid0.h2000/11/29 04:55:47 1.2
@@ -9,7 +9,7 @@
unsigned long dev_offset;   /* Zone offset in real dev */
unsigned long size; /* Zone size */
int nb_dev; /* # of devices attached to the zone */
-   mdk_rdev_t *dev[MAX_REAL]; /* Devices attached to the zone */
+   mdk_rdev_t *dev[MD_SB_DISKS]; /* Devices attached to the zone */
 };
 
 struct raid0_hash
--- ./drivers/md/md.c   2000/11/29 04:49:29 1.3
+++ ./drivers/md/md.c   2000/11/29 04:55:47 1.4
@@ -3587,9 +3587,9 @@
 {
static char * name = "mdrecoveryd";

-   printk (KERN_INFO "md driver %d.%d.%d MAX_MD_DEVS=%d, MAX_REAL=%d\n",
+   printk (KERN_INFO "md driver %d.%d.%d MAX_MD_DEVS=%d, MD_SB_DISKS=%d\n",
MD_MAJOR_VERSION, MD_MINOR_VERSION,
-   MD_PATCHLEVEL_VERSION, MAX_MD_DEVS, MAX_REAL);
+   MD_PATCHLEVEL_VERSION, MAX_MD_DEVS, MD_SB_DISKS);
 
if (devfs_register_blkdev (MAJOR_NR, "md", md_fops))
{
@@ -3639,7 +3639,7 @@
unsigned long set;
int pers[MAX_MD_BOOT_DEVS];
int chunk[MAX_MD_BOOT_DEVS];
-   kdev_t devices[MAX_MD_BOOT_DEVS][MAX_REAL];
+   kdev_t devices[MAX_MD_BOOT_DEVS][MD_SB_DISKS];
 } md_setup_args md__initdata = { 0, };
 
 /*
@@ -3713,7 +3713,7 @@
pername="super-block";
}
devnames = str;
-   for (; iMAX_REAL  str; i++) {
+   for (; iMD_SB_DISKS  str; i++) {
if ((device = name_to_kdev_t(str))) {
md_setup_args.devices[minor][i] = device;
} else {
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

PATCH - md.c - confusing message corrected

2000-11-28 Thread Neil Brown



Linus, 
  This is a resend of a patch that probably got lost a week or so ago.
  (It is also more gramatically correct).

  If md.c has two raid arrays that need to be resynced, and they share
  a physical device, then the two resyncs are serialised.  However the
  message printed says something about "overlapping" which confuses
  and worries people needlessly.  
  This patch improves the message.

NeilBrown


--- ./drivers/md/md.c   2000/11/29 04:21:37 1.1
+++ ./drivers/md/md.c   2000/11/29 04:22:13 1.2
@@ -3279,7 +3279,7 @@
if (mddev2 == mddev)
continue;
if (mddev2-curr_resync  match_mddev_units(mddev,mddev2)) {
-   printk(KERN_INFO "md: serializing resync, md%d has overlapping 
physical units with md%d!\n", mdidx(mddev), mdidx(mddev2));
+   printk(KERN_INFO "md: serializing resync, md%d shares one or 
+more physical units with md%d!\n", mdidx(mddev), mdidx(mddev2));
serialize = 1;
break;
}
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

PATCH - md - initialisation cleanup

2000-11-29 Thread Neil Brown



Linus,
 here is a patch for test12 which cleans up the initialisation of raid
 personalities.  I didn't include it in the previous raid init cleanup
 because I hadn't figured out the inner mysteries of link order
 completely.  The linux-kbuild list helped there.

 This patch arranges that each personaility auto-initialised, and
 makes sure that all personaility are initialised before md.o gets
 initialised. 

 An earlier form of this (which didn't get the Makefile quite right)
 went into test11ac??.

NeilBrown


--- ./drivers/md/Makefile   2000/11/29 05:46:23 1.3
+++ ./drivers/md/Makefile   2000/11/29 06:04:25 1.4
@@ -16,10 +16,13 @@
 obj-n  :=
 obj-   :=
 
-# NOTE: xor.o must link *before* md.o so that auto-detect
-# of raid5 arrays works (and doesn't Oops).  Fortunately
-# they are both export-objs, so setting the order here
-# works.
+# Note: link order is important.  All raid personalities
+# and xor.o must come before md.o, as they each initialise 
+# themselves, and md.o may use the personalities when it 
+# auto-initialised.
+# The use of MIX_OBJS allows link order to be maintained even
+# though some are export-objs and some aren't.
+
 obj-$(CONFIG_MD_LINEAR)+= linear.o
 obj-$(CONFIG_MD_RAID0) += raid0.o
 obj-$(CONFIG_MD_RAID1) += raid1.o
@@ -28,10 +31,11 @@
 obj-$(CONFIG_BLK_DEV_LVM)  += lvm-mod.o
 
 # Translate to Rules.make lists.
-O_OBJS := $(filter-out $(export-objs), $(obj-y))
-OX_OBJS:= $(filter $(export-objs), $(obj-y))
-M_OBJS := $(sort $(filter-out $(export-objs), $(obj-m)))
-MX_OBJS:= $(sort $(filter  $(export-objs), $(obj-m)))
+active-objs:= $(sort $(obj-y) $(obj-m))
+
+O_OBJS := $(obj-y)
+M_OBJS := $(obj-m)
+MIX_OBJS   := $(filter $(export-objs), $(active-objs))
 
 include $(TOPDIR)/Rules.make
 
--- ./drivers/md/md.c   2000/11/29 04:55:47 1.4
+++ ./drivers/md/md.c   2000/11/29 06:04:25 1.5
@@ -3576,12 +3576,6 @@
create_proc_read_entry("mdstat", 0, NULL, md_status_read_proc, NULL);
 #endif
 }
-void hsm_init (void);
-void translucent_init (void);
-void linear_init (void);
-void raid0_init (void);
-void raid1_init (void);
-void raid5_init (void);
 
 int md__init md_init (void)
 {
@@ -3617,18 +3611,6 @@
md_register_reboot_notifier(md_notifier);
raid_table_header = register_sysctl_table(raid_root_table, 1);
 
-#ifdef CONFIG_MD_LINEAR
-   linear_init ();
-#endif
-#ifdef CONFIG_MD_RAID0
-   raid0_init ();
-#endif
-#ifdef CONFIG_MD_RAID1
-   raid1_init ();
-#endif
-#ifdef CONFIG_MD_RAID5
-   raid5_init ();
-#endif
md_geninit();
return (0);
 }
--- ./drivers/md/raid5.c2000/11/29 04:16:29 1.2
+++ ./drivers/md/raid5.c2000/11/29 06:04:25 1.3
@@ -2352,19 +2352,16 @@
sync_request:   raid5_sync_request
 };
 
-int raid5_init (void)
+static int md__init raid5_init (void)
 {
return register_md_personality (RAID5, raid5_personality);
 }
 
-#ifdef MODULE
-int init_module (void)
-{
-   return raid5_init();
-}
-
-void cleanup_module (void)
+static void raid5_exit (void)
 {
unregister_md_personality (RAID5);
 }
-#endif
+
+module_init(raid5_init);
+module_exit(raid5_exit);
+
--- ./drivers/md/linear.c   2000/11/29 05:45:04 1.1
+++ ./drivers/md/linear.c   2000/11/29 06:04:25 1.2
@@ -190,24 +190,16 @@
status: linear_status,
 };
 
-#ifndef MODULE
-
-void md__init linear_init (void)
-{
-   register_md_personality (LINEAR, linear_personality);
-}
-
-#else
-
-int init_module (void)
+static int md__init linear_init (void)
 {
-   return (register_md_personality (LINEAR, linear_personality));
+   return register_md_personality (LINEAR, linear_personality);
 }
 
-void cleanup_module (void)
+static void linear_exit (void)
 {
unregister_md_personality (LINEAR);
 }
 
-#endif
 
+module_init(linear_init);
+module_exit(linear_exit);
--- ./drivers/md/raid0.c2000/11/29 05:45:04 1.1
+++ ./drivers/md/raid0.c2000/11/29 06:04:25 1.2
@@ -333,24 +333,17 @@
status: raid0_status,
 };
 
-#ifndef MODULE
-
-void raid0_init (void)
-{
-   register_md_personality (RAID0, raid0_personality);
-}
-
-#else
-
-int init_module (void)
+static int md__init raid0_init (void)
 {
-   return (register_md_personality (RAID0, raid0_personality));
+   return register_md_personality (RAID0, raid0_personality);
 }
 
-void cleanup_module (void)
+static void raid0_exit (void)
 {
unregister_md_personality (RAID0);
 }
 
-#endif
+module_init(raid0_init);
+module_exit(raid0_exit);
+
 
--- ./drivers/md/raid1.c2000/11/29 05:45:04 1.1
+++ ./drivers/md/raid1.c2000/11/29 06:04:25 1.2
@@ -1882,19 +1882,16 @@
sync_request:   raid1_sync_request
 };
 
-int raid1_init (void)
+static int md__init raid1_init (void)
 {
return register_md_personality (RAID1,

Re: autodetect question

2000-12-01 Thread Neil Brown


On Friday December 1, [EMAIL PROTECTED] wrote:
 If I have all of MD as a module and autodetect raid enabled, do the MD
 drives that the machine has get detected and setup
 1) at boot
 2) at module load
 or
 3) it doesn't

3.  It doesn't.
  Rationale: by the time you are loading a module, you have enough
  user space running to do the auto-detect stuff in user space.
  The simple fact that no-one has written autodetect code for user
  space yet is not a kernel issue.  I will when I need it, unless
  someone beats me to it.

 
 
 This is another question.  Is it possible to change the code so that
 autodetect works when the whole disk is part of the raid instead of a
 partition under it?  (ie: check drives that the kernel couldn't find a
 partition table on)
 
http://bible.gospelcom.net/cgi-bin/bible?passage=1COR+10:2

1 Corinthians 10

23 "Everything is permissible"--but not everything is
   beneficial. "Everything is permissible"--but not everything is
   constructive.  

Yes you could, but I don't think that you should.
If you want to boot off a raid array of whole-drives, then enable
CONFIG_MD_BOOT and boot with 
 md=0,/dev/hda,/dev/hdb
or similar.
If you want this for a non-boot drive, configure it from user-space.

NeilBrown

coming soon: partitioning for md devices:
  md=0,/dev/hda,/dev/hdb root=/dev/md0p1 swap=/dev/md0p2

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

Re: FW: how to upgrade raid disks from redhat 4.1 to redhat 6.2

2000-12-01 Thread Neil Brown

On Friday December 1, [EMAIL PROTECTED] wrote:
  -Original Message-
  From: Carly Yang [mailto:[EMAIL PROTECTED]]
  Sent: Friday, December 01, 2000 2:42 PM
  To: [EMAIL PROTECTED]
  Subject: how to upgrade raid disks from redhat 4.1 to redhat 6.2

  Dear Gregory
  I have a system which run on redhat 4.1 with tow scsi hard 
  disks making a RAID0 partiton. I add command in 
  /etc/rc.d/rc.local as the following:
  /sbin/mdadd /dev/md0 /dev/sda1 /dev/sdb1
  /sbin/mdrun -p0 /dev/md0
  e2fsck -y /dev/md0
  mount -t ext2 /dev/md0 /home

  So I can access the raid disk.
  Recently I upgrade the system to redhat 6.2, I made the 
  raidtab in /etc/ as following:
  raiddev /dev/md0
  raid-level0
  nr-raid-disks2
  persistent-superblock0
  chunk-size8

  device/dev/sda1
  raid-disk0
  device/dev/sdb1
  raid-disk1

  I run "mkraid --upgrade /dev/md0" to upgrade raid partion to 
  new system. But it report error as :
  Cannot upgrade magic-less superblock on /dev/sda1 ...

I think you want raid0run.  Check the man page and see if it works for
you.

NeilBrown

  mkraid: aborted, see syslog and /proc/mdstat for potential clues.
  run "cat /proc/mdstat" get "personalities:
  read-aheas net set
  unused device: none

  I run "mkraid" in mandrake 7.1 and get the same result, I 
  don't know how to make a raid partition upgrade now. Could 
  tell how to do that? 
  I read your  Linux-RAID-FAQ, I think you can give me some 
  good advice.

  Yours Sincerely

  Carl

 -
 To unsubscribe from this list: send the line "unsubscribe linux-raid" in
 the body of a message to [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

Re: Ex2FS unable to read superblock

2000-12-03 Thread Neil Brown


On Sunday December 3, [EMAIL PROTECTED] wrote:
 I'm new to the raid under linux world, and had a question.  Sorry if several
 posts have been made by me previously, I had some trouble subscribing to the
 list...
 
  I successfully installed redhat 6.2 with raid 0 for two drives on a sun
 ultra 1.  However i'm trying to rebuild the kernel, and thought i'd play
 with 2.4test11 since it has the raid code built in, but to no avail.  while
 it will auto detect the raid drives fine from what i can tell, the kernel
 always panics with "EX2FS Unable Ro Read Superblock"  any thoughts on what i
 might be doing wrong that is causing this error.  sorry if this has been
 brought up before
 
 dave
 
 -
 To unsubscribe from this list: send the line "unsubscribe linux-raid" in
 the body of a message to [EMAIL PROTECTED]

Details might help.
e.g. a copy of the boot logs when booting under 2.2.whatever and it
working, and similar logs with it booting under 2.4.0test11 and it not
working, though they may not be as easiy to get if your root
filesystem doesn't come up.

There were issues in the 2.2 raid code on Sparc hardware which may
have been resolved by redhat, and may have been resolved differently
in 2.4.
Look for a line like:

  md.c: sizeof(mdp_super_t) = 4096

is the number 4096 in both cases?  If not, then that is probably the
problem.  It is quite possible that raid in 2.4 on sparc is not
upwards compatible with raid in redhat6.2 on sparc.

NeilBrown

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

Re: [OOPS] raidsetfaulty - raidhotremove - raidhotadd

2000-12-06 Thread Neil Brown


On Wednesday December 6, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
  
  Could you try this patch and see how it goes?
 
 Same result!
 

Ok... must be something else... I tried again to reproduce it, and
this time I succeeded.
The problem happens when you try to access the last 128k of a raid1
array what have been reconstructed since the last reboot.

The reconstruction creates a sliding 3-part window which is 3*128k
wide.

The leading pane ("pending") may have some outstanding I/O requests,
but no new requests will be added.
The middle pane ("ready") has no outstanding I/O requests, and gets no
new I/O requests, but does get new rebuild requests.
The trailing pain ("active") has outstanding rebuild requests, but no
new I/O requests will be added.

This window slides forward through the address space keeping IO and
reconstruction quite separate.

However, the reconstruction process finishes with "ready" still
covering the tail end of the address space.  "active" has fallen of
the end, and "pending" has collapse down to an empty pain, but "ready"
is still there.

When rebuilding after an unclean shutdown, this gets cleaned up
properly, but when rebuilding onto a spare, it doesn't.

The attached patch, which can also be found at:

 http://www.cse.unsw.edu.au/~neilb/patches/linux/2.4.0-test12-pre6/patch-G-raid1rebuild

fixed the problem.  It should apply to any recent 2.4.0-test kernel.

Please try it and confirm that it works.

Thanks,

NeilBrown


--- ./drivers/md/raid1.c2000/12/06 22:34:27 1.3
+++ ./drivers/md/raid1.c2000/12/06 22:37:04 1.4
@@ -798,6 +798,32 @@
}
 }
 
+static void close_sync(raid1_conf_t *conf)
+{
+   mddev_t *mddev = conf-mddev;
+   /* If reconstruction was interrupted, we need to close the "active" and 
+"pending"
+* holes.
+* we know that there are no active rebuild requests, os cnt_active == 
+cnt_ready ==0
+*/
+   /* this is really needed when recovery stops too... */
+   spin_lock_irq(conf-segment_lock);
+   conf-start_active = conf-start_pending;
+   conf-start_ready = conf-start_pending;
+   wait_event_lock_irq(conf-wait_ready, !conf-cnt_pending, conf-segment_lock);
+   conf-start_active =conf-start_ready = conf-start_pending = 
+conf-start_future;
+   conf-start_future = mddev-sb-size+1;
+   conf-cnt_pending = conf-cnt_future;
+   conf-cnt_future = 0;
+   conf-phase = conf-phase ^1;
+   wait_event_lock_irq(conf-wait_ready, !conf-cnt_pending, conf-segment_lock);
+   conf-start_active = conf-start_ready = conf-start_pending = 
+conf-start_future = 0;
+   conf-phase = 0;
+   conf-cnt_future = conf-cnt_done;;
+   conf-cnt_done = 0;
+   spin_unlock_irq(conf-segment_lock);
+   wake_up(conf-wait_done);
+}
+
 static int raid1_diskop(mddev_t *mddev, mdp_disk_t **d, int state)
 {
int err = 0;
@@ -910,6 +936,7 @@
 * Deactivate a spare disk:
 */
case DISKOP_SPARE_INACTIVE:
+   close_sync(conf);
sdisk = conf-mirrors + spare_disk;
sdisk-operational = 0;
sdisk-write_only = 0;
@@ -922,7 +949,7 @@
 * property)
 */
case DISKOP_SPARE_ACTIVE:
-
+   close_sync(conf);
sdisk = conf-mirrors + spare_disk;
fdisk = conf-mirrors + failed_disk;
 
@@ -1213,27 +1240,7 @@
conf-resync_mirrors = 0;
}
 
-   /* If reconstruction was interrupted, we need to close the "active" and 
"pending"
-* holes.
-* we know that there are no active rebuild requests, os cnt_active == 
cnt_ready ==0
-*/
-   /* this is really needed when recovery stops too... */
-   spin_lock_irq(conf-segment_lock);
-   conf-start_active = conf-start_pending;
-   conf-start_ready = conf-start_pending;
-   wait_event_lock_irq(conf-wait_ready, !conf-cnt_pending, conf-segment_lock);
-   conf-start_active =conf-start_ready = conf-start_pending = 
conf-start_future;
-   conf-start_future = mddev-sb-size+1;
-   conf-cnt_pending = conf-cnt_future;
-   conf-cnt_future = 0;
-   conf-phase = conf-phase ^1;
-   wait_event_lock_irq(conf-wait_ready, !conf-cnt_pending, conf-segment_lock);
-   conf-start_active = conf-start_ready = conf-start_pending = 
conf-start_future = 0;
-   conf-phase = 0;
-   conf-cnt_future = conf-cnt_done;;
-   conf-cnt_done = 0;
-   spin_unlock_irq(conf-segment_lock);
-   wake_up(conf-wait_done);
+   close_sync(conf);
 
up(mddev-recovery_sem);
raid1_shrink_buffers(conf);
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

PATCH - raid1 next drive selection.

2000-12-10 Thread Neil Brown



Linus (et al)

 The raid1 code has a concept of finding a "next available drive".  It
 uses this for read balancing.
 Currently, this is implemented via a simple linked list that links
 the working drives together.

 However, there is no locking to make sure that the list does not get
 modified by two processors at the same time, and hence corrupted
 (though it is changed so  rarely that that is unlikely).
 Also, when choosing a drive to read from for rebuilding a new spare,
 the "last_used" drive is used, even though that might be a drive which
 failed recently.  i.e. there is no check the the "last_used" drive is
 actually still valid.  I managed to exploit this to get the kernel
 into a tight spin.

 This patch discards the linked list and just walks the array ignoring
 failed drives.  I also makes sure that "last_used" is always
 validated before being used.

 patch against 2.4.0-test12-pre8

NeilBrown

--- ./include/linux/raid/raid1.h2000/12/10 22:38:20 1.1
+++ ./include/linux/raid/raid1.h2000/12/10 22:38:25 1.2
@@ -7,7 +7,6 @@
int number;
int raid_disk;
kdev_t  dev;
-   int next;
int sect_limit;
int head_position;
 
--- ./drivers/md/raid1.c2000/12/10 22:36:34 1.2
+++ ./drivers/md/raid1.c2000/12/10 22:38:25 1.3
@@ -463,16 +463,12 @@
if (conf-resync_mirrors)
goto rb_out;

-   if (conf-working_disks  2) {
-   int i = 0;
-   
-   while( !conf-mirrors[new_disk].operational 
-   (i  MD_SB_DISKS) ) {
-   new_disk = conf-mirrors[new_disk].next;
-   i++;
-   }
-   
-   if (i = MD_SB_DISKS) {
+
+   /* make sure that disk is operational */
+   while( !conf-mirrors[new_disk].operational) {
+   if (new_disk = 0) new_disk = conf-raid_disks;
+   new_disk--;
+   if (new_disk == disk) {
/*
 * This means no working disk was found
 * Nothing much to do, lets not change anything
@@ -480,11 +476,13 @@
 */

new_disk = conf-last_used;
+
+   goto rb_out;
}
-   
-   goto rb_out;
}
-
+   disk = new_disk;
+   /* now disk == new_disk == starting point for search */
+   
/*
 * Don't touch anything for sequential reads.
 */
@@ -501,16 +499,16 @@

if (conf-sect_count = conf-mirrors[new_disk].sect_limit) {
conf-sect_count = 0;
-   
-   while( new_disk != conf-mirrors[new_disk].next ) {
-   if ((conf-mirrors[new_disk].write_only) ||
-   (!conf-mirrors[new_disk].operational) )
-   continue;
-   
-   new_disk = conf-mirrors[new_disk].next;
-   break;
-   }
-   
+
+   do {
+   if (new_disk=0)
+   new_disk = conf-raid_disks;
+   new_disk--;
+   if (new_disk == disk)
+   break;
+   } while ((conf-mirrors[new_disk].write_only) ||
+(!conf-mirrors[new_disk].operational));
+
goto rb_out;
}

@@ -519,8 +517,10 @@

/* Find the disk which is closest */

-   while( conf-mirrors[disk].next != conf-last_used ) {
-   disk = conf-mirrors[disk].next;
+   do {
+   if (disk = 0)
+   disk = conf-raid_disks;
+   disk--;

if ((conf-mirrors[disk].write_only) ||
(!conf-mirrors[disk].operational))
@@ -534,7 +534,7 @@
current_distance = new_distance;
new_disk = disk;
}
-   }
+   } while (disk != conf-last_used);
 
 rb_out:
conf-mirrors[new_disk].head_position = this_sector + sectors;
@@ -702,16 +702,6 @@
return sz;
 }
 
-static void unlink_disk (raid1_conf_t *conf, int target)
-{
-   int disks = MD_SB_DISKS;
-   int i;
-
-   for (i = 0; i  disks; i++)
-   if (conf-mirrors[i].next == target)
-   conf-mirrors[i].next = conf-mirrors[target].next;
-}
-
 #define LAST_DISK KERN_ALERT \
 "raid1: only one disk left and IO error.\n"
 
@@ -735,7 +725,6 @@
mdp_super_t *sb = mddev-sb;
 
mirror-operational = 0;
-   unlink_disk(conf, failed);
mark_disk_faulty(sb-disks+mirror-number);
mark_disk_nonsync(sb-disks+mirror-number);

PATCH - md device reference counting

2000-12-10 Thread Neil Brown



Linus (et al),

  An md device need to know if it is in-use so that it doesn't allow
  raidstop while still mounted.  Previously it did this by looking for
  a superblock on the device.  This is a bit in-elegant and doesn't
  generalise.

  With this patch, it tracks opens and closes (get and release) and
  does not allow raidstop while there is any active access.

  This leaves open the possibility of syncing out the superblocks on
  the last close, which might happen in a later patch.

  One interesting gotcha in this patch is that the START_ARRAY ioctl
  (used by raidstart) can potentially start a completely different
  array, as it decides which array to start based on a value in the
  raid superblock.
 
  To get the reference counts right, I needed to tell the code which
  array I think I am starting.  I it actually starts that one, it sets
  the initial reference count to 1, otherwise it sets it to 0.


 patch against 2.4.0-test12-pre8

NeilBrown



--- ./include/linux/raid/md_k.h 2000/12/10 22:54:16 1.1
+++ ./include/linux/raid/md_k.h 2000/12/10 23:21:26 1.2
@@ -206,6 +206,7 @@
struct semaphorereconfig_sem;
struct semaphorerecovery_sem;
struct semaphoreresync_sem;
+   atomic_tactive;
 
atomic_trecovery_active; /* blocks scheduled, but not 
written */
md_wait_queue_head_trecovery_wait;
--- ./drivers/md/md.c   2000/12/10 22:37:18 1.3
+++ ./drivers/md/md.c   2000/12/10 23:21:26 1.4
@@ -203,6 +203,7 @@
init_MUTEX(mddev-resync_sem);
MD_INIT_LIST_HEAD(mddev-disks);
MD_INIT_LIST_HEAD(mddev-all_mddevs);
+   atomic_set(mddev-active, 0);
 
/*
 * The 'base' mddev is the one with data NULL.
@@ -1718,12 +1719,20 @@
 
 #define STILL_MOUNTED KERN_WARNING \
 "md: md%d still mounted.\n"
+#defineSTILL_IN_USE \
+"md: md%d still in use.\n"
 
 static int do_md_stop (mddev_t * mddev, int ro)
 {
int err = 0, resync_interrupted = 0;
kdev_t dev = mddev_to_kdev(mddev);
 
+   if (atomic_read(mddev-active)1) {
+   printk(STILL_IN_USE, mdidx(mddev));
+   OUT(-EBUSY);
+   }
+ 
+   /* this shouldn't be needed as above would have fired */
if (!ro  get_super(dev)) {
printk (STILL_MOUNTED, mdidx(mddev));
OUT(-EBUSY);
@@ -1859,8 +1868,10 @@
  * the 'same_array' list. Then order this list based on superblock
  * update time (freshest comes first), kick out 'old' disks and
  * compare superblocks. If everything's fine then run it.
+ *
+ * If "unit" is allocated, then bump its reference count
  */
-static void autorun_devices (void)
+static void autorun_devices (kdev_t countdev)
 {
struct md_list_head candidates;
struct md_list_head *tmp;
@@ -1902,6 +1913,12 @@
continue;
}
mddev = alloc_mddev(md_kdev);
+   if (mddev == NULL) {
+   printk("md: cannot allocate memory for md drive.\n");
+   break;
+   }
+   if (md_kdev == countdev)
+   atomic_inc(mddev-active);
printk("created md%d\n", mdidx(mddev));
ITERATE_RDEV_GENERIC(candidates,pending,rdev,tmp) {
bind_rdev_to_array(rdev, mddev);
@@ -1945,7 +1962,7 @@
 #define AUTORUNNING KERN_INFO \
 "md: auto-running md%d.\n"
 
-static int autostart_array (kdev_t startdev)
+static int autostart_array (kdev_t startdev, kdev_t countdev)
 {
int err = -EINVAL, i;
mdp_super_t *sb = NULL;
@@ -2002,7 +2019,7 @@
/*
 * possibly return codes
 */
-   autorun_devices();
+   autorun_devices(countdev);
return 0;
 
 abort:
@@ -2077,7 +2094,7 @@
md_list_add(rdev-pending, pending_raid_disks);
}
 
-   autorun_devices();
+   autorun_devices(-1);
}
 
dev_cnt = -1; /* make sure further calls to md_autodetect_dev are ignored */
@@ -2607,6 +2624,8 @@
err = -ENOMEM;
goto abort;
}
+   atomic_inc(mddev-active);
+
/*
 * alloc_mddev() should possibly self-lock.
 */
@@ -2640,7 +2659,7 @@
/*
 * possibly make it lock the array ...
 */
-   err = autostart_array((kdev_t)arg);
+   err = autostart_array((kdev_t)arg, dev);
if (err) {
printk("autostart %s failed!\n",
partition_name((kdev_t)arg));
@@ -2820,14 +2839,26 @@
 static int md_open (struct inode *inode, struct file *file)
 {
/*

linus

2000-12-10 Thread Neil Brown



Linus (et al),

 The raid code wants to be the sole accessor of any devices are are
 combined into the array.  i.e. it want to lock those devices agaist
 other use.

 It currently tried to do this bby creating an inode that appears to
 be associated with that device.
 This no longer has any useful effect (and I don't think it has for a
 while, though I haven't dug into the history).

 I have changed the lock_rdev code to simple do a blkdev_get when
 attaching the device, and a blkdev_put when releasing it.  This
 atleast makes sure that if the device is in a module it wont be
 unloaded.
 Any further exclusive access control will needed to go into the
 blkdev_{get,put} routines at some stage I think.

 patch against 2.4.0-test12-pre8

NeilBrown


--- ./include/linux/raid/md_k.h 2000/12/10 23:21:26 1.2
+++ ./include/linux/raid/md_k.h 2000/12/10 23:33:07 1.3
@@ -165,8 +165,7 @@
mddev_t *mddev; /* RAID array if running */
unsigned long last_events;  /* IO event timestamp */
 
-   struct inode *inode;/* Lock inode */
-   struct file filp;   /* Lock file */
+   struct block_device *bdev;  /* block device handle */
 
mdp_super_t *sb;
unsigned long sb_offset;
--- ./drivers/md/md.c   2000/12/10 23:21:26 1.4
+++ ./drivers/md/md.c   2000/12/10 23:33:08 1.5
@@ -657,32 +657,25 @@
 static int lock_rdev (mdk_rdev_t *rdev)
 {
int err = 0;
+   struct block_device *bdev;
 
-   /*
-* First insert a dummy inode.
-*/
-   if (rdev-inode)
-   MD_BUG();
-   rdev-inode = get_empty_inode();
-   if (!rdev-inode)
+   bdev = bdget(rdev-dev);
+   if (bdev == NULL)
return -ENOMEM;
-   /*
-* we dont care about any other fields
-*/
-   rdev-inode-i_dev = rdev-inode-i_rdev = rdev-dev;
-   insert_inode_hash(rdev-inode);
-
-   memset(rdev-filp, 0, sizeof(rdev-filp));
-   rdev-filp.f_mode = 3; /* read write */
+   err = blkdev_get(bdev, FMODE_READ|FMODE_WRITE, 0, BDEV_FILE);
+   if (!err) {
+   rdev-bdev = bdev;
+   }
return err;
 }
 
 static void unlock_rdev (mdk_rdev_t *rdev)
 {
-   if (!rdev-inode)
+   if (!rdev-bdev)
MD_BUG();
-   iput(rdev-inode);
-   rdev-inode = NULL;
+   blkdev_put(rdev-bdev, BDEV_FILE);
+   bdput(rdev-bdev);
+   rdev-bdev = NULL;
 }
 
 static void export_rdev (mdk_rdev_t * rdev)
@@ -1150,7 +1143,7 @@
 
 abort_free:
if (rdev-sb) {
-   if (rdev-inode)
+   if (rdev-bdev)
unlock_rdev(rdev);
free_disk_sb(rdev);
}
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

PATCH - raid5 in 2.4.0-test13 - substantial rewrite with substantial performance increase

2000-12-20 Thread Neil Brown



Linus
  here is a rather large patch for raid5 in 2.4.0-test13-pre3.
  It is a substantial re-write of the stripe-cache handling code,
  which is the heart of the raid5 module.

  I have been sitting on this for a while so that others can test it
  (a few have) and so that I would have had quite a lot of experience
  with it myself.
  I am now happy that it is ready for inclusion in 2.4.0-testX.  I
  hope you will be too.

  What it does:

- processing of raid5 requests can require several stages
  (pre-read, calc parity, write).  To accomodate this, request
  handling is based on a simple state machine.
  Prior to this patch, state was explicitly recorded - there were
  different phases "READY", "READOLD", "WRITE" etc.
  With this patch, the state is made implicit in the b_state of
  the buffers in the cache.  This makes the processing code
  (handle_stripe) more flexable, and it is easier to add requests
  to a stripe at any stage of processing.
- With the above change, we no longer need to wait for a stripe to
  "complete" before adding a new request.  We at most need to wait
  for a spinlock to be released.  This allows more parallelism and
  provides throughput speeds many times the current speed.

- Without this patch, two buffers are allocated on each stripe in
  the cache for each underlying device in the array.  This is
  wasteful.  With the patch, only one buffer is needed per stripe,
  per device.

- This patch creates a couple of linked lists of stripes, one for
  stripes that are inactive, and one for stripe that need to be
  processed by raid5d.  This obviates the need to search the hash
  table for the stripes of interested in raid5d or when looking
  for a free stripe.

  There is more work to be done to bring raid5 performance upto the
  level of 2.2+mingos-patches, but this is a first, large, step on the
  way. 

NeilBrown


(2000 line patch included in mail to Linus, but removed from mail to
linux-raid.
If you want it, try
   http://www.cse.unsw.edu.au/~neilb/patch/linux/2.4.0-test13-pre3/patch-A-raid5
)
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

Re: mkraid problems

2001-01-04 Thread Neil Brown


On Thursday January 4, [EMAIL PROTECTED] wrote:
 OK, I followed the instructions from linuxraid.org and
 it seems my kernel is now installed because now it
 doesn't boot.  Error message states that the root fs
 cannot be mounted on 08:03 then it halts.  What did I
 miss now?
 
 Chris

08:03 is /dev/sda3.  Is that what you would expect to be booting off -
the first scsi disc?   Is the scsi controller driver compiled into the
kernel properly?  Do the boot messages show that the scsi controller
was found?

NeilBrown

 
 
 --- Alvin Oga [EMAIL PROTECTED] wrote:
  
  hi ya chris...
  
  did you follow the instructions on www.linuxraid.org
  ???
  
  the generic raid patch to generic 2.2.18 fails to
  patch
  properly.. try to follow the steps at
  linuxraid.org
  
  that was a very good site...
  
  have fun
  alvin
  http://www.linux-1U.net ...  1U Raid5 
  
  On Wed, 3 Jan 2001, Chris Winczewski wrote:
  
   Here is /etc/raidtab and /proc/mdstat
   
   chris
   
   raidtab:
   # Sample raid-5 configuration
   raiddev   /dev/md0
   raid-level5
   nr-raid-disks 3
   chunk-size32
   
   # Parity placement algorithm
   
   #parity-algorithm left-asymmetric
   parity-algorithm  left-symmetric
   #parity-algorithm right-asymmetric
   #parity-algorithm right-symmetric
   
   # Spare disks for hot reconstruction
   nr-spare-disks0
   persistent-superblock 1
   
   device/dev/sdb1
   raid-disk 0
   
   device/dev/sdc1
   raid-disk 1
   
   device/dev/sdd1
   raid-disk 2
   
   
   mdstat:
   Personalities : [4 raid5]
   read_ahead not set
   md0 : inactive
   md1 : inactive
   md2 : inactive
   md3 : inactive
   
   
   --- Neil Brown [EMAIL PROTECTED] wrote:
On Wednesday January 3,
  [EMAIL PROTECTED]
wrote:
 mkraid aborts with no usefull error mssg on
  screen
or
 in the syslog.  My /etc/raidtab is set up
correctly
 and I am using raidtools2 with kernel 2.2.18
  with
raid
 patch installed.  Any ideas?
 
 Chris
 
Please include a copy of 
  /etc/raidtab
  /proc/mdstat

NeilBrown
   
   
   __
   Do You Yahoo!?
   Yahoo! Photos - Share your holiday photos online!
   http://photos.yahoo.com/
   -
   To unsubscribe from this list: send the line
  "unsubscribe linux-raid" in
   the body of a message to [EMAIL PROTECTED]
   
  
 
 
 __
 Do You Yahoo!?
 Yahoo! Photos - Share your holiday photos online!
 http://photos.yahoo.com/
 -
 To unsubscribe from this list: send the line "unsubscribe linux-raid" in
 the body of a message to [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

Re: Status of raid.

2001-02-09 Thread Neil Brown


On Friday February 9, [EMAIL PROTECTED] wrote:
 Greetings,
 
 I'm getting ready to put kernel 2.4.1 on my server at home.  I have some
 questions about the status of RAID in 2.4.1.  Sorry to be dense but I
 couldn't glean the answers to these questions from my search of the
 mailing list.
 
 1. It appears that as of 2.4.1 RAID is finally part of the standard
 kernel.  Is this correct?

Yes, though you will need to wait for 2.4.2 if you want to compile md
as a module.

 2. Which raidtools package do I use and where can I get it?  Or is it,
 too, enclosed with the kernel?

The same ones you would use with patches 2.2.  i.e. 0.90.

 3. Does the RAID in 2.4.1 have the read-balancing patch?
 

Yes, that patch is in.

NeilBrown


 --
   / C. R. (Charles) Oldham | NCA Commission on Accreditation and \
  / Director of Technology  |   School Improvement \
 / [EMAIL PROTECTED]  | V:480-965-8703  F:480-965-9423\
 
 
 -
 To unsubscribe from this list: send the line "unsubscribe linux-raid" in
 the body of a message to [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

Re: Newbie questions

2001-02-21 Thread Neil Brown


On Wednesday February 21, [EMAIL PROTECTED] wrote:
 Hello,
 
 This is my first time playing with software raid so sorry if I sound dumb.
 What I have is a remote device that only has one hard drive.  There is no
 ability for a second.  Can I use the raidtools package to setup a raid-1
 mirror on two partitions on the same drive?  For example /dev/hda1 and
 /dev/hda2 consist of the raid 1 set, /dev/hda3 swap, and /dev/hda4 for
 the rest of the OS.  I know raid is normally used with multiple drives,
 but is this possible?

Yes, it is possible, but would it help?
If your drive fails, then you loose the data anyway.
I guess this would protect against bad sectors in one part of the
drive, but my experience is that once you get a bad sector or two your
drive isn't long for this world.

Also, write speed would be appalling as the head would be zipping back
and forth between the two partitions.

However, the best answer is "try it", and see if it does what you
want.


 
 P.S.  Could you please respond to [EMAIL PROTECTED]  I am not on the
 list and could not find any info on how to join.
 
echo help | mail [EMAIL PROTECTED]

should get you started.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

Re: Urgent Problem: moving a raid

2001-02-25 Thread Neil Brown


On Sunday February 25, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
   OK, this time I really want to know how this should be handled.
  
  Well. it "should" be handled by re-writing various bits of raid code
  to make it all work more easily, but without doing that it "could" be
  handled by marking the partitions as hold raid componenets (0XFE I
  think) and then booting a kernel with AUTODETECT_RAID enabled. This
  approach ignores the device info in the superblock and finds
  everything properly.
 
 I do not use partitions, the whole /dev/hdb and /dev/hdd are the RAID
 drives (mainly because fdisk was unhappy handling the 60GB drives). Is
 there a way to do the above marking in this situation? How?

No, without partitions, that idea falls through.

With 2.4, you could boot with

   md=1,/dev/whatever,/dev/otherone

and it should build the array from the named drives.
There are ioctls available to do the same thing from user space, but
no user-level code has been released to use it yet.
The following patch, when applied to raidtools-0.90 should make
raidstart do the right thing, but I it is a while since I wrote this
code and I only did minimal testing.

NeilBrown

--- ./raidlib.c 2000/05/19 03:42:47 1.1
+++ ./raidlib.c 2000/05/19 06:53:04
@@ -149,6 +149,24 @@
return 0;
 }
 
+static int do_newstart (int fd, md_cfg_entry_t *cfg)
+{
+   int i;
+   if (ioctl (fd, SET_ARRAY_INFO, 0UL)== -1)
+   return -1;
+   /* Ok, we have a new enough kernel (2.3.99pre9?) */
+   for (i=0; icfg-array.param.nr_disks ; i++) {
+   struct stat s;
+   md_disk_info_t info;
+   stat(cfg-device_name[i], s);
+   memset(info, 0, sizeof(info));
+   info.major = major(s.st_rdev);
+   info.minor = minor(s.st_rdev);
+   ioctl(fd, ADD_NEW_DISK, (unsigned long)info);
+   }
+   return (ioctl(fd, RUN_ARRAY, 0UL)!= 0);
+}
+
 int do_raidstart_rw (int fd, char *dev)
 {
int func = RESTART_ARRAY_RW;
@@ -380,10 +398,12 @@
{
struct stat s;
 
-   stat (cfg-device_name[0], s);
-
fd = open_or_die(cfg-md_name);
-   if (do_mdstart (fd, cfg-md_name, s.st_rdev)) rc++;
+   if (do_newstart (fd, cfg)) {
+   stat (cfg-device_name[0], s);
+
+   if (do_mdstart (fd, cfg-md_name, s.st_rdev)) rc++;
+   }
break;
}
 

   
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

Re: MD reverting to old Raid type

2001-02-25 Thread Neil Brown


On Sunday February 25, [EMAIL PROTECTED] wrote:
 Linux 2.4.1/RAIDtools2 0.90
 
 I have 4 ide disks which have identical partition layouts.
 RAID is working successfully, its even booting RAID1.
 
 I created a RAID5 set on a set of 4 partitions, which works OK. 
 I then destroyed that set and updated it so it was 2x RAID0
 partitions (so I can mirror them into a RAID10).
 
 The problem is when I raidstop, then raidstart either of the new RAID0
 mds it reverts back to the RAID5 (I originally noticed it when I rebooted).
snip
 
 
 Any idea?

In the raidtab file where you describe the raid0 arrays, make sure to
say:

  persistent-superblock = 1

(or whatever the correct systax is).  The default is 0 (== no) for
back compat I assume, and so your raid5 superblock doesn't get
overwritten.

NeilBrown

 
 Cheers, Suad
 --
 
 
 
 -
 To unsubscribe from this list: send the line "unsubscribe linux-raid" in
 the body of a message to [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

Re: Going from 'old' (kernel v2.2.x) to 'new' (kernel v2.4.x) raidsystem

2001-02-26 Thread Neil Brown


On  February 26, [EMAIL PROTECTED] wrote:
 I'm currently running a standard v2.2.17 kernel w/ the 'accompanying'
 raid system (linear).
 
 Having the following /etc/mdtab file:
 /dev/md0  linear,4k,0,75f3bcd8/dev/sdb1 /dev/sdc1 /dev/sdd10 /dev/sde1 
/dev/sdf1 /dev/sdg1 
 
 And converted this to a newer /etc/raidtab:
 raiddev /dev/md0
   raid-level  linear
   nr-raid-disks   6
   persistent-superblock   1

"old" style linear arrays don't have a super block, so this should
read:
persistent-superblock   0

Given this, you should be able to run mkraid with complete safety as
is doesn't actually write anything to any drive.
You might be more comfortable running "raid0run" instead of "mkraid".
It is the same program, but when called as raid0run, it checks that
your configuration matches an old style raid0/linear setup.



   device  /dev/sdb1
   raid-disk   0
   device  /dev/sdc1
   raid-disk   1
   device  /dev/sdd10
   raid-disk   2
   device  /dev/sde1
   raid-disk   3
   device  /dev/sdf1
   raid-disk   4
   device  /dev/sdg1
   raid-disk   5
 
 And this is what cfdisk tells me about the partitions:
 sdb1Primary  Linux raid autodetect 6310.74
 sdc1Primary  Linux raid autodetect 1059.07
 sdd10   Logical  Linux raid autodetect 2549.84
 sde1Primary  Linux raid autodetect 9138.29
 sdf1Primary  Linux raid autodetect18350.60
 sdg1Primary  Linux raid autodetect16697.32
 

Autodetect cannot work with old-style arrays that don't have
superblocks. If you want autodetect, you will need to copy the data
somewhere, create a new array, and copy it all back.

NeilBrown

 
 But when I start the new kernel, it won't start the raidsystem...
 I tried the 'mkraid --upgrade' command, but that says something about
 no superblock stuff... Can't remember exactly what it says, but...
 
 And I'm to afraid to just execute 'mkraid' to create the array. I have
 over 50Gb of data that I can't backup somewhere...
 
 
 What can I do to keep the old data, but convert the array to the new
 raid system? 
 
 -- 
  Turbo __ _ Debian GNU Unix _IS_ user friendly - it's just 
  ^/ /(_)_ __  _   ___  __  selective about who its friends are 
  / / | | '_ \| | | \ \/ /   Debian Certified Linux Developer  
   _ /// / /__| | | | | |_| |Turbo Fredriksson   [EMAIL PROTECTED]
   \\\/  \/_|_| |_|\__,_/_/\_\ Stockholm/Sweden
 -- 
 Iran domestic disruption Treasury Panama assassination cracking
 genetic Albanian jihad president Noriega AK-47 Khaddafi ammonium DES
 [See http://www.aclu.org/echelonwatch/index.html for more about this]
 -
 To unsubscribe from this list: send the line "unsubscribe linux-raid" in
 the body of a message to [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

Re: [lvm-devel] Re: partitions for RAID volumes?

2001-02-27 Thread Neil Brown


On Monday February 26, [EMAIL PROTECTED] wrote:
 
 Actually, the LVM metadata is somewhat poorly layed out in this respect.
 The metadata is at the start of the device, and on occasion is not even
 sector aligned, AFAICS.  Also, the PE structures, while large powers of
 2 in size, are not guaranteed to be other than sector aligned.  They
 aligned with the END of the partition/device, and not the start.  I think
 under Linux, partitions are at least 1k multiples in size so the PEs will
 at least be 1k aligned...

MD/RAID volumes are always a multiple of 64k.  The amount of a single
device will be rounded down using MD_NEW_SIZE_BLOCKS defined as:

#define MD_RESERVED_BYTES   (64 * 1024)
#define MD_RESERVED_SECTORS (MD_RESERVED_BYTES / 512)
#define MD_RESERVED_BLOCKS  (MD_RESERVED_BYTES / BLOCK_SIZE)

#define MD_NEW_SIZE_SECTORS(x)  ((x  ~(MD_RESERVED_SECTORS - 1)) - 
MD_RESERVED_SECTORS)
#define MD_NEW_SIZE_BLOCKS(x)   ((x  ~(MD_RESERVED_BLOCKS - 1)) - 
MD_RESERVED_BLOCKS)

And the whole array will be the sum of 1 or more of these sizes.
So if each PE is indeed  sized "from 8KB to 512MB in powers of 2 and
unit KB", then all accesses should be properly aligned, so you
shouldn't have any problems (and if you apply the patch and get no
errors, then you will be doubly sure).

I thought a bit more about the consequences of unaligned accesses and
I think it is most significant when rebuilding parity.

RAID5 assumes that two different stripes with different sector
addresses will not overlap.
If all accesses are properly aligned, then this will be true.  Also if
all accesses are misaligned by the same amount (e.g. 4K accesses at
(4n+1)K offsets) then everything should work well too.
However, raid5 resync always aligns accesses so if the current
stripe-cache size were 4K, all sync accesses would be at (4n)K
offsets.
If there were (4n+1)K accesses happening at the same time, they would
not synchronise with the resync accesses and you could get corruption.

But it sounds like LVM is safe from this problem.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

Re: RaidHotAdd and reconstruction

2001-03-04 Thread Neil Brown


On Sunday March 4, [EMAIL PROTECTED] wrote:
 Hi folks,
 
 I have a two-disk RAID 1 test array that I was playing with. I then
 decided to hot add a third disk using ``raidhotadd''. The disk was added
 to the array, but as far as I could see, the RAID software did not start
 a reconstruction of that newly added disk. A skimmed through the driver
 code a bit, and could not really locate the point where the
 reconstruction was initiated. Am I missing something?
 
The third disk that you added became a hot spare.
You cannot add an extra active drive to a RAID array without using
mkraid.

In you case, you could edit /etc/raidtab to list the third strive as a
"failed-disk" instead of a "raid-disk", and set the "nr-taid-disks" to
3.

Then run mkraid.  It shouldn't destroy any data, but the raid system
should automatically start building data onto the new drive.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

Re: Proposed RAID5 design changes.

2001-03-16 Thread Neil Brown



(I've taken Alan and Linus off the Cc list. I'm sure they have plenty
to read, and may even be on linux-raid anyway).

On Thursday March 15, [EMAIL PROTECTED] wrote:
 I'm not too happy with the linux RAID5 implementation. In my
 opinion, a number of changes need to be made, but I'm not sure how to
 make them or get them accepted into the official distribution if I did
 make the changes.

I've been doing a fair bit of development with RAID5 lately and Linus
seems happy to accept patches from me, and I am happy to work with you
(or anyone else) to make improvements and them submit them to Linus.

There was a paper in 
   2000 USENIX Annual Technical Conference
titled
   Towards Availability Benchmarks: A Case Study of Software RAID
   Systems
by
   Aaron Brown and David A. Patterson of UCB.

They built a neat rig for testing fault tolerance and fault handling
in raid systems and compared Linux, Solaris, and WinNT.

Their particular comment about Linux was that it seemed to evict
drives on any excuse, just as you observe.  Apparently the other
systems tried much harder to keep drives in the working set if
possible.
It is certainly worth a read if you are interested in this.

My feeling about retrying after failed IO is that it should be done at
a lower level.  Once the SCSI or IDE level tells us that there is a
READ error, or a WRITE error, we should believe them.
Now it appears that this isn't true: at least not for all drivers.
So while I would not be strongly against putting that sort of re-try
logic at the RAID level, I think it would be worth the effort to find
out why it isn't being done at a lower level.

As for re-writing after a failed read, that certainly makes sense, and
probably wouldn't be too hard.
You would introduce into the "struct stripe_head" a way to mark a
drive as "read-failed".
Then on a read error, you mark that drive as read-failed in that
stripe only and schedule a retry.
If the retry succeeds, you then schedule a write, and if that
works, you just continue on happily.

You would need to make sure that you aren't too generous: once you
have had some number of read errors on a given drive you really should
fail that drive anyway.

 3) Drives should not be kicked out of the array unless they are having
really persistent problems. I've an idea on how to define 'really
persistent' but it requires a bit of math to explain, so I'll only
go into it if someone is interested.

I'd certainly be interested in reading your math.

 
 Then there are two changes that might improve recovery performance:
 
 4) If the drive being kicked out is not totally inoperable and there is
a spare drive to replace it, try to copy the data from the failing
drive to the spare rather than reconstructing the data from all the
other disks. Fall back to full reconstruction if the error rate gets
too high.

That would actually be fairly easy to do.  Once you get the data
structures right so that the concept of a "partially failed" drive can
be clearly represented, it should be a cinch.

 
 5) When doing (4) use the SCSI 'copy' command if the drives are on the
same bus, and the host adapter and driver supports 'copy'. However,
this should be done with caution. 'copy' is not generally used and
any number of undetected firmware bugs might make it unreliable.
An additional category may need to be added to the device black list
to flag devices that can not do 'copy' reliably.

I've very against this sort of idea.  Currently the raid code is
blissfully unaware of the underlying technology: it could be scsi,
ide, ramdisc, netdisk or anything else and RAID just doesn't care.
This I believe is one of the strengths of software RAID - it is
cleanly layered.
Firmware (==hardware) raid controllers often try to "know" a lot about
the underlying drive - even to the extent of getting the drives  to do
the XOR themselves I believe.  This undoubtedly makes the code more
complex, and can lead to real problems if you have firmware-mismatches
(and we have had a few of those).

Stick with "read" and "write" I think.  Everybody understands what
they mean so it is much more likely to work.
And really, our rebuild performance isn't that bad.  The other interesting result
for Linux in that paper is that rebuild made almost no impact on
performance with Linux, while it did for solaris and NT (but Linux did
rebuild much more slowly).

So if you want to do this, dive right in and have a go.
I am certainly happy to answer any questions, review any code, and
forward anything that looks good to Linus.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

Re: disk fails in raid5 but not in raid0

2001-03-19 Thread Neil Brown


On Monday March 19, [EMAIL PROTECTED] wrote:
 Hi,
 
 I have a RAID setup, 3 Compaq 4Gb drives running of an Adaptec2940UW. 
 Kernel 2.2.18 with RAID-patches etc.
 
 I have been trying out various options, doing some stress-testing etc.,
 and I have now arrived at the following situation that I cannot explain:
 
 when running the 3 drives in a RAID5 config, one of the drives (always the
 same one) will always fail in during heavy IO or during a resync phase. It
 appears to produce one IO error (judging from messages in the log), upon
 which it is promptly removed from the array.
 I can then hotremove the failing drive, then hotadd it - and resync starts, 
 and quite often completes. This scenario is consistently repeatable.

During the initial resync phase, the data blocks are read and the
parity blocks are written -- across all drives.
During a rebuild-after-failure, data and paritiy are read from a
"good" drives and data-or-parity is written to the spare drive.

This could lead to differrent patterns of concurrent access.  In
particular, duing the resync that you say often completes, the
questionable drive is only being written to.  During the resync that
usually fails, the questionable drive is often being read concurently
with other drives.

 
 So, it would seem that this one drive has a hardware problem. So I ran badblocks
 with write-check on it, couple of times - came out 100% clean.
 I then built a RAID0 array instead - and started driving lots of IO on it - 
 it's still running - not a problem. Filled up the array, still no probs.
 
 So, except when the drive is in a RAID5 config, it seems ok. 

Well, raid5 would do about 30% more IO when writing.  It certainly
sounds odd, but it could be some combinatorial thing..

 
 Any suggestions ? I would like to confirm whether or whether not the
 drive has a problem. 

Try re-arranging the drives on the scsi chain.  If the questionable
one is currently furthest from the host-adapter, make it closest.  See
if that has any effect.
It could well be cabling, or terminators or something.  Or it could be
the drive.

NeilBrown

 
 thanks,
 Per Jessen
 
 
 
 
 
 regards,
 Per Jessen
 
 
 -
 To unsubscribe from this list: send the line "unsubscribe linux-raid" in
 the body of a message to [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

Re: Problem migrating RAID1 from 2.2.x to 2.4.2

2001-03-19 Thread Neil Brown


On Monday March 19, [EMAIL PROTECTED] wrote:
 I'm having trouble running a RAID1 root/boot mirror under 2.4.2.  Works
 fine on 2.2.14 though.
 
 I'm running RH 6.2 with stock 2.2.14 kernel.  Running RAID1 on a pair of
 9.1 UW SCSI Barracudas as root/boot/lilo.  md0 is / and md1 is 256M swap,
 also a 2 drive mirror. I built the RAID1 at install time using the Redhat GUI.
 
 This configuration works flawlessly.  However, I've recently compiled the
 2.4.2 kernel, with no module support; RAID1 static.  When 2.4.2 boots, I
 get an "Kernel panic: VFS: Unable to mount root fs on 09:00".
 
 Here's the RAID driver output when booting 2.4.2:
 
 autodetecting RAID arrays
 (read) sda5's sb offset: 8618752 [events: 0022]
 (read) sdb5's sb offset: 8618752 [events: 0022]
 autorun ...
 considering sdb5 ...
   adding sdb5 ...
   adding sda5 ...
 created md0
 bindsda5,1
 bindsdb5,2
 running: sdb5sda5
 now!
 sdb5's event counter: 0022
 sda5's event counter: 0022
 do_md_run() returned -22
 md0 stopped.
 
 Note: This RAID1 mirror works great under 2.2.14.  Under 2.4.2 I get the
 "returned -22" - What does this mean?

 -22 == EINVAL

It looks very much like raid1 *isn't* compiled into your kernel.
Can you show us your .config file?

Also /proc/mdstat when booted under 2.2 might help.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

Re: Error Injector?

2001-03-21 Thread Neil Brown


On Wednesday March 21, [EMAIL PROTECTED] wrote:
 
 My question is based upon prior experience working for Stratus Computer.  At
 Stratus it was impractical to go beat the disk drives with a hammer to cause
 them to fail - rather we would simply use a utility to cause the disk driver
 to begin to get "errors" from the drives.  This would then exercise the 
 recovery mechanism - taking a drive off line and bringing another up to
 take its place.  This facility is also present in Veritas Volume Manager test
 suites to exercise the code.

raidsetfaulty 
should do what you want.  It is part of the latest raidtools-0.90.
If you don't have it, get the source from
www.kernel.org/pub/linux/daemons/...
(it might be devel rather than daemons, I'm not sure).

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

Re: Strange performance results in RAID5

2001-03-28 Thread Neil Brown


On Thursday March 29, [EMAIL PROTECTED] wrote:
 Hi,
 
 I have been doing some performance checks on my RAID 5 system.

Good.

 
 The system is
 
 5 Seagate Cheetahs X15
 Linux 2.4.2
 
 I am using IOtest 3.0 on /dev/md0
 My chunk size is 1M...
 
 When I do random reads of 64K blobs using one process, I get 100 
 reads/sec, which is the same as doing random reads on one disk. So I was 
 quite happy with that.
 
 My next test was to do random reads using ten processes, I expected 500 
 reads/sec, however, I only got 250 reads/sec.
 
 This to me doesn't seem right??? Does anyone know why this is the
 case?

A few possibilities:

   1/ you didn't say how fast your SCSI buss is.  I guess if it is
   reasonably new it would be at least 80Mb/sec which should allow 
   500 * 64K/s but it wouldn't have to be too old to not allow that,
   and I don't like to assume things that aren't stated.

   2/ You could be being slowed down by the stripe cache - it only
   allows 256 concurrent 4k access.   Try increasing NR_STRIPES at the
   top of drivers/md/raid5.c - say to 2048.  See if that makes a
   difference.

   3/ Also, try applying

 http://www.cse.unsw.edu.au/~neilb/patches/linux/2.4.3-pre6/patch-F-raid5readbypass

   This patch speeds up large sequential reads, at a possible small cost
   to random read-modify-writes (I haven't measured any problems, but
   haven't had the time to explore the performance thoroughly).
   What it does is read directly into the filesystems buffer instead
   of into the stripe cache and then memcpy into the filesys buffer.

   4/ I'm assuming you are doing direct IO to /dev/md0.
   Try making a mounting a filesystem of /dev/md0 first. This will
   switch the device blocksize to 4K (if you have a 4k block size
   filesystem).  The larger block size improves performance
   substantially.   I always do I/O tests to a filesystem, not to the
   block device, because it makes a difference and it is a filesystem
   that I want to use (though I realise that you may not).

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]

Re: spare-disk in a RAID1 set? Conflicting answers...

2001-04-29 Thread Neil Brown


On Saturday April 28, [EMAIL PROTECTED] wrote:
   Question: can you have one or more spare-disk entries in /etc/raidtab when 
 running a RAID1 set?
 
   First answer: the Linux Software-RAID HOWTO says yes, and gives an example 
 of this in the section on RAID1 config in raidtab.
 
   Second answer: the manpage for raidtab says no, stating that spare-disk is 
 only valid for RAID4 and RAID5.
 
   Hm.. Which is it?

You can certainly have spare discs in raid1 arrays.

NeilBrown

 
   I'm running Mandrake 8.0, which is a 2.4.3 kernel.  I haven't tried to 
 actually use a spare-disk entry yet, because I'm still waiting for the 
 third disk for my RAID1 set to get here, but I thought I'd ask to see if 
 anybody knows for sure.  If not, I'll experiment with it once my third disk 
 gets here and report back.
 
   Thanks!
 
 - Al
 
 
 ---
 | voice: 503.247.9256
   Lots of folks confuse bad management  | email: [EMAIL PROTECTED]
  with destiny.  | cell: 503.709.0028
 | email to my cell:
  - Kin Hubbard  |  [EMAIL PROTECTED]
 ---
 
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]

[PATCH] RAID5 NULL Checking Bug Fix

2001-05-15 Thread Neil Brown


On Wednesday May 16, [EMAIL PROTECTED] wrote:
 
 (more patches to come.  They will go to Linus, Alan, and linux-raid only).

This is the next one, which actually addresses the NULL Checking
Bug.

There are two places in the the raid code which allocate memory
without (properly) checking for failure and which are fixed in ac,
both in raid5.c.  One in grow_buffers and one in __check_consistency.

The one in grow_buffers is definately right and included below in a
slightly different form -  fewer characters.
The one in __check_consistency is best fixed by simply removing
__check_consistency.

__check_consistency reads stipes at some offset into the array and
checks that parity is correct.  This is called once, but the result is
ignored.
Presumably this is a hangover from times gone by when the superblock
didn't have proper versioning and there was no auto-rebuild process.
It is now irrelevant and can go.
There is similar code in raid1.c which should also go.
This patch removes said code.

NeilBrown



--- ./drivers/md/raid5.c2001/05/16 05:14:39 1.2
+++ ./drivers/md/raid5.c2001/05/16 05:27:20 1.3
@@ -156,9 +156,9 @@
return 1;
memset(bh, 0, sizeof (struct buffer_head));
init_waitqueue_head(bh-b_wait);
-   page = alloc_page(priority);
-   bh-b_data = page_address(page);
-   if (!bh-b_data) {
+   if ((page = alloc_page(priority)))
+   bh-b_data = page_address(page);
+   else {
kfree(bh);
return 1;
}
@@ -1256,76 +1256,6 @@
printk(raid5: resync finished.\n);
 }
 
-static int __check_consistency (mddev_t *mddev, int row)
-{
-   raid5_conf_t *conf = mddev-private;
-   kdev_t dev;
-   struct buffer_head *bh[MD_SB_DISKS], *tmp = NULL;
-   int i, ret = 0, nr = 0, count;
-   struct buffer_head *bh_ptr[MAX_XOR_BLOCKS];
-
-   if (conf-working_disks != conf-raid_disks)
-   goto out;
-   tmp = kmalloc(sizeof(*tmp), GFP_KERNEL);
-   tmp-b_size = 4096;
-   tmp-b_page = alloc_page(GFP_KERNEL);
-   tmp-b_data = page_address(tmp-b_page);
-   if (!tmp-b_data)
-   goto out;
-   md_clear_page(tmp-b_data);
-   memset(bh, 0, MD_SB_DISKS * sizeof(struct buffer_head *));
-   for (i = 0; i  conf-raid_disks; i++) {
-   dev = conf-disks[i].dev;
-   set_blocksize(dev, 4096);
-   bh[i] = bread(dev, row / 4, 4096);
-   if (!bh[i])
-   break;
-   nr++;
-   }
-   if (nr == conf-raid_disks) {
-   bh_ptr[0] = tmp;
-   count = 1;
-   for (i = 1; i  nr; i++) {
-   bh_ptr[count++] = bh[i];
-   if (count == MAX_XOR_BLOCKS) {
-   xor_block(count, bh_ptr[0]);
-   count = 1;
-   }
-   }
-   if (count != 1) {
-   xor_block(count, bh_ptr[0]);
-   }
-   if (memcmp(tmp-b_data, bh[0]-b_data, 4096))
-   ret = 1;
-   }
-   for (i = 0; i  conf-raid_disks; i++) {
-   dev = conf-disks[i].dev;
-   if (bh[i]) {
-   bforget(bh[i]);
-   bh[i] = NULL;
-   }
-   fsync_dev(dev);
-   invalidate_buffers(dev);
-   }
-   free_page((unsigned long) tmp-b_data);
-out:
-   if (tmp)
-   kfree(tmp);
-   return ret;
-}
-
-static int check_consistency (mddev_t *mddev)
-{
-   if (__check_consistency(mddev, 0))
-/*
- * We are not checking this currently, as it's legitimate to have
- * an inconsistent array, at creation time.
- */
-   return 0;
-
-   return 0;
-}
-
 static int raid5_run (mddev_t *mddev)
 {
raid5_conf_t *conf;
@@ -1483,12 +1413,6 @@
if (conf-working_disks != sb-raid_disks) {
printk(KERN_ALERT raid5: md%d, not all disks are operational -- 
trying to recover array\n, mdidx(mddev));
start_recovery = 1;
-   }
-
-   if (!start_recovery  (sb-state  (1  MD_SB_CLEAN)) 
-   check_consistency(mddev)) {
-   printk(KERN_ERR raid5: detected raid-5 superblock xor inconsistency 
-- running resync\n);
-   sb-state = ~(1  MD_SB_CLEAN);
}
 
{
--- ./drivers/md/raid1.c2001/05/16 05:14:39 1.2
+++ ./drivers/md/raid1.c2001/05/16 05:27:20 1.3
@@ -1448,69 +1448,6 @@
}
 }
 
-/*
- * This will catch the scenario in which one of the mirrors was
- * mounted as a normal device rather than as a part of a raid set.
- *
- * check_consistency is very personality-dependent, eg. RAID5 cannot
- * do this check, it uses another method.
- */
-static int __check_consistency

[PATCH] - md_error gets simpler

2001-05-15 Thread Neil Brown



Linus,
 This isn't a bug fix, just a tidy up.

 Current, md_error - which is called when an underlying device detects
 an error - takes a kdev_t to identify which md array is affected.
 It converts this into a mddev_t structure pointer, and in every case,
 the caller already has the desired structure pointer.

 This patch changes md_error and the callers to pass an mddev_t
 instead of a kdev_t

NeilBrown

--- ./include/linux/raid/md.h   2001/05/16 06:08:41 1.1
+++ ./include/linux/raid/md.h   2001/05/16 06:10:02 1.2
@@ -80,7 +80,7 @@
 extern struct gendisk * find_gendisk (kdev_t dev);
 extern int md_notify_reboot(struct notifier_block *this,
unsigned long code, void *x);
-extern int md_error (kdev_t mddev, kdev_t rdev);
+extern int md_error (mddev_t *mddev, kdev_t rdev);
 extern int md_run_setup(void);
 
 extern void md_print_devices (void);
--- ./drivers/md/raid5.c2001/05/16 05:27:20 1.3
+++ ./drivers/md/raid5.c2001/05/16 06:10:02 1.4
@@ -412,7 +412,7 @@
spin_lock_irqsave(conf-device_lock, flags);
}
} else {
-   md_error(mddev_to_kdev(conf-mddev), bh-b_dev);
+   md_error(conf-mddev, bh-b_dev);
clear_bit(BH_Uptodate, bh-b_state);
}
clear_bit(BH_Lock, bh-b_state);
@@ -440,7 +440,7 @@
 
md_spin_lock_irqsave(conf-device_lock, flags);
if (!uptodate)
-   md_error(mddev_to_kdev(conf-mddev), bh-b_dev);
+   md_error(conf-mddev, bh-b_dev);
clear_bit(BH_Lock, bh-b_state);
set_bit(STRIPE_HANDLE, sh-state);
__release_stripe(conf, sh);
--- ./drivers/md/md.c   2001/05/16 06:08:41 1.1
+++ ./drivers/md/md.c   2001/05/16 06:10:03 1.2
@@ -2464,7 +2464,7 @@
int ret;
 
fsync_dev(mddev_to_kdev(mddev));
-   ret = md_error(mddev_to_kdev(mddev), dev);
+   ret = md_error(mddev, dev);
return ret;
 }
 
@@ -2938,13 +2938,11 @@
 }
 
 
-int md_error (kdev_t dev, kdev_t rdev)
+int md_error (mddev_t *mddev, kdev_t rdev)
 {
-   mddev_t *mddev;
mdk_rdev_t * rrdev;
int rc;
 
-   mddev = kdev_to_mddev(dev);
 /* printk(md_error dev:(%d:%d), rdev:(%d:%d), (caller: 
%p,%p,%p,%p).\n,MAJOR(dev),MINOR(dev),MAJOR(rdev),MINOR(rdev), 
__builtin_return_address(0),__builtin_return_address(1),__builtin_return_address(2),__builtin_return_address(3));
  */
if (!mddev) {
--- ./drivers/md/raid1.c2001/05/16 05:27:20 1.3
+++ ./drivers/md/raid1.c2001/05/16 06:10:03 1.4
@@ -388,7 +388,7 @@
 * this branch is our 'one mirror IO has finished' event handler:
 */
if (!uptodate)
-   md_error (mddev_to_kdev(r1_bh-mddev), bh-b_dev);
+   md_error (r1_bh-mddev, bh-b_dev);
else
/*
 * Set R1BH_Uptodate in our master buffer_head, so that
@@ -1426,7 +1426,7 @@
 * We don't do much here, just schedule handling by raid1d
 */
if (!uptodate)
-   md_error (mddev_to_kdev(r1_bh-mddev), bh-b_dev);
+   md_error (r1_bh-mddev, bh-b_dev);
else
set_bit(R1BH_Uptodate, r1_bh-state);
raid1_reschedule_retry(r1_bh);
@@ -1437,7 +1437,7 @@
struct raid1_bh * r1_bh = (struct raid1_bh *)(bh-b_private);

if (!uptodate)
-   md_error (mddev_to_kdev(r1_bh-mddev), bh-b_dev);
+   md_error (r1_bh-mddev, bh-b_dev);
if (atomic_dec_and_test(r1_bh-remaining)) {
mddev_t *mddev = r1_bh-mddev;
unsigned long sect = bh-b_blocknr * (bh-b_size9);
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]

[PATCH] raid resync by sectors to allow for 512byte block filesystems

2001-05-17 Thread Neil Brown



Linus,
 The current raid1/raid5 resync code requests resync in units of 1k
 (though the raid personality can round up requests if it likes).
 This interacts badly with filesystems that do IO in 512 byte blocks,
 such as XFS (because raid5 need to use the same blocksize for IO and
 resync).

 The attached patch changes the resync code to work in units of
 sectors which makes more sense and plays nicely with XFS.

NeilBrown



--- ./drivers/md/md.c   2001/05/17 05:50:51 1.1
+++ ./drivers/md/md.c   2001/05/17 06:11:50 1.2
@@ -2997,7 +2997,7 @@
int sz = 0;
unsigned long max_blocks, resync, res, dt, db, rt;
 
-   resync = mddev-curr_resync - atomic_read(mddev-recovery_active);
+   resync = (mddev-curr_resync - atomic_read(mddev-recovery_active))/2;
max_blocks = mddev-sb-size;
 
/*
@@ -3042,7 +3042,7 @@
 */
dt = ((jiffies - mddev-resync_mark) / HZ);
if (!dt) dt++;
-   db = resync - mddev-resync_mark_cnt;
+   db = resync - (mddev-resync_mark_cnt/2);
rt = (dt * ((max_blocks-resync) / (db/100+1)))/100;

sz += sprintf(page + sz,  finish=%lu.%lumin, rt / 60, (rt % 60)/6);
@@ -3217,7 +3217,7 @@
 
 void md_done_sync(mddev_t *mddev, int blocks, int ok)
 {
-   /* another blocks (1K) blocks have been synced */
+   /* another blocks (512byte) blocks have been synced */
atomic_sub(blocks, mddev-recovery_active);
wake_up(mddev-recovery_wait);
if (!ok) {
@@ -3230,7 +3230,7 @@
 int md_do_sync(mddev_t *mddev, mdp_disk_t *spare)
 {
mddev_t *mddev2;
-   unsigned int max_blocks, currspeed,
+   unsigned int max_sectors, currspeed,
j, window, err, serialize;
kdev_t read_disk = mddev_to_kdev(mddev);
unsigned long mark[SYNC_MARKS];
@@ -3267,7 +3267,7 @@
 
mddev-curr_resync = 1;
 
-   max_blocks = mddev-sb-size;
+   max_sectors = mddev-sb-size1;
 
printk(KERN_INFO md: syncing RAID array md%d\n, mdidx(mddev));
printk(KERN_INFO md: minimum _guaranteed_ reconstruction speed: %d 
KB/sec/disc.\n,
@@ -3291,23 +3291,23 @@
/*
 * Tune reconstruction:
 */
-   window = MAX_READAHEAD*(PAGE_SIZE/1024);
-   printk(KERN_INFO md: using %dk window, over a total of %d 
blocks.\n,window,max_blocks);
+   window = MAX_READAHEAD*(PAGE_SIZE/512);
+   printk(KERN_INFO md: using %dk window, over a total of %d 
+blocks.\n,window/2,max_sectors/2);
 
atomic_set(mddev-recovery_active, 0);
init_waitqueue_head(mddev-recovery_wait);
last_check = 0;
-   for (j = 0; j  max_blocks;) {
-   int blocks;
+   for (j = 0; j  max_sectors;) {
+   int sectors;
 
-   blocks = mddev-pers-sync_request(mddev, j);
+   sectors = mddev-pers-sync_request(mddev, j);
 
-   if (blocks  0) {
-   err = blocks;
+   if (sectors  0) {
+   err = sectors;
goto out;
}
-   atomic_add(blocks, mddev-recovery_active);
-   j += blocks;
+   atomic_add(sectors, mddev-recovery_active);
+   j += sectors;
mddev-curr_resync = j;
 
if (last_check + window  j)
@@ -3325,7 +3325,7 @@
mark_cnt[next] = j - atomic_read(mddev-recovery_active);
last_mark = next;
}
-   
+
 
if (md_signal_pending(current)) {
/*
@@ -3350,7 +3350,7 @@
if (md_need_resched(current))
schedule();
 
-   currspeed = 
(j-mddev-resync_mark_cnt)/((jiffies-mddev-resync_mark)/HZ +1) +1;
+   currspeed = 
+(j-mddev-resync_mark_cnt)/2/((jiffies-mddev-resync_mark)/HZ +1) +1;
 
if (currspeed  sysctl_speed_limit_min) {
current-nice = 19;
--- ./drivers/md/raid5.c2001/05/17 05:50:51 1.1
+++ ./drivers/md/raid5.c2001/05/17 06:11:51 1.2
@@ -886,7 +886,7 @@
}
}
if (syncing) {
-   md_done_sync(conf-mddev, (sh-size10) - sh-sync_redone,0);
+   md_done_sync(conf-mddev, (sh-size9) - sh-sync_redone,0);
clear_bit(STRIPE_SYNCING, sh-state);
syncing = 0;
}   
@@ -1059,7 +1059,7 @@
}
}
if (syncing  locked == 0  test_bit(STRIPE_INSYNC, sh-state)) {
-   md_done_sync(conf-mddev, (sh-size10) - sh-sync_redone,1);
+   md_done_sync(conf-mddev, (sh-size9) - sh-sync_redone,1);
clear_bit(STRIPE_SYNCING, sh-state);
}

@@ -1153,13 +1153,13 @@
return correct_size;
 }
 
-static int raid5_sync_request (mddev_t *mddev, unsigned long block_nr)
+static int

[PATCH] raid1 to use sector numbers in b_blocknr

2001-05-23 Thread Neil Brown



Linus,
  raid1 allocates a new buffer_head when passing a request done
  to an underlying device.
  It currently sets b_blocknr to b_rsector/(b_size9) from the
  original buffer_head to parallel other uses of b_blocknr (i.e. it
  being the number of the block).

  However, if raid1 gets a non-aligned request, then the calcuation of
  b_blocknr would loose information resulting in potential data
  curruption if the request were resubmitted to a different drive on
  failure. 

  Non aligned requests aren't currently possible (I believe) but newer
  filesystems are ikely to want them soon, and if a raid1 array were
  to be partitioned into partitions that were not page aligned, it
  could happen.

  This patch changes the usage of b_blocknr in raid1.c to store the
  value of b_rsector of the incoming request.

  Also, I remove the third argument to raid1_map which is never used.

NeilBrown

 
--- ./drivers/md/raid1.c2001/05/23 01:18:15 1.1
+++ ./drivers/md/raid1.c2001/05/23 01:18:19 1.2
@@ -298,7 +298,7 @@
md_spin_unlock_irq(conf-device_lock);
 }
 
-static int raid1_map (mddev_t *mddev, kdev_t *rdev, unsigned long size)
+static int raid1_map (mddev_t *mddev, kdev_t *rdev)
 {
raid1_conf_t *conf = mddev_to_conf(mddev);
int i, disks = MD_SB_DISKS;
@@ -602,7 +602,7 @@
 
bh_req = r1_bh-bh_req;
memcpy(bh_req, bh, sizeof(*bh));
-   bh_req-b_blocknr = bh-b_rsector / sectors;
+   bh_req-b_blocknr = bh-b_rsector;
bh_req-b_dev = mirror-dev;
bh_req-b_rdev = mirror-dev;
/*  bh_req-b_rsector = bh-n_rsector; */
@@ -646,7 +646,7 @@
/*
 * prepare mirrored mbh (fields ordered for max mem throughput):
 */
-   mbh-b_blocknr= bh-b_rsector / sectors;
+   mbh-b_blocknr= bh-b_rsector;
mbh-b_dev= conf-mirrors[i].dev;
mbh-b_rdev   = conf-mirrors[i].dev;
mbh-b_rsector= bh-b_rsector;
@@ -1138,7 +1138,6 @@
int disks = MD_SB_DISKS;
struct buffer_head *bhl, *mbh;
raid1_conf_t *conf;
-   int sectors = bh-b_size  9;

conf = mddev_to_conf(mddev);
bhl = raid1_alloc_bh(conf, conf-raid_disks); /* don't 
really need this many */
@@ -1168,7 +1167,7 @@
mbh-b_blocknr= bh-b_blocknr;
mbh-b_dev= conf-mirrors[i].dev;
mbh-b_rdev   = conf-mirrors[i].dev;
-   mbh-b_rsector= bh-b_blocknr * sectors;
+   mbh-b_rsector= bh-b_blocknr;
mbh-b_state  = (1BH_Req) | 
(1BH_Dirty) |
(1BH_Mapped) | (1BH_Lock);
atomic_set(mbh-b_count, 1);
@@ -1195,7 +1194,7 @@
}
} else {
dev = bh-b_dev;
-   raid1_map (mddev, bh-b_dev, bh-b_size  9);
+   raid1_map (mddev, bh-b_dev);
if (bh-b_dev == dev) {
printk (IO_ERROR, partition_name(bh-b_dev), 
bh-b_blocknr);
md_done_sync(mddev, bh-b_size9, 0);
@@ -1203,6 +1202,7 @@
printk (REDIRECT_SECTOR,
partition_name(bh-b_dev), 
bh-b_blocknr);
bh-b_rdev = bh-b_dev;
+   bh-b_rsector = bh-b_blocknr; 
generic_make_request(READ, bh);
}
}
@@ -1211,8 +1211,7 @@
case READ:
case READA:
dev = bh-b_dev;
-   
-   raid1_map (mddev, bh-b_dev, bh-b_size  9);
+   raid1_map (mddev, bh-b_dev);
if (bh-b_dev == dev) {
printk (IO_ERROR, partition_name(bh-b_dev), 
bh-b_blocknr);
raid1_end_bh_io(r1_bh, 0);
@@ -1220,6 +1219,7 @@
printk (REDIRECT_SECTOR,
partition_name(bh-b_dev), bh-b_blocknr);
bh-b_rdev = bh-b_dev;
+   bh-b_rsector = bh-b_blocknr;
generic_make_request (r1_bh-cmd, bh);
}
break;
@@ -1313,6 +1313,7 @@
struct

Re: md= problems with devfs names

2001-06-02 Thread Neil Brown


On Saturday June 2, [EMAIL PROTECTED] wrote:
 
 I've moved from:
   md=4,/dev/sdf5,/dev/sdg5
 to:
   md=4,/dev/scsi/host0/bus0/target30/lun0/part5,\
   /dev/scsi/host0/bus0/target32/lun0/part5
 
 And now get:
   md: Unknown device name,\
   /dev/scsi/host0/bus0/target30/lun0/part5,\
   /dev/scsi/host0/bus0/target32/lun0/part5.
 
: (
 
 md_setup() is displaying the error due to failing on name_to_kdev_t().
 root_dev_setup() calls name_to_kdev_t() with a long devfs name without a
 problem, so that's not the issue directly.

Yes... this is all very ugly.
root_dev_setup also stores the device name in root_device_name.
And then when actualy mounting root in fs/super.c::mount_root,
devfs_find_handle is called to map that name into a devfs object.

So maybe md_setup should store names as well, and md_setup_drive
should call devfs_find_handle like mount_root does.

But probably sticking with non-devfs names is easier.
Was there a particular need to change to devfs naming?

NeilBrown

 
 I think md_setup() is being run before the devfs names are fully registered,
 but i have no clue how the execution order
 of __setup() items is determined.
 
 Help?
 
 Dave
 
 
 md_setup() is run VERY early, much earlier then raid_setup().
 dmesg excerpt:
 
 mapped APIC to e000 (fee0)
 mapped IOAPIC to d000 (fec0)
 Kernel command line: devfs=mount raid=noautodetect 
 root=/dev/scsi/host0/bus0/target2/lun0/part7
 
md=4,/dev/scsi/host0/bus0/target30/lun0/part5,/dev/scsi/host0/bus0/target32/lun0/part5
 mem=393216K
 md: Unknown device name,
 /dev/scsi/host0/bus0/target30/lun0/part5,/dev/scsi/host0/bus0/target32/lun0/part5.
 Initializing CPU#0
 Detected 875.429 MHz processor.
 Console: colour VGA+ 80x25
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]

Re: mdrecoveryd invalid operand error

2001-06-06 Thread Neil Brown


On Wednesday June 6, [EMAIL PROTECTED] wrote:
 In the XFS kernel tree v2.4.3 w/ several patches,
 we were unable to raidhotremove and subsequently
 raidhotadd a spare without a reboot.  It did not
 matter if you had a new or the same hard disk.  We then
 tried the patch Igno Molnar sent regarding the issue.
 
 http://www.mail-archive.com/linux-raid@vger.kernel.org/msg00551.html
 
 This solved the problem of not doing a reboot and trying
 to switch a hotspare and faulty drive.
 
 In addition, however, we are seeing a kernel panic using
 raid 5 when mdrecoveryd starts when after doing the hotspare
 and faulty drive swap a second time without a reboot.
...
 
 Do you have any suggestions Neil?

Yep.  Upgrade to the latest kernel! (Don't you just hate it when that
turns out to be the answer).

Ingo's patch is half right, but not quite there.

 http://www.cse.unsw.edu.au/~neilb/patches/linux/2.4.5-pre2/patch-A-rdevfix

contains a correct version of the patch.  It is in 2.4.5.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]

Re: failure of raid 5 when first disk is unavailable

2001-06-07 Thread Neil Brown


On Thursday June 7, [EMAIL PROTECTED] wrote:
 Hi Neil;
 
 I am hoping you are going to tell me this is already solved,
 but here goes...

Almost :-)

 
 scenario:
   hda4, hdb4, and hdc4 in a raid 5 with no hotspare.
 
 
 With 2.4.3 XFS kernels, it seems that a raid 5 does not come
 up correctly if the first disk is unavailable.  The error message
 that arises in the syslog from the md driver is:
 
 md: could not lock hda4, zero size?  marking faulty
 md: could not import hda4.
 md: autostart hda4 failed!
 

Yep. This happens when you use raidstart.
It doesn't happen if you set the partition type to LINUX_RAID and use
the autodetect functionality.

raidstart just takes one drive, gives it to the kernel, and say look
in the superblock for the major/minor of the other devices.

This has several failure modes.

It is partly for this reason that I am writting a replacement md
management tool - mdctl.

I wasn't going to announce it just yet because it is very incomplete,
but you have pushed me into it :-)

 http://www.cse.unsw.edu.au/~neilb/source/mdctl/mdctl-v0.2.tgz

is a current snapshot.  I compiles (for me) and 
   mdctl --help

works. mdctl --Assemble is nearly there.

Comments welcome.

In 2.2, there is no other way to start an array than give one device
to the kernel and tell it to look for others.  So mdctl will
find the device numbers of the devices in the array and re-write the
super block if necessary to make the array start.

In 2.4, mdctl can use a 
  SET_ARRAY_INFO / ADD_NEW_DISK* / RUN_ARRAY
sequence to start a new array.

If you don't like the name mdctl (I don't), please suggest another.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]

Re: mdctl

2001-06-11 Thread Neil Brown


On Friday June 8, [EMAIL PROTECTED] wrote:
 On Fri, 8 Jun 2001, Neil Brown wrote:
 
  If you don't like the name mdctl (I don't), please suggest another.
 
 How about raidctrl?

Possible... though I don't think it is much better.  Longer to type too:-)

I kind of like having the md in there as it is the md driver.
raid is a generic term, and mdctl doesn't work with all raid
(i.e. not firmware raid), only software raid, and in particular, only
the md driver.

But thanks for the suggestion, I will keep it in mind and see if it
grows on me.

NeilBrown


 
 -- 
 MfG / Regards
 Friedrich Lobenstock
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]

Re: mdctl - names and code

2001-06-13 Thread Neil Brown



Thankyou for all the suggestions for names for mdctl.
We have

 raidctl raidctrl
 swraidctl
 mdtools mdutils 
 mdmanage mdmgr mdmd:-) mdcfg mdconfig mdadmin

Mike Black suggested that it is valuable for tools that are related to
start with a common prefix so that command completion can be used
to find them.
I think that is very true but, in this case, irrelevant.
mdctl (to whatever it gets called) will be one tool that does
everything.

I might arrange - for back compatability - that if you call it with a
name like 
   raidhotadd
it will default to the hot-add functionality, but I don't expect that
to be normal usage.

I have previously said that I am not very fond of raidctl as raid is
a bit too generic.  swraidctl is better but harder to pronouce.

I actually rather like md.  It has a pleasing similarity to mt.
Also
   man 1 md   -- would document the user interface
   man 5 md   -- would document the device driver.
This elegant.  But maybe a bit too terse.

I'm currently leaning towards mdadm or mdadmin as it is easy to
pronounce (for my palate anyway) and has the right sort of meaning.

I think I will keep that name at mdctl until I achieve all the
functionality I want, and then when I release v1.0 I will change the
name to what seems best at the time.

Thanks again for the suggestions and interest.

For the eager, 
  http://www.cse.unsw.edu.au/~neilb/source/mdctl/mdctl-v0.3.tgz
contains my latest source which has most of the functionality in
place, though it doesn't all work quite right yet.
You can create a raid5 array with:

  mdctl --create /dev/md0 --level raid5 --chunk 32 --raid-disks 3 \
/dev/sda /dev/sdb /dev/sdc

and stop it with
  mdctl --stop /dev/md0

and the assemble it with

  mdctl --assemble /dev/md0 /dev/sda /dev/sdb /dev/sdc

I wouldn't actually trust a raid array that you build this way
though.  Some fields in the super block aren't right yet.

I am very interested in comments on the interface and usability.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]

Re: du discrepancies?

2001-06-14 Thread Neil Brown


On Friday June 15, [EMAIL PROTECTED] wrote:
 There appears to be a discrepancy between the true state of affairs on my 
 RAID partitions and what df reports;
 
 [root /]# sfdisk -l /dev/hda
  
 Disk /dev/hda: 38792 cylinders, 16 heads, 63 sectors/track
 Units = cylinders of 516096 bytes, blocks of 1024 bytes, counting from 0
  
Device Boot Start End   #cyls   #blocks   Id  System
 /dev/hda1  0+   15231524-   768095+  fd  Linux raid autodetect
 /dev/hda2   15241845 3221622885  Extended
 /dev/hda3   18462252 407205128   fd  Linux raid autodetect
 /dev/hda4   2253   38791   36539  18415656   fd  Linux raid autodetect
 /dev/hda5   1524+   1584  61-30743+  83  Linux
 /dev/hda6   1585+   1845 261-   131543+  82  Linux swap
 
 [root /]# df
 Filesystem   1k-blocks  Used Available Use% Mounted on
 /dev/md1755920666748 50772  93% /
 WRONG
 /dev/md3198313 13405174656   7% /var 
 WRONG
 /dev/md4  18126088118288  17087024   1% /home WRONG
 
 These figures are clearly wrong. Can anyone suggest where I should start 
 looking for an explanation?

How can figures be wrong?  They are just figures.
What do you think is wrong about them??

Anyway, for a more useful response...
I assume that md[134] are RAID1 arrays, with one mirror on hda.
Lets take md1 made in part of hda1

hda1 has 768095 1K blocks.
md/raid rounds down to a multiple of 64K, and then removes the last
64k for the raid super block, leaving
  768000 1K blocks.
ext2fs uses some of this for metad, and reports the rest as the
available space.
The overhead space compises the superblocks, the block group
descriptors, and inode bitmaps, the block bitmaps, and the inode
tables.
This seems to add up to 12080K on this filesystem, about 1.6%.


NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]

PATCH - md initialisation to accept devfs names

2001-06-20 Thread Neil Brown



Linus, 
  it is possible to start an md array from the boot command line with,
  e.g.
 md=0,/dev/something,/dev/somethingelse

  However only names recognised by name_to_kdev_t work here.  devfs
  based names do not work.
  To fix this, the follow patch moves the name lookup from __setup
  time to __init time so that the devfs routines can be called.

  This patch is largely due to Dave Cinege, though I have made a few
  improvements (particularly removing the devices array from
  md_setup_args). 

  The #ifdef MODULE that this patch removes it wholy within another
  #ifdef MODULE and so it totally pointless.

NeilBrown



--- ./drivers/md/md.c   2001/06/21 00:51:42 1.2
+++ ./drivers/md/md.c   2001/06/21 00:53:09 1.3
@@ -3638,7 +3638,7 @@
char device_set [MAX_MD_DEVS];
int pers[MAX_MD_DEVS];
int chunk[MAX_MD_DEVS];
-   kdev_t devices[MAX_MD_DEVS][MD_SB_DISKS];
+   char *device_names[MAX_MD_DEVS];
 } md_setup_args md__initdata;
 
 /*
@@ -3657,14 +3657,15 @@
  * md=n,device-list  reads a RAID superblock from the devices
  * elements in device-list are read by name_to_kdev_t so can be
  * a hex number or something like /dev/hda1 /dev/sdb
+ * 2001-06-03: Dave Cinege [EMAIL PROTECTED]
+ * Shifted name_to_kdev_t() and related operations to md_set_drive()
+ * for later execution. Rewrote section to make devfs compatible.
  */
-#ifndef MODULE
-extern kdev_t name_to_kdev_t(char *line) md__init;
 static int md__init md_setup(char *str)
 {
-   int minor, level, factor, fault, i=0;
-   kdev_t device;
-   char *devnames, *pername = ;
+   int minor, level, factor, fault;
+   char *pername = ;
+   char *str1 = str;
 
if (get_option(str, minor) != 2) {/* MD Number */
printk(md: Too few arguments supplied to md=.\n);
@@ -3673,9 +3674,8 @@
if (minor = MAX_MD_DEVS) {
printk (md: Minor device number too high.\n);
return 0;
-   } else if (md_setup_args.device_set[minor]) {
-   printk (md: Warning - md=%d,... has been specified twice;\n
-   will discard the first definition.\n, minor);
+   } else if (md_setup_args.device_names[minor]) {
+   printk (md: md=%d, Specified more then once. Replacing previous 
+definition.\n, minor);
}
switch (get_option(str, level)) { /* RAID Personality */
case 2: /* could be 0 or -1.. */
@@ -3706,53 +3706,72 @@
}
/* FALL THROUGH */
case 1: /* the first device is numeric */
-   md_setup_args.devices[minor][i++] = level;
+   str = str1;
/* FALL THROUGH */
case 0:
md_setup_args.pers[minor] = 0;
pername=super-block;
}
-   devnames = str;
-   for (; iMD_SB_DISKS  str; i++) {
-   if ((device = name_to_kdev_t(str))) {
-   md_setup_args.devices[minor][i] = device;
-   } else {
-   printk (md: Unknown device name, %s.\n, str);
-   return 0;
-   }
-   if ((str = strchr(str, ',')) != NULL)
-   str++;
-   }
-   if (!i) {
-   printk (md: No devices specified for md%d?\n, minor);
-   return 0;
-   }
-
+   
printk (md: Will configure md%d (%s) from %s, below.\n,
-   minor, pername, devnames);
-   md_setup_args.devices[minor][i] = (kdev_t) 0;
-   md_setup_args.device_set[minor] = 1;
+   minor, pername, str);
+   md_setup_args.device_names[minor] = str;
+   
return 1;
 }
-#endif /* !MODULE */
 
+extern kdev_t name_to_kdev_t(char *line) md__init;
 void md__init md_setup_drive(void)
 {
int minor, i;
kdev_t dev;
mddev_t*mddev;
+   kdev_t devices[MD_SB_DISKS+1];
 
for (minor = 0; minor  MAX_MD_DEVS; minor++) {
+   int err = 0;
+   char *devname;
mdu_disk_info_t dinfo;
 
-   int err = 0;
-   if (!md_setup_args.device_set[minor])
+   if ((devname = md_setup_args.device_names[minor]) == 0) continue;
+   
+   for (i = 0; i  MD_SB_DISKS  devname != 0; i++) {
+
+   char *p;
+   void *handle;
+   
+   if ((p = strchr(devname, ',')) != NULL)
+   *p++ = 0;
+
+   dev = name_to_kdev_t(devname);
+   handle = devfs_find_handle(NULL, devname, MAJOR (dev), MINOR 
+(dev),
+   DEVFS_SPECIAL_BLK, 1);
+   if (handle != 0) {
+   unsigned major, minor;
+   devfs_get_maj_min(handle, major, minor);
+

PATCH - tag all printk's in md.c

2001-06-20 Thread Neil Brown



Linus, 
 This patch makes sure that all the printks in md.c print a message
 starting with md: or md%d:.
 The next step (not today) will be to reduce a lot of them to
 KERN_INFO or similar as md is really quite noisy.

 Also, two printks in raid1.c get prefixed with raid1:  
 This patch is partly due to  Dave Cinege.

 While preparing this I noticed that write_disk_sb sometimes returns
 1 for error, sometimes -1, and the return val is added into a
 cumulative error variable.  So now it always returns 1.

 Also md_update_sb reports on the each superblock (one per disk) on
 separate lines, but worries about inserting commas and ending with a
 full stop.   I have removed the commas and fullstop - vestigates of
 shorter device names I suspect.

NeilBrown


--- ./drivers/md/md.c   2001/06/21 00:53:09 1.3
+++ ./drivers/md/md.c   2001/06/21 00:53:39 1.4
@@ -634,7 +634,7 @@
md_list_add(rdev-same_set, mddev-disks);
rdev-mddev = mddev;
mddev-nb_dev++;
-   printk(bind%s,%d\n, partition_name(rdev-dev), mddev-nb_dev);
+   printk(md: bind%s,%d\n, partition_name(rdev-dev), mddev-nb_dev);
 }
 
 static void unbind_rdev_from_array (mdk_rdev_t * rdev)
@@ -646,7 +646,7 @@
md_list_del(rdev-same_set);
MD_INIT_LIST_HEAD(rdev-same_set);
rdev-mddev-nb_dev--;
-   printk(unbind%s,%d\n, partition_name(rdev-dev),
+   printk(md: unbind%s,%d\n, partition_name(rdev-dev),
 rdev-mddev-nb_dev);
rdev-mddev = NULL;
 }
@@ -686,7 +686,7 @@
 
 static void export_rdev (mdk_rdev_t * rdev)
 {
-   printk(export_rdev(%s)\n,partition_name(rdev-dev));
+   printk(md: export_rdev(%s)\n,partition_name(rdev-dev));
if (rdev-mddev)
MD_BUG();
unlock_rdev(rdev);
@@ -694,7 +694,7 @@
md_list_del(rdev-all);
MD_INIT_LIST_HEAD(rdev-all);
if (rdev-pending.next != rdev-pending) {
-   printk((%s was pending)\n,partition_name(rdev-dev));
+   printk(md: (%s was pending)\n,partition_name(rdev-dev));
md_list_del(rdev-pending);
MD_INIT_LIST_HEAD(rdev-pending);
}
@@ -777,14 +777,14 @@
 {
int i;
 
-   printk(  SB: (V:%d.%d.%d) ID:%08x.%08x.%08x.%08x CT:%08x\n,
+   printk(md:  SB: (V:%d.%d.%d) ID:%08x.%08x.%08x.%08x CT:%08x\n,
sb-major_version, sb-minor_version, sb-patch_version,
sb-set_uuid0, sb-set_uuid1, sb-set_uuid2, sb-set_uuid3,
sb-ctime);
-   printk( L%d S%08d ND:%d RD:%d md%d LO:%d CS:%d\n, sb-level,
+   printk(md: L%d S%08d ND:%d RD:%d md%d LO:%d CS:%d\n, sb-level,
sb-size, sb-nr_disks, sb-raid_disks, sb-md_minor,
sb-layout, sb-chunk_size);
-   printk( UT:%08x ST:%d AD:%d WD:%d FD:%d SD:%d CSUM:%08x E:%08lx\n,
+   printk(md: UT:%08x ST:%d AD:%d WD:%d FD:%d SD:%d CSUM:%08x E:%08lx\n,
sb-utime, sb-state, sb-active_disks, sb-working_disks,
sb-failed_disks, sb-spare_disks,
sb-sb_csum, (unsigned long)sb-events_lo);
@@ -793,24 +793,24 @@
mdp_disk_t *desc;
 
desc = sb-disks + i;
-   printk( D %2d: , i);
+   printk(md: D %2d: , i);
print_desc(desc);
}
-   printk( THIS: );
+   printk(md: THIS: );
print_desc(sb-this_disk);
 
 }
 
 static void print_rdev(mdk_rdev_t *rdev)
 {
-   printk( rdev %s: O:%s, SZ:%08ld F:%d DN:%d ,
+   printk(md: rdev %s: O:%s, SZ:%08ld F:%d DN:%d ,
partition_name(rdev-dev), partition_name(rdev-old_dev),
rdev-size, rdev-faulty, rdev-desc_nr);
if (rdev-sb) {
-   printk(rdev superblock:\n);
+   printk(md: rdev superblock:\n);
print_sb(rdev-sb);
} else
-   printk(no rdev superblock!\n);
+   printk(md: no rdev superblock!\n);
 }
 
 void md_print_devices (void)
@@ -820,9 +820,9 @@
mddev_t *mddev;
 
printk(\n);
-   printk(**\n);
-   printk(* COMPLETE RAID STATE PRINTOUT *\n);
-   printk(**\n);
+   printk(md: **\n);
+   printk(md: * COMPLETE RAID STATE PRINTOUT *\n);
+   printk(md: **\n);
ITERATE_MDDEV(mddev,tmp) {
printk(md%d: , mdidx(mddev));
 
@@ -838,7 +838,7 @@
ITERATE_RDEV(mddev,rdev,tmp2)
print_rdev(rdev);
}
-   printk(**\n);
+   printk(md: **\n);
printk(\n);
 }
 
@@ -917,15 +917,15 @@
 
if (!rdev-sb) {
MD_BUG();
-   return -1;
+   return 1;
}
if (rdev-faulty) {

PATCH - raid5 performance improvement - 3 of 3

2001-06-20 Thread Neil Brown




Linus, and fellow RAIDers,

 This is the third in my three patch series for improving RAID5
 throughput.
 This one substantially lifts write thoughput by leveraging the
 opportunities for write gathering provided by the first patch.

 With RAID5, it is much more efficient to write a whole stripe full of
 data at a time as this avoids the need to pre-read any old data or
 parity from the discs.

 Without this patch, when a write request arrives, raid5 will
 immediately start a couple of pre-reads so that it will be able to
 write that block and update the parity.
 By the time that the old data and parity arrive it is quite possible
 that write requests for all the other blocks in the stripe will have
 been submitted, and the old data and parity will not be needed.

 This patch uses concepts similar to queue plugging to delay write
 requests slightly to improve the chance that many or even all of the
 data blocks in a stripe will have outstanding write requests before
 processing is started.

 To do this it maintains a queue of stripes that seem to require
 pre-reading.
 Stripes are only released  from this queue when there are no other
 pre-read requests active, and then only if the raid5 device is not
 currently plugged.

 As I mentioned earlier, my testing shows substantial improvements
 from these three patches for both sequential (bonnie) and random
 (dbench) access patterns.  I would be particularly interested if
 anyone else does any different testing, preferable comparing
 2.2.19+patches with 2.4.5 and then with 2.4.5-plus these patches. 

 I know of one problem area being sequential writes to a 3 disc
 array.  If anyone can find any other access patterns that still
 perform below 2.2.19 levels, I would really like to know about them.

NeilBrown



--- ./include/linux/raid/raid5.h2001/06/21 01:01:46 1.3
+++ ./include/linux/raid/raid5.h2001/06/21 01:04:05 1.4
@@ -158,6 +158,32 @@
 #define STRIPE_HANDLE  2
 #defineSTRIPE_SYNCING  3
 #defineSTRIPE_INSYNC   4
+#defineSTRIPE_PREREAD_ACTIVE   5
+#defineSTRIPE_DELAYED  6
+
+/*
+ * Plugging:
+ *
+ * To improve write throughput, we need to delay the handling of some
+ * stripes until there has been a chance that several write requests
+ * for the one stripe have all been collected.
+ * In particular, any write request that would require pre-reading
+ * is put on a delayed queue until there are no stripes currently
+ * in a pre-read phase.  Further, if the delayed queue is empty when
+ * a stripe is put on it then we plug the queue and do not process it
+ * until an unplg call is made. (the tq_disk list is run).
+ *
+ * When preread is initiated on a stripe, we set PREREAD_ACTIVE and add
+ * it to the count of prereading stripes.
+ * When write is initiated, or the stripe refcnt == 0 (just in case) we
+ * clear the PREREAD_ACTIVE flag and decrement the count
+ * Whenever the delayed queue is empty and the device is not plugged, we
+ * move any strips from delayed to handle and clear the DELAYED flag and set 
+PREREAD_ACTIVE.
+ * In stripe_handle, if we find pre-reading is necessary, we do it if
+ * PREREAD_ACTIVE is set, else we set DELAYED which will send it to the delayed queue.
+ * HANDLE gets cleared if stripe_handle leave nothing locked.
+ */
+ 
 
 struct disk_info {
kdev_t  dev;
@@ -182,6 +208,8 @@
int max_nr_stripes;
 
struct list_headhandle_list; /* stripes needing handling */
+   struct list_headdelayed_list; /* stripes that have plugged requests */
+   atomic_tpreread_active_stripes; /* stripes with scheduled io */
/*
 * Free stripes pool
 */
@@ -192,6 +220,9 @@
 * waiting for 25% to be free
 */
md_spinlock_t   device_lock;
+
+   int plugged;
+   struct tq_structplug_tq;
 };
 
 typedef struct raid5_private_data raid5_conf_t;
--- ./drivers/md/raid5.c2001/06/21 01:01:46 1.3
+++ ./drivers/md/raid5.c2001/06/21 01:04:05 1.4
@@ -31,6 +31,7 @@
  */
 
 #define NR_STRIPES 256
+#defineIO_THRESHOLD1
 #define HASH_PAGES 1
 #define HASH_PAGES_ORDER   0
 #define NR_HASH(HASH_PAGES * PAGE_SIZE / sizeof(struct 
stripe_head *))
@@ -65,11 +66,17 @@
BUG();
if (atomic_read(conf-active_stripes)==0)
BUG();
-   if (test_bit(STRIPE_HANDLE, sh-state)) {
+   if (test_bit(STRIPE_DELAYED, sh-state))
+   list_add_tail(sh-lru, conf-delayed_list);
+   else if (test_bit(STRIPE_HANDLE, sh-state)) {
list_add_tail(sh-lru, conf-handle_list);

PATCH

2001-06-20 Thread Neil Brown



Linus,
  There is a buggy BUG in the raid5 code.
 If a request on an underlying device reports an error, raid5 finds out
 which device that was and marks it as failed.  This is fine.
 If another request on the same device reports an error, raid5 fails
 to find that device in its table (because though  it is there, it is
 not operational), and so it thinks something is wrong and calls
 MD_BUG() - which is very noisy, though not actually harmful (except
 to the confidence of the sysadmin)
 

 This patch changes the test so that a failure on a drive that is
 known but not-operational will be Expected and node a BUG.

NeilBrown

--- ./drivers/md/raid5.c2001/06/21 01:04:05 1.4
+++ ./drivers/md/raid5.c2001/06/21 01:04:41 1.5
@@ -486,22 +486,24 @@
PRINTK(raid5_error called\n);
conf-resync_parity = 0;
for (i = 0, disk = conf-disks; i  conf-raid_disks; i++, disk++) {
-   if (disk-dev == dev  disk-operational) {
-   disk-operational = 0;
-   mark_disk_faulty(sb-disks+disk-number);
-   mark_disk_nonsync(sb-disks+disk-number);
-   mark_disk_inactive(sb-disks+disk-number);
-   sb-active_disks--;
-   sb-working_disks--;
-   sb-failed_disks++;
-   mddev-sb_dirty = 1;
-   conf-working_disks--;
-   conf-failed_disks++;
-   md_wakeup_thread(conf-thread);
-   printk (KERN_ALERT
-   raid5: Disk failure on %s, disabling device.
-Operation continuing on %d devices\n,
-   partition_name (dev), conf-working_disks);
+   if (disk-dev == dev) {
+   if (disk-operational) {
+   disk-operational = 0;
+   mark_disk_faulty(sb-disks+disk-number);
+   mark_disk_nonsync(sb-disks+disk-number);
+   mark_disk_inactive(sb-disks+disk-number);
+   sb-active_disks--;
+   sb-working_disks--;
+   sb-failed_disks++;
+   mddev-sb_dirty = 1;
+   conf-working_disks--;
+   conf-failed_disks++;
+   md_wakeup_thread(conf-thread);
+   printk (KERN_ALERT
+   raid5: Disk failure on %s, disabling device.
+Operation continuing on %d devices\n,
+   partition_name (dev), conf-working_disks);
+   }
return 0;
}
}
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]

Re: PATCH - raid5 performance improvement - 3 of 3

2001-06-24 Thread Neil Brown


On Sunday June 24, [EMAIL PROTECTED] wrote:
 Hi,
 
 We used to (long ago, 2.2.x), whenever we got a write request for some
 buffer,
 search the buffer cache to see if additional buffers which belong to that
 particular stripe are dirty, and then schedule them for writing as well, in
 an
 attempt to write full stripes. That resulted in a huge sequential write
 performance
 improvement.
 
 If such an approach is still possible today, it is preferrable to delaying
 the writes

 It is not still possible, at least not usefully so. Infact, it is
 also true that it is probably not preferrable. 

 Since about 2.3.7, filesystem data has not, by-and-large, been stored
 in the buffer cache.  It is only stored in the page cache.  So were
 raid5 to go looking in the buffer cache it would be unlikely to find
 anything. 

 But there are other problems.  The cache snooping only works if the
 direct client of raid5 is a filesystem that stores data in the buffer
 cache.  If the filesystem is an indirect client, via LVM for example,
 or even via a RAID0 array, then raid5 would not be able to look in
 the right buffer cache, and so would find nothing.  This was the
 case in 2.2.  If you tried an LVM over RAID5 in 2.2, you wouldn't get
 good write speed.  You also would probably get data corruption while
 the array was re-syncing, but that is a separate issue.

 The current solution is much more robust.  It cares nothing about the
 way the raid5 array is used. 

 Also, while the handling of stripes is delayed, I don't believe that
 this would actually show as measurable increase in latency.  The
 effect is really to have requests spend more time on a higher level
 queue, and less time on a lower level queue.  The total time on
 queues should normally be the same or less (due to improved
 throughput) or only very slightly more in pathological cases.

NeilBrown



 for the partial buffer while hoping that the rest of the bufferes in the
 stripe would
 come as well, since it both eliminates the additional delay, and doesn't
 depend on the order in which the bufferes are flushed from the much bigger
 memory buffers to the smaller stripe cache.
 

I think the ideal solution would be to have the filesystem write data
in two stages, much like Unix apps can.
As soon as a buffer is dirtied (or more accurately, as soon as the
filesystem is happy for the data to be written), it is passed on with a
WRITE_AHEAD request.  The driver is free to do what it likes,
including ignore this.
Later, at a time corresponding to fsync or close maybe, or when
memory is tight, the filesystem can send the buffer down with a
WRITE request which says please write this *now*.

RAID5 could then gather all the write_ahead requests into a hash table
(not unlike the old buffer cache), and easily find full stripes for
writing.

But that is not going to happen in 2.4.

NeilBrown


 Cheers,
 
 Gadi
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]

Re: Failed disk triggers raid5.c bug?

2001-06-24 Thread Neil Brown


On Sunday June 24, [EMAIL PROTECTED] wrote:
 Hi!
 
 Neil Brown wrote:
  On Thursday June 14, [EMAIL PROTECTED] wrote:
   
   Dear All
   
   I've just had a disk (sdc) fail in my raid5 array (sdb sdc sdd),
 
  Great!  A real live hardware failure1  It is always more satisfying to
  watch one of those than to have to simulate them all the time!!
  Unless of course they are fatal... not the case here it seems.
 
 Well, here comes a _real_ fatal one...

And a very detailed report it was, thanks.

I'm not sure that  you want to know this, but it looks like you might
have been able to recover your data though it is only a might.

 Jun 19 09:18:23 wien kernel: (read) scsi/host0/bus0/target0/lun0/part4's sb offset: 
16860096 [events: 0024]
 Jun 19 09:18:23 wien kernel: (read) scsi/host0/bus0/target1/lun0/part4's sb offset: 
16860096 [events: 0024]
 Jun 19 09:18:23 wien kernel: (read) scsi/host0/bus0/target2/lun0/part4's sb offset: 
16860096 [events: 0023]
 Jun 19 09:18:23 wien kernel: (read) scsi/host0/bus0/target3/lun0/part4's sb offset: 
16860096 [events: 0028]

The reason that this array couldn't restart was that the 4th drive had
the highest event count and it was alone in this.  It didn't even have
any valid data!!
Had you unplugged this drive and booted, it would have tried to
assemble an array out of the first two (event count 24).  This might
have worked (though it might not, read on).

Alternately, you could have created a raidtab which said that the
third drive was failed, and then run mkraid...

mdctl, when it is finished, should be able to make this all much
easier.

But what went wrong?  I don't know the whole story but:

- On the first error, the drive was disabled and reconstruction was
  started. 
- On the second error, the reconstruction was inapprporiately
  interrupted.  This is an error that I will have to fix in 2.4.
  However it isn't really fatal error.
- Things were then going fairly OK, though noisy, until:

 Jun 19 09:10:07 wien kernel: attempt to access beyond end of device
 Jun 19 09:10:07 wien kernel: 08:04: rw=0, want=1788379804, limit=16860217
 Jun 19 09:10:07 wien kernel: dev 09:00 blksize=1024 blocknr=447094950 
sector=-718207696 size=4096 count=1

 For some reason, it tried to access well beyond the end of one of the
 underlying drives.  This caused that drive to fail.  This relates to
 the subsequent message:

 Jun 19 09:10:07 wien kernel: raid5: restarting stripe 3576759600

 which strongly suggests that the filesystem actually asked the raid5
 array for a block that was well out of range.
 In 2.4, this will be caught before the request gets to raid5.  In 2.2
 it isn't.  The request goes on to raid5, raid5 blindly passes a bad
 request down to the disc.  The disc reports an error, and raid5
 thinks the disc has failed, rather than realise that it never should
 have made such a silly request.

 But why did the filesystem ask for a block that was out of range?
 This is the part that I cannot fathom.  It would seem as though the
 filesystem got corrupt somehow.  Maybe an indirect block got replaced
 with garbage, and ext2fs believed the indirect block and went seeking
 way off the end of the array.  But I don't know how the corruption
 happened.

 Had you known enough to restart the array from the two apparently
 working drives, and then run fsck, it might have fixed things enough
 to keep going.  Or it might not, depending on how much corruption
 there was.

 So, Summary of problems:
  1/ md responds to a failure on a know-failed drive inappropriately. 
This shouldn't be fatal but needs fixing.
  2/ md isn't thoughtful enough about updating the event counter on
 superblocks and can easily leave an array in an unbuildable
 state.  This needs to be fix.  It's on my list...
  3/ raid5 responds to a request for an out-of-bounds device address
by passing on out-of-bounds device addresses the drives, and then
thinking that those drives are failed.
This is fixed in 2.4
  4/ Something caused some sort of filesystem corruption.  I don't
 know what.


NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]

Re: Failed disk triggers raid5.c bug?

2001-06-26 Thread Neil Brown


On Monday June 25, [EMAIL PROTECTED] wrote:
 Is there any way for the RAID code to be smarter when deciding 
 about those event counters? Does it have any chance (theoretically)
 to _know_ that it shouldn't use the drive with event count 28?

My current thinking is that once a raid array becomes unusable - in the
case of raid5, this means two failures - the array should immediately
be marked read-only, including the superblocks.   Then if you ever
manage to get enough drives together to form a working array, it will
start working again, and if not, it won't really matter whether the
superblock was updated to not.



 And even if that can't be done automatically, what about a small
 utility for the admin where he can give some advise to support 
 the RAID code on those decisions?
 Will mdctl have this functionality? That would be great!

mdctl --assemble will have a --force option to tell it to ignore
event numbers and assemble the array anyway.  This could result in
data corruption if you include an old disc, but would be able to get
you out of a tight spot.  Ofcourse, once the above change goes into
the kernel it shouldn't be necessary.

  
 Hm, does the RAID code disable a drive on _every_ error condition?
 Isn't there a distinction between, let's say, soft errors and hard
 errors?
 (I have to admit I don't know the internals of Linux device drivers
 enough to answer that question)
 Shouldn't the RAID code leave a drive which reports soft errors
 in the array and disable drives with hard errors only?

A Linux block device doesn't report soft errors. There is either
success or failure.  The driver for the disc drive should retry any
soft errors and only report an error up through the block-device layer
when it is definately hard.

Arguably the RAID layer should catch read errors and try to get the
data from elsewhere and then re-write over the failed read, just
incase it was a single block error.
But a write error should always be fatal and fail the drive. I cannot
think of any other reasonable approach.

 
 In that case, the filesystem might have been corrupt, but the array
 would have been re-synced automatically, wouldn't it?

yes, and it would have if it hadn't collapsed in a heap while trying :-(

 
   But why did the filesystem ask for a block that was out of range?
   This is the part that I cannot fathom.  It would seem as though the
   filesystem got corrupt somehow.  Maybe an indirect block got replaced
   with garbage, and ext2fs believed the indirect block and went seeking
   way off the end of the array.  But I don't know how the corruption
   happened.
  
 Perhaps the read errors from the drive triggered that problem?

They shouldn't do, but seeing don't know where the corruption came
from, and I'm not even 100% confident that there was corruption, maybe
they could.

The closest I can come to a workable scenario is that maybe some
parity block had the wrong data.  Normally this wouldn't be noticed,
but when you have a failed drive you have to use the parity to
calculate the value of a missing block, and bad parity would make this
block bad.  But I cannot imaging who you would have a bad parity
block.  After any unclean shutdown the parity should be recalculated.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]

Re: Mounting very old style raid on a recent machine?

2001-06-26 Thread Neil Brown


On Tuesday June 26, [EMAIL PROTECTED] wrote:
 Hi,
 
 I currently have to salvage data from an ancient box that looks like
 to have run kernel 2.0.35. However, the system on that disk is
 corrupted and won't boot any more (at least not on today's hardware).
 It looks like main data is on a RAID.
 
 /etc/mdtab:
 |/dev/md0 linear  /dev/hda3   /dev/hdb2
 
 Can I access that RAID from a current system running kernel 2.2 or
 2.4? Do I have to build a new 2.0 kernel? What type of raidtools do I
 need to activate that RAID?

You should be able to access this just fine from a 2.2 kernel using
raidtools-0.41 from
http://www.kernel.org/pub/linux/daemons/raid/

If you need to use 2.4, then you should still be able to access it
using raidtools 0.90 from
   http://www.kernel.org/pub/linux/daemons/raid/alpha/

in this case you would need an /etc/raidtab like
-
raiddev /dev/md0
raid-level linear
nr-raid-disks 2
persistent-superblock 0

device /dev/hda3
raid-disk 0
device /dev/hdb2
raid-disk 1
---

Note that this is pure theory.  I have never actually done it myself.

It should be quite safe to experiment.  You are unlikely to corrupt
anything if you don't to anything outrageously silly like telling it
that it is a raid1 or raid5 array.

Note: the persistent-superblock 0 is fairly important. These older
arrays did not have any raid-superblock on the device.  You want to
make sure you don't accidentally write one and so corrupt data.

I would go for a 2.2 kernel, raidtools  0.41 and the command:

 mdadd -r -pl /dev/md0 /dev/hda3 /dev/hdb2

NeilBrown


 
 Any hints will be appreciated.
 
 Greetings
 Marc
 
 -- 
 -
 Marc Haber | I don't trust Computers. They | Mailadresse im Header
 Karlsruhe, Germany |  lose things.Winona Ryder | Fon: *49 721 966 32 15
 Nordisch by Nature |  How to make an American Quilt | Fax: *49 721 966 31 29
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]

PATCH/RFC - partitioning of md devices

2001-07-01 Thread Neil Brown



Linus,
 I wonder if you would consider applying, or commenting on this patch.

 It adds support for partitioning md devices.  In particular, a new
 major device is created (name==mdp, number assigned dynamically)
 which provides for 15 partitions on each of the first 16 md devices.

 I understand that a more uniform approach to partitioning might get
 introduced in 2.5, but this seems the best approach for 2.4.

 This is particularly useful if you want to have a mirrored boot
 drive, rather than two drives with lots of mirrored partitions.

 It is also useful for supporting what I call winware raid, which is
 the raid-controller requivalent of winmodems - minimal hardware and
 most of the support done in software.

 Among the things that this patch does are:
 
  1/ tidy up some terminology.  Currently there is a one-to-one
  mapping between minor numbers and raid arrays or units, so the
  term minor is used when referring to either read minor number or
  to a unit.
  This patch introduces the term unit to be used to identify which
  particular array is being referred to, and keeps minor just for
  when a minor device number is realy implied.

  2/ When reporting the geometry of a partitioned raid1 array, the
  geometry of the underlying device is reported.  For all other arrays
  the 2x4xLARGE geometry is maintained.

  3/ The hardsectsize of partitions in a RAID5 array is set the the
PAGESIZE because raid5 doesn't cope well with receiving requests
with different blocksizes.

  4/ The new device reports a name of md (via hd_struct-major_name)
so partitions look like mda3 or md/disc0/part3, but registers the
name mdp so that /proc/devices shows the major number next to
mdp.

  5/ devices ioctls for re-reading the partition table and setting
 partition table information.



--- ./include/linux/raid/md.h   2001/07/01 22:59:38 1.1
+++ ./include/linux/raid/md.h   2001/07/01 22:59:47 1.2
@@ -61,8 +61,11 @@
 extern int md_size[MAX_MD_DEVS];
 extern struct hd_struct md_hd_struct[MAX_MD_DEVS];
 
-extern void add_mddev_mapping (mddev_t *mddev, kdev_t dev, void *data);
-extern void del_mddev_mapping (mddev_t *mddev, kdev_t dev);
+extern int mdp_size[MAX_MDP_DEVSMDP_MINOR_SHIFT];
+extern struct hd_struct mdp_hd_struct[MAX_MDP_DEVSMDP_MINOR_SHIFT];
+
+extern void add_mddev_mapping (mddev_t *mddev, int unit, void *data);
+extern void del_mddev_mapping (mddev_t *mddev, int unit);
 extern char * partition_name (kdev_t dev);
 extern int register_md_personality (int p_num, mdk_personality_t *p);
 extern int unregister_md_personality (int p_num);
--- ./include/linux/raid/md_k.h 2001/07/01 22:59:38 1.1
+++ ./include/linux/raid/md_k.h 2001/07/01 22:59:47 1.2
@@ -15,6 +15,7 @@
 #ifndef _MD_K_H
 #define _MD_K_H
 
+
 #define MD_RESERVED   0UL
 #define LINEAR1UL
 #define STRIPED   2UL
@@ -60,7 +61,10 @@
 #error MD doesnt handle bigger kdev yet
 #endif
 
+#defineMDP_MINOR_SHIFT 4
+
 #define MAX_MD_DEVS  (1MINORBITS)/* Max number of md dev */
+#define MAX_MDP_DEVS  (1(MINORBITS-MDP_MINOR_SHIFT)) /* Max number of md dev */
 
 /*
  * Maps a kdev to an mddev/subdev. How 'data' is handled is up to
@@ -73,11 +77,17 @@
 
 extern dev_mapping_t mddev_map [MAX_MD_DEVS];
 
+extern int mdp_major;
 static inline mddev_t * kdev_to_mddev (kdev_t dev)
 {
-   if (MAJOR(dev) != MD_MAJOR)
+   int unit=0;
+   if (MAJOR(dev) == MD_MAJOR)
+   unit = MINOR(dev);
+   else if (MAJOR(dev) == mdp_major)
+   unit = MINOR(dev)  MDP_MINOR_SHIFT;
+   else
BUG();
-return mddev_map[MINOR(dev)].mddev;
+   return mddev_map[unit].mddev;
 }
 
 /*
@@ -191,7 +201,7 @@
 {
void*private;
mdk_personality_t   *pers;
-   int __minor;
+   int __unit;
mdp_super_t *sb;
int nb_dev;
struct md_list_head disks;
@@ -248,13 +258,34 @@
  */
 static inline int mdidx (mddev_t * mddev)
 {
-   return mddev-__minor;
+   return mddev-__unit;
+}
+
+static inline int mdminor (mddev_t *mddev)
+{
+   return mdidx(mddev);
+}
+
+static inline int mdpminor (mddev_t *mddev)
+{
+   return mdidx(mddev) MDP_MINOR_SHIFT;
+}
+
+static inline kdev_t md_kdev (mddev_t *mddev)
+{
+   return MKDEV(MD_MAJOR, mdminor(mddev));
 }
 
-static inline kdev_t mddev_to_kdev(mddev_t * mddev)
+static inline kdev_t mdp_kdev (mddev_t *mddev, int part)
 {
-   return MKDEV(MD_MAJOR, mdidx(mddev));
+   return MKDEV(mdp_major, mdpminor(mddev)+part);
 }
+
+#define foreach_part(tmp,mddev)\
+   if (mdidx(mddev)MAX_MDP_DEVS)  \
+   for(tmp=mdpminor(mddev);\
+   tmpmdpminor(mddev)+(1MDP_MINOR_SHIFT);   \
+   tmp++)
 
 extern

Re: raid 01 vs 10

2001-07-09 Thread Neil Brown


On Monday July 9, [EMAIL PROTECTED] wrote:
 
 I was wondering what people thought of using raid 0+1 (a mirrored array
 of raid0 stripes) vs. raid 1+0 (a raid0 array of mirrored disks). It
 seems that either can sustain at least one drive failure and the
 performance should be similar. Are there strong reasons for using one
 over the other?

All other things being equal, raid 1+0 is usually better.
It can withstand a greater variety of 2 disc failures and the separate
arrays can rebuild in parallel after an unclean shutdown, thus
returning you to full redundancy more quickly.

But some times other things are not equal.
If you don't have uniform drive sizes, you might want to raid0
assorted drives together to create two similar sized sets to raid1.

I recall once someone suggesting that with certain cabling geometries
it was better to use 0+1 in cases of cable failure, but I cannot
remember, or work out, how that might have been.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]

Re: migrating raid-1 to different drive geometry ?

2005-01-24 Thread Neil Brown


On Monday January 24, [EMAIL PROTECTED] wrote:
 how can the existing raid setup be moved to the new disks 
 without data loss ?
 
 I guess it must be something like this:
 
 1) physically remove first old drive
 2) physically add first new drive
 3) re-create partitions on new drive
 4) run raidhotadd for each partition
 5) wait until all partitions synced
 6) repeat with second drive

Sounds good.
 
 the big question is: since the drive geometry will definitely different
 between old 60GB and new 80GB drive(s), how do the new partitions 
 have to be created on the new drive ?
 - do they have to have exactly the same amount of blocks ?
No.
 - may they be bigger ?
Yes (they cannot be smaller).

However making the partitions bigger will not make the arrays bigger.

If you are using a recent 2.6 kernel and mdadm 1.8.0, you can grow the
array with
   mdadm --grow /dev/mdX --size=max

You will then need to convince the filesystem in the array to make use
of the extra space.  Many filesystems do support such growth.  Some
even support on-line growth.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: migrating raid-1 to different drive geometry ?

2005-01-24 Thread Neil Brown

On Tuesday January 25, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
  If you are using a recent 2.6 kernel and mdadm 1.8.0, you can grow the
  array with
 mdadm --grow /dev/mdX --size=max
 
 Neil,
 
 Is this just for RAID1? OR will it work for RAID5 too?

 --grow --size=max

should work for raid 1,5,6.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: /dev/md* Device Files

2005-01-26 Thread Neil Brown

On Wednesday January 26, [EMAIL PROTECTED] wrote:
  A useful trick I discovered yesterday: Add --auto to your mdadm commandline
  and it will create the device for you if it is missing :)
 
 
 Well, it seems that this machine is using the udev scheme for managing 
 device files. I didn't realize this as udev is new to me, but I probably 
 should have mentioned the kernel version (2.6.8) I was using. So I need to 
 research udev and how one causes devices to be created, etc.

Beware udev has an understanding of how device files are meant to
work which is quite different from how md actually works.

udev thinks that devices should appear in /dev after the device is
actually known to exist in the kernel.  md needs a device to exist in
/dev before the kernel can be told that it exists.

This is one of the reasons that --auto was added to mdadm - to bypass
udev.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Software RAID 0+1 with mdadm.

2005-01-26 Thread Neil Brown

On Wednesday January 26, [EMAIL PROTECTED] wrote:
 This bug that's fixed in 1.9.0, is in a bug when you create the array?  ie
 do we need to use 1.9.0 to create the array.  I'm looking to do the same but
 my bootdisk currently only has 1.7.soemthing on it.  Do I need to make a
 custom bootcd with 1.9.0 on it?

This issue that will be fixed in 1.9.0 has nothing to do with creating
the array.

It is only relevant for stacked arrays (e.g. a raid0 made out of 2 or
more raid1 arrays), and only if you are using
   mdadm --assemble --scan
(or similar) to assemble your arrays, and you specify the devices to
scan in mdadm.conf as
   DEVICES partitions
(i.e. don't list actual devices, just say to get them from the list of
known partitions).

So, no: no need for a custom bootcd.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Software RAID 0+1 with mdadm.

2005-01-26 Thread Neil Brown

On Tuesday January 25, [EMAIL PROTECTED] wrote:
 Been trying for days to get a software RAID 0+1 setup. This is on SuSe
 9.2 with kernel 2.6.8-24.11-smp x86_64.
 
 I am trying to setup a RAID 0+1 with 4 250gb SATA drives. I do the
 following:
 
 mdadm --create /dev/md1 --level=0 --chunk=4 --raid-devices=2 /dev/sdb1
 /dev/sdc1
 mdadm --create /dev/md2 --level=0 --chunk=4 --raid-devices=2 /dev/sdd1
 /dev/sde1
 mdadm --create /dev/md0 --level=1 --chunk=4 --raid-devices=2 /dev/md1
 /dev/md2
 
 This all works fine and I can mkreiserfs /dev/md0 and mount it. If I am
 then to reboot /dev/md1 and /dev/md2 will show up in the /proc/mdstat
 but not /dev/md0. So I create a /etc/mdadm.conf like so to see if this
 will work:
 
 DEVICE partitions
 DEVICE /dev/md*
 ARRAY /dev/md2 level=raid0 num-devices=2
 UUID=5e6efe7d:6f5de80b:82ef7843:148cd518
devices=/dev/sdd1,/dev/sde1
 ARRAY /dev/md1 level=raid0 num-devices=2
 UUID=e81e74f9:1cf84f87:7747c1c9:b3f08a81
devices=/dev/sdb1,/dev/sdc1
 ARRAY /dev/md0 level=raid1 num-devices=2  devices=/dev/md2,/dev/md1
 
 
 Everything seems ok after boot. But again no /dev/md0 in /proc/mdstat.
 But then if I do a mdadm --assemble --scan it will then load
 /dev/md0. 

My guess is that you are (or SuSE is) relying on autodetect to
assemble the arrays.  Autodetect cannot assemble an array made of
other arrays.  Just an array made of partitions.

If you disable the autodetect stuff and make sure 
  mdadm --assemble --scan
is in a boot-script somewhere, it should just work.

Also, you don't really want the device=/dev/sdd1... entries in
mdadm.conf.
They tell mdadm to require the devices to have those names.  If you
add or remove scsi drives at all, the names can change.  Just rely on
the UUID.

 
 Also do I need to create partitions? Or can I setup the whole drives as
 the array?

You don't need partitions.

 
 I have since upgraded to mdadm 1.8 and setup a RAID10. However I need
 something that is production worthy. Is a RAID10 something I could rely
 on as well? Also under a RAID10 how do you tell it which drives you want
 mirrored?

raid10 is 2.6 only, but should be quite stable.
You cannot tell it which drives to mirror because you shouldn't care.
You just give it a bunch of identical drives and let it put the data
where it wants.

If you really want to care (and I cannot imagine why you would - all
drives in a raid10 are likely to get similar load) then you have to
build it by hand - a raid0 of multiple raid1s.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Change preferred minor number of an md device?

2005-01-31 Thread Neil Brown

On Monday January 31, [EMAIL PROTECTED] wrote:
 Hi to all, md gurus!
 
 Is there a way to edit the preferred minor of a stopped device?

mdadm --assemble /dev/md0 --update=super-minor /dev/

will assemble the array and update the preferred minor to 0 (from
/dev/md0).

However this won't work for you as you already have a /dev/md0
running...

 
 Alternatively, is there a way to create a raid1 device specifying the 
 preferred minor number md0, but activating it provisionally as a different 
 minor, say md5? An md0 is already running, so mdadm --create /dev/md0 
 fails...
 
 I have to dump my /dev/md0 to a different disk (/dev/md5), but when I boot 
 from the new disk, I want the kernel to autmatically detect the device 
 as /dev/md0.

If you are running 2.6, then you just need to assemble it as /dev/md0
once and that will automatically update the superblock.  You could do
this with kernel parameters of 
   raid=noautodetect md=0,/dev/firstdrive,/dev/seconddrive

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

ANNOUNCE: mdadm 1.9.0 - A tool for managing Soft RAID under Linux

2005-02-03 Thread Neil Brown



I am pleased to announce the availability of 
   mdadm version 1.9.0
It is available at
   http://www.cse.unsw.edu.au/~neilb/source/mdadm/
and
   http://www.{countrycode}.kernel.org/pub/linux/utils/raid/mdadm/

as a source tar-ball and (at the first site) as an SRPM, and as an RPM for i386.

mdadm is a tool for creating, managing and monitoring
device arrays using the md driver in Linux, also
known as Software RAID arrays.

Release 1.9.0 adds:
-   Fix rpm build problem (stray %)
-   Minor manpage updates
-   Change dirty status to active as it was confusing people.
-   --assemble --auto recognises 'standard' name and insists on using
the appropriate major/minor number for them.
-   Remove underscore from partition names, so partitions of 
foo are foo1, foo2 etc (unchanged) and partitions of
f00 are f00p1, f00p2 etc rather than f00_p1...
-   Use major, minor, makedev macros instead of 
MAJOR, MINOR, MKDEV so that large device numbers work
on 2.6 (providing you have glibc 2.3.3 or later).
-   Add some missing closes of open file descriptors.
-   Reread /proc/partition for every array assembled when using
it to find devices, rather than only once.
-   Make mdadm -Ss stop stacked devices properly, by reversing the
order in which arrays are stopped.
-   Improve some error messages.
-   Allow device name to appear before first option, so e.g.
mdadm /dev/md0 -A /dev/sd[ab]
works.
-   Assume '-Q' if just a device is given, rather than being silent.

This is based on 1.8.0 and *not* on 1.8.1 which was meant to be a pre-release 
for the upcoming 2.0.0.  The next prerelease will have a more obvious name.

Development of mdadm is sponsored by [EMAIL PROTECTED]: 
  The School of Computer Science and Engineering
at
  The University of New South Wales

NeilBrown  04 February 2005

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ANNOUNCE: mdadm 1.9.0 - A tool for managing Soft RAID under Linux

2005-02-03 Thread Neil Brown

On Friday February 4, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
 
  Release 1.9.0 adds:
 ...
  -   --assemble --auto recognises 'standard' name and insists on using
  the appropriate major/minor number for them.
 
 Is this the problem I encountered when I added auto=md to my mdadm.conf 
 file?

Probably.

 
 It caused all sorts of problems - which were recoverable, fortunately.
 
 I ended up putting a '/sbin/MAKEDEV md' into /etc/rc.sysinit just before 
 the call to mdadm, but that creates all the md devices, not just those 
 that are needed.
 
 Will this new version allow me to remove this line in rc.sysinit again 
 and put the 'auto=md' back into mdadm.conf?

I think so, yes.  It is certainly worth a try and I would appreciate
success of failure reports.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Problem with Openmosix

2005-02-14 Thread Neil Brown

On Monday February 14, [EMAIL PROTECTED] wrote:
 Hi, Neil...

Hi.

 
 I use MD driver two year ago with Debian, and run perfectly.

Great!

 
 The machine boot the new kernel a run Ok... but... if I (or another
 process) make a change/write to the raid md system, the computer crash
 with the message:
 
   hdh: Drive not ready for command.
 
 (hdh is the mirror raid1 for hdf disk).

I cannot help thinking that maybe the Drive is not ready for the
command.  i.e. it isn't an md problem.  It isn't an openmosix
problem.  It is a drive hardware problem, or maybe an IDE controller
problem.   Can you try a different drive? Can you try just putting a
filesystem on that drive alone and see if it works?

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [Bugme-new] [Bug 4211] New: md configuration destroys disk GPT label

2005-02-14 Thread Neil Brown

On Monday February 14, [EMAIL PROTECTED] wrote:
 Maybe I am confused, but if you use the whole disk, I would expect the whole
 disk could be over-written!  What am I missing?

I second that.

Once you do anything to a whole disk, whether make an md array out of
it, or mkfs it or anything else, you can kiss any partitioning
goodbye.

Maybe what you want to do it make an md array and then partition
that.
In 2.6 you can do that directly.  In 2.4 you would need to use LVM to
partition the array.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.11-rc4 md loops on missing drives

2005-02-15 Thread Neil Brown

On Tuesday February 15, [EMAIL PROTECTED] wrote:
 G'day all,
 
 I'm not really sure how it's supposed to cope with losing more disks than 
 planned, but filling the 
 syslog with nastiness is not very polite.

Thanks for the bug report.  There are actually a few problems relating
to resync/recovery when an array (raid 5 or 6) has lost too many
devices.
This patch should fix them.

NeilBrown


Make raid5 and raid6 robust against failure during recovery.

Two problems are fixed here.
1/ if the array is known to require a resync (parity update),
  but there are too many failed devices,  the resync cannot complete
  but will be retried indefinitedly.
2/ if the array has two many failed drives to be usable and a spare is
  available, reconstruction will be attempted, but cannot work.  This
  also is retried indefinitely.


Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/md.c|   12 ++--
 ./drivers/md/raid5.c |   13 +
 ./drivers/md/raid6main.c |   12 
 3 files changed, 31 insertions(+), 6 deletions(-)

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~  2005-02-16 11:25:25.0 +1100
+++ ./drivers/md/md.c   2005-02-16 11:25:31.0 +1100
@@ -3655,18 +3655,18 @@ void md_check_recovery(mddev_t *mddev)
 
/* no recovery is running.
 * remove any failed drives, then
-* add spares if possible
+* add spares if possible.
+* Spare are also removed and re-added, to allow
+* the personality to fail the re-add.
 */
-   ITERATE_RDEV(mddev,rdev,rtmp) {
+   ITERATE_RDEV(mddev,rdev,rtmp)
if (rdev-raid_disk = 0 
-   rdev-faulty 
+   (rdev-faulty || ! rdev-in_sync) 
atomic_read(rdev-nr_pending)==0) {
if (mddev-pers-hot_remove_disk(mddev, 
rdev-raid_disk)==0)
rdev-raid_disk = -1;
}
-   if (!rdev-faulty  rdev-raid_disk = 0  
!rdev-in_sync)
-   spares++;
-   }
+
if (mddev-degraded) {
ITERATE_RDEV(mddev,rdev,rtmp)
if (rdev-raid_disk  0

diff ./drivers/md/raid5.c~current~ ./drivers/md/raid5.c
--- ./drivers/md/raid5.c~current~   2005-02-16 11:25:25.0 +1100
+++ ./drivers/md/raid5.c2005-02-16 11:25:31.0 +1100
@@ -1491,6 +1491,15 @@ static int sync_request (mddev_t *mddev,
unplug_slaves(mddev);
return 0;
}
+   /* if there is 1 or more failed drives and we are trying
+* to resync, then assert that we are finished, because there is
+* nothing we can do.
+*/
+   if (mddev-degraded = 1  test_bit(MD_RECOVERY_SYNC, 
mddev-recovery)) {
+   int rv = (mddev-size  1) - sector_nr;
+   md_done_sync(mddev, rv, 1);
+   return rv;
+   }
 
x = sector_nr;
chunk_offset = sector_div(x, sectors_per_chunk);
@@ -1882,6 +1891,10 @@ static int raid5_add_disk(mddev_t *mddev
int disk;
struct disk_info *p;
 
+   if (mddev-degraded  1)
+   /* no point adding a device */
+   return 0;
+
/*
 * find the disk ...
 */

diff ./drivers/md/raid6main.c~current~ ./drivers/md/raid6main.c
--- ./drivers/md/raid6main.c~current~   2005-02-16 11:25:25.0 +1100
+++ ./drivers/md/raid6main.c2005-02-16 11:25:31.0 +1100
@@ -1650,6 +1650,15 @@ static int sync_request (mddev_t *mddev,
unplug_slaves(mddev);
return 0;
}
+   /* if there are 2 or more failed drives and we are trying
+* to resync, then assert that we are finished, because there is
+* nothing we can do.
+*/
+   if (mddev-degraded = 2  test_bit(MD_RECOVERY_SYNC, 
mddev-recovery)) {
+   int rv = (mddev-size  1) - sector_nr;
+   md_done_sync(mddev, rv, 1);
+   return rv;
+   }
 
x = sector_nr;
chunk_offset = sector_div(x, sectors_per_chunk);
@@ -2048,6 +2057,9 @@ static int raid6_add_disk(mddev_t *mddev
int disk;
struct disk_info *p;
 
+   if (mddev-degraded  2)
+   /* no point adding a device */
+   return 0;
/*
 * find the disk ...
 */
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH md 9 of 9] Optimise reconstruction when re-adding a recently failed drive.

2005-02-17 Thread Neil Brown

On Thursday February 17, [EMAIL PROTECTED] wrote:
 
 NeilBrown wrote:
  When an array is degraded, bit in the intent-bitmap are
  never cleared. So if a recently failed drive is re-added, we only need
  to reconstruct the block that are still reflected in the
  bitmap.
  This patch adds support for this re-adding.
 
 Hi there -
 
 If I understand this correctly, this means that:
 
 1) if I had a raid1 mirror (for example) that has no writes to it since 
 a resync
 2) a drive fails out, and some writes occur
 3) when I re-add the drive, only the areas where the writes occurred 
 would be re-synced?
 
 I can think of a bunch of peripheral questions around this scenario, and 
 bad sectors / bad sector clearing, but I may not be understanding the 
 basic idea, so I wanted to ask first.

You seem to understand the basic idea.
I believe one of the motivators for this code (I didn't originate it)
is when a raid1 has one device locally and one device over a network
connection.

If the network connection breaks, that device has to be thrown
out. But when it comes back, we don't want to resync the whole array
over the network.  This functionality helps there (though there are a
few other things needed before that scenario can work smoothly).

You would only re-add a device if you thought it was OK.  i.e. if it
was a connection problem rather than a media problem, or if you had
resolved any media issues.


NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH md 0 of 9] Introduction

2005-02-20 Thread Neil Brown

On Friday February 18, [EMAIL PROTECTED] wrote:
 Would you recommend to apply this package 
 http://neilb.web.cse.unsw.edu.au/~neilb/patches/linux-devel/2.6/2005-02-18-00/patch-all-2005-02-18-00
 To a 2.6.10 kernel?

No.  I don't think it would apply.
That patch it mostly experimental stuff.  Only apply it if you want to
experiment with the bitmap resync code.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Raid-6 hang on write.

2005-02-27 Thread Neil Brown

On Friday February 25, [EMAIL PROTECTED] wrote:
 
 Turning on debugging in raid6main.c and md.c make it much harder to hit. So 
 I'm assuming something 
 timing related.
 
 raid6d -- md_check_recovery -- generic_make_request -- make_request -- 
 get_active_stripe

Yes, there is a real problem here.  I see if I can figure out the best
way to remedy it...
However I think you reported this problem against a non -mm kernel,
and the path from md_check_recovery to generic_make_requests only
exists in -mm.

Could you please confirm if there is a problem with
2.6.11-rc4-bk4-bk10

as reported, and whether it seems to be the same problem.

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Raid-6 hang on write.

2005-03-01 Thread Neil Brown

On Tuesday March 1, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
  
  Could you please confirm if there is a problem with
  2.6.11-rc4-bk4-bk10
  
  as reported, and whether it seems to be the same problem.
 
 Ok.. are we all ready? I had applied your development patches to all my 
 vanilla 2.6.11-rc4-* 
 kernels. Thus they all exhibited the same problem in the same way as -mm1. 
 Smacks forehead against 
 wall repeatedly

Thanks for following through with this so we know exactly where the
problem is ... and isn't.  And admitting your careless mistake in
public is a great example to all the rest of us who are too shy to do
so - thanks :-)

 
 Oh well, at least we now know about a bug in the -mm patches.
 

Yes, and very helpful to know it is.  Thanks again.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Joys of spare disks!

2005-03-01 Thread Neil Brown

On Wednesday March 2, [EMAIL PROTECTED] wrote:
 
 Is there any sound reason why this is not feasible? Is it just that 
 someone needs to write the code to implement it?

Exactly (just needs to be implemented).

NeilBrown

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Creating RAID1 with missing - mdadm 1.90

2005-03-05 Thread Neil Brown

On Saturday March 5, [EMAIL PROTECTED] wrote:
 What might the proper [or functional] syntax be to do this?
 
 I'm running 2.6.10-1.766-FC3, and mdadm 1.90.

It would help if you told us what you tried as then we could possible
give a more focussed answer, however:


   mdadm --create /dev/md1 --level=raid1 --raid-devices=2 /dev/sda3 missing

might be the sort of thing you want.

NeilBrown

 
 Thanks for the time.
 b-
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Spare disk could not sleep / standby

2005-03-07 Thread Neil Brown

On Monday March 7, [EMAIL PROTECTED] wrote:
 I have no idea, but...
 
 Is the disk IO reads or writes.  If writes, scary  Maybe data destined
 for the array goes to the spare sometimes.  I hope not.  I feel safe with my
 2.4 kernel.  :)

It is writes, but don't be scared.  It is just super-block updates.

In 2.6, the superblock is marked 'clean' whenever there is a period of
about 20ms of no write activity.  This increases the chance on a
resync won't be needed after a crash.
(unfortunately) the superblocks on the spares need to be updated too.

The only way around this that I can think of is to have the spares
attached to some other array, and have mdadm monitoring the situation
and using the SpareGroup functionality to move the spare to where it
is needed when it is needed.
This would really require having and array with spare drives but no
data drives... maybe a 1-drive raid1 with a loopback device as the
main drive, and all the spares attached to that. there must be a
better way, or atleast some sensible support in mdadm to make it not
too horrible.  I'll think about it.

NeilBrown

 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Peter Evertz
 Sent: Monday, March 07, 2005 11:05 PM
 To: linux-raid@vger.kernel.org
 Subject: Spare disk could not sleep / standby
 
 I have 2 Raid5 arrays on a hpt375. Each has a (unused) spare disk. 
 With change from 2.4 to 2.6 I can not put the spare disk to sleep or  
 
 standby. 
 It wakes up after some seconds. 
 /proc/diskstat shows activities every 2 to 5 seconds. 
 It is a problem of the plain kernel driver ( not application ) 
 because 
 if i raidhotremove the drives it can sleep/standby and no activities 
 are 
 shown in /proc/diskstats. 
 If the whole Array is unmounted, there are no activities an all 
 drives. 
 It seems that access to md always causes an access at the spare disk 
 ?! 
 
 Any hints ? Anyone with the same problem ? 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BUG (Deadlock) in 2.6.10

2005-03-07 Thread Neil Brown

On Sunday February 27, [EMAIL PROTECTED] wrote:
 Hello.
 
 Just for your information: There is a deadlock in the following situation:
 
 MD2 is Raid 0 with 3 disks. sda1 sdb1 sdc1
 MD3 is Raid 0 with 3 disks. sdd1 sde1 sdf1
 MD4 is Raid 1 with 2 disks. MD2 and MD3!!
 
 If a disk in MD2 fails, MD2 completely fails. MD4 SHOULD now mark disk 1 
 (MD2) as faulty but does 
 not. Instead there is a dead-lock. sync hangs as well. Had to
 reboot.

Sorry for the delay in replying.

I cannot reproduce this. disk '1' of md4 (which is md2) gets moved to
position '2' and marked faulty.

You are running a pure 2.6.10 kernel are you? no patches at all?
Any other details you could provide?


 
 I am now using the native Raid 10. Is this stable enough?

Should be It hasn't had as much testing as raid1, but I'm glad you
are helping out by giving it a bit more

NeilBrown


 
 Best regards,
 Chris - RapidTec
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Spare disk could not sleep / standby

2005-03-07 Thread Neil Brown

On Tuesday March 8, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
  It is writes, but don't be scared.  It is just super-block updates.
  
  In 2.6, the superblock is marked 'clean' whenever there is a period of
  about 20ms of no write activity.  This increases the chance on a
  resync won't be needed after a crash.
  (unfortunately) the superblocks on the spares need to be updated too.
 
 Ack, one of the cool things that a linux md array can do that others
 can't is imho that the disks can spin down when inactive.  Granted,
 it's mostly for home users who want their desktop RAID to be quiet
 when it's not in use, and their basement multi-terabyte facility to
 use a minimum of power when idling, but anyway.
 
 Is there any particular reason to update the superblocks every 20
 msecs when they're already marked clean?


It doesn't (well, shouldn't and I don't think it does).
Before the first write, they are all marked 'active'.
Then after 20ms with no write, they are all marked 'clean'.
Then before the next write they are all marked 'active'.

As the event count needs to be updated every time the superblock is
modified, the event count will be updated forever active-clean or
clean-active transition.  All the drives in an array must have the
same value for the event count, so the spares need to be updated even
though they, themselves, aren't exactly 'active' or 'clean'.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Spare disk could not sleep / standby

2005-03-07 Thread Neil Brown

On Tuesday March 8, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
  Then after 20ms with no write, they are all marked 'clean'.
  Then before the next write they are all marked 'active'.
  
  As the event count needs to be updated every time the superblock is
  modified, the event count will be updated forever active-clean or
  clean-active transition.
 
 So..  Sorry if I'm a bit slow here.. But what you're saying is:
 
 The kernel marks the partition clean when all writes have expired to disk.
 This change is propagated through MD, and when it is, it causes the
 event counter to rise, thus causing a write, thus marking the
 superblock active.  20 msecs later, the same scenario repeats itself.
 
 Is my perception of the situation correct?

No.  Writing the superblock does not cause the array to be marked
active.
If the array is idle, the individual drives will be idle.


 
 Seems like a design flaw to me, but then again, I'm biased towards
 hating this behaviour since I really like being able to put inactive
 RAIDs to sleep..

Hmmm... maybe I misunderstood your problem.  I thought you were just
talking about a spare not being idle when you thought it should be.
Are you saying that your whole array is idle, but still seeing writes?
That would have to be something non-md-specific I think.

NeilBrown

 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: md Grow for Raid 5

2005-03-08 Thread Neil Brown

On Tuesday March 8, [EMAIL PROTECTED] wrote:
 
 berk walker wrote:
  Have you guys seen/tried mdadm 1.90?  I am delightfully experiencing the 
 
 I believe the mdadm based grow does not work for raid5, but only for 
 raid0 or raid1. raidreconf is actually capable of adding disks to raid5 
 and re-laying out the stripes / moving parity blocks, etc

There are different dimensions for growing
You can make the component devices bigger, or you can add component
devices.  You can increase storage or you can increase redundancy.

If you replace all the devices in a raid1, raid5, or raid6 with large
devices (presumably one at a time allowing for a reconstruct each
time) then mdadm will allow you to grow the array to make use of the
extra space.

mdadm will also allow you to grow and raid1 array by adding extra
devices.  This only increases redundancy, not capacity.

I code to allows you to grow a linear array by adding a drive to
it.  I'm not sure if I have submitted this code.

I plan a raid4 version that organises the data in a linear rather than
a striped fashion.   It would be quite simple for mdadm to 'grow' this
sort of array.

All of these do not require moving data around, so they are easy.

Growing a raid5 or raid6 by adding another drive is conceptually
possible to do while the array is online, but I have not definite
plans to do this (I would like to).  Growing a raid5 into a raid6
would also be useful.
These require moving lots of data around, and need to be able to cope
with drive failure and system crash a fun project..

As has been said, raidreconf does at least some of this off-line.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH md 0 of 4] Introduction

2005-03-08 Thread Neil Brown

On Monday March 7, [EMAIL PROTECTED] wrote:
 NeilBrown [EMAIL PROTECTED] wrote:
 
  The first two are trivial and should apply equally to 2.6.11
  
   The second two fix bugs that were introduced by the recent 
   bitmap-based-intent-logging patches and so are not relevant
   to 2.6.11 yet. 
 
 The changelog for the Fix typo in super_1_sync patch doesn't actually say
 what the patch does.  What are the user-visible consequences of not fixing
 this?

---
This fixes possible inconsistencies that might arise in a version-1 
superblock when devices fail and are removed.

Usage of version-1 superblocks is not yet widespread and no actual
problems have been reported.

 
 
 Is the bitmap stuff now ready for Linus?

I agree with Paul - not yet.
I'd also like to get a bit more functionality in before it goes to
Linus, as the functionality may necessitate in interface change (I'm
not sure).
Specifically, I want the bitmap to be able to live near the superblock
rather than having to be in a file on a different filesystem.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH md 0 of 4] Introduction

2005-03-08 Thread Neil Brown

On Tuesday March 8, [EMAIL PROTECTED] wrote:
 
 But I digress. My immediate problem is that writes must be queued
 first. I thought md traditionally did not queue requests, but instead
 used its own make_request substitute to dispatch incoming requests as
 they arrived.
 
 Have you remodelled the md/raid1 make_request() fn?

Somewhat.  Write requests are queued, and raid1d submits them when
it is happy that all bitmap updates have been done.

There is no '1/100th' second or anything like that.
When a write request arrives, the queue is 'plugged', requests are
queued, and bits in the in-memory bitmap are set.
When the queue is unplugged (by the filesystem or timeout) the bitmap
changes (if any) are flushed to disk, then the queued requests are
submitted. 

Bits on disk are cleaned lazily.


Note that for many applications, the bitmap does not need to be huge.
4K is enough for 1 bit per 2-3 megabytes on many large drives.
Having to sync 3 meg when just one block might be out-of-sync may seem
like a waste, but it is heaps better than syncing 100Gig!!

If a resync without bitmap logging takes 1 hour, I suspect a resync
with a 4K bitmap would have a good chance of finishing in under 1
minute (Depending on locality of references).  That is good enough for
me.

Of course, if one mirror is on the other side of the country, and a
normal sync requires 5 days over ADSL, then you would have a strong
case for a finer grained bitmap.

 
 And if so, do you also aggregate them? And what steps are taken to
 preserve write ordering constraints (do some overlying file systems
 still require these)?

filesystems have never had any write ordering constraints, except that
IO must not be processed before it is requested, nor after it has been
acknowledged.  md continue to obey these restraints.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mdadm --dangerous-no-resync equivalent

2005-03-09 Thread Neil Brown

On Thursday March 10, [EMAIL PROTECTED] wrote:
 Hi,
 
 I have an installer (http://sourceforge.net/projects/terraformix/) that
 creates Raid 1 arrays, previously the arrays were created with mkraid
 using the --dangerous-no-resync option. I am now required to build the
 arrays with mdadm and have the following questions ;
 
 1) Is there an equivalent of --dangerous-no-resync in mdadm ?

No, though there might be one day, in which case it would be
 --assume-clean
(which works with --build, but is currently ignored for --create).

 
 2) Can I just go ahead and install onto a newly created RAID 1 array
 without waiting for it to resync ?

Yes.

 
 3) Can I just go ahead and install onto a newly created RAID 5 array
 without waiting for it to resync ?

Yes.


 
 Thanks in advance.

You're welcome.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Problems with Linux RAID in kernel 2.6

2005-03-10 Thread Neil Brown

On Thursday March 10, [EMAIL PROTECTED] wrote:
 Hi,
 
   I have many problems with RAID in kernel 2.6.10.
..
   And dmesg says:
 
 md: raidstart(pid 2944) used deprecated START_ARRAY ioctl. This will not  
 -- !!!
 be supported beyond 2.6   
 -- !!!

Take the hint.  Don't use 'raidstart'.  It seems to work, but it will
fail you when it really counts.  In fact, I think it is failing for
you now.

Use mdadm to assemble your arrays.

 
   And with mdadm also fails:
 
 centralmad:~# mdadm -R /dev/md2 
 mdadm: failed to run array /dev/md2: Invalid argument

You are using mdadm wrongly.  
You want something like:
  mdadm --assemble /dev/md2  /dev/sdb2 /dev/sdc2




 centralmad:~# cat /proc/mdstat 
 Personalities : [raid1] 
 unused devices: none
 
   Moreover, dmesg says the md driver fails:
 
 md: bug in file drivers/md/md.c, line 1514

This is (a rather non-helpful) way of saying that you tried to start
an array which contained no devices.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] md bitmap bug fixes

2005-03-14 Thread Neil Brown

On Monday March 14, [EMAIL PROTECTED] wrote:
 On 2005-03-14T21:22:57, Neil Brown [EMAIL PROTECTED] wrote:
 
   Hi there, just a question about how the bitmap stuff works with
   1++-redundancy, say RAID1 with 2 mirrors, or RAID6.
  I assume you mean RAID1 with 3 drives (there isn't really one main
  drive and all the others are mirrors - all drives are nearly equal).
 
 Yeah, that's what I meant.
 
 (BTW, if they are all equal, how to you figure out where to sync
 from?

It arbitrarily chooses one.  It doesn't matter which.  The code
currently happens to choose the first, but this is not a significant choice.

 Isn't the first one also the first one to receive the writes, so
 unless it's somehow identified as bad, it's the one which will have the
 best data?)

Data is written to all drives in parallel (the request to the first
might be launched slightly before the second, but the difference is
insignificant compared to the time it takes for the write to
complete). 

There is no such thing as best data.
Consider the situation where you want to make a transactional update
to a file that requires writing two block.
If the system dies while writing the first, the before data is
better.  If it dies while writing the second, the after data is
better. 

 
  We haven't put any significant work into bitmap intent logging for
  levels other than raid1, so some of the answer may be pure theory.
 
 OK.
 
 (Though in particular for raid5 with the expensive parity and raid6 with
 the even more expensive parity this seems desireable.)

Yes.  We will get there.  We just aren't there yet so I cannot say
with confidence how it will work.

 
 I think each disk needs to have it's own bitmap in the long run. On
 start, we need to merge them.

I think any scheme that involved multiple bitmaps would be introducing
too much complexity.  Certainly your examples sound very far fetched
(as I think you admitted yourself).  But I always try to be open to
new ideas.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/3] md bitmap-based asynchronous writes

2005-03-20 Thread Neil Brown

On Thursday March 17, [EMAIL PROTECTED] wrote:
 These three patches provide the ability to perform asynchronous writes 
 with raid1. The asynchronous write capability is primarily useful when 
 raid1 is employed in network replication (i.e., with one or more of the 
 disks located on a remote system). It allows us to acknowledge writes 
 before they reach the remote system, thus making much more efficient use 
 of the network.

Thanks for these.

However I would like to leave them until I'm completely happy with the
bitmap resync, including
  - storing bitmap near superblock  (I have this working)
  - hot-add bitmap  (I've started on this)
  - support for raid5/6 and hopefully raid10.  (still to come)

and I would really like
   only-kick-drives-on-write-errors-not-read-errors

to be the next priority.  After that, I will definitely look at async
writes.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Strangeness when booting raid1: md5 already running?

2005-03-21 Thread Neil Brown

On Monday March 21, [EMAIL PROTECTED] wrote:
 Folks,
 
 I had to pull the plug on my box today, and when it rebooted got this
 rather strange raid issue. The box has 5 raid1 arrays, consisting of 5
 partitions on each of 2 drives. When it rebooted, the md5 array came up
 like so:
 
 raid1: raid set md0 active with 2 out of 2 mirrors
 md: ... autorun DONE.
 md: Autodetecting RAID arrays.
 md: autorun ...
 md: considering hda9 ...
 md:  adding hda9 ...
...repeated several times
 md: export_rdev(hda9)
 md: ... autorun DONE.
 EXT3-fs: INFO: recovery required on readonly filesystem.
 EXT3-fs: write access will be enabled during recovery.
 EXT3-fs: recovery complete.
 
 Now this was the first time the string md5 appears in the log. And
 indeed, it appears that hda9 has been kicked out of the array:

So was md5 actually running (what did /proc/mdstat show? What about
mdadm -D /dev/md5?).
My guess is that your startup scripts, possibly in an initrd used the
RAID_AUTORUN command several times, and each time it found the drive
which wasn't part of an array but couldn't be added to one.  Nothing
particularly interesting.

If md5 hadn't actually been started, then that is interesting.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID1: no resync after crash?

2005-03-21 Thread Neil Brown

On Friday March 18, [EMAIL PROTECTED] wrote:
 
 Is there perhaps some bug that denies a resync on a degraded
 RAID1 even if there is more than one mirror operational?
 

Yes :-(

The following patch might fix it...
I guess I should double check and submit something to Marcelo. 
Thanks for reporting this.

NeilBrown


Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./drivers/md/raid1.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletion(-)

diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c
--- ./drivers/md/raid1.c~current~   2004-08-16 10:17:11.0 +1000
+++ ./drivers/md/raid1.c2005-03-22 09:47:11.0 +1100
@@ -1737,10 +1737,11 @@ static int raid1_run (mddev_t *mddev)
}
}
 
-   if (!start_recovery  !(sb-state  (1  MD_SB_CLEAN)) 
+   if (!(sb-state  (1  MD_SB_CLEAN)) 
(conf-working_disks  1)) {
const char * name = raid1syncd;
 
+   start_recovery = 0;
conf-resync_thread = md_register_thread(raid1syncd, conf,name);
if (!conf-resync_thread) {
printk(THREAD_ERROR, mdidx(mddev));
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID1 and data safety?

2005-03-21 Thread Neil Brown

On Wednesday March 16, [EMAIL PROTECTED] wrote:
 Just wondering;
 
 Is there any way to tell MD to do verify-on-write and
 read-from-all-disks on a RAID1 array?

No.
I would have thought that modern disk drives did some sort of
verify-on-write, else how would they detect write errors, and they are
certainly in the best place to do verify-on-write.
Doing it at the md level would be problematic as you would have to
ensure that you really were reading from the media and not from some
cache somewhere in the data path.  I doubt it would be a mechanism
that would actually increase confidence in the safety of the data.

read-from-all-disks would require at least three drives before there
would be any real value in it.  There would be an enormous overhead,
but possibly that could be justified in some circumstances.  If we
ever implement background-data-checking, it might become relatively
easy to implement this.

However I think that checksum based checking would be more effective,
and that it should be done at the filesystem level.

Imagine a filesystem that could access multiple devices, and where it
kept index information it didn't just keep one block address, but
rather kept two block address, each on different devices, and a strong
checksum of the data block.  This would allow much the same robustness
as read-from-all-drives and much lower overhead.

It is very possibly the Sun's new ZFS filesystem works like this,
though I haven't seen precise technical details.

In summary:
 - you cannot do it now.
 - I don't think md is at the right level to solve these sort of problems.
   I think a filesystem could do it much better. (I'm working on a
   filesystem  slowly...)
 - read-from-all-disks might get implemented one day. verify-on-write
   is much less likely.

 
 Apologies if the answer is in the docs.

It isn't.  But it is in the list archives now

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mdadm command to trigger Raid Recovery?

2005-03-21 Thread Neil Brown

On Saturday March 19, [EMAIL PROTECTED] wrote:
 Hi,
 
 What exactly is the command to recover a raid array after system crash? I
 created a raid5 array with 3 disks.  After system crash, one disk is out
 of sync, so I tried the command
 
 mdadm --assemble --run --force --update=resync /dev/md2 /dev/sbd3  /dev/sbd4 
 /dev/sbd5
 
 However, I still saw that raid array was started with 2 disks:
 mdadm: /dev/md/2 has been started with 2 drives (out of 3).
 
 cat /proc/mdstat still shows that /dev/sbd5 is inactive.
 Personalities : [raid5]
 md2 : active raid5 sbd3[0] sbd4[1]
   3968 blocks level 5, 64k chunk, algorithm 2 [3/2] [UU_]
 
 If --update=resync isn't the right option to sync a raid array, what
 command am I supposed to use?

I think all you have to do at this point is add sbd5 back into the
array:
   mdadm /dev/md2 --add /dev/sbd5

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Questions regarding readonly/readwrite semantics

2005-03-23 Thread Neil Brown

On Tuesday March 22, [EMAIL PROTECTED] wrote:
 Hello,
 
 in the beginning I had just one simple question :) ...
 Is there any way to start RAIDs in readonly mode while autodetection
 on system boot?

No.
The read-only mode has not been well thought out in md, and I have not
yet put any effort into fixing it.  I might one day, but it is not a
high priority.

The only way you will find answers to your why does it fail like
this questions is to hunt through the source code.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Software RAID0 behaviour/performance on ATA drives

2005-03-24 Thread Neil Brown

On Friday March 25, [EMAIL PROTECTED] wrote:
 The recommended setup for doing RAID0 with ATA drives is
 that each hard drive should be on its own IDE channel.  The
 reasoning is so they can operate concurrently... i.e.  if you
 put them in a master-slave configuration on the same channel,
 you lose the benefits of striping since both drives cannot
 use the channel at the same time and have to wait for the
 others' read/write to finish first.
 
 QUESTION:  Is this really accurate?  Is the md driver smart/multi-threaded
 enough to read/write ahead non-consecutive sectors simultaneously?
 When writing, for example, the 1st to 4th chunks of a file across
 a striped two-drive array, would it write the 2nd and 4th
 chunks on one drive without waiting for the 1st and 3rd chunks
 to finish writing on the other?
 
 If it can't, and the writes have to proceed in sequence/lockstep,
 how can putting the striped drives on separate channels help?

The raid0 driver is 'clever' at all.
It is given requests by the filesystem or mm subsystem, maps them to
the correct device/sector, and sends them straight on to the
appropriate driver.  It never waits for requests, just maps and
forwards.

So if the  filesystem sends 128 4k read-ahead requests to the raid0
driver it will forward each one to the relevant device and, depending
on chunk size etc, you might get, say, 32 4K requests sent to each of
4 drives.  The drives would (depending on the internals of the driver)
processes all these requests in parallel.

In your example, if the filesystem or mm subsystem submitted writes
for 4 consecutive chunks on a two-drive raid0 array without waiting
for earlier ones to complete before submitting later ones, then they
would all get to the device driver in a timely fashion, and the device
driver(s) should be able to drive the two drives in parallel.

So if the writer handles the required parallelism, and the devices
handle the required parallelism, then the raid0 layer won't interfere
at all. 

Hope that makes it clear.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Software RAID0 behaviour/performance on ATA drives

2005-03-25 Thread Neil Brown

On Friday March 25, [EMAIL PROTECTED] wrote:
  
  The raid0 driver is 'clever' at all.

Hmm.. that should have been The raid0 drivers isn't 'clever' at all.
   ^

  It is given requests by the filesystem or mm subsystem, maps them to
  the correct device/sector, and sends them straight on to the
  appropriate driver.  It never waits for requests, just maps and
  forwards.
  
  So if the  filesystem sends 128 4k read-ahead requests to the raid0
  driver it will forward each one to the relevant device and, depending
  on chunk size etc, you might get, say, 32 4K requests sent to each of
  4 drives.  The drives would (depending on the internals of the driver)
  processes all these requests in parallel.
  
  In your example, if the filesystem or mm subsystem submitted writes
  for 4 consecutive chunks on a two-drive raid0 array without waiting
  for earlier ones to complete before submitting later ones, then they
  would all get to the device driver in a timely fashion, and the device
  driver(s) should be able to drive the two drives in parallel.
  
  So if the writer handles the required parallelism, and the devices
  handle the required parallelism, then the raid0 layer won't interfere
  at all. 
  NeilBrown
 Hi Neil
 
 I am curious about that, too. In my memory, the IDE Channel can only
 allow one IDE device read/write at once. If one driver(e.g. master) 
 was writing, the second driver (e.g. slave) must wait until the master
 driver finished...
 
 If you were right, can I plug the two disk into the same IDE channel in
 raid 1 without losing any performance?

Right about what?  I don't think that anything that I wrote can
reasonably be interpreted to say that you can just use one channel
without losing performance.

I said If ... the devices handles the required parallelism.  I don't
know much in detail about IDE (I use SCSI mostly) but if it is true
that you cannot talk to a slave and a master at the same time, then
the drives DO NOT handle the required parallelism, so you don't get
any parallelism.  
It is basically completely out of md's hands.  It won't get in the way
of parallelism, but it won't make it magically happen if the
drives/controller/drivers cannot make it happen.

Is that any clearer?  Or did I misunderstand you.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Raid1 problem can't add remove or mark faulty -- it did work

2005-03-26 Thread Neil Brown

On Saturday March 26, [EMAIL PROTECTED] wrote:
 i have a strange problem -- can't get a fully funtional 2 drive raid 
 back up running-- it may or may not be a drive bios interaction don't 
 know. none of the mdadm manage functions will work add remove or mark faulty
 i have purged and reinstalled the mdadm package twice.
 
 below is all the info i could think  of .   the kernel is 2.6.10 
 patched  -- stock kanotix
 either drive will boot and the behavior is the same no matter which one 
 is active

For future reference, extra information which would be helpful
includes
   cat /proc/mdstat
   mdadm -D /dev/md0
  and any 'dmesg' messages that are generated when 'mdadm' fails.

It appears that hda1 and hdc1 are parts of the raid1, and hdc1 is the
'freshest' part so when you boot, the array is assembled with just one
drive: hdc1.  hda1 is not included because it appears to be out of
date, and so presumably failed at some time.

The thing that should be done is to add hda1 with
   mdadm /dev/md0 -a /dev/hda1

It appears that you tried this and it failed.  When it failed there
should have been kernel messages generated.  I need to see these.

 what happens when i try a hot add remove or set faulty
  
 [EMAIL PROTECTED]:/home/rob# mdadm /dev/md0 -a /dev/hda1
 mdadm: hot add failed for /dev/hda1: Invalid argument

This should have worked, but didn't.  The kernel messages should
indicate why.


 [EMAIL PROTECTED]:/home/rob# mdadm /dev/md0 -a /dev/hdc1
 mdadm: hot add failed for /dev/hdc1: Invalid argument
 

hdc1 is already part of md0. Adding it is meaningless.


 [EMAIL PROTECTED]:/home/rob# mdadm /dev/md0 -r /dev/hda1
 mdadm: hot remove failed for /dev/hda1: No such device or address

You cannot remove hda1 because it isn't part of the array.

 [EMAIL PROTECTED]:/home/rob# mdadm /dev/md0 -r /dev/hdc1
 mdadm: hot remove failed for /dev/hdc1: Device or resource busy
 

You cannot remove hdc1 because it is actively in use in the array.
You can only removed failed drives or spares.


 [EMAIL PROTECTED]:/home/rob# mdadm /dev/md0 -f /dev/hda1
 mdadm: set device faulty failed for /dev/hda1:  No such device

You cannot fail hda1 because it isn't part of md0.

 [EMAIL PROTECTED]:/home/rob# mdadm /dev/md0 -f /dev/hdc1
 mdadm: set /dev/hdc1 faulty in /dev/md0

Ooops... you just failed the only drive in the raid1 array, md0 will
no longer be functional... until you reboot and the array gets
re-assembled.  Failing a drive does not write anything to it, so you
won't have hurt any drive by doing this, just made the array stop
working for now.


NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: AW: AW: RAID1 and data safety?

2005-03-29 Thread Neil Brown

On Tuesday March 29, [EMAIL PROTECTED] wrote:
 But:
 If you have a raid1 and a journaling fs, see the following:
 If the system chrashes at the end of a write transaction,
 then the end-of-transaction information may got written 
 to hda already, but not to hdb. On the next boot, the 
 journaling fs may see an overall unclean bit (*probably* a transaction
 is pending), so it reads the transaction log. 
 
 And here the fault happens:
 By chance, it reads the transaction log from hda, then sees, that the
 transaction was finished, and clears the overall unclean bit. 
 This cleaning is a write, so it goes to *both* HDs.
 
 Situation now: On hdb there is a pending transaction in the transaction 
 log, but the overall unclean bit is cleared. This may not be realised,
 until by chance a year later hda chrashes, and you finaly face the fact,
 that there is a corrupt situation on the left HD.

Wrong.  There is nothing of the sort on hdb.  
Due to the system crash the data on hdb is completely ignored.  Data
from hda is copied over onto it.  Until that copy has completed,
nothing is read from hdb.

You could possibly come up with a scenario where the above happens but
while the copy from hda-hdb is happening, hda dies completely so
reads have to start happening from hdb.  
md could possibly handle this situation better (ensure a copy has
happened for any block before a read succeeds of that block), but I
don't think it is at all likely to be a re-life problem.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: once again raid5

2005-03-31 Thread Neil Brown

On Thursday March 31, [EMAIL PROTECTED] wrote:
 Hi,
 
 we still have troubles with our raid5 array. You can find the history of 
 the fault in detail in my other postings (11.3.2005).
 
 I will show you my attempts.
 
 There are 4 discs (Maxtor 250GB) in a raid5-array. One disc failed and 
 we sent it back to Maxtor. Now, the array consists of 3 discs.
 I tried to reassemble it,
 
 mdadm -A --run --force /dev/md2 /dev/hdi1 /dev/hdk1  /dev/hdo1
 
 but i got an error:
 
 -snip-
 mdadm: failed to RUN_ARRAY /dev/md2: Invalid argument
 -snap-

It looks like hdi1 doesn't think it is an active part of the array. 
It is just a spare. 
It is as-though the array was not fully synced when hdm1 (?) failed.

Looking back through previous emails, it looks like you had 2 drive
fail in a raid5 array.  This means you lose. :-(

Your best bet would be:

  mdadm --create /dev/md2 --level 5 -n 4 /dev/hda1 /dev/hdk1 missing  /dev/hdo1

and hope that the data you find on md2 isn't too corrupted.  You might be
lucky, but I'm not holding my breath - sorry.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid1-diseaster on reboot: old version overwrites new version

2005-04-02 Thread Neil Brown

On Saturday April 2, [EMAIL PROTECTED] wrote:
 
 * What did I do wrong?
 
 The only explantion to me is, that I had the wrong entry in my 
 lilo.conf. I had root=/dev/hda6 there instead of root=/dev/md2
 So maybe root was always mounted as /dev/hda6 and never as /dev/md2, 
 which was started, but never had any data written to it. Is this a 
 possible explanation?

Yep, this completely explains everything.
/ was *not* on /dev/md2, it was on /dev/hda6 which also happened to be
a part of an unused raid1 array.

After a crash, the raid1 array did a resync copying from hdc6 to
hda6.  Very sad.  Very good that you had backups.

2.6 won't let you do this: you cannot have a partition in a raid array
and mounted as a filesystem at the same time.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

ANNOUNCE: mdadm 1.10.0 - A tool for managing Soft RAID under Linux

2005-04-04 Thread Neil Brown



I am pleased to announce the availability of 
   mdadm version 1.10.0
It is available at
   http://www.cse.unsw.edu.au/~neilb/source/mdadm/
and
   http://www.{countrycode}.kernel.org/pub/linux/utils/raid/mdadm/

as a source tar-ball and (at the first site) as an SRPM, and as an RPM for i386.

mdadm is a tool for creating, managing and monitoring
device arrays using the md driver in Linux, also
known as Software RAID arrays.

Release 1.9.0 adds:
-   Fix bug with --config=partitions
-   Open sub-devices with O_EXCL to detect if already in use
-   Make sure superblock updates are flushed directly to disk.

The first update is the most significant (mdadm -Escpartitions would crash).
The others are mildly useful extras.

Development of mdadm is sponsored by [EMAIL PROTECTED]: 
  The School of Computer Science and Engineering
at
  The University of New South Wales

NeilBrown  04 April 2005

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: How to move a singleton raid1 drive from raid1 /dev/md2 to /dev/md1

2005-04-06 Thread Neil Brown

On Thursday April 7, [EMAIL PROTECTED] wrote:
 Hi Software Raid Gurus!:
 
 I have 
 
 A1:~# cat /proc/mdstat
 Personalities : [raid1]
 md0 : active raid1 hdb1[0] hdg1[1]
   244195904 blocks [2/2] [UU]
 
 md1 : active raid1 hdc1[0]
   244195904 blocks [1/1] [U]
 
 md2 : active raid1 hde1[0]
   244195904 blocks [1/1] [U]
 
 Now I want to take /dev/hde1 and get rid of /dev/md2 and add /dev/hde1 
 to /dev/md1.

md1 is a 1-drive array.  You to make it a 2-drive array you need a
recent 2.6 kernel and a recent release of mdadm.  If you have these,
then

  mdadm -S /dev/md2  # shutdown md2 and release hde1
  mdadm /dev/md1 -a /dev/hde1 # add hde1 as a spare in md1
  mdadm -G -n2 /dev/md1 # grow md1 to have two devices.

If you don't have 2.6, then you will have to recreate md1 with
the desired number of components, which will require unmounting it.

   mdadm -S /dev/md2
   mdadm -S /dev/md1
   mdadm -C /dev/md1 -l1 -n2 /dev/hdc1 missing

   mdadm /dev/md1 -a /dev/hde1

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH md 001 of 2] Close a small race in md thread deregistration

2005-04-07 Thread Neil Brown

On Thursday April 7, [EMAIL PROTECTED] wrote:
 
 That code all seems a bit crufty to me.  Sometime it would be good to stop
 using signals in-kernel and to use the kthread API for thread startup and
 shutdown.

I've just added that to my TODO list... thanks for the suggestion.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

ANNOUNCE: mdadm 1.11.0 - A tool for managing Soft RAID under Linux

2005-04-10 Thread Neil Brown


I am pleased (embarrassed?) to announce the availability of 
   mdadm version 1.11.0
It is available at
   http://www.cse.unsw.edu.au/~neilb/source/mdadm/
and
   http://www.{countrycode}.kernel.org/pub/linux/utils/raid/mdadm/

as a source tar-ball and (at the first site) as an SRPM, and as an RPM for i386.

mdadm is a tool for creating, managing and monitoring
device arrays using the md driver in Linux, also
known as Software RAID arrays.

Release 1.11.0 adds:
-   Fix embarrasing bug in 1.10.0 which causes --add to always fail.
  thanks to Dave Jiang djiang at mvista dot com for reporting it.

Development of mdadm is sponsored by [EMAIL PROTECTED]: 
  The School of Computer Science and Engineering
at
  The University of New South Wales

NeilBrown  11 April 2005

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: out of sync raid 5 + xfs = kernel startup problem

2005-04-12 Thread Neil Brown

On Tuesday April 12, [EMAIL PROTECTED] wrote:
 My raid5 system recently went through a sequence of power outages.  When 
 everything came back on the drives were out of sync.  No big issue... 
 just sync them back up again.  But something is going wrong.  Any help 
 is appreciated.  dmesg provides the following (the network stuff is 
 mixed in):
 
..
 md: raidstart(pid 220) used deprecated START_ARRAY ioctl. This will not 
 be supported beyond 2.6

First hint.  Don't use 'raidstart'.  It works OK when everything is
working, but when things aren't working, raidstart makes it worse.

 md: could not bd_claim sdf2.

That's odd... Maybe it is trying to 'claim' it twice, because it
certainly seems to have got it below..

 md: autorun ...
 md: considering sdd2 ...
 md:  adding sdd2 ...
 md:  adding sde2 ...
 md:  adding sdf2 ...
 md:  adding sdc2 ...
 md:  adding sdb2 ...
 md:  adding sda2 ...
 md: created md0
 md: bindsda2
 md: bindsdb2
 md: bindsdc2
 md: bindsdf2
 md: bindsde2
 md: bindsdd2
 md: running: sdd2sde2sdf2sdc2sdb2sda2
 md: kicking non-fresh sdd2 from array!

So sdd2 is not fresh.  Must have been missing at one stage, so it
probably has old data.


 md: unbindsdd2
 md: export_rdev(sdd2)
 md: md0: raid array is not clean -- starting background reconstruction
 raid5: device sde2 operational as raid disk 4
 raid5: device sdf2 operational as raid disk 3
 raid5: device sdc2 operational as raid disk 2
 raid5: device sdb2 operational as raid disk 1
 raid5: device sda2 operational as raid disk 0
 raid5: cannot start dirty degraded array for md0


Here's the main problem.

You've got a degraded, unclean array.  i.e. one drive is
failed/missing and md isn't confident that all the parity blocks are
correct due to an unclean shutdown (could have been in the middle of a
write). 
This means you could have undetectable data corruption.

md wants you to know this an not assume that everything is perfectly
OK.

You can still start the array, but you will need to use
  mdadm --assemble --force
which means you need to boot first ... got a boot CD?

I should add a raid=force-start or similar boot option, but I
haven't yet.

So, boot somehow, and
  mdadm --assemble /dev/md0 --force /dev/sd[a-f]2

  mdadm /dev/md0 -a /dev/sdd2

 wait for sync to complete (not absolutely needed).

Reboot.

 XFS: SB read failed
 Unable to handle kernel NULL pointer dereference at  RIP:
 802c4d5d{raid5_unplug_device+13}

Hmm.. This is a bit of a worry.. I should be doing
mddev-queue-unplug_fn = raid5_unplug_device;
mddev-queue-issue_flush_fn = raid5_issue_flush;
a bit later in drivers/md/raid5.c(run), after the last 'goto
abort'... I'll have to think through it a bit though to be sure.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 975 matches

Mail list logo