Re: AACRAID driver broken in 2.6.22.x (and beyond?) [WAS: Re: 2.6.22.16 MD raid1 doesn't mark removed disk faulty, MD thread goes UN]

2008-01-23 Thread Mike Snitzer
. The diagnostic dump 
 reported by the Adaptec utilities should be able to point to the fault you 
 are experiencing if these appear to be the root causes.

snitzer:
It would seem that 1.1.5-2451 has the firmware reset support given the
log I provided above, no?   Anyway, with 2.6.22.16 when a drive is
pulled using the aacraid 1.1-5[2437]-mh4 there is absolutely no errors
from the aacraid driver; in fact the scsi layer doesn't see anything
until I force the issue with explicit reads/writes to the device that
was pulled.  It could be that on a drive pull the 1.1.5-2451 driver
results in a BlinkLED, resets the firmware, and continues.  Whereas
with the 1.1-5[2437]-mh4 I get no BlinkLED and as such Linux (both
scsi and raid1) is completely unaware of any disconnect of the
physical device.

thanks,
Mike

  -Original Message-
  From: Mike Snitzer [mailto:[EMAIL PROTECTED]
  Sent: Tuesday, January 22, 2008 7:10 PM
  To: linux-raid@vger.kernel.org; NeilBrown
  Cc: [EMAIL PROTECTED]; K. Tanaka; AACRAID;
  [EMAIL PROTECTED]
  Subject: AACRAID driver broken in 2.6.22.x (and beyond?)
  [WAS: Re: 2.6.22.16 MD raid1 doesn't mark removed disk
  faulty, MD thread goes UN]
 

  On Jan 22, 2008 12:29 AM, Mike Snitzer [EMAIL PROTECTED] wrote:
   cc'ing Tanaka-san given his recent raid1 BUG report:
   http://lkml.org/lkml/2008/1/14/515
  
  
   On Jan 21, 2008 6:04 PM, Mike Snitzer [EMAIL PROTECTED] wrote:
Under 2.6.22.16, I physically pulled a SATA disk
  (/dev/sdac, connected to
an aacraid controller) that was acting as the local raid1
  member of
/dev/md30.
   
Linux MD didn't see an /dev/sdac1 error until I tried
  forcing the issue by
doing a read (with dd) from /dev/md30:
  
   The raid1d thread is locked at line 720 in raid1.c
  (raid1d+2437); aka
   freeze_array:
  
   (gdb) l *0x2539
   0x2539 is in raid1d (drivers/md/raid1.c:720).
   715  * wait until barrier+nr_pending match nr_queued+2
   716  */
   717 spin_lock_irq(conf-resync_lock);
   718 conf-barrier++;
   719 conf-nr_waiting++;
   720 wait_event_lock_irq(conf-wait_barrier,
   721
  conf-barrier+conf-nr_pending ==
   conf-nr_queued+2,
   722 conf-resync_lock,
   723
  raid1_unplug(conf-mddev-queue));
   724 spin_unlock_irq(conf-resync_lock);
  
   Given Tanaka-san's report against 2.6.23 and me hitting
  what seems to
   be the same deadlock in 2.6.22.16; it stands to reason this affects
   raid1 in 2.6.24-rcX too.
 
  Turns out that the aacraid driver in 2.6.22.x is HORRIBLY BROKEN (when
  you pull a drive); it responds to MD's write requests with uptodate=1
  (in raid1_end_write_request) for the drive that was pulled!  I've not
  looked to see if aacraid has been fixed in newer kernels... are others
  aware of any crucial aacraid fixes in 2.6.23.x or 2.6.24?
 
  After the drive was physically pulled, and small periodic writes
  continued to the associated MD device, the raid1 MD driver did _NOT_
  detect the pulled drive's writes as having failed (verified this with
  systemtap).  MD happily thought the write completed to both members
  (so MD had no reason to mark the pulled drive faulty; or mark the
  raid degraded).
 
  Installing an Adaptec-provided 1.1-5[2451] driver enabled raid1 to
  work as expected.
 
  That said, I now have a recipe for hitting the raid1 deadlock that
  Tanaka first reported over a week ago.  I'm still surprised that all
  of this chatter about that BUG hasn't drawn interest/scrutiny from
  others!?
 
  regards,
  Mike
 

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


AACRAID driver broken in 2.6.22.x (and beyond?) [WAS: Re: 2.6.22.16 MD raid1 doesn't mark removed disk faulty, MD thread goes UN]

2008-01-22 Thread Mike Snitzer
On Jan 22, 2008 12:29 AM, Mike Snitzer [EMAIL PROTECTED] wrote:
 cc'ing Tanaka-san given his recent raid1 BUG report:
 http://lkml.org/lkml/2008/1/14/515


 On Jan 21, 2008 6:04 PM, Mike Snitzer [EMAIL PROTECTED] wrote:
  Under 2.6.22.16, I physically pulled a SATA disk (/dev/sdac, connected to
  an aacraid controller) that was acting as the local raid1 member of
  /dev/md30.
 
  Linux MD didn't see an /dev/sdac1 error until I tried forcing the issue by
  doing a read (with dd) from /dev/md30:

 The raid1d thread is locked at line 720 in raid1.c (raid1d+2437); aka
 freeze_array:

 (gdb) l *0x2539
 0x2539 is in raid1d (drivers/md/raid1.c:720).
 715  * wait until barrier+nr_pending match nr_queued+2
 716  */
 717 spin_lock_irq(conf-resync_lock);
 718 conf-barrier++;
 719 conf-nr_waiting++;
 720 wait_event_lock_irq(conf-wait_barrier,
 721 conf-barrier+conf-nr_pending ==
 conf-nr_queued+2,
 722 conf-resync_lock,
 723 raid1_unplug(conf-mddev-queue));
 724 spin_unlock_irq(conf-resync_lock);

 Given Tanaka-san's report against 2.6.23 and me hitting what seems to
 be the same deadlock in 2.6.22.16; it stands to reason this affects
 raid1 in 2.6.24-rcX too.

Turns out that the aacraid driver in 2.6.22.x is HORRIBLY BROKEN (when
you pull a drive); it responds to MD's write requests with uptodate=1
(in raid1_end_write_request) for the drive that was pulled!  I've not
looked to see if aacraid has been fixed in newer kernels... are others
aware of any crucial aacraid fixes in 2.6.23.x or 2.6.24?

After the drive was physically pulled, and small periodic writes
continued to the associated MD device, the raid1 MD driver did _NOT_
detect the pulled drive's writes as having failed (verified this with
systemtap).  MD happily thought the write completed to both members
(so MD had no reason to mark the pulled drive faulty; or mark the
raid degraded).

Installing an Adaptec-provided 1.1-5[2451] driver enabled raid1 to
work as expected.

That said, I now have a recipe for hitting the raid1 deadlock that
Tanaka first reported over a week ago.  I'm still surprised that all
of this chatter about that BUG hasn't drawn interest/scrutiny from
others!?

regards,
Mike
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


2.6.22.16 MD raid1 doesn't mark removed disk faulty, MD thread goes UN

2008-01-21 Thread Mike Snitzer
Under 2.6.22.16, I physically pulled a SATA disk (/dev/sdac, connected to
an aacraid controller) that was acting as the local raid1 member of
/dev/md30.

Linux MD didn't see an /dev/sdac1 error until I tried forcing the issue by
doing a read (with dd) from /dev/md30:

Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Sense Key :
Hardware Error [current]
Jan 21 17:08:07 lab17-233 kernel: Info fld=0x0
Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Add. Sense:
Internal target failure
Jan 21 17:08:07 lab17-233 kernel: end_request: I/O error, dev sdac, sector 71
Jan 21 17:08:07 lab17-233 kernel: printk: 3 messages suppressed.
Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 8
Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 16
Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 24
Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 32
Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 40
Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 48
Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 56
Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 64
Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 72
Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 80
Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Sense Key :
Hardware Error [current]
Jan 21 17:08:07 lab17-233 kernel: Info fld=0x0
Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Add. Sense:
Internal target failure
Jan 21 17:08:07 lab17-233 kernel: end_request: I/O error, dev sdac, sector 343
Jan 21 17:08:08 lab17-233 kernel: sd 2:0:27:0: [sdac] Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Jan 21 17:08:08 lab17-233 kernel: sd 2:0:27:0: [sdac] Sense Key :
Hardware Error [current]
Jan 21 17:08:08 lab17-233 kernel: Info fld=0x0
...
Jan 21 17:08:12 lab17-233 kernel: sd 2:0:27:0: [sdac] Add. Sense:
Internal target failure
Jan 21 17:08:12 lab17-233 kernel: end_request: I/O error, dev sdac, sector 3399
Jan 21 17:08:12 lab17-233 kernel: printk: 765 messages suppressed.
Jan 21 17:08:12 lab17-233 kernel: raid1: sdac1: rescheduling sector 3336

However, the MD layer still hasn't marked the sdac1 member faulty:

md30 : active raid1 nbd2[1](W) sdac1[0]
  4016204 blocks super 1.0 [2/2] [UU]
  bitmap: 1/8 pages [4KB], 256KB chunk

The dd I used to read from /dev/md30 is blocked on IO:

Jan 21 17:13:55 lab17-233 kernel: ddD 0afa9cf5c346
0 12337   7702 (NOTLB)
Jan 21 17:13:55 lab17-233 kernel:  81010c449868 0082
 80268f14
Jan 21 17:13:55 lab17-233 kernel:  81015da6f320 81015de532c0
0008 81012d9d7780
Jan 21 17:13:55 lab17-233 kernel:  81015fae2880 4926
81012d9d7970 0001802879a0
Jan 21 17:13:55 lab17-233 kernel: Call Trace:
Jan 21 17:13:55 lab17-233 kernel:  [80268f14] mempool_alloc+0x24/0xda
Jan 21 17:13:55 lab17-233 kernel:  [88b91381]
:raid1:wait_barrier+0x84/0xc2
Jan 21 17:13:55 lab17-233 kernel:  [8022d8fa]
default_wake_function+0x0/0xe
Jan 21 17:13:55 lab17-233 kernel:  [88b92093]
:raid1:make_request+0x83/0x5c0
Jan 21 17:13:55 lab17-233 kernel:  [80305acd]
__make_request+0x57f/0x668
Jan 21 17:13:55 lab17-233 kernel:  [80302dc7]
generic_make_request+0x26e/0x2a9
Jan 21 17:13:55 lab17-233 kernel:  [80268f14] mempool_alloc+0x24/0xda
Jan 21 17:13:55 lab17-233 kernel:  [8030db39] __next_cpu+0x19/0x28
Jan 21 17:13:55 lab17-233 kernel:  [80305162] submit_bio+0xb6/0xbd
Jan 21 17:13:55 lab17-233 kernel:  [802aba6a] submit_bh+0xdf/0xff
Jan 21 17:13:55 lab17-233 kernel:  [802ae188]
block_read_full_page+0x271/0x28e
Jan 21 17:13:55 lab17-233 kernel:  [802b0b27]
blkdev_get_block+0x0/0x46
Jan 21 17:13:55 lab17-233 kernel:  [803103ad]
radix_tree_insert+0xcb/0x18c
Jan 21 17:13:55 lab17-233 kernel:  [8026d003]
__do_page_cache_readahead+0x16d/0x1df
Jan 21 17:13:55 lab17-233 kernel:  [80248c51] getnstimeofday+0x32/0x8d
Jan 21 17:13:55 lab17-233 kernel:  [80247e5e] ktime_get_ts+0x1a/0x4e
Jan 21 17:13:55 lab17-233 kernel:  [80265543] delayacct_end+0x7d/0x88
Jan 21 17:13:55 lab17-233 kernel:  [8026d0c8]
blockable_page_cache_readahead+0x53/0xb2
Jan 21 17:13:55 lab17-233 kernel:  [8026d1a9]
make_ahead_window+0x82/0x9e
Jan 21 17:13:55 lab17-233 kernel:  [8026d34f]
page_cache_readahead+0x18a/0x1c1
Jan 21 17:13:55 lab17-233 kernel:  [8026723c]
do_generic_mapping_read+0x135/0x3fc
Jan 21 17:13:55 lab17-233 kernel:  [80266755]
file_read_actor+0x0/0x170
Jan 21 17:13:55 lab17-233 kernel:  

Re: 2.6.22.16 MD raid1 doesn't mark removed disk faulty, MD thread goes UN

2008-01-21 Thread Mike Snitzer
cc'ing Tanaka-san given his recent raid1 BUG report:
http://lkml.org/lkml/2008/1/14/515

On Jan 21, 2008 6:04 PM, Mike Snitzer [EMAIL PROTECTED] wrote:
 Under 2.6.22.16, I physically pulled a SATA disk (/dev/sdac, connected to
 an aacraid controller) that was acting as the local raid1 member of
 /dev/md30.

 Linux MD didn't see an /dev/sdac1 error until I tried forcing the issue by
 doing a read (with dd) from /dev/md30:

 Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Result:
 hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
 Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Sense Key :
 Hardware Error [current]
 Jan 21 17:08:07 lab17-233 kernel: Info fld=0x0
 Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Add. Sense:
 Internal target failure
 Jan 21 17:08:07 lab17-233 kernel: end_request: I/O error, dev sdac, sector 71
 Jan 21 17:08:07 lab17-233 kernel: printk: 3 messages suppressed.
 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 8
 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 16
 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 24
 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 32
 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 40
 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 48
 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 56
 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 64
 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 72
 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 80
 Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Result:
 hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
 Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Sense Key :
 Hardware Error [current]
 Jan 21 17:08:07 lab17-233 kernel: Info fld=0x0
 Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Add. Sense:
 Internal target failure
 Jan 21 17:08:07 lab17-233 kernel: end_request: I/O error, dev sdac, sector 343
 Jan 21 17:08:08 lab17-233 kernel: sd 2:0:27:0: [sdac] Result:
 hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
 Jan 21 17:08:08 lab17-233 kernel: sd 2:0:27:0: [sdac] Sense Key :
 Hardware Error [current]
 Jan 21 17:08:08 lab17-233 kernel: Info fld=0x0
 ...
 Jan 21 17:08:12 lab17-233 kernel: sd 2:0:27:0: [sdac] Add. Sense:
 Internal target failure
 Jan 21 17:08:12 lab17-233 kernel: end_request: I/O error, dev sdac, sector 
 3399
 Jan 21 17:08:12 lab17-233 kernel: printk: 765 messages suppressed.
 Jan 21 17:08:12 lab17-233 kernel: raid1: sdac1: rescheduling sector 3336

 However, the MD layer still hasn't marked the sdac1 member faulty:

 md30 : active raid1 nbd2[1](W) sdac1[0]
   4016204 blocks super 1.0 [2/2] [UU]
   bitmap: 1/8 pages [4KB], 256KB chunk

 The dd I used to read from /dev/md30 is blocked on IO:

 Jan 21 17:13:55 lab17-233 kernel: ddD 0afa9cf5c346
 0 12337   7702 (NOTLB)
 Jan 21 17:13:55 lab17-233 kernel:  81010c449868 0082
  80268f14
 Jan 21 17:13:55 lab17-233 kernel:  81015da6f320 81015de532c0
 0008 81012d9d7780
 Jan 21 17:13:55 lab17-233 kernel:  81015fae2880 4926
 81012d9d7970 0001802879a0
 Jan 21 17:13:55 lab17-233 kernel: Call Trace:
 Jan 21 17:13:55 lab17-233 kernel:  [80268f14] 
 mempool_alloc+0x24/0xda
 Jan 21 17:13:55 lab17-233 kernel:  [88b91381]
 :raid1:wait_barrier+0x84/0xc2
 Jan 21 17:13:55 lab17-233 kernel:  [8022d8fa]
 default_wake_function+0x0/0xe
 Jan 21 17:13:55 lab17-233 kernel:  [88b92093]
 :raid1:make_request+0x83/0x5c0
 Jan 21 17:13:55 lab17-233 kernel:  [80305acd]
 __make_request+0x57f/0x668
 Jan 21 17:13:55 lab17-233 kernel:  [80302dc7]
 generic_make_request+0x26e/0x2a9
 Jan 21 17:13:55 lab17-233 kernel:  [80268f14] 
 mempool_alloc+0x24/0xda
 Jan 21 17:13:55 lab17-233 kernel:  [8030db39] __next_cpu+0x19/0x28
 Jan 21 17:13:55 lab17-233 kernel:  [80305162] submit_bio+0xb6/0xbd
 Jan 21 17:13:55 lab17-233 kernel:  [802aba6a] submit_bh+0xdf/0xff
 Jan 21 17:13:55 lab17-233 kernel:  [802ae188]
 block_read_full_page+0x271/0x28e
 Jan 21 17:13:55 lab17-233 kernel:  [802b0b27]
 blkdev_get_block+0x0/0x46
 Jan 21 17:13:55 lab17-233 kernel:  [803103ad]
 radix_tree_insert+0xcb/0x18c
 Jan 21 17:13:55 lab17-233 kernel:  [8026d003]
 __do_page_cache_readahead+0x16d/0x1df
 Jan 21 17:13:55 lab17-233 kernel:  [80248c51] 
 getnstimeofday+0x32/0x8d
 Jan 21 17:13:55 lab17-233 kernel:  [80247e5e] ktime_get_ts+0x1a/0x4e
 Jan 21 17:13:55 lab17-233 kernel:  [80265543] 
 delayacct_end+0x7d/0x88
 Jan 21 17:13:55 lab17-233 kernel:  [8026d0c8]
 blockable_page_cache_readahead+0x53/0xb2
 Jan 21 17:13:55 lab17-233 kernel:  [8026d1a9]
 make_ahead_window+0x82/0x9e
 Jan 21 17:13:55 lab17-233 kernel

Re: [PATCH 003 of 3] md: Update md bitmap during resync.

2007-12-10 Thread Mike Snitzer
On Dec 7, 2007 12:42 AM, NeilBrown [EMAIL PROTECTED] wrote:

 Currently and md array with a write-intent bitmap does not updated
 that bitmap to reflect successful partial resync.  Rather the entire
 bitmap is updated when the resync completes.

 This is because there is no guarentee that resync requests will
 complete in order, and tracking each request individually is
 unnecessarily burdensome.

 However there is value in regularly updating the bitmap, so add code
 to periodically pause while all pending sync requests complete, then
 update the bitmap.  Doing this only every few seconds (the same as the
 bitmap update time) does not notciably affect resync performance.

 Signed-off-by: Neil Brown [EMAIL PROTECTED]

Hi Neil,

You forgot to export bitmap_cond_end_sync.  Please see the attached patch.

regards,
Mike
diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index f31ea4f..b596538 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -1566,3 +1566,4 @@ EXPORT_SYMBOL(bitmap_start_sync);
 EXPORT_SYMBOL(bitmap_end_sync);
 EXPORT_SYMBOL(bitmap_unplug);
 EXPORT_SYMBOL(bitmap_close_sync);
+EXPORT_SYMBOL(bitmap_cond_end_sync);


Re: Time to deprecate old RAID formats?

2007-10-24 Thread Mike Snitzer
On 10/24/07, John Stoffel [EMAIL PROTECTED] wrote:
  Bill == Bill Davidsen [EMAIL PROTECTED] writes:

 Bill John Stoffel wrote:
  Why do we have three different positions for storing the superblock?

 Bill Why do you suggest changing anything until you get the answer to
 Bill this question? If you don't understand why there are three
 Bill locations, perhaps that would be a good initial investigation.

 Because I've asked this question before and not gotten an answer, nor
 is it answered in the man page for mdadm on why we have this setup.

 Bill Clearly the short answer is that they reflect three stages of
 Bill Neil's thinking on the topic, and I would bet that he had a good
 Bill reason for moving the superblock when he did it.

 So let's hear Neil's thinking about all this?  Or should I just work
 up a patch to do what I suggest and see how that flies?

 Bill Since you have to support all of them or break existing arrays,
 Bill and they all use the same format so there's no saving of code
 Bill size to mention, why even bring this up?

 Because of the confusion factor.  Again, since noone has been able to
 articulate a reason why we have three different versions of the 1.x
 superblock, nor have I seen any good reasons for why we should have
 them, I'm going by the KISS principle to reduce the options to the
 best one.

 And no, I'm not advocating getting rid of legacy support, but I AM
 advocating that we settle on ONE standard format going forward as the
 default for all new RAID superblocks.

Why exactly are you on this crusade to find the one best v1
superblock location?  Giving people the freedom to place the
superblock where they choose isn't a bad thing.  Would adding
something like If in doubt, 1.1 is the safest choice. to the mdadm
man page give you the KISS warm-fuzzies you're pining for?

The fact that, after you read the manpage, you didn't even know that
the only difference between the v1.x variants is the location that the
superblock is placed indicates that you're not in a position to be so
tremendously evangelical about affecting code changes that limit
existing options.

Mike
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] lvm2 support for detecting v1.x MD superblocks

2007-10-23 Thread Mike Snitzer
lvm2's MD v1.0 superblock detection doesn't work at all (because it
doesn't use v1 sb offsets).

I've tested the attached patch to work on MDs with v0.90.0, v1.0,
v1.1, and v1.2 superblocks.

please advise, thanks.
Mike
Index: lib/device/dev-md.c
===
RCS file: /cvs/lvm2/LVM2/lib/device/dev-md.c,v
retrieving revision 1.5
diff -u -r1.5 dev-md.c
--- lib/device/dev-md.c	20 Aug 2007 20:55:25 -	1.5
+++ lib/device/dev-md.c	23 Oct 2007 15:17:57 -
@@ -25,6 +25,40 @@
 #define MD_NEW_SIZE_SECTORS(x) ((x  ~(MD_RESERVED_SECTORS - 1)) \
 - MD_RESERVED_SECTORS)
 
+int dev_has_md_sb(struct device *dev, uint64_t sb_offset, uint64_t *sb)
+{
+	int ret = 0;	
+	uint32_t md_magic;
+	/* Version 1 is little endian; version 0.90.0 is machine endian */
+	if (dev_read(dev, sb_offset, sizeof(uint32_t), md_magic) 
+	((md_magic == xlate32(MD_SB_MAGIC)) ||
+	 (md_magic == MD_SB_MAGIC))) {
+		if (sb)
+			*sb = sb_offset;
+		ret = 1;
+	}
+	return ret;
+}
+
+uint64_t v1_sb_offset(uint64_t size, int minor_version) {
+	uint64_t sb_offset;
+	switch(minor_version) {
+	case 0:
+		sb_offset = size;
+		sb_offset -= 8*2;
+		sb_offset = ~(4*2-1);
+		break;
+	case 1:
+		sb_offset = 0;
+		break;
+	case 2:
+		sb_offset = 4*2;
+		break;
+	}
+	sb_offset = SECTOR_SHIFT;
+	return sb_offset;
+}
+
 /*
  * Returns -1 on error
  */
@@ -35,7 +69,6 @@
 #ifdef linux
 
 	uint64_t size, sb_offset;
-	uint32_t md_magic;
 
 	if (!dev_get_size(dev, size)) {
 		stack;
@@ -50,16 +83,20 @@
 		return -1;
 	}
 
-	sb_offset = MD_NEW_SIZE_SECTORS(size)  SECTOR_SHIFT;
-
 	/* Check if it is an md component device. */
-	/* Version 1 is little endian; version 0.90.0 is machine endian */
-	if (dev_read(dev, sb_offset, sizeof(uint32_t), md_magic) 
-	((md_magic == xlate32(MD_SB_MAGIC)) ||
-	 (md_magic == MD_SB_MAGIC))) {
-		if (sb)
-			*sb = sb_offset;
+	/* Version 0.90.0 */
+	sb_offset = MD_NEW_SIZE_SECTORS(size)  SECTOR_SHIFT;
+	if (dev_has_md_sb(dev, sb_offset, sb)) {
 		ret = 1;
+	} else {
+		/* Version 1, try v1.0 - v1.2 */
+		int minor;
+		for (minor = 0; minor = 2; minor++) {
+			if (dev_has_md_sb(dev, v1_sb_offset(size, minor), sb)) {
+ret = 1;
+break;
+			}
+		}
 	}
 
 	if (!dev_close(dev))


Re: [lvm-devel] [PATCH] lvm2 support for detecting v1.x MD superblocks

2007-10-23 Thread Mike Snitzer
On 10/23/07, Alasdair G Kergon [EMAIL PROTECTED] wrote:
 On Tue, Oct 23, 2007 at 11:32:56AM -0400, Mike Snitzer wrote:
  I've tested the attached patch to work on MDs with v0.90.0, v1.0,
  v1.1, and v1.2 superblocks.

 I'll apply this, thanks, but need to add comments (or reference) to explain
 what the hard-coded numbers are:

 sb_offset = (size - 8 * 2)  ~(4 * 2 - 1);
 etc.

All values are in terms of sectors; so that is where the * 2 is coming
from.  The v1.0 case follows the same model as the MD_NEW_SIZE_SECTORS
which is used for v0.90.0.  The difference is that the v1.0 superblock
is found at least 8K, but less than 12K, from the end of the device.

The same switch statement is used in mdadm and is accompanied with the
following comment:

/*
 * Calculate the position of the superblock.
 * It is always aligned to a 4K boundary and
 * depending on minor_version, it can be:
 * 0: At least 8K, but less than 12K, from end of device
 * 1: At start of device
 * 2: 4K from start of device.
 */

Would it be sufficient to add that comment block above
v1_sb_offset()'s switch statement?

thanks,
Mike
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm 2.6.x regression, fails creation of raid1 w/ v1.0 sb and internal bitmap

2007-10-22 Thread Mike Snitzer
On 10/22/07, Neil Brown [EMAIL PROTECTED] wrote:
 On Friday October 19, [EMAIL PROTECTED] wrote:
  On 10/19/07, Neil Brown [EMAIL PROTECTED] wrote:
   On Friday October 19, [EMAIL PROTECTED] wrote:
 
I'm using a stock 2.6.19.7 that I then backported various MD fixes to
from 2.6.20 - 2.6.23...  this kernel has worked great until I
attempted v1.0 sb w/ bitmap=internal using mdadm 2.6.x.
   
But would you like me to try a stock 2.6.22 or 2.6.23 kernel?
  
   Yes please.
   I'm suspecting the code in write_sb_page where it tests if the bitmap
   overlaps the data or metadata.  The only way I can see you getting the
   exact error that you do get it for that to fail.
   That test was introduced in 2.6.22.  Did you backport that?  Any
   chance it got mucked up a bit?
 
  I believe you're referring to commit
  f0d76d70bc77b9b11256a3a23e98e80878be1578.  That change actually made
  it into 2.6.23 AFAIK; but yes I actually did backport that fix (which
  depended on ab6085c795a71b6a21afe7469d30a365338add7a).
 
  If I back-out f0d76d70bc77b9b11256a3a23e98e80878be1578 I can create a
  raid1 w/ v1.0 sb and an internal bitmap.  But clearly that is just
  because I removed the negative checks that you introduced ;)
 
  For me this begs the question: what else would
  f0d76d70bc77b9b11256a3a23e98e80878be1578 depend on that I missed?  I
  included 505fa2c4a2f125a70951926dfb22b9cf273994f1 and
ab6085c795a71b6a21afe7469d30a365338add7a too.
 
  *shrug*...
 

 This is all very odd...
 I definitely tested this last week and couldn't reproduce the
 problem.  This week I can reproduce it easily.  And given the nature
 of the bug, I cannot see how it ever worked.

 Anyway, here is a fix that works for me.

Hey Neil,

Your fix works for me too.  However, I'm wondering why you held back
on fixing the same issue in the bitmap runs into data comparison
that follows:

--- ./drivers/md/bitmap.c 2007-10-19 19:11:58.0 -0400
+++ ./drivers/md/bitmap.c 2007-10-22 09:53:41.0 -0400
@@ -286,7 +286,7 @@
/* METADATA BITMAP DATA */
if (rdev-sb_offset*2
+ bitmap-offset
-   + page-index*(PAGE_SIZE/512) + size/512
+   +
(long)(page-index*(PAGE_SIZE/512)) + size/512
 rdev-data_offset)
/* bitmap runs in to data */
return -EINVAL;

Thanks,
Mike
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm 2.6.x regression, fails creation of raid1 w/ v1.0 sb and internal bitmap

2007-10-19 Thread Mike Snitzer
On 10/19/07, Neil Brown [EMAIL PROTECTED] wrote:
 On Friday October 19, [EMAIL PROTECTED] wrote:

  I'm using a stock 2.6.19.7 that I then backported various MD fixes to
  from 2.6.20 - 2.6.23...  this kernel has worked great until I
  attempted v1.0 sb w/ bitmap=internal using mdadm 2.6.x.
 
  But would you like me to try a stock 2.6.22 or 2.6.23 kernel?

 Yes please.
 I'm suspecting the code in write_sb_page where it tests if the bitmap
 overlaps the data or metadata.  The only way I can see you getting the
 exact error that you do get it for that to fail.
 That test was introduced in 2.6.22.  Did you backport that?  Any
 chance it got mucked up a bit?

I believe you're referring to commit
f0d76d70bc77b9b11256a3a23e98e80878be1578.  That change actually made
it into 2.6.23 AFAIK; but yes I actually did backport that fix (which
depended on ab6085c795a71b6a21afe7469d30a365338add7a).

If I back-out f0d76d70bc77b9b11256a3a23e98e80878be1578 I can create a
raid1 w/ v1.0 sb and an internal bitmap.  But clearly that is just
because I removed the negative checks that you introduced ;)

For me this begs the question: what else would
f0d76d70bc77b9b11256a3a23e98e80878be1578 depend on that I missed?  I
included 505fa2c4a2f125a70951926dfb22b9cf273994f1 and
ab6085c795a71b6a21afe7469d30a365338add7a too.

*shrug*...

Mike
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm 2.6.x regression, fails creation of raid1 w/ v1.0 sb and internal bitmap

2007-10-18 Thread Mike Snitzer
On 10/18/07, Neil Brown [EMAIL PROTECTED] wrote:
 On Wednesday October 17, [EMAIL PROTECTED] wrote:
  mdadm 2.4.1 through 2.5.6 works. mdadm-2.6's Improve allocation and
  use of space for bitmaps in version1 metadata
  (199171a297a87d7696b6b8c07ee520363f4603c1) would seem like the
  offending change.  Using 1.2 metdata works.
 
  I get the following using the tip of the mdadm git repo or any other
  version of mdadm 2.6.x:
 
  # mdadm --create /dev/md2 --run -l 1 --metadata=1.0 --bitmap=internal
  -n 2 /dev/sdf --write-mostly /dev/nbd2
  mdadm: /dev/sdf appears to be part of a raid array:
  level=raid1 devices=2 ctime=Wed Oct 17 10:17:31 2007
  mdadm: /dev/nbd2 appears to be part of a raid array:
  level=raid1 devices=2 ctime=Wed Oct 17 10:17:31 2007
  mdadm: RUN_ARRAY failed: Input/output error
  mdadm: stopped /dev/md2
 
  kernel log shows:
  md2: bitmap initialized from disk: read 22/22 pages, set 715290 bits, 
  status: 0
  created bitmap (350 pages) for device md2
  md2: failed to create bitmap (-5)

 Could you please tell me the exact size of your device?  Then should
 be able to reproduce it and test a fix.

 (It works for a 734003201K device).

732456960K, it is fairly surprising that such a relatively small
difference in size would prevent it from working...

regards,
Mike
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kicking non-fresh member from array?

2007-10-18 Thread Mike Snitzer
On 10/18/07, Goswin von Brederlow [EMAIL PROTECTED] wrote:
 Mike Snitzer [EMAIL PROTECTED] writes:

  All,
 
  I have repeatedly seen that when a 2 member raid1 becomes degraded,
  and IO continues to the lone good member, that if the array is then
  stopped and reassembled you get:
 
  md: bindnbd0
  md: bindsdc
  md: kicking non-fresh nbd0 from array!
  md: unbindnbd0
  md: export_rdev(nbd0)
  raid1: raid set md0 active with 1 out of 2 mirrors
 
  I'm not seeing how one can avoid assembling such an array in 2 passes:
  1) assemble array with both members
  2) if a member was deemed non-fresh re-add that member; whereby
  triggering recovery.
 
  So why does MD kick non-fresh members out on assemble when its
  perfectly capable of recovering the non-fresh member?  Looking at
  md.c it is fairly clear there isn't a way to avoid this 2-step
  procedure.
 
  Why/how does MD benefit from this kicking non-fresh semantic?
  Should MD/mdadm be made optionally tolerant of such non-fresh members
  during assembly?
 
  Mike

 What if the disk has lots of bad blocks, just not where the meta data
 is? On every restart you would resync and fail.

 Or what if you removed a mirror to keep a snapshot of a previous
 state? If it auto resyncs you loose that snapshot.

Both of your examples are fairly tenuous given that such members
shouldn't have been provided on the --asemble commandline.  I'm not
talking about auto assemble via udev or something.  But auto assemble
via udev brings up an annoying corner-case when you consider the 2
cases you pointed out.

So you have valid points.  This leads to my last question; having the
ability to _optionally_ tolerate (repair) such stale members would
allow for greater flexibility.  The current behavior isn't conducive
to repairing unprotected raids (that mdadm/md were told to assemble
with specific members) without taking steps to say no I really
_really_ mean it; now re-add this disk!.

Any pointers from Neil (or others) on how such a 'repair non-fresh
member(s) on assemble' override _should_ be implemented would be
helpful.  My first thought is to add a new superblock
--update=repair-non-fresh option to mdadm that would tie into a new
flag in the MD superblock.  But then it begs the question: why not
first add support to set such a superblock option at MD create-time?
The validate_super methods would also need to be trained accordingly.

regards,
Mike
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm 2.6.x regression, fails creation of raid1 w/ v1.0 sb and internal bitmap

2007-10-18 Thread Mike Snitzer
On 10/18/07, Neil Brown [EMAIL PROTECTED] wrote:

 Sorry, I wasn't paying close enough attention and missed the obvious.
 .

 On Thursday October 18, [EMAIL PROTECTED] wrote:
  On 10/18/07, Neil Brown [EMAIL PROTECTED] wrote:
   On Wednesday October 17, [EMAIL PROTECTED] wrote:
mdadm 2.4.1 through 2.5.6 works. mdadm-2.6's Improve allocation and
use of space for bitmaps in version1 metadata
(199171a297a87d7696b6b8c07ee520363f4603c1) would seem like the
offending change.  Using 1.2 metdata works.
   
I get the following using the tip of the mdadm git repo or any other
version of mdadm 2.6.x:
   
# mdadm --create /dev/md2 --run -l 1 --metadata=1.0 --bitmap=internal
-n 2 /dev/sdf --write-mostly /dev/nbd2
mdadm: /dev/sdf appears to be part of a raid array:
level=raid1 devices=2 ctime=Wed Oct 17 10:17:31 2007
mdadm: /dev/nbd2 appears to be part of a raid array:
level=raid1 devices=2 ctime=Wed Oct 17 10:17:31 2007
mdadm: RUN_ARRAY failed: Input/output error
^^

 This means there was an IO error.  i.e. there is a block on the device
 that cannot be read from.
 It worked with earlier version of mdadm because they used a much
 smaller bitmap.  With the patch you mention in place, mdadm tries
 harder to find a good location and good size for a bitmap and to
 make sure that space is available.
 The important fact is that the bitmap ends up at a different
 location.

 You have a bad block at that location, it would seem.

I'm a bit skeptical of that being the case considering I get this
error on _any_ pair of disks I try in an environment where I'm
mirroring across servers that each have access to 8 of these disks.
Each of the 8 mirrors consists of a local member and a remote (nbd)
member.  I can't see all 16 disks having the very same bad block(s) at
the end of the disk ;)

I feels to me like the calculation that you're making isn't leaving
adequate room for the 128K bitmap without hitting the superblock...
but I don't have hard proof yet ;)

 I would have expected an error in the kernel logs about the read error
 though - that is strange.

What about the md2: failed to create bitmap (-5)?

 What do
   mdadm -E
 and
   mdadm -X

 on each device say?

# mdadm -E /dev/sdf
/dev/sdf:
  Magic : a92b4efc
Version : 1.0
Feature Map : 0x1
 Array UUID : caabb900:616bfc5a:03763b95:83ea99a7
   Name : 2
  Creation Time : Fri Oct 19 00:38:45 2007
 Raid Level : raid1
   Raid Devices : 2

  Used Dev Size : 1464913648 (698.53 GiB 750.04 GB)
 Array Size : 1464913648 (698.53 GiB 750.04 GB)
   Super Offset : 1464913904 sectors
  State : clean
Device UUID : 978cdd42:abaa82a1:4ad79285:1b56ed86

Internal Bitmap : -176 sectors from superblock
Update Time : Fri Oct 19 00:38:45 2007
   Checksum : c6bb03db - correct
 Events : 0


Array Slot : 0 (0, 1)
   Array State : Uu

# mdadm -E /dev/nbd2
/dev/nbd2:
  Magic : a92b4efc
Version : 1.0
Feature Map : 0x1
 Array UUID : caabb900:616bfc5a:03763b95:83ea99a7
   Name : 2
  Creation Time : Fri Oct 19 00:38:45 2007
 Raid Level : raid1
   Raid Devices : 2

  Used Dev Size : 1464913648 (698.53 GiB 750.04 GB)
 Array Size : 1464913648 (698.53 GiB 750.04 GB)
   Super Offset : 1464913904 sectors
  State : clean
Device UUID : 180209d2:cff9b5d0:05054d19:2e4930f2

Internal Bitmap : -176 sectors from superblock
  Flags : write-mostly
Update Time : Fri Oct 19 00:38:45 2007
   Checksum : 8416e951 - correct
 Events : 0


Array Slot : 1 (0, 1)
   Array State : uU

# mdadm -X /dev/sdf
Filename : /dev/sdf
   Magic : 6d746962
 Version : 4
UUID : caabb900:616bfc5a:03763b95:83ea99a7
  Events : 0
  Events Cleared : 0
   State : OK
   Chunksize : 1 MB
  Daemon : 5s flush period
  Write Mode : Normal
   Sync Size : 732456824 (698.53 GiB 750.04 GB)
  Bitmap : 715290 bits (chunks), 715290 dirty (100.0%)

# mdadm -X /dev/nbd2
Filename : /dev/nbd2
   Magic : 6d746962
 Version : 4
UUID : caabb900:616bfc5a:03763b95:83ea99a7
  Events : 0
  Events Cleared : 0
   State : OK
   Chunksize : 1 MB
  Daemon : 5s flush period
  Write Mode : Normal
   Sync Size : 732456824 (698.53 GiB 750.04 GB)
  Bitmap : 715290 bits (chunks), 715290 dirty (100.0%)

mdadm: stopped /dev/md2
   
kernel log shows:
md2: bitmap initialized from disk: read 22/22 pages, set 715290 bits, 
status: 0
created bitmap (350 pages) for device md2
md2: failed to create bitmap (-5)

I assumed that the RUN_ARRAY failed (via IO error) was a side-effect
of MD's inability to create the bitmap (-5):

md2: bitmap initialized from disk: read 22/22 pages, set 715290 bits, status: 0
created bitmap (350 pages) for device md2
md2: failed to create 

Re: mdadm 2.6.x regression, fails creation of raid1 w/ v1.0 sb and internal bitmap

2007-10-18 Thread Mike Snitzer
On 10/19/07, Mike Snitzer [EMAIL PROTECTED] wrote:
 On 10/18/07, Neil Brown [EMAIL PROTECTED] wrote:
 
  Sorry, I wasn't paying close enough attention and missed the obvious.
  .
 
  On Thursday October 18, [EMAIL PROTECTED] wrote:
   On 10/18/07, Neil Brown [EMAIL PROTECTED] wrote:
On Wednesday October 17, [EMAIL PROTECTED] wrote:
 mdadm 2.4.1 through 2.5.6 works. mdadm-2.6's Improve allocation and
 use of space for bitmaps in version1 metadata
 (199171a297a87d7696b6b8c07ee520363f4603c1) would seem like the
 offending change.  Using 1.2 metdata works.

 I get the following using the tip of the mdadm git repo or any other
 version of mdadm 2.6.x:

 # mdadm --create /dev/md2 --run -l 1 --metadata=1.0 --bitmap=internal
 -n 2 /dev/sdf --write-mostly /dev/nbd2
 mdadm: /dev/sdf appears to be part of a raid array:
 level=raid1 devices=2 ctime=Wed Oct 17 10:17:31 2007
 mdadm: /dev/nbd2 appears to be part of a raid array:
 level=raid1 devices=2 ctime=Wed Oct 17 10:17:31 2007
 mdadm: RUN_ARRAY failed: Input/output error
 ^^
 
  This means there was an IO error.  i.e. there is a block on the device
  that cannot be read from.
  It worked with earlier version of mdadm because they used a much
  smaller bitmap.  With the patch you mention in place, mdadm tries
  harder to find a good location and good size for a bitmap and to
  make sure that space is available.
  The important fact is that the bitmap ends up at a different
  location.
 
  You have a bad block at that location, it would seem.

 I'm a bit skeptical of that being the case considering I get this
 error on _any_ pair of disks I try in an environment where I'm
 mirroring across servers that each have access to 8 of these disks.
 Each of the 8 mirrors consists of a local member and a remote (nbd)
 member.  I can't see all 16 disks having the very same bad block(s) at
 the end of the disk ;)

 I feels to me like the calculation that you're making isn't leaving
 adequate room for the 128K bitmap without hitting the superblock...
 but I don't have hard proof yet ;)

To further test this I used 2 local sparse 732456960K loopback devices
and attempted to create the raid1 in the same manner.  It failed in
exactly the same way.  This should cast further doubt on the bad block
theory no?

I'm using a stock 2.6.19.7 that I then backported various MD fixes to
from 2.6.20 - 2.6.23...  this kernel has worked great until I
attempted v1.0 sb w/ bitmap=internal using mdadm 2.6.x.

But would you like me to try a stock 2.6.22 or 2.6.23 kernel?

Mike
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


mdadm 2.6.x regression, fails creation of raid1 w/ v1.0 sb and internal bitmap

2007-10-17 Thread Mike Snitzer
mdadm 2.4.1 through 2.5.6 works. mdadm-2.6's Improve allocation and
use of space for bitmaps in version1 metadata
(199171a297a87d7696b6b8c07ee520363f4603c1) would seem like the
offending change.  Using 1.2 metdata works.

I get the following using the tip of the mdadm git repo or any other
version of mdadm 2.6.x:

# mdadm --create /dev/md2 --run -l 1 --metadata=1.0 --bitmap=internal
-n 2 /dev/sdf --write-mostly /dev/nbd2
mdadm: /dev/sdf appears to be part of a raid array:
level=raid1 devices=2 ctime=Wed Oct 17 10:17:31 2007
mdadm: /dev/nbd2 appears to be part of a raid array:
level=raid1 devices=2 ctime=Wed Oct 17 10:17:31 2007
mdadm: RUN_ARRAY failed: Input/output error
mdadm: stopped /dev/md2

kernel log shows:
md2: bitmap initialized from disk: read 22/22 pages, set 715290 bits, status: 0
created bitmap (350 pages) for device md2
md2: failed to create bitmap (-5)
md: pers-run() failed ...
md: md2 stopped.
md: unbindnbd2
md: export_rdev(nbd2)
md: unbindsdf
md: export_rdev(sdf)
md: md2 stopped.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm 2.6.x regression, fails creation of raid1 w/ v1.0 sb and internal bitmap

2007-10-17 Thread Mike Snitzer
On 10/17/07, Bill Davidsen [EMAIL PROTECTED] wrote:
 Mike Snitzer wrote:
  mdadm 2.4.1 through 2.5.6 works. mdadm-2.6's Improve allocation and
  use of space for bitmaps in version1 metadata
  (199171a297a87d7696b6b8c07ee520363f4603c1) would seem like the
  offending change.  Using 1.2 metdata works.
 
  I get the following using the tip of the mdadm git repo or any other
  version of mdadm 2.6.x:
 
  # mdadm --create /dev/md2 --run -l 1 --metadata=1.0 --bitmap=internal
  -n 2 /dev/sdf --write-mostly /dev/nbd2
  mdadm: /dev/sdf appears to be part of a raid array:
  level=raid1 devices=2 ctime=Wed Oct 17 10:17:31 2007
  mdadm: /dev/nbd2 appears to be part of a raid array:
  level=raid1 devices=2 ctime=Wed Oct 17 10:17:31 2007
  mdadm: RUN_ARRAY failed: Input/output error
  mdadm: stopped /dev/md2
 
  kernel log shows:
  md2: bitmap initialized from disk: read 22/22 pages, set 715290 bits, 
  status: 0
  created bitmap (350 pages) for device md2
  md2: failed to create bitmap (-5)
  md: pers-run() failed ...
  md: md2 stopped.
  md: unbindnbd2
  md: export_rdev(nbd2)
  md: unbindsdf
  md: export_rdev(sdf)
  md: md2 stopped.
 

 I would start by retrying with an external bitmap, to see if for some
 reason there isn't room for the bitmap. If that fails, perhaps no bitmap
 at all would be a useful data point. Was the original metadata the same
 version? Things moved depending on the exact version, and some
 --zero-superblock magic might be needed. Hopefully Neil can clarify, I'm
 just telling you what I suspect is the problem, and maybe a
 non-destructive solution.

Creating with an external bitmap works perfectly fine.  As does
creating without a bitmap.  --zero-superblock hasn't helped.  Metadata
v1.1 and v1.2 works with an internal bitmap.  I'd like to use v1.0
with an internal bitmap (using an external bitmap isn't an option for
me).

It does appear that the changes to sb super1.c aren't leaving adequate
room for the bitmap.  Looking at the relevant diff for v1.0 metadata
the newer super1.c code makes use of a larger bitmap (128K) for
devices  200GB.  My blockdevice is 700GB.  So could the larger
blockdevice possibly explain why others haven't noticed this?

Mike
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mke2fs stuck in D state while creating filesystem on md*

2007-09-19 Thread Mike Snitzer
On 9/19/07, Wiesner Thomas [EMAIL PROTECTED] wrote:
  Has there been any progress on this? I think I saw it, or something
  similar, during some testing of recent 2.6.23-rc kernels, on mke2fs took
  about 11 min longer than all the others (~2 min) and it was not
  repeatable. I worry that process of more interest will have the same
  hang.

 Well, I must say: no. I haven't tried anything further. I've set up the
 production system a week or so ago
 which runs Debian Etch with no modifications (kernel 2.6.18 I think, the
 debian one and a mdadm 2.5.6-9).
 I didn't notice the problem while creating the raid but that doesn't mean
 anything as I didn't pay attention
 and as I wrote earlier it isn't reliably reproducable.
 (Watching it on a large storage gets boring very fast.)

 I'm not a kernel programmer but I can test another kernel or mdadm version
 if it helps, but let me know
 if you want me to do that.

If/when you experience the hang again please get a trace of all processes with:
echo t  /proc/sysrq-trigger

Of particular interest is the mke2fs trace; as well as any md threads.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: detecting read errors after RAID1 check operation

2007-08-25 Thread Mike Snitzer
On 8/17/07, Mike Accetta [EMAIL PROTECTED] wrote:

 Neil Brown writes:
  On Wednesday August 15, [EMAIL PROTECTED] wrote:
   Neil Brown writes:
On Wednesday August 15, [EMAIL PROTECTED] wrote:

   ...
   This happens in our old friend sync_request_write()?  I'm dealing with
 
  Yes, that would be the place.
 
   ...
   This fragment
  
   if (j  0 || test_bit(MD_RECOVERY_CHECK, mddev-recovery)) {
   sbio-bi_end_io = NULL;
   rdev_dec_pending(conf-mirrors[i].rdev, mddev);
   } else {
   /* fixup the bio for reuse */
   ...
   }
  
   looks suspicously like any correction attempt for 'check' is being
   short-circuited to me, regardless of whether or not there was a read
   error.  Actually, even if the rewrite was not being short-circuited,
   I still don't see the path that would update 'corrected_errors' in this
   case.  There are only two raid1.c sites that touch 'corrected_errors', one
   is in fix_read_errors() and the other is later in sync_request_write().
   With my limited understanding of how this all works, neither of these
   paths would seem to apply here.
 
  hmmm yes
  I guess I was thinking of the RAID5 code rather than the RAID1 code.
  It doesn't do the right thing does it?
  Maybe this patch is what we need.  I think it is right.
 
  Thanks,
  NeilBrown
 
 
  Signed-off-by: Neil Brown [EMAIL PROTECTED]
 
  ### Diffstat output
   ./drivers/md/raid1.c |3 ++-
   1 file changed, 2 insertions(+), 1 deletion(-)
 
  diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c
  --- .prev/drivers/md/raid1.c  2007-08-16 10:29:58.0 +1000
  +++ ./drivers/md/raid1.c  2007-08-17 12:07:35.0 +1000
  @@ -1260,7 +1260,8 @@ static void sync_request_write(mddev_t *
j = 0;
if (j = 0)
mddev-resync_mismatches += 
  r1_bio-sec
  tors;
  - if (j  0 || test_bit(MD_RECOVERY_CHECK, 
  mddev
  -recovery)) {
  + if (j  0 || (test_bit(MD_RECOVERY_CHECK, 
  mdde
  v-recovery)
  +text_bit(BIO_UPTODATE, 
  sbio-
  bi_flags))) {
sbio-bi_end_io = NULL;

  rdev_dec_pending(conf-mirrors[i].rdev,
   mddev);
} else {

 I tried this (with the typo fixed) and it indeed issues a re-write.
 However, it doesn't seem to do anything with the corrected errors
 count if the rewrite succeeds.  Since end_sync_write() is only used one
 other place when !In_sync, I tried the following and it seems to work
 to get the error count updated.  I don't know whether this belongs in
 end_sync_write() but I'd think it needs to come after the write actually
 succeeds so that seems like the earliest it could be done.

Neil,

Any feedback on Mike's patch?

thanks,
Mike
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Need clarification on raid1 resync behavior with bitmap support

2007-08-03 Thread Mike Snitzer
On 8/3/07, Neil Brown [EMAIL PROTECTED] wrote:
 On Monday July 23, [EMAIL PROTECTED] wrote:
  On 7/23/07, Neil Brown [EMAIL PROTECTED] wrote:
   Can you test this out and report a sequence of events that causes a
   full resync?
 
  Sure, using an internal-bitmap-enabled raid1 with 2 loopback devices
  on a stock 2.6.20.1 kernel, the following sequences result in a full
  resync.  (FYI, I'm fairly certain I've seen this same behavior on
  2.6.18 and 2.6.15 kernels too but would need to retest):
 
  1)
  mdadm /dev/md0 --manage --fail /dev/loop0
  mdadm -S /dev/md0
  mdadm --assemble /dev/md0 /dev/loop0 /dev/loop1
mdadm: /dev/md0 has been started with 1 drive (out of 2).
NOTE: kernel log says:  md: kicking non-fresh loop0 from array!
  mdadm /dev/md0 --manage --re-add /dev/loop0


 sorry for the slow response.

 It looks like commit 1757128438d41670ded8bc3bc735325cc07dc8f9
 (December 2006) set conf-fullsync a litte too often.

 This seems to fix it, and I'm fairly sure it is correct.

 Thanks,
 NeilBrown

 --
 Make sure a re-add after a restart honours bitmap when resyncing.

 Commit 1757128438d41670ded8bc3bc735325cc07dc8f9 was slightly bad.
 If and array has a write-intent bitmap, and you remove a drive,
 then readd it, only the changes parts should be resynced.
 This only works if the array has not been shut down and restarted.

 The above mentioned commit sets 'fullsync' at little more often
 than it should.  This patch is more careful.

I hand-patched your change into a 2.6.20.1 kernel (I'd imagine your
patch is against current git).  I didn't see any difference because
unfortunately both of my full resync scenarios included stopping a
degraded raid after either: 1) having failed but not been removed a
member 2) having failed and removed a member.  In both scenarios if I
didn't stop the array and I just removed and re-added the faulty drive
the array would _not_ do a full resync.

My examples clearly conflict with your assertion that: This only
works if the array has not been shut down and restarted.

But shouldn't raid1 be better about leveraging the bitmap of known
good (fresh) members even after having stopped a degraded array?  Why
is it that when an array is stopped raid1 seemingly loses the required
metadata that enables bitmap resyncs to just work upon re-add IFF the
array is _not_ stopped?  Couldn't raid1 be made to assemble the array
to look like the array had never been stopped, leaving the non-fresh
members out as it already does, and only then re-add the non-fresh
members that were provided?

To be explicit: isn't the bitmap still valid on the fresh members?  If
so, why is raid1 just disregarding the fresh bitmap?

Thanks, I really appreciate your insight.
Mike
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Need clarification on raid1 resync behavior with bitmap support

2007-07-23 Thread Mike Snitzer

On 7/23/07, Neil Brown [EMAIL PROTECTED] wrote:

On Saturday July 21, [EMAIL PROTECTED] wrote:



 Could you share the other situations where a bitmap-enabled raid1
 _must_ perform a full recovery?

When you add a new drive.  When you create a new bitmap.  I think that
should be all.

 - Correct me if I'm wrong, but one that comes to mind is when a server
 reboots (after cleanly stopping a raid1 array that had a faulty
 member) and then either:
 1) assembles the array with the previously faulty member now
 available

 2) assembles the array with the same faulty member missing.  The user
 later re-adds the faulty member

 AFAIK both scenarios would bring about a full resync.

Only if the drive is not recognised as the original member.
Can you test this out and report a sequence of events that causes a
full resync?


Sure, using an internal-bitmap-enabled raid1 with 2 loopback devices
on a stock 2.6.20.1 kernel, the following sequences result in a full
resync.  (FYI, I'm fairly certain I've seen this same behavior on
2.6.18 and 2.6.15 kernels too but would need to retest):

1)
mdadm /dev/md0 --manage --fail /dev/loop0
mdadm -S /dev/md0
mdadm --assemble /dev/md0 /dev/loop0 /dev/loop1
 mdadm: /dev/md0 has been started with 1 drive (out of 2).
 NOTE: kernel log says:  md: kicking non-fresh loop0 from array!
mdadm /dev/md0 --manage --re-add /dev/loop0

2)
mdadm /dev/md0 --manage --fail /dev/loop0
mdadm /dev/md0 --manage --remove /dev/loop0
mdadm -S /dev/md0
mdadm --assemble /dev/md0 /dev/loop0 /dev/loop1
 mdadm: /dev/md0 has been started with 1 drive (out of 2).
 NOTE: kernel log says:  md: kicking non-fresh loop0 from array!
mdadm /dev/md0 --manage --re-add /dev/loop0

Is stopping the MD (either with mdadm -S or a server reboot) tainting
that faulty member's ability to come back in using a quick
bitmap-based resync?

Mike
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Need clarification on raid1 resync behavior with bitmap support

2007-07-21 Thread Mike Snitzer

On 6/1/06, NeilBrown [EMAIL PROTECTED] wrote:


When an array has a bitmap, a device can be removed and re-added
and only blocks changes since the removal (as recorded in the bitmap)
will be resynced.


Neil,

Does the same apply when a bitmap-enabled raid1's member goes faulty?
Meaning even if a member is faulty, when the user removes and re-adds
the faulty device the raid1 rebuild _should_ leverage the bitmap
during a resync right?

I've seen messages like:
[12068875.690255] raid1: raid set md0 active with 2 out of 2 mirrors
[12068875.690284] md0: bitmap file is out of date (0  1) -- forcing
full recovery
[12068875.690289] md0: bitmap file is out of date, doing full recovery
[12068875.710214] md0: bitmap initialized from disk: read 5/5 pages,
set 131056 bits, status: 0
[12068875.710222] created bitmap (64 pages) for device md0

Could you share the other situations where a bitmap-enabled raid1
_must_ perform a full recovery?
- Correct me if I'm wrong, but one that comes to mind is when a server
reboots (after cleanly stopping a raid1 array that had a faulty
member) and then either:
1) assembles the array with the previously faulty member now available

2) assembles the array with the same faulty member missing.  The user
later re-adds the faulty member

AFAIK both scenarios would bring about a full resync.

regards,
Mike
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 with nbd member hangs MD on SLES10 and RHEL5

2007-06-14 Thread Mike Snitzer

On 6/14/07, Bill Davidsen [EMAIL PROTECTED] wrote:

Mike Snitzer wrote:
 On 6/13/07, Mike Snitzer [EMAIL PROTECTED] wrote:
 On 6/13/07, Mike Snitzer [EMAIL PROTECTED] wrote:
  On 6/12/07, Neil Brown [EMAIL PROTECTED] wrote:
 ...
 On 6/12/07, Neil Brown [EMAIL PROTECTED] wrote:
  On Tuesday June 12, [EMAIL PROTECTED] wrote:
  
   I can provided more detailed information; please just ask.
  
 
  A complete sysrq trace (all processes) might help.

 Bringing this back to a wider audience.  I provided the full sysrq
 trace of the RHEL5 kernel to Neil; in it we saw that md0_raid1 had the
 following trace:

 md0_raid1 D 810026183ce0  5368 31663 11  3822
 29488 (L-TLB)
  810026183ce0 810031e9b5f8 0008 000a
  810037eef040 810037e17100 00043e64d2983c1f 4c7f
  810037eef210 00010001 00081c506640 
 Call Trace:
  [8003e371] keventd_create_kthread+0x0/0x61
  [801b9364] md_super_wait+0xa8/0xbc
  [8003e711] autoremove_wake_function+0x0/0x2e
  [801b9adb] md_update_sb+0x1dd/0x23a
  [801bed2a] md_check_recovery+0x15f/0x449
  [882a1af3] :raid1:raid1d+0x27/0xc1e
  [80233209] thread_return+0x0/0xde
  [8023279c] __sched_text_start+0xc/0xa79
  [8003e371] keventd_create_kthread+0x0/0x61
  [80233a9f] schedule_timeout+0x1e/0xad
  [8003e371] keventd_create_kthread+0x0/0x61
  [801bd06c] md_thread+0xf8/0x10e
  [8003e711] autoremove_wake_function+0x0/0x2e
  [801bcf74] md_thread+0x0/0x10e
  [8003e5e7] kthread+0xd4/0x109
  [8000a505] child_rip+0xa/0x11
  [8003e371] keventd_create_kthread+0x0/0x61
  [8003e513] kthread+0x0/0x109
  [8000a4fb] child_rip+0x0/0x11

 To which Neil had the following to say:

   md0_raid1 is holding the lock on the array and trying to write
 out the
   superblocks for some reason, and the write isn't completing.
   As it is holding the locks, mdadm and /proc/mdstat are hanging.
 ...

  We're using MD+NBD for disaster recovery (one local scsi device, one
  remote via nbd).  The nbd-server is not contributing to md0.  The
  nbd-server is connected to a remote machine that is running a raid1
  remotely

 To take this further I've now collected a full sysrq trace of this
 hang on a SLES10 SP1 RC5 2.6.16.46-0.12-smp kernel, the relevant
 md0_raid1 trace is comparable to the RHEL5 trace from above:

 md0_raid1 D 810001089780 0  8583 51  8952
 8260 (L-TLB)
 810812393ca8 0046 8107b7fbac00 000a
81081f3c6a18 81081f3c67d0 8104ffe8f100
 44819ddcd5e2
eb8b 0007028009c7
 Call Trace: 801e1f94{generic_make_request+501}
8026946c{md_super_wait+168}
 80145aa2{autoremove_wake_function+0}
8026f056{write_page+128}
 80269ac7{md_update_sb+220}
8026bda5{md_check_recovery+361}
 883a97f5{:raid1:raid1d+38}
8013ad8f{lock_timer_base+27}
 8013ae01{try_to_del_timer_sync+81}
8013ae16{del_timer_sync+12}
 802d9adf{schedule_timeout+146}
801456a9{keventd_create_kthread+0}
 8026d5d8{md_thread+248}
80145aa2{autoremove_wake_function+0}
 8026d4e0{md_thread+0}
80145965{kthread+236} 8010bdce{child_rip+8}
801456a9{keventd_create_kthread+0}
 80145879{kthread+0}
8010bdc6{child_rip+0}

 Taking a step back, here is what was done to reproduce on SLES10:
 1) establish a raid1 mirror (md0) using one local member (sdc1) and
 one remote member (nbd0)
 2) power off the remote machine, whereby severing nbd0's connection
 3) perform IO to the filesystem that is on the md0 device to enduce
 the MD layer to mark the nbd device as faulty
 4) cat /proc/mdstat hangs, sysrq trace was collected and showed the
 above md0_raid1 trace.

 To be clear, the MD superblock update hangs indefinitely on RHEL5.
 But with SLES10 it eventually succeeds (and MD marks the nbd0 member
 faulty); and the other tasks that were blocking waiting for the MD
 lock (e.g. 'cat /proc/mdstat') then complete immediately.

 It should be noted that this MD+NBD configuration has worked
 flawlessly using a stock kernel.org 2.6.15.7 kernel (ontop of a
 RHEL4U4 distro).  Steps have not been taken to try to reproduce  with
 2.6.15.7 on SLES10; it may be useful to pursue but I'll defer to
 others to suggest I do so.

 2.6.15.7 does not have the SMP race fixes that were made in 2.6.16;
 yet both SLES10 and RHEL5 kernels do:
 
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4b2f0260c74324abca76ccaa42d426af163125e7


 If not this specific NBD change, something appears to have changed
 with how NBD behaves in the face of it's connection to the server
 being lost

Re: raid1 with nbd member hangs MD on SLES10 and RHEL5

2007-06-14 Thread Mike Snitzer

On 6/14/07, Paul Clements [EMAIL PROTECTED] wrote:

Bill Davidsen wrote:

 Second, AFAIK nbd hasn't working in a while. I haven't tried it in ages,
 but was told it wouldn't work with smp and I kind of lost interest. If
 Neil thinks it should work in 2.6.21 or later I'll test it, since I have
 a machine which wants a fresh install soon, and is both backed up and
 available.

Please stop this. nbd is working perfectly fine, AFAIK. I use it every
day, and so do 100s of our customers. What exactly is it that not's
working? If there's a problem, please send the bug report.


Paul,

This thread details what I've experienced using MD (raid1) with 2
devices; one being a local scsi device and the other is an NBD device.
I've yet to put effort to pinpointing the problem in a kernel.org
kernel; however both SLES10 and RHEL5 kernels appear to be hanging in
either 1) nbd or 2) the socket layer.

Here are the steps to reproduce reliably on SLES10 SP1:
1) establish a raid1 mirror (md0) using one local member (sdc1) and
one remote member (nbd0)
2) power off the remote machine, whereby severing nbd0's connection
3) perform IO to the filesystem that is on the md0 device to enduce
the MD layer to mark the nbd device as faulty
4) cat /proc/mdstat hangs, sysrq trace was collected

To be clear, the MD superblock update hangs indefinitely on RHEL5.
But with SLES10 it eventually succeeds after ~5min (and MD marks the
nbd0 member faulty); and the other tasks that were blocking waiting
for the MD lock (e.g. 'cat /proc/mdstat') then complete immediately.

If you look back in this thread you'll see traces for md0_raid1 for
both SLES10 and RHEL5.  I hope to try to reproduce this issue on
kernel.org 2.6.16.46 (the basis for SLES10).  If I can I'll then git
bisect back to try to pinpoint the regression; I obviously need to
verify that 2.6.16 works in this situation on SMP.

Mike
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 with nbd member hangs MD on SLES10 and RHEL5

2007-06-14 Thread Mike Snitzer

On 6/14/07, Paul Clements [EMAIL PROTECTED] wrote:

Mike Snitzer wrote:

 Here are the steps to reproduce reliably on SLES10 SP1:
 1) establish a raid1 mirror (md0) using one local member (sdc1) and
 one remote member (nbd0)
 2) power off the remote machine, whereby severing nbd0's connection
 3) perform IO to the filesystem that is on the md0 device to enduce
 the MD layer to mark the nbd device as faulty
 4) cat /proc/mdstat hangs, sysrq trace was collected

That's working as designed. NBD works over TCP. You're going to have to
wait for TCP to time out before an error occurs. Until then I/O will hang.


With kernel.org 2.6.15.7 (uni-processor) I've not seen NBD hang in the
kernel like I am with RHEL5 and SLES10.  This hang (tcp timeout) is
indefinite oh RHEL5 and ~5min on SLES10.

Should/can I be playing with TCP timeout values?  Why was this not a
concern with kernel.org 2.6.15.7; I was able to feel the nbd
connection break immediately; no MD superblock update hangs, no
longwinded (or indefinite) TCP timeout.

regards,
Mike
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 with nbd member hangs MD on SLES10 and RHEL5

2007-06-14 Thread Mike Snitzer

On 6/14/07, Paul Clements [EMAIL PROTECTED] wrote:

Mike Snitzer wrote:
 On 6/14/07, Paul Clements [EMAIL PROTECTED] wrote:
 Mike Snitzer wrote:

  Here are the steps to reproduce reliably on SLES10 SP1:
  1) establish a raid1 mirror (md0) using one local member (sdc1) and
  one remote member (nbd0)
  2) power off the remote machine, whereby severing nbd0's connection
  3) perform IO to the filesystem that is on the md0 device to enduce
  the MD layer to mark the nbd device as faulty
  4) cat /proc/mdstat hangs, sysrq trace was collected

 That's working as designed. NBD works over TCP. You're going to have to
 wait for TCP to time out before an error occurs. Until then I/O will
 hang.

 With kernel.org 2.6.15.7 (uni-processor) I've not seen NBD hang in the
 kernel like I am with RHEL5 and SLES10.  This hang (tcp timeout) is
 indefinite oh RHEL5 and ~5min on SLES10.

 Should/can I be playing with TCP timeout values?  Why was this not a
 concern with kernel.org 2.6.15.7; I was able to feel the nbd
 connection break immediately; no MD superblock update hangs, no
 longwinded (or indefinite) TCP timeout.

I don't know. I've never seen nbd immediately start returning I/O
errors. Perhaps something was different about the configuration?
If the other other machine rebooted quickly, for instance, you'd get a
connection reset, which would kill the nbd connection.


OK, I'll retest the 2.6.15.7 setup.  As for SLES10 and RHEL5, I've
been leaving the remote server powered off.  As such I'm at the full
mercy of the TCP timeout.  It is odd that RHEL5 has been hanging
indefinitely but I'll dig deeper on that once I come to terms with how
kernel.org and SLES10 behaves.

I'll update with my findings for completeness.

Thanks for your insight!
Mike
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 with nbd member hangs MD on SLES10 and RHEL5

2007-06-13 Thread Mike Snitzer

On 6/13/07, Mike Snitzer [EMAIL PROTECTED] wrote:

On 6/12/07, Neil Brown [EMAIL PROTECTED] wrote:

...

   On 6/12/07, Neil Brown [EMAIL PROTECTED] wrote:
On Tuesday June 12, [EMAIL PROTECTED] wrote:

 I can provided more detailed information; please just ask.

   
A complete sysrq trace (all processes) might help.


Bringing this back to a wider audience.  I provided the full sysrq
trace of the RHEL5 kernel to Neil; in it we saw that md0_raid1 had the
following trace:

md0_raid1 D 810026183ce0  5368 31663 11  3822 29488 (L-TLB)
810026183ce0 810031e9b5f8 0008 000a
810037eef040 810037e17100 00043e64d2983c1f 4c7f
810037eef210 00010001 00081c506640 
Call Trace:
[8003e371] keventd_create_kthread+0x0/0x61
[801b9364] md_super_wait+0xa8/0xbc
[8003e711] autoremove_wake_function+0x0/0x2e
[801b9adb] md_update_sb+0x1dd/0x23a
[801bed2a] md_check_recovery+0x15f/0x449
[882a1af3] :raid1:raid1d+0x27/0xc1e
[80233209] thread_return+0x0/0xde
[8023279c] __sched_text_start+0xc/0xa79
[8003e371] keventd_create_kthread+0x0/0x61
[80233a9f] schedule_timeout+0x1e/0xad
[8003e371] keventd_create_kthread+0x0/0x61
[801bd06c] md_thread+0xf8/0x10e
[8003e711] autoremove_wake_function+0x0/0x2e
[801bcf74] md_thread+0x0/0x10e
[8003e5e7] kthread+0xd4/0x109
[8000a505] child_rip+0xa/0x11
[8003e371] keventd_create_kthread+0x0/0x61
[8003e513] kthread+0x0/0x109
[8000a4fb] child_rip+0x0/0x11

To which Neil had the following to say:


 md0_raid1 is holding the lock on the array and trying to write out the
 superblocks for some reason, and the write isn't completing.
 As it is holding the locks, mdadm and /proc/mdstat are hanging.

 You seem to have nbd-servers running on this machine.  Are they
 serving the device that md is using. (i.e. a loop-back situation).  I
 would expect memory deadlocks would be very easy to hit in that
 situation, but I don't know if that is what has happened.

 Nothing else stands out.

 Could you clarify the arrangement of nbd.  Where are the servers and
 what are they serving?

We're using MD+NBD for disaster recovery (one local scsi device, one
remote via nbd).  The nbd-server is not contributing to md0.  The
nbd-server is connected to a remote machine that is running a raid1
remotely


To take this further I've now collected a full sysrq trace of this
hang on a SLES10 SP1 RC5 2.6.16.46-0.12-smp kernel, the relevant
md0_raid1 trace is comparable to the RHEL5 trace from above:

md0_raid1 D 810001089780 0  8583 51  8952  8260 (L-TLB)
810812393ca8 0046 8107b7fbac00 000a
  81081f3c6a18 81081f3c67d0 8104ffe8f100 44819ddcd5e2
  eb8b 0007028009c7
Call Trace: 801e1f94{generic_make_request+501}
  8026946c{md_super_wait+168}
80145aa2{autoremove_wake_function+0}
  8026f056{write_page+128} 80269ac7{md_update_sb+220}
  8026bda5{md_check_recovery+361}
883a97f5{:raid1:raid1d+38}
  8013ad8f{lock_timer_base+27}
8013ae01{try_to_del_timer_sync+81}
  8013ae16{del_timer_sync+12}
802d9adf{schedule_timeout+146}
  801456a9{keventd_create_kthread+0}
8026d5d8{md_thread+248}
  80145aa2{autoremove_wake_function+0}
8026d4e0{md_thread+0}
  80145965{kthread+236} 8010bdce{child_rip+8}
  801456a9{keventd_create_kthread+0}
80145879{kthread+0}
  8010bdc6{child_rip+0}

Taking a step back, here is what was done to reproduce on SLES10:
1) establish a raid1 mirror (md0) using one local member (sdc1) and
one remote member (nbd0)
2) power off the remote machine, whereby severing nbd0's connection
3) perform IO to the filesystem that is on the md0 device to enduce
the MD layer to mark the nbd device as faulty
4) cat /proc/mdstat hangs, sysrq trace was collected and showed the
above md0_raid1 trace.

To be clear, the MD superblock update hangs indefinitely on RHEL5.
But with SLES10 it eventually succeeds (and MD marks the nbd0 member
faulty); and the other tasks that were blocking waiting for the MD
lock (e.g. 'cat /proc/mdstat') then complete immediately.

It should be noted that this MD+NBD configuration has worked
flawlessly using a stock kernel.org 2.6.15.7 kernel (ontop of a
RHEL4U4 distro).  Steps have not been taken to try to reproduce  with
2.6.15.7 on SLES10; it may be useful to pursue but I'll defer to
others to suggest I do so.

2.6.15.7 does not have the SMP race fixes that were made in 2.6.16;
yet both SLES10 and RHEL5 kernels do:
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4b2f0260c74324abca76ccaa42d426af163125e7

If not this specific NBD change

Re: Cluster Aware MD Driver

2007-06-13 Thread Mike Snitzer

Is the goal to have the MD device be directly accessible from all
nodes? This strategy seems flawed in that it speaks to updating MD
superblocks then in-memory Linux data structures across a cluster.
The reality is if we're talking about shared storage the MD management
only needs to happen in one node.  Others can weigh in on this but the
current MD really doesn't want to be cluster-aware.

IMHO, this cluster awareness really doesn't belong in MD/mdadm.  A
high-level cluster management tool should be doing this MD
ownership/coordination work.  The MD ownership can be transferred
accordingly if/when the current owner fails, etc.  But this implies
that the MD is only ever active on one node at any given point in
time.

Mike

On 6/13/07, Xinwei Hu [EMAIL PROTECTED] wrote:

Hi all,

  Steven Dake proposed a solution* to make MD layer and tools to be cluster
aware in early 2003. But it seems that no progressing is made since then. I'd
like to pick this one up again. :)

  So far as I understand, Steven's proposal still applies to currently MD
implementation mostly, except we have bitmap now. And bitmap can be
workarounded via set_bitmap_file.

   The problem is that it seems we need a kernel-userspace interface to sync
the mddev struct across all nodes, but I don't find out how.

   I'm new to the MD driver, so correct me if I'm wrong. And you suggestions
are really appreciated.

   Thanks.

* http://osdir.com/ml/raid/2003-01/msg00013.html
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 with nbd member hangs MD on SLES10 and RHEL5

2007-06-13 Thread Mike Snitzer

On 6/13/07, Mike Snitzer [EMAIL PROTECTED] wrote:

On 6/13/07, Mike Snitzer [EMAIL PROTECTED] wrote:
 On 6/12/07, Neil Brown [EMAIL PROTECTED] wrote:
...
On 6/12/07, Neil Brown [EMAIL PROTECTED] wrote:
 On Tuesday June 12, [EMAIL PROTECTED] wrote:
 
  I can provided more detailed information; please just ask.
 

 A complete sysrq trace (all processes) might help.

Bringing this back to a wider audience.  I provided the full sysrq
trace of the RHEL5 kernel to Neil; in it we saw that md0_raid1 had the
following trace:

md0_raid1 D 810026183ce0  5368 31663 11  3822 29488 (L-TLB)
 810026183ce0 810031e9b5f8 0008 000a
 810037eef040 810037e17100 00043e64d2983c1f 4c7f
 810037eef210 00010001 00081c506640 
Call Trace:
 [8003e371] keventd_create_kthread+0x0/0x61
 [801b9364] md_super_wait+0xa8/0xbc
 [8003e711] autoremove_wake_function+0x0/0x2e
 [801b9adb] md_update_sb+0x1dd/0x23a
 [801bed2a] md_check_recovery+0x15f/0x449
 [882a1af3] :raid1:raid1d+0x27/0xc1e
 [80233209] thread_return+0x0/0xde
 [8023279c] __sched_text_start+0xc/0xa79
 [8003e371] keventd_create_kthread+0x0/0x61
 [80233a9f] schedule_timeout+0x1e/0xad
 [8003e371] keventd_create_kthread+0x0/0x61
 [801bd06c] md_thread+0xf8/0x10e
 [8003e711] autoremove_wake_function+0x0/0x2e
 [801bcf74] md_thread+0x0/0x10e
 [8003e5e7] kthread+0xd4/0x109
 [8000a505] child_rip+0xa/0x11
 [8003e371] keventd_create_kthread+0x0/0x61
 [8003e513] kthread+0x0/0x109
 [8000a4fb] child_rip+0x0/0x11

To which Neil had the following to say:

  md0_raid1 is holding the lock on the array and trying to write out the
  superblocks for some reason, and the write isn't completing.
  As it is holding the locks, mdadm and /proc/mdstat are hanging.

...


 We're using MD+NBD for disaster recovery (one local scsi device, one
 remote via nbd).  The nbd-server is not contributing to md0.  The
 nbd-server is connected to a remote machine that is running a raid1
 remotely

To take this further I've now collected a full sysrq trace of this
hang on a SLES10 SP1 RC5 2.6.16.46-0.12-smp kernel, the relevant
md0_raid1 trace is comparable to the RHEL5 trace from above:

md0_raid1 D 810001089780 0  8583 51  8952  8260 (L-TLB)
810812393ca8 0046 8107b7fbac00 000a
   81081f3c6a18 81081f3c67d0 8104ffe8f100 44819ddcd5e2
   eb8b 0007028009c7
Call Trace: 801e1f94{generic_make_request+501}
   8026946c{md_super_wait+168}
80145aa2{autoremove_wake_function+0}
   8026f056{write_page+128} 80269ac7{md_update_sb+220}
   8026bda5{md_check_recovery+361}
883a97f5{:raid1:raid1d+38}
   8013ad8f{lock_timer_base+27}
8013ae01{try_to_del_timer_sync+81}
   8013ae16{del_timer_sync+12}
802d9adf{schedule_timeout+146}
   801456a9{keventd_create_kthread+0}
8026d5d8{md_thread+248}
   80145aa2{autoremove_wake_function+0}
8026d4e0{md_thread+0}
   80145965{kthread+236} 8010bdce{child_rip+8}
   801456a9{keventd_create_kthread+0}
80145879{kthread+0}
   8010bdc6{child_rip+0}

Taking a step back, here is what was done to reproduce on SLES10:
1) establish a raid1 mirror (md0) using one local member (sdc1) and
one remote member (nbd0)
2) power off the remote machine, whereby severing nbd0's connection
3) perform IO to the filesystem that is on the md0 device to enduce
the MD layer to mark the nbd device as faulty
4) cat /proc/mdstat hangs, sysrq trace was collected and showed the
above md0_raid1 trace.

To be clear, the MD superblock update hangs indefinitely on RHEL5.
But with SLES10 it eventually succeeds (and MD marks the nbd0 member
faulty); and the other tasks that were blocking waiting for the MD
lock (e.g. 'cat /proc/mdstat') then complete immediately.

It should be noted that this MD+NBD configuration has worked
flawlessly using a stock kernel.org 2.6.15.7 kernel (ontop of a
RHEL4U4 distro).  Steps have not been taken to try to reproduce  with
2.6.15.7 on SLES10; it may be useful to pursue but I'll defer to
others to suggest I do so.

2.6.15.7 does not have the SMP race fixes that were made in 2.6.16;
yet both SLES10 and RHEL5 kernels do:
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4b2f0260c74324abca76ccaa42d426af163125e7

If not this specific NBD change, something appears to have changed
with how NBD behaves in the face of it's connection to the server
being lost.  Almost like the MD superblock update that would be
written to nbd0 is blocking within nbd or the network layer because of
a network timeout issue?


Just a quick update

raid1 with nbd member hangs MD on SLES10 and RHEL5

2007-06-12 Thread Mike Snitzer

When using raid1 with one local member and one nbd member (marked as
write-mostly) MD hangs when trying to format /dev/md0 with ext3.  Both
'cat /proc/mdstat' and 'mdadm --detail /dev/md0' hang infinitely.
I've not tried to reproduce on 2.6.18 or 2.6.19ish kernel.org kernels
yet but this issue affects both SLES10 and RHEL5.

sysrq traces for RHEL5 follow; I don't have immediate access to a
SLES10 system at the moment but I've seen this same hang with SLES10
SP1 RC4:

cat /proc/mdstat

cat   S 8100048e7de8  6208 11428  11391 (NOTLB)
8100048e7de8 076eb000 80098ea6 0008
81001ff170c0 810037e17100 00045f8d13924085 0006b89f
81001ff17290 0001 0005 
Call Trace:
[80098ea6] seq_printf+0x67/0x8f
[80233df5] __mutex_lock_interruptible_slowpath+0x7f/0xbc
[801be644] md_seq_show+0x123/0x6aa
[8009939f] seq_read+0x1b8/0x28d
[8007b7a8] vfs_read+0xcb/0x171
[8007bb87] sys_read+0x45/0x6e
[800097e1] tracesys+0xd1/0xdc

/sbin/mdadm --detail /dev/md0

mdadm S 810035a1dd78  6384  3829   3828 (NOTLB)
810035a1dd78 81003f4570c0 80094e4d 0001
81000617c870 810037e17100 00043e667c800afe 0005ae94
81000617ca40 0001 0021 
Call Trace:
[80094e4d] mntput_no_expire+0x19/0x89
[80233df5] __mutex_lock_interruptible_slowpath+0x7f/0xbc
[801be4e7] md_open+0x2e/0x68
[80082560] do_open+0x216/0x316
[8008280b] blkdev_open+0x0/0x4f
[8008282e] blkdev_open+0x23/0x4f
[80079889] __dentry_open+0xd9/0x1dc
[80079a40] do_filp_open+0x2d/0x3d
[80079a94] do_sys_open+0x44/0xbe
[800097e1] tracesys+0xd1/0xdc

I can provided more detailed information; please just ask.

thanks,
Mike
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 with nbd member hangs MD on SLES10 and RHEL5

2007-06-12 Thread Mike Snitzer

On 6/12/07, Neil Brown [EMAIL PROTECTED] wrote:

On Tuesday June 12, [EMAIL PROTECTED] wrote:

 I can provided more detailed information; please just ask.


A complete sysrq trace (all processes) might help.


I'll send it to you off list.

thanks,
Mike
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 010 of 10] md: Allow the write_mostly flag to be set via sysfs.

2006-08-05 Thread Mike Snitzer

On 8/5/06, Mike Snitzer [EMAIL PROTECTED] wrote:

Aside from this write-mostly sysfs support, is there a way to toggle
the write-mostly bit of an md member with mdadm?  I couldn't identify
a clear way to do so.

It'd be nice if mdadm --assemble would honor --write-mostly...


I went ahead and implemented the ability to toggle the write-mostly
bit for all disks in an array.  I did so by adding another type of
--update to --assemble.  This is very useful for a 2 disk raid1 (one
disk local, one remote).   When you switch the raidhost you also need
to toggle the write-mostly bit too.

I've tested the attached patch to work with both ver.90 and ver1
superblocks with mdadm 2.4.1 and 2.5.2.  The patch is against mdadm
2.4.1 but applies cleanly (with fuzz) against mdadm 2.5.2).

# cat /proc/mdstat
...
md2 : active raid1 nbd2[0] sdd[1](W)
 390613952 blocks [2/2] [UU]
 bitmap: 0/187 pages [0KB], 1024KB chunk

# mdadm -S /dev/md2
# mdadm --assemble /dev/md2 --run --update=toggle-write-mostly
/dev/sdd /dev/nbd2
mdadm: /dev/md2 has been started with 2 drives.

# cat /proc/mdstat
...
md2 : active raid1 nbd2[0](W) sdd[1]
 390613952 blocks [2/2] [UU]
 bitmap: 0/187 pages [0KB], 1024KB chunk
diff -Naur mdadm-2.4.1/mdadm.c mdadm-2.4.1_toggle_write_mostly/mdadm.c
--- mdadm-2.4.1/mdadm.c	2006-03-28 21:55:39.0 -0500
+++ mdadm-2.4.1_toggle_write_mostly/mdadm.c	2006-08-05 17:01:48.0 -0400
@@ -587,6 +587,8 @@
 continue;
 			if (strcmp(update, uuid)==0)
 continue;
+			if (strcmp(update, toggle-write-mostly)==0)
+continue;
 			if (strcmp(update, byteorder)==0) {
 if (ss) {
 	fprintf(stderr, Name : must not set metadata type with --update=byteorder.\n);
@@ -601,7 +603,7 @@
 
 continue;
 			}
-			fprintf(stderr, Name : '--update %s' invalid.  Only 'sparc2.2', 'super-minor', 'uuid', 'resync' or 'summaries' supported\n,update);
+			fprintf(stderr, Name : '--update %s' invalid.  Only 'sparc2.2', 'super-minor', 'uuid', 'resync', 'summaries' or 'toggle-write-mostly' supported\n,update);
 			exit(2);
 
 		case O(ASSEMBLE,'c'): /* config file */
diff -Naur mdadm-2.4.1/super0.c mdadm-2.4.1_toggle_write_mostly/super0.c
--- mdadm-2.4.1/super0.c	2006-03-28 01:10:51.0 -0500
+++ mdadm-2.4.1_toggle_write_mostly/super0.c	2006-08-05 18:04:45.0 -0400
@@ -382,6 +382,10 @@
 			rv = 1;
 		}
 	}
+	if (strcmp(update, toggle-write-mostly)==0) {
+		int d = info-disk.number;
+		sb-disks[d].state ^= (1MD_DISK_WRITEMOSTLY);
+	}
 	if (strcmp(update, newdev) == 0) {
 		int d = info-disk.number;
 		memset(sb-disks[d], 0, sizeof(sb-disks[d]));
diff -Naur mdadm-2.4.1/super1.c mdadm-2.4.1_toggle_write_mostly/super1.c
--- mdadm-2.4.1/super1.c	2006-04-07 00:32:06.0 -0400
+++ mdadm-2.4.1_toggle_write_mostly/super1.c	2006-08-05 18:33:21.0 -0400
@@ -446,6 +446,9 @@
 			rv = 1;
 		}
 	}
+	if (strcmp(update, toggle-write-mostly)==0) {
+		sb-devflags ^= WriteMostly1;
+	}
 #if 0
 	if (strcmp(update, newdev) == 0) {
 		int d = info-disk.number;


issue with mdadm ver1 sb and bitmap on x86_64

2006-08-05 Thread Mike Snitzer

FYI, with both mdadm ver 2.4.1 and 2.5.2 I can't mdadm --create with a
ver1 superblock and a write intent bitmap on x86_64.

running: mdadm --create /dev/md2 -e 1.0 -l 1 --bitmap=internal -n 2
/dev/sdd --write-mostly /dev/nbd2
I get: mdadm: RUN_ARRAY failed: Invalid argument

Mike
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 010 of 10] md: Allow the write_mostly flag to be set via sysfs.

2006-08-04 Thread Mike Snitzer

Aside from this write-mostly sysfs support, is there a way to toggle
the write-mostly bit of an md member with mdadm?  I couldn't identify
a clear way to do so.

It'd be nice if mdadm --assemble would honor --write-mostly...


On 6/1/06, NeilBrown [EMAIL PROTECTED] wrote:


It appears in /sys/mdX/md/dev-YYY/state
and can be set or cleared by writing 'writemostly' or '-writemostly'
respectively.

Signed-off-by: Neil Brown [EMAIL PROTECTED]

### Diffstat output
 ./Documentation/md.txt |5 +
 ./drivers/md/md.c  |   12 
 2 files changed, 17 insertions(+)

diff ./Documentation/md.txt~current~ ./Documentation/md.txt
--- ./Documentation/md.txt~current~ 2006-06-01 15:05:30.0 +1000
+++ ./Documentation/md.txt  2006-06-01 15:05:30.0 +1000
@@ -309,6 +309,9 @@ Each directory contains:
  faulty   - device has been kicked from active use due to
  a detected fault
  in_sync  - device is a fully in-sync member of the array
+ writemostly - device will only be subject to read
+requests if there are no other options.
+This applies only to raid1 arrays.
  spare- device is working, but not a full member.
 This includes spares that are in the process
 of being recoverred to
@@ -316,6 +319,8 @@ Each directory contains:
This can be written to.
Writing faulty  simulates a failure on the device.
Writing remove removes the device from the array.
+   Writing writemostly sets the writemostly flag.
+   Writing -writemostly clears the writemostly flag.

   errors
An approximate count of read errors that have been detected on

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~  2006-06-01 15:05:30.0 +1000
+++ ./drivers/md/md.c   2006-06-01 15:05:30.0 +1000
@@ -1737,6 +1737,10 @@ state_show(mdk_rdev_t *rdev, char *page)
len += sprintf(page+len, %sin_sync,sep);
sep = ,;
}
+   if (test_bit(WriteMostly, rdev-flags)) {
+   len += sprintf(page+len, %swrite_mostly,sep);
+   sep = ,;
+   }
if (!test_bit(Faulty, rdev-flags) 
!test_bit(In_sync, rdev-flags)) {
len += sprintf(page+len, %sspare, sep);
@@ -1751,6 +1755,8 @@ state_store(mdk_rdev_t *rdev, const char
/* can write
 *  faulty  - simulates and error
 *  remove  - disconnects the device
+*  writemostly - sets write_mostly
+*  -writemostly - clears write_mostly
 */
int err = -EINVAL;
if (cmd_match(buf, faulty)  rdev-mddev-pers) {
@@ -1766,6 +1772,12 @@ state_store(mdk_rdev_t *rdev, const char
md_new_event(mddev);
err = 0;
}
+   } else if (cmd_match(buf, writemostly)) {
+   set_bit(WriteMostly, rdev-flags);
+   err = 0;
+   } else if (cmd_match(buf, -writemostly)) {
+   clear_bit(WriteMostly, rdev-flags);
+   err = 0;
}
return err ? err : len;
 }
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] md: new bitmap sysfs interface

2006-07-27 Thread Mike Snitzer

On 7/26/06, Paul Clements [EMAIL PROTECTED] wrote:

Mike Snitzer wrote:

 I tracked down the thread you referenced and these posts (by you)
 seems to summarize things well:
 http://marc.theaimsgroup.com/?l=linux-raidm=16563016418w=2
 http://marc.theaimsgroup.com/?l=linux-raidm=17515400864w=2

 But for clarity's sake, could you elaborate on the negative
 implications of not merging the bitmaps on the secondary server?  Will
 the previous primary's dirty blocks get dropped on the floor because
 the secondary (now the primary) doesn't have awareness of the previous
 primary's dirty blocks once it activates the raid1?

Right. At the time of the failover, there were (probably) blocks that
were out of sync between the primary and secondary.


OK, so now that I understand the need to merge the bitmaps... the
various scenarios that create this (potential) inconsistency are still
unclear to me when you consider the different flavors of raid1.  Is
this inconsistency only possible if using async (aka write-behind)
raid1?

If not, how would this difference in committed blocks occur with
normal (sync) raid1 given MD's endio acknowledges writes after they
are submitted to all raid members?  Is it merely that the bitmap is
left with dangling bits set that don't reflect reality (blocks weren't
actually changed anywhere) when a crash occurs?  Is there real
potential for inconsistent data on disk(s) when using sync raid1 (does
having an nbd member increase the likelihood)?

regards,
Mike
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] md: new bitmap sysfs interface

2006-07-26 Thread Mike Snitzer

On 7/25/06, Paul Clements [EMAIL PROTECTED] wrote:

This patch (tested against 2.6.18-rc1-mm1) adds a new sysfs interface
that allows the bitmap of an array to be dirtied. The interface is
write-only, and is used as follows:

echo 1000  /sys/block/md2/md/bitmap

(dirty the bit for chunk 1000 [offset 0] in the in-memory and on-disk
bitmaps of array md2)

echo 1000-2000  /sys/block/md1/md/bitmap

(dirty the bits for chunks 1000-2000 in md1's bitmap)

This is useful, for example, in cluster environments where you may need
to combine two disjoint bitmaps into one (following a server failure,
after a secondary server has taken over the array). By combining the
bitmaps on the two servers, a full resync can be avoided (This was
discussed on the list back on March 18, 2005, [PATCH 1/2] md bitmap bug
fixes thread).


Hi Paul,

I tracked down the thread you referenced and these posts (by you)
seems to summarize things well:
http://marc.theaimsgroup.com/?l=linux-raidm=16563016418w=2
http://marc.theaimsgroup.com/?l=linux-raidm=17515400864w=2

But for clarity's sake, could you elaborate on the negative
implications of not merging the bitmaps on the secondary server?  Will
the previous primary's dirty blocks get dropped on the floor because
the secondary (now the primary) doesn't have awareness of the previous
primary's dirty blocks once it activates the raid1?

Also, what is the interface one should use to collect dirty bits from
the primary's bitmap?

This bitmap merge can't happen until the primary's dirty bits can be
collected right?  Waiting for the failed server to come back to
harvest the dirty bits it has seems wrong (why failover at all?); so I
must be missing something.

please advise, thanks.
Mike
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] md: new bitmap sysfs interface

2006-07-26 Thread Mike Snitzer

On 7/26/06, Paul Clements [EMAIL PROTECTED] wrote:

Mike Snitzer wrote:

 I tracked down the thread you referenced and these posts (by you)
 seems to summarize things well:
 http://marc.theaimsgroup.com/?l=linux-raidm=16563016418w=2
 http://marc.theaimsgroup.com/?l=linux-raidm=17515400864w=2

 But for clarity's sake, could you elaborate on the negative
 implications of not merging the bitmaps on the secondary server?  Will
 the previous primary's dirty blocks get dropped on the floor because
 the secondary (now the primary) doesn't have awareness of the previous
 primary's dirty blocks once it activates the raid1?

Right. At the time of the failover, there were (probably) blocks that
were out of sync between the primary and secondary. Now, after you've
failed over to the secondary, you've got to overwrite those blocks with
data from the secondary in order to make the primary disk consistent
again. This requires that either you do a full resync from secondary to
primary (if you don't know what differs), or you merge the two bitmaps
and resync just that data.


I took more time to read the later posts in the original thread; that
coupled with your detailed response has helped a lot. thanks.


 Also, what is the interface one should use to collect dirty bits from
 the primary's bitmap?

Whatever you'd like. scp the bitmap file over or collect the ranges into
a file and scp that over, or something similar.


OK, so regardless of whether you are using an external or internal
bitmap; how does one collect the ranges from an array's bitmap?

Generally speaking I think others would have the same (naive) question
given that we need to know what to use as input for the sysfs
interface you've kindly provided.   If it is left as an exercise to
the user that is fine; I'd imagine neilb will get our backs with a
nifty new mdadm flag if need be.

thanks again,
Mike
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: accessing mirrired lvm on shared storage

2006-04-13 Thread Mike Snitzer
On 4/12/06, Neil Brown [EMAIL PROTECTED] wrote:

 One thing that is on my todo list is supporting shared raid1, so that
 several nodes in the cluster can assemble the same raid1 and access it
 - providing that the clients all do proper mutual exclusion as
 e.g. OCFS does.

Very cool... that would be extremely nice to have.  Any estimate on
when you might get to this?

Mike
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


mdadm 2.2 ver1 superblock regression?

2006-04-06 Thread Mike Snitzer
When I try to create a RAID1 array with ver 1.0 superblock using mdadm
 2.2 I'm getting:
WARNING - superblock isn't sized correctly

Looking at the code (and adding a bit more debugging) it is clear that
all 3 checks fail in super1.c's calc_sb_1_csum()'s make sure I can
count... test.

Is this a regression in mdadm 2.4, 2.3.1 and 2.3 (NOTE: mdadm 2.2's
ver1 sb works!)?

please advise, thanks.
Mike
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm 2.2 ver1 superblock regression?

2006-04-06 Thread Mike Snitzer
On 4/7/06, Neil Brown [EMAIL PROTECTED] wrote:
 On Friday April 7, [EMAIL PROTECTED] wrote:
 
  Seeing this hasn't made it into a released kernel yet, I might just
  change it.  But I'll have to make sure that old mdadm's don't mess
  things up ... I wonder how I will do that :-(
 
  Thanks for the report.

 Yes, try 2.4.1 (just released).


Works great.. thanks for the extremely quick response and fix.

Mike
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html