from:"Mike Snitzer"

Re: [PATCH] md: new bitmap sysfs interface

2006-07-26 Thread Mike Snitzer


On 7/25/06, Paul Clements <[EMAIL PROTECTED]> wrote:

This patch (tested against 2.6.18-rc1-mm1) adds a new sysfs interface
that allows the bitmap of an array to be dirtied. The interface is
write-only, and is used as follows:

echo "1000" > /sys/block/md2/md/bitmap

(dirty the bit for chunk 1000 [offset 0] in the in-memory and on-disk
bitmaps of array md2)

echo "1000-2000" > /sys/block/md1/md/bitmap

(dirty the bits for chunks 1000-2000 in md1's bitmap)

This is useful, for example, in cluster environments where you may need
to combine two disjoint bitmaps into one (following a server failure,
after a secondary server has taken over the array). By combining the
bitmaps on the two servers, a full resync can be avoided (This was
discussed on the list back on March 18, 2005, "[PATCH 1/2] md bitmap bug
fixes" thread).


Hi Paul,

I tracked down the thread you referenced and these posts (by you)
seems to summarize things well:
http://marc.theaimsgroup.com/?l=linux-raid&m=16563016418&w=2
http://marc.theaimsgroup.com/?l=linux-raid&m=17515400864&w=2

But for clarity's sake, could you elaborate on the negative
implications of not merging the bitmaps on the secondary server?  Will
the previous primary's dirty blocks get dropped on the floor because
the secondary (now the primary) doesn't have awareness of the previous
primary's dirty blocks once it activates the raid1?

Also, what is the interface one should use to collect dirty bits from
the primary's bitmap?

This bitmap merge can't happen until the primary's dirty bits can be
collected right?  Waiting for the failed server to come back to
harvest the dirty bits it has seems wrong (why failover at all?); so I
must be missing something.

please advise, thanks.
Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] md: new bitmap sysfs interface

2006-07-26 Thread Mike Snitzer

On 7/26/06, Paul Clements <[EMAIL PROTECTED]> wrote:

Mike Snitzer wrote:

> I tracked down the thread you referenced and these posts (by you)
> seems to summarize things well:
> http://marc.theaimsgroup.com/?l=linux-raid&m=16563016418&w=2
> http://marc.theaimsgroup.com/?l=linux-raid&m=17515400864&w=2
>
> But for clarity's sake, could you elaborate on the negative
> implications of not merging the bitmaps on the secondary server?  Will
> the previous primary's dirty blocks get dropped on the floor because
> the secondary (now the primary) doesn't have awareness of the previous
> primary's dirty blocks once it activates the raid1?

Right. At the time of the failover, there were (probably) blocks that
were out of sync between the primary and secondary. Now, after you've
failed over to the secondary, you've got to overwrite those blocks with
data from the secondary in order to make the primary disk consistent
again. This requires that either you do a full resync from secondary to
primary (if you don't know what differs), or you merge the two bitmaps
and resync just that data.

I took more time to read the later posts in the original thread; that
coupled with your detailed response has helped a lot. thanks.

> Also, what is the interface one should use to collect dirty bits from
> the primary's bitmap?

Whatever you'd like. scp the bitmap file over or collect the ranges into
a file and scp that over, or something similar.

OK, so regardless of whether you are using an external or internal
bitmap; how does one collect the ranges from an array's bitmap?

Generally speaking I think others would have the same (naive) question
given that we need to know what to use as input for the sysfs
interface you've kindly provided.   If it is left as an exercise to
the user that is fine; I'd imagine neilb will get our backs with a
nifty new mdadm flag if need be.

thanks again,
Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] md: new bitmap sysfs interface

2006-07-27 Thread Mike Snitzer

On 7/26/06, Paul Clements <[EMAIL PROTECTED]> wrote:

Mike Snitzer wrote:

> I tracked down the thread you referenced and these posts (by you)
> seems to summarize things well:
> http://marc.theaimsgroup.com/?l=linux-raid&m=16563016418&w=2
> http://marc.theaimsgroup.com/?l=linux-raid&m=17515400864&w=2
>
> But for clarity's sake, could you elaborate on the negative
> implications of not merging the bitmaps on the secondary server?  Will
> the previous primary's dirty blocks get dropped on the floor because
> the secondary (now the primary) doesn't have awareness of the previous
> primary's dirty blocks once it activates the raid1?

Right. At the time of the failover, there were (probably) blocks that
were out of sync between the primary and secondary.

OK, so now that I understand the need to merge the bitmaps... the
various scenarios that create this (potential) inconsistency are still
unclear to me when you consider the different flavors of raid1.  Is
this inconsistency only possible if using async (aka write-behind)
raid1?

If not, how would this difference in committed blocks occur with
normal (sync) raid1 given MD's endio acknowledges writes after they
are submitted to all raid members?  Is it merely that the bitmap is
left with dangling bits set that don't reflect reality (blocks weren't
actually changed anywhere) when a crash occurs?  Is there real
potential for inconsistent data on disk(s) when using sync raid1 (does
having an nbd member increase the likelihood)?

regards,
Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 010 of 10] md: Allow the write_mostly flag to be set via sysfs.

2006-08-04 Thread Mike Snitzer


Aside from this write-mostly sysfs support, is there a way to toggle
the write-mostly bit of an md member with mdadm?  I couldn't identify
a clear way to do so.

It'd be nice if mdadm --assemble would honor --write-mostly...


On 6/1/06, NeilBrown <[EMAIL PROTECTED]> wrote:


It appears in /sys/mdX/md/dev-YYY/state
and can be set or cleared by writing 'writemostly' or '-writemostly'
respectively.

Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./Documentation/md.txt |5 +
 ./drivers/md/md.c  |   12 
 2 files changed, 17 insertions(+)

diff ./Documentation/md.txt~current~ ./Documentation/md.txt
--- ./Documentation/md.txt~current~ 2006-06-01 15:05:30.0 +1000
+++ ./Documentation/md.txt  2006-06-01 15:05:30.0 +1000
@@ -309,6 +309,9 @@ Each directory contains:
  faulty   - device has been kicked from active use due to
  a detected fault
  in_sync  - device is a fully in-sync member of the array
+ writemostly - device will only be subject to read
+requests if there are no other options.
+This applies only to raid1 arrays.
  spare- device is working, but not a full member.
 This includes spares that are in the process
 of being recoverred to
@@ -316,6 +319,8 @@ Each directory contains:
This can be written to.
Writing "faulty"  simulates a failure on the device.
Writing "remove" removes the device from the array.
+   Writing "writemostly" sets the writemostly flag.
+   Writing "-writemostly" clears the writemostly flag.

   errors
An approximate count of read errors that have been detected on

diff ./drivers/md/md.c~current~ ./drivers/md/md.c
--- ./drivers/md/md.c~current~  2006-06-01 15:05:30.0 +1000
+++ ./drivers/md/md.c   2006-06-01 15:05:30.0 +1000
@@ -1737,6 +1737,10 @@ state_show(mdk_rdev_t *rdev, char *page)
len += sprintf(page+len, "%sin_sync",sep);
sep = ",";
}
+   if (test_bit(WriteMostly, &rdev->flags)) {
+   len += sprintf(page+len, "%swrite_mostly",sep);
+   sep = ",";
+   }
if (!test_bit(Faulty, &rdev->flags) &&
!test_bit(In_sync, &rdev->flags)) {
len += sprintf(page+len, "%sspare", sep);
@@ -1751,6 +1755,8 @@ state_store(mdk_rdev_t *rdev, const char
/* can write
 *  faulty  - simulates and error
 *  remove  - disconnects the device
+*  writemostly - sets write_mostly
+*  -writemostly - clears write_mostly
 */
int err = -EINVAL;
if (cmd_match(buf, "faulty") && rdev->mddev->pers) {
@@ -1766,6 +1772,12 @@ state_store(mdk_rdev_t *rdev, const char
md_new_event(mddev);
err = 0;
}
+   } else if (cmd_match(buf, "writemostly")) {
+   set_bit(WriteMostly, &rdev->flags);
+   err = 0;
+   } else if (cmd_match(buf, "-writemostly")) {
+   clear_bit(WriteMostly, &rdev->flags);
+   err = 0;
}
return err ? err : len;
 }
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 010 of 10] md: Allow the write_mostly flag to be set via sysfs.

2006-08-05 Thread Mike Snitzer


On 8/5/06, Mike Snitzer <[EMAIL PROTECTED]> wrote:

Aside from this write-mostly sysfs support, is there a way to toggle
the write-mostly bit of an md member with mdadm?  I couldn't identify
a clear way to do so.

It'd be nice if mdadm --assemble would honor --write-mostly...


I went ahead and implemented the ability to toggle the write-mostly
bit for all disks in an array.  I did so by adding another type of
--update to --assemble.  This is very useful for a 2 disk raid1 (one
disk local, one remote).   When you switch the raidhost you also need
to toggle the write-mostly bit too.

I've tested the attached patch to work with both ver.90 and ver1
superblocks with mdadm 2.4.1 and 2.5.2.  The patch is against mdadm
2.4.1 but applies cleanly (with fuzz) against mdadm 2.5.2).

# cat /proc/mdstat
...
md2 : active raid1 nbd2[0] sdd[1](W)
 390613952 blocks [2/2] [UU]
 bitmap: 0/187 pages [0KB], 1024KB chunk

# mdadm -S /dev/md2
# mdadm --assemble /dev/md2 --run --update=toggle-write-mostly
/dev/sdd /dev/nbd2
mdadm: /dev/md2 has been started with 2 drives.

# cat /proc/mdstat
...
md2 : active raid1 nbd2[0](W) sdd[1]
 390613952 blocks [2/2] [UU]
 bitmap: 0/187 pages [0KB], 1024KB chunk
diff -Naur mdadm-2.4.1/mdadm.c mdadm-2.4.1_toggle_write_mostly/mdadm.c
--- mdadm-2.4.1/mdadm.c	2006-03-28 21:55:39.0 -0500
+++ mdadm-2.4.1_toggle_write_mostly/mdadm.c	2006-08-05 17:01:48.0 -0400
@@ -587,6 +587,8 @@
 continue;
 			if (strcmp(update, "uuid")==0)
 continue;
+			if (strcmp(update, "toggle-write-mostly")==0)
+continue;
 			if (strcmp(update, "byteorder")==0) {
 if (ss) {
 	fprintf(stderr, Name ": must not set metadata type with --update=byteorder.\n");
@@ -601,7 +603,7 @@
 
 continue;
 			}
-			fprintf(stderr, Name ": '--update %s' invalid.  Only 'sparc2.2', 'super-minor', 'uuid', 'resync' or 'summaries' supported\n",update);
+			fprintf(stderr, Name ": '--update %s' invalid.  Only 'sparc2.2', 'super-minor', 'uuid', 'resync', 'summaries' or 'toggle-write-mostly' supported\n",update);
 			exit(2);
 
 		case O(ASSEMBLE,'c'): /* config file */
diff -Naur mdadm-2.4.1/super0.c mdadm-2.4.1_toggle_write_mostly/super0.c
--- mdadm-2.4.1/super0.c	2006-03-28 01:10:51.0 -0500
+++ mdadm-2.4.1_toggle_write_mostly/super0.c	2006-08-05 18:04:45.0 -0400
@@ -382,6 +382,10 @@
 			rv = 1;
 		}
 	}
+	if (strcmp(update, "toggle-write-mostly")==0) {
+		int d = info->disk.number;
+		sb->disks[d].state ^= (1<disk.number;
 		memset(&sb->disks[d], 0, sizeof(sb->disks[d]));
diff -Naur mdadm-2.4.1/super1.c mdadm-2.4.1_toggle_write_mostly/super1.c
--- mdadm-2.4.1/super1.c	2006-04-07 00:32:06.0 -0400
+++ mdadm-2.4.1_toggle_write_mostly/super1.c	2006-08-05 18:33:21.0 -0400
@@ -446,6 +446,9 @@
 			rv = 1;
 		}
 	}
+	if (strcmp(update, "toggle-write-mostly")==0) {
+		sb->devflags ^= WriteMostly1;
+	}
 #if 0
 	if (strcmp(update, "newdev") == 0) {
 		int d = info->disk.number;

issue with mdadm ver1 sb and bitmap on x86_64

2006-08-05 Thread Mike Snitzer


FYI, with both mdadm ver 2.4.1 and 2.5.2 I can't mdadm --create with a
ver1 superblock and a write intent bitmap on x86_64.

running: mdadm --create /dev/md2 -e 1.0 -l 1 --bitmap=internal -n 2
/dev/sdd --write-mostly /dev/nbd2
I get: mdadm: RUN_ARRAY failed: Invalid argument

Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] md: pass down BIO_RW_SYNC in raid{1,10}

2007-01-09 Thread Mike Snitzer

On 1/8/07, Andrew Morton <[EMAIL PROTECTED]> wrote:

On Mon, 8 Jan 2007 10:08:34 +0100
Lars Ellenberg <[EMAIL PROTECTED]> wrote:

> md raidX make_request functions strip off the BIO_RW_SYNC flag,
> thus introducing additional latency.
>
> fixing this in raid1 and raid10 seems to be straight forward enough.
>
> for our particular usage case in DRBD, passing this flag improved
> some initialization time from ~5 minutes to ~5 seconds.

That sounds like a significant fix.

So will this fix also improve performance associated with raid1's
internal bitmap support?  What is the scope of the performance
problems this fix will address?  That is, what are some other examples
of where users might see a benefit from this patch?

regards,
Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

raid1 with nbd member hangs MD on SLES10 and RHEL5

2007-06-12 Thread Mike Snitzer


When using raid1 with one local member and one nbd member (marked as
write-mostly) MD hangs when trying to format /dev/md0 with ext3.  Both
'cat /proc/mdstat' and 'mdadm --detail /dev/md0' hang infinitely.
I've not tried to reproduce on 2.6.18 or 2.6.19ish kernel.org kernels
yet but this issue affects both SLES10 and RHEL5.

sysrq traces for RHEL5 follow; I don't have immediate access to a
SLES10 system at the moment but I've seen this same hang with SLES10
SP1 RC4:

cat /proc/mdstat

cat   S 8100048e7de8  6208 11428  11391 (NOTLB)
8100048e7de8 076eb000 80098ea6 0008
81001ff170c0 810037e17100 00045f8d13924085 0006b89f
81001ff17290 0001 0005 
Call Trace:
[] seq_printf+0x67/0x8f
[] __mutex_lock_interruptible_slowpath+0x7f/0xbc
[] md_seq_show+0x123/0x6aa
[] seq_read+0x1b8/0x28d
[] vfs_read+0xcb/0x171
[] sys_read+0x45/0x6e
[] tracesys+0xd1/0xdc

/sbin/mdadm --detail /dev/md0

mdadm S 810035a1dd78  6384  3829   3828 (NOTLB)
810035a1dd78 81003f4570c0 80094e4d 0001
81000617c870 810037e17100 00043e667c800afe 0005ae94
81000617ca40 0001 0021 
Call Trace:
[] mntput_no_expire+0x19/0x89
[] __mutex_lock_interruptible_slowpath+0x7f/0xbc
[] md_open+0x2e/0x68
[] do_open+0x216/0x316
[] blkdev_open+0x0/0x4f
[] blkdev_open+0x23/0x4f
[] __dentry_open+0xd9/0x1dc
[] do_filp_open+0x2d/0x3d
[] do_sys_open+0x44/0xbe
[] tracesys+0xd1/0xdc

I can provided more detailed information; please just ask.

thanks,
Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid1 with nbd member hangs MD on SLES10 and RHEL5

2007-06-12 Thread Mike Snitzer


On 6/12/07, Neil Brown <[EMAIL PROTECTED]> wrote:

On Tuesday June 12, [EMAIL PROTECTED] wrote:
>
> I can provided more detailed information; please just ask.
>

A complete sysrq trace (all processes) might help.


I'll send it to you off list.

thanks,
Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid1 with nbd member hangs MD on SLES10 and RHEL5

2007-06-13 Thread Mike Snitzer

On 6/13/07, Mike Snitzer <[EMAIL PROTECTED]> wrote:

On 6/12/07, Neil Brown <[EMAIL PROTECTED]> wrote:

...

> > > On 6/12/07, Neil Brown <[EMAIL PROTECTED]> wrote:
> > > > On Tuesday June 12, [EMAIL PROTECTED] wrote:
> > > > >
> > > > > I can provided more detailed information; please just ask.
> > > > >
> > > >
> > > > A complete sysrq trace (all processes) might help.

Bringing this back to a wider audience.  I provided the full sysrq
trace of the RHEL5 kernel to Neil; in it we saw that md0_raid1 had the
following trace:

md0_raid1 D 810026183ce0  5368 31663 11  3822 29488 (L-TLB)
810026183ce0 810031e9b5f8 0008 000a
810037eef040 810037e17100 00043e64d2983c1f 4c7f
810037eef210 00010001 00081c506640 
Call Trace:
[] keventd_create_kthread+0x0/0x61
[] md_super_wait+0xa8/0xbc
[] autoremove_wake_function+0x0/0x2e
[] md_update_sb+0x1dd/0x23a
[] md_check_recovery+0x15f/0x449
[] :raid1:raid1d+0x27/0xc1e
[] thread_return+0x0/0xde
[] __sched_text_start+0xc/0xa79
[] keventd_create_kthread+0x0/0x61
[] schedule_timeout+0x1e/0xad
[] keventd_create_kthread+0x0/0x61
[] md_thread+0xf8/0x10e
[] autoremove_wake_function+0x0/0x2e
[] md_thread+0x0/0x10e
[] kthread+0xd4/0x109
[] child_rip+0xa/0x11
[] keventd_create_kthread+0x0/0x61
[] kthread+0x0/0x109
[] child_rip+0x0/0x11

To which Neil had the following to say:

> md0_raid1 is holding the lock on the array and trying to write out the
> superblocks for some reason, and the write isn't completing.
> As it is holding the locks, mdadm and /proc/mdstat are hanging.
>
> You seem to have nbd-servers running on this machine.  Are they
> serving the device that md is using. (i.e. a loop-back situation).  I
> would expect memory deadlocks would be very easy to hit in that
> situation, but I don't know if that is what has happened.
>
> Nothing else stands out.
>
> Could you clarify the arrangement of nbd.  Where are the servers and
> what are they serving?

We're using MD+NBD for disaster recovery (one local scsi device, one
remote via nbd).  The nbd-server is not contributing to md0.  The
nbd-server is connected to a remote machine that is running a raid1
remotely

To take this further I've now collected a full sysrq trace of this
hang on a SLES10 SP1 RC5 2.6.16.46-0.12-smp kernel, the relevant
md0_raid1 trace is comparable to the RHEL5 trace from above:

md0_raid1 D 810001089780 0  8583 51  8952  8260 (L-TLB)
810812393ca8 0046 8107b7fbac00 000a
  81081f3c6a18 81081f3c67d0 8104ffe8f100 44819ddcd5e2
  eb8b 0007028009c7
Call Trace: {generic_make_request+501}
  {md_super_wait+168}
{autoremove_wake_function+0}
  {write_page+128} {md_update_sb+220}
  {md_check_recovery+361}
{:raid1:raid1d+38}
  {lock_timer_base+27}
{try_to_del_timer_sync+81}
  {del_timer_sync+12}
{schedule_timeout+146}
  {keventd_create_kthread+0}
{md_thread+248}
  {autoremove_wake_function+0}
{md_thread+0}
  {kthread+236} {child_rip+8}
  {keventd_create_kthread+0}
{kthread+0}
  {child_rip+0}

Taking a step back, here is what was done to reproduce on SLES10:
1) establish a raid1 mirror (md0) using one local member (sdc1) and
one remote member (nbd0)
2) power off the remote machine, whereby severing nbd0's connection
3) perform IO to the filesystem that is on the md0 device to enduce
the MD layer to mark the nbd device as "faulty"
4) cat /proc/mdstat hangs, sysrq trace was collected and showed the
above md0_raid1 trace.

To be clear, the MD superblock update hangs indefinitely on RHEL5.
But with SLES10 it eventually succeeds (and MD marks the nbd0 member
faulty); and the other tasks that were blocking waiting for the MD
lock (e.g. 'cat /proc/mdstat') then complete immediately.

It should be noted that this MD+NBD configuration has worked
flawlessly using a stock kernel.org 2.6.15.7 kernel (ontop of a
RHEL4U4 distro).  Steps have not been taken to try to reproduce  with
2.6.15.7 on SLES10; it may be useful to pursue but I'll defer to
others to suggest I do so.

2.6.15.7 does not have the SMP race fixes that were made in 2.6.16;
yet both SLES10 and RHEL5 kernels do:
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4b2f0260c74324abca76ccaa42d426af163125e7

If not this specific NBD change, something appears to have changed
with how NBD behaves in the face of it's connection to the server
being lost.  Almost like the MD superblock update that would be
written to nbd0 is blocking within nbd or the network layer because of
a network timeout issue?

I will try to get a better understanding of what is _really_ happening
with systemtap; but othe

Re: Cluster Aware MD Driver

2007-06-13 Thread Mike Snitzer


Is the goal to have the MD device be directly accessible from all
nodes? This strategy seems flawed in that it speaks to updating MD
superblocks then in-memory Linux data structures across a cluster.
The reality is if we're talking about shared storage the MD management
only needs to happen in one node.  Others can weigh in on this but the
current MD really doesn't want to be cluster-aware.

IMHO, this cluster awareness really doesn't belong in MD/mdadm.  A
high-level cluster management tool should be doing this MD
ownership/coordination work.  The MD ownership can be transferred
accordingly if/when the current owner fails, etc.  But this implies
that the MD is only ever active on one node at any given point in
time.

Mike

On 6/13/07, Xinwei Hu <[EMAIL PROTECTED]> wrote:

Hi all,

  Steven Dake proposed a solution* to make MD layer and tools to be cluster
aware in early 2003. But it seems that no progressing is made since then. I'd
like to pick this one up again. :)

  So far as I understand, Steven's proposal still applies to currently MD
implementation mostly, except we have bitmap now. And bitmap can be
workarounded via set_bitmap_file.

   The problem is that it seems we need a kernel<->userspace interface to sync
the mddev struct across all nodes, but I don't find out how.

   I'm new to the MD driver, so correct me if I'm wrong. And you suggestions
are really appreciated.

   Thanks.

* http://osdir.com/ml/raid/2003-01/msg00013.html
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid1 with nbd member hangs MD on SLES10 and RHEL5

2007-06-13 Thread Mike Snitzer

On 6/13/07, Mike Snitzer <[EMAIL PROTECTED]> wrote:

On 6/13/07, Mike Snitzer <[EMAIL PROTECTED]> wrote:
> On 6/12/07, Neil Brown <[EMAIL PROTECTED]> wrote:
...
> > > > On 6/12/07, Neil Brown <[EMAIL PROTECTED]> wrote:
> > > > > On Tuesday June 12, [EMAIL PROTECTED] wrote:
> > > > > >
> > > > > > I can provided more detailed information; please just ask.
> > > > > >
> > > > >
> > > > > A complete sysrq trace (all processes) might help.

Bringing this back to a wider audience.  I provided the full sysrq
trace of the RHEL5 kernel to Neil; in it we saw that md0_raid1 had the
following trace:

md0_raid1 D 810026183ce0  5368 31663 11  3822 29488 (L-TLB)
 810026183ce0 810031e9b5f8 0008 000a
 810037eef040 810037e17100 00043e64d2983c1f 4c7f
 810037eef210 00010001 00081c506640 
Call Trace:
 [] keventd_create_kthread+0x0/0x61
 [] md_super_wait+0xa8/0xbc
 [] autoremove_wake_function+0x0/0x2e
 [] md_update_sb+0x1dd/0x23a
 [] md_check_recovery+0x15f/0x449
 [] :raid1:raid1d+0x27/0xc1e
 [] thread_return+0x0/0xde
 [] __sched_text_start+0xc/0xa79
 [] keventd_create_kthread+0x0/0x61
 [] schedule_timeout+0x1e/0xad
 [] keventd_create_kthread+0x0/0x61
 [] md_thread+0xf8/0x10e
 [] autoremove_wake_function+0x0/0x2e
 [] md_thread+0x0/0x10e
 [] kthread+0xd4/0x109
 [] child_rip+0xa/0x11
 [] keventd_create_kthread+0x0/0x61
 [] kthread+0x0/0x109
 [] child_rip+0x0/0x11

To which Neil had the following to say:

> > md0_raid1 is holding the lock on the array and trying to write out the
> > superblocks for some reason, and the write isn't completing.
> > As it is holding the locks, mdadm and /proc/mdstat are hanging.

...

> We're using MD+NBD for disaster recovery (one local scsi device, one
> remote via nbd).  The nbd-server is not contributing to md0.  The
> nbd-server is connected to a remote machine that is running a raid1
> remotely

To take this further I've now collected a full sysrq trace of this
hang on a SLES10 SP1 RC5 2.6.16.46-0.12-smp kernel, the relevant
md0_raid1 trace is comparable to the RHEL5 trace from above:

md0_raid1 D 810001089780 0  8583 51  8952  8260 (L-TLB)
810812393ca8 0046 8107b7fbac00 000a
   81081f3c6a18 81081f3c67d0 8104ffe8f100 44819ddcd5e2
   eb8b 0007028009c7
Call Trace: {generic_make_request+501}
   {md_super_wait+168}
{autoremove_wake_function+0}
   {write_page+128} {md_update_sb+220}
   {md_check_recovery+361}
{:raid1:raid1d+38}
   {lock_timer_base+27}
{try_to_del_timer_sync+81}
   {del_timer_sync+12}
{schedule_timeout+146}
   {keventd_create_kthread+0}
{md_thread+248}
   {autoremove_wake_function+0}
{md_thread+0}
   {kthread+236} {child_rip+8}
   {keventd_create_kthread+0}
{kthread+0}
   {child_rip+0}

Taking a step back, here is what was done to reproduce on SLES10:
1) establish a raid1 mirror (md0) using one local member (sdc1) and
one remote member (nbd0)
2) power off the remote machine, whereby severing nbd0's connection
3) perform IO to the filesystem that is on the md0 device to enduce
the MD layer to mark the nbd device as "faulty"
4) cat /proc/mdstat hangs, sysrq trace was collected and showed the
above md0_raid1 trace.

To be clear, the MD superblock update hangs indefinitely on RHEL5.
But with SLES10 it eventually succeeds (and MD marks the nbd0 member
faulty); and the other tasks that were blocking waiting for the MD
lock (e.g. 'cat /proc/mdstat') then complete immediately.

It should be noted that this MD+NBD configuration has worked
flawlessly using a stock kernel.org 2.6.15.7 kernel (ontop of a
RHEL4U4 distro).  Steps have not been taken to try to reproduce  with
2.6.15.7 on SLES10; it may be useful to pursue but I'll defer to
others to suggest I do so.

2.6.15.7 does not have the SMP race fixes that were made in 2.6.16;
yet both SLES10 and RHEL5 kernels do:
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4b2f0260c74324abca76ccaa42d426af163125e7

If not this specific NBD change, something appears to have changed
with how NBD behaves in the face of it's connection to the server
being lost.  Almost like the MD superblock update that would be
written to nbd0 is blocking within nbd or the network layer because of
a network timeout issue?

Just a quick update; it is really starting to look like there is
definitely an issue with the nbd kernel driver.  I booted the SLES10
2.6.16.46-0.12-smp kernel with maxcpus=1 to test the theory that the
nbd SMP fix that went into 2.6.16 was in some way causing this MD/NBD
hang.  But it _still_ occurs with the 4-step process I outlined above.

The nbd0 device _sho

Re: raid1 with nbd member hangs MD on SLES10 and RHEL5

2007-06-14 Thread Mike Snitzer


On 6/14/07, Bill Davidsen <[EMAIL PROTECTED]> wrote:

Mike Snitzer wrote:
> On 6/13/07, Mike Snitzer <[EMAIL PROTECTED]> wrote:
>> On 6/13/07, Mike Snitzer <[EMAIL PROTECTED]> wrote:
>> > On 6/12/07, Neil Brown <[EMAIL PROTECTED]> wrote:
>> ...
>> > > > > On 6/12/07, Neil Brown <[EMAIL PROTECTED]> wrote:
>> > > > > > On Tuesday June 12, [EMAIL PROTECTED] wrote:
>> > > > > > >
>> > > > > > > I can provided more detailed information; please just ask.
>> > > > > > >
>> > > > > >
>> > > > > > A complete sysrq trace (all processes) might help.
>>
>> Bringing this back to a wider audience.  I provided the full sysrq
>> trace of the RHEL5 kernel to Neil; in it we saw that md0_raid1 had the
>> following trace:
>>
>> md0_raid1 D 810026183ce0  5368 31663 11  3822
>> 29488 (L-TLB)
>>  810026183ce0 810031e9b5f8 0008 000a
>>  810037eef040 810037e17100 00043e64d2983c1f 4c7f
>>  810037eef210 00010001 00081c506640 
>> Call Trace:
>>  [] keventd_create_kthread+0x0/0x61
>>  [] md_super_wait+0xa8/0xbc
>>  [] autoremove_wake_function+0x0/0x2e
>>  [] md_update_sb+0x1dd/0x23a
>>  [] md_check_recovery+0x15f/0x449
>>  [] :raid1:raid1d+0x27/0xc1e
>>  [] thread_return+0x0/0xde
>>  [] __sched_text_start+0xc/0xa79
>>  [] keventd_create_kthread+0x0/0x61
>>  [] schedule_timeout+0x1e/0xad
>>  [] keventd_create_kthread+0x0/0x61
>>  [] md_thread+0xf8/0x10e
>>  [] autoremove_wake_function+0x0/0x2e
>>  [] md_thread+0x0/0x10e
>>  [] kthread+0xd4/0x109
>>  [] child_rip+0xa/0x11
>>  [] keventd_create_kthread+0x0/0x61
>>  [] kthread+0x0/0x109
>>  [] child_rip+0x0/0x11
>>
>> To which Neil had the following to say:
>>
>> > > md0_raid1 is holding the lock on the array and trying to write
>> out the
>> > > superblocks for some reason, and the write isn't completing.
>> > > As it is holding the locks, mdadm and /proc/mdstat are hanging.
> ...
>
>> > We're using MD+NBD for disaster recovery (one local scsi device, one
>> > remote via nbd).  The nbd-server is not contributing to md0.  The
>> > nbd-server is connected to a remote machine that is running a raid1
>> > remotely
>>
>> To take this further I've now collected a full sysrq trace of this
>> hang on a SLES10 SP1 RC5 2.6.16.46-0.12-smp kernel, the relevant
>> md0_raid1 trace is comparable to the RHEL5 trace from above:
>>
>> md0_raid1 D 810001089780 0  8583 51  8952
>> 8260 (L-TLB)
>> 810812393ca8 0046 8107b7fbac00 000a
>>81081f3c6a18 81081f3c67d0 8104ffe8f100
>> 44819ddcd5e2
>>eb8b 0007028009c7
>> Call Trace: {generic_make_request+501}
>>{md_super_wait+168}
>> {autoremove_wake_function+0}
>>{write_page+128}
>> {md_update_sb+220}
>>{md_check_recovery+361}
>> {:raid1:raid1d+38}
>>{lock_timer_base+27}
>> {try_to_del_timer_sync+81}
>>{del_timer_sync+12}
>> {schedule_timeout+146}
>>{keventd_create_kthread+0}
>> {md_thread+248}
>>{autoremove_wake_function+0}
>> {md_thread+0}
>>{kthread+236} {child_rip+8}
>>{keventd_create_kthread+0}
>> {kthread+0}
>>{child_rip+0}
>>
>> Taking a step back, here is what was done to reproduce on SLES10:
>> 1) establish a raid1 mirror (md0) using one local member (sdc1) and
>> one remote member (nbd0)
>> 2) power off the remote machine, whereby severing nbd0's connection
>> 3) perform IO to the filesystem that is on the md0 device to enduce
>> the MD layer to mark the nbd device as "faulty"
>> 4) cat /proc/mdstat hangs, sysrq trace was collected and showed the
>> above md0_raid1 trace.
>>
>> To be clear, the MD superblock update hangs indefinitely on RHEL5.
>> But with SLES10 it eventually succeeds (and MD marks the nbd0 member
>> faulty); and the other tasks that were blocking waiting for the MD
>> lock (e.g. 'cat /proc/mdstat') then complete immediately.
>>
>> It should be noted that this MD+NBD configuration has worked
>> flawlessly using a stock kernel.org 2.6.15.7 kernel (ontop of a
>> RHEL4U4 distro).  Steps have not been taken to try to reproduce  with
>> 2.6.15.7 on SLE

Re: raid1 with nbd member hangs MD on SLES10 and RHEL5

2007-06-14 Thread Mike Snitzer

On 6/14/07, Paul Clements <[EMAIL PROTECTED]> wrote:

Bill Davidsen wrote:

> Second, AFAIK nbd hasn't working in a while. I haven't tried it in ages,
> but was told it wouldn't work with smp and I kind of lost interest. If
> Neil thinks it should work in 2.6.21 or later I'll test it, since I have
> a machine which wants a fresh install soon, and is both backed up and
> available.

Please stop this. nbd is working perfectly fine, AFAIK. I use it every
day, and so do 100s of our customers. What exactly is it that not's
working? If there's a problem, please send the bug report.

Paul,

This thread details what I've experienced using MD (raid1) with 2
devices; one being a local scsi device and the other is an NBD device.
I've yet to put effort to pinpointing the problem in a kernel.org
kernel; however both SLES10 and RHEL5 kernels appear to be hanging in
either 1) nbd or 2) the socket layer.

Here are the steps to reproduce reliably on SLES10 SP1:
1) establish a raid1 mirror (md0) using one local member (sdc1) and
one remote member (nbd0)
2) power off the remote machine, whereby severing nbd0's connection
3) perform IO to the filesystem that is on the md0 device to enduce
the MD layer to mark the nbd device as "faulty"
4) cat /proc/mdstat hangs, sysrq trace was collected

To be clear, the MD superblock update hangs indefinitely on RHEL5.
But with SLES10 it eventually succeeds after ~5min (and MD marks the
nbd0 member faulty); and the other tasks that were blocking waiting
for the MD lock (e.g. 'cat /proc/mdstat') then complete immediately.

If you look back in this thread you'll see traces for md0_raid1 for
both SLES10 and RHEL5.  I hope to try to reproduce this issue on
kernel.org 2.6.16.46 (the basis for SLES10).  If I can I'll then git
bisect back to try to pinpoint the regression; I obviously need to
verify that 2.6.16 works in this situation on SMP.

Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid1 with nbd member hangs MD on SLES10 and RHEL5

2007-06-14 Thread Mike Snitzer

On 6/14/07, Paul Clements <[EMAIL PROTECTED]> wrote:

Mike Snitzer wrote:

> Here are the steps to reproduce reliably on SLES10 SP1:
> 1) establish a raid1 mirror (md0) using one local member (sdc1) and
> one remote member (nbd0)
> 2) power off the remote machine, whereby severing nbd0's connection
> 3) perform IO to the filesystem that is on the md0 device to enduce
> the MD layer to mark the nbd device as "faulty"
> 4) cat /proc/mdstat hangs, sysrq trace was collected

That's working as designed. NBD works over TCP. You're going to have to
wait for TCP to time out before an error occurs. Until then I/O will hang.

With kernel.org 2.6.15.7 (uni-processor) I've not seen NBD hang in the
kernel like I am with RHEL5 and SLES10.  This hang (tcp timeout) is
indefinite oh RHEL5 and ~5min on SLES10.

Should/can I be playing with TCP timeout values?  Why was this not a
concern with kernel.org 2.6.15.7; I was able to "feel" the nbd
connection break immediately; no MD superblock update hangs, no
longwinded (or indefinite) TCP timeout.

regards,
Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid1 with nbd member hangs MD on SLES10 and RHEL5

2007-06-14 Thread Mike Snitzer

On 6/14/07, Paul Clements <[EMAIL PROTECTED]> wrote:

Mike Snitzer wrote:
> On 6/14/07, Paul Clements <[EMAIL PROTECTED]> wrote:
>> Mike Snitzer wrote:
>>
>> > Here are the steps to reproduce reliably on SLES10 SP1:
>> > 1) establish a raid1 mirror (md0) using one local member (sdc1) and
>> > one remote member (nbd0)
>> > 2) power off the remote machine, whereby severing nbd0's connection
>> > 3) perform IO to the filesystem that is on the md0 device to enduce
>> > the MD layer to mark the nbd device as "faulty"
>> > 4) cat /proc/mdstat hangs, sysrq trace was collected
>>
>> That's working as designed. NBD works over TCP. You're going to have to
>> wait for TCP to time out before an error occurs. Until then I/O will
>> hang.
>
> With kernel.org 2.6.15.7 (uni-processor) I've not seen NBD hang in the
> kernel like I am with RHEL5 and SLES10.  This hang (tcp timeout) is
> indefinite oh RHEL5 and ~5min on SLES10.
>
> Should/can I be playing with TCP timeout values?  Why was this not a
> concern with kernel.org 2.6.15.7; I was able to "feel" the nbd
> connection break immediately; no MD superblock update hangs, no
> longwinded (or indefinite) TCP timeout.

I don't know. I've never seen nbd immediately start returning I/O
errors. Perhaps something was different about the configuration?
If the other other machine rebooted quickly, for instance, you'd get a
connection reset, which would kill the nbd connection.

OK, I'll retest the 2.6.15.7 setup.  As for SLES10 and RHEL5, I've
been leaving the remote server powered off.  As such I'm at the full
mercy of the TCP timeout.  It is odd that RHEL5 has been hanging
indefinitely but I'll dig deeper on that once I come to terms with how
kernel.org and SLES10 behaves.

I'll update with my findings for completeness.

Thanks for your insight!
Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Need clarification on raid1 resync behavior with bitmap support

2007-07-21 Thread Mike Snitzer


On 6/1/06, NeilBrown <[EMAIL PROTECTED]> wrote:


When an array has a bitmap, a device can be removed and re-added
and only blocks changes since the removal (as recorded in the bitmap)
will be resynced.


Neil,

Does the same apply when a bitmap-enabled raid1's member goes faulty?
Meaning even if a member is faulty, when the user removes and re-adds
the faulty device the raid1 rebuild _should_ leverage the bitmap
during a resync right?

I've seen messages like:
[12068875.690255] raid1: raid set md0 active with 2 out of 2 mirrors
[12068875.690284] md0: bitmap file is out of date (0 < 1) -- forcing
full recovery
[12068875.690289] md0: bitmap file is out of date, doing full recovery
[12068875.710214] md0: bitmap initialized from disk: read 5/5 pages,
set 131056 bits, status: 0
[12068875.710222] created bitmap (64 pages) for device md0

Could you share the other situations where a bitmap-enabled raid1
_must_ perform a full recovery?
- Correct me if I'm wrong, but one that comes to mind is when a server
reboots (after cleanly stopping a raid1 array that had a faulty
member) and then either:
1) assembles the array with the previously faulty member now available

2) assembles the array with the same faulty member missing.  The user
later re-adds the faulty member

AFAIK both scenarios would bring about a full resync.

regards,
Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Need clarification on raid1 resync behavior with bitmap support

2007-07-23 Thread Mike Snitzer

On 7/23/07, Neil Brown <[EMAIL PROTECTED]> wrote:

On Saturday July 21, [EMAIL PROTECTED] wrote:

> Could you share the other situations where a bitmap-enabled raid1
> _must_ perform a full recovery?

When you add a new drive.  When you create a new bitmap.  I think that
should be all.

> - Correct me if I'm wrong, but one that comes to mind is when a server
> reboots (after cleanly stopping a raid1 array that had a faulty
> member) and then either:
> 1) assembles the array with the previously faulty member now
> available
>
> 2) assembles the array with the same faulty member missing.  The user
> later re-adds the faulty member
>
> AFAIK both scenarios would bring about a full resync.

Only if the drive is not recognised as the original member.
Can you test this out and report a sequence of events that causes a
full resync?

Sure, using an internal-bitmap-enabled raid1 with 2 loopback devices
on a stock 2.6.20.1 kernel, the following sequences result in a full
resync.  (FYI, I'm fairly certain I've seen this same behavior on
2.6.18 and 2.6.15 kernels too but would need to retest):

1)
mdadm /dev/md0 --manage --fail /dev/loop0
mdadm -S /dev/md0
mdadm --assemble /dev/md0 /dev/loop0 /dev/loop1
 mdadm: /dev/md0 has been started with 1 drive (out of 2).
 NOTE: kernel log says:  md: kicking non-fresh loop0 from array!
mdadm /dev/md0 --manage --re-add /dev/loop0

2)
mdadm /dev/md0 --manage --fail /dev/loop0
mdadm /dev/md0 --manage --remove /dev/loop0
mdadm -S /dev/md0
mdadm --assemble /dev/md0 /dev/loop0 /dev/loop1
 mdadm: /dev/md0 has been started with 1 drive (out of 2).
 NOTE: kernel log says:  md: kicking non-fresh loop0 from array!
mdadm /dev/md0 --manage --re-add /dev/loop0

Is stopping the MD (either with mdadm -S or a server reboot) tainting
that faulty member's ability to come back in using a quick
bitmap-based resync?

Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Need clarification on raid1 resync behavior with bitmap support

2007-08-03 Thread Mike Snitzer

On 8/3/07, Neil Brown <[EMAIL PROTECTED]> wrote:
> On Monday July 23, [EMAIL PROTECTED] wrote:
> > On 7/23/07, Neil Brown <[EMAIL PROTECTED]> wrote:
> > > Can you test this out and report a sequence of events that causes a
> > > full resync?
> >
> > Sure, using an internal-bitmap-enabled raid1 with 2 loopback devices
> > on a stock 2.6.20.1 kernel, the following sequences result in a full
> > resync.  (FYI, I'm fairly certain I've seen this same behavior on
> > 2.6.18 and 2.6.15 kernels too but would need to retest):
> >
> > 1)
> > mdadm /dev/md0 --manage --fail /dev/loop0
> > mdadm -S /dev/md0
> > mdadm --assemble /dev/md0 /dev/loop0 /dev/loop1
> >   mdadm: /dev/md0 has been started with 1 drive (out of 2).
> >   NOTE: kernel log says:  md: kicking non-fresh loop0 from array!
> > mdadm /dev/md0 --manage --re-add /dev/loop0
>
>
> sorry for the slow response.
>
> It looks like commit 1757128438d41670ded8bc3bc735325cc07dc8f9
> (December 2006) set conf->fullsync a litte too often.
>
> This seems to fix it, and I'm fairly sure it is correct.
>
> Thanks,
> NeilBrown
>
> --
> Make sure a re-add after a restart honours bitmap when resyncing.
>
> Commit 1757128438d41670ded8bc3bc735325cc07dc8f9 was slightly bad.
> If and array has a write-intent bitmap, and you remove a drive,
> then readd it, only the changes parts should be resynced.
> This only works if the array has not been shut down and restarted.
>
> The above mentioned commit sets 'fullsync' at little more often
> than it should.  This patch is more careful.

I hand-patched your change into a 2.6.20.1 kernel (I'd imagine your
patch is against current git).  I didn't see any difference because
unfortunately both of my full resync scenarios included stopping a
degraded raid after either: 1) having failed but not been removed a
member 2) having failed and removed a member.  In both scenarios if I
didn't stop the array and I just removed and re-added the faulty drive
the array would _not_ do a full resync.

My examples clearly conflict with your assertion that: "This only
works if the array has not been shut down and restarted."

But shouldn't raid1 be better about leveraging the bitmap of known
good (fresh) members even after having stopped a degraded array?  Why
is it that when an array is stopped raid1 seemingly loses the required
metadata that enables bitmap resyncs to just work upon re-add IFF the
array is _not_ stopped?  Couldn't raid1 be made to assemble the array
to look like the array had never been stopped, leaving the non-fresh
members out as it already does, and only then re-add the "non-fresh"
members that were provided?

To be explicit: isn't the bitmap still valid on the fresh members?  If
so, why is raid1 just disregarding the fresh bitmap?

Thanks, I really appreciate your insight.
Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Need clarification on raid1 resync behavior with bitmap support

2007-08-06 Thread Mike Snitzer

On 8/3/07, Neil Brown <[EMAIL PROTECTED]> wrote:
> On Friday August 3, [EMAIL PROTECTED] wrote:
> >
> > I hand-patched your change into a 2.6.20.1 kernel (I'd imagine your
> > patch is against current git).  I didn't see any difference because
> > unfortunately both of my full resync scenarios included stopping a
> > degraded raid after either: 1) having failed but not been removed a
> > member 2) having failed and removed a member.  In both scenarios if I
> > didn't stop the array and I just removed and re-added the faulty drive
> > the array would _not_ do a full resync.
> >
> > My examples clearly conflict with your assertion that: "This only
> > works if the array has not been shut down and restarted."
>
> I think my changelog entry for the patch was poorly written.
> What I meant to say was:
>   *before this patch*  a remove and re-add only does a partial resync
> if the array has not been shutdown and restarted in the interim.
>   The implication being that *after the patch*, a shutdown and restart
>   will not interfere and a remove followed by a readd will always do a
>   partial resync, even if the array was shutdown and restarted while
>   degraded.

Great, thanks for clarifying.

> > To be explicit: isn't the bitmap still valid on the fresh members?  If
> > so, why is raid1 just disregarding the fresh bitmap?
>
> Yes.  Exactly.  It is my understanding and experience that the patch I
> sent fixes a bug so that it doesn't disregard the fresh bitmap.  It
> should fix it for 2.6.20.1 as well.
>
> Are you saying that you tried the same scenario with the patch applied
> and it still did a full resync?  How do you measure whether it did a
> full resync or a partial resync?

I must not have loaded the patched raid1.ko because after retesting it
is clear that your patch does in fact fix the issue.  FYI, before, I
could just tell a full resync was occurring by looking at /proc/mdstat
and the time that elapsed.

Thanks for your help, any idea when this fix will make it upstream?

regards,
Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: detecting read errors after RAID1 check operation

2007-08-25 Thread Mike Snitzer

On 8/17/07, Mike Accetta <[EMAIL PROTECTED]> wrote:
>
> Neil Brown writes:
> > On Wednesday August 15, [EMAIL PROTECTED] wrote:
> > > Neil Brown writes:
> > > > On Wednesday August 15, [EMAIL PROTECTED] wrote:
> > > > >
> > > ...
> > > This happens in our old friend sync_request_write()?  I'm dealing with
> >
> > Yes, that would be the place.
> >
> > > ...
> > > This fragment
> > >
> > > if (j < 0 || test_bit(MD_RECOVERY_CHECK, &mddev->recovery)) {
> > > sbio->bi_end_io = NULL;
> > > rdev_dec_pending(conf->mirrors[i].rdev, mddev);
> > > } else {
> > > /* fixup the bio for reuse */
> > > ...
> > > }
> > >
> > > looks suspicously like any correction attempt for 'check' is being
> > > short-circuited to me, regardless of whether or not there was a read
> > > error.  Actually, even if the rewrite was not being short-circuited,
> > > I still don't see the path that would update 'corrected_errors' in this
> > > case.  There are only two raid1.c sites that touch 'corrected_errors', one
> > > is in fix_read_errors() and the other is later in sync_request_write().
> > > With my limited understanding of how this all works, neither of these
> > > paths would seem to apply here.
> >
> > hmmm yes
> > I guess I was thinking of the RAID5 code rather than the RAID1 code.
> > It doesn't do the right thing does it?
> > Maybe this patch is what we need.  I think it is right.
> >
> > Thanks,
> > NeilBrown
> >
> >
> > Signed-off-by: Neil Brown <[EMAIL PROTECTED]>
> >
> > ### Diffstat output
> >  ./drivers/md/raid1.c |3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c
> > --- .prev/drivers/md/raid1.c  2007-08-16 10:29:58.0 +1000
> > +++ ./drivers/md/raid1.c  2007-08-17 12:07:35.0 +1000
> > @@ -1260,7 +1260,8 @@ static void sync_request_write(mddev_t *
> >   j = 0;
> >   if (j >= 0)
> >   mddev->resync_mismatches += 
> > r1_bio->sec
> > tors;
> > - if (j < 0 || test_bit(MD_RECOVERY_CHECK, 
> > &mddev
> > ->recovery)) {
> > + if (j < 0 || (test_bit(MD_RECOVERY_CHECK, 
> > &mdde
> > v->recovery)
> > +   && text_bit(BIO_UPTODATE, 
> > &sbio->
> > bi_flags))) {
> >   sbio->bi_end_io = NULL;
> >   
> > rdev_dec_pending(conf->mirrors[i].rdev,
> >  mddev);
> >   } else {
>
> I tried this (with the typo fixed) and it indeed issues a re-write.
> However, it doesn't seem to do anything with the corrected errors
> count if the rewrite succeeds.  Since end_sync_write() is only used one
> other place when !In_sync, I tried the following and it seems to work
> to get the error count updated.  I don't know whether this belongs in
> end_sync_write() but I'd think it needs to come after the write actually
> succeeds so that seems like the earliest it could be done.

Neil,

Any feedback on Mike's patch?

thanks,
Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mke2fs stuck in D state while creating filesystem on md*

2007-09-19 Thread Mike Snitzer

On 9/19/07, Wiesner Thomas <[EMAIL PROTECTED]> wrote:
> > Has there been any progress on this? I think I saw it, or something
> > similar, during some testing of recent 2.6.23-rc kernels, on mke2fs took
> > about 11 min longer than all the others (~2 min) and it was not
> > repeatable. I worry that process of more interest will have the same
> > hang.
>
> Well, I must say: no. I haven't tried anything further. I've set up the
> production system a week or so ago
> which runs Debian Etch with no modifications (kernel 2.6.18 I think, the
> debian one and a mdadm 2.5.6-9).
> I didn't notice the problem while creating the raid but that doesn't mean
> anything as I didn't pay attention
> and as I wrote earlier it isn't reliably reproducable.
> (Watching it on a large storage gets boring very fast.)
>
> I'm not a kernel programmer but I can test another kernel or mdadm version
> if it helps, but let me know
> if you want me to do that.

If/when you experience the hang again please get a trace of all processes with:
echo t > /proc/sysrq-trigger

Of particular interest is the mke2fs trace; as well as any md threads.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

mdadm > 2.2 ver1 superblock regression?

2006-04-06 Thread Mike Snitzer

When I try to create a RAID1 array with ver 1.0 superblock using mdadm
> 2.2 I'm getting:
WARNING - superblock isn't sized correctly

Looking at the code (and adding a bit more debugging) it is clear that
all 3 checks fail in super1.c's calc_sb_1_csum()'s "make sure I can
count..." test.

Is this a regression in mdadm 2.4, 2.3.1 and 2.3 (NOTE: mdadm 2.2's
ver1 sb works!)?

please advise, thanks.
Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mdadm > 2.2 ver1 superblock regression?

2006-04-06 Thread Mike Snitzer

On 4/7/06, Neil Brown <[EMAIL PROTECTED]> wrote:
> On Friday April 7, [EMAIL PROTECTED] wrote:
> >
> > Seeing this hasn't made it into a released kernel yet, I might just
> > change it.  But I'll have to make sure that old mdadm's don't mess
> > things up ... I wonder how I will do that :-(
> >
> > Thanks for the report.
>
> Yes, try 2.4.1 (just released).


Works great.. thanks for the extremely quick response and fix.

Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: accessing mirrired lvm on shared storage

2006-04-13 Thread Mike Snitzer

On 4/12/06, Neil Brown <[EMAIL PROTECTED]> wrote:

> One thing that is on my todo list is supporting shared raid1, so that
> several nodes in the cluster can assemble the same raid1 and access it
> - providing that the clients all do proper mutual exclusion as
> e.g. OCFS does.

Very cool... that would be extremely nice to have.  Any estimate on
when you might get to this?

Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: accessing mirrired lvm on shared storage

2006-04-17 Thread Mike Snitzer

On 4/16/06, Neil Brown <[EMAIL PROTECTED]> wrote:
> On Thursday April 13, [EMAIL PROTECTED] wrote:
> > On 4/12/06, Neil Brown <[EMAIL PROTECTED]> wrote:
> >
> > > One thing that is on my todo list is supporting shared raid1, so that
> > > several nodes in the cluster can assemble the same raid1 and access it
> > > - providing that the clients all do proper mutual exclusion as
> > > e.g. OCFS does.
> >
> > Very cool... that would be extremely nice to have.  Any estimate on
> > when you might get to this?
> >
>
> I'm working on it, but there are lots of distractions
>
> The first step is getting support into the kernel for various
> operations like suspending and resuming IO and resync.
> That is progressing nicely.

Sounds good... will it be possible to suspend/resume IO to only
specific members of the raid1 (aka partial IO/resync suspend/resume)? 
If not I have a tangential raid1 suspend/resume question: is there a
better/cleaner way to suspend and resume a raid1 mirror than removing
and re-adding a member?

That is you:
1) establish a 2 disk raid1
2) suspend the mirror but allow degraded changes to occur  (remove member?)
3) after a user specified interval resume the mirror to resync (re-add member?)
4) goto 2

Using the write-intent bitmap the resync should be relatively cheap. 
However, would it be better to just use mdadm to tag a member as
write-mostly and enable write-behind on the raid1?  BUT is there a way
to set the write-behind to 0 to force a resync at a certain time (it
would appear write-behind is a create-time feature)?

thanks,
mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

kicking non-fresh member from array?

2007-10-16 Thread Mike Snitzer

All,

I have repeatedly seen that when a 2 member raid1 becomes degraded,
and IO continues to the lone good member, that if the array is then
stopped and reassembled you get:

md: bind
md: bind
md: kicking non-fresh nbd0 from array!
md: unbind
md: export_rdev(nbd0)
raid1: raid set md0 active with 1 out of 2 mirrors

I'm not seeing how one can avoid assembling such an array in 2 passes:
1) assemble array with both members
2) if a member was deemed "non-fresh" re-add that member; whereby
triggering recovery.

So why does MD kick non-fresh members out on assemble when its
perfectly capable of recovering the "non-fresh" member?  Looking at
md.c it is fairly clear there isn't a way to avoid this 2-step
procedure.

Why/how does MD benefit from this "kicking non-fresh" semantic?
Should MD/mdadm be made optionally tolerant of such non-fresh members
during assembly?

Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

mdadm 2.6.x regression, fails creation of raid1 w/ v1.0 sb and internal bitmap

2007-10-17 Thread Mike Snitzer

mdadm 2.4.1 through 2.5.6 works. mdadm-2.6's "Improve allocation and
use of space for bitmaps in version1 metadata"
(199171a297a87d7696b6b8c07ee520363f4603c1) would seem like the
offending change.  Using 1.2 metdata works.

I get the following using the tip of the mdadm git repo or any other
version of mdadm 2.6.x:

# mdadm --create /dev/md2 --run -l 1 --metadata=1.0 --bitmap=internal
-n 2 /dev/sdf --write-mostly /dev/nbd2
mdadm: /dev/sdf appears to be part of a raid array:
level=raid1 devices=2 ctime=Wed Oct 17 10:17:31 2007
mdadm: /dev/nbd2 appears to be part of a raid array:
level=raid1 devices=2 ctime=Wed Oct 17 10:17:31 2007
mdadm: RUN_ARRAY failed: Input/output error
mdadm: stopped /dev/md2

kernel log shows:
md2: bitmap initialized from disk: read 22/22 pages, set 715290 bits, status: 0
created bitmap (350 pages) for device md2
md2: failed to create bitmap (-5)
md: pers->run() failed ...
md: md2 stopped.
md: unbind
md: export_rdev(nbd2)
md: unbind
md: export_rdev(sdf)
md: md2 stopped.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mdadm 2.6.x regression, fails creation of raid1 w/ v1.0 sb and internal bitmap

2007-10-17 Thread Mike Snitzer

On 10/17/07, Bill Davidsen <[EMAIL PROTECTED]> wrote:
> Mike Snitzer wrote:
> > mdadm 2.4.1 through 2.5.6 works. mdadm-2.6's "Improve allocation and
> > use of space for bitmaps in version1 metadata"
> > (199171a297a87d7696b6b8c07ee520363f4603c1) would seem like the
> > offending change.  Using 1.2 metdata works.
> >
> > I get the following using the tip of the mdadm git repo or any other
> > version of mdadm 2.6.x:
> >
> > # mdadm --create /dev/md2 --run -l 1 --metadata=1.0 --bitmap=internal
> > -n 2 /dev/sdf --write-mostly /dev/nbd2
> > mdadm: /dev/sdf appears to be part of a raid array:
> > level=raid1 devices=2 ctime=Wed Oct 17 10:17:31 2007
> > mdadm: /dev/nbd2 appears to be part of a raid array:
> > level=raid1 devices=2 ctime=Wed Oct 17 10:17:31 2007
> > mdadm: RUN_ARRAY failed: Input/output error
> > mdadm: stopped /dev/md2
> >
> > kernel log shows:
> > md2: bitmap initialized from disk: read 22/22 pages, set 715290 bits, 
> > status: 0
> > created bitmap (350 pages) for device md2
> > md2: failed to create bitmap (-5)
> > md: pers->run() failed ...
> > md: md2 stopped.
> > md: unbind
> > md: export_rdev(nbd2)
> > md: unbind
> > md: export_rdev(sdf)
> > md: md2 stopped.
> >
>
> I would start by retrying with an external bitmap, to see if for some
> reason there isn't room for the bitmap. If that fails, perhaps no bitmap
> at all would be a useful data point. Was the original metadata the same
> version? Things moved depending on the exact version, and some
> --zero-superblock magic might be needed. Hopefully Neil can clarify, I'm
> just telling you what I suspect is the problem, and maybe a
> non-destructive solution.

Creating with an external bitmap works perfectly fine.  As does
creating without a bitmap.  --zero-superblock hasn't helped.  Metadata
v1.1 and v1.2 works with an internal bitmap.  I'd like to use v1.0
with an internal bitmap (using an external bitmap isn't an option for
me).

It does appear that the changes to sb super1.c aren't leaving adequate
room for the bitmap.  Looking at the relevant diff for v1.0 metadata
the newer super1.c code makes use of a larger bitmap (128K) for
devices > 200GB.  My blockdevice is 700GB.  So could the larger
blockdevice possibly explain why others haven't noticed this?

Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mdadm 2.6.x regression, fails creation of raid1 w/ v1.0 sb and internal bitmap

2007-10-18 Thread Mike Snitzer

On 10/18/07, Neil Brown <[EMAIL PROTECTED]> wrote:
> On Wednesday October 17, [EMAIL PROTECTED] wrote:
> > mdadm 2.4.1 through 2.5.6 works. mdadm-2.6's "Improve allocation and
> > use of space for bitmaps in version1 metadata"
> > (199171a297a87d7696b6b8c07ee520363f4603c1) would seem like the
> > offending change.  Using 1.2 metdata works.
> >
> > I get the following using the tip of the mdadm git repo or any other
> > version of mdadm 2.6.x:
> >
> > # mdadm --create /dev/md2 --run -l 1 --metadata=1.0 --bitmap=internal
> > -n 2 /dev/sdf --write-mostly /dev/nbd2
> > mdadm: /dev/sdf appears to be part of a raid array:
> > level=raid1 devices=2 ctime=Wed Oct 17 10:17:31 2007
> > mdadm: /dev/nbd2 appears to be part of a raid array:
> > level=raid1 devices=2 ctime=Wed Oct 17 10:17:31 2007
> > mdadm: RUN_ARRAY failed: Input/output error
> > mdadm: stopped /dev/md2
> >
> > kernel log shows:
> > md2: bitmap initialized from disk: read 22/22 pages, set 715290 bits, 
> > status: 0
> > created bitmap (350 pages) for device md2
> > md2: failed to create bitmap (-5)
>
> Could you please tell me the exact size of your device?  Then should
> be able to reproduce it and test a fix.
>
> (It works for a 734003201K device).

732456960K, it is fairly surprising that such a relatively small
difference in size would prevent it from working...

regards,
Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kicking non-fresh member from array?

2007-10-18 Thread Mike Snitzer

On 10/18/07, Goswin von Brederlow <[EMAIL PROTECTED]> wrote:
> "Mike Snitzer" <[EMAIL PROTECTED]> writes:
>
> > All,
> >
> > I have repeatedly seen that when a 2 member raid1 becomes degraded,
> > and IO continues to the lone good member, that if the array is then
> > stopped and reassembled you get:
> >
> > md: bind
> > md: bind
> > md: kicking non-fresh nbd0 from array!
> > md: unbind
> > md: export_rdev(nbd0)
> > raid1: raid set md0 active with 1 out of 2 mirrors
> >
> > I'm not seeing how one can avoid assembling such an array in 2 passes:
> > 1) assemble array with both members
> > 2) if a member was deemed "non-fresh" re-add that member; whereby
> > triggering recovery.
> >
> > So why does MD kick non-fresh members out on assemble when its
> > perfectly capable of recovering the "non-fresh" member?  Looking at
> > md.c it is fairly clear there isn't a way to avoid this 2-step
> > procedure.
> >
> > Why/how does MD benefit from this "kicking non-fresh" semantic?
> > Should MD/mdadm be made optionally tolerant of such non-fresh members
> > during assembly?
> >
> > Mike
>
> What if the disk has lots of bad blocks, just not where the meta data
> is? On every restart you would resync and fail.
>
> Or what if you removed a mirror to keep a snapshot of a previous
> state? If it auto resyncs you loose that snapshot.

Both of your examples are fairly tenuous given that such members
shouldn't have been provided on the --asemble commandline.  I'm not
talking about auto assemble via udev or something.  But auto assemble
via udev brings up an annoying corner-case when you consider the 2
cases you pointed out.

So you have valid points.  This leads to my last question; having the
ability to _optionally_ tolerate (repair) such stale members would
allow for greater flexibility.  The current behavior isn't conducive
to repairing unprotected raids (that mdadm/md were told to assemble
with specific members) without taking steps to say "no I really
_really_ mean it; now re-add this disk!".

Any pointers from Neil (or others) on how such a 'repair "non-fresh"
member(s) on assemble' override _should_ be implemented would be
helpful.  My first thought is to add a new superblock
--update=repair-non-fresh option to mdadm that would tie into a new
flag in the MD superblock.  But then it begs the question: why not
first add support to set such a superblock option at MD create-time?
The validate_super methods would also need to be trained accordingly.

regards,
Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mdadm 2.6.x regression, fails creation of raid1 w/ v1.0 sb and internal bitmap

2007-10-19 Thread Mike Snitzer

On 10/19/07, Neil Brown <[EMAIL PROTECTED]> wrote:
> On Friday October 19, [EMAIL PROTECTED] wrote:

> > I'm using a stock 2.6.19.7 that I then backported various MD fixes to
> > from 2.6.20 -> 2.6.23...  this kernel has worked great until I
> > attempted v1.0 sb w/ bitmap=internal using mdadm 2.6.x.
> >
> > But would you like me to try a stock 2.6.22 or 2.6.23 kernel?
>
> Yes please.
> I'm suspecting the code in write_sb_page where it tests if the bitmap
> overlaps the data or metadata.  The only way I can see you getting the
> exact error that you do get it for that to fail.
> That test was introduced in 2.6.22.  Did you backport that?  Any
> chance it got mucked up a bit?

I believe you're referring to commit
f0d76d70bc77b9b11256a3a23e98e80878be1578.  That change actually made
it into 2.6.23 AFAIK; but yes I actually did backport that fix (which
depended on ab6085c795a71b6a21afe7469d30a365338add7a).

If I back-out f0d76d70bc77b9b11256a3a23e98e80878be1578 I can create a
raid1 w/ v1.0 sb and an internal bitmap.  But clearly that is just
because I removed the negative checks that you introduced ;)

For me this begs the question: what else would
f0d76d70bc77b9b11256a3a23e98e80878be1578 depend on that I missed?  I
included 505fa2c4a2f125a70951926dfb22b9cf273994f1 and
ab6085c795a71b6a21afe7469d30a365338add7a too.

*shrug*...

Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mdadm 2.6.x regression, fails creation of raid1 w/ v1.0 sb and internal bitmap

2007-10-18 Thread Mike Snitzer

On 10/18/07, Neil Brown <[EMAIL PROTECTED]> wrote:
>
> Sorry, I wasn't paying close enough attention and missed the obvious.
> .
>
> On Thursday October 18, [EMAIL PROTECTED] wrote:
> > On 10/18/07, Neil Brown <[EMAIL PROTECTED]> wrote:
> > > On Wednesday October 17, [EMAIL PROTECTED] wrote:
> > > > mdadm 2.4.1 through 2.5.6 works. mdadm-2.6's "Improve allocation and
> > > > use of space for bitmaps in version1 metadata"
> > > > (199171a297a87d7696b6b8c07ee520363f4603c1) would seem like the
> > > > offending change.  Using 1.2 metdata works.
> > > >
> > > > I get the following using the tip of the mdadm git repo or any other
> > > > version of mdadm 2.6.x:
> > > >
> > > > # mdadm --create /dev/md2 --run -l 1 --metadata=1.0 --bitmap=internal
> > > > -n 2 /dev/sdf --write-mostly /dev/nbd2
> > > > mdadm: /dev/sdf appears to be part of a raid array:
> > > > level=raid1 devices=2 ctime=Wed Oct 17 10:17:31 2007
> > > > mdadm: /dev/nbd2 appears to be part of a raid array:
> > > > level=raid1 devices=2 ctime=Wed Oct 17 10:17:31 2007
> > > > mdadm: RUN_ARRAY failed: Input/output error
>^^
>
> This means there was an IO error.  i.e. there is a block on the device
> that cannot be read from.
> It worked with earlier version of mdadm because they used a much
> smaller bitmap.  With the patch you mention in place, mdadm tries
> harder to find a good location and good size for a bitmap and to
> make sure that space is available.
> The important fact is that the bitmap ends up at a different
> location.
>
> You have a bad block at that location, it would seem.

I'm a bit skeptical of that being the case considering I get this
error on _any_ pair of disks I try in an environment where I'm
mirroring across servers that each have access to 8 of these disks.
Each of the 8 mirrors consists of a local member and a remote (nbd)
member.  I can't see all 16 disks having the very same bad block(s) at
the end of the disk ;)

I feels to me like the calculation that you're making isn't leaving
adequate room for the 128K bitmap without hitting the superblock...
but I don't have hard proof yet ;)

> I would have expected an error in the kernel logs about the read error
> though - that is strange.

What about the "md2: failed to create bitmap (-5)"?

> What do
>   mdadm -E
> and
>   mdadm -X
>
> on each device say?

# mdadm -E /dev/sdf
/dev/sdf:
  Magic : a92b4efc
Version : 1.0
Feature Map : 0x1
 Array UUID : caabb900:616bfc5a:03763b95:83ea99a7
   Name : 2
  Creation Time : Fri Oct 19 00:38:45 2007
 Raid Level : raid1
   Raid Devices : 2

  Used Dev Size : 1464913648 (698.53 GiB 750.04 GB)
 Array Size : 1464913648 (698.53 GiB 750.04 GB)
   Super Offset : 1464913904 sectors
  State : clean
Device UUID : 978cdd42:abaa82a1:4ad79285:1b56ed86

Internal Bitmap : -176 sectors from superblock
Update Time : Fri Oct 19 00:38:45 2007
   Checksum : c6bb03db - correct
 Events : 0


Array Slot : 0 (0, 1)
   Array State : Uu

# mdadm -E /dev/nbd2
/dev/nbd2:
  Magic : a92b4efc
Version : 1.0
Feature Map : 0x1
 Array UUID : caabb900:616bfc5a:03763b95:83ea99a7
   Name : 2
  Creation Time : Fri Oct 19 00:38:45 2007
 Raid Level : raid1
   Raid Devices : 2

  Used Dev Size : 1464913648 (698.53 GiB 750.04 GB)
 Array Size : 1464913648 (698.53 GiB 750.04 GB)
   Super Offset : 1464913904 sectors
  State : clean
Device UUID : 180209d2:cff9b5d0:05054d19:2e4930f2

Internal Bitmap : -176 sectors from superblock
  Flags : write-mostly
Update Time : Fri Oct 19 00:38:45 2007
   Checksum : 8416e951 - correct
 Events : 0


Array Slot : 1 (0, 1)
   Array State : uU

# mdadm -X /dev/sdf
Filename : /dev/sdf
   Magic : 6d746962
 Version : 4
UUID : caabb900:616bfc5a:03763b95:83ea99a7
  Events : 0
  Events Cleared : 0
   State : OK
   Chunksize : 1 MB
  Daemon : 5s flush period
  Write Mode : Normal
   Sync Size : 732456824 (698.53 GiB 750.04 GB)
  Bitmap : 715290 bits (chunks), 715290 dirty (100.0%)

# mdadm -X /dev/nbd2
Filename : /dev/nbd2
   Magic : 6d746962
 Version : 4
UUID : caabb900:616bfc5a:03763b95:83ea99a7
  Events : 0
  Events Cleared : 0
   State : OK
   Chunksize : 1 MB
  Daemon : 5s flush period
  Write Mode : Normal
   Sync Size : 732456824 (698.53 GiB 750.04 GB)
  Bitmap : 715290 bits (chunks), 715290 dirty (100.0%)

> > > > mdadm: stopped /dev/md2
> > > >
> > > > kernel log shows:
> > > > md2: bitmap initialized from disk: read 22/22 pages, set 715290 bits, 
> > > > status: 0
> > > > created bitmap (350 pages) for device md2
> > > > md2: failed to create bitmap (-5)

I assumed that the RUN_ARRAY failed (via IO error) was a side-effect
of MD's inability to create the bitmap (-5):

md2: bitmap initia

Re: mdadm 2.6.x regression, fails creation of raid1 w/ v1.0 sb and internal bitmap

2007-10-18 Thread Mike Snitzer

On 10/19/07, Mike Snitzer <[EMAIL PROTECTED]> wrote:
> On 10/18/07, Neil Brown <[EMAIL PROTECTED]> wrote:
> >
> > Sorry, I wasn't paying close enough attention and missed the obvious.
> > .
> >
> > On Thursday October 18, [EMAIL PROTECTED] wrote:
> > > On 10/18/07, Neil Brown <[EMAIL PROTECTED]> wrote:
> > > > On Wednesday October 17, [EMAIL PROTECTED] wrote:
> > > > > mdadm 2.4.1 through 2.5.6 works. mdadm-2.6's "Improve allocation and
> > > > > use of space for bitmaps in version1 metadata"
> > > > > (199171a297a87d7696b6b8c07ee520363f4603c1) would seem like the
> > > > > offending change.  Using 1.2 metdata works.
> > > > >
> > > > > I get the following using the tip of the mdadm git repo or any other
> > > > > version of mdadm 2.6.x:
> > > > >
> > > > > # mdadm --create /dev/md2 --run -l 1 --metadata=1.0 --bitmap=internal
> > > > > -n 2 /dev/sdf --write-mostly /dev/nbd2
> > > > > mdadm: /dev/sdf appears to be part of a raid array:
> > > > > level=raid1 devices=2 ctime=Wed Oct 17 10:17:31 2007
> > > > > mdadm: /dev/nbd2 appears to be part of a raid array:
> > > > > level=raid1 devices=2 ctime=Wed Oct 17 10:17:31 2007
> > > > > mdadm: RUN_ARRAY failed: Input/output error
> >^^
> >
> > This means there was an IO error.  i.e. there is a block on the device
> > that cannot be read from.
> > It worked with earlier version of mdadm because they used a much
> > smaller bitmap.  With the patch you mention in place, mdadm tries
> > harder to find a good location and good size for a bitmap and to
> > make sure that space is available.
> > The important fact is that the bitmap ends up at a different
> > location.
> >
> > You have a bad block at that location, it would seem.
>
> I'm a bit skeptical of that being the case considering I get this
> error on _any_ pair of disks I try in an environment where I'm
> mirroring across servers that each have access to 8 of these disks.
> Each of the 8 mirrors consists of a local member and a remote (nbd)
> member.  I can't see all 16 disks having the very same bad block(s) at
> the end of the disk ;)
>
> I feels to me like the calculation that you're making isn't leaving
> adequate room for the 128K bitmap without hitting the superblock...
> but I don't have hard proof yet ;)

To further test this I used 2 local sparse 732456960K loopback devices
and attempted to create the raid1 in the same manner.  It failed in
exactly the same way.  This should cast further doubt on the bad block
theory no?

I'm using a stock 2.6.19.7 that I then backported various MD fixes to
from 2.6.20 -> 2.6.23...  this kernel has worked great until I
attempted v1.0 sb w/ bitmap=internal using mdadm 2.6.x.

But would you like me to try a stock 2.6.22 or 2.6.23 kernel?

Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mdadm 2.6.x regression, fails creation of raid1 w/ v1.0 sb and internal bitmap

2007-10-22 Thread Mike Snitzer

On 10/22/07, Neil Brown <[EMAIL PROTECTED]> wrote:
> On Friday October 19, [EMAIL PROTECTED] wrote:
> > On 10/19/07, Neil Brown <[EMAIL PROTECTED]> wrote:
> > > On Friday October 19, [EMAIL PROTECTED] wrote:
> >
> > > > I'm using a stock 2.6.19.7 that I then backported various MD fixes to
> > > > from 2.6.20 -> 2.6.23...  this kernel has worked great until I
> > > > attempted v1.0 sb w/ bitmap=internal using mdadm 2.6.x.
> > > >
> > > > But would you like me to try a stock 2.6.22 or 2.6.23 kernel?
> > >
> > > Yes please.
> > > I'm suspecting the code in write_sb_page where it tests if the bitmap
> > > overlaps the data or metadata.  The only way I can see you getting the
> > > exact error that you do get it for that to fail.
> > > That test was introduced in 2.6.22.  Did you backport that?  Any
> > > chance it got mucked up a bit?
> >
> > I believe you're referring to commit
> > f0d76d70bc77b9b11256a3a23e98e80878be1578.  That change actually made
> > it into 2.6.23 AFAIK; but yes I actually did backport that fix (which
> > depended on ab6085c795a71b6a21afe7469d30a365338add7a).
> >
> > If I back-out f0d76d70bc77b9b11256a3a23e98e80878be1578 I can create a
> > raid1 w/ v1.0 sb and an internal bitmap.  But clearly that is just
> > because I removed the negative checks that you introduced ;)
> >
> > For me this begs the question: what else would
> > f0d76d70bc77b9b11256a3a23e98e80878be1578 depend on that I missed?  I
> > included 505fa2c4a2f125a70951926dfb22b9cf273994f1 and
> >   ab6085c795a71b6a21afe7469d30a365338add7a too.
> >
> > *shrug*...
> >
>
> This is all very odd...
> I definitely tested this last week and couldn't reproduce the
> problem.  This week I can reproduce it easily.  And given the nature
> of the bug, I cannot see how it ever worked.
>
> Anyway, here is a fix that works for me.

Hey Neil,

Your fix works for me too.  However, I'm wondering why you held back
on fixing the same issue in the "bitmap runs into data" comparison
that follows:

--- ./drivers/md/bitmap.c 2007-10-19 19:11:58.0 -0400
+++ ./drivers/md/bitmap.c 2007-10-22 09:53:41.0 -0400
@@ -286,7 +286,7 @@
/* METADATA BITMAP DATA */
if (rdev->sb_offset*2
+ bitmap->offset
-   + page->index*(PAGE_SIZE/512) + size/512
+   +
(long)(page->index*(PAGE_SIZE/512)) + size/512
> rdev->data_offset)
/* bitmap runs in to data */
return -EINVAL;

Thanks,
Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] lvm2 support for detecting v1.x MD superblocks

2007-10-23 Thread Mike Snitzer

lvm2's MD v1.0 superblock detection doesn't work at all (because it
doesn't use v1 sb offsets).

I've tested the attached patch to work on MDs with v0.90.0, v1.0,
v1.1, and v1.2 superblocks.

please advise, thanks.
Mike
Index: lib/device/dev-md.c
===
RCS file: /cvs/lvm2/LVM2/lib/device/dev-md.c,v
retrieving revision 1.5
diff -u -r1.5 dev-md.c
--- lib/device/dev-md.c	20 Aug 2007 20:55:25 -	1.5
+++ lib/device/dev-md.c	23 Oct 2007 15:17:57 -
@@ -25,6 +25,40 @@
 #define MD_NEW_SIZE_SECTORS(x) ((x & ~(MD_RESERVED_SECTORS - 1)) \
 - MD_RESERVED_SECTORS)
 
+int dev_has_md_sb(struct device *dev, uint64_t sb_offset, uint64_t *sb)
+{
+	int ret = 0;	
+	uint32_t md_magic;
+	/* Version 1 is little endian; version 0.90.0 is machine endian */
+	if (dev_read(dev, sb_offset, sizeof(uint32_t), &md_magic) &&
+	((md_magic == xlate32(MD_SB_MAGIC)) ||
+	 (md_magic == MD_SB_MAGIC))) {
+		if (sb)
+			*sb = sb_offset;
+		ret = 1;
+	}
+	return ret;
+}
+
+uint64_t v1_sb_offset(uint64_t size, int minor_version) {
+	uint64_t sb_offset;
+	switch(minor_version) {
+	case 0:
+		sb_offset = size;
+		sb_offset -= 8*2;
+		sb_offset &= ~(4*2-1);
+		break;
+	case 1:
+		sb_offset = 0;
+		break;
+	case 2:
+		sb_offset = 4*2;
+		break;
+	}
+	sb_offset <<= SECTOR_SHIFT;
+	return sb_offset;
+}
+
 /*
  * Returns -1 on error
  */
@@ -35,7 +69,6 @@
 #ifdef linux
 
 	uint64_t size, sb_offset;
-	uint32_t md_magic;
 
 	if (!dev_get_size(dev, &size)) {
 		stack;
@@ -50,16 +83,20 @@
 		return -1;
 	}
 
-	sb_offset = MD_NEW_SIZE_SECTORS(size) << SECTOR_SHIFT;
-
 	/* Check if it is an md component device. */
-	/* Version 1 is little endian; version 0.90.0 is machine endian */
-	if (dev_read(dev, sb_offset, sizeof(uint32_t), &md_magic) &&
-	((md_magic == xlate32(MD_SB_MAGIC)) ||
-	 (md_magic == MD_SB_MAGIC))) {
-		if (sb)
-			*sb = sb_offset;
+	/* Version 0.90.0 */
+	sb_offset = MD_NEW_SIZE_SECTORS(size) << SECTOR_SHIFT;
+	if (dev_has_md_sb(dev, sb_offset, sb)) {
 		ret = 1;
+	} else {
+		/* Version 1, try v1.0 -> v1.2 */
+		int minor;
+		for (minor = 0; minor <= 2; minor++) {
+			if (dev_has_md_sb(dev, v1_sb_offset(size, minor), sb)) {
+ret = 1;
+break;
+			}
+		}
 	}
 
 	if (!dev_close(dev))

Re: [lvm-devel] [PATCH] lvm2 support for detecting v1.x MD superblocks

2007-10-23 Thread Mike Snitzer

On 10/23/07, Alasdair G Kergon <[EMAIL PROTECTED]> wrote:
> On Tue, Oct 23, 2007 at 11:32:56AM -0400, Mike Snitzer wrote:
> > I've tested the attached patch to work on MDs with v0.90.0, v1.0,
> > v1.1, and v1.2 superblocks.
>
> I'll apply this, thanks, but need to add comments (or reference) to explain
> what the hard-coded numbers are:
>
> sb_offset = (size - 8 * 2) & ~(4 * 2 - 1);
> etc.

All values are in terms of sectors; so that is where the * 2 is coming
from.  The v1.0 case follows the same model as the MD_NEW_SIZE_SECTORS
which is used for v0.90.0.  The difference is that the v1.0 superblock
is found "at least 8K, but less than 12K, from the end of the device".

The same switch statement is used in mdadm and is accompanied with the
following comment:

/*
 * Calculate the position of the superblock.
 * It is always aligned to a 4K boundary and
 * depending on minor_version, it can be:
 * 0: At least 8K, but less than 12K, from end of device
 * 1: At start of device
 * 2: 4K from start of device.
 */

Would it be sufficient to add that comment block above
v1_sb_offset()'s switch statement?

thanks,
Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Time to deprecate old RAID formats?

2007-10-24 Thread Mike Snitzer

On 10/24/07, John Stoffel <[EMAIL PROTECTED]> wrote:
> > "Bill" == Bill Davidsen <[EMAIL PROTECTED]> writes:
>
> Bill> John Stoffel wrote:
> >> Why do we have three different positions for storing the superblock?
>
> Bill> Why do you suggest changing anything until you get the answer to
> Bill> this question? If you don't understand why there are three
> Bill> locations, perhaps that would be a good initial investigation.
>
> Because I've asked this question before and not gotten an answer, nor
> is it answered in the man page for mdadm on why we have this setup.
>
> Bill> Clearly the short answer is that they reflect three stages of
> Bill> Neil's thinking on the topic, and I would bet that he had a good
> Bill> reason for moving the superblock when he did it.
>
> So let's hear Neil's thinking about all this?  Or should I just work
> up a patch to do what I suggest and see how that flies?
>
> Bill> Since you have to support all of them or break existing arrays,
> Bill> and they all use the same format so there's no saving of code
> Bill> size to mention, why even bring this up?
>
> Because of the confusion factor.  Again, since noone has been able to
> articulate a reason why we have three different versions of the 1.x
> superblock, nor have I seen any good reasons for why we should have
> them, I'm going by the KISS principle to reduce the options to the
> best one.
>
> And no, I'm not advocating getting rid of legacy support, but I AM
> advocating that we settle on ONE standard format going forward as the
> default for all new RAID superblocks.

Why exactly are you on this crusade to find the one "best" v1
superblock location?  Giving people the freedom to place the
superblock where they choose isn't a bad thing.  Would adding
something like "If in doubt, 1.1 is the safest choice." to the mdadm
man page give you the KISS warm-fuzzies you're pining for?

The fact that, after you read the manpage, you didn't even know that
the only difference between the v1.x variants is the location that the
superblock is placed indicates that you're not in a position to be so
tremendously evangelical about affecting code changes that limit
existing options.

Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 003 of 3] md: Update md bitmap during resync.

2007-12-10 Thread Mike Snitzer

On Dec 7, 2007 12:42 AM, NeilBrown <[EMAIL PROTECTED]> wrote:
>
> Currently and md array with a write-intent bitmap does not updated
> that bitmap to reflect successful partial resync.  Rather the entire
> bitmap is updated when the resync completes.
>
> This is because there is no guarentee that resync requests will
> complete in order, and tracking each request individually is
> unnecessarily burdensome.
>
> However there is value in regularly updating the bitmap, so add code
> to periodically pause while all pending sync requests complete, then
> update the bitmap.  Doing this only every few seconds (the same as the
> bitmap update time) does not notciably affect resync performance.
>
> Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

Hi Neil,

You forgot to export bitmap_cond_end_sync.  Please see the attached patch.

regards,
Mike
diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index f31ea4f..b596538 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -1566,3 +1566,4 @@ EXPORT_SYMBOL(bitmap_start_sync);
 EXPORT_SYMBOL(bitmap_end_sync);
 EXPORT_SYMBOL(bitmap_unplug);
 EXPORT_SYMBOL(bitmap_close_sync);
+EXPORT_SYMBOL(bitmap_cond_end_sync);

2.6.22.16 MD raid1 doesn't mark removed disk faulty, MD thread goes UN

2008-01-21 Thread Mike Snitzer

Under 2.6.22.16, I physically pulled a SATA disk (/dev/sdac, connected to
an aacraid controller) that was acting as the local raid1 member of
/dev/md30.

Linux MD didn't see an /dev/sdac1 error until I tried forcing the issue by
doing a read (with dd) from /dev/md30:

Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Sense Key :
Hardware Error [current]
Jan 21 17:08:07 lab17-233 kernel: Info fld=0x0
Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Add. Sense:
Internal target failure
Jan 21 17:08:07 lab17-233 kernel: end_request: I/O error, dev sdac, sector 71
Jan 21 17:08:07 lab17-233 kernel: printk: 3 messages suppressed.
Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 8
Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 16
Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 24
Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 32
Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 40
Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 48
Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 56
Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 64
Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 72
Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 80
Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Sense Key :
Hardware Error [current]
Jan 21 17:08:07 lab17-233 kernel: Info fld=0x0
Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Add. Sense:
Internal target failure
Jan 21 17:08:07 lab17-233 kernel: end_request: I/O error, dev sdac, sector 343
Jan 21 17:08:08 lab17-233 kernel: sd 2:0:27:0: [sdac] Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Jan 21 17:08:08 lab17-233 kernel: sd 2:0:27:0: [sdac] Sense Key :
Hardware Error [current]
Jan 21 17:08:08 lab17-233 kernel: Info fld=0x0
...
Jan 21 17:08:12 lab17-233 kernel: sd 2:0:27:0: [sdac] Add. Sense:
Internal target failure
Jan 21 17:08:12 lab17-233 kernel: end_request: I/O error, dev sdac, sector 3399
Jan 21 17:08:12 lab17-233 kernel: printk: 765 messages suppressed.
Jan 21 17:08:12 lab17-233 kernel: raid1: sdac1: rescheduling sector 3336

However, the MD layer still hasn't marked the sdac1 member faulty:

md30 : active raid1 nbd2[1](W) sdac1[0]
  4016204 blocks super 1.0 [2/2] [UU]
  bitmap: 1/8 pages [4KB], 256KB chunk

The dd I used to read from /dev/md30 is blocked on IO:

Jan 21 17:13:55 lab17-233 kernel: ddD 0afa9cf5c346
0 12337   7702 (NOTLB)
Jan 21 17:13:55 lab17-233 kernel:  81010c449868 0082
 80268f14
Jan 21 17:13:55 lab17-233 kernel:  81015da6f320 81015de532c0
0008 81012d9d7780
Jan 21 17:13:55 lab17-233 kernel:  81015fae2880 4926
81012d9d7970 0001802879a0
Jan 21 17:13:55 lab17-233 kernel: Call Trace:
Jan 21 17:13:55 lab17-233 kernel:  [] mempool_alloc+0x24/0xda
Jan 21 17:13:55 lab17-233 kernel:  []
:raid1:wait_barrier+0x84/0xc2
Jan 21 17:13:55 lab17-233 kernel:  []
default_wake_function+0x0/0xe
Jan 21 17:13:55 lab17-233 kernel:  []
:raid1:make_request+0x83/0x5c0
Jan 21 17:13:55 lab17-233 kernel:  []
__make_request+0x57f/0x668
Jan 21 17:13:55 lab17-233 kernel:  []
generic_make_request+0x26e/0x2a9
Jan 21 17:13:55 lab17-233 kernel:  [] mempool_alloc+0x24/0xda
Jan 21 17:13:55 lab17-233 kernel:  [] __next_cpu+0x19/0x28
Jan 21 17:13:55 lab17-233 kernel:  [] submit_bio+0xb6/0xbd
Jan 21 17:13:55 lab17-233 kernel:  [] submit_bh+0xdf/0xff
Jan 21 17:13:55 lab17-233 kernel:  []
block_read_full_page+0x271/0x28e
Jan 21 17:13:55 lab17-233 kernel:  []
blkdev_get_block+0x0/0x46
Jan 21 17:13:55 lab17-233 kernel:  []
radix_tree_insert+0xcb/0x18c
Jan 21 17:13:55 lab17-233 kernel:  []
__do_page_cache_readahead+0x16d/0x1df
Jan 21 17:13:55 lab17-233 kernel:  [] getnstimeofday+0x32/0x8d
Jan 21 17:13:55 lab17-233 kernel:  [] ktime_get_ts+0x1a/0x4e
Jan 21 17:13:55 lab17-233 kernel:  [] delayacct_end+0x7d/0x88
Jan 21 17:13:55 lab17-233 kernel:  []
blockable_page_cache_readahead+0x53/0xb2
Jan 21 17:13:55 lab17-233 kernel:  []
make_ahead_window+0x82/0x9e
Jan 21 17:13:55 lab17-233 kernel:  []
page_cache_readahead+0x18a/0x1c1
Jan 21 17:13:55 lab17-233 kernel:  []
do_generic_mapping_read+0x135/0x3fc
Jan 21 17:13:55 lab17-233 kernel:  []
file_read_actor+0x0/0x170
Jan 21 17:13:55 lab17-233 kernel:  []
generic_file_aio_read+0x119/0x155
Jan 21 17:13:55 lab17-233 kernel:  [] do_sync_read+0xc9/0x10c
Jan 21 17:13:55 lab17-233 kernel:  []
autoremove_wake_function+0x0/0x2e
Jan 21 17:13:55 lab17-233 kernel:  []
do_mmap_pgoff+0x639/0x7a5
Jan 21 17:13:55 lab17-233 kernel:  [] vfs_read+0xcb/0x153
Jan 21 17:13:55 lab17-233 kernel:  [] sys_read+0x45/0x6e
Jan 21 17:13

Re: 2.6.22.16 MD raid1 doesn't mark removed disk faulty, MD thread goes UN

2008-01-21 Thread Mike Snitzer

cc'ing Tanaka-san given his recent raid1 BUG report:
http://lkml.org/lkml/2008/1/14/515

On Jan 21, 2008 6:04 PM, Mike Snitzer <[EMAIL PROTECTED]> wrote:
> Under 2.6.22.16, I physically pulled a SATA disk (/dev/sdac, connected to
> an aacraid controller) that was acting as the local raid1 member of
> /dev/md30.
>
> Linux MD didn't see an /dev/sdac1 error until I tried forcing the issue by
> doing a read (with dd) from /dev/md30:
>
> Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Result:
> hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
> Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Sense Key :
> Hardware Error [current]
> Jan 21 17:08:07 lab17-233 kernel: Info fld=0x0
> Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Add. Sense:
> Internal target failure
> Jan 21 17:08:07 lab17-233 kernel: end_request: I/O error, dev sdac, sector 71
> Jan 21 17:08:07 lab17-233 kernel: printk: 3 messages suppressed.
> Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 8
> Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 16
> Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 24
> Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 32
> Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 40
> Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 48
> Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 56
> Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 64
> Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 72
> Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 80
> Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Result:
> hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
> Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Sense Key :
> Hardware Error [current]
> Jan 21 17:08:07 lab17-233 kernel: Info fld=0x0
> Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Add. Sense:
> Internal target failure
> Jan 21 17:08:07 lab17-233 kernel: end_request: I/O error, dev sdac, sector 343
> Jan 21 17:08:08 lab17-233 kernel: sd 2:0:27:0: [sdac] Result:
> hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
> Jan 21 17:08:08 lab17-233 kernel: sd 2:0:27:0: [sdac] Sense Key :
> Hardware Error [current]
> Jan 21 17:08:08 lab17-233 kernel: Info fld=0x0
> ...
> Jan 21 17:08:12 lab17-233 kernel: sd 2:0:27:0: [sdac] Add. Sense:
> Internal target failure
> Jan 21 17:08:12 lab17-233 kernel: end_request: I/O error, dev sdac, sector 
> 3399
> Jan 21 17:08:12 lab17-233 kernel: printk: 765 messages suppressed.
> Jan 21 17:08:12 lab17-233 kernel: raid1: sdac1: rescheduling sector 3336
>
> However, the MD layer still hasn't marked the sdac1 member faulty:
>
> md30 : active raid1 nbd2[1](W) sdac1[0]
>   4016204 blocks super 1.0 [2/2] [UU]
>   bitmap: 1/8 pages [4KB], 256KB chunk
>
> The dd I used to read from /dev/md30 is blocked on IO:
>
> Jan 21 17:13:55 lab17-233 kernel: ddD 0afa9cf5c346
> 0 12337   7702 (NOTLB)
> Jan 21 17:13:55 lab17-233 kernel:  81010c449868 0082
>  80268f14
> Jan 21 17:13:55 lab17-233 kernel:  81015da6f320 81015de532c0
> 0008 81012d9d7780
> Jan 21 17:13:55 lab17-233 kernel:  81015fae2880 4926
> 81012d9d7970 0001802879a0
> Jan 21 17:13:55 lab17-233 kernel: Call Trace:
> Jan 21 17:13:55 lab17-233 kernel:  [] 
> mempool_alloc+0x24/0xda
> Jan 21 17:13:55 lab17-233 kernel:  []
> :raid1:wait_barrier+0x84/0xc2
> Jan 21 17:13:55 lab17-233 kernel:  []
> default_wake_function+0x0/0xe
> Jan 21 17:13:55 lab17-233 kernel:  []
> :raid1:make_request+0x83/0x5c0
> Jan 21 17:13:55 lab17-233 kernel:  []
> __make_request+0x57f/0x668
> Jan 21 17:13:55 lab17-233 kernel:  []
> generic_make_request+0x26e/0x2a9
> Jan 21 17:13:55 lab17-233 kernel:  [] 
> mempool_alloc+0x24/0xda
> Jan 21 17:13:55 lab17-233 kernel:  [] __next_cpu+0x19/0x28
> Jan 21 17:13:55 lab17-233 kernel:  [] submit_bio+0xb6/0xbd
> Jan 21 17:13:55 lab17-233 kernel:  [] submit_bh+0xdf/0xff
> Jan 21 17:13:55 lab17-233 kernel:  []
> block_read_full_page+0x271/0x28e
> Jan 21 17:13:55 lab17-233 kernel:  []
> blkdev_get_block+0x0/0x46
> Jan 21 17:13:55 lab17-233 kernel:  []
> radix_tree_insert+0xcb/0x18c
> Jan 21 17:13:55 lab17-233 kernel:  []
> __do_page_cache_readahead+0x16d/0x1df
> Jan 21 17:13:55 lab17-233 kernel:  [] 
> getnstimeofday+0x32/0x8d
> Jan 21 17:13:55 lab17-233 kernel:  [] ktime_get_ts+0x1a/0x4e
> Jan 21 17:13:55 lab17-233 kernel:  [] 
> delayacct_end+0x7d/0x88
> Jan 21 17:13:55 lab17-233 kernel:  []
> blockable_page_cache_readahead+0x53/0xb2
> Jan 21 17:1

AACRAID driver broken in 2.6.22.x (and beyond?) [WAS: Re: 2.6.22.16 MD raid1 doesn't mark removed disk faulty, MD thread goes UN]

2008-01-22 Thread Mike Snitzer

On Jan 22, 2008 12:29 AM, Mike Snitzer <[EMAIL PROTECTED]> wrote:
> cc'ing Tanaka-san given his recent raid1 BUG report:
> http://lkml.org/lkml/2008/1/14/515
>
>
> On Jan 21, 2008 6:04 PM, Mike Snitzer <[EMAIL PROTECTED]> wrote:
> > Under 2.6.22.16, I physically pulled a SATA disk (/dev/sdac, connected to
> > an aacraid controller) that was acting as the local raid1 member of
> > /dev/md30.
> >
> > Linux MD didn't see an /dev/sdac1 error until I tried forcing the issue by
> > doing a read (with dd) from /dev/md30:

> The raid1d thread is locked at line 720 in raid1.c (raid1d+2437); aka
> freeze_array:
>
> (gdb) l *0x2539
> 0x2539 is in raid1d (drivers/md/raid1.c:720).
> 715  * wait until barrier+nr_pending match nr_queued+2
> 716  */
> 717 spin_lock_irq(&conf->resync_lock);
> 718 conf->barrier++;
> 719 conf->nr_waiting++;
> 720 wait_event_lock_irq(conf->wait_barrier,
> 721 conf->barrier+conf->nr_pending ==
> conf->nr_queued+2,
> 722 conf->resync_lock,
> 723 raid1_unplug(conf->mddev->queue));
> 724 spin_unlock_irq(&conf->resync_lock);
>
> Given Tanaka-san's report against 2.6.23 and me hitting what seems to
> be the same deadlock in 2.6.22.16; it stands to reason this affects
> raid1 in 2.6.24-rcX too.

Turns out that the aacraid driver in 2.6.22.x is HORRIBLY BROKEN (when
you pull a drive); it responds to MD's write requests with uptodate=1
(in raid1_end_write_request) for the drive that was pulled!  I've not
looked to see if aacraid has been fixed in newer kernels... are others
aware of any crucial aacraid fixes in 2.6.23.x or 2.6.24?

After the drive was physically pulled, and small periodic writes
continued to the associated MD device, the raid1 MD driver did _NOT_
detect the pulled drive's writes as having failed (verified this with
systemtap).  MD happily thought the write completed to both members
(so MD had no reason to mark the pulled drive "faulty"; or mark the
raid "degraded").

Installing an Adaptec-provided 1.1-5[2451] driver enabled raid1 to
work as expected.

That said, I now have a recipe for hitting the raid1 deadlock that
Tanaka first reported over a week ago.  I'm still surprised that all
of this chatter about that BUG hasn't drawn interest/scrutiny from
others!?

regards,
Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: AACRAID driver broken in 2.6.22.x (and beyond?) [WAS: Re: 2.6.22.16 MD raid1 doesn't mark removed disk faulty, MD thread goes UN]

2008-01-23 Thread Mike Snitzer

ondition generally indicates a serious hardware problem 
> or target incompatibility; and is generally rare as they are a result of 
> corner case conditions within the Adapter Firmware. The diagnostic dump 
> reported by the Adaptec utilities should be able to point to the fault you 
> are experiencing if these appear to be the root causes.

snitzer:
It would seem that 1.1.5-2451 has the firmware reset support given the
log I provided above, no?   Anyway, with 2.6.22.16 when a drive is
pulled using the aacraid 1.1-5[2437]-mh4 there is absolutely no errors
from the aacraid driver; in fact the scsi layer doesn't see anything
until I force the issue with explicit reads/writes to the device that
was pulled.  It could be that on a drive pull the 1.1.5-2451 driver
results in a BlinkLED, resets the firmware, and continues.  Whereas
with the 1.1-5[2437]-mh4 I get no BlinkLED and as such Linux (both
scsi and raid1) is completely unaware of any disconnect of the
physical device.

thanks,
Mike

> > -Original Message-
> > From: Mike Snitzer [mailto:[EMAIL PROTECTED]
> > Sent: Tuesday, January 22, 2008 7:10 PM
> > To: linux-raid@vger.kernel.org; NeilBrown
> > Cc: [EMAIL PROTECTED]; K. Tanaka; AACRAID;
> > [EMAIL PROTECTED]
> > Subject: AACRAID driver broken in 2.6.22.x (and beyond?)
> > [WAS: Re: 2.6.22.16 MD raid1 doesn't mark removed disk
> > faulty, MD thread goes UN]
> >
>
> > On Jan 22, 2008 12:29 AM, Mike Snitzer <[EMAIL PROTECTED]> wrote:
> > > cc'ing Tanaka-san given his recent raid1 BUG report:
> > > http://lkml.org/lkml/2008/1/14/515
> > >
> > >
> > > On Jan 21, 2008 6:04 PM, Mike Snitzer <[EMAIL PROTECTED]> wrote:
> > > > Under 2.6.22.16, I physically pulled a SATA disk
> > (/dev/sdac, connected to
> > > > an aacraid controller) that was acting as the local raid1
> > member of
> > > > /dev/md30.
> > > >
> > > > Linux MD didn't see an /dev/sdac1 error until I tried
> > forcing the issue by
> > > > doing a read (with dd) from /dev/md30:
> > 
> > > The raid1d thread is locked at line 720 in raid1.c
> > (raid1d+2437); aka
> > > freeze_array:
> > >
> > > (gdb) l *0x2539
> > > 0x2539 is in raid1d (drivers/md/raid1.c:720).
> > > 715  * wait until barrier+nr_pending match nr_queued+2
> > > 716  */
> > > 717 spin_lock_irq(&conf->resync_lock);
> > > 718 conf->barrier++;
> > > 719 conf->nr_waiting++;
> > > 720 wait_event_lock_irq(conf->wait_barrier,
> > > 721
> > conf->barrier+conf->nr_pending ==
> > > conf->nr_queued+2,
> > > 722 conf->resync_lock,
> > > 723
> > raid1_unplug(conf->mddev->queue));
> > > 724 spin_unlock_irq(&conf->resync_lock);
> > >
> > > Given Tanaka-san's report against 2.6.23 and me hitting
> > what seems to
> > > be the same deadlock in 2.6.22.16; it stands to reason this affects
> > > raid1 in 2.6.24-rcX too.
> >
> > Turns out that the aacraid driver in 2.6.22.x is HORRIBLY BROKEN (when
> > you pull a drive); it responds to MD's write requests with uptodate=1
> > (in raid1_end_write_request) for the drive that was pulled!  I've not
> > looked to see if aacraid has been fixed in newer kernels... are others
> > aware of any crucial aacraid fixes in 2.6.23.x or 2.6.24?
> >
> > After the drive was physically pulled, and small periodic writes
> > continued to the associated MD device, the raid1 MD driver did _NOT_
> > detect the pulled drive's writes as having failed (verified this with
> > systemtap).  MD happily thought the write completed to both members
> > (so MD had no reason to mark the pulled drive "faulty"; or mark the
> > raid "degraded").
> >
> > Installing an Adaptec-provided 1.1-5[2451] driver enabled raid1 to
> > work as expected.
> >
> > That said, I now have a recipe for hitting the raid1 deadlock that
> > Tanaka first reported over a week ago.  I'm still surprised that all
> > of this chatter about that BUG hasn't drawn interest/scrutiny from
> > others!?
> >
> > regards,
> > Mike
> >
>
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

43 matches

Mail list logo