Re: [PATCH] md: new bitmap sysfs interface
On 7/25/06, Paul Clements <[EMAIL PROTECTED]> wrote: This patch (tested against 2.6.18-rc1-mm1) adds a new sysfs interface that allows the bitmap of an array to be dirtied. The interface is write-only, and is used as follows: echo "1000" > /sys/block/md2/md/bitmap (dirty the bit for chunk 1000 [offset 0] in the in-memory and on-disk bitmaps of array md2) echo "1000-2000" > /sys/block/md1/md/bitmap (dirty the bits for chunks 1000-2000 in md1's bitmap) This is useful, for example, in cluster environments where you may need to combine two disjoint bitmaps into one (following a server failure, after a secondary server has taken over the array). By combining the bitmaps on the two servers, a full resync can be avoided (This was discussed on the list back on March 18, 2005, "[PATCH 1/2] md bitmap bug fixes" thread). Hi Paul, I tracked down the thread you referenced and these posts (by you) seems to summarize things well: http://marc.theaimsgroup.com/?l=linux-raid&m=16563016418&w=2 http://marc.theaimsgroup.com/?l=linux-raid&m=17515400864&w=2 But for clarity's sake, could you elaborate on the negative implications of not merging the bitmaps on the secondary server? Will the previous primary's dirty blocks get dropped on the floor because the secondary (now the primary) doesn't have awareness of the previous primary's dirty blocks once it activates the raid1? Also, what is the interface one should use to collect dirty bits from the primary's bitmap? This bitmap merge can't happen until the primary's dirty bits can be collected right? Waiting for the failed server to come back to harvest the dirty bits it has seems wrong (why failover at all?); so I must be missing something. please advise, thanks. Mike - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] md: new bitmap sysfs interface
On 7/26/06, Paul Clements <[EMAIL PROTECTED]> wrote: Mike Snitzer wrote: > I tracked down the thread you referenced and these posts (by you) > seems to summarize things well: > http://marc.theaimsgroup.com/?l=linux-raid&m=16563016418&w=2 > http://marc.theaimsgroup.com/?l=linux-raid&m=17515400864&w=2 > > But for clarity's sake, could you elaborate on the negative > implications of not merging the bitmaps on the secondary server? Will > the previous primary's dirty blocks get dropped on the floor because > the secondary (now the primary) doesn't have awareness of the previous > primary's dirty blocks once it activates the raid1? Right. At the time of the failover, there were (probably) blocks that were out of sync between the primary and secondary. Now, after you've failed over to the secondary, you've got to overwrite those blocks with data from the secondary in order to make the primary disk consistent again. This requires that either you do a full resync from secondary to primary (if you don't know what differs), or you merge the two bitmaps and resync just that data. I took more time to read the later posts in the original thread; that coupled with your detailed response has helped a lot. thanks. > Also, what is the interface one should use to collect dirty bits from > the primary's bitmap? Whatever you'd like. scp the bitmap file over or collect the ranges into a file and scp that over, or something similar. OK, so regardless of whether you are using an external or internal bitmap; how does one collect the ranges from an array's bitmap? Generally speaking I think others would have the same (naive) question given that we need to know what to use as input for the sysfs interface you've kindly provided. If it is left as an exercise to the user that is fine; I'd imagine neilb will get our backs with a nifty new mdadm flag if need be. thanks again, Mike - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] md: new bitmap sysfs interface
On 7/26/06, Paul Clements <[EMAIL PROTECTED]> wrote: Mike Snitzer wrote: > I tracked down the thread you referenced and these posts (by you) > seems to summarize things well: > http://marc.theaimsgroup.com/?l=linux-raid&m=16563016418&w=2 > http://marc.theaimsgroup.com/?l=linux-raid&m=17515400864&w=2 > > But for clarity's sake, could you elaborate on the negative > implications of not merging the bitmaps on the secondary server? Will > the previous primary's dirty blocks get dropped on the floor because > the secondary (now the primary) doesn't have awareness of the previous > primary's dirty blocks once it activates the raid1? Right. At the time of the failover, there were (probably) blocks that were out of sync between the primary and secondary. OK, so now that I understand the need to merge the bitmaps... the various scenarios that create this (potential) inconsistency are still unclear to me when you consider the different flavors of raid1. Is this inconsistency only possible if using async (aka write-behind) raid1? If not, how would this difference in committed blocks occur with normal (sync) raid1 given MD's endio acknowledges writes after they are submitted to all raid members? Is it merely that the bitmap is left with dangling bits set that don't reflect reality (blocks weren't actually changed anywhere) when a crash occurs? Is there real potential for inconsistent data on disk(s) when using sync raid1 (does having an nbd member increase the likelihood)? regards, Mike - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 010 of 10] md: Allow the write_mostly flag to be set via sysfs.
Aside from this write-mostly sysfs support, is there a way to toggle the write-mostly bit of an md member with mdadm? I couldn't identify a clear way to do so. It'd be nice if mdadm --assemble would honor --write-mostly... On 6/1/06, NeilBrown <[EMAIL PROTECTED]> wrote: It appears in /sys/mdX/md/dev-YYY/state and can be set or cleared by writing 'writemostly' or '-writemostly' respectively. Signed-off-by: Neil Brown <[EMAIL PROTECTED]> ### Diffstat output ./Documentation/md.txt |5 + ./drivers/md/md.c | 12 2 files changed, 17 insertions(+) diff ./Documentation/md.txt~current~ ./Documentation/md.txt --- ./Documentation/md.txt~current~ 2006-06-01 15:05:30.0 +1000 +++ ./Documentation/md.txt 2006-06-01 15:05:30.0 +1000 @@ -309,6 +309,9 @@ Each directory contains: faulty - device has been kicked from active use due to a detected fault in_sync - device is a fully in-sync member of the array + writemostly - device will only be subject to read +requests if there are no other options. +This applies only to raid1 arrays. spare- device is working, but not a full member. This includes spares that are in the process of being recoverred to @@ -316,6 +319,8 @@ Each directory contains: This can be written to. Writing "faulty" simulates a failure on the device. Writing "remove" removes the device from the array. + Writing "writemostly" sets the writemostly flag. + Writing "-writemostly" clears the writemostly flag. errors An approximate count of read errors that have been detected on diff ./drivers/md/md.c~current~ ./drivers/md/md.c --- ./drivers/md/md.c~current~ 2006-06-01 15:05:30.0 +1000 +++ ./drivers/md/md.c 2006-06-01 15:05:30.0 +1000 @@ -1737,6 +1737,10 @@ state_show(mdk_rdev_t *rdev, char *page) len += sprintf(page+len, "%sin_sync",sep); sep = ","; } + if (test_bit(WriteMostly, &rdev->flags)) { + len += sprintf(page+len, "%swrite_mostly",sep); + sep = ","; + } if (!test_bit(Faulty, &rdev->flags) && !test_bit(In_sync, &rdev->flags)) { len += sprintf(page+len, "%sspare", sep); @@ -1751,6 +1755,8 @@ state_store(mdk_rdev_t *rdev, const char /* can write * faulty - simulates and error * remove - disconnects the device +* writemostly - sets write_mostly +* -writemostly - clears write_mostly */ int err = -EINVAL; if (cmd_match(buf, "faulty") && rdev->mddev->pers) { @@ -1766,6 +1772,12 @@ state_store(mdk_rdev_t *rdev, const char md_new_event(mddev); err = 0; } + } else if (cmd_match(buf, "writemostly")) { + set_bit(WriteMostly, &rdev->flags); + err = 0; + } else if (cmd_match(buf, "-writemostly")) { + clear_bit(WriteMostly, &rdev->flags); + err = 0; } return err ? err : len; } - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 010 of 10] md: Allow the write_mostly flag to be set via sysfs.
On 8/5/06, Mike Snitzer <[EMAIL PROTECTED]> wrote: Aside from this write-mostly sysfs support, is there a way to toggle the write-mostly bit of an md member with mdadm? I couldn't identify a clear way to do so. It'd be nice if mdadm --assemble would honor --write-mostly... I went ahead and implemented the ability to toggle the write-mostly bit for all disks in an array. I did so by adding another type of --update to --assemble. This is very useful for a 2 disk raid1 (one disk local, one remote). When you switch the raidhost you also need to toggle the write-mostly bit too. I've tested the attached patch to work with both ver.90 and ver1 superblocks with mdadm 2.4.1 and 2.5.2. The patch is against mdadm 2.4.1 but applies cleanly (with fuzz) against mdadm 2.5.2). # cat /proc/mdstat ... md2 : active raid1 nbd2[0] sdd[1](W) 390613952 blocks [2/2] [UU] bitmap: 0/187 pages [0KB], 1024KB chunk # mdadm -S /dev/md2 # mdadm --assemble /dev/md2 --run --update=toggle-write-mostly /dev/sdd /dev/nbd2 mdadm: /dev/md2 has been started with 2 drives. # cat /proc/mdstat ... md2 : active raid1 nbd2[0](W) sdd[1] 390613952 blocks [2/2] [UU] bitmap: 0/187 pages [0KB], 1024KB chunk diff -Naur mdadm-2.4.1/mdadm.c mdadm-2.4.1_toggle_write_mostly/mdadm.c --- mdadm-2.4.1/mdadm.c 2006-03-28 21:55:39.0 -0500 +++ mdadm-2.4.1_toggle_write_mostly/mdadm.c 2006-08-05 17:01:48.0 -0400 @@ -587,6 +587,8 @@ continue; if (strcmp(update, "uuid")==0) continue; + if (strcmp(update, "toggle-write-mostly")==0) +continue; if (strcmp(update, "byteorder")==0) { if (ss) { fprintf(stderr, Name ": must not set metadata type with --update=byteorder.\n"); @@ -601,7 +603,7 @@ continue; } - fprintf(stderr, Name ": '--update %s' invalid. Only 'sparc2.2', 'super-minor', 'uuid', 'resync' or 'summaries' supported\n",update); + fprintf(stderr, Name ": '--update %s' invalid. Only 'sparc2.2', 'super-minor', 'uuid', 'resync', 'summaries' or 'toggle-write-mostly' supported\n",update); exit(2); case O(ASSEMBLE,'c'): /* config file */ diff -Naur mdadm-2.4.1/super0.c mdadm-2.4.1_toggle_write_mostly/super0.c --- mdadm-2.4.1/super0.c 2006-03-28 01:10:51.0 -0500 +++ mdadm-2.4.1_toggle_write_mostly/super0.c 2006-08-05 18:04:45.0 -0400 @@ -382,6 +382,10 @@ rv = 1; } } + if (strcmp(update, "toggle-write-mostly")==0) { + int d = info->disk.number; + sb->disks[d].state ^= (1<disk.number; memset(&sb->disks[d], 0, sizeof(sb->disks[d])); diff -Naur mdadm-2.4.1/super1.c mdadm-2.4.1_toggle_write_mostly/super1.c --- mdadm-2.4.1/super1.c 2006-04-07 00:32:06.0 -0400 +++ mdadm-2.4.1_toggle_write_mostly/super1.c 2006-08-05 18:33:21.0 -0400 @@ -446,6 +446,9 @@ rv = 1; } } + if (strcmp(update, "toggle-write-mostly")==0) { + sb->devflags ^= WriteMostly1; + } #if 0 if (strcmp(update, "newdev") == 0) { int d = info->disk.number;
issue with mdadm ver1 sb and bitmap on x86_64
FYI, with both mdadm ver 2.4.1 and 2.5.2 I can't mdadm --create with a ver1 superblock and a write intent bitmap on x86_64. running: mdadm --create /dev/md2 -e 1.0 -l 1 --bitmap=internal -n 2 /dev/sdd --write-mostly /dev/nbd2 I get: mdadm: RUN_ARRAY failed: Invalid argument Mike - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] md: pass down BIO_RW_SYNC in raid{1,10}
On 1/8/07, Andrew Morton <[EMAIL PROTECTED]> wrote: On Mon, 8 Jan 2007 10:08:34 +0100 Lars Ellenberg <[EMAIL PROTECTED]> wrote: > md raidX make_request functions strip off the BIO_RW_SYNC flag, > thus introducing additional latency. > > fixing this in raid1 and raid10 seems to be straight forward enough. > > for our particular usage case in DRBD, passing this flag improved > some initialization time from ~5 minutes to ~5 seconds. That sounds like a significant fix. So will this fix also improve performance associated with raid1's internal bitmap support? What is the scope of the performance problems this fix will address? That is, what are some other examples of where users might see a benefit from this patch? regards, Mike - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
raid1 with nbd member hangs MD on SLES10 and RHEL5
When using raid1 with one local member and one nbd member (marked as write-mostly) MD hangs when trying to format /dev/md0 with ext3. Both 'cat /proc/mdstat' and 'mdadm --detail /dev/md0' hang infinitely. I've not tried to reproduce on 2.6.18 or 2.6.19ish kernel.org kernels yet but this issue affects both SLES10 and RHEL5. sysrq traces for RHEL5 follow; I don't have immediate access to a SLES10 system at the moment but I've seen this same hang with SLES10 SP1 RC4: cat /proc/mdstat cat S 8100048e7de8 6208 11428 11391 (NOTLB) 8100048e7de8 076eb000 80098ea6 0008 81001ff170c0 810037e17100 00045f8d13924085 0006b89f 81001ff17290 0001 0005 Call Trace: [] seq_printf+0x67/0x8f [] __mutex_lock_interruptible_slowpath+0x7f/0xbc [] md_seq_show+0x123/0x6aa [] seq_read+0x1b8/0x28d [] vfs_read+0xcb/0x171 [] sys_read+0x45/0x6e [] tracesys+0xd1/0xdc /sbin/mdadm --detail /dev/md0 mdadm S 810035a1dd78 6384 3829 3828 (NOTLB) 810035a1dd78 81003f4570c0 80094e4d 0001 81000617c870 810037e17100 00043e667c800afe 0005ae94 81000617ca40 0001 0021 Call Trace: [] mntput_no_expire+0x19/0x89 [] __mutex_lock_interruptible_slowpath+0x7f/0xbc [] md_open+0x2e/0x68 [] do_open+0x216/0x316 [] blkdev_open+0x0/0x4f [] blkdev_open+0x23/0x4f [] __dentry_open+0xd9/0x1dc [] do_filp_open+0x2d/0x3d [] do_sys_open+0x44/0xbe [] tracesys+0xd1/0xdc I can provided more detailed information; please just ask. thanks, Mike - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 with nbd member hangs MD on SLES10 and RHEL5
On 6/12/07, Neil Brown <[EMAIL PROTECTED]> wrote: On Tuesday June 12, [EMAIL PROTECTED] wrote: > > I can provided more detailed information; please just ask. > A complete sysrq trace (all processes) might help. I'll send it to you off list. thanks, Mike - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 with nbd member hangs MD on SLES10 and RHEL5
On 6/13/07, Mike Snitzer <[EMAIL PROTECTED]> wrote: On 6/12/07, Neil Brown <[EMAIL PROTECTED]> wrote: ... > > > On 6/12/07, Neil Brown <[EMAIL PROTECTED]> wrote: > > > > On Tuesday June 12, [EMAIL PROTECTED] wrote: > > > > > > > > > > I can provided more detailed information; please just ask. > > > > > > > > > > > > > A complete sysrq trace (all processes) might help. Bringing this back to a wider audience. I provided the full sysrq trace of the RHEL5 kernel to Neil; in it we saw that md0_raid1 had the following trace: md0_raid1 D 810026183ce0 5368 31663 11 3822 29488 (L-TLB) 810026183ce0 810031e9b5f8 0008 000a 810037eef040 810037e17100 00043e64d2983c1f 4c7f 810037eef210 00010001 00081c506640 Call Trace: [] keventd_create_kthread+0x0/0x61 [] md_super_wait+0xa8/0xbc [] autoremove_wake_function+0x0/0x2e [] md_update_sb+0x1dd/0x23a [] md_check_recovery+0x15f/0x449 [] :raid1:raid1d+0x27/0xc1e [] thread_return+0x0/0xde [] __sched_text_start+0xc/0xa79 [] keventd_create_kthread+0x0/0x61 [] schedule_timeout+0x1e/0xad [] keventd_create_kthread+0x0/0x61 [] md_thread+0xf8/0x10e [] autoremove_wake_function+0x0/0x2e [] md_thread+0x0/0x10e [] kthread+0xd4/0x109 [] child_rip+0xa/0x11 [] keventd_create_kthread+0x0/0x61 [] kthread+0x0/0x109 [] child_rip+0x0/0x11 To which Neil had the following to say: > md0_raid1 is holding the lock on the array and trying to write out the > superblocks for some reason, and the write isn't completing. > As it is holding the locks, mdadm and /proc/mdstat are hanging. > > You seem to have nbd-servers running on this machine. Are they > serving the device that md is using. (i.e. a loop-back situation). I > would expect memory deadlocks would be very easy to hit in that > situation, but I don't know if that is what has happened. > > Nothing else stands out. > > Could you clarify the arrangement of nbd. Where are the servers and > what are they serving? We're using MD+NBD for disaster recovery (one local scsi device, one remote via nbd). The nbd-server is not contributing to md0. The nbd-server is connected to a remote machine that is running a raid1 remotely To take this further I've now collected a full sysrq trace of this hang on a SLES10 SP1 RC5 2.6.16.46-0.12-smp kernel, the relevant md0_raid1 trace is comparable to the RHEL5 trace from above: md0_raid1 D 810001089780 0 8583 51 8952 8260 (L-TLB) 810812393ca8 0046 8107b7fbac00 000a 81081f3c6a18 81081f3c67d0 8104ffe8f100 44819ddcd5e2 eb8b 0007028009c7 Call Trace: {generic_make_request+501} {md_super_wait+168} {autoremove_wake_function+0} {write_page+128} {md_update_sb+220} {md_check_recovery+361} {:raid1:raid1d+38} {lock_timer_base+27} {try_to_del_timer_sync+81} {del_timer_sync+12} {schedule_timeout+146} {keventd_create_kthread+0} {md_thread+248} {autoremove_wake_function+0} {md_thread+0} {kthread+236} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} Taking a step back, here is what was done to reproduce on SLES10: 1) establish a raid1 mirror (md0) using one local member (sdc1) and one remote member (nbd0) 2) power off the remote machine, whereby severing nbd0's connection 3) perform IO to the filesystem that is on the md0 device to enduce the MD layer to mark the nbd device as "faulty" 4) cat /proc/mdstat hangs, sysrq trace was collected and showed the above md0_raid1 trace. To be clear, the MD superblock update hangs indefinitely on RHEL5. But with SLES10 it eventually succeeds (and MD marks the nbd0 member faulty); and the other tasks that were blocking waiting for the MD lock (e.g. 'cat /proc/mdstat') then complete immediately. It should be noted that this MD+NBD configuration has worked flawlessly using a stock kernel.org 2.6.15.7 kernel (ontop of a RHEL4U4 distro). Steps have not been taken to try to reproduce with 2.6.15.7 on SLES10; it may be useful to pursue but I'll defer to others to suggest I do so. 2.6.15.7 does not have the SMP race fixes that were made in 2.6.16; yet both SLES10 and RHEL5 kernels do: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4b2f0260c74324abca76ccaa42d426af163125e7 If not this specific NBD change, something appears to have changed with how NBD behaves in the face of it's connection to the server being lost. Almost like the MD superblock update that would be written to nbd0 is blocking within nbd or the network layer because of a network timeout issue? I will try to get a better understanding of what is _really_ happening with systemtap; but othe
Re: Cluster Aware MD Driver
Is the goal to have the MD device be directly accessible from all nodes? This strategy seems flawed in that it speaks to updating MD superblocks then in-memory Linux data structures across a cluster. The reality is if we're talking about shared storage the MD management only needs to happen in one node. Others can weigh in on this but the current MD really doesn't want to be cluster-aware. IMHO, this cluster awareness really doesn't belong in MD/mdadm. A high-level cluster management tool should be doing this MD ownership/coordination work. The MD ownership can be transferred accordingly if/when the current owner fails, etc. But this implies that the MD is only ever active on one node at any given point in time. Mike On 6/13/07, Xinwei Hu <[EMAIL PROTECTED]> wrote: Hi all, Steven Dake proposed a solution* to make MD layer and tools to be cluster aware in early 2003. But it seems that no progressing is made since then. I'd like to pick this one up again. :) So far as I understand, Steven's proposal still applies to currently MD implementation mostly, except we have bitmap now. And bitmap can be workarounded via set_bitmap_file. The problem is that it seems we need a kernel<->userspace interface to sync the mddev struct across all nodes, but I don't find out how. I'm new to the MD driver, so correct me if I'm wrong. And you suggestions are really appreciated. Thanks. * http://osdir.com/ml/raid/2003-01/msg00013.html - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 with nbd member hangs MD on SLES10 and RHEL5
On 6/13/07, Mike Snitzer <[EMAIL PROTECTED]> wrote: On 6/13/07, Mike Snitzer <[EMAIL PROTECTED]> wrote: > On 6/12/07, Neil Brown <[EMAIL PROTECTED]> wrote: ... > > > > On 6/12/07, Neil Brown <[EMAIL PROTECTED]> wrote: > > > > > On Tuesday June 12, [EMAIL PROTECTED] wrote: > > > > > > > > > > > > I can provided more detailed information; please just ask. > > > > > > > > > > > > > > > > A complete sysrq trace (all processes) might help. Bringing this back to a wider audience. I provided the full sysrq trace of the RHEL5 kernel to Neil; in it we saw that md0_raid1 had the following trace: md0_raid1 D 810026183ce0 5368 31663 11 3822 29488 (L-TLB) 810026183ce0 810031e9b5f8 0008 000a 810037eef040 810037e17100 00043e64d2983c1f 4c7f 810037eef210 00010001 00081c506640 Call Trace: [] keventd_create_kthread+0x0/0x61 [] md_super_wait+0xa8/0xbc [] autoremove_wake_function+0x0/0x2e [] md_update_sb+0x1dd/0x23a [] md_check_recovery+0x15f/0x449 [] :raid1:raid1d+0x27/0xc1e [] thread_return+0x0/0xde [] __sched_text_start+0xc/0xa79 [] keventd_create_kthread+0x0/0x61 [] schedule_timeout+0x1e/0xad [] keventd_create_kthread+0x0/0x61 [] md_thread+0xf8/0x10e [] autoremove_wake_function+0x0/0x2e [] md_thread+0x0/0x10e [] kthread+0xd4/0x109 [] child_rip+0xa/0x11 [] keventd_create_kthread+0x0/0x61 [] kthread+0x0/0x109 [] child_rip+0x0/0x11 To which Neil had the following to say: > > md0_raid1 is holding the lock on the array and trying to write out the > > superblocks for some reason, and the write isn't completing. > > As it is holding the locks, mdadm and /proc/mdstat are hanging. ... > We're using MD+NBD for disaster recovery (one local scsi device, one > remote via nbd). The nbd-server is not contributing to md0. The > nbd-server is connected to a remote machine that is running a raid1 > remotely To take this further I've now collected a full sysrq trace of this hang on a SLES10 SP1 RC5 2.6.16.46-0.12-smp kernel, the relevant md0_raid1 trace is comparable to the RHEL5 trace from above: md0_raid1 D 810001089780 0 8583 51 8952 8260 (L-TLB) 810812393ca8 0046 8107b7fbac00 000a 81081f3c6a18 81081f3c67d0 8104ffe8f100 44819ddcd5e2 eb8b 0007028009c7 Call Trace: {generic_make_request+501} {md_super_wait+168} {autoremove_wake_function+0} {write_page+128} {md_update_sb+220} {md_check_recovery+361} {:raid1:raid1d+38} {lock_timer_base+27} {try_to_del_timer_sync+81} {del_timer_sync+12} {schedule_timeout+146} {keventd_create_kthread+0} {md_thread+248} {autoremove_wake_function+0} {md_thread+0} {kthread+236} {child_rip+8} {keventd_create_kthread+0} {kthread+0} {child_rip+0} Taking a step back, here is what was done to reproduce on SLES10: 1) establish a raid1 mirror (md0) using one local member (sdc1) and one remote member (nbd0) 2) power off the remote machine, whereby severing nbd0's connection 3) perform IO to the filesystem that is on the md0 device to enduce the MD layer to mark the nbd device as "faulty" 4) cat /proc/mdstat hangs, sysrq trace was collected and showed the above md0_raid1 trace. To be clear, the MD superblock update hangs indefinitely on RHEL5. But with SLES10 it eventually succeeds (and MD marks the nbd0 member faulty); and the other tasks that were blocking waiting for the MD lock (e.g. 'cat /proc/mdstat') then complete immediately. It should be noted that this MD+NBD configuration has worked flawlessly using a stock kernel.org 2.6.15.7 kernel (ontop of a RHEL4U4 distro). Steps have not been taken to try to reproduce with 2.6.15.7 on SLES10; it may be useful to pursue but I'll defer to others to suggest I do so. 2.6.15.7 does not have the SMP race fixes that were made in 2.6.16; yet both SLES10 and RHEL5 kernels do: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4b2f0260c74324abca76ccaa42d426af163125e7 If not this specific NBD change, something appears to have changed with how NBD behaves in the face of it's connection to the server being lost. Almost like the MD superblock update that would be written to nbd0 is blocking within nbd or the network layer because of a network timeout issue? Just a quick update; it is really starting to look like there is definitely an issue with the nbd kernel driver. I booted the SLES10 2.6.16.46-0.12-smp kernel with maxcpus=1 to test the theory that the nbd SMP fix that went into 2.6.16 was in some way causing this MD/NBD hang. But it _still_ occurs with the 4-step process I outlined above. The nbd0 device _sho
Re: raid1 with nbd member hangs MD on SLES10 and RHEL5
On 6/14/07, Bill Davidsen <[EMAIL PROTECTED]> wrote: Mike Snitzer wrote: > On 6/13/07, Mike Snitzer <[EMAIL PROTECTED]> wrote: >> On 6/13/07, Mike Snitzer <[EMAIL PROTECTED]> wrote: >> > On 6/12/07, Neil Brown <[EMAIL PROTECTED]> wrote: >> ... >> > > > > On 6/12/07, Neil Brown <[EMAIL PROTECTED]> wrote: >> > > > > > On Tuesday June 12, [EMAIL PROTECTED] wrote: >> > > > > > > >> > > > > > > I can provided more detailed information; please just ask. >> > > > > > > >> > > > > > >> > > > > > A complete sysrq trace (all processes) might help. >> >> Bringing this back to a wider audience. I provided the full sysrq >> trace of the RHEL5 kernel to Neil; in it we saw that md0_raid1 had the >> following trace: >> >> md0_raid1 D 810026183ce0 5368 31663 11 3822 >> 29488 (L-TLB) >> 810026183ce0 810031e9b5f8 0008 000a >> 810037eef040 810037e17100 00043e64d2983c1f 4c7f >> 810037eef210 00010001 00081c506640 >> Call Trace: >> [] keventd_create_kthread+0x0/0x61 >> [] md_super_wait+0xa8/0xbc >> [] autoremove_wake_function+0x0/0x2e >> [] md_update_sb+0x1dd/0x23a >> [] md_check_recovery+0x15f/0x449 >> [] :raid1:raid1d+0x27/0xc1e >> [] thread_return+0x0/0xde >> [] __sched_text_start+0xc/0xa79 >> [] keventd_create_kthread+0x0/0x61 >> [] schedule_timeout+0x1e/0xad >> [] keventd_create_kthread+0x0/0x61 >> [] md_thread+0xf8/0x10e >> [] autoremove_wake_function+0x0/0x2e >> [] md_thread+0x0/0x10e >> [] kthread+0xd4/0x109 >> [] child_rip+0xa/0x11 >> [] keventd_create_kthread+0x0/0x61 >> [] kthread+0x0/0x109 >> [] child_rip+0x0/0x11 >> >> To which Neil had the following to say: >> >> > > md0_raid1 is holding the lock on the array and trying to write >> out the >> > > superblocks for some reason, and the write isn't completing. >> > > As it is holding the locks, mdadm and /proc/mdstat are hanging. > ... > >> > We're using MD+NBD for disaster recovery (one local scsi device, one >> > remote via nbd). The nbd-server is not contributing to md0. The >> > nbd-server is connected to a remote machine that is running a raid1 >> > remotely >> >> To take this further I've now collected a full sysrq trace of this >> hang on a SLES10 SP1 RC5 2.6.16.46-0.12-smp kernel, the relevant >> md0_raid1 trace is comparable to the RHEL5 trace from above: >> >> md0_raid1 D 810001089780 0 8583 51 8952 >> 8260 (L-TLB) >> 810812393ca8 0046 8107b7fbac00 000a >>81081f3c6a18 81081f3c67d0 8104ffe8f100 >> 44819ddcd5e2 >>eb8b 0007028009c7 >> Call Trace: {generic_make_request+501} >>{md_super_wait+168} >> {autoremove_wake_function+0} >>{write_page+128} >> {md_update_sb+220} >>{md_check_recovery+361} >> {:raid1:raid1d+38} >>{lock_timer_base+27} >> {try_to_del_timer_sync+81} >>{del_timer_sync+12} >> {schedule_timeout+146} >>{keventd_create_kthread+0} >> {md_thread+248} >>{autoremove_wake_function+0} >> {md_thread+0} >>{kthread+236} {child_rip+8} >>{keventd_create_kthread+0} >> {kthread+0} >>{child_rip+0} >> >> Taking a step back, here is what was done to reproduce on SLES10: >> 1) establish a raid1 mirror (md0) using one local member (sdc1) and >> one remote member (nbd0) >> 2) power off the remote machine, whereby severing nbd0's connection >> 3) perform IO to the filesystem that is on the md0 device to enduce >> the MD layer to mark the nbd device as "faulty" >> 4) cat /proc/mdstat hangs, sysrq trace was collected and showed the >> above md0_raid1 trace. >> >> To be clear, the MD superblock update hangs indefinitely on RHEL5. >> But with SLES10 it eventually succeeds (and MD marks the nbd0 member >> faulty); and the other tasks that were blocking waiting for the MD >> lock (e.g. 'cat /proc/mdstat') then complete immediately. >> >> It should be noted that this MD+NBD configuration has worked >> flawlessly using a stock kernel.org 2.6.15.7 kernel (ontop of a >> RHEL4U4 distro). Steps have not been taken to try to reproduce with >> 2.6.15.7 on SLE
Re: raid1 with nbd member hangs MD on SLES10 and RHEL5
On 6/14/07, Paul Clements <[EMAIL PROTECTED]> wrote: Bill Davidsen wrote: > Second, AFAIK nbd hasn't working in a while. I haven't tried it in ages, > but was told it wouldn't work with smp and I kind of lost interest. If > Neil thinks it should work in 2.6.21 or later I'll test it, since I have > a machine which wants a fresh install soon, and is both backed up and > available. Please stop this. nbd is working perfectly fine, AFAIK. I use it every day, and so do 100s of our customers. What exactly is it that not's working? If there's a problem, please send the bug report. Paul, This thread details what I've experienced using MD (raid1) with 2 devices; one being a local scsi device and the other is an NBD device. I've yet to put effort to pinpointing the problem in a kernel.org kernel; however both SLES10 and RHEL5 kernels appear to be hanging in either 1) nbd or 2) the socket layer. Here are the steps to reproduce reliably on SLES10 SP1: 1) establish a raid1 mirror (md0) using one local member (sdc1) and one remote member (nbd0) 2) power off the remote machine, whereby severing nbd0's connection 3) perform IO to the filesystem that is on the md0 device to enduce the MD layer to mark the nbd device as "faulty" 4) cat /proc/mdstat hangs, sysrq trace was collected To be clear, the MD superblock update hangs indefinitely on RHEL5. But with SLES10 it eventually succeeds after ~5min (and MD marks the nbd0 member faulty); and the other tasks that were blocking waiting for the MD lock (e.g. 'cat /proc/mdstat') then complete immediately. If you look back in this thread you'll see traces for md0_raid1 for both SLES10 and RHEL5. I hope to try to reproduce this issue on kernel.org 2.6.16.46 (the basis for SLES10). If I can I'll then git bisect back to try to pinpoint the regression; I obviously need to verify that 2.6.16 works in this situation on SMP. Mike - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 with nbd member hangs MD on SLES10 and RHEL5
On 6/14/07, Paul Clements <[EMAIL PROTECTED]> wrote: Mike Snitzer wrote: > Here are the steps to reproduce reliably on SLES10 SP1: > 1) establish a raid1 mirror (md0) using one local member (sdc1) and > one remote member (nbd0) > 2) power off the remote machine, whereby severing nbd0's connection > 3) perform IO to the filesystem that is on the md0 device to enduce > the MD layer to mark the nbd device as "faulty" > 4) cat /proc/mdstat hangs, sysrq trace was collected That's working as designed. NBD works over TCP. You're going to have to wait for TCP to time out before an error occurs. Until then I/O will hang. With kernel.org 2.6.15.7 (uni-processor) I've not seen NBD hang in the kernel like I am with RHEL5 and SLES10. This hang (tcp timeout) is indefinite oh RHEL5 and ~5min on SLES10. Should/can I be playing with TCP timeout values? Why was this not a concern with kernel.org 2.6.15.7; I was able to "feel" the nbd connection break immediately; no MD superblock update hangs, no longwinded (or indefinite) TCP timeout. regards, Mike - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 with nbd member hangs MD on SLES10 and RHEL5
On 6/14/07, Paul Clements <[EMAIL PROTECTED]> wrote: Mike Snitzer wrote: > On 6/14/07, Paul Clements <[EMAIL PROTECTED]> wrote: >> Mike Snitzer wrote: >> >> > Here are the steps to reproduce reliably on SLES10 SP1: >> > 1) establish a raid1 mirror (md0) using one local member (sdc1) and >> > one remote member (nbd0) >> > 2) power off the remote machine, whereby severing nbd0's connection >> > 3) perform IO to the filesystem that is on the md0 device to enduce >> > the MD layer to mark the nbd device as "faulty" >> > 4) cat /proc/mdstat hangs, sysrq trace was collected >> >> That's working as designed. NBD works over TCP. You're going to have to >> wait for TCP to time out before an error occurs. Until then I/O will >> hang. > > With kernel.org 2.6.15.7 (uni-processor) I've not seen NBD hang in the > kernel like I am with RHEL5 and SLES10. This hang (tcp timeout) is > indefinite oh RHEL5 and ~5min on SLES10. > > Should/can I be playing with TCP timeout values? Why was this not a > concern with kernel.org 2.6.15.7; I was able to "feel" the nbd > connection break immediately; no MD superblock update hangs, no > longwinded (or indefinite) TCP timeout. I don't know. I've never seen nbd immediately start returning I/O errors. Perhaps something was different about the configuration? If the other other machine rebooted quickly, for instance, you'd get a connection reset, which would kill the nbd connection. OK, I'll retest the 2.6.15.7 setup. As for SLES10 and RHEL5, I've been leaving the remote server powered off. As such I'm at the full mercy of the TCP timeout. It is odd that RHEL5 has been hanging indefinitely but I'll dig deeper on that once I come to terms with how kernel.org and SLES10 behaves. I'll update with my findings for completeness. Thanks for your insight! Mike - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Need clarification on raid1 resync behavior with bitmap support
On 6/1/06, NeilBrown <[EMAIL PROTECTED]> wrote: When an array has a bitmap, a device can be removed and re-added and only blocks changes since the removal (as recorded in the bitmap) will be resynced. Neil, Does the same apply when a bitmap-enabled raid1's member goes faulty? Meaning even if a member is faulty, when the user removes and re-adds the faulty device the raid1 rebuild _should_ leverage the bitmap during a resync right? I've seen messages like: [12068875.690255] raid1: raid set md0 active with 2 out of 2 mirrors [12068875.690284] md0: bitmap file is out of date (0 < 1) -- forcing full recovery [12068875.690289] md0: bitmap file is out of date, doing full recovery [12068875.710214] md0: bitmap initialized from disk: read 5/5 pages, set 131056 bits, status: 0 [12068875.710222] created bitmap (64 pages) for device md0 Could you share the other situations where a bitmap-enabled raid1 _must_ perform a full recovery? - Correct me if I'm wrong, but one that comes to mind is when a server reboots (after cleanly stopping a raid1 array that had a faulty member) and then either: 1) assembles the array with the previously faulty member now available 2) assembles the array with the same faulty member missing. The user later re-adds the faulty member AFAIK both scenarios would bring about a full resync. regards, Mike - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Need clarification on raid1 resync behavior with bitmap support
On 7/23/07, Neil Brown <[EMAIL PROTECTED]> wrote: On Saturday July 21, [EMAIL PROTECTED] wrote: > Could you share the other situations where a bitmap-enabled raid1 > _must_ perform a full recovery? When you add a new drive. When you create a new bitmap. I think that should be all. > - Correct me if I'm wrong, but one that comes to mind is when a server > reboots (after cleanly stopping a raid1 array that had a faulty > member) and then either: > 1) assembles the array with the previously faulty member now > available > > 2) assembles the array with the same faulty member missing. The user > later re-adds the faulty member > > AFAIK both scenarios would bring about a full resync. Only if the drive is not recognised as the original member. Can you test this out and report a sequence of events that causes a full resync? Sure, using an internal-bitmap-enabled raid1 with 2 loopback devices on a stock 2.6.20.1 kernel, the following sequences result in a full resync. (FYI, I'm fairly certain I've seen this same behavior on 2.6.18 and 2.6.15 kernels too but would need to retest): 1) mdadm /dev/md0 --manage --fail /dev/loop0 mdadm -S /dev/md0 mdadm --assemble /dev/md0 /dev/loop0 /dev/loop1 mdadm: /dev/md0 has been started with 1 drive (out of 2). NOTE: kernel log says: md: kicking non-fresh loop0 from array! mdadm /dev/md0 --manage --re-add /dev/loop0 2) mdadm /dev/md0 --manage --fail /dev/loop0 mdadm /dev/md0 --manage --remove /dev/loop0 mdadm -S /dev/md0 mdadm --assemble /dev/md0 /dev/loop0 /dev/loop1 mdadm: /dev/md0 has been started with 1 drive (out of 2). NOTE: kernel log says: md: kicking non-fresh loop0 from array! mdadm /dev/md0 --manage --re-add /dev/loop0 Is stopping the MD (either with mdadm -S or a server reboot) tainting that faulty member's ability to come back in using a quick bitmap-based resync? Mike - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Need clarification on raid1 resync behavior with bitmap support
On 8/3/07, Neil Brown <[EMAIL PROTECTED]> wrote: > On Monday July 23, [EMAIL PROTECTED] wrote: > > On 7/23/07, Neil Brown <[EMAIL PROTECTED]> wrote: > > > Can you test this out and report a sequence of events that causes a > > > full resync? > > > > Sure, using an internal-bitmap-enabled raid1 with 2 loopback devices > > on a stock 2.6.20.1 kernel, the following sequences result in a full > > resync. (FYI, I'm fairly certain I've seen this same behavior on > > 2.6.18 and 2.6.15 kernels too but would need to retest): > > > > 1) > > mdadm /dev/md0 --manage --fail /dev/loop0 > > mdadm -S /dev/md0 > > mdadm --assemble /dev/md0 /dev/loop0 /dev/loop1 > > mdadm: /dev/md0 has been started with 1 drive (out of 2). > > NOTE: kernel log says: md: kicking non-fresh loop0 from array! > > mdadm /dev/md0 --manage --re-add /dev/loop0 > > > sorry for the slow response. > > It looks like commit 1757128438d41670ded8bc3bc735325cc07dc8f9 > (December 2006) set conf->fullsync a litte too often. > > This seems to fix it, and I'm fairly sure it is correct. > > Thanks, > NeilBrown > > -- > Make sure a re-add after a restart honours bitmap when resyncing. > > Commit 1757128438d41670ded8bc3bc735325cc07dc8f9 was slightly bad. > If and array has a write-intent bitmap, and you remove a drive, > then readd it, only the changes parts should be resynced. > This only works if the array has not been shut down and restarted. > > The above mentioned commit sets 'fullsync' at little more often > than it should. This patch is more careful. I hand-patched your change into a 2.6.20.1 kernel (I'd imagine your patch is against current git). I didn't see any difference because unfortunately both of my full resync scenarios included stopping a degraded raid after either: 1) having failed but not been removed a member 2) having failed and removed a member. In both scenarios if I didn't stop the array and I just removed and re-added the faulty drive the array would _not_ do a full resync. My examples clearly conflict with your assertion that: "This only works if the array has not been shut down and restarted." But shouldn't raid1 be better about leveraging the bitmap of known good (fresh) members even after having stopped a degraded array? Why is it that when an array is stopped raid1 seemingly loses the required metadata that enables bitmap resyncs to just work upon re-add IFF the array is _not_ stopped? Couldn't raid1 be made to assemble the array to look like the array had never been stopped, leaving the non-fresh members out as it already does, and only then re-add the "non-fresh" members that were provided? To be explicit: isn't the bitmap still valid on the fresh members? If so, why is raid1 just disregarding the fresh bitmap? Thanks, I really appreciate your insight. Mike - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Need clarification on raid1 resync behavior with bitmap support
On 8/3/07, Neil Brown <[EMAIL PROTECTED]> wrote: > On Friday August 3, [EMAIL PROTECTED] wrote: > > > > I hand-patched your change into a 2.6.20.1 kernel (I'd imagine your > > patch is against current git). I didn't see any difference because > > unfortunately both of my full resync scenarios included stopping a > > degraded raid after either: 1) having failed but not been removed a > > member 2) having failed and removed a member. In both scenarios if I > > didn't stop the array and I just removed and re-added the faulty drive > > the array would _not_ do a full resync. > > > > My examples clearly conflict with your assertion that: "This only > > works if the array has not been shut down and restarted." > > I think my changelog entry for the patch was poorly written. > What I meant to say was: > *before this patch* a remove and re-add only does a partial resync > if the array has not been shutdown and restarted in the interim. > The implication being that *after the patch*, a shutdown and restart > will not interfere and a remove followed by a readd will always do a > partial resync, even if the array was shutdown and restarted while > degraded. Great, thanks for clarifying. > > To be explicit: isn't the bitmap still valid on the fresh members? If > > so, why is raid1 just disregarding the fresh bitmap? > > Yes. Exactly. It is my understanding and experience that the patch I > sent fixes a bug so that it doesn't disregard the fresh bitmap. It > should fix it for 2.6.20.1 as well. > > Are you saying that you tried the same scenario with the patch applied > and it still did a full resync? How do you measure whether it did a > full resync or a partial resync? I must not have loaded the patched raid1.ko because after retesting it is clear that your patch does in fact fix the issue. FYI, before, I could just tell a full resync was occurring by looking at /proc/mdstat and the time that elapsed. Thanks for your help, any idea when this fix will make it upstream? regards, Mike - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: detecting read errors after RAID1 check operation
On 8/17/07, Mike Accetta <[EMAIL PROTECTED]> wrote: > > Neil Brown writes: > > On Wednesday August 15, [EMAIL PROTECTED] wrote: > > > Neil Brown writes: > > > > On Wednesday August 15, [EMAIL PROTECTED] wrote: > > > > > > > > ... > > > This happens in our old friend sync_request_write()? I'm dealing with > > > > Yes, that would be the place. > > > > > ... > > > This fragment > > > > > > if (j < 0 || test_bit(MD_RECOVERY_CHECK, &mddev->recovery)) { > > > sbio->bi_end_io = NULL; > > > rdev_dec_pending(conf->mirrors[i].rdev, mddev); > > > } else { > > > /* fixup the bio for reuse */ > > > ... > > > } > > > > > > looks suspicously like any correction attempt for 'check' is being > > > short-circuited to me, regardless of whether or not there was a read > > > error. Actually, even if the rewrite was not being short-circuited, > > > I still don't see the path that would update 'corrected_errors' in this > > > case. There are only two raid1.c sites that touch 'corrected_errors', one > > > is in fix_read_errors() and the other is later in sync_request_write(). > > > With my limited understanding of how this all works, neither of these > > > paths would seem to apply here. > > > > hmmm yes > > I guess I was thinking of the RAID5 code rather than the RAID1 code. > > It doesn't do the right thing does it? > > Maybe this patch is what we need. I think it is right. > > > > Thanks, > > NeilBrown > > > > > > Signed-off-by: Neil Brown <[EMAIL PROTECTED]> > > > > ### Diffstat output > > ./drivers/md/raid1.c |3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c > > --- .prev/drivers/md/raid1.c 2007-08-16 10:29:58.0 +1000 > > +++ ./drivers/md/raid1.c 2007-08-17 12:07:35.0 +1000 > > @@ -1260,7 +1260,8 @@ static void sync_request_write(mddev_t * > > j = 0; > > if (j >= 0) > > mddev->resync_mismatches += > > r1_bio->sec > > tors; > > - if (j < 0 || test_bit(MD_RECOVERY_CHECK, > > &mddev > > ->recovery)) { > > + if (j < 0 || (test_bit(MD_RECOVERY_CHECK, > > &mdde > > v->recovery) > > + && text_bit(BIO_UPTODATE, > > &sbio-> > > bi_flags))) { > > sbio->bi_end_io = NULL; > > > > rdev_dec_pending(conf->mirrors[i].rdev, > > mddev); > > } else { > > I tried this (with the typo fixed) and it indeed issues a re-write. > However, it doesn't seem to do anything with the corrected errors > count if the rewrite succeeds. Since end_sync_write() is only used one > other place when !In_sync, I tried the following and it seems to work > to get the error count updated. I don't know whether this belongs in > end_sync_write() but I'd think it needs to come after the write actually > succeeds so that seems like the earliest it could be done. Neil, Any feedback on Mike's patch? thanks, Mike - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mke2fs stuck in D state while creating filesystem on md*
On 9/19/07, Wiesner Thomas <[EMAIL PROTECTED]> wrote: > > Has there been any progress on this? I think I saw it, or something > > similar, during some testing of recent 2.6.23-rc kernels, on mke2fs took > > about 11 min longer than all the others (~2 min) and it was not > > repeatable. I worry that process of more interest will have the same > > hang. > > Well, I must say: no. I haven't tried anything further. I've set up the > production system a week or so ago > which runs Debian Etch with no modifications (kernel 2.6.18 I think, the > debian one and a mdadm 2.5.6-9). > I didn't notice the problem while creating the raid but that doesn't mean > anything as I didn't pay attention > and as I wrote earlier it isn't reliably reproducable. > (Watching it on a large storage gets boring very fast.) > > I'm not a kernel programmer but I can test another kernel or mdadm version > if it helps, but let me know > if you want me to do that. If/when you experience the hang again please get a trace of all processes with: echo t > /proc/sysrq-trigger Of particular interest is the mke2fs trace; as well as any md threads. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
mdadm > 2.2 ver1 superblock regression?
When I try to create a RAID1 array with ver 1.0 superblock using mdadm > 2.2 I'm getting: WARNING - superblock isn't sized correctly Looking at the code (and adding a bit more debugging) it is clear that all 3 checks fail in super1.c's calc_sb_1_csum()'s "make sure I can count..." test. Is this a regression in mdadm 2.4, 2.3.1 and 2.3 (NOTE: mdadm 2.2's ver1 sb works!)? please advise, thanks. Mike - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mdadm > 2.2 ver1 superblock regression?
On 4/7/06, Neil Brown <[EMAIL PROTECTED]> wrote: > On Friday April 7, [EMAIL PROTECTED] wrote: > > > > Seeing this hasn't made it into a released kernel yet, I might just > > change it. But I'll have to make sure that old mdadm's don't mess > > things up ... I wonder how I will do that :-( > > > > Thanks for the report. > > Yes, try 2.4.1 (just released). Works great.. thanks for the extremely quick response and fix. Mike - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: accessing mirrired lvm on shared storage
On 4/12/06, Neil Brown <[EMAIL PROTECTED]> wrote: > One thing that is on my todo list is supporting shared raid1, so that > several nodes in the cluster can assemble the same raid1 and access it > - providing that the clients all do proper mutual exclusion as > e.g. OCFS does. Very cool... that would be extremely nice to have. Any estimate on when you might get to this? Mike - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: accessing mirrired lvm on shared storage
On 4/16/06, Neil Brown <[EMAIL PROTECTED]> wrote: > On Thursday April 13, [EMAIL PROTECTED] wrote: > > On 4/12/06, Neil Brown <[EMAIL PROTECTED]> wrote: > > > > > One thing that is on my todo list is supporting shared raid1, so that > > > several nodes in the cluster can assemble the same raid1 and access it > > > - providing that the clients all do proper mutual exclusion as > > > e.g. OCFS does. > > > > Very cool... that would be extremely nice to have. Any estimate on > > when you might get to this? > > > > I'm working on it, but there are lots of distractions > > The first step is getting support into the kernel for various > operations like suspending and resuming IO and resync. > That is progressing nicely. Sounds good... will it be possible to suspend/resume IO to only specific members of the raid1 (aka partial IO/resync suspend/resume)? If not I have a tangential raid1 suspend/resume question: is there a better/cleaner way to suspend and resume a raid1 mirror than removing and re-adding a member? That is you: 1) establish a 2 disk raid1 2) suspend the mirror but allow degraded changes to occur (remove member?) 3) after a user specified interval resume the mirror to resync (re-add member?) 4) goto 2 Using the write-intent bitmap the resync should be relatively cheap. However, would it be better to just use mdadm to tag a member as write-mostly and enable write-behind on the raid1? BUT is there a way to set the write-behind to 0 to force a resync at a certain time (it would appear write-behind is a create-time feature)? thanks, mike - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
kicking non-fresh member from array?
All, I have repeatedly seen that when a 2 member raid1 becomes degraded, and IO continues to the lone good member, that if the array is then stopped and reassembled you get: md: bind md: bind md: kicking non-fresh nbd0 from array! md: unbind md: export_rdev(nbd0) raid1: raid set md0 active with 1 out of 2 mirrors I'm not seeing how one can avoid assembling such an array in 2 passes: 1) assemble array with both members 2) if a member was deemed "non-fresh" re-add that member; whereby triggering recovery. So why does MD kick non-fresh members out on assemble when its perfectly capable of recovering the "non-fresh" member? Looking at md.c it is fairly clear there isn't a way to avoid this 2-step procedure. Why/how does MD benefit from this "kicking non-fresh" semantic? Should MD/mdadm be made optionally tolerant of such non-fresh members during assembly? Mike - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
mdadm 2.6.x regression, fails creation of raid1 w/ v1.0 sb and internal bitmap
mdadm 2.4.1 through 2.5.6 works. mdadm-2.6's "Improve allocation and use of space for bitmaps in version1 metadata" (199171a297a87d7696b6b8c07ee520363f4603c1) would seem like the offending change. Using 1.2 metdata works. I get the following using the tip of the mdadm git repo or any other version of mdadm 2.6.x: # mdadm --create /dev/md2 --run -l 1 --metadata=1.0 --bitmap=internal -n 2 /dev/sdf --write-mostly /dev/nbd2 mdadm: /dev/sdf appears to be part of a raid array: level=raid1 devices=2 ctime=Wed Oct 17 10:17:31 2007 mdadm: /dev/nbd2 appears to be part of a raid array: level=raid1 devices=2 ctime=Wed Oct 17 10:17:31 2007 mdadm: RUN_ARRAY failed: Input/output error mdadm: stopped /dev/md2 kernel log shows: md2: bitmap initialized from disk: read 22/22 pages, set 715290 bits, status: 0 created bitmap (350 pages) for device md2 md2: failed to create bitmap (-5) md: pers->run() failed ... md: md2 stopped. md: unbind md: export_rdev(nbd2) md: unbind md: export_rdev(sdf) md: md2 stopped. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mdadm 2.6.x regression, fails creation of raid1 w/ v1.0 sb and internal bitmap
On 10/17/07, Bill Davidsen <[EMAIL PROTECTED]> wrote: > Mike Snitzer wrote: > > mdadm 2.4.1 through 2.5.6 works. mdadm-2.6's "Improve allocation and > > use of space for bitmaps in version1 metadata" > > (199171a297a87d7696b6b8c07ee520363f4603c1) would seem like the > > offending change. Using 1.2 metdata works. > > > > I get the following using the tip of the mdadm git repo or any other > > version of mdadm 2.6.x: > > > > # mdadm --create /dev/md2 --run -l 1 --metadata=1.0 --bitmap=internal > > -n 2 /dev/sdf --write-mostly /dev/nbd2 > > mdadm: /dev/sdf appears to be part of a raid array: > > level=raid1 devices=2 ctime=Wed Oct 17 10:17:31 2007 > > mdadm: /dev/nbd2 appears to be part of a raid array: > > level=raid1 devices=2 ctime=Wed Oct 17 10:17:31 2007 > > mdadm: RUN_ARRAY failed: Input/output error > > mdadm: stopped /dev/md2 > > > > kernel log shows: > > md2: bitmap initialized from disk: read 22/22 pages, set 715290 bits, > > status: 0 > > created bitmap (350 pages) for device md2 > > md2: failed to create bitmap (-5) > > md: pers->run() failed ... > > md: md2 stopped. > > md: unbind > > md: export_rdev(nbd2) > > md: unbind > > md: export_rdev(sdf) > > md: md2 stopped. > > > > I would start by retrying with an external bitmap, to see if for some > reason there isn't room for the bitmap. If that fails, perhaps no bitmap > at all would be a useful data point. Was the original metadata the same > version? Things moved depending on the exact version, and some > --zero-superblock magic might be needed. Hopefully Neil can clarify, I'm > just telling you what I suspect is the problem, and maybe a > non-destructive solution. Creating with an external bitmap works perfectly fine. As does creating without a bitmap. --zero-superblock hasn't helped. Metadata v1.1 and v1.2 works with an internal bitmap. I'd like to use v1.0 with an internal bitmap (using an external bitmap isn't an option for me). It does appear that the changes to sb super1.c aren't leaving adequate room for the bitmap. Looking at the relevant diff for v1.0 metadata the newer super1.c code makes use of a larger bitmap (128K) for devices > 200GB. My blockdevice is 700GB. So could the larger blockdevice possibly explain why others haven't noticed this? Mike - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mdadm 2.6.x regression, fails creation of raid1 w/ v1.0 sb and internal bitmap
On 10/18/07, Neil Brown <[EMAIL PROTECTED]> wrote: > On Wednesday October 17, [EMAIL PROTECTED] wrote: > > mdadm 2.4.1 through 2.5.6 works. mdadm-2.6's "Improve allocation and > > use of space for bitmaps in version1 metadata" > > (199171a297a87d7696b6b8c07ee520363f4603c1) would seem like the > > offending change. Using 1.2 metdata works. > > > > I get the following using the tip of the mdadm git repo or any other > > version of mdadm 2.6.x: > > > > # mdadm --create /dev/md2 --run -l 1 --metadata=1.0 --bitmap=internal > > -n 2 /dev/sdf --write-mostly /dev/nbd2 > > mdadm: /dev/sdf appears to be part of a raid array: > > level=raid1 devices=2 ctime=Wed Oct 17 10:17:31 2007 > > mdadm: /dev/nbd2 appears to be part of a raid array: > > level=raid1 devices=2 ctime=Wed Oct 17 10:17:31 2007 > > mdadm: RUN_ARRAY failed: Input/output error > > mdadm: stopped /dev/md2 > > > > kernel log shows: > > md2: bitmap initialized from disk: read 22/22 pages, set 715290 bits, > > status: 0 > > created bitmap (350 pages) for device md2 > > md2: failed to create bitmap (-5) > > Could you please tell me the exact size of your device? Then should > be able to reproduce it and test a fix. > > (It works for a 734003201K device). 732456960K, it is fairly surprising that such a relatively small difference in size would prevent it from working... regards, Mike - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kicking non-fresh member from array?
On 10/18/07, Goswin von Brederlow <[EMAIL PROTECTED]> wrote: > "Mike Snitzer" <[EMAIL PROTECTED]> writes: > > > All, > > > > I have repeatedly seen that when a 2 member raid1 becomes degraded, > > and IO continues to the lone good member, that if the array is then > > stopped and reassembled you get: > > > > md: bind > > md: bind > > md: kicking non-fresh nbd0 from array! > > md: unbind > > md: export_rdev(nbd0) > > raid1: raid set md0 active with 1 out of 2 mirrors > > > > I'm not seeing how one can avoid assembling such an array in 2 passes: > > 1) assemble array with both members > > 2) if a member was deemed "non-fresh" re-add that member; whereby > > triggering recovery. > > > > So why does MD kick non-fresh members out on assemble when its > > perfectly capable of recovering the "non-fresh" member? Looking at > > md.c it is fairly clear there isn't a way to avoid this 2-step > > procedure. > > > > Why/how does MD benefit from this "kicking non-fresh" semantic? > > Should MD/mdadm be made optionally tolerant of such non-fresh members > > during assembly? > > > > Mike > > What if the disk has lots of bad blocks, just not where the meta data > is? On every restart you would resync and fail. > > Or what if you removed a mirror to keep a snapshot of a previous > state? If it auto resyncs you loose that snapshot. Both of your examples are fairly tenuous given that such members shouldn't have been provided on the --asemble commandline. I'm not talking about auto assemble via udev or something. But auto assemble via udev brings up an annoying corner-case when you consider the 2 cases you pointed out. So you have valid points. This leads to my last question; having the ability to _optionally_ tolerate (repair) such stale members would allow for greater flexibility. The current behavior isn't conducive to repairing unprotected raids (that mdadm/md were told to assemble with specific members) without taking steps to say "no I really _really_ mean it; now re-add this disk!". Any pointers from Neil (or others) on how such a 'repair "non-fresh" member(s) on assemble' override _should_ be implemented would be helpful. My first thought is to add a new superblock --update=repair-non-fresh option to mdadm that would tie into a new flag in the MD superblock. But then it begs the question: why not first add support to set such a superblock option at MD create-time? The validate_super methods would also need to be trained accordingly. regards, Mike - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mdadm 2.6.x regression, fails creation of raid1 w/ v1.0 sb and internal bitmap
On 10/19/07, Neil Brown <[EMAIL PROTECTED]> wrote: > On Friday October 19, [EMAIL PROTECTED] wrote: > > I'm using a stock 2.6.19.7 that I then backported various MD fixes to > > from 2.6.20 -> 2.6.23... this kernel has worked great until I > > attempted v1.0 sb w/ bitmap=internal using mdadm 2.6.x. > > > > But would you like me to try a stock 2.6.22 or 2.6.23 kernel? > > Yes please. > I'm suspecting the code in write_sb_page where it tests if the bitmap > overlaps the data or metadata. The only way I can see you getting the > exact error that you do get it for that to fail. > That test was introduced in 2.6.22. Did you backport that? Any > chance it got mucked up a bit? I believe you're referring to commit f0d76d70bc77b9b11256a3a23e98e80878be1578. That change actually made it into 2.6.23 AFAIK; but yes I actually did backport that fix (which depended on ab6085c795a71b6a21afe7469d30a365338add7a). If I back-out f0d76d70bc77b9b11256a3a23e98e80878be1578 I can create a raid1 w/ v1.0 sb and an internal bitmap. But clearly that is just because I removed the negative checks that you introduced ;) For me this begs the question: what else would f0d76d70bc77b9b11256a3a23e98e80878be1578 depend on that I missed? I included 505fa2c4a2f125a70951926dfb22b9cf273994f1 and ab6085c795a71b6a21afe7469d30a365338add7a too. *shrug*... Mike - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mdadm 2.6.x regression, fails creation of raid1 w/ v1.0 sb and internal bitmap
On 10/18/07, Neil Brown <[EMAIL PROTECTED]> wrote: > > Sorry, I wasn't paying close enough attention and missed the obvious. > . > > On Thursday October 18, [EMAIL PROTECTED] wrote: > > On 10/18/07, Neil Brown <[EMAIL PROTECTED]> wrote: > > > On Wednesday October 17, [EMAIL PROTECTED] wrote: > > > > mdadm 2.4.1 through 2.5.6 works. mdadm-2.6's "Improve allocation and > > > > use of space for bitmaps in version1 metadata" > > > > (199171a297a87d7696b6b8c07ee520363f4603c1) would seem like the > > > > offending change. Using 1.2 metdata works. > > > > > > > > I get the following using the tip of the mdadm git repo or any other > > > > version of mdadm 2.6.x: > > > > > > > > # mdadm --create /dev/md2 --run -l 1 --metadata=1.0 --bitmap=internal > > > > -n 2 /dev/sdf --write-mostly /dev/nbd2 > > > > mdadm: /dev/sdf appears to be part of a raid array: > > > > level=raid1 devices=2 ctime=Wed Oct 17 10:17:31 2007 > > > > mdadm: /dev/nbd2 appears to be part of a raid array: > > > > level=raid1 devices=2 ctime=Wed Oct 17 10:17:31 2007 > > > > mdadm: RUN_ARRAY failed: Input/output error >^^ > > This means there was an IO error. i.e. there is a block on the device > that cannot be read from. > It worked with earlier version of mdadm because they used a much > smaller bitmap. With the patch you mention in place, mdadm tries > harder to find a good location and good size for a bitmap and to > make sure that space is available. > The important fact is that the bitmap ends up at a different > location. > > You have a bad block at that location, it would seem. I'm a bit skeptical of that being the case considering I get this error on _any_ pair of disks I try in an environment where I'm mirroring across servers that each have access to 8 of these disks. Each of the 8 mirrors consists of a local member and a remote (nbd) member. I can't see all 16 disks having the very same bad block(s) at the end of the disk ;) I feels to me like the calculation that you're making isn't leaving adequate room for the 128K bitmap without hitting the superblock... but I don't have hard proof yet ;) > I would have expected an error in the kernel logs about the read error > though - that is strange. What about the "md2: failed to create bitmap (-5)"? > What do > mdadm -E > and > mdadm -X > > on each device say? # mdadm -E /dev/sdf /dev/sdf: Magic : a92b4efc Version : 1.0 Feature Map : 0x1 Array UUID : caabb900:616bfc5a:03763b95:83ea99a7 Name : 2 Creation Time : Fri Oct 19 00:38:45 2007 Raid Level : raid1 Raid Devices : 2 Used Dev Size : 1464913648 (698.53 GiB 750.04 GB) Array Size : 1464913648 (698.53 GiB 750.04 GB) Super Offset : 1464913904 sectors State : clean Device UUID : 978cdd42:abaa82a1:4ad79285:1b56ed86 Internal Bitmap : -176 sectors from superblock Update Time : Fri Oct 19 00:38:45 2007 Checksum : c6bb03db - correct Events : 0 Array Slot : 0 (0, 1) Array State : Uu # mdadm -E /dev/nbd2 /dev/nbd2: Magic : a92b4efc Version : 1.0 Feature Map : 0x1 Array UUID : caabb900:616bfc5a:03763b95:83ea99a7 Name : 2 Creation Time : Fri Oct 19 00:38:45 2007 Raid Level : raid1 Raid Devices : 2 Used Dev Size : 1464913648 (698.53 GiB 750.04 GB) Array Size : 1464913648 (698.53 GiB 750.04 GB) Super Offset : 1464913904 sectors State : clean Device UUID : 180209d2:cff9b5d0:05054d19:2e4930f2 Internal Bitmap : -176 sectors from superblock Flags : write-mostly Update Time : Fri Oct 19 00:38:45 2007 Checksum : 8416e951 - correct Events : 0 Array Slot : 1 (0, 1) Array State : uU # mdadm -X /dev/sdf Filename : /dev/sdf Magic : 6d746962 Version : 4 UUID : caabb900:616bfc5a:03763b95:83ea99a7 Events : 0 Events Cleared : 0 State : OK Chunksize : 1 MB Daemon : 5s flush period Write Mode : Normal Sync Size : 732456824 (698.53 GiB 750.04 GB) Bitmap : 715290 bits (chunks), 715290 dirty (100.0%) # mdadm -X /dev/nbd2 Filename : /dev/nbd2 Magic : 6d746962 Version : 4 UUID : caabb900:616bfc5a:03763b95:83ea99a7 Events : 0 Events Cleared : 0 State : OK Chunksize : 1 MB Daemon : 5s flush period Write Mode : Normal Sync Size : 732456824 (698.53 GiB 750.04 GB) Bitmap : 715290 bits (chunks), 715290 dirty (100.0%) > > > > mdadm: stopped /dev/md2 > > > > > > > > kernel log shows: > > > > md2: bitmap initialized from disk: read 22/22 pages, set 715290 bits, > > > > status: 0 > > > > created bitmap (350 pages) for device md2 > > > > md2: failed to create bitmap (-5) I assumed that the RUN_ARRAY failed (via IO error) was a side-effect of MD's inability to create the bitmap (-5): md2: bitmap initia
Re: mdadm 2.6.x regression, fails creation of raid1 w/ v1.0 sb and internal bitmap
On 10/19/07, Mike Snitzer <[EMAIL PROTECTED]> wrote: > On 10/18/07, Neil Brown <[EMAIL PROTECTED]> wrote: > > > > Sorry, I wasn't paying close enough attention and missed the obvious. > > . > > > > On Thursday October 18, [EMAIL PROTECTED] wrote: > > > On 10/18/07, Neil Brown <[EMAIL PROTECTED]> wrote: > > > > On Wednesday October 17, [EMAIL PROTECTED] wrote: > > > > > mdadm 2.4.1 through 2.5.6 works. mdadm-2.6's "Improve allocation and > > > > > use of space for bitmaps in version1 metadata" > > > > > (199171a297a87d7696b6b8c07ee520363f4603c1) would seem like the > > > > > offending change. Using 1.2 metdata works. > > > > > > > > > > I get the following using the tip of the mdadm git repo or any other > > > > > version of mdadm 2.6.x: > > > > > > > > > > # mdadm --create /dev/md2 --run -l 1 --metadata=1.0 --bitmap=internal > > > > > -n 2 /dev/sdf --write-mostly /dev/nbd2 > > > > > mdadm: /dev/sdf appears to be part of a raid array: > > > > > level=raid1 devices=2 ctime=Wed Oct 17 10:17:31 2007 > > > > > mdadm: /dev/nbd2 appears to be part of a raid array: > > > > > level=raid1 devices=2 ctime=Wed Oct 17 10:17:31 2007 > > > > > mdadm: RUN_ARRAY failed: Input/output error > >^^ > > > > This means there was an IO error. i.e. there is a block on the device > > that cannot be read from. > > It worked with earlier version of mdadm because they used a much > > smaller bitmap. With the patch you mention in place, mdadm tries > > harder to find a good location and good size for a bitmap and to > > make sure that space is available. > > The important fact is that the bitmap ends up at a different > > location. > > > > You have a bad block at that location, it would seem. > > I'm a bit skeptical of that being the case considering I get this > error on _any_ pair of disks I try in an environment where I'm > mirroring across servers that each have access to 8 of these disks. > Each of the 8 mirrors consists of a local member and a remote (nbd) > member. I can't see all 16 disks having the very same bad block(s) at > the end of the disk ;) > > I feels to me like the calculation that you're making isn't leaving > adequate room for the 128K bitmap without hitting the superblock... > but I don't have hard proof yet ;) To further test this I used 2 local sparse 732456960K loopback devices and attempted to create the raid1 in the same manner. It failed in exactly the same way. This should cast further doubt on the bad block theory no? I'm using a stock 2.6.19.7 that I then backported various MD fixes to from 2.6.20 -> 2.6.23... this kernel has worked great until I attempted v1.0 sb w/ bitmap=internal using mdadm 2.6.x. But would you like me to try a stock 2.6.22 or 2.6.23 kernel? Mike - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mdadm 2.6.x regression, fails creation of raid1 w/ v1.0 sb and internal bitmap
On 10/22/07, Neil Brown <[EMAIL PROTECTED]> wrote: > On Friday October 19, [EMAIL PROTECTED] wrote: > > On 10/19/07, Neil Brown <[EMAIL PROTECTED]> wrote: > > > On Friday October 19, [EMAIL PROTECTED] wrote: > > > > > > I'm using a stock 2.6.19.7 that I then backported various MD fixes to > > > > from 2.6.20 -> 2.6.23... this kernel has worked great until I > > > > attempted v1.0 sb w/ bitmap=internal using mdadm 2.6.x. > > > > > > > > But would you like me to try a stock 2.6.22 or 2.6.23 kernel? > > > > > > Yes please. > > > I'm suspecting the code in write_sb_page where it tests if the bitmap > > > overlaps the data or metadata. The only way I can see you getting the > > > exact error that you do get it for that to fail. > > > That test was introduced in 2.6.22. Did you backport that? Any > > > chance it got mucked up a bit? > > > > I believe you're referring to commit > > f0d76d70bc77b9b11256a3a23e98e80878be1578. That change actually made > > it into 2.6.23 AFAIK; but yes I actually did backport that fix (which > > depended on ab6085c795a71b6a21afe7469d30a365338add7a). > > > > If I back-out f0d76d70bc77b9b11256a3a23e98e80878be1578 I can create a > > raid1 w/ v1.0 sb and an internal bitmap. But clearly that is just > > because I removed the negative checks that you introduced ;) > > > > For me this begs the question: what else would > > f0d76d70bc77b9b11256a3a23e98e80878be1578 depend on that I missed? I > > included 505fa2c4a2f125a70951926dfb22b9cf273994f1 and > > ab6085c795a71b6a21afe7469d30a365338add7a too. > > > > *shrug*... > > > > This is all very odd... > I definitely tested this last week and couldn't reproduce the > problem. This week I can reproduce it easily. And given the nature > of the bug, I cannot see how it ever worked. > > Anyway, here is a fix that works for me. Hey Neil, Your fix works for me too. However, I'm wondering why you held back on fixing the same issue in the "bitmap runs into data" comparison that follows: --- ./drivers/md/bitmap.c 2007-10-19 19:11:58.0 -0400 +++ ./drivers/md/bitmap.c 2007-10-22 09:53:41.0 -0400 @@ -286,7 +286,7 @@ /* METADATA BITMAP DATA */ if (rdev->sb_offset*2 + bitmap->offset - + page->index*(PAGE_SIZE/512) + size/512 + + (long)(page->index*(PAGE_SIZE/512)) + size/512 > rdev->data_offset) /* bitmap runs in to data */ return -EINVAL; Thanks, Mike - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] lvm2 support for detecting v1.x MD superblocks
lvm2's MD v1.0 superblock detection doesn't work at all (because it doesn't use v1 sb offsets). I've tested the attached patch to work on MDs with v0.90.0, v1.0, v1.1, and v1.2 superblocks. please advise, thanks. Mike Index: lib/device/dev-md.c === RCS file: /cvs/lvm2/LVM2/lib/device/dev-md.c,v retrieving revision 1.5 diff -u -r1.5 dev-md.c --- lib/device/dev-md.c 20 Aug 2007 20:55:25 - 1.5 +++ lib/device/dev-md.c 23 Oct 2007 15:17:57 - @@ -25,6 +25,40 @@ #define MD_NEW_SIZE_SECTORS(x) ((x & ~(MD_RESERVED_SECTORS - 1)) \ - MD_RESERVED_SECTORS) +int dev_has_md_sb(struct device *dev, uint64_t sb_offset, uint64_t *sb) +{ + int ret = 0; + uint32_t md_magic; + /* Version 1 is little endian; version 0.90.0 is machine endian */ + if (dev_read(dev, sb_offset, sizeof(uint32_t), &md_magic) && + ((md_magic == xlate32(MD_SB_MAGIC)) || + (md_magic == MD_SB_MAGIC))) { + if (sb) + *sb = sb_offset; + ret = 1; + } + return ret; +} + +uint64_t v1_sb_offset(uint64_t size, int minor_version) { + uint64_t sb_offset; + switch(minor_version) { + case 0: + sb_offset = size; + sb_offset -= 8*2; + sb_offset &= ~(4*2-1); + break; + case 1: + sb_offset = 0; + break; + case 2: + sb_offset = 4*2; + break; + } + sb_offset <<= SECTOR_SHIFT; + return sb_offset; +} + /* * Returns -1 on error */ @@ -35,7 +69,6 @@ #ifdef linux uint64_t size, sb_offset; - uint32_t md_magic; if (!dev_get_size(dev, &size)) { stack; @@ -50,16 +83,20 @@ return -1; } - sb_offset = MD_NEW_SIZE_SECTORS(size) << SECTOR_SHIFT; - /* Check if it is an md component device. */ - /* Version 1 is little endian; version 0.90.0 is machine endian */ - if (dev_read(dev, sb_offset, sizeof(uint32_t), &md_magic) && - ((md_magic == xlate32(MD_SB_MAGIC)) || - (md_magic == MD_SB_MAGIC))) { - if (sb) - *sb = sb_offset; + /* Version 0.90.0 */ + sb_offset = MD_NEW_SIZE_SECTORS(size) << SECTOR_SHIFT; + if (dev_has_md_sb(dev, sb_offset, sb)) { ret = 1; + } else { + /* Version 1, try v1.0 -> v1.2 */ + int minor; + for (minor = 0; minor <= 2; minor++) { + if (dev_has_md_sb(dev, v1_sb_offset(size, minor), sb)) { +ret = 1; +break; + } + } } if (!dev_close(dev))
Re: [lvm-devel] [PATCH] lvm2 support for detecting v1.x MD superblocks
On 10/23/07, Alasdair G Kergon <[EMAIL PROTECTED]> wrote: > On Tue, Oct 23, 2007 at 11:32:56AM -0400, Mike Snitzer wrote: > > I've tested the attached patch to work on MDs with v0.90.0, v1.0, > > v1.1, and v1.2 superblocks. > > I'll apply this, thanks, but need to add comments (or reference) to explain > what the hard-coded numbers are: > > sb_offset = (size - 8 * 2) & ~(4 * 2 - 1); > etc. All values are in terms of sectors; so that is where the * 2 is coming from. The v1.0 case follows the same model as the MD_NEW_SIZE_SECTORS which is used for v0.90.0. The difference is that the v1.0 superblock is found "at least 8K, but less than 12K, from the end of the device". The same switch statement is used in mdadm and is accompanied with the following comment: /* * Calculate the position of the superblock. * It is always aligned to a 4K boundary and * depending on minor_version, it can be: * 0: At least 8K, but less than 12K, from end of device * 1: At start of device * 2: 4K from start of device. */ Would it be sufficient to add that comment block above v1_sb_offset()'s switch statement? thanks, Mike - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to deprecate old RAID formats?
On 10/24/07, John Stoffel <[EMAIL PROTECTED]> wrote: > > "Bill" == Bill Davidsen <[EMAIL PROTECTED]> writes: > > Bill> John Stoffel wrote: > >> Why do we have three different positions for storing the superblock? > > Bill> Why do you suggest changing anything until you get the answer to > Bill> this question? If you don't understand why there are three > Bill> locations, perhaps that would be a good initial investigation. > > Because I've asked this question before and not gotten an answer, nor > is it answered in the man page for mdadm on why we have this setup. > > Bill> Clearly the short answer is that they reflect three stages of > Bill> Neil's thinking on the topic, and I would bet that he had a good > Bill> reason for moving the superblock when he did it. > > So let's hear Neil's thinking about all this? Or should I just work > up a patch to do what I suggest and see how that flies? > > Bill> Since you have to support all of them or break existing arrays, > Bill> and they all use the same format so there's no saving of code > Bill> size to mention, why even bring this up? > > Because of the confusion factor. Again, since noone has been able to > articulate a reason why we have three different versions of the 1.x > superblock, nor have I seen any good reasons for why we should have > them, I'm going by the KISS principle to reduce the options to the > best one. > > And no, I'm not advocating getting rid of legacy support, but I AM > advocating that we settle on ONE standard format going forward as the > default for all new RAID superblocks. Why exactly are you on this crusade to find the one "best" v1 superblock location? Giving people the freedom to place the superblock where they choose isn't a bad thing. Would adding something like "If in doubt, 1.1 is the safest choice." to the mdadm man page give you the KISS warm-fuzzies you're pining for? The fact that, after you read the manpage, you didn't even know that the only difference between the v1.x variants is the location that the superblock is placed indicates that you're not in a position to be so tremendously evangelical about affecting code changes that limit existing options. Mike - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 003 of 3] md: Update md bitmap during resync.
On Dec 7, 2007 12:42 AM, NeilBrown <[EMAIL PROTECTED]> wrote: > > Currently and md array with a write-intent bitmap does not updated > that bitmap to reflect successful partial resync. Rather the entire > bitmap is updated when the resync completes. > > This is because there is no guarentee that resync requests will > complete in order, and tracking each request individually is > unnecessarily burdensome. > > However there is value in regularly updating the bitmap, so add code > to periodically pause while all pending sync requests complete, then > update the bitmap. Doing this only every few seconds (the same as the > bitmap update time) does not notciably affect resync performance. > > Signed-off-by: Neil Brown <[EMAIL PROTECTED]> Hi Neil, You forgot to export bitmap_cond_end_sync. Please see the attached patch. regards, Mike diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c index f31ea4f..b596538 100644 --- a/drivers/md/bitmap.c +++ b/drivers/md/bitmap.c @@ -1566,3 +1566,4 @@ EXPORT_SYMBOL(bitmap_start_sync); EXPORT_SYMBOL(bitmap_end_sync); EXPORT_SYMBOL(bitmap_unplug); EXPORT_SYMBOL(bitmap_close_sync); +EXPORT_SYMBOL(bitmap_cond_end_sync);
2.6.22.16 MD raid1 doesn't mark removed disk faulty, MD thread goes UN
Under 2.6.22.16, I physically pulled a SATA disk (/dev/sdac, connected to an aacraid controller) that was acting as the local raid1 member of /dev/md30. Linux MD didn't see an /dev/sdac1 error until I tried forcing the issue by doing a read (with dd) from /dev/md30: Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Sense Key : Hardware Error [current] Jan 21 17:08:07 lab17-233 kernel: Info fld=0x0 Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Add. Sense: Internal target failure Jan 21 17:08:07 lab17-233 kernel: end_request: I/O error, dev sdac, sector 71 Jan 21 17:08:07 lab17-233 kernel: printk: 3 messages suppressed. Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 8 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 16 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 24 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 32 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 40 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 48 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 56 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 64 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 72 Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 80 Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Sense Key : Hardware Error [current] Jan 21 17:08:07 lab17-233 kernel: Info fld=0x0 Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Add. Sense: Internal target failure Jan 21 17:08:07 lab17-233 kernel: end_request: I/O error, dev sdac, sector 343 Jan 21 17:08:08 lab17-233 kernel: sd 2:0:27:0: [sdac] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK Jan 21 17:08:08 lab17-233 kernel: sd 2:0:27:0: [sdac] Sense Key : Hardware Error [current] Jan 21 17:08:08 lab17-233 kernel: Info fld=0x0 ... Jan 21 17:08:12 lab17-233 kernel: sd 2:0:27:0: [sdac] Add. Sense: Internal target failure Jan 21 17:08:12 lab17-233 kernel: end_request: I/O error, dev sdac, sector 3399 Jan 21 17:08:12 lab17-233 kernel: printk: 765 messages suppressed. Jan 21 17:08:12 lab17-233 kernel: raid1: sdac1: rescheduling sector 3336 However, the MD layer still hasn't marked the sdac1 member faulty: md30 : active raid1 nbd2[1](W) sdac1[0] 4016204 blocks super 1.0 [2/2] [UU] bitmap: 1/8 pages [4KB], 256KB chunk The dd I used to read from /dev/md30 is blocked on IO: Jan 21 17:13:55 lab17-233 kernel: ddD 0afa9cf5c346 0 12337 7702 (NOTLB) Jan 21 17:13:55 lab17-233 kernel: 81010c449868 0082 80268f14 Jan 21 17:13:55 lab17-233 kernel: 81015da6f320 81015de532c0 0008 81012d9d7780 Jan 21 17:13:55 lab17-233 kernel: 81015fae2880 4926 81012d9d7970 0001802879a0 Jan 21 17:13:55 lab17-233 kernel: Call Trace: Jan 21 17:13:55 lab17-233 kernel: [] mempool_alloc+0x24/0xda Jan 21 17:13:55 lab17-233 kernel: [] :raid1:wait_barrier+0x84/0xc2 Jan 21 17:13:55 lab17-233 kernel: [] default_wake_function+0x0/0xe Jan 21 17:13:55 lab17-233 kernel: [] :raid1:make_request+0x83/0x5c0 Jan 21 17:13:55 lab17-233 kernel: [] __make_request+0x57f/0x668 Jan 21 17:13:55 lab17-233 kernel: [] generic_make_request+0x26e/0x2a9 Jan 21 17:13:55 lab17-233 kernel: [] mempool_alloc+0x24/0xda Jan 21 17:13:55 lab17-233 kernel: [] __next_cpu+0x19/0x28 Jan 21 17:13:55 lab17-233 kernel: [] submit_bio+0xb6/0xbd Jan 21 17:13:55 lab17-233 kernel: [] submit_bh+0xdf/0xff Jan 21 17:13:55 lab17-233 kernel: [] block_read_full_page+0x271/0x28e Jan 21 17:13:55 lab17-233 kernel: [] blkdev_get_block+0x0/0x46 Jan 21 17:13:55 lab17-233 kernel: [] radix_tree_insert+0xcb/0x18c Jan 21 17:13:55 lab17-233 kernel: [] __do_page_cache_readahead+0x16d/0x1df Jan 21 17:13:55 lab17-233 kernel: [] getnstimeofday+0x32/0x8d Jan 21 17:13:55 lab17-233 kernel: [] ktime_get_ts+0x1a/0x4e Jan 21 17:13:55 lab17-233 kernel: [] delayacct_end+0x7d/0x88 Jan 21 17:13:55 lab17-233 kernel: [] blockable_page_cache_readahead+0x53/0xb2 Jan 21 17:13:55 lab17-233 kernel: [] make_ahead_window+0x82/0x9e Jan 21 17:13:55 lab17-233 kernel: [] page_cache_readahead+0x18a/0x1c1 Jan 21 17:13:55 lab17-233 kernel: [] do_generic_mapping_read+0x135/0x3fc Jan 21 17:13:55 lab17-233 kernel: [] file_read_actor+0x0/0x170 Jan 21 17:13:55 lab17-233 kernel: [] generic_file_aio_read+0x119/0x155 Jan 21 17:13:55 lab17-233 kernel: [] do_sync_read+0xc9/0x10c Jan 21 17:13:55 lab17-233 kernel: [] autoremove_wake_function+0x0/0x2e Jan 21 17:13:55 lab17-233 kernel: [] do_mmap_pgoff+0x639/0x7a5 Jan 21 17:13:55 lab17-233 kernel: [] vfs_read+0xcb/0x153 Jan 21 17:13:55 lab17-233 kernel: [] sys_read+0x45/0x6e Jan 21 17:13
Re: 2.6.22.16 MD raid1 doesn't mark removed disk faulty, MD thread goes UN
cc'ing Tanaka-san given his recent raid1 BUG report: http://lkml.org/lkml/2008/1/14/515 On Jan 21, 2008 6:04 PM, Mike Snitzer <[EMAIL PROTECTED]> wrote: > Under 2.6.22.16, I physically pulled a SATA disk (/dev/sdac, connected to > an aacraid controller) that was acting as the local raid1 member of > /dev/md30. > > Linux MD didn't see an /dev/sdac1 error until I tried forcing the issue by > doing a read (with dd) from /dev/md30: > > Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Result: > hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK > Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Sense Key : > Hardware Error [current] > Jan 21 17:08:07 lab17-233 kernel: Info fld=0x0 > Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Add. Sense: > Internal target failure > Jan 21 17:08:07 lab17-233 kernel: end_request: I/O error, dev sdac, sector 71 > Jan 21 17:08:07 lab17-233 kernel: printk: 3 messages suppressed. > Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 8 > Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 16 > Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 24 > Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 32 > Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 40 > Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 48 > Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 56 > Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 64 > Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 72 > Jan 21 17:08:07 lab17-233 kernel: raid1: sdac1: rescheduling sector 80 > Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Result: > hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK > Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Sense Key : > Hardware Error [current] > Jan 21 17:08:07 lab17-233 kernel: Info fld=0x0 > Jan 21 17:08:07 lab17-233 kernel: sd 2:0:27:0: [sdac] Add. Sense: > Internal target failure > Jan 21 17:08:07 lab17-233 kernel: end_request: I/O error, dev sdac, sector 343 > Jan 21 17:08:08 lab17-233 kernel: sd 2:0:27:0: [sdac] Result: > hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK > Jan 21 17:08:08 lab17-233 kernel: sd 2:0:27:0: [sdac] Sense Key : > Hardware Error [current] > Jan 21 17:08:08 lab17-233 kernel: Info fld=0x0 > ... > Jan 21 17:08:12 lab17-233 kernel: sd 2:0:27:0: [sdac] Add. Sense: > Internal target failure > Jan 21 17:08:12 lab17-233 kernel: end_request: I/O error, dev sdac, sector > 3399 > Jan 21 17:08:12 lab17-233 kernel: printk: 765 messages suppressed. > Jan 21 17:08:12 lab17-233 kernel: raid1: sdac1: rescheduling sector 3336 > > However, the MD layer still hasn't marked the sdac1 member faulty: > > md30 : active raid1 nbd2[1](W) sdac1[0] > 4016204 blocks super 1.0 [2/2] [UU] > bitmap: 1/8 pages [4KB], 256KB chunk > > The dd I used to read from /dev/md30 is blocked on IO: > > Jan 21 17:13:55 lab17-233 kernel: ddD 0afa9cf5c346 > 0 12337 7702 (NOTLB) > Jan 21 17:13:55 lab17-233 kernel: 81010c449868 0082 > 80268f14 > Jan 21 17:13:55 lab17-233 kernel: 81015da6f320 81015de532c0 > 0008 81012d9d7780 > Jan 21 17:13:55 lab17-233 kernel: 81015fae2880 4926 > 81012d9d7970 0001802879a0 > Jan 21 17:13:55 lab17-233 kernel: Call Trace: > Jan 21 17:13:55 lab17-233 kernel: [] > mempool_alloc+0x24/0xda > Jan 21 17:13:55 lab17-233 kernel: [] > :raid1:wait_barrier+0x84/0xc2 > Jan 21 17:13:55 lab17-233 kernel: [] > default_wake_function+0x0/0xe > Jan 21 17:13:55 lab17-233 kernel: [] > :raid1:make_request+0x83/0x5c0 > Jan 21 17:13:55 lab17-233 kernel: [] > __make_request+0x57f/0x668 > Jan 21 17:13:55 lab17-233 kernel: [] > generic_make_request+0x26e/0x2a9 > Jan 21 17:13:55 lab17-233 kernel: [] > mempool_alloc+0x24/0xda > Jan 21 17:13:55 lab17-233 kernel: [] __next_cpu+0x19/0x28 > Jan 21 17:13:55 lab17-233 kernel: [] submit_bio+0xb6/0xbd > Jan 21 17:13:55 lab17-233 kernel: [] submit_bh+0xdf/0xff > Jan 21 17:13:55 lab17-233 kernel: [] > block_read_full_page+0x271/0x28e > Jan 21 17:13:55 lab17-233 kernel: [] > blkdev_get_block+0x0/0x46 > Jan 21 17:13:55 lab17-233 kernel: [] > radix_tree_insert+0xcb/0x18c > Jan 21 17:13:55 lab17-233 kernel: [] > __do_page_cache_readahead+0x16d/0x1df > Jan 21 17:13:55 lab17-233 kernel: [] > getnstimeofday+0x32/0x8d > Jan 21 17:13:55 lab17-233 kernel: [] ktime_get_ts+0x1a/0x4e > Jan 21 17:13:55 lab17-233 kernel: [] > delayacct_end+0x7d/0x88 > Jan 21 17:13:55 lab17-233 kernel: [] > blockable_page_cache_readahead+0x53/0xb2 > Jan 21 17:1
AACRAID driver broken in 2.6.22.x (and beyond?) [WAS: Re: 2.6.22.16 MD raid1 doesn't mark removed disk faulty, MD thread goes UN]
On Jan 22, 2008 12:29 AM, Mike Snitzer <[EMAIL PROTECTED]> wrote: > cc'ing Tanaka-san given his recent raid1 BUG report: > http://lkml.org/lkml/2008/1/14/515 > > > On Jan 21, 2008 6:04 PM, Mike Snitzer <[EMAIL PROTECTED]> wrote: > > Under 2.6.22.16, I physically pulled a SATA disk (/dev/sdac, connected to > > an aacraid controller) that was acting as the local raid1 member of > > /dev/md30. > > > > Linux MD didn't see an /dev/sdac1 error until I tried forcing the issue by > > doing a read (with dd) from /dev/md30: > The raid1d thread is locked at line 720 in raid1.c (raid1d+2437); aka > freeze_array: > > (gdb) l *0x2539 > 0x2539 is in raid1d (drivers/md/raid1.c:720). > 715 * wait until barrier+nr_pending match nr_queued+2 > 716 */ > 717 spin_lock_irq(&conf->resync_lock); > 718 conf->barrier++; > 719 conf->nr_waiting++; > 720 wait_event_lock_irq(conf->wait_barrier, > 721 conf->barrier+conf->nr_pending == > conf->nr_queued+2, > 722 conf->resync_lock, > 723 raid1_unplug(conf->mddev->queue)); > 724 spin_unlock_irq(&conf->resync_lock); > > Given Tanaka-san's report against 2.6.23 and me hitting what seems to > be the same deadlock in 2.6.22.16; it stands to reason this affects > raid1 in 2.6.24-rcX too. Turns out that the aacraid driver in 2.6.22.x is HORRIBLY BROKEN (when you pull a drive); it responds to MD's write requests with uptodate=1 (in raid1_end_write_request) for the drive that was pulled! I've not looked to see if aacraid has been fixed in newer kernels... are others aware of any crucial aacraid fixes in 2.6.23.x or 2.6.24? After the drive was physically pulled, and small periodic writes continued to the associated MD device, the raid1 MD driver did _NOT_ detect the pulled drive's writes as having failed (verified this with systemtap). MD happily thought the write completed to both members (so MD had no reason to mark the pulled drive "faulty"; or mark the raid "degraded"). Installing an Adaptec-provided 1.1-5[2451] driver enabled raid1 to work as expected. That said, I now have a recipe for hitting the raid1 deadlock that Tanaka first reported over a week ago. I'm still surprised that all of this chatter about that BUG hasn't drawn interest/scrutiny from others!? regards, Mike - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: AACRAID driver broken in 2.6.22.x (and beyond?) [WAS: Re: 2.6.22.16 MD raid1 doesn't mark removed disk faulty, MD thread goes UN]
ondition generally indicates a serious hardware problem > or target incompatibility; and is generally rare as they are a result of > corner case conditions within the Adapter Firmware. The diagnostic dump > reported by the Adaptec utilities should be able to point to the fault you > are experiencing if these appear to be the root causes. snitzer: It would seem that 1.1.5-2451 has the firmware reset support given the log I provided above, no? Anyway, with 2.6.22.16 when a drive is pulled using the aacraid 1.1-5[2437]-mh4 there is absolutely no errors from the aacraid driver; in fact the scsi layer doesn't see anything until I force the issue with explicit reads/writes to the device that was pulled. It could be that on a drive pull the 1.1.5-2451 driver results in a BlinkLED, resets the firmware, and continues. Whereas with the 1.1-5[2437]-mh4 I get no BlinkLED and as such Linux (both scsi and raid1) is completely unaware of any disconnect of the physical device. thanks, Mike > > -Original Message- > > From: Mike Snitzer [mailto:[EMAIL PROTECTED] > > Sent: Tuesday, January 22, 2008 7:10 PM > > To: linux-raid@vger.kernel.org; NeilBrown > > Cc: [EMAIL PROTECTED]; K. Tanaka; AACRAID; > > [EMAIL PROTECTED] > > Subject: AACRAID driver broken in 2.6.22.x (and beyond?) > > [WAS: Re: 2.6.22.16 MD raid1 doesn't mark removed disk > > faulty, MD thread goes UN] > > > > > On Jan 22, 2008 12:29 AM, Mike Snitzer <[EMAIL PROTECTED]> wrote: > > > cc'ing Tanaka-san given his recent raid1 BUG report: > > > http://lkml.org/lkml/2008/1/14/515 > > > > > > > > > On Jan 21, 2008 6:04 PM, Mike Snitzer <[EMAIL PROTECTED]> wrote: > > > > Under 2.6.22.16, I physically pulled a SATA disk > > (/dev/sdac, connected to > > > > an aacraid controller) that was acting as the local raid1 > > member of > > > > /dev/md30. > > > > > > > > Linux MD didn't see an /dev/sdac1 error until I tried > > forcing the issue by > > > > doing a read (with dd) from /dev/md30: > > > > > The raid1d thread is locked at line 720 in raid1.c > > (raid1d+2437); aka > > > freeze_array: > > > > > > (gdb) l *0x2539 > > > 0x2539 is in raid1d (drivers/md/raid1.c:720). > > > 715 * wait until barrier+nr_pending match nr_queued+2 > > > 716 */ > > > 717 spin_lock_irq(&conf->resync_lock); > > > 718 conf->barrier++; > > > 719 conf->nr_waiting++; > > > 720 wait_event_lock_irq(conf->wait_barrier, > > > 721 > > conf->barrier+conf->nr_pending == > > > conf->nr_queued+2, > > > 722 conf->resync_lock, > > > 723 > > raid1_unplug(conf->mddev->queue)); > > > 724 spin_unlock_irq(&conf->resync_lock); > > > > > > Given Tanaka-san's report against 2.6.23 and me hitting > > what seems to > > > be the same deadlock in 2.6.22.16; it stands to reason this affects > > > raid1 in 2.6.24-rcX too. > > > > Turns out that the aacraid driver in 2.6.22.x is HORRIBLY BROKEN (when > > you pull a drive); it responds to MD's write requests with uptodate=1 > > (in raid1_end_write_request) for the drive that was pulled! I've not > > looked to see if aacraid has been fixed in newer kernels... are others > > aware of any crucial aacraid fixes in 2.6.23.x or 2.6.24? > > > > After the drive was physically pulled, and small periodic writes > > continued to the associated MD device, the raid1 MD driver did _NOT_ > > detect the pulled drive's writes as having failed (verified this with > > systemtap). MD happily thought the write completed to both members > > (so MD had no reason to mark the pulled drive "faulty"; or mark the > > raid "degraded"). > > > > Installing an Adaptec-provided 1.1-5[2451] driver enabled raid1 to > > work as expected. > > > > That said, I now have a recipe for hitting the raid1 deadlock that > > Tanaka first reported over a week ago. I'm still surprised that all > > of this chatter about that BUG hasn't drawn interest/scrutiny from > > others!? > > > > regards, > > Mike > > > - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html