Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Tue, May 29, 2007 at 05:01:24PM -0700, [EMAIL PROTECTED] wrote: On Wed, 30 May 2007, David Chinner wrote: On Tue, May 29, 2007 at 04:03:43PM -0400, Phillip Susi wrote: David Chinner wrote: The use of barriers in XFS assumes the commit write to be on stable storage before it returns. One of the ordering guarantees that we need is that the transaction (commit write) is on disk before the metadata block containing the change in the transaction is written to disk and the current barrier behaviour gives us that. Barrier != synchronous write, Of course. FYI, XFS only issues barriers on *async* writes. But barrier semantics - as far as they've been described by everyone but you indicate that the barrier write is guaranteed to be on stable storage when it returns. this doesn't match what I have seen wtih barriers it's perfectly legal to have the following sequence of events 1. app writes block 10 to OS 2. app writes block 4 to OS 3. app writes barrier to OS 4. app writes block 5 to OS 5. app writes block 20 to OS hm - applications can't issue barriers to the filesystem. However, if you consider the barrier to be an fsync() for example, then it's still the filesystem that is issuing the barrier and there's a block that needs to be written that is associated with that barrier (either an inode or a transaction commit) that needs to be on stable storage before the filesystem returns to userspace. 6. OS writes block 4 to disk drive 7. OS writes block 10 to disk drive 8. OS writes barrier to disk drive 9. OS writes block 5 to disk drive 10. OS writes block 20 to disk drive Replace OS with filesystem, and combine 7+8 together - we don't have zero-length barriers and hence they are *always* associated with a write to a certain block on disk. i.e.: 1. FS writes block 4 to disk drive 2. FS writes block 10 to disk drive 3. FS writes *barrier* block X to disk drive 4. FS writes block 5 to disk drive 5. FS writes block 20 to disk drive The order that these are expected by the filesystem to hit stable storage are: 1. block 4 and 10 on stable storage in any order 2. barrier block X on stable storage 3. block 5 and 20 on stable storage in any order The point I'm trying to make is that in XFS, block 5 and 20 cannot be allowed to hit the disk before the barrier block because they have strict order dependency on block X being stable before them, just like block X has strict order dependency that block 4 and 10 must be stable before we start the barrier block write. 11. disk drive writes block 10 to platter 12. disk drive writes block 4 to platter 13. disk drive writes block 20 to platter 14. disk drive writes block 5 to platter if the disk drive doesn't support barriers then step #8 becomes 'issue flush' and steps 11 and 12 take place before step #9, 13, 14 No, you need a flush on either side of the block X write to maintain the same semantics as barrier writes currently have. We have filesystems that require barriers to prevent reordering of writes in both directions and to ensure that the block associated with the barrier is on stable storage when I/o completion is signalled. The existing barrier implementation (where it works) provide these requirements. We need barriers to retain these semantics, otherwise we'll still have to do special stuff in the filesystems to get the semantics that we need. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
The order that these are expected by the filesystem to hit stable storage are: 1. block 4 and 10 on stable storage in any order 2. barrier block X on stable storage 3. block 5 and 20 on stable storage in any order The point I'm trying to make is that in XFS, block 5 and 20 cannot be allowed to hit the disk before the barrier block because they have strict order dependency on block X being stable before them, just like block X has strict order dependency that block 4 and 10 must be stable before we start the barrier block write. That would be the exactly how I understand Documentation/block/barrier.txt: In other words, I/O barrier requests have the following two properties. 1. Request ordering ... 2. Forced flushing to physical medium So, I/O barriers need to guarantee that requests actually get written to non-volatile medium in order. Stefan - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
in-flight I/O to go to zero? Something like that is needed for some dm targets to support barriers. (We needn't always wait for *all* in-flight I/O.) When faced with -EOPNOTSUP, do all callers fall back to a sync in the places a barrier would have been used, or are there any more sophisticated strategies attempting to optimise code without barriers? If I didn't misunderstand the idea is that no caller will face an -EOPNOTSUPP in future. IOW every layer or driver somehow makes sure the right thing happens. An efficient I/O barrier implementation would not normally involve flushing AFAIK: dm surely wouldn't cause a higher layer to assume stronger semantics than are provided. Seems there are at least two assumptions about what the semantics exactly _are_. Based on Documentation/block/barriers.txt I understand a barrier implies ordering and flushing. But regardless of that, assume the (admittedly constructed) following case: You got a linear target that consists of two disks. One drive (a) supports barriers and the other one (b) doesn't. Device-mapper just maps the requests to the appropriate disk. Now the following sequence happens: 1. block x gets mapped to drive b 2. block y (with barrier) gets mapped to drive a Since drive a supports barrier request we don't get -EOPNOTSUPP but the request with block y might get written before block x since the disk are independent. I guess the chances of this are quite low since at some point a barrier request will also hit drive b but for the time being it might be better to indicate -EOPNOTSUPP right from device-mapper. Stefan - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Mon, May 28 2007, Neil Brown wrote: I think the implementation priorities here are: 1/ implement a zero-length BIO_RW_BARRIER option. 2/ Use it (or otherwise) to make all dm and md modules handle barriers (and loop?). 3/ Devise and implement appropriate fall-backs with-in the block layer so that -EOPNOTSUP is never returned. 4/ Remove unneeded cruft from filesystems (and elsewhere). This is the start of 1/ above. It's very lightly tested, it's verified to DTRT here at least and not crash :-) It gets rid of the -issue_flush_fn() queue callback, all the driver knowledge resides in -prepare_flush_fn() anyways. blkdev_issue_flush() then just reuses the empty-bio approach to queue an empty barrier, this should work equally well for stacked and non-stacked devices. While this patch isn't complete yet, it's clearly the right direction to go. I didn't convert drivers/md/* to support this approach, I'm leaving that to you :-) block/elevator.c| 12 ++ block/ll_rw_blk.c | 173 ++-- drivers/ide/ide-disk.c | 29 - drivers/message/i2o/i2o_block.c | 24 drivers/scsi/scsi_lib.c | 17 --- drivers/scsi/sd.c | 15 -- fs/bio.c|8 - include/linux/bio.h | 18 ++- include/linux/blkdev.h |3 include/scsi/scsi_driver.h |1 include/scsi/sd.h |1 mm/bounce.c |6 + 12 files changed, 141 insertions(+), 166 deletions(-) diff --git a/block/elevator.c b/block/elevator.c index ce866eb..af5e58d 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -715,6 +715,18 @@ struct request *elv_next_request(request_queue_t *q) int ret; while ((rq = __elv_next_request(q)) != NULL) { + /* +* Kill the empty barrier place holder, the driver must +* not ever see it. +*/ + if (blk_fs_request(rq) blk_barrier_rq(rq) + !rq-hard_nr_sectors) { + blkdev_dequeue_request(rq); + rq-cmd_flags |= REQ_QUIET; + end_that_request_chunk(rq, 1, 0); + end_that_request_last(rq, 1); + continue; + } if (!(rq-cmd_flags REQ_STARTED)) { /* * This is the first time the device driver diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c index 6b5173a..8680083 100644 --- a/block/ll_rw_blk.c +++ b/block/ll_rw_blk.c @@ -300,23 +300,6 @@ int blk_queue_ordered(request_queue_t *q, unsigned ordered, EXPORT_SYMBOL(blk_queue_ordered); -/** - * blk_queue_issue_flush_fn - set function for issuing a flush - * @q: the request queue - * @iff: the function to be called issuing the flush - * - * Description: - * If a driver supports issuing a flush command, the support is notified - * to the block layer by defining it through this call. - * - **/ -void blk_queue_issue_flush_fn(request_queue_t *q, issue_flush_fn *iff) -{ - q-issue_flush_fn = iff; -} - -EXPORT_SYMBOL(blk_queue_issue_flush_fn); - /* * Cache flushing for ordered writes handling */ @@ -433,7 +416,8 @@ static inline struct request *start_ordered(request_queue_t *q, rq_init(q, rq); if (bio_data_dir(q-orig_bar_rq-bio) == WRITE) rq-cmd_flags |= REQ_RW; - rq-cmd_flags |= q-ordered QUEUE_ORDERED_FUA ? REQ_FUA : 0; + if (q-ordered QUEUE_ORDERED_FUA) + rq-cmd_flags |= REQ_FUA; rq-elevator_private = NULL; rq-elevator_private2 = NULL; init_request_from_bio(rq, q-orig_bar_rq-bio); @@ -445,7 +429,7 @@ static inline struct request *start_ordered(request_queue_t *q, * no fs request uses ELEVATOR_INSERT_FRONT and thus no fs * request gets inbetween ordered sequence. */ - if (q-ordered QUEUE_ORDERED_POSTFLUSH) + if ((q-ordered QUEUE_ORDERED_POSTFLUSH) rq-hard_nr_sectors) queue_flush(q, QUEUE_ORDERED_POSTFLUSH); else q-ordseq |= QUEUE_ORDSEQ_POSTFLUSH; @@ -469,7 +453,7 @@ static inline struct request *start_ordered(request_queue_t *q, int blk_do_ordered(request_queue_t *q, struct request **rqp) { struct request *rq = *rqp; - int is_barrier = blk_fs_request(rq) blk_barrier_rq(rq); + const int is_barrier = blk_fs_request(rq) blk_barrier_rq(rq); if (!q-ordseq) { if (!is_barrier) @@ -2635,6 +2619,16 @@ int blk_execute_rq(request_queue_t *q, struct gendisk *bd_disk, EXPORT_SYMBOL(blk_execute_rq); +static int bio_end_empty_barrier(struct bio *bio, unsigned int bytes_done, +int err) +{ + if (err) + clear_bit(BIO_UPTODATE, bio-bi_flags); + + complete(bio-bi_private); + return 0; +} + /** * blkdev_issue_flush -
Re: ANNOUNCE: mdadm 2.6.2 - A tool for managing Soft RAID under Linux
Neil, On Tuesday, 29. May 2007, you wrote: cc1: warnings being treated as errors sysfs.c: In function 'sysfs_read': sysfs.c:97: warning: value computed is not used sysfs.c:119: warning: value computed is not used sysfs.c:127: warning: value computed is not used sysfs.c:133: warning: value computed is not used sysfs.c:139: warning: value computed is not used sysfs.c:178: warning: value computed is not used Those are bogus warnings. Each is strcpy(base, x); and base most certainly is used., though I can see how gcc might not notice if it is being too clever. Maybe you need to get gcc-4.1.2? or make CWFLAGS=-Wall Holger Kiehl was right, it complained about the unused return value. Please see the attached patch. Thomas diff -u -r -p mdadm-2.6.2/Detail.c mdadm.warning/Detail.c --- mdadm-2.6.2/Detail.c Mon May 21 06:25:50 2007 +++ mdadm.warning/Detail.c Wed May 30 10:52:32 2007 @@ -59,7 +59,7 @@ int Detail(char *dev, int brief, int exp void *super = NULL; int rv = test ? 4 : 1; int avail_disks = 0; - char *avail; + char *avail = NULL; if (fd 0) { fprintf(stderr, Name : cannot open %s: %s\n, diff -u -r -p mdadm-2.6.2/sysfs.c mdadm.warning/sysfs.c --- mdadm-2.6.2/sysfs.c Thu Dec 21 06:44:22 2006 +++ mdadm.warning/sysfs.c Wed May 30 10:55:43 2007 @@ -94,7 +94,7 @@ struct sysarray *sysfs_read(int fd, int sra-devs = NULL; if (options GET_VERSION) { - strcpy(base, metadata_version); + (void)strcpy(base, metadata_version); if (load_sys(fname, buf)) goto abort; if (strncmp(buf, none, 4) == 0) @@ -104,19 +104,19 @@ struct sysarray *sysfs_read(int fd, int sra-major_version, sra-minor_version); } if (options GET_LEVEL) { - strcpy(base, level); + (void)strcpy(base, level); if (load_sys(fname, buf)) goto abort; sra-level = map_name(pers, buf); } if (options GET_LAYOUT) { - strcpy(base, layout); + (void)strcpy(base, layout); if (load_sys(fname, buf)) goto abort; sra-layout = strtoul(buf, NULL, 0); } if (options GET_COMPONENT) { - strcpy(base, component_size); + (void)strcpy(base, component_size); if (load_sys(fname, buf)) goto abort; sra-component_size = strtoull(buf, NULL, 0); @@ -124,19 +124,19 @@ struct sysarray *sysfs_read(int fd, int sra-component_size *= 2; } if (options GET_CHUNK) { - strcpy(base, chunk_size); + (void)strcpy(base, chunk_size); if (load_sys(fname, buf)) goto abort; sra-chunk = strtoul(buf, NULL, 0); } if (options GET_CACHE) { - strcpy(base, stripe_cache_size); + (void)strcpy(base, stripe_cache_size); if (load_sys(fname, buf)) goto abort; sra-cache_size = strtoul(buf, NULL, 0); } if (options GET_MISMATCH) { - strcpy(base, mismatch_cnt); + (void)strcpy(base, mismatch_cnt); if (load_sys(fname, buf)) goto abort; sra-mismatch_cnt = strtoul(buf, NULL, 0); @@ -175,7 +175,7 @@ struct sysarray *sysfs_read(int fd, int dev-role = strtoul(buf, ep, 10); if (*ep) dev-role = -1; - strcpy(dbase, block/dev); + (void)strcpy(dbase, block/dev); if (load_sys(fname, buf)) goto abort; sscanf(buf, %d:%d, dev-major, dev-minor);
Creating RAID1 with bitmap fails
Hi, the following command strangely gives -EIO ... 12:27 sun:~ # mdadm -C /dev/md4 -l 1 -n 2 -e 1.0 -b internal /dev/ram0 missing md: md4: raid array is not clean -- starting background reconstruction md4: failed to create bitmap (-5) md: pers-run() failed ... mdadm: RUN_ARRAY failed: Input/output error mdadm: stopped /dev/md4 Leaving out -b internal creates the array. /dev/ram0 or /dev/sda5 - EIO happens on both. (But the disk is fine, like ram0) Where could I start looking? Linux sun 2.6.21-1.3149.al3.8smp #3 SMP Wed May 30 09:43:00 CEST 2007 sparc64 sparc64 sparc64 GNU/Linux mdadm 2.5.4 Thanks, Jan -- - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Wed, May 30, 2007 at 11:12:37AM +0200, Stefan Bader wrote: it might be better to indicate -EOPNOTSUPP right from device-mapper. Indeed we should. For support, on receipt of a barrier, dm core should send a zero-length barrier to all active underlying paths, and delay mapping any further I/O. Alasdair -- [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mismatch_cnt = 128 for root (/) md raid1 device
Asking again.. On Sat, 26 May 2007, Justin Piszcz wrote: Kernel 2.6.21.3 Fri May 25 20:00:02 EDT 2007: Executing RAID health check for /dev/md0... Fri May 25 20:00:03 EDT 2007: Executing RAID health check for /dev/md1... Fri May 25 20:00:04 EDT 2007: Executing RAID health check for /dev/md2... Fri May 25 20:00:05 EDT 2007: Executing RAID health check for /dev/md3... Sat May 26 04:40:09 EDT 2007: cat /sys/block/md0/md/mismatch_cnt Sat May 26 04:40:09 EDT 2007: 0 Sat May 26 04:40:09 EDT 2007: cat /sys/block/md1/md/mismatch_cnt Sat May 26 04:40:09 EDT 2007: 0 Sat May 26 04:40:09 EDT 2007: cat /sys/block/md2/md/mismatch_cnt Sat May 26 04:40:09 EDT 2007: 128 Sat May 26 04:40:09 EDT 2007: cat /sys/block/md3/md/mismatch_cnt Sat May 26 04:40:09 EDT 2007: 0 Sat May 26 04:40:09 EDT 2007: The meta-device /dev/md0 has no mismatched sectors. Sat May 26 04:40:10 EDT 2007: The meta-device /dev/md1 has no mismatched sectors. Sat May 26 04:40:11 EDT 2007: The meta-device /dev/md2 has 128 mismatched sectors. Sat May 26 04:40:11 EDT 2007: Executing repair on /dev/md2 Sat May 26 04:40:12 EDT 2007: The meta-device /dev/md3 has no mismatched sectors. Sat May 26 05:00:14 EDT 2007: cat /sys/block/md0/md/mismatch_cnt Sat May 26 05:00:14 EDT 2007: 0 Sat May 26 05:00:14 EDT 2007: cat /sys/block/md1/md/mismatch_cnt Sat May 26 05:00:14 EDT 2007: 0 Sat May 26 05:00:14 EDT 2007: cat /sys/block/md2/md/mismatch_cnt Sat May 26 05:00:14 EDT 2007: 0 Sat May 26 05:00:14 EDT 2007: cat /sys/block/md3/md/mismatch_cnt Sat May 26 05:00:14 EDT 2007: 0 I often see 128 or so for the root volume (/) for my RAID1. Any idea? I know when you reboot/shutdown a system with md raid1 it does not mount uncleanly (I believe?). Just curious why it happens on the root volume and if its something I should be worried about? md0=swap md1=boot md2=root md3=raid5_volume - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Creating RAID1 with bitmap fails
On May 30 2007 22:05, Neil Brown wrote: the following command strangely gives -EIO ... 12:27 sun:~ # mdadm -C /dev/md4 -l 1 -n 2 -e 1.0 -b internal /dev/ram0 missing md: md4: raid array is not clean -- starting background reconstruction md4: failed to create bitmap (-5) md: pers-run() failed ... mdadm: RUN_ARRAY failed: Input/output error mdadm: stopped /dev/md4 Leaving out -b internal creates the array. /dev/ram0 or /dev/sda5 - EIO happens on both. (But the disk is fine, like ram0) Where could I start looking? Linux sun 2.6.21-1.3149.al3.8smp #3 SMP Wed May 30 09:43:00 CEST 2007 sparc64 sparc64 sparc64 GNU/Linux mdadm 2.5.4 I'm fairly sure this is fixed in 2.6.2. It is certainly worth a try. The same command works on a x86_64 with mdadm 2.5.3... Jan -- - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
David Chinner wrote: Barrier != synchronous write, Of course. FYI, XFS only issues barriers on *async* writes. But barrier semantics - as far as they've been described by everyone but you indicate that the barrier write is guaranteed to be on stable storage when it returns. Hrm... I may have misunderstood the perspective you were talking from. Yes, when the bio is completed it must be on the media, but the filesystem should issue both requests, and then really not care when they complete. That is to say, the filesystem should not wait for block A to finish before issuing block B; it should issue both, and use barriers to make sure they hit the disk in the correct order. XFS relies on the block being stable before any other write goes to disk. That is the semantic that the barrier I/Os currently have. How that is implemented in the device is irrelevant to me, but if I issue a barrier I/O, I do not expect *any* I/O to be reordered around it. Right... it just needs to control the order of the requests, just not wait on one to finish before issuing the next. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Wed, 30 May 2007, David Chinner wrote: On Tue, May 29, 2007 at 05:01:24PM -0700, [EMAIL PROTECTED] wrote: On Wed, 30 May 2007, David Chinner wrote: On Tue, May 29, 2007 at 04:03:43PM -0400, Phillip Susi wrote: David Chinner wrote: The use of barriers in XFS assumes the commit write to be on stable storage before it returns. One of the ordering guarantees that we need is that the transaction (commit write) is on disk before the metadata block containing the change in the transaction is written to disk and the current barrier behaviour gives us that. Barrier != synchronous write, Of course. FYI, XFS only issues barriers on *async* writes. But barrier semantics - as far as they've been described by everyone but you indicate that the barrier write is guaranteed to be on stable storage when it returns. this doesn't match what I have seen wtih barriers it's perfectly legal to have the following sequence of events 1. app writes block 10 to OS 2. app writes block 4 to OS 3. app writes barrier to OS 4. app writes block 5 to OS 5. app writes block 20 to OS hm - applications can't issue barriers to the filesystem. However, if you consider the barrier to be an fsync() for example, then it's still the filesystem that is issuing the barrier and there's a block that needs to be written that is associated with that barrier (either an inode or a transaction commit) that needs to be on stable storage before the filesystem returns to userspace. 6. OS writes block 4 to disk drive 7. OS writes block 10 to disk drive 8. OS writes barrier to disk drive 9. OS writes block 5 to disk drive 10. OS writes block 20 to disk drive Replace OS with filesystem, and combine 7+8 together - we don't have zero-length barriers and hence they are *always* associated with a write to a certain block on disk. i.e.: 1. FS writes block 4 to disk drive 2. FS writes block 10 to disk drive 3. FS writes *barrier* block X to disk drive 4. FS writes block 5 to disk drive 5. FS writes block 20 to disk drive The order that these are expected by the filesystem to hit stable storage are: 1. block 4 and 10 on stable storage in any order 2. barrier block X on stable storage 3. block 5 and 20 on stable storage in any order The point I'm trying to make is that in XFS, block 5 and 20 cannot be allowed to hit the disk before the barrier block because they have strict order dependency on block X being stable before them, just like block X has strict order dependency that block 4 and 10 must be stable before we start the barrier block write. 11. disk drive writes block 10 to platter 12. disk drive writes block 4 to platter 13. disk drive writes block 20 to platter 14. disk drive writes block 5 to platter if the disk drive doesn't support barriers then step #8 becomes 'issue flush' and steps 11 and 12 take place before step #9, 13, 14 No, you need a flush on either side of the block X write to maintain the same semantics as barrier writes currently have. We have filesystems that require barriers to prevent reordering of writes in both directions and to ensure that the block associated with the barrier is on stable storage when I/o completion is signalled. The existing barrier implementation (where it works) provide these requirements. We need barriers to retain these semantics, otherwise we'll still have to do special stuff in the filesystems to get the semantics that we need. one of us is misunderstanding barriers here. you are understanding barriers to be the same as syncronous writes. (and therefor the data is on persistant media before the call returns) I am understanding barriers to only indicate ordering requirements. things before the barrier can be reordered freely, things after the barrier can be reordered freely, but things cannot be reordered across the barrier. if I am understanding it correctly, the big win for barriers is that you do NOT have to stop and wait until the data is on persistant media before you can continue. in the past barriers have not been fully implmented in most cases, and as a result they have been simulated by forcing a full flush of the buffers to persistant media before any other writes are allowed. This has made them _in practice_ operate the same way as syncronous writes (matching your understanding), but the current thread is talking about fixing the implementation to the official symantics for all hardware that can actually support barriers (and fix it at the OS level) David Lang - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID SB 1.x autodetection
On 29 May 2007, Jan Engelhardt uttered the following: from your post at http://www.mail-archive.com/linux-raid@vger.kernel.org/msg07384.html I read that autodetecting arrays with a 1.x superblock is currently impossible. Does it at least work to force the kernel to always assume a 1.x sb? There are some 'broken' distros out there that still don't use mdadm in initramfs, and recreating the initramfs each time is a bit cumbersome... The kernel build system should be able to do that for you, shouldn't it? -- `On a scale of one to ten of usefulness, BBC BASIC was several points ahead of the competition, scoring a relatively respectable zero.' --- Peter Corlett - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Phillip Susi wrote: Hrm... I may have misunderstood the perspective you were talking from. Yes, when the bio is completed it must be on the media, but the filesystem should issue both requests, and then really not care when they complete. That is to say, the filesystem should not wait for block A to finish before issuing block B; it should issue both, and use barriers to make sure they hit the disk in the correct order. Actually now that I think about it, that wasn't correct. The request CAN be completed before the data has hit the medium. The barrier just constrains the ordering of the writes, but they can still sit in the disk write back cache for some time. Stefan Bader wrote: That would be the exactly how I understand Documentation/block/barrier.txt: In other words, I/O barrier requests have the following two properties. 1. Request ordering ... 2. Forced flushing to physical medium So, I/O barriers need to guarantee that requests actually get written to non-volatile medium in order. I think you misinterpret this, and it probably could be worded a bit better. The barrier request is about constraining the order. The forced flushing is one means to implement that constraint. The other alternative mentioned there is to use ordered tags. The key part there is requests actually get written to non-volatile medium _in order_, not before the request completes, which would be synchronous IO. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
dm-crypt issue
I'm not sure this is the right place, or that there IS a right pace for this, but it involves RAID, so... ;-) I have an array I mount with cryptoloop, to hole some information with AES encryption. I keep hearing that the way to do this is with dm-crypt, but I can't find anyone who will explain how to do that. If I had the luxury of starting over I could use dm-crypt to start from scratch, but the only practical solution in terms of time and requirement for encrypted backup would be to be able to mount the same partitions, using the same encryption, just using dm-crypt. I don't think I can do that, but if someone wants to assure me that it can be done and point me to documentation, I'll be grateful. I have a LOT of partial sets of this data in the field in DVD, same requirement, I can't replace them, they have to work as is. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID SB 1.x autodetection
Nix wrote: On 29 May 2007, Jan Engelhardt uttered the following: from your post at http://www.mail-archive.com/linux-raid@vger.kernel.org/msg07384.html I read that autodetecting arrays with a 1.x superblock is currently impossible. Does it at least work to force the kernel to always assume a 1.x sb? There are some 'broken' distros out there that still don't use mdadm in initramfs, and recreating the initramfs each time is a bit cumbersome... The kernel build system should be able to do that for you, shouldn't it? That would be an improvement, yes. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID SB 1.x autodetection
On May 30 2007 16:35, Bill Davidsen wrote: On 29 May 2007, Jan Engelhardt uttered the following: from your post at http://www.mail-archive.com/linux-raid@vger.kernel.org/msg07384.html I read that autodetecting arrays with a 1.x superblock is currently impossible. Does it at least work to force the kernel to always assume a 1.x sb? There are some 'broken' distros out there that still don't use mdadm in initramfs, and recreating the initramfs each time is a bit cumbersome... The kernel build system should be able to do that for you, shouldn't it? That would be an improvement, yes. Hardly, with all the Fedora specific cruft. Anyway, there was a simple patch posted in RH bugzilla, so I've gone with that. Jan -- - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID SB 1.x autodetection
On 30 May 2007, Bill Davidsen stated: Nix wrote: On 29 May 2007, Jan Engelhardt uttered the following: from your post at http://www.mail-archive.com/linux-raid@vger.kernel.org/msg07384.html I read that autodetecting arrays with a 1.x superblock is currently impossible. Does it at least work to force the kernel to always assume a 1.x sb? There are some 'broken' distros out there that still don't use mdadm in initramfs, and recreating the initramfs each time is a bit cumbersome... The kernel build system should be able to do that for you, shouldn't it? That would be an improvement, yes. Allow me to rephrase: the kernel build system *can* do that for you ;) that is, it can build a gzipped cpio archive from components located anywhere on the filesystem or arbitrary source located under usr/. -- `On a scale of one to ten of usefulness, BBC BASIC was several points ahead of the competition, scoring a relatively respectable zero.' --- Peter Corlett - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Creating RAID1 with bitmap fails
On Wednesday May 30, [EMAIL PROTECTED] wrote: On May 30 2007 22:05, Neil Brown wrote: the following command strangely gives -EIO ... 12:27 sun:~ # mdadm -C /dev/md4 -l 1 -n 2 -e 1.0 -b internal /dev/ram0 missing md: md4: raid array is not clean -- starting background reconstruction md4: failed to create bitmap (-5) md: pers-run() failed ... mdadm: RUN_ARRAY failed: Input/output error mdadm: stopped /dev/md4 Leaving out -b internal creates the array. /dev/ram0 or /dev/sda5 - EIO happens on both. (But the disk is fine, like ram0) Where could I start looking? Linux sun 2.6.21-1.3149.al3.8smp #3 SMP Wed May 30 09:43:00 CEST 2007 sparc64 sparc64 sparc64 GNU/Linux mdadm 2.5.4 I'm fairly sure this is fixed in 2.6.2. It is certainly worth a try. The same command works on a x86_64 with mdadm 2.5.3... Are you sure? I suspect that the difference is more in the kernel version. mdadm used to create some arrays with the bitmap positioned so that it overlapped the data. Recent kernels check for that and reject the array if there is an overlap. mdadm-2.6.2 makes sure not to create any overlap. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Wed, May 30, 2007 at 09:52:49AM -0700, [EMAIL PROTECTED] wrote: On Wed, 30 May 2007, David Chinner wrote: with the barrier is on stable storage when I/o completion is signalled. The existing barrier implementation (where it works) provide these requirements. We need barriers to retain these semantics, otherwise we'll still have to do special stuff in the filesystems to get the semantics that we need. one of us is misunderstanding barriers here. No, I thinkwe are both on the same level here - it's what barriers are used for that is not clear understood, I think. you are understanding barriers to be the same as syncronous writes. (and therefor the data is on persistant media before the call returns) No, I'm describing the high level behaviour that is expected by a filesystem. The reasons for this are below I am understanding barriers to only indicate ordering requirements. things before the barrier can be reordered freely, things after the barrier can be reordered freely, but things cannot be reordered across the barrier. Ok, that's my understanding of how *device based barriers* can work, but there's more to it than that. As far as the filesystem is concerned the barrier write needs to *behave* exactly like a sync write because of the guarantees the filesystem has to provide userspace. Specifically - sync, sync writes and fsync. This is the big problem, right? If we use barriers for commit writes, the filesystem can return to userspace after a sync write or fsync() and an *ordered barrier device implementation* may not have written the blocks to persistent media. If we then pull the plug on the box, we've just lost data that sync or fsync said was successfully on disk. That's BAD. Right now a barrier write on the last block of the fsync/sync write is sufficient to prevent that because of the FUA on the barrier block write. A purely ordered barrier implementation does not provide this guarantee. This is the crux of my argument - from a filesystem perspective, there is a *major* difference between a barrier implemented to just guaranteeing ordering and a barrier implemented with a flush+FUA or flush+write+flush. IOWs, there are two parts to the problem: 1 - guaranteeing I/O ordering 2 - guaranteeing blocks are on persistent storage. Right now, a single barrier I/O is used to provide both of these guarantees. In most cases, all we really need to provide is 1); the need for 2) is a much rarer condition but still needs to be provided. if I am understanding it correctly, the big win for barriers is that you do NOT have to stop and wait until the data is on persistant media before you can continue. Yes, if we define a barrier to only guarantee 1), then yes this would be a big win (esp. for XFS). But that requires all filesystems to handle sync writes differently, and sync_blockdev() needs to call blkdev_issue_flush() as well So, what do we do here? Do we define a barrier I/O to only provide ordering, or do we define it to also provide persistent storage writeback? Whatever we decide, it needs to be documented Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Tuesday May 29, [EMAIL PROTECTED] wrote: Neil Brown wrote: md/dm modules could keep count of requests as has been suggested (though that would be a fairly big change for raid0 as it currently doesn't know when a request completes - bi_endio goes directly to the filesystem). Are you sure? I believe that dm handles bi_endio because it waits for all in progress bio to complete before switching tables. I was taking about md/raid0, not dm-stripe. md/raid0 (and md/linear) currently never know that a request has completed. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Monday May 28, [EMAIL PROTECTED] wrote: There are two things I'm not sure you covered. First, disks which don't support flush but do have a cache dirty status bit you can poll at times like shutdown. If there are no drivers which support these, it can be ignored. There are really devices like that? So to implement a flush, you have to stop sending writes and wait and poll - maybe poll every millisecond? That wouldn't be very good for performance maybe you just wouldn't bother with barriers on that sort of device? Which reminds me: What is the best way to turn off barriers? Several filesystems have -o nobarriers or -o barriers=0, or the inverse. md/raid currently uses barriers to write metadata, and there is no way to turn that off. I'm beginning to wonder if that is best. Maybe barrier support should be a function of the device. i.e. the filesystem or whatever always sends barrier requests where it thinks it is appropriate, and the block device tries to honour them to the best of its ability, but if you run blockdev --enforce-barriers=no /dev/sda then you lose some reliability guarantees, but gain some throughput (a bit like the 'async' export option for nfsd). Second, NAS (including nbd?). Is there enough information to handle this really rigt? NAS means lots of things, including NFS and CIFS where this doesn't apply. For 'nbd', it is entirely up to the protocol. If the protocol allows a barrier flag to be sent to the server, then barriers should just work. If it doesn't, then either the server disables write-back caching, or flushes every request, or you lose all barrier guarantees. For 'iscsi', I guess it works just the same as SCSI... NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Monday May 28, [EMAIL PROTECTED] wrote: On Mon, May 28, 2007 at 12:57:53PM +1000, Neil Brown wrote: What exactly do you want to know, and why do you care? If someone explicitly mounts -o barrier and the underlying device cannot do it, then we want to issue a warning or reject the mount. I guess that makes sense. But apparently you cannot tell what a device supports until you write to it. So maybe you need to write some metadata with as a barrier, then ask the device what it's barrier status is. The options might be: YES - barriers are fully handled NO - best effort, but due to missing device features, it might not work DISABLED - admin has requested that barriers be ignored. ?? The idea is that every struct block_device supports barriers. If the underlying hardware doesn't support them directly, then they get simulated by draining the queue and issuing a flush. Ok. But you also seem to be implying that there will be devices that cannot support barriers. It seems there will always be hardware that doesn't meet specs. If a device doesn't support SYNCHRONIZE_CACHE or FUA, then implementing barriers all the way to the media would be hard.. Even if all devices do eventually support barriers, it may take some time before we reach that goal. Why not start by making it easy to determine what the capabilities of each device are. This can then be removed once we reach the holy grail I'd rather not add something that we plan to remove. We currently have -EOPNOTSUP. I don't think there is much point having more than that. I would really like to get to the stage where -EOPNOTSUP is never returned. If a filesystem cares, it could 'ask' as suggested above. What would be a good interface for asking? What if the truth changes (as can happen with md or dm)? NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Thu, May 31, 2007 at 10:46:04AM +1000, Neil Brown wrote: What if the truth changes (as can happen with md or dm)? You get notified in endio() that the barrier had to be emulated? Alasdair -- [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Thu, May 31, 2007 at 10:46:04AM +1000, Neil Brown wrote: If a filesystem cares, it could 'ask' as suggested above. What would be a good interface for asking? XFS already tests: bd_disk-queue-ordered == QUEUE_ORDERED_NONE Alasdair -- [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Thu, May 31, 2007 at 02:07:39AM +0100, Alasdair G Kergon wrote: On Thu, May 31, 2007 at 10:46:04AM +1000, Neil Brown wrote: If a filesystem cares, it could 'ask' as suggested above. What would be a good interface for asking? XFS already tests: bd_disk-queue-ordered == QUEUE_ORDERED_NONE The side effects of removing that check is what started this whole discussion. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
very strange (maybe) raid1 testing results
I assembled a 3-component raid1 out of 3 4GB partitions. After syncing, I ran the following script: for bs in 32 64 128 192 256 384 512 768 1024 ; do \ let COUNT=2048 * 1024 / ${bs}; \ echo -n ${bs}K bs - ; \ dd if=/dev/md1 of=/dev/null bs=${bs}k count=$COUNT iflag=direct 21 | grep 'copied' ; \ done I also ran 'dstat' (like iostat) in another terminal. What I noticed was very unexpected to me, so I re-ran it several times. I confirmed my initial observation - every time a new dd process ran, *all* of the read I/O for that process came from a single disk. It does not (appear to) have to do with block size - if I stop and re-run the script the next drive in line will take all of the I/O - it goes sda, sdc, sdb and back to sda and so on. I am getting 70-80MB/s read rates as reported via dstat, and 60-80MB/s as reported by dd. What I don't understand is why just one disk is being used here, instead of two or more. I tried different versions of metadata, and using a bitmap makes no difference. I created the array with (allowing for variations of bitmap and metadata version): mdadm --create --level=1 --raid-devices=3 /dev/md1 /dev/sda3 /dev/sdb3 /dev/sdc3 I am running 2.6.18.8-0.3-default on x86_64, openSUSE 10.2. Am I doing something wrong or is something weird going on? -- Jon Nelson [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: When does a disk get flagged as bad?
Alberto Alonso writes: OK, lets see if I can understand how a disk gets flagged as bad and removed from an array. I was under the impression that any read or write operation failure flags the drive as bad and it gets removed automatically from the array. However, as I indicated in a prior post I am having problems where the array is never degraded. Does an error of type: end_request: I/O error, dev sdb, sector not count as a read/write error? I was also under the impression that any read or write error would fail the drive out of the array but some recent experiments with error injecting seem to indicate otherwise at least for raid1. My working hypothesis is that only write errors fail the drive. Read errors appear to just redirect the sector to a different mirror. I actually ran across what looks like a bug in the raid1 recovery/check/repair read error logic that I posted about last week but which hasn't generated any response yet (cf. http://article.gmane.org/gmane.linux.raid/15354). This bug results in sending a zero length write request down to the underlying device driver. A consequence of issuing a zero length write is that it fails at the device level, which raid1 sees as a write failure, which then fails the array. The fix I proposed actually has the effect of *not* failing the array in this case since the spurious failing write is never generated. I'm not sure what is actually supposed to happen in this case. Hopefully, someone more knowledgeable will comment soon. -- Mike Accetta ECI Telecom Ltd. Data Networking Division (previously Laurel Networks) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: very strange (maybe) raid1 testing results
Jon Nelson wrote: I am getting 70-80MB/s read rates as reported via dstat, and 60-80MB/s as reported by dd. What I don't understand is why just one disk is being used here, instead of two or more. I tried different versions of metadata, and using a bitmap makes no difference. I created the array with (allowing for variations of bitmap and metadata version): This is normal for md RAID1. What you should find is that for concurrent reads, each read will be serviced by a different disk, until no. of reads = no. of drives. Regards, Richard - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: When does a disk get flagged as bad?
On Wed, 2007-05-30 at 22:28 -0400, Mike Accetta wrote: Alberto Alonso writes: OK, lets see if I can understand how a disk gets flagged as bad and removed from an array. I was under the impression that any read or write operation failure flags the drive as bad and it gets removed automatically from the array. However, as I indicated in a prior post I am having problems where the array is never degraded. Does an error of type: end_request: I/O error, dev sdb, sector not count as a read/write error? I was also under the impression that any read or write error would fail the drive out of the array but some recent experiments with error injecting seem to indicate otherwise at least for raid1. My working hypothesis is that only write errors fail the drive. Read errors appear to just redirect the sector to a different mirror. I actually ran across what looks like a bug in the raid1 recovery/check/repair read error logic that I posted about last week but which hasn't generated any response yet (cf. http://article.gmane.org/gmane.linux.raid/15354). This bug results in sending a zero length write request down to the underlying device driver. A consequence of issuing a zero length write is that it fails at the device level, which raid1 sees as a write failure, which then fails the array. The fix I proposed actually has the effect of *not* failing the array in this case since the spurious failing write is never generated. I'm not sure what is actually supposed to happen in this case. Hopefully, someone more knowledgeable will comment soon. -- Mike Accetta I was starting to think that nobody got my posts, I know there are plenty of people that understand raid and didn't get any answers to any of my related posts. After thinking about your post, I guess I can see some logic behind not failing on the read, although I would say that after x amount of read failures a drive should be kicked out no matter what. In my case I believe the errors are during writes, which is still confusing. Unfortunately I've never done any kind of disk I/O code so I am afraid of looking at the code and getting completely lost. Alberto - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: very strange (maybe) raid1 testing results
On Thu, 31 May 2007, Richard Scobie wrote: Jon Nelson wrote: I am getting 70-80MB/s read rates as reported via dstat, and 60-80MB/s as reported by dd. What I don't understand is why just one disk is being used here, instead of two or more. I tried different versions of metadata, and using a bitmap makes no difference. I created the array with (allowing for variations of bitmap and metadata version): This is normal for md RAID1. What you should find is that for concurrent reads, each read will be serviced by a different disk, until no. of reads = no. of drives. Alright. To clarify, let's assume some process (like a single-threaded webserver) using a raid1 to store content (who knows why, let's just say it is), and also assume that the I/O load is 100% reads. Given that the server does not fork (or create a thread) for each request, does that mean that every single web request is essentially serviced from one disk, always? What mechanism determines which disk actually services the request? -- Jon Nelson [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Monday May 28, [EMAIL PROTECTED] wrote: Neil Brown writes: [...] Thus the general sequence might be: a/ issue all preceding writes. b/ issue the commit write with BIO_RW_BARRIER c/ wait for the commit to complete. If it was successful - done. If it failed other than with EOPNOTSUPP, abort else continue d/ wait for all 'preceding writes' to complete e/ call blkdev_issue_flush f/ issue commit write without BIO_RW_BARRIER g/ wait for commit write to complete if it failed, abort h/ call blkdev_issue DONE steps b and c can be left out if it is known that the device does not support barriers. The only way to discover this to try and see if it fails. I don't think any filesystem follows all these steps. It seems that steps b/ -- h/ are quite generic, and can be implemented once in a generic code (with some synchronization mechanism like wait-queue at d/). Yes and no. It depends on what you mean by preceding write. If you implement this in the filesystem, the filesystem can wait only for those writes where it has an ordering dependency. If you implement it in common code, then you have to wait for all writes that were previously issued. e.g. If you have two different filesystems on two different partitions on the one device, why should writes in one filesystem wait for a barrier issued in the other filesystem. If you have a single filesystem with one thread doing lot of over-writes (no metadata changes) and the another doing lots of metadata changes (requiring journalling and barriers) why should the data write be held up by the metadata updates? So I'm not actually convinced that doing this is common code is the best approach. But it is the easiest. The common code should provide the barrier and flushing primitives, and the filesystem gets to use them however it likes. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Creating RAID1 with bitmap fails
On May 31 2007 09:09, Neil Brown wrote: the following command strangely gives -EIO ... 12:27 sun:~ # mdadm -C /dev/md4 -l 1 -n 2 -e 1.0 -b internal /dev/ram0 missing Where could I start looking? Linux sun 2.6.21-1.3149.al3.8smp #3 SMP Wed May 30 09:43:00 CEST 2007 sparc64 sparc64 sparc64 GNU/Linux mdadm 2.5.4 I'm fairly sure this is fixed in 2.6.2. It is certainly worth a try. The same command works on a x86_64 with mdadm 2.5.3... [ with 2.6.18.8 ] Are you sure? I suspect that the difference is more in the kernel version. mdadm used to create some arrays with the bitmap positioned so that it overlapped the data. Recent kernels check for that and reject the array if there is an overlap. mdadm-2.6.2 makes sure not to create any overlap. Regarding above x86_64/2.5.3/2.6.18.8 created array, is there a way to check whether it overlaps? Jan -- - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Md corruption using RAID10 on linux-2.6.21
On Wednesday May 30, [EMAIL PROTECTED] wrote: Neil, I sent the scripts to you. Any update on this issue? Sorry, I got distracted. Your scripts are way more complicated than needed. Most of the logic in there is already in mdadm. mdadm --assemble /dev/md_d0 --run --uuid=$BOOTUUID /dev/sd[abcd]2 can replace most of it. And you don't need to wait for resync to complete before mounting filesystems. That said: I cannot see anything in your script that would actually do the wrong thing. Hmmm... I see now I wasn't quite testing the right thing. I need to trigger a resync with one device missing. i.e mdadm -C /dev/md0 -l10 -n4 -p n3 /dev/sd[abcd]1 mkfs /dev/md0 mdadm /dev/md0 -f /dev/sda1 mdadm -S /dev/md0 mdadm -A /dev/md0 -R --update=resync /dev/sd[bcd]1 fsck -f /dev/md0 This fails just as you say. Following patch fixes it as well as another problem I found while doing this testing. Thanks for pursuing this. NeilBrown diff .prev/drivers/md/raid10.c ./drivers/md/raid10.c --- .prev/drivers/md/raid10.c 2007-05-21 11:18:23.0 +1000 +++ ./drivers/md/raid10.c 2007-05-31 15:11:42.0 +1000 @@ -1866,6 +1866,7 @@ static sector_t sync_request(mddev_t *md int d = r10_bio-devs[i].devnum; bio = r10_bio-devs[i].bio; bio-bi_end_io = NULL; + clear_bit(BIO_UPTODATE, bio-bi_flags); if (conf-mirrors[d].rdev == NULL || test_bit(Faulty, conf-mirrors[d].rdev-flags)) continue; @@ -2036,6 +2037,11 @@ static int run(mddev_t *mddev) /* 'size' is now the number of chunks in the array */ /* calculate used chunks per device in 'stride' */ stride = size * conf-copies; + + /* We need to round up when dividing by raid_disks to +* get the stride size. +*/ + stride += conf-raid_disks - 1; sector_div(stride, conf-raid_disks); mddev-size = stride (conf-chunk_shift-1); - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html