Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread David Chinner
On Tue, May 29, 2007 at 05:01:24PM -0700, [EMAIL PROTECTED] wrote:
 On Wed, 30 May 2007, David Chinner wrote:
 
 On Tue, May 29, 2007 at 04:03:43PM -0400, Phillip Susi wrote:
 David Chinner wrote:
 The use of barriers in XFS assumes the commit write to be on stable
 storage before it returns.  One of the ordering guarantees that we
 need is that the transaction (commit write) is on disk before the
 metadata block containing the change in the transaction is written
 to disk and the current barrier behaviour gives us that.
 
 Barrier != synchronous write,
 
 Of course. FYI, XFS only issues barriers on *async* writes.
 
 But barrier semantics - as far as they've been described by everyone
 but you indicate that the barrier write is guaranteed to be on stable
 storage when it returns.
 
 this doesn't match what I have seen
 
 wtih barriers it's perfectly legal to have the following sequence of 
 events
 
 1. app writes block 10 to OS
 2. app writes block 4 to OS
 3. app writes barrier to OS
 4. app writes block 5 to OS
 5. app writes block 20 to OS

hm - applications can't issue barriers to the filesystem.
However, if you consider the barrier to be an fsync() for example,
then it's still the filesystem that is issuing the barrier and
there's a block that needs to be written that is associated with
that barrier (either an inode or a transaction commit) that needs to
be on stable storage before the filesystem returns to userspace.

 6. OS writes block 4 to disk drive
 7. OS writes block 10 to disk drive
 8. OS writes barrier to disk drive
 9. OS writes block 5 to disk drive
 10. OS writes block 20 to disk drive

Replace OS with filesystem, and combine 7+8 together - we don't have
zero-length barriers and hence they are *always* associated with a
write to a certain block on disk. i.e.:

1. FS writes block 4 to disk drive
2. FS writes block 10 to disk drive
3. FS writes *barrier* block X to disk drive
4. FS writes block 5 to disk drive
5. FS writes block 20 to disk drive

The order that these are expected by the filesystem to hit stable
storage are:

1. block 4 and 10 on stable storage in any order
2. barrier block X on stable storage
3. block 5 and 20 on stable storage in any order

The point I'm trying to make is that in XFS,  block 5 and 20 cannot
be allowed to hit the disk before the barrier block because they
have strict order dependency on block X being stable before them,
just like block X has strict order dependency that block 4 and 10
must be stable before we start the barrier block write.

 11. disk drive writes block 10 to platter
 12. disk drive writes block 4 to platter
 13. disk drive writes block 20 to platter
 14. disk drive writes block 5 to platter

 if the disk drive doesn't support barriers then step #8 becomes 'issue 
 flush' and steps 11 and 12 take place before step #9, 13, 14

No, you need a flush on either side of the block X write to maintain
the same semantics as barrier writes currently have.

We have filesystems that require barriers to prevent reordering of
writes in both directions and to ensure that the block associated
with the barrier is on stable storage when I/o completion is
signalled.  The existing barrier implementation (where it works)
provide these requirements. We need barriers to retain these
semantics, otherwise we'll still have to do special stuff in
the filesystems to get the semantics that we need.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread Stefan Bader

The order that these are expected by the filesystem to hit stable
storage are:

1. block 4 and 10 on stable storage in any order
2. barrier block X on stable storage
3. block 5 and 20 on stable storage in any order

The point I'm trying to make is that in XFS,  block 5 and 20 cannot
be allowed to hit the disk before the barrier block because they
have strict order dependency on block X being stable before them,
just like block X has strict order dependency that block 4 and 10
must be stable before we start the barrier block write.



That would be the exactly how I understand Documentation/block/barrier.txt:

In other words, I/O barrier requests have the following two properties.
1. Request ordering
...
2. Forced flushing to physical medium

So, I/O barriers need to guarantee that requests actually get written
to non-volatile medium in order.

Stefan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread Stefan Bader

 in-flight I/O to go to zero?

Something like that is needed for some dm targets to support barriers.
(We needn't always wait for *all* in-flight I/O.)
When faced with -EOPNOTSUP, do all callers fall back to a sync in
the places a barrier would have been used, or are there any more
sophisticated strategies attempting to optimise code without barriers?


If I didn't misunderstand the idea is that no caller will face an
-EOPNOTSUPP in future. IOW every layer or driver somehow makes sure
the right thing happens.



An efficient I/O barrier implementation would not normally involve
flushing AFAIK: dm surely wouldn't cause a higher layer to assume
stronger semantics than are provided.


Seems there are at least two assumptions about what the semantics
exactly _are_. Based on Documentation/block/barriers.txt I understand
a barrier implies ordering and flushing.
But regardless of that, assume the (admittedly constructed) following case:

You got a linear target that consists of two disks. One drive (a)
supports barriers and the other one (b) doesn't. Device-mapper just
maps the requests to the appropriate disk. Now the following sequence
happens:

1. block x gets mapped to drive b
2. block y (with barrier) gets mapped to drive a

Since drive a supports barrier request we don't get -EOPNOTSUPP but
the request with block y might get written before block x since the
disk are independent. I guess the chances of this are quite low since
at some point a barrier request will also hit drive b but for the time
being it might be better to indicate -EOPNOTSUPP right from
device-mapper.

Stefan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread Jens Axboe
On Mon, May 28 2007, Neil Brown wrote:
 I think the implementation priorities here are:
 
 1/ implement a zero-length BIO_RW_BARRIER option.
 2/ Use it (or otherwise) to make all dm and md modules handle
barriers (and loop?).
 3/ Devise and implement appropriate fall-backs with-in the block layer
so that  -EOPNOTSUP is never returned.
 4/ Remove unneeded cruft from filesystems (and elsewhere).

This is the start of 1/ above. It's very lightly tested, it's verified
to DTRT here at least and not crash :-)

It gets rid of the -issue_flush_fn() queue callback, all the driver
knowledge resides in -prepare_flush_fn() anyways. blkdev_issue_flush()
then just reuses the empty-bio approach to queue an empty barrier, this
should work equally well for stacked and non-stacked devices.

While this patch isn't complete yet, it's clearly the right direction to
go.

I didn't convert drivers/md/* to support this approach, I'm leaving that
to you :-)

 block/elevator.c|   12 ++
 block/ll_rw_blk.c   |  173 ++--
 drivers/ide/ide-disk.c  |   29 -
 drivers/message/i2o/i2o_block.c |   24 
 drivers/scsi/scsi_lib.c |   17 ---
 drivers/scsi/sd.c   |   15 --
 fs/bio.c|8 -
 include/linux/bio.h |   18 ++-
 include/linux/blkdev.h  |3 
 include/scsi/scsi_driver.h  |1 
 include/scsi/sd.h   |1 
 mm/bounce.c |6 +
 12 files changed, 141 insertions(+), 166 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index ce866eb..af5e58d 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -715,6 +715,18 @@ struct request *elv_next_request(request_queue_t *q)
int ret;
 
while ((rq = __elv_next_request(q)) != NULL) {
+   /*
+* Kill the empty barrier place holder, the driver must
+* not ever see it.
+*/
+   if (blk_fs_request(rq)  blk_barrier_rq(rq) 
+   !rq-hard_nr_sectors) {
+   blkdev_dequeue_request(rq);
+   rq-cmd_flags |= REQ_QUIET;
+   end_that_request_chunk(rq, 1, 0);
+   end_that_request_last(rq, 1);
+   continue;
+   }
if (!(rq-cmd_flags  REQ_STARTED)) {
/*
 * This is the first time the device driver
diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 6b5173a..8680083 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -300,23 +300,6 @@ int blk_queue_ordered(request_queue_t *q, unsigned ordered,
 
 EXPORT_SYMBOL(blk_queue_ordered);
 
-/**
- * blk_queue_issue_flush_fn - set function for issuing a flush
- * @q: the request queue
- * @iff:   the function to be called issuing the flush
- *
- * Description:
- *   If a driver supports issuing a flush command, the support is notified
- *   to the block layer by defining it through this call.
- *
- **/
-void blk_queue_issue_flush_fn(request_queue_t *q, issue_flush_fn *iff)
-{
-   q-issue_flush_fn = iff;
-}
-
-EXPORT_SYMBOL(blk_queue_issue_flush_fn);
-
 /*
  * Cache flushing for ordered writes handling
  */
@@ -433,7 +416,8 @@ static inline struct request *start_ordered(request_queue_t 
*q,
rq_init(q, rq);
if (bio_data_dir(q-orig_bar_rq-bio) == WRITE)
rq-cmd_flags |= REQ_RW;
-   rq-cmd_flags |= q-ordered  QUEUE_ORDERED_FUA ? REQ_FUA : 0;
+   if (q-ordered  QUEUE_ORDERED_FUA)
+   rq-cmd_flags |= REQ_FUA;
rq-elevator_private = NULL;
rq-elevator_private2 = NULL;
init_request_from_bio(rq, q-orig_bar_rq-bio);
@@ -445,7 +429,7 @@ static inline struct request *start_ordered(request_queue_t 
*q,
 * no fs request uses ELEVATOR_INSERT_FRONT and thus no fs
 * request gets inbetween ordered sequence.
 */
-   if (q-ordered  QUEUE_ORDERED_POSTFLUSH)
+   if ((q-ordered  QUEUE_ORDERED_POSTFLUSH)  rq-hard_nr_sectors)
queue_flush(q, QUEUE_ORDERED_POSTFLUSH);
else
q-ordseq |= QUEUE_ORDSEQ_POSTFLUSH;
@@ -469,7 +453,7 @@ static inline struct request *start_ordered(request_queue_t 
*q,
 int blk_do_ordered(request_queue_t *q, struct request **rqp)
 {
struct request *rq = *rqp;
-   int is_barrier = blk_fs_request(rq)  blk_barrier_rq(rq);
+   const int is_barrier = blk_fs_request(rq)  blk_barrier_rq(rq);
 
if (!q-ordseq) {
if (!is_barrier)
@@ -2635,6 +2619,16 @@ int blk_execute_rq(request_queue_t *q, struct gendisk 
*bd_disk,
 
 EXPORT_SYMBOL(blk_execute_rq);
 
+static int bio_end_empty_barrier(struct bio *bio, unsigned int bytes_done,
+int err)
+{
+   if (err)
+   clear_bit(BIO_UPTODATE, bio-bi_flags);
+
+   complete(bio-bi_private);
+   return 0;
+}
+
 /**
  * blkdev_issue_flush - 

Re: ANNOUNCE: mdadm 2.6.2 - A tool for managing Soft RAID under Linux

2007-05-30 Thread Thomas Jarosch
Neil,

On Tuesday, 29. May 2007, you wrote:
  cc1: warnings being treated as errors
  sysfs.c: In function 'sysfs_read':
  sysfs.c:97: warning: value computed is not used
  sysfs.c:119: warning: value computed is not used
  sysfs.c:127: warning: value computed is not used
  sysfs.c:133: warning: value computed is not used
  sysfs.c:139: warning: value computed is not used
  sysfs.c:178: warning: value computed is not used

 Those are bogus warnings. Each is
   strcpy(base, x);
 and base most certainly is used., though I can see how gcc might not
 notice if it is being too clever. Maybe you need to get gcc-4.1.2?
 or
make CWFLAGS=-Wall

Holger Kiehl was right, it complained about the unused return value.
Please see the attached patch.

Thomas
diff -u -r -p mdadm-2.6.2/Detail.c mdadm.warning/Detail.c
--- mdadm-2.6.2/Detail.c	Mon May 21 06:25:50 2007
+++ mdadm.warning/Detail.c	Wed May 30 10:52:32 2007
@@ -59,7 +59,7 @@ int Detail(char *dev, int brief, int exp
 	void *super = NULL;
 	int rv = test ? 4 : 1;
 	int avail_disks = 0;
-	char *avail;
+	char *avail = NULL;
 
 	if (fd  0) {
 		fprintf(stderr, Name : cannot open %s: %s\n,
diff -u -r -p mdadm-2.6.2/sysfs.c mdadm.warning/sysfs.c
--- mdadm-2.6.2/sysfs.c	Thu Dec 21 06:44:22 2006
+++ mdadm.warning/sysfs.c	Wed May 30 10:55:43 2007
@@ -94,7 +94,7 @@ struct sysarray *sysfs_read(int fd, int 
 
 	sra-devs = NULL;
 	if (options  GET_VERSION) {
-		strcpy(base, metadata_version);
+		(void)strcpy(base, metadata_version);
 		if (load_sys(fname, buf))
 			goto abort;
 		if (strncmp(buf, none, 4) == 0)
@@ -104,19 +104,19 @@ struct sysarray *sysfs_read(int fd, int 
 			   sra-major_version, sra-minor_version);
 	}
 	if (options  GET_LEVEL) {
-		strcpy(base, level);
+		(void)strcpy(base, level);
 		if (load_sys(fname, buf))
 			goto abort;
 		sra-level = map_name(pers, buf);
 	}
 	if (options  GET_LAYOUT) {
-		strcpy(base, layout);
+		(void)strcpy(base, layout);
 		if (load_sys(fname, buf))
 			goto abort;
 		sra-layout = strtoul(buf, NULL, 0);
 	}
 	if (options  GET_COMPONENT) {
-		strcpy(base, component_size);
+		(void)strcpy(base, component_size);
 		if (load_sys(fname, buf))
 			goto abort;
 		sra-component_size = strtoull(buf, NULL, 0);
@@ -124,19 +124,19 @@ struct sysarray *sysfs_read(int fd, int 
 		sra-component_size *= 2;
 	}
 	if (options  GET_CHUNK) {
-		strcpy(base, chunk_size);
+		(void)strcpy(base, chunk_size);
 		if (load_sys(fname, buf))
 			goto abort;
 		sra-chunk = strtoul(buf, NULL, 0);
 	}
 	if (options  GET_CACHE) {
-		strcpy(base, stripe_cache_size);
+		(void)strcpy(base, stripe_cache_size);
 		if (load_sys(fname, buf))
 			goto abort;
 		sra-cache_size = strtoul(buf, NULL, 0);
 	}
 	if (options  GET_MISMATCH) {
-		strcpy(base, mismatch_cnt);
+		(void)strcpy(base, mismatch_cnt);
 		if (load_sys(fname, buf))
 			goto abort;
 		sra-mismatch_cnt = strtoul(buf, NULL, 0);
@@ -175,7 +175,7 @@ struct sysarray *sysfs_read(int fd, int 
 		dev-role = strtoul(buf, ep, 10);
 		if (*ep) dev-role = -1;
 
-		strcpy(dbase, block/dev);
+		(void)strcpy(dbase, block/dev);
 		if (load_sys(fname, buf))
 			goto abort;
 		sscanf(buf, %d:%d, dev-major, dev-minor);


Creating RAID1 with bitmap fails

2007-05-30 Thread Jan Engelhardt
Hi,


the following command strangely gives -EIO ...
12:27 sun:~ # mdadm -C /dev/md4 -l 1 -n 2 -e 1.0 -b internal /dev/ram0 
missing

md: md4: raid array is not clean -- starting background reconstruction
md4: failed to create bitmap (-5)
md: pers-run() failed ...
mdadm: RUN_ARRAY failed: Input/output error
mdadm: stopped /dev/md4

Leaving out -b internal creates the array. /dev/ram0 or /dev/sda5 - EIO 
happens on both. (But the disk is fine, like ram0)
Where could I start looking?

Linux sun 2.6.21-1.3149.al3.8smp #3 SMP Wed May 30 09:43:00 CEST 2007 
sparc64 sparc64 sparc64 GNU/Linux
mdadm 2.5.4


Thanks,
Jan
-- 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread Alasdair G Kergon
On Wed, May 30, 2007 at 11:12:37AM +0200, Stefan Bader wrote:
 it might be better to indicate -EOPNOTSUPP right from
 device-mapper.
 
Indeed we should.  For support, on receipt of a barrier, dm core should
send a zero-length barrier to all active underlying paths, and delay
mapping any further I/O.

Alasdair
-- 
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch_cnt = 128 for root (/) md raid1 device

2007-05-30 Thread Justin Piszcz

Asking again..

On Sat, 26 May 2007, Justin Piszcz wrote:


Kernel 2.6.21.3

Fri May 25 20:00:02 EDT 2007: Executing RAID health check for /dev/md0...
Fri May 25 20:00:03 EDT 2007: Executing RAID health check for /dev/md1...
Fri May 25 20:00:04 EDT 2007: Executing RAID health check for /dev/md2...
Fri May 25 20:00:05 EDT 2007: Executing RAID health check for /dev/md3...
Sat May 26 04:40:09 EDT 2007: cat /sys/block/md0/md/mismatch_cnt
Sat May 26 04:40:09 EDT 2007: 0
Sat May 26 04:40:09 EDT 2007: cat /sys/block/md1/md/mismatch_cnt
Sat May 26 04:40:09 EDT 2007: 0
Sat May 26 04:40:09 EDT 2007: cat /sys/block/md2/md/mismatch_cnt
Sat May 26 04:40:09 EDT 2007: 128
Sat May 26 04:40:09 EDT 2007: cat /sys/block/md3/md/mismatch_cnt
Sat May 26 04:40:09 EDT 2007: 0
Sat May 26 04:40:09 EDT 2007: The meta-device /dev/md0 has no mismatched
sectors.
Sat May 26 04:40:10 EDT 2007: The meta-device /dev/md1 has no mismatched
sectors.
Sat May 26 04:40:11 EDT 2007: The meta-device /dev/md2 has 128 mismatched
sectors.
Sat May 26 04:40:11 EDT 2007: Executing repair on /dev/md2
Sat May 26 04:40:12 EDT 2007: The meta-device /dev/md3 has no mismatched
sectors.
Sat May 26 05:00:14 EDT 2007: cat /sys/block/md0/md/mismatch_cnt
Sat May 26 05:00:14 EDT 2007: 0
Sat May 26 05:00:14 EDT 2007: cat /sys/block/md1/md/mismatch_cnt
Sat May 26 05:00:14 EDT 2007: 0
Sat May 26 05:00:14 EDT 2007: cat /sys/block/md2/md/mismatch_cnt
Sat May 26 05:00:14 EDT 2007: 0
Sat May 26 05:00:14 EDT 2007: cat /sys/block/md3/md/mismatch_cnt
Sat May 26 05:00:14 EDT 2007: 0

I often see 128 or so for the root volume (/) for my RAID1.  Any idea?  I 
know when you reboot/shutdown a system with md raid1 it does not mount 
uncleanly (I believe?).


Just curious why it happens on the root volume and if its something I should 
be worried about?


md0=swap
md1=boot
md2=root
md3=raid5_volume



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Creating RAID1 with bitmap fails

2007-05-30 Thread Jan Engelhardt

On May 30 2007 22:05, Neil Brown wrote:
 
 the following command strangely gives -EIO ...
 12:27 sun:~ # mdadm -C /dev/md4 -l 1 -n 2 -e 1.0 -b internal /dev/ram0 
 missing
 
 md: md4: raid array is not clean -- starting background reconstruction
 md4: failed to create bitmap (-5)
 md: pers-run() failed ...
 mdadm: RUN_ARRAY failed: Input/output error
 mdadm: stopped /dev/md4
 
 Leaving out -b internal creates the array. /dev/ram0 or /dev/sda5 - EIO 
 happens on both. (But the disk is fine, like ram0)
 Where could I start looking?
 
 Linux sun 2.6.21-1.3149.al3.8smp #3 SMP Wed May 30 09:43:00 CEST 2007 
 sparc64 sparc64 sparc64 GNU/Linux
 mdadm 2.5.4

I'm fairly sure this is fixed in 2.6.2.  It is certainly worth a try.

The same command works on a x86_64 with mdadm 2.5.3...


Jan
-- 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread Phillip Susi

David Chinner wrote:

Barrier != synchronous write,


Of course. FYI, XFS only issues barriers on *async* writes.

But barrier semantics - as far as they've been described by everyone
but you indicate that the barrier write is guaranteed to be on stable
storage when it returns.


Hrm... I may have misunderstood the perspective you were talking from. 
Yes, when the bio is completed it must be on the media, but the 
filesystem should issue both requests, and then really not care when 
they complete.  That is to say, the filesystem should not wait for block 
A to finish before issuing block B; it should issue both, and use 
barriers to make sure they hit the disk in the correct order.



XFS relies on the block being stable before any other write
goes to disk. That is the semantic that the barrier I/Os currently
have. How that is implemented in the device is irrelevant to me,
but if I issue a barrier I/O, I do not expect *any* I/O to be
reordered around it.


Right... it just needs to control the order of the requests, just not 
wait on one to finish before issuing the next.



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread david

On Wed, 30 May 2007, David Chinner wrote:


On Tue, May 29, 2007 at 05:01:24PM -0700, [EMAIL PROTECTED] wrote:

On Wed, 30 May 2007, David Chinner wrote:


On Tue, May 29, 2007 at 04:03:43PM -0400, Phillip Susi wrote:

David Chinner wrote:

The use of barriers in XFS assumes the commit write to be on stable
storage before it returns.  One of the ordering guarantees that we
need is that the transaction (commit write) is on disk before the
metadata block containing the change in the transaction is written
to disk and the current barrier behaviour gives us that.


Barrier != synchronous write,


Of course. FYI, XFS only issues barriers on *async* writes.

But barrier semantics - as far as they've been described by everyone
but you indicate that the barrier write is guaranteed to be on stable
storage when it returns.


this doesn't match what I have seen

wtih barriers it's perfectly legal to have the following sequence of
events

1. app writes block 10 to OS
2. app writes block 4 to OS
3. app writes barrier to OS
4. app writes block 5 to OS
5. app writes block 20 to OS


hm - applications can't issue barriers to the filesystem.
However, if you consider the barrier to be an fsync() for example,
then it's still the filesystem that is issuing the barrier and
there's a block that needs to be written that is associated with
that barrier (either an inode or a transaction commit) that needs to
be on stable storage before the filesystem returns to userspace.


6. OS writes block 4 to disk drive
7. OS writes block 10 to disk drive
8. OS writes barrier to disk drive
9. OS writes block 5 to disk drive
10. OS writes block 20 to disk drive


Replace OS with filesystem, and combine 7+8 together - we don't have
zero-length barriers and hence they are *always* associated with a
write to a certain block on disk. i.e.:

1. FS writes block 4 to disk drive
2. FS writes block 10 to disk drive
3. FS writes *barrier* block X to disk drive
4. FS writes block 5 to disk drive
5. FS writes block 20 to disk drive

The order that these are expected by the filesystem to hit stable
storage are:

1. block 4 and 10 on stable storage in any order
2. barrier block X on stable storage
3. block 5 and 20 on stable storage in any order

The point I'm trying to make is that in XFS,  block 5 and 20 cannot
be allowed to hit the disk before the barrier block because they
have strict order dependency on block X being stable before them,
just like block X has strict order dependency that block 4 and 10
must be stable before we start the barrier block write.


11. disk drive writes block 10 to platter
12. disk drive writes block 4 to platter
13. disk drive writes block 20 to platter
14. disk drive writes block 5 to platter



if the disk drive doesn't support barriers then step #8 becomes 'issue
flush' and steps 11 and 12 take place before step #9, 13, 14


No, you need a flush on either side of the block X write to maintain
the same semantics as barrier writes currently have.

We have filesystems that require barriers to prevent reordering of
writes in both directions and to ensure that the block associated
with the barrier is on stable storage when I/o completion is
signalled.  The existing barrier implementation (where it works)
provide these requirements. We need barriers to retain these
semantics, otherwise we'll still have to do special stuff in
the filesystems to get the semantics that we need.


one of us is misunderstanding barriers here.

you are understanding barriers to be the same as syncronous writes. (and 
therefor the data is on persistant media before the call returns)


I am understanding barriers to only indicate ordering requirements. things 
before the barrier can be reordered freely, things after the barrier can 
be reordered freely, but things cannot be reordered across the barrier.


if I am understanding it correctly, the big win for barriers is that you 
do NOT have to stop and wait until the data is on persistant media before 
you can continue.


in the past barriers have not been fully implmented in most cases, and as 
a result they have been simulated by forcing a full flush of the buffers 
to persistant media before any other writes are allowed. This has made 
them _in practice_ operate the same way as syncronous writes (matching 
your understanding), but the current thread is talking about fixing the 
implementation to the official symantics for all hardware that can 
actually support barriers (and fix it at the OS level)


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID SB 1.x autodetection

2007-05-30 Thread Nix
On 29 May 2007, Jan Engelhardt uttered the following:

 from your post at 
 http://www.mail-archive.com/linux-raid@vger.kernel.org/msg07384.html I 
 read that autodetecting arrays with a 1.x superblock is currently 
 impossible. Does it at least work to force the kernel to always assume a 
 1.x sb? There are some 'broken' distros out there that still don't use 
 mdadm in initramfs, and recreating the initramfs each time is a bit 
 cumbersome...

The kernel build system should be able to do that for you, shouldn't it?

-- 
`On a scale of one to ten of usefulness, BBC BASIC was several points ahead
 of the competition, scoring a relatively respectable zero.' --- Peter Corlett
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread Phillip Susi

Phillip Susi wrote:
Hrm... I may have misunderstood the perspective you were talking from. 
Yes, when the bio is completed it must be on the media, but the 
filesystem should issue both requests, and then really not care when 
they complete.  That is to say, the filesystem should not wait for block 
A to finish before issuing block B; it should issue both, and use 
barriers to make sure they hit the disk in the correct order.


Actually now that I think about it, that wasn't correct.  The request 
CAN be completed before the data has hit the medium.  The barrier just 
constrains the ordering of the writes, but they can still sit in the 
disk write back cache for some time.


Stefan Bader wrote:

That would be the exactly how I understand Documentation/block/barrier.txt:

In other words, I/O barrier requests have the following two properties.
1. Request ordering
...
2. Forced flushing to physical medium

So, I/O barriers need to guarantee that requests actually get written
to non-volatile medium in order. 


I think you misinterpret this, and it probably could be worded a bit 
better.  The barrier request is about constraining the order.  The 
forced flushing is one means to implement that constraint.  The other 
alternative mentioned there is to use ordered tags.  The key part there 
is requests actually get written to non-volatile medium _in order_, 
not before the request completes, which would be synchronous IO.


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


dm-crypt issue

2007-05-30 Thread Bill Davidsen
I'm not sure this is the right place, or that there IS a right pace for 
this, but it involves RAID, so... ;-)


I have an array I mount with cryptoloop, to hole some information with 
AES encryption. I keep hearing that the way to do this is with dm-crypt, 
but I can't find anyone who will explain how to do that. If I had the 
luxury of starting over I could use dm-crypt to start from scratch, but 
the only practical solution in terms of time and requirement for 
encrypted backup would be to be able to mount the same partitions, using 
the same encryption, just using dm-crypt. I don't think I can do that, 
but if someone wants to assure me that it can be done and point me to 
documentation, I'll be grateful.


I have a LOT of partial sets of this data in the field in DVD, same 
requirement, I can't replace them, they have to work as is.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID SB 1.x autodetection

2007-05-30 Thread Bill Davidsen

Nix wrote:

On 29 May 2007, Jan Engelhardt uttered the following:

  
from your post at 
http://www.mail-archive.com/linux-raid@vger.kernel.org/msg07384.html I 
read that autodetecting arrays with a 1.x superblock is currently 
impossible. Does it at least work to force the kernel to always assume a 
1.x sb? There are some 'broken' distros out there that still don't use 
mdadm in initramfs, and recreating the initramfs each time is a bit 
cumbersome...



The kernel build system should be able to do that for you, shouldn't it?

  

That would be an improvement, yes.

--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID SB 1.x autodetection

2007-05-30 Thread Jan Engelhardt

On May 30 2007 16:35, Bill Davidsen wrote:
 On 29 May 2007, Jan Engelhardt uttered the following:
  from your post at
  http://www.mail-archive.com/linux-raid@vger.kernel.org/msg07384.html I
  read that autodetecting arrays with a 1.x superblock is currently
  impossible. Does it at least work to force the kernel to always assume
  a 1.x sb? There are some 'broken' distros out there that still don't
  use mdadm in initramfs, and recreating the initramfs each time is a
  bit cumbersome...
 
 The kernel build system should be able to do that for you, shouldn't it?
 
 That would be an improvement, yes.

Hardly, with all the Fedora specific cruft. Anyway, there was a
simple patch posted in RH bugzilla, so I've gone with that.


Jan
-- 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID SB 1.x autodetection

2007-05-30 Thread Nix
On 30 May 2007, Bill Davidsen stated:

 Nix wrote:
 On 29 May 2007, Jan Engelhardt uttered the following:


 from your post at 
 http://www.mail-archive.com/linux-raid@vger.kernel.org/msg07384.html I read 
 that autodetecting arrays with a
 1.x superblock is currently impossible. Does it at least work to force the 
 kernel to always assume a 1.x sb? There are some
 'broken' distros out there that still don't use mdadm in initramfs, and 
 recreating the initramfs each time is a bit cumbersome...

 The kernel build system should be able to do that for you, shouldn't it?

 That would be an improvement, yes.

Allow me to rephrase: the kernel build system *can* do that for you ;)
that is, it can build a gzipped cpio archive from components located
anywhere on the filesystem or arbitrary source located under usr/.

-- 
`On a scale of one to ten of usefulness, BBC BASIC was several points ahead
 of the competition, scoring a relatively respectable zero.' --- Peter Corlett
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Creating RAID1 with bitmap fails

2007-05-30 Thread Neil Brown
On Wednesday May 30, [EMAIL PROTECTED] wrote:
 
 On May 30 2007 22:05, Neil Brown wrote:
  
  the following command strangely gives -EIO ...
  12:27 sun:~ # mdadm -C /dev/md4 -l 1 -n 2 -e 1.0 -b internal /dev/ram0 
  missing
  
  md: md4: raid array is not clean -- starting background reconstruction
  md4: failed to create bitmap (-5)
  md: pers-run() failed ...
  mdadm: RUN_ARRAY failed: Input/output error
  mdadm: stopped /dev/md4
  
  Leaving out -b internal creates the array. /dev/ram0 or /dev/sda5 - EIO 
  happens on both. (But the disk is fine, like ram0)
  Where could I start looking?
  
  Linux sun 2.6.21-1.3149.al3.8smp #3 SMP Wed May 30 09:43:00 CEST 2007 
  sparc64 sparc64 sparc64 GNU/Linux
  mdadm 2.5.4
 
 I'm fairly sure this is fixed in 2.6.2.  It is certainly worth a try.
 
 The same command works on a x86_64 with mdadm 2.5.3...

Are you sure?
I suspect that the difference is more in the kernel version.
mdadm used to create some arrays with the bitmap positioned so that it
overlapped the data.  Recent kernels check for that and reject the
array if there is an overlap.  mdadm-2.6.2 makes sure not to create
any overlap.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread David Chinner
On Wed, May 30, 2007 at 09:52:49AM -0700, [EMAIL PROTECTED] wrote:
 On Wed, 30 May 2007, David Chinner wrote:
 with the barrier is on stable storage when I/o completion is
 signalled.  The existing barrier implementation (where it works)
 provide these requirements. We need barriers to retain these
 semantics, otherwise we'll still have to do special stuff in
 the filesystems to get the semantics that we need.
 
 one of us is misunderstanding barriers here.

No, I thinkwe are both on the same level here - it's what
barriers are used for that is not clear understood, I think.

 you are understanding barriers to be the same as syncronous writes. (and 
 therefor the data is on persistant media before the call returns)

No, I'm describing the high level behaviour that is expected by
a filesystem. The reasons for this are below

 I am understanding barriers to only indicate ordering requirements. things 
 before the barrier can be reordered freely, things after the barrier can 
 be reordered freely, but things cannot be reordered across the barrier.

Ok, that's my understanding of how *device based barriers* can work,
but there's more to it than that. As far as the filesystem is
concerned the barrier write needs to *behave* exactly like a sync
write because of the guarantees the filesystem has to provide
userspace. Specifically - sync, sync writes and fsync.

This is the big problem, right? If we use barriers for commit
writes, the filesystem can return to userspace after a sync write or
fsync() and an *ordered barrier device implementation* may not have
written the blocks to persistent media. If we then pull the plug on
the box, we've just lost data that sync or fsync said was
successfully on disk. That's BAD.

Right now a barrier write on the last block of the fsync/sync write
is sufficient to prevent that because of the FUA on the barrier
block write. A purely ordered barrier implementation does not
provide this guarantee.

This is the crux of my argument - from a filesystem perspective,
there is a *major* difference between a barrier implemented to just
guaranteeing ordering and a barrier implemented with a flush+FUA or
flush+write+flush.

IOWs, there are two parts to the problem:

1 - guaranteeing I/O ordering
2 - guaranteeing blocks are on persistent storage.

Right now, a single barrier I/O is used to provide both of these
guarantees. In most cases, all we really need to provide is 1); the
need for 2) is a much rarer condition but still needs to be
provided.

 if I am understanding it correctly, the big win for barriers is that you 
 do NOT have to stop and wait until the data is on persistant media before 
 you can continue.

Yes, if we define a barrier to only guarantee 1), then yes this
would be a big win (esp. for XFS). But that requires all filesystems
to handle sync writes differently, and sync_blockdev() needs to
call blkdev_issue_flush() as well

So, what do we do here? Do we define a barrier I/O to only provide
ordering, or do we define it to also provide persistent storage
writeback? Whatever we decide, it needs to be documented

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread Neil Brown
On Tuesday May 29, [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
   md/dm modules could keep count of requests as has been suggested
   (though that would be a fairly big change for raid0 as it currently
   doesn't know when a request completes - bi_endio goes directly to the
   filesystem). 
 
 Are you sure?  I believe that dm handles bi_endio because it waits for 
 all in progress bio to complete before switching tables.

I was taking about md/raid0, not dm-stripe.
md/raid0 (and md/linear) currently never know that a request has
completed.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread Neil Brown
On Monday May 28, [EMAIL PROTECTED] wrote:
 There are two things I'm not sure you covered.
 
 First, disks which don't support flush but do have a cache dirty 
 status bit you can poll at times like shutdown. If there are no drivers 
 which support these, it can be ignored.

There are really devices like that?  So to implement a flush, you have
to stop sending writes and wait and poll - maybe poll every
millisecond?
That wouldn't be very good for performance  maybe you just
wouldn't bother with barriers on that sort of device?

Which reminds me:  What is the best way to turn off barriers?
Several filesystems have -o nobarriers or -o barriers=0,
or the inverse.
md/raid currently uses barriers to write metadata, and there is no
way to turn that off.  I'm beginning to wonder if that is best.

Maybe barrier support should be a function of the device.  i.e. the
filesystem or whatever always sends barrier requests where it thinks
it is appropriate, and the block device tries to honour them to the
best of its ability, but if you run
   blockdev --enforce-barriers=no /dev/sda
then you lose some reliability guarantees, but gain some throughput (a
bit like the 'async' export option for nfsd).

 
 Second, NAS (including nbd?). Is there enough information to handle this 
 really rigt?

NAS means lots of things, including NFS and CIFS where this doesn't
apply.
For 'nbd', it is entirely up to the protocol.  If the protocol allows
a barrier flag to be sent to the server, then barriers should just
work.  If it doesn't, then either the server disables write-back
caching, or flushes every request, or you lose all barrier
guarantees. 
For 'iscsi', I guess it works just the same as SCSI...

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread Neil Brown
On Monday May 28, [EMAIL PROTECTED] wrote:
 On Mon, May 28, 2007 at 12:57:53PM +1000, Neil Brown wrote:
  What exactly do you want to know, and why do you care?
 
 If someone explicitly mounts -o barrier and the underlying device
 cannot do it, then we want to issue a warning or reject the
 mount.

I guess that makes sense.
But apparently you cannot tell what a device supports until you write
to it.
So maybe you need to write some metadata with as a barrier, then ask
the device what it's barrier status is.  The options might be:
  YES - barriers are fully handled
  NO  - best effort, but due to missing device features, it might not
work
  DISABLED - admin has requested that barriers be ignored.

??
 
 
  The idea is that every struct block_device supports barriers.  If the
  underlying hardware doesn't support them directly, then they get
  simulated by draining the queue and issuing a flush.
 
 Ok. But you also seem to be implying that there will be devices that
 cannot support barriers.

It seems there will always be hardware that doesn't meet specs.  If a
device doesn't support SYNCHRONIZE_CACHE or FUA, then implementing
barriers all the way to the media would be hard..

 
 Even if all devices do eventually support barriers, it may take some
 time before we reach that goal.  Why not start by making it easy to
 determine what the capabilities of each device are. This can then be
 removed once we reach the holy grail

I'd rather not add something that we plan to remove.  We currently
have -EOPNOTSUP.  I don't think there is much point having more than
that.

I would really like to get to the stage where -EOPNOTSUP is never
returned.  If a filesystem cares, it could 'ask' as suggested above.
What would be a good interface for asking?
What if the truth changes (as can happen with md or dm)?

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread Alasdair G Kergon
On Thu, May 31, 2007 at 10:46:04AM +1000, Neil Brown wrote:
 What if the truth changes (as can happen with md or dm)?

You get notified in endio() that the barrier had to be emulated?
 
Alasdair
-- 
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread Alasdair G Kergon
On Thu, May 31, 2007 at 10:46:04AM +1000, Neil Brown wrote:
 If a filesystem cares, it could 'ask' as suggested above.
 What would be a good interface for asking?

XFS already tests:
  bd_disk-queue-ordered == QUEUE_ORDERED_NONE

Alasdair
-- 
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread David Chinner
On Thu, May 31, 2007 at 02:07:39AM +0100, Alasdair G Kergon wrote:
 On Thu, May 31, 2007 at 10:46:04AM +1000, Neil Brown wrote:
  If a filesystem cares, it could 'ask' as suggested above.
  What would be a good interface for asking?
 
 XFS already tests:
   bd_disk-queue-ordered == QUEUE_ORDERED_NONE

The side effects of removing that check is what started
this whole discussion.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


very strange (maybe) raid1 testing results

2007-05-30 Thread Jon Nelson

I assembled a 3-component raid1 out of 3 4GB partitions.
After syncing, I ran the following script:

for bs in 32 64 128 192 256 384 512 768 1024 ; do \
 let COUNT=2048 * 1024 / ${bs}; \
 echo -n ${bs}K bs - ; \
 dd if=/dev/md1 of=/dev/null bs=${bs}k count=$COUNT iflag=direct 21 | 
 grep 'copied' ; \
done

I also ran 'dstat' (like iostat) in another terminal. What I noticed was 
very unexpected to me, so I re-ran it several times.  I confirmed my 
initial observation - every time a new dd process ran, *all* of the read 
I/O for that process came from a single disk. It does not (appear to) 
have to do with block size -  if I stop and re-run the script the next 
drive in line will take all of the I/O - it goes sda, sdc, sdb and back 
to sda and so on.

I am getting 70-80MB/s read rates as reported via dstat, and 60-80MB/s 
as reported by dd. What I don't understand is why just one disk is being 
used here, instead of two or more. I tried different versions of 
metadata, and using a bitmap makes no difference. I created the array 
with (allowing for variations of bitmap and metadata version):

mdadm --create --level=1 --raid-devices=3 /dev/md1 /dev/sda3 /dev/sdb3 /dev/sdc3

I am running 2.6.18.8-0.3-default on x86_64, openSUSE 10.2.

Am I doing something wrong or is something weird going on?

--
Jon Nelson [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: When does a disk get flagged as bad?

2007-05-30 Thread Mike Accetta
Alberto Alonso writes:
 OK, lets see if I can understand how a disk gets flagged
 as bad and removed from an array. I was under the impression
 that any read or write operation failure flags the drive as
 bad and it gets removed automatically from the array.
 
 However, as I indicated in a prior post I am having problems
 where the array is never degraded. Does an error of type:
 end_request: I/O error, dev sdb, sector 
 not count as a read/write error?

I was also under the impression that any read or write error would
fail the drive out of the array but some recent experiments with error
injecting seem to indicate otherwise at least for raid1.  My working
hypothesis is that only write errors fail the drive.  Read errors appear
to just redirect the sector to a different mirror.

I actually ran across what looks like a bug in the raid1
recovery/check/repair read error logic that I posted about
last week but which hasn't generated any response yet (cf.
http://article.gmane.org/gmane.linux.raid/15354).  This bug results in
sending a zero length write request down to the underlying device driver.
A consequence of issuing a zero length write is that it fails at the
device level, which raid1 sees as a write failure, which then fails the
array.  The fix I proposed actually has the effect of *not* failing the
array in this case since the spurious failing write is never generated.
I'm not sure what is actually supposed to happen in this case.  Hopefully,
someone more knowledgeable will comment soon.
--
Mike Accetta

ECI Telecom Ltd.
Data Networking Division (previously Laurel Networks)
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: very strange (maybe) raid1 testing results

2007-05-30 Thread Richard Scobie

Jon Nelson wrote:

I am getting 70-80MB/s read rates as reported via dstat, and 60-80MB/s 
as reported by dd. What I don't understand is why just one disk is being 
used here, instead of two or more. I tried different versions of 
metadata, and using a bitmap makes no difference. I created the array 
with (allowing for variations of bitmap and metadata version):


This is normal for md RAID1. What you should find is that for concurrent 
reads, each read will be serviced by a different disk, until no. of 
reads = no. of drives.


Regards,

Richard
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: When does a disk get flagged as bad?

2007-05-30 Thread Alberto Alonso
On Wed, 2007-05-30 at 22:28 -0400, Mike Accetta wrote:
 Alberto Alonso writes:
  OK, lets see if I can understand how a disk gets flagged
  as bad and removed from an array. I was under the impression
  that any read or write operation failure flags the drive as
  bad and it gets removed automatically from the array.
  
  However, as I indicated in a prior post I am having problems
  where the array is never degraded. Does an error of type:
  end_request: I/O error, dev sdb, sector 
  not count as a read/write error?
 
 I was also under the impression that any read or write error would
 fail the drive out of the array but some recent experiments with error
 injecting seem to indicate otherwise at least for raid1.  My working
 hypothesis is that only write errors fail the drive.  Read errors appear
 to just redirect the sector to a different mirror.
 
 I actually ran across what looks like a bug in the raid1
 recovery/check/repair read error logic that I posted about
 last week but which hasn't generated any response yet (cf.
 http://article.gmane.org/gmane.linux.raid/15354).  This bug results in
 sending a zero length write request down to the underlying device driver.
 A consequence of issuing a zero length write is that it fails at the
 device level, which raid1 sees as a write failure, which then fails the
 array.  The fix I proposed actually has the effect of *not* failing the
 array in this case since the spurious failing write is never generated.
 I'm not sure what is actually supposed to happen in this case.  Hopefully,
 someone more knowledgeable will comment soon.
 --
 Mike Accetta

I was starting to think that nobody got my posts, I know there
are plenty of people that understand raid and didn't get any answers
to any of my related posts.

After thinking about your post, I guess I can see some logic behind
not failing on the read, although I would say that after x amount of
read failures a drive should be kicked out no matter what.

In my case I believe the errors are during writes, which is still
confusing.

Unfortunately I've never done any kind of disk I/O code so I am
afraid of looking at the code and getting completely lost.

Alberto

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: very strange (maybe) raid1 testing results

2007-05-30 Thread Jon Nelson
On Thu, 31 May 2007, Richard Scobie wrote:

 Jon Nelson wrote:
 
  I am getting 70-80MB/s read rates as reported via dstat, and 60-80MB/s as
  reported by dd. What I don't understand is why just one disk is being used
  here, instead of two or more. I tried different versions of metadata, and
  using a bitmap makes no difference. I created the array with (allowing for
  variations of bitmap and metadata version):
 
 This is normal for md RAID1. What you should find is that for 
 concurrent reads, each read will be serviced by a different disk, 
 until no. of reads = no. of drives.

Alright. To clarify, let's assume some process (like a single-threaded 
webserver) using a raid1 to store content (who knows why, let's just say 
it is), and also assume that the I/O load is 100% reads. Given that the 
server does not fork (or create a thread) for each request, does that 
mean that every single web request is essentially serviced from one 
disk, always? What mechanism determines which disk actually services the 
request?

--
Jon Nelson [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread Neil Brown
On Monday May 28, [EMAIL PROTECTED] wrote:
 Neil Brown writes:
   
 
 [...]
 
   Thus the general sequence might be:
   
 a/ issue all preceding writes.
 b/ issue the commit write with BIO_RW_BARRIER
 c/ wait for the commit to complete.
If it was successful - done.
If it failed other than with EOPNOTSUPP, abort
else continue
 d/ wait for all 'preceding writes' to complete
 e/ call blkdev_issue_flush
 f/ issue commit write without BIO_RW_BARRIER
 g/ wait for commit write to complete
  if it failed, abort
 h/ call blkdev_issue
 DONE
   
   steps b and c can be left out if it is known that the device does not
   support barriers.  The only way to discover this to try and see if it
   fails.
   
   I don't think any filesystem follows all these steps.
 
 It seems that steps b/ -- h/ are quite generic, and can be implemented
 once in a generic code (with some synchronization mechanism like
 wait-queue at d/).

Yes and no.
It depends on what you mean by preceding write.

If you implement this in the filesystem, the filesystem can wait only
for those writes where it has an ordering dependency.   If you
implement it in common code, then you have to wait for all writes
that were previously issued.

e.g.
  If you have two different filesystems on two different partitions on
  the one device, why should writes in one filesystem wait for a
  barrier issued in the other filesystem.
  If you have a single filesystem with one thread doing lot of
  over-writes (no metadata changes) and the another doing lots of
  metadata changes (requiring journalling and barriers) why should the
  data write be held up by the metadata updates?

So I'm not actually convinced that doing this is common code is the
best approach.  But it is the easiest.  The common code should provide
the barrier and flushing primitives, and the filesystem gets to use
them however it likes.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Creating RAID1 with bitmap fails

2007-05-30 Thread Jan Engelhardt

On May 31 2007 09:09, Neil Brown wrote:
  the following command strangely gives -EIO ...
  12:27 sun:~ # mdadm -C /dev/md4 -l 1 -n 2 -e 1.0 -b internal /dev/ram0 
  missing
  Where could I start looking?
  
  Linux sun 2.6.21-1.3149.al3.8smp #3 SMP Wed May 30 09:43:00 CEST 2007 
  sparc64 sparc64 sparc64 GNU/Linux
  mdadm 2.5.4
 
 I'm fairly sure this is fixed in 2.6.2.  It is certainly worth a try.
 
 The same command works on a x86_64 with mdadm 2.5.3...
  [ with 2.6.18.8 ]

Are you sure?
I suspect that the difference is more in the kernel version.
mdadm used to create some arrays with the bitmap positioned so that it
overlapped the data.  Recent kernels check for that and reject the
array if there is an overlap.  mdadm-2.6.2 makes sure not to create
any overlap.

Regarding above x86_64/2.5.3/2.6.18.8 created array, is there a way to
check whether it overlaps?


Jan
-- 
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Md corruption using RAID10 on linux-2.6.21

2007-05-30 Thread Neil Brown
On Wednesday May 30, [EMAIL PROTECTED] wrote:
 Neil, I sent the scripts to you. Any update on this issue?

Sorry, I got distracted.

Your scripts are way more complicated than needed.  Most of the logic
in there is already in mdadm.

   mdadm --assemble /dev/md_d0 --run --uuid=$BOOTUUID /dev/sd[abcd]2

can replace most of it.  And you don't need to wait for resync to
complete before mounting filesystems.

That said: I cannot see anything in your script that would actually do
the wrong thing.

Hmmm... I see now I wasn't quite testing the right thing.  I need to
trigger a resync with one device missing.
i.e
  mdadm -C /dev/md0 -l10 -n4 -p n3 /dev/sd[abcd]1
  mkfs /dev/md0
  mdadm /dev/md0 -f /dev/sda1
  mdadm -S /dev/md0
  mdadm -A /dev/md0 -R --update=resync /dev/sd[bcd]1
  fsck -f /dev/md0

This fails just as you say.
Following patch fixes it as well as another problem I found while
doing this testing.

Thanks for pursuing this.

NeilBrown

diff .prev/drivers/md/raid10.c ./drivers/md/raid10.c
--- .prev/drivers/md/raid10.c   2007-05-21 11:18:23.0 +1000
+++ ./drivers/md/raid10.c   2007-05-31 15:11:42.0 +1000
@@ -1866,6 +1866,7 @@ static sector_t sync_request(mddev_t *md
int d = r10_bio-devs[i].devnum;
bio = r10_bio-devs[i].bio;
bio-bi_end_io = NULL;
+   clear_bit(BIO_UPTODATE, bio-bi_flags);
if (conf-mirrors[d].rdev == NULL ||
test_bit(Faulty, conf-mirrors[d].rdev-flags))
continue;
@@ -2036,6 +2037,11 @@ static int run(mddev_t *mddev)
/* 'size' is now the number of chunks in the array */
/* calculate used chunks per device in 'stride' */
stride = size * conf-copies;
+
+   /* We need to round up when dividing by raid_disks to
+* get the stride size.
+*/
+   stride += conf-raid_disks - 1;
sector_div(stride, conf-raid_disks);
mddev-size = stride   (conf-chunk_shift-1);
 

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html