Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-10 Thread Tejun Heo
Bill Davidsen wrote:
 Jan Engelhardt wrote:
 On Dec 1 2007 06:26, Justin Piszcz wrote:
 I ran the following:

 dd if=/dev/zero of=/dev/sdc
 dd if=/dev/zero of=/dev/sdd
 dd if=/dev/zero of=/dev/sde

 (as it is always a very good idea to do this with any new disk)

 Why would you care about what's on the disk? fdisk, mkfs and
 the day-to-day operation will overwrite it _anyway_.

 (If you think the disk is not empty, you should look at it
 and copy off all usable warez beforehand :-)

 Do you not test your drive for minimum functionality before using them?

I personally don't.

 Also, if you have the tools to check for relocated sectors before and
 after doing this, that's a good idea as well. S.M.A.R.T is your friend.
 And when writing /dev/zero to a drive, if it craps out you have less
 emotional attachment to the data.

Writing all zero isn't too useful tho.  Drive failing reallocation on
write is catastrophic failure.  It means that the drive wanna relocate
but can't because it used up all its extra space which usually indicates
something else is seriously wrong with the drive.  The drive will have
to go to the trash can.  This is all serious and bad but the catch is
that in such cases the problem usually stands like a sore thumb so
either vendor doesn't ship such drive or you'll find the failure very
early.  I personally haven't seen any such failure yet.  Maybe I'm lucky.

Most data loss occurs when the drive fails to read what it thought it
wrote successfully and the opposite - reading and dumping the whole disk
to /dev/null periodically is probably much better than writing zeros as
it allows the drive to find out deteriorating sector early while it's
still readable and relocate.  But then again I think it's an overkill.

Writing zeros to sectors is more useful as cure rather than prevention.
 If your drive fails to read a sector, write whatever value to the
sector.  The drive will forget about the data on the damaged sector and
reallocate and write new data to it.  Of course, you lose data which was
originally on the sector.

I personally think it's enough to just throw in an extra disk and make
it RAID0 or 5 and rebuild the array if read fails on one of the disks.
If write fails or read fail continues, replace the disk.  Of course, if
you wanna be extra cautious, good for you.  :-)

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-10 Thread Tejun Heo
Justin Piszcz wrote:
 The badblocks did not do anything; however, when I built a software raid
 5 and the performed a dd:
 
 /usr/bin/time dd if=/dev/zero of=fill_disk bs=1M
 
 [42332.936615] ata5.00: exception Emask 0x2 SAct 0x7000 SErr 0x0 action
 0x2 frozen
 [42332.936706] ata5.00: spurious completions during NCQ issue=0x0
 SAct=0x7000 FIS=004040a1:0800
 
 Next test, I will turn off NCQ and try to make the problem re-occur.
 If anyone else has any thoughts here..?
 I ran long smart tests on all 3 disks, they all ran successfully.
 
 Perhaps these drives need to be NCQ BLACKLISTED with the P35 chipset?

That was me being stupid.  Patches for both upstream and -stable
branches are posted.  These will go away.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible data corruption sata_sil24?

2007-07-19 Thread Tejun Heo
David Shaw wrote:
 I'm not sure whether this is problem of sata_sil24 or dm layer.  Cc'ing
 linux-raid for help.  How much memory do you have?  One big difference
 between ata_piix and sata_sil24 is that sil24 can handle 64bit DMA.
 Maybe dma mapping or something interacts weirdly with dm there?
 
 The machine has 640 megs of RAM.  FWIW, I tried this with 512 megs of
 RAM with the same results.  Running Memtest86+ shows the memory is
 good.

Hmmm... I see, so no DMA to the wrong address problem then.  Let's see
whether dm people can help us out.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible data corruption sata_sil24?

2007-07-18 Thread Tejun Heo
David Shaw wrote:
 It fails whether I use a raw /dev/sdd or partition it into one large
 /dev/sdd1, or partition into multiple partitions.  sata_sil24 seems to
 work by itself, as does dm, but as soon as I mix sata_sil24+dm, I get
 corruption.
 H Can you reproduce the corruption by accessing both devices
 simultaneously without using dm?  Considering ich5 does fine, it looks
 like hardware and/or driver problem and I really wanna rule out dm.
 
 I think I wasn't clear enough before.  The corruption happens when I
 use dm to create two dm mappings that both reside on the same real
 device.  Using two different devices, or two different partitions on
 the same physical device works properly.  ich5 does fine with these 3
 tests, but sata_sil24 fails:
 
  * /dev/sdd, create 2 dm linear mappings on it, mke2fs and use those
dm devices == corruption
 
  * Partition /dev/sdd into /dev/sdd1 and /dev/sdd2, mke2fs and use
those partitions == no corruption
 
  * Partition /dev/sdd into /dev/sdd1 and /dev/sdd2, create 2 dm linear
mappings on /dev/sdd1, mke2fs and use those dm devices ==
corruption

I'm not sure whether this is problem of sata_sil24 or dm layer.  Cc'ing
linux-raid for help.  How much memory do you have?  One big difference
between ata_piix and sata_sil24 is that sil24 can handle 64bit DMA.
Maybe dma mapping or something interacts weirdly with dm there?

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] block: cosmetic changes

2007-07-18 Thread Tejun Heo
Cosmetic changes.  This is taken from Jens' zero-length barrier patch.

Signed-off-by: Tejun Heo [EMAIL PROTECTED]
Cc: Jens Axboe [EMAIL PROTECTED]
---
 block/ll_rw_blk.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Index: work/block/ll_rw_blk.c
===
--- work.orig/block/ll_rw_blk.c
+++ work/block/ll_rw_blk.c
@@ -443,7 +443,8 @@ static inline struct request *start_orde
rq_init(q, rq);
if (bio_data_dir(q-orig_bar_rq-bio) == WRITE)
rq-cmd_flags |= REQ_RW;
-   rq-cmd_flags |= q-ordered  QUEUE_ORDERED_FUA ? REQ_FUA : 0;
+   if (q-ordered  QUEUE_ORDERED_FUA)
+   rq-cmd_flags |= REQ_FUA;
rq-elevator_private = NULL;
rq-elevator_private2 = NULL;
init_request_from_bio(rq, q-orig_bar_rq-bio);
@@ -3167,7 +3168,7 @@ end_io:
break;
}
 
-   if (unlikely(bio_sectors(bio)  q-max_hw_sectors)) {
+   if (unlikely(nr_sectors  q-max_hw_sectors)) {
printk(bio too big device %s (%u  %u)\n, 
bdevname(bio-bi_bdev, b),
bio_sectors(bio),
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] block: factor out bio_check_eod()

2007-07-18 Thread Tejun Heo
End of device check is done twice in __generic_make_request() and it's
fully inlined each time.  Factor out bio_check_eod().

This is taken from Jens' zero-length barrier patch.

Signed-off-by: Tejun Heo [EMAIL PROTECTED]
Cc: Jens Axboe [EMAIL PROTECTED]
---
 block/ll_rw_blk.c |   63 --
 1 file changed, 33 insertions(+), 30 deletions(-)

Index: work/block/ll_rw_blk.c
===
--- work.orig/block/ll_rw_blk.c
+++ work/block/ll_rw_blk.c
@@ -3094,6 +3094,35 @@ static inline int should_fail_request(st
 
 #endif /* CONFIG_FAIL_MAKE_REQUEST */
 
+/*
+ * Check whether this bio extends beyond the end of the device.
+ */
+static int bio_check_eod(struct bio *bio, unsigned int nr_sectors)
+{
+   sector_t maxsector;
+
+   if (!nr_sectors)
+   return 0;
+
+   /* Test device or partition size, when known. */
+   maxsector = bio-bi_bdev-bd_inode-i_size  9;
+   if (maxsector) {
+   sector_t sector = bio-bi_sector;
+
+   if (maxsector  nr_sectors || maxsector - nr_sectors  sector) {
+   /*
+* This may well happen - the kernel calls bread()
+* without checking the size of the device, e.g., when
+* mounting a device.
+*/
+   handle_bad_sector(bio);
+   return 1;
+   }
+   }
+
+   return 0;
+}
+
 /**
  * generic_make_request: hand a buffer to its device driver for I/O
  * @bio:  The bio describing the location in memory and on the device.
@@ -3121,27 +3150,14 @@ static inline int should_fail_request(st
 static inline void __generic_make_request(struct bio *bio)
 {
request_queue_t *q;
-   sector_t maxsector;
sector_t old_sector;
int ret, nr_sectors = bio_sectors(bio);
dev_t old_dev;
 
might_sleep();
-   /* Test device or partition size, when known. */
-   maxsector = bio-bi_bdev-bd_inode-i_size  9;
-   if (maxsector) {
-   sector_t sector = bio-bi_sector;
 
-   if (maxsector  nr_sectors || maxsector - nr_sectors  sector) {
-   /*
-* This may well happen - the kernel calls bread()
-* without checking the size of the device, e.g., when
-* mounting a device.
-*/
-   handle_bad_sector(bio);
-   goto end_io;
-   }
-   }
+   if (bio_check_eod(bio, nr_sectors))
+   goto end_io;
 
/*
 * Resolve the mapping until finished. (drivers are
@@ -3197,21 +3213,8 @@ end_io:
old_sector = bio-bi_sector;
old_dev = bio-bi_bdev-bd_dev;
 
-   maxsector = bio-bi_bdev-bd_inode-i_size  9;
-   if (maxsector) {
-   sector_t sector = bio-bi_sector;
-
-   if (maxsector  nr_sectors ||
-   maxsector - nr_sectors  sector) {
-   /*
-* This may well happen - partitions are not
-* checked to make sure they are within the size
-* of the whole device.
-*/
-   handle_bad_sector(bio);
-   goto end_io;
-   }
-   }
+   if (bio_check_eod(bio, nr_sectors))
+   goto end_io;
 
ret = q-make_request_fn(q, bio);
} while (ret);
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block: factor out bio_check_eod()

2007-07-18 Thread Tejun Heo
Jens Axboe wrote:
 On Wed, Jul 18 2007, Tejun Heo wrote:
 End of device check is done twice in __generic_make_request() and it's
 fully inlined each time.  Factor out bio_check_eod().
 
 Tejun, yeah I should seperate the cleanups and put them in the upstream
 branch. Will do so and add your signed-off to both of them.
 

Would they be different from the one I just posted?  No big deal either
way.  I'm just basing the zero-length barrier on top of these patches.
Oh well, the changes are trivial anyway.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block: factor out bio_check_eod()

2007-07-18 Thread Tejun Heo
Jens Axboe wrote:
 On Wed, Jul 18 2007, Tejun Heo wrote:
 Jens Axboe wrote:
 On Wed, Jul 18 2007, Tejun Heo wrote:
 End of device check is done twice in __generic_make_request() and it's
 fully inlined each time.  Factor out bio_check_eod().
 Tejun, yeah I should seperate the cleanups and put them in the upstream
 branch. Will do so and add your signed-off to both of them.

 Would they be different from the one I just posted?  No big deal either
 way.  I'm just basing the zero-length barrier on top of these patches.
 Oh well, the changes are trivial anyway.
 
 This one ended up being the same, but in the first one you missed some
 of the cleanups. I ended up splitting the patch some more though, see
 the series:
 
 http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=barrier

Alright, will base on 662d5c5e6afb79d05db5563205b809c0de530286.  Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block: factor out bio_check_eod()

2007-07-18 Thread Tejun Heo
Jens Axboe wrote:
 On Wed, Jul 18 2007, Tejun Heo wrote:
 Jens Axboe wrote:
 On Wed, Jul 18 2007, Tejun Heo wrote:
 Jens Axboe wrote:
 On Wed, Jul 18 2007, Tejun Heo wrote:
 End of device check is done twice in __generic_make_request() and it's
 fully inlined each time.  Factor out bio_check_eod().
 Tejun, yeah I should seperate the cleanups and put them in the upstream
 branch. Will do so and add your signed-off to both of them.

 Would they be different from the one I just posted?  No big deal either
 way.  I'm just basing the zero-length barrier on top of these patches.
 Oh well, the changes are trivial anyway.
 This one ended up being the same, but in the first one you missed some
 of the cleanups. I ended up splitting the patch some more though, see
 the series:

 http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=barrier
 Alright, will base on 662d5c5e6afb79d05db5563205b809c0de530286.  Thanks.
 
 1781c6a39fb6e31836557618c4505f5f7bc61605, no? Unless you want to rewrite
 it completely :-)

I think I'll start from 662d5c5e and steal most parts from 1781c6a3.  I
like stealing, you know. :-) I think 1781c6a3 also can use splitting -
zero length barrier implementation and issue_flush conversion.

Anyways, how do I pull from git.kernel.dk?
git://git.kernel.dk/linux-2.6-block.git gives me connection reset by server.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block: factor out bio_check_eod()

2007-07-18 Thread Tejun Heo
Jens Axboe wrote:
 somewhat annoying, I'll see if I can prefix it with git-daemon in the
 future.
 
 OK, now skip the /data/git/ stuff and just use
 
 git://git.kernel.dk/linux-2.6-block.git

Alright, it works like a charm now.  Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-10 Thread Tejun Heo
[EMAIL PROTECTED] wrote:
 On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said:
 
 All of the high end arrays have non-volatile cache (read, on power loss, it 
 is a 
 promise that it will get all of your data out to permanent storage). You 
 don't 
 need to ask this kind of array to drain the cache. In fact, it might just 
 ignore 
 you if you send it that kind of request ;-)
 
 OK, I'll bite - how does the kernel know whether the other end of that
 fiberchannel cable is attached to a DMX-3 or to some no-name product that
 may not have the same assurances?  Is there a I'm a high-end array bit
 in the sense data that I'm unaware of?

Well, the array just has to tell the kernel that it doesn't to write
back caching.  The kernel automatically selects ORDERED_DRAIN in such case.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux Software RAID is really RAID?

2007-07-03 Thread Tejun Heo
Mark Lord wrote:
 I believe he said it was ICH5 (different post/thread).
 
 My observation on ICH5 is that if one unplugs a drive,
 then the chipset/cpu locks up hard when toggling SRST
 in the EH code.
 
 Specifically, it locks up at the instruction
 which restores SRST back to the non-asserted state,
 which likely corresponds to the chipset finally actually
 sending a FIS to the drive.
 
 A hard(ware) lockup, not software.
 That's why Intel says ICH5 doesn't do hotplug.

OIC.  I don't think there's much left to do from the driver side then.
Or is there any workaround?

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume

2007-07-02 Thread Tejun Heo
David Greaves wrote:
 Tejun Heo wrote:
 It's really weird tho.  The PHY RDY status changed events are coming
 from the device which is NOT used while resuming
 
 There is an obvious problem there though Tejun (the errors even when sda
 isn't involved in the OS boot) - can I start another thread about that
 issue/bug later? I need to reshuffle partitions so I'd rather get the
 hibernate working first and then go back to it if that's OK?

Yeah, sure.  The problem is that we don't know whether or how those two
are related.  It would be great if there's a way to verify memory image
read from hibernation is intact.  Rafael, any ideas?

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-02 Thread Tejun Heo
Hello,

Jens Axboe wrote:
 Would that be very different from issuing barrier and not waiting for
 its completion?  For ATA and SCSI, we'll have to flush write back cache
 anyway, so I don't see how we can get performance advantage by
 implementing separate WRITE_ORDERED.  I think zero-length barrier
 (haven't looked at the code yet, still recovering from jet lag :-) can
 serve as genuine barrier without the extra write tho.
 
 As always, it depends :-)
 
 If you are doing pure flush barriers, then there's no difference. Unless
 you only guarantee ordering wrt previously submitted requests, in which
 case you can eliminate the post flush.
 
 If you are doing ordered tags, then just setting the ordered bit is
 enough. That is different from the barrier in that we don't need a flush
 of FUA bit set.

Hmmm... I'm feeling dense.  Zero-length barrier also requires only one
flush to separate requests before and after it (haven't looked at the
code yet, will soon).  Can you enlighten me?

Thanks.

-- 
tejun

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-01 Thread Tejun Heo
[ cc'ing Ric Wheeler for storage array thingie.  Hi, whole thread is at
http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/3344 ]

Hello,

[EMAIL PROTECTED] wrote:
 but when you consider the self-contained disk arrays it's an entirely
 different story. you can easily have a few gig of cache and a complete
 OS pretending to be a single drive as far as you are concerned.
 
 and the price of such devices is plummeting (in large part thanks to
 Linux moving into this space), you can now readily buy a 10TB array for
 $10k that looks like a single drive.

Don't those thingies usually have NV cache or backed by battery such
that ORDERED_DRAIN is enough?

The problem is that the interface between the host and a storage device
(ATA or SCSI) is not built to communicate that kind of information
(grouped flush, relaxed ordering...).  I think battery backed
ORDERED_DRAIN combined with fine-grained host queue flush would be
pretty good.  It doesn't require some fancy new interface which isn't
gonna be used widely anyway and can achieve most of performance gain if
the storage plays it smart.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Tejun Heo
Jens Axboe wrote:
 On Thu, May 31 2007, David Chinner wrote:
 On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:
 On Thu, May 31 2007, David Chinner wrote:
 IOWs, there are two parts to the problem:

1 - guaranteeing I/O ordering
2 - guaranteeing blocks are on persistent storage.

 Right now, a single barrier I/O is used to provide both of these
 guarantees. In most cases, all we really need to provide is 1); the
 need for 2) is a much rarer condition but still needs to be
 provided.

 if I am understanding it correctly, the big win for barriers is that you 
 do NOT have to stop and wait until the data is on persistant media before 
 you can continue.
 Yes, if we define a barrier to only guarantee 1), then yes this
 would be a big win (esp. for XFS). But that requires all filesystems
 to handle sync writes differently, and sync_blockdev() needs to
 call blkdev_issue_flush() as well

 So, what do we do here? Do we define a barrier I/O to only provide
 ordering, or do we define it to also provide persistent storage
 writeback? Whatever we decide, it needs to be documented
 The block layer already has a notion of the two types of barriers, with
 a very small amount of tweaking we could expose that. There's absolutely
 zero reason we can't easily support both types of barriers.
 That sounds like a good idea - we can leave the existing
 WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
 behaviour that only guarantees ordering. The filesystem can then
 choose which to use where appropriate
 
 Precisely. The current definition of barriers are what Chris and I came
 up with many years ago, when solving the problem for reiserfs
 originally. It is by no means the only feasible approach.
 
 I'll add a WRITE_ORDERED command to the #barrier branch, it already
 contains the empty-bio barrier support I posted yesterday (well a
 slightly modified and cleaned up version).

Would that be very different from issuing barrier and not waiting for
its completion?  For ATA and SCSI, we'll have to flush write back cache
anyway, so I don't see how we can get performance advantage by
implementing separate WRITE_ORDERED.  I think zero-length barrier
(haven't looked at the code yet, still recovering from jet lag :-) can
serve as genuine barrier without the extra write tho.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Tejun Heo
Stefan Bader wrote:
 2007/5/30, Phillip Susi [EMAIL PROTECTED]:
 Stefan Bader wrote:
 
  Since drive a supports barrier request we don't get -EOPNOTSUPP but
  the request with block y might get written before block x since the
  disk are independent. I guess the chances of this are quite low since
  at some point a barrier request will also hit drive b but for the time
  being it might be better to indicate -EOPNOTSUPP right from
  device-mapper.

 The device mapper needs to ensure that ALL underlying devices get a
 barrier request when one comes down from above, even if it has to
 construct zero length barriers to send to most of them.

 
 And somehow also make sure all of the barriers have been processed
 before returning the barrier that came in. Plus it would have to queue
 all mapping requests until the barrier is done (if strictly acting
 according to barrier.txt).
 
 But I am wondering a bit whether the requirements to barriers are
 really that tight as described in Tejun's document (barrier request is
 only started if everything before is safe, the barrier itself isn't
 returned until it is safe, too, and all requests after the barrier
 aren't started before the barrier is done). Is it really necessary to
 defer any further requests until the barrier has been written to save
 storage? Or would it be sufficient to guarantee that, if a barrier
 request returns, everything up to (including the barrier) is on safe
 storage?

Well, what's described in barrier.txt is the current implemented
semantics and what filesystems expect, so we can't change it underneath
them but we definitely can introduce new more relaxed variants, but one
thing we should bear in mind is that harddisks don't have humongous
caches or very smart controller / instruction set.  No matter how
relaxed interface the block layer provides, in the end, it just has to
issue whole-sale FLUSH CACHE on the device to guarantee data ordering on
the media.

IMHO, we can do better by paying more attention to how we do things in
the request queue which can be deeper and more intelligent than the
device queue.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-26 Thread Tejun Heo
Hello, Neil Brown.

Please cc me on blkdev barriers and, if you haven't yet, reading
Documentation/block/barrier.txt can be helpful too.

Neil Brown wrote:
[--snip--]
 1/ SAFE.  With a SAFE device, there is no write-behind cache, or if
   there is it is non-volatile.  Once a write completes it is 
   completely safe.  Such a device does not require barriers
   or -issue_flush_fn, and can respond to them either by a
 no-op or with -EOPNOTSUPP (the former is preferred).
 
 2/ FLUSHABLE.
   A FLUSHABLE device may have a volatile write-behind cache.
   This cache can be flushed with a call to blkdev_issue_flush.
 It may not support barrier requests.
 
 3/ BARRIER.
 A BARRIER device supports both blkdev_issue_flush and
   BIO_RW_BARRIER.  Either may be used to synchronise any
 write-behind cache to non-volatile storage (media).
 
 Handling of SAFE and FLUSHABLE devices is essentially the same and can
 work on a BARRIER device.  The BARRIER device has the option of more
 efficient handling.

Actually, all above three are handled by blkdev flush code.

 How does a filesystem use this?
 ===
 
[--snip--]
 2/ Set the BIO_RW_BARRIER bit in the write request for the commit
 block.
(This is more efficient on BARRIER).

This really should be enough.

 HOW DO MD or DM USE THIS
 
 
 1/ striping devices.
  This includes md/raid0 md/linear dm-linear dm-stripe and probably
  others. 
 
These devices can easily support blkdev_issue_flush by simply
calling blkdev_issue_flush on all component devices.
 
These devices would find it very hard to support BIO_RW_BARRIER.
Doing this would require keeping track of all in-flight requests
(which some, possibly all, of the above don't) and then:
  When a BIO_RW_BARRIER request arrives:
 wait for all pending writes to complete
 call blkdev_issue_flush on all devices
 issue the barrier write to the target device(s)
as BIO_RW_BARRIER,
 if that is -EOPNOTSUP, re-issue, wait, flush.

Hmm... What do you think about introducing zero-length BIO_RW_BARRIER
for this case?

 2/ Mirror devices.  This includes md/raid1 and dm-raid1.
 
These device can trivially implement blkdev_issue_flush much like
the striping devices, and can support BIO_RW_BARRIER to some
extent.
md/raid1 currently tries.  I'm not sure about dm-raid1.
 
md/raid1 determines if the underlying devices can handle
BIO_RW_BARRIER.  If any cannot, it rejects such requests (EOPNOTSUP)
itself.
If all underlying devices do appear to support barriers, md/raid1
will pass a barrier-write down to all devices.
The difficulty comes if it fails on one device, but not all
devices.  In this case it is not clear what to do.  Failing the
request is a lie, because some data has been written (possible too
early).  Succeeding the request (after re-submitting the failed
requests) is also a lie as the barrier wasn't really honoured.
md/raid1 currently takes the latter approach, but will only do it
once - after that it fails all barrier requests.
 
Hopefully this is unlikely to happen.  What device would work
correctly with barriers once, and then not the next time?
The answer is md/raid1.  If you remove a failed device and add a
new device that doesn't support barriers, md/raid1 will notice and
stop supporting barriers.
If md/raid1 can change from supporting barrier to not, then maybe
some other device could too?
 
I'm not sure what to do about this - maybe just ignore it...

That sounds good.  :-)

 3/ Other modules
 
Other md and dm modules (raid5, mpath, crypt) do not add anything
interesting to the above.  Either handling BIO_RW_BARRIER is
trivial, or extremely difficult.
 
 HOW DO LOW LEVEL DEVICES HANDLE THIS
 
 
 This is part of the picture that I haven't explored greatly.  My
 feeling is that most if not all devices support blkdev_issue_flush
 properly, and support barriers reasonably well providing that the
 hardware does.
 There in an exception I recently found though.
 For devices that don't support QUEUE_ORDERED_TAG (i.e. commands sent to
 the controller can be tagged as barriers), SCSI will use the
 SYNCHRONIZE_CACHE command to flush the cache after the barrier
 request (a bit like the filesystem calling blkdev_issue_flush, but at
 a lower level). However it does this without setting the SYNC_NV bit.
 This means that a device with a non-volatile cache will be required --
 needlessly -- to flush that cache to media.

Yeah, it probably needs updating but some devices might react badly too.

 So: some questions to help encourage response:
 
  - Is the above substantial correct?  Totally correct?
  - Should the various filesystems be fixed as suggested above?  Is 
 someone willing to do 

Re: Kernel 2.6.20.4: Software RAID 5: ata13.00: (irq_stat 0x00020002, failed to transmit command FIS)

2007-04-09 Thread Tejun Heo
Justin Piszcz wrote:
 
 
 On Thu, 5 Apr 2007, Justin Piszcz wrote:
 
 Had a quick question, this is the first time I have seen this happen,
 and it was not even under during heavy I/O, hardly anything was going
 on with the box at the time.
 
 .. snip ..
 
 # /usr/bin/time badblocks -b 512 -s -v -w /dev/sdl
 Checking for bad blocks in read-write mode
 From block 0 to 293046768
 Testing with pattern 0xaa: done
 Reading and comparing: done
 
 Not a single bad block on the drive so far.  I have not changed anything
 the box physically, with the exception of the BIOS version to V1666 for
 an Intel P965 motherboard (DG965WHMKR).  Any idea what or why this
 happened?  Is it a kernel or actual HW issue?  What caused this to
 occur?  Any thoughts or ideas?

My bet is on harddisk firmware acting weird.  You can prove this by
reconnecting the disk to known working port without cutting power.  Or,
you can prove that the original port works by removing the harddisk and
putting a different one.  You'll need to issue manual scan using SCSI
sysfs node.  It's very rare but some drives do choke like that.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20.3 AMD64 oops in CFQ code

2007-04-04 Thread Tejun Heo
Lee Revell wrote:
 On 4/4/07, Bill Davidsen [EMAIL PROTECTED] wrote:
 I won't say that's voodoo, but if I ever did it I'd wipe down my
 keyboard with holy water afterward. ;-)

 Well, I did save the message in my tricks file, but it sounds like a
 last ditch effort after something get very wrong.

Which actually is true.  ATA ports failing to reset indicate something
is very wrong.  Either the attached device or the controller is broken
and libata shuts down the port to protect the rest of the system from
it.  The manual scan requests tell libata to give it one more shot and
polling hotplug can do that automatically.  Anyways, this shouldn't
happen unless you have a broken piece of hardware.

 Would it reallty be an impediment to development if the kernel
 maintainers simply refuse to merge patches that add new sysfs entries
 without corresponding documentation?

SCSI host scan nodes have been there for a long time.  I think it's
documented somewhere.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20.3 AMD64 oops in CFQ code

2007-04-03 Thread Tejun Heo
[EMAIL PROTECTED] wrote:
 [EMAIL PROTECTED] wrote:
 Anyway, what's annoying is that I can't figure out how to bring the
 drive back on line without resetting the box.  It's in a hot-swap enclosure,
 but power cycling the drive doesn't seem to help.  I thought libata hotplug
 was working?  (SiI3132 card, using the sil24 driver.)
 
 Yeah, it's working but failing resets are considered highly dangerous
 (in that the controller status is unknown and may cause something
 dangerous like screaming interrupts) and port is muted after that.  The
 plan is to handle this with polling hotplug such that libata tries to
 revive the port if PHY status change is detected by polling.  Patches
 are available but they need other things to resolved to get integrated.
 I think it'll happen before the summer.
 
 Anyways, you can tell libata to retry the port by manually telling it to
 rescan the port (echo - - -  /sys/class/scsi_host/hostX/scan).
 
 Ah, thank you!  I have to admit, that is at least as mysterious as any
 Microsoft registry tweak.

Polling hotplug should fix this.  I thought I would be able to merge it
much earlier.  I apparently was way too optimistic.  :-(

 (H'm... after rebooting, reallocated sectors jumped from 26 to 39.
 Something is up with that drive.)
 
 Yeap, seems like a broken drive to me.
 
 Actually, after a few rounds, the reallocated sectors stabilized at 56
 and all is working well again.  It's like there was a major problem with
 error handling.
 
 The problem is that I don't know where the blame lies.

I'm pretty sure it's the firmware's fault.  It's not supposed to go out
for lunch like that even when internal error occurs.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20.3 AMD64 oops in CFQ code

2007-04-02 Thread Tejun Heo
[resending.  my mail service was down for more than a week and this
message didn't get delivered.]

[EMAIL PROTECTED] wrote:
  Anyway, what's annoying is that I can't figure out how to bring the
  drive back on line without resetting the box.  It's in a hot-swap
enclosure,
  but power cycling the drive doesn't seem to help.  I thought libata
hotplug
  was working?  (SiI3132 card, using the sil24 driver.)

Yeah, it's working but failing resets are considered highly dangerous
(in that the controller status is unknown and may cause something
dangerous like screaming interrupts) and port is muted after that.  The
plan is to handle this with polling hotplug such that libata tries to
revive the port if PHY status change is detected by polling.  Patches
are available but they need other things to resolved to get integrated.
 I think it'll happen before the summer.

Anyways, you can tell libata to retry the port by manually telling it to
rescan the port (echo - - -  /sys/class/scsi_host/hostX/scan).

  (H'm... after rebooting, reallocated sectors jumped from 26 to 39.
  Something is up with that drive.)

Yeap, seems like a broken drive to me.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem booting linux 2.6.19-rc5, 2.6.19-rc5-git6, 2.6.19-rc5-mm2 with md raid 1 over lvm root

2006-11-15 Thread Tejun Heo

Nicolas Mailhot wrote:

The failing kernels (I tried -rc5, -rc5-git6, -rc5-mm2 only print :

%
device-mapper: ioctl: 4.7.0-ioctl (2006-06-24) initialised:
[EMAIL PROTECTED]
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
%-

(I didn't bother copying the rest of the failing kernel dmesg, as sata
initialisation fills the first half of the screen, then dm is initialised,
then you only get the logical consequences of failing to detect the /
volume. The sata part seems fine – it prints the name of the hard drives
we want to use)

I'm attaching the dmesg for the working distro kernel (yes I know not 100%
distro kernel, but very close to one), distro config , and the config I
used in my test. If anyone could help me to figure what's wrong I'd be
grateful.


Say 'y' not 'm' to SCSI disk support.

--
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Test feedback 2.6.17.4+libata-tj-stable (EH, hotplug)

2006-07-10 Thread Tejun Heo

Christian Pernegger wrote:

The fact that the disk had changed minor numbers after it was plugged
back in bugs me a bit. (was sdc before, sde after). Additionally udev
removed the sdc device file, so I had to manually recreate it to be
able to remove the 'faulty' disk from its md array.


That's because md is stilling holding onto sdc in failed mode.  A 
hotplug script which checks whether a removed device is in md array and 
if so removes it from the array will solve the problem.  Not sure 
whether that would be the correct approach though.


Thanks.

--
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html