Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-10 Thread Tejun Heo
Justin Piszcz wrote:
> The badblocks did not do anything; however, when I built a software raid
> 5 and the performed a dd:
> 
> /usr/bin/time dd if=/dev/zero of=fill_disk bs=1M
> 
> [42332.936615] ata5.00: exception Emask 0x2 SAct 0x7000 SErr 0x0 action
> 0x2 frozen
> [42332.936706] ata5.00: spurious completions during NCQ issue=0x0
> SAct=0x7000 FIS=004040a1:0800
> 
> Next test, I will turn off NCQ and try to make the problem re-occur.
> If anyone else has any thoughts here..?
> I ran long smart tests on all 3 disks, they all ran successfully.
> 
> Perhaps these drives need to be NCQ BLACKLISTED with the P35 chipset?

That was me being stupid.  Patches for both upstream and -stable
branches are posted.  These will go away.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)

2007-12-10 Thread Tejun Heo
Bill Davidsen wrote:
> Jan Engelhardt wrote:
>> On Dec 1 2007 06:26, Justin Piszcz wrote:
>>> I ran the following:
>>>
>>> dd if=/dev/zero of=/dev/sdc
>>> dd if=/dev/zero of=/dev/sdd
>>> dd if=/dev/zero of=/dev/sde
>>>
>>> (as it is always a very good idea to do this with any new disk)
>>
>> Why would you care about what's on the disk? fdisk, mkfs and
>> the day-to-day operation will overwrite it _anyway_.
>>
>> (If you think the disk is not empty, you should look at it
>> and copy off all usable warez beforehand :-)
>>
> Do you not test your drive for minimum functionality before using them?

I personally don't.

> Also, if you have the tools to check for relocated sectors before and
> after doing this, that's a good idea as well. S.M.A.R.T is your friend.
> And when writing /dev/zero to a drive, if it craps out you have less
> emotional attachment to the data.

Writing all zero isn't too useful tho.  Drive failing reallocation on
write is catastrophic failure.  It means that the drive wanna relocate
but can't because it used up all its extra space which usually indicates
something else is seriously wrong with the drive.  The drive will have
to go to the trash can.  This is all serious and bad but the catch is
that in such cases the problem usually stands like a sore thumb so
either vendor doesn't ship such drive or you'll find the failure very
early.  I personally haven't seen any such failure yet.  Maybe I'm lucky.

Most data loss occurs when the drive fails to read what it thought it
wrote successfully and the opposite - reading and dumping the whole disk
to /dev/null periodically is probably much better than writing zeros as
it allows the drive to find out deteriorating sector early while it's
still readable and relocate.  But then again I think it's an overkill.

Writing zeros to sectors is more useful as cure rather than prevention.
 If your drive fails to read a sector, write whatever value to the
sector.  The drive will forget about the data on the damaged sector and
reallocate and write new data to it.  Of course, you lose data which was
originally on the sector.

I personally think it's enough to just throw in an extra disk and make
it RAID0 or 5 and rebuild the array if read fails on one of the disks.
If write fails or read fail continues, replace the disk.  Of course, if
you wanna be extra cautious, good for you.  :-)

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible data corruption sata_sil24?

2007-07-19 Thread Tejun Heo
David Shaw wrote:
>> I'm not sure whether this is problem of sata_sil24 or dm layer.  Cc'ing
>> linux-raid for help.  How much memory do you have?  One big difference
>> between ata_piix and sata_sil24 is that sil24 can handle 64bit DMA.
>> Maybe dma mapping or something interacts weirdly with dm there?
> 
> The machine has 640 megs of RAM.  FWIW, I tried this with 512 megs of
> RAM with the same results.  Running Memtest86+ shows the memory is
> good.

Hmmm... I see, so no DMA to the wrong address problem then.  Let's see
whether dm people can help us out.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block: factor out bio_check_eod()

2007-07-18 Thread Tejun Heo
Jens Axboe wrote:
>> somewhat annoying, I'll see if I can prefix it with git-daemon in the
>> future.
> 
> OK, now skip the /data/git/ stuff and just use
> 
> git://git.kernel.dk/linux-2.6-block.git

Alright, it works like a charm now.  Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block: factor out bio_check_eod()

2007-07-18 Thread Tejun Heo
Jens Axboe wrote:
> On Wed, Jul 18 2007, Tejun Heo wrote:
>> Jens Axboe wrote:
>>> On Wed, Jul 18 2007, Tejun Heo wrote:
>>>> Jens Axboe wrote:
>>>>> On Wed, Jul 18 2007, Tejun Heo wrote:
>>>>>> End of device check is done twice in __generic_make_request() and it's
>>>>>> fully inlined each time.  Factor out bio_check_eod().
>>>>> Tejun, yeah I should seperate the cleanups and put them in the upstream
>>>>> branch. Will do so and add your signed-off to both of them.
>>>>>
>>>> Would they be different from the one I just posted?  No big deal either
>>>> way.  I'm just basing the zero-length barrier on top of these patches.
>>>> Oh well, the changes are trivial anyway.
>>> This one ended up being the same, but in the first one you missed some
>>> of the cleanups. I ended up splitting the patch some more though, see
>>> the series:
>>>
>>> http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=barrier
>> Alright, will base on 662d5c5e6afb79d05db5563205b809c0de530286.  Thanks.
> 
> 1781c6a39fb6e31836557618c4505f5f7bc61605, no? Unless you want to rewrite
> it completely :-)

I think I'll start from 662d5c5e and steal most parts from 1781c6a3.  I
like stealing, you know. :-) I think 1781c6a3 also can use splitting -
zero length barrier implementation and issue_flush conversion.

Anyways, how do I pull from git.kernel.dk?
git://git.kernel.dk/linux-2.6-block.git gives me connection reset by server.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block: factor out bio_check_eod()

2007-07-18 Thread Tejun Heo
Jens Axboe wrote:
> On Wed, Jul 18 2007, Tejun Heo wrote:
>> Jens Axboe wrote:
>>> On Wed, Jul 18 2007, Tejun Heo wrote:
>>>> End of device check is done twice in __generic_make_request() and it's
>>>> fully inlined each time.  Factor out bio_check_eod().
>>> Tejun, yeah I should seperate the cleanups and put them in the upstream
>>> branch. Will do so and add your signed-off to both of them.
>>>
>> Would they be different from the one I just posted?  No big deal either
>> way.  I'm just basing the zero-length barrier on top of these patches.
>> Oh well, the changes are trivial anyway.
> 
> This one ended up being the same, but in the first one you missed some
> of the cleanups. I ended up splitting the patch some more though, see
> the series:
> 
> http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=barrier

Alright, will base on 662d5c5e6afb79d05db5563205b809c0de530286.  Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block: factor out bio_check_eod()

2007-07-18 Thread Tejun Heo
Jens Axboe wrote:
> On Wed, Jul 18 2007, Tejun Heo wrote:
>> End of device check is done twice in __generic_make_request() and it's
>> fully inlined each time.  Factor out bio_check_eod().
> 
> Tejun, yeah I should seperate the cleanups and put them in the upstream
> branch. Will do so and add your signed-off to both of them.
> 

Would they be different from the one I just posted?  No big deal either
way.  I'm just basing the zero-length barrier on top of these patches.
Oh well, the changes are trivial anyway.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] block: factor out bio_check_eod()

2007-07-18 Thread Tejun Heo
End of device check is done twice in __generic_make_request() and it's
fully inlined each time.  Factor out bio_check_eod().

This is taken from Jens' zero-length barrier patch.

Signed-off-by: Tejun Heo <[EMAIL PROTECTED]>
Cc: Jens Axboe <[EMAIL PROTECTED]>
---
 block/ll_rw_blk.c |   63 --
 1 file changed, 33 insertions(+), 30 deletions(-)

Index: work/block/ll_rw_blk.c
===
--- work.orig/block/ll_rw_blk.c
+++ work/block/ll_rw_blk.c
@@ -3094,6 +3094,35 @@ static inline int should_fail_request(st
 
 #endif /* CONFIG_FAIL_MAKE_REQUEST */
 
+/*
+ * Check whether this bio extends beyond the end of the device.
+ */
+static int bio_check_eod(struct bio *bio, unsigned int nr_sectors)
+{
+   sector_t maxsector;
+
+   if (!nr_sectors)
+   return 0;
+
+   /* Test device or partition size, when known. */
+   maxsector = bio->bi_bdev->bd_inode->i_size >> 9;
+   if (maxsector) {
+   sector_t sector = bio->bi_sector;
+
+   if (maxsector < nr_sectors || maxsector - nr_sectors < sector) {
+   /*
+* This may well happen - the kernel calls bread()
+* without checking the size of the device, e.g., when
+* mounting a device.
+*/
+   handle_bad_sector(bio);
+   return 1;
+   }
+   }
+
+   return 0;
+}
+
 /**
  * generic_make_request: hand a buffer to its device driver for I/O
  * @bio:  The bio describing the location in memory and on the device.
@@ -3121,27 +3150,14 @@ static inline int should_fail_request(st
 static inline void __generic_make_request(struct bio *bio)
 {
request_queue_t *q;
-   sector_t maxsector;
sector_t old_sector;
int ret, nr_sectors = bio_sectors(bio);
dev_t old_dev;
 
might_sleep();
-   /* Test device or partition size, when known. */
-   maxsector = bio->bi_bdev->bd_inode->i_size >> 9;
-   if (maxsector) {
-   sector_t sector = bio->bi_sector;
 
-   if (maxsector < nr_sectors || maxsector - nr_sectors < sector) {
-   /*
-* This may well happen - the kernel calls bread()
-* without checking the size of the device, e.g., when
-* mounting a device.
-*/
-   handle_bad_sector(bio);
-   goto end_io;
-   }
-   }
+   if (bio_check_eod(bio, nr_sectors))
+   goto end_io;
 
/*
 * Resolve the mapping until finished. (drivers are
@@ -3197,21 +3213,8 @@ end_io:
old_sector = bio->bi_sector;
old_dev = bio->bi_bdev->bd_dev;
 
-   maxsector = bio->bi_bdev->bd_inode->i_size >> 9;
-   if (maxsector) {
-   sector_t sector = bio->bi_sector;
-
-   if (maxsector < nr_sectors ||
-   maxsector - nr_sectors < sector) {
-   /*
-* This may well happen - partitions are not
-* checked to make sure they are within the size
-* of the whole device.
-*/
-   handle_bad_sector(bio);
-   goto end_io;
-   }
-   }
+   if (bio_check_eod(bio, nr_sectors))
+   goto end_io;
 
ret = q->make_request_fn(q, bio);
} while (ret);
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] block: cosmetic changes

2007-07-18 Thread Tejun Heo
Cosmetic changes.  This is taken from Jens' zero-length barrier patch.

Signed-off-by: Tejun Heo <[EMAIL PROTECTED]>
Cc: Jens Axboe <[EMAIL PROTECTED]>
---
 block/ll_rw_blk.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Index: work/block/ll_rw_blk.c
===
--- work.orig/block/ll_rw_blk.c
+++ work/block/ll_rw_blk.c
@@ -443,7 +443,8 @@ static inline struct request *start_orde
rq_init(q, rq);
if (bio_data_dir(q->orig_bar_rq->bio) == WRITE)
rq->cmd_flags |= REQ_RW;
-   rq->cmd_flags |= q->ordered & QUEUE_ORDERED_FUA ? REQ_FUA : 0;
+   if (q->ordered & QUEUE_ORDERED_FUA)
+   rq->cmd_flags |= REQ_FUA;
rq->elevator_private = NULL;
rq->elevator_private2 = NULL;
init_request_from_bio(rq, q->orig_bar_rq->bio);
@@ -3167,7 +3168,7 @@ end_io:
break;
}
 
-   if (unlikely(bio_sectors(bio) > q->max_hw_sectors)) {
+   if (unlikely(nr_sectors > q->max_hw_sectors)) {
printk("bio too big device %s (%u > %u)\n", 
bdevname(bio->bi_bdev, b),
bio_sectors(bio),
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possible data corruption sata_sil24?

2007-07-18 Thread Tejun Heo
David Shaw wrote:
>>> It fails whether I use a raw /dev/sdd or partition it into one large
>>> /dev/sdd1, or partition into multiple partitions.  sata_sil24 seems to
>>> work by itself, as does dm, but as soon as I mix sata_sil24+dm, I get
>>> corruption.
>> H Can you reproduce the corruption by accessing both devices
>> simultaneously without using dm?  Considering ich5 does fine, it looks
>> like hardware and/or driver problem and I really wanna rule out dm.
> 
> I think I wasn't clear enough before.  The corruption happens when I
> use dm to create two dm mappings that both reside on the same real
> device.  Using two different devices, or two different partitions on
> the same physical device works properly.  ich5 does fine with these 3
> tests, but sata_sil24 fails:
> 
>  * /dev/sdd, create 2 dm linear mappings on it, mke2fs and use those
>dm "devices" == corruption
> 
>  * Partition /dev/sdd into /dev/sdd1 and /dev/sdd2, mke2fs and use
>those partitions == no corruption
> 
>  * Partition /dev/sdd into /dev/sdd1 and /dev/sdd2, create 2 dm linear
>mappings on /dev/sdd1, mke2fs and use those dm "devices" ==
>corruption

I'm not sure whether this is problem of sata_sil24 or dm layer.  Cc'ing
linux-raid for help.  How much memory do you have?  One big difference
between ata_piix and sata_sil24 is that sil24 can handle 64bit DMA.
Maybe dma mapping or something interacts weirdly with dm there?

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-10 Thread Tejun Heo
Ric Wheeler wrote:
>> Don't those thingies usually have NV cache or backed by battery such
>> that ORDERED_DRAIN is enough?
> 
> All of the high end arrays have non-volatile cache (read, on power loss,
> it is a promise that it will get all of your data out to permanent
> storage). You don't need to ask this kind of array to drain the cache.
> In fact, it might just ignore you if you send it that kind of request ;-)
> 
> The size of the NV cache can run from a few gigabytes up to hundreds of
> gigabytes, so you really don't want to invoke cache flushes here if you
> can avoid it.
> 
> For this class of device, you can get the required in order completion
> and data integrity semantics as long as we send the IO's to the device
> in the correct order.

Thanks for clarification.

>> The problem is that the interface between the host and a storage device
>> (ATA or SCSI) is not built to communicate that kind of information
>> (grouped flush, relaxed ordering...).  I think battery backed
>> ORDERED_DRAIN combined with fine-grained host queue flush would be
>> pretty good.  It doesn't require some fancy new interface which isn't
>> gonna be used widely anyway and can achieve most of performance gain if
>> the storage plays it smart.
> 
> I am not really sure that you need this ORDERED_DRAIN for big arrays...

ORDERED_DRAIN is to properly order requests from host request queue
(elevator/iosched).  We can make it finer-grained but we do need to put
some ordering restrictions.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-10 Thread Tejun Heo
[EMAIL PROTECTED] wrote:
> On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said:
> 
>> All of the high end arrays have non-volatile cache (read, on power loss, it 
>> is a 
>> promise that it will get all of your data out to permanent storage). You 
>> don't 
>> need to ask this kind of array to drain the cache. In fact, it might just 
>> ignore 
>> you if you send it that kind of request ;-)
> 
> OK, I'll bite - how does the kernel know whether the other end of that
> fiberchannel cable is attached to a DMX-3 or to some no-name product that
> may not have the same assurances?  Is there a "I'm a high-end array" bit
> in the sense data that I'm unaware of?

Well, the array just has to tell the kernel that it doesn't to write
back caching.  The kernel automatically selects ORDERED_DRAIN in such case.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-05 Thread Tejun Heo
Hello, Jens.

Jens Axboe wrote:
> On Mon, May 28 2007, Neil Brown wrote:
>> I think the implementation priorities here are:
>>
>> 1/ implement a zero-length BIO_RW_BARRIER option.
>> 2/ Use it (or otherwise) to make all dm and md modules handle
>>barriers (and loop?).
>> 3/ Devise and implement appropriate fall-backs with-in the block layer
>>so that  -EOPNOTSUP is never returned.
>> 4/ Remove unneeded cruft from filesystems (and elsewhere).
> 
> This is the start of 1/ above. It's very lightly tested, it's verified
> to DTRT here at least and not crash :-)
> 
> It gets rid of the ->issue_flush_fn() queue callback, all the driver
> knowledge resides in ->prepare_flush_fn() anyways. blkdev_issue_flush()
> then just reuses the empty-bio approach to queue an empty barrier, this
> should work equally well for stacked and non-stacked devices.
> 
> While this patch isn't complete yet, it's clearly the right direction to
> go.

Finally took a brief look. :-) I think the sequencing for zero-length
barrier can be better done by pre-setting QUEUE_ORDSEQ_BAR in
start_ordered() rather than short circuiting the request after it's
issued.  What do you think?

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux Software RAID is really RAID?

2007-07-03 Thread Tejun Heo
Mark Lord wrote:
> I believe he said it was ICH5 (different post/thread).
> 
> My observation on ICH5 is that if one unplugs a drive,
> then the chipset/cpu locks up hard when toggling SRST
> in the EH code.
> 
> Specifically, it locks up at the instruction
> which restores SRST back to the non-asserted state,
> which likely corresponds to the chipset finally actually
> sending a FIS to the drive.
> 
> A hard(ware) lockup, not software.
> That's why Intel says ICH5 doesn't do hotplug.

OIC.  I don't think there's much left to do from the driver side then.
Or is there any workaround?

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux Software RAID is really RAID?

2007-07-03 Thread Tejun Heo
Brad Campbell wrote:
> Johny Mail list wrote:
>> Hello list,
>> I have a little question about software RAID on Linux.
>> I have installed Software Raid on all my SC1425 servers DELL by
>> believing that the md raid was a strong driver.
>> And recently i make some test on a server and try to view if the RAID
>> hard drive power failure work fine, so i power up my server and after
>> booting and the prompt appear I disconnected the power cable of my
>> SATA hard drive. Normaly the MD should eleminate the failure hard
>> drive of the logical drive it build, and the server continue to work
>> fine like nothing happen. Oddly the server stop to respond and i get
>> this messages :
>> ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
>> ata4.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0
>>  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>> ata4: port is slow to respond, please be patient (Status 0xd0)
>> ata4: port failed to respond (30sec, Status 0xd0)
>> ata4: soft resetting port
>>
>> After that my system is frozen.

How hard is it frozen?  Can you blink the Numlock LED?

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume

2007-07-02 Thread Tejun Heo
David Greaves wrote:
>> Tejun Heo wrote:
>>> It's really weird tho.  The PHY RDY status changed events are coming
>>> from the device which is NOT used while resuming
> 
> There is an obvious problem there though Tejun (the errors even when sda
> isn't involved in the OS boot) - can I start another thread about that
> issue/bug later? I need to reshuffle partitions so I'd rather get the
> hibernate working first and then go back to it if that's OK?

Yeah, sure.  The problem is that we don't know whether or how those two
are related.  It would be great if there's a way to verify memory image
read from hibernation is intact.  Rafael, any ideas?

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume

2007-06-20 Thread Tejun Heo
David Greaves wrote:
> Tejun Heo wrote:
>> Your controller is repeatedly reporting PHY readiness changed exception.
>>  Are you reading the system image from the device attached to the first
>> SATA port?
> 
> Yes if you mean 1st as in the one after the zero-th ...

I meant the first first (0th).

>> How reproducible is the problem?  Does the problem go away or occur more
>> often if you change the drive you write the memory image to?
> 
> I don't think there should be activity on the sda drive during resume
> itself.
> 
> [I broke my / md mirror and am using some of that for swap/resume for now]
> 
> I did change the swap/resume device to sdd2 (different controller,
> onboard sata_via) and there was no EH during resume. The system seemed
> OK, wrote a few Gb of video and did a kernel compile.
> I repeated this test, no EH during resume, no problems.
> I even ran xfs_fsr, the defragment utility, to stress the fs.
> 
> I retain this configuration and try again tonight but it looks like
> there _may_ be a link between EH during resume and my problems...
> 
> Of course, I don't understand why it *should* EH during resume, it
> doesn't during boot or normal operation...

EH occurs during boot, suspend and resume all the time.  It just runs in
quiet mode to avoid disturbing the users too much.  In your case, EH is
kicking in due to actual exception conditions so it's being verbose to
give clue about what's going on.

It's really weird tho.  The PHY RDY status changed events are coming
from the device which is NOT used while resuming and it's before any
actual PM events are triggered.  Your kernel just boots, swsusp realizes
it's resuming and tries to read memory image from the swap device.
While reading, the disk controller raises consecutive PHY readiness
changed interrupts.  EH recovers them alright but the end result seems
to indicate that the loaded image is corrupt.

So, there's no device suspend/resume code involved at all.  The kernel
just booted and is trying to read data from the drive.  Please try with
only the first drive attached and see what happens.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume

2007-06-19 Thread Tejun Heo
Hello,

David Greaves wrote:
>> Good :)
> Now, not so good :)

Oh, crap.  :-)

> So I hibernated last night and resumed this morning.
> Before hibernating I froze and sync'ed. After resume I thawed it. (Sorry
> Dave)
> 
> Here are some photos of the screen during resume. This is not 100%
> reproducable - it seems to occur only if the system is shutdown for
> 30mins or so.
> 
> Tejun, I wonder if error handling during resume is problematic? I got
> the same errors in 2.6.21. I have never seen these (or any other libata)
> errors other than during resume.
> 
> http://www.dgreaves.com/pub/2.6.22-rc5-resume-failure.jpg
> (hard to read, here's one from 2.6.21
> http://www.dgreaves.com/pub/2.6.21-resume-failure.jpg

Your controller is repeatedly reporting PHY readiness changed exception.
 Are you reading the system image from the device attached to the first
SATA port?

> I _think_ I've only seen the xfs problem when a resume shows these errors.

The error handling itself tries very hard to ensure that there is no
data corruption in case of errors.  All commands which experience
exceptions are retried but if the drive itself is doing something
stupid, there's only so much the driver can do.

How reproducible is the problem?  Does the problem go away or occur more
often if you change the drive you write the memory image to?

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Machine hanging on synchronize cache on shutdown 2.6.22-rc4-git[45678]

2007-06-18 Thread Tejun Heo
Mikael Pettersson wrote:
> FWIW, I'm seeing scsi layer accesses (cache flushes) after things
> like rmmod sata_promise. They error out and don't seem to cause
> any harm, but the fact that they occur at all makes me nervous.

That's okay.  On rmmod, as the low level device (ATA) goes away first
just as in hot unplug, sd gets notified *after* the device is gone but
sd still tries to clean up and issues the commands which are properly
rejected by the SCSI midlayer as the device is marked offline already,
so nothing to worry about there.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Machine hanging on synchronize cache on shutdown 2.6.22-rc4-git[45678]

2007-06-18 Thread Tejun Heo
Hello,

Mikael Pettersson wrote:
> On Sat, 16 Jun 2007 15:52:33 +0400, Brad Campbell wrote:
>> I've got a box here based on current Debian Stable.
>> It's got 15 Maxtor SATA drives in it on 4 Promise TX4 controllers.
>>
>> Using kernel 2.6.21.x it shuts down, but of course with a huge "clack" as 15 
>> drives all do emergency 
>> head parks simultaneously. I thought I'd upgrade to 2.6.22-rc to get around 
>> this but the machine 
>> just hangs up hard apparently trying to sync cache on a drive.
>>
>> I've run this process manually, so I know it is being performed properly.
>>
>> Prior to shutdown, all nfsd processes are stopped, filesystems unmounted and 
>> md arrays stopped.
>> /proc/mdstat shows
>> [EMAIL PROTECTED]:~# cat /proc/mdstat
>> Personalities : [raid6] [raid5] [raid4]
>> unused devices: 
>> [EMAIL PROTECTED]:~#
>>
>> Here is the final hangup.
>>
>> http://www.fnarfbargle.com/CIMG1029.JPG
> 
> Something sent a command to the disk on ata15 after the PHY had been
> offlined and the interface had been put in SLUMBER state (SStatus 614).
> Consequently the command timed out. Libata tried a soft reset, and then
> a hard reset, after which the machine hung.

Hmm... weird.  Maybe device initiated power saving (DIPS) is active?

> I don't think sata_promise is the guilty party here. Looks like some
> layer above sata_promise got confused about the state of the interface.

But locking up hard after hardreset is a problem of sata_promise, no?

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-04 Thread Tejun Heo
Jens Axboe wrote:
> On Sat, Jun 02 2007, Tejun Heo wrote:
>> Hello,
>>
>> Jens Axboe wrote:
>>>> Would that be very different from issuing barrier and not waiting for
>>>> its completion?  For ATA and SCSI, we'll have to flush write back cache
>>>> anyway, so I don't see how we can get performance advantage by
>>>> implementing separate WRITE_ORDERED.  I think zero-length barrier
>>>> (haven't looked at the code yet, still recovering from jet lag :-) can
>>>> serve as genuine barrier without the extra write tho.
>>> As always, it depends :-)
>>>
>>> If you are doing pure flush barriers, then there's no difference. Unless
>>> you only guarantee ordering wrt previously submitted requests, in which
>>> case you can eliminate the post flush.
>>>
>>> If you are doing ordered tags, then just setting the ordered bit is
>>> enough. That is different from the barrier in that we don't need a flush
>>> of FUA bit set.
>> Hmmm... I'm feeling dense.  Zero-length barrier also requires only one
>> flush to separate requests before and after it (haven't looked at the
>> code yet, will soon).  Can you enlighten me?
> 
> Yeah, that's what the zero-length barrier implementation I posted does.
> Not sure if you have a question beyond that, if so fire away :-)

I thought you were talking about adding BIO_RW_ORDERED instead of
exposing zero length BIO_RW_BARRIER.  Sorry about the confusion.  :-)

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-02 Thread Tejun Heo
Hello,

Jens Axboe wrote:
>> Would that be very different from issuing barrier and not waiting for
>> its completion?  For ATA and SCSI, we'll have to flush write back cache
>> anyway, so I don't see how we can get performance advantage by
>> implementing separate WRITE_ORDERED.  I think zero-length barrier
>> (haven't looked at the code yet, still recovering from jet lag :-) can
>> serve as genuine barrier without the extra write tho.
> 
> As always, it depends :-)
> 
> If you are doing pure flush barriers, then there's no difference. Unless
> you only guarantee ordering wrt previously submitted requests, in which
> case you can eliminate the post flush.
> 
> If you are doing ordered tags, then just setting the ordered bit is
> enough. That is different from the barrier in that we don't need a flush
> of FUA bit set.

Hmmm... I'm feeling dense.  Zero-length barrier also requires only one
flush to separate requests before and after it (haven't looked at the
code yet, will soon).  Can you enlighten me?

Thanks.

-- 
tejun

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-01 Thread Tejun Heo
[EMAIL PROTECTED] wrote:
> On Fri, 01 Jun 2007 16:16:01 +0900, Tejun Heo said:
>> Don't those thingies usually have NV cache or backed by battery such
>> that ORDERED_DRAIN is enough?
> 
> Probably *most* do, but do you really want to bet the user's data on it?

Thought we were talking about high-end storage stuff.  I don't think
I'll be too uncomfortable.  The reason why we're talking about this at
all is because high-end stuff with fancy NV cache and a hunk of battery
will unnecessarily suffer from the current barrier implementation.

>> The problem is that the interface between the host and a storage device
>> (ATA or SCSI) is not built to communicate that kind of information
>> (grouped flush, relaxed ordering...).  I think battery backed
>> ORDERED_DRAIN combined with fine-grained host queue flush would be
>> pretty good.  It doesn't require some fancy new interface which isn't
>> gonna be used widely anyway and can achieve most of performance gain if
>> the storage plays it smart.
> 
> Yes, that would probably be "pretty good".  But how do you get the storage
> device to *reliably* tell the truth about what it actually implements? 
> (Consider
> the number of devices that downright lie about their implementation of cache
> flushing)

SCSI NV bit or report write through cache?  Again, we're talking about
large arrays and we already trust the write through thing even on cheap
single spindle drives.  sd currently doesn't honor NV bit and it's
causing some troubles on some arrays.  We'll probably have to honor them
at least conditionally.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-01 Thread Tejun Heo
[ cc'ing Ric Wheeler for storage array thingie.  Hi, whole thread is at
http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/3344 ]

Hello,

[EMAIL PROTECTED] wrote:
> but when you consider the self-contained disk arrays it's an entirely
> different story. you can easily have a few gig of cache and a complete
> OS pretending to be a single drive as far as you are concerned.
> 
> and the price of such devices is plummeting (in large part thanks to
> Linux moving into this space), you can now readily buy a 10TB array for
> $10k that looks like a single drive.

Don't those thingies usually have NV cache or backed by battery such
that ORDERED_DRAIN is enough?

The problem is that the interface between the host and a storage device
(ATA or SCSI) is not built to communicate that kind of information
(grouped flush, relaxed ordering...).  I think battery backed
ORDERED_DRAIN combined with fine-grained host queue flush would be
pretty good.  It doesn't require some fancy new interface which isn't
gonna be used widely anyway and can achieve most of performance gain if
the storage plays it smart.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Tejun Heo
Stefan Bader wrote:
> 2007/5/30, Phillip Susi <[EMAIL PROTECTED]>:
>> Stefan Bader wrote:
>> >
>> > Since drive a supports barrier request we don't get -EOPNOTSUPP but
>> > the request with block y might get written before block x since the
>> > disk are independent. I guess the chances of this are quite low since
>> > at some point a barrier request will also hit drive b but for the time
>> > being it might be better to indicate -EOPNOTSUPP right from
>> > device-mapper.
>>
>> The device mapper needs to ensure that ALL underlying devices get a
>> barrier request when one comes down from above, even if it has to
>> construct zero length barriers to send to most of them.
>>
> 
> And somehow also make sure all of the barriers have been processed
> before returning the barrier that came in. Plus it would have to queue
> all mapping requests until the barrier is done (if strictly acting
> according to barrier.txt).
> 
> But I am wondering a bit whether the requirements to barriers are
> really that tight as described in Tejun's document (barrier request is
> only started if everything before is safe, the barrier itself isn't
> returned until it is safe, too, and all requests after the barrier
> aren't started before the barrier is done). Is it really necessary to
> defer any further requests until the barrier has been written to save
> storage? Or would it be sufficient to guarantee that, if a barrier
> request returns, everything up to (including the barrier) is on safe
> storage?

Well, what's described in barrier.txt is the current implemented
semantics and what filesystems expect, so we can't change it underneath
them but we definitely can introduce new more relaxed variants, but one
thing we should bear in mind is that harddisks don't have humongous
caches or very smart controller / instruction set.  No matter how
relaxed interface the block layer provides, in the end, it just has to
issue whole-sale FLUSH CACHE on the device to guarantee data ordering on
the media.

IMHO, we can do better by paying more attention to how we do things in
the request queue which can be deeper and more intelligent than the
device queue.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Tejun Heo
Jens Axboe wrote:
> On Thu, May 31 2007, David Chinner wrote:
>> On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:
>>> On Thu, May 31 2007, David Chinner wrote:
 IOWs, there are two parts to the problem:

1 - guaranteeing I/O ordering
2 - guaranteeing blocks are on persistent storage.

 Right now, a single barrier I/O is used to provide both of these
 guarantees. In most cases, all we really need to provide is 1); the
 need for 2) is a much rarer condition but still needs to be
 provided.

> if I am understanding it correctly, the big win for barriers is that you 
> do NOT have to stop and wait until the data is on persistant media before 
> you can continue.
 Yes, if we define a barrier to only guarantee 1), then yes this
 would be a big win (esp. for XFS). But that requires all filesystems
 to handle sync writes differently, and sync_blockdev() needs to
 call blkdev_issue_flush() as well

 So, what do we do here? Do we define a barrier I/O to only provide
 ordering, or do we define it to also provide persistent storage
 writeback? Whatever we decide, it needs to be documented
>>> The block layer already has a notion of the two types of barriers, with
>>> a very small amount of tweaking we could expose that. There's absolutely
>>> zero reason we can't easily support both types of barriers.
>> That sounds like a good idea - we can leave the existing
>> WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
>> behaviour that only guarantees ordering. The filesystem can then
>> choose which to use where appropriate
> 
> Precisely. The current definition of barriers are what Chris and I came
> up with many years ago, when solving the problem for reiserfs
> originally. It is by no means the only feasible approach.
> 
> I'll add a WRITE_ORDERED command to the #barrier branch, it already
> contains the empty-bio barrier support I posted yesterday (well a
> slightly modified and cleaned up version).

Would that be very different from issuing barrier and not waiting for
its completion?  For ATA and SCSI, we'll have to flush write back cache
anyway, so I don't see how we can get performance advantage by
implementing separate WRITE_ORDERED.  I think zero-length barrier
(haven't looked at the code yet, still recovering from jet lag :-) can
serve as genuine barrier without the extra write tho.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-28 Thread Tejun Heo
Hello,

Neil Brown wrote:
> 1/ A BIO_RW_BARRIER request should never fail with -EOPNOTSUP.
> 
>  This is certainly a very attractive position - it makes the interface
>  cleaner and makes life easier for filesystems and other clients of
>  the block interface.
>  Currently filesystems handle -EOPNOTSUP by
>   a/ resubmitting the request without the BARRIER (after waiting for
> earlier requests to complete) and
>   b/ possibly printing an error message to the kernel logs.
> 
>  The block layer can do both of these just as easily and it does make
>  sense to do it there.

Yeah, I think doing all the above in the block layer is the cleanest way
to solve this.  If write back cache && flush doesn't work, barrier is
bound to fail but block layer still can write the barrier block as
requested (without actual barriering), whine about it to the user, and
tell the FS that barrier is failed but the write itself went through, so
that FS can go on without caring about it unless it wants to.

>  md/dm modules could keep count of requests as has been suggested
>  (though that would be a fairly big change for raid0 as it currently
>  doesn't know when a request completes - bi_endio goes directly to the
>  filesystem). 
>  However I think the idea of a zero-length BIO_RW_BARRIER would be a
>  good option.  raid0 could send one of these down each device, and
>  when they all return, the barrier request can be sent to it's target
>  device(s).

Yeap.

> 2/ Maybe barriers provide stronger semantics than are required.
> 
>  All write requests are synchronised around a barrier write.  This is
>  often more than is required and apparently can cause a measurable
>  slowdown.
> 
>  Also the FUA for the actual commit write might not be needed.  It is
>  important for consistency that the preceding writes are in safe
>  storage before the commit write, but it is not so important that the
>  commit write is immediately safe on storage.  That isn't needed until
>  a 'sync' or 'fsync' or similar.
> 
>  One possible alternative is:
>- writes can overtake barriers, but barrier cannot overtake writes.
>- flush before the barrier, not after.

I think we can give this property to zero length barriers.

>  This is considerably weaker, and hence cheaper. But I think it is
>  enough for all filesystems (providing it is still an option to call
>  blkdev_issue_flush on 'fsync').
> 
>  Another alternative would be to tag each bio was being in a
>  particular barrier-group.  Then bio's in different groups could
>  overtake each other in either direction, but a BARRIER request must
>  be totally ordered w.r.t. other requests in the barrier group.
>  This would require an extra bio field, and would give the filesystem
>  more appearance of control.  I'm not yet sure how much it would
>  really help...
>  It would allow us to set FUA on all bios with a non-zero
>  barrier-group.  That would mean we don't have to flush the entire
>  cache, just those blocks that are critical but I'm still not sure
>  it's a good idea.

Barrier code as it currently stands deals with two colors so there can
be only one outstanding barrier at given moment.  Expanding it to deal
with multiple colors and then to multiple simultaneous groups will take
some work but is definitely possible.  If FS people can make good use of
it, I think it would be worthwhile.

>  Of course, these weaker rules would only apply inside the elevator.
>  Once the request goes to the device we need to work with what the
>  device provides, which probably means total-ordering around the
>  barrier. 

Yeah, on device side, the best we can do most of the time is full flush
but as long as request queue depth is much deeper than the
controller/device one, having multiple barrier groups can be helpful.
We need more input from FS people, I think.

> 3/ Do we need explicit control of the 'ordered' mode?
> 
>   Consider a SCSI device that has NV RAM cache.  mode_sense reports
>   that write-back is enabled, so _FUA or _FLUSH will be used.
>   But as it is *NV* ram, QUEUE_ORDER_DRAIN is really the best mode.
>   But it seems there is no way to query this information.
>   Using _FLUSH causes the NVRAM to be flushed to media which is a
>   terrible performance problem.

If the NV RAM can be reliably detected using one of the inquiry pages,
sd driver can switch it to DRAIN automatically.

>   Setting SYNC_NV doesn't work on the particular device in question.
>   We currently tell customers to mount with -o nobarriers, but that
>   really feels like the wrong solution.  We should be telling the scsi
>   device "don't flush".
>   An advantage of 'nobarriers' is it can go in /etc/fstab.  Where
>   would you record that a SCSI drive should be set to
>   QUEUE_ORDERD_DRAIN ??

How about exporting ordered mode as sysfs attribute and configuring it
using a udev rule?  It's a device property after all.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the bo

Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-26 Thread Tejun Heo
Hello, Neil Brown.

Please cc me on blkdev barriers and, if you haven't yet, reading
Documentation/block/barrier.txt can be helpful too.

Neil Brown wrote:
[--snip--]
> 1/ SAFE.  With a SAFE device, there is no write-behind cache, or if
>   there is it is non-volatile.  Once a write completes it is 
>   completely safe.  Such a device does not require barriers
>   or ->issue_flush_fn, and can respond to them either by a
> no-op or with -EOPNOTSUPP (the former is preferred).
> 
> 2/ FLUSHABLE.
>   A FLUSHABLE device may have a volatile write-behind cache.
>   This cache can be flushed with a call to blkdev_issue_flush.
> It may not support barrier requests.
> 
> 3/ BARRIER.
> A BARRIER device supports both blkdev_issue_flush and
>   BIO_RW_BARRIER.  Either may be used to synchronise any
> write-behind cache to non-volatile storage (media).
> 
> Handling of SAFE and FLUSHABLE devices is essentially the same and can
> work on a BARRIER device.  The BARRIER device has the option of more
> efficient handling.

Actually, all above three are handled by blkdev flush code.

> How does a filesystem use this?
> ===
> 
[--snip--]
> 2/ Set the BIO_RW_BARRIER bit in the write request for the commit
> block.
>(This is more efficient on BARRIER).

This really should be enough.

> HOW DO MD or DM USE THIS
> 
> 
> 1/ striping devices.
>  This includes md/raid0 md/linear dm-linear dm-stripe and probably
>  others. 
> 
>These devices can easily support blkdev_issue_flush by simply
>calling blkdev_issue_flush on all component devices.
> 
>These devices would find it very hard to support BIO_RW_BARRIER.
>Doing this would require keeping track of all in-flight requests
>(which some, possibly all, of the above don't) and then:
>  When a BIO_RW_BARRIER request arrives:
> wait for all pending writes to complete
> call blkdev_issue_flush on all devices
> issue the barrier write to the target device(s)
>as BIO_RW_BARRIER,
> if that is -EOPNOTSUP, re-issue, wait, flush.

Hmm... What do you think about introducing zero-length BIO_RW_BARRIER
for this case?

> 2/ Mirror devices.  This includes md/raid1 and dm-raid1.
> 
>These device can trivially implement blkdev_issue_flush much like
>the striping devices, and can support BIO_RW_BARRIER to some
>extent.
>md/raid1 currently tries.  I'm not sure about dm-raid1.
> 
>md/raid1 determines if the underlying devices can handle
>BIO_RW_BARRIER.  If any cannot, it rejects such requests (EOPNOTSUP)
>itself.
>If all underlying devices do appear to support barriers, md/raid1
>will pass a barrier-write down to all devices.
>The difficulty comes if it fails on one device, but not all
>devices.  In this case it is not clear what to do.  Failing the
>request is a lie, because some data has been written (possible too
>early).  Succeeding the request (after re-submitting the failed
>requests) is also a lie as the barrier wasn't really honoured.
>md/raid1 currently takes the latter approach, but will only do it
>once - after that it fails all barrier requests.
> 
>Hopefully this is unlikely to happen.  What device would work
>correctly with barriers once, and then not the next time?
>The answer is md/raid1.  If you remove a failed device and add a
>new device that doesn't support barriers, md/raid1 will notice and
>stop supporting barriers.
>If md/raid1 can change from supporting barrier to not, then maybe
>some other device could too?
> 
>I'm not sure what to do about this - maybe just ignore it...

That sounds good.  :-)

> 3/ Other modules
> 
>Other md and dm modules (raid5, mpath, crypt) do not add anything
>interesting to the above.  Either handling BIO_RW_BARRIER is
>trivial, or extremely difficult.
> 
> HOW DO LOW LEVEL DEVICES HANDLE THIS
> 
> 
> This is part of the picture that I haven't explored greatly.  My
> feeling is that most if not all devices support blkdev_issue_flush
> properly, and support barriers reasonably well providing that the
> hardware does.
> There in an exception I recently found though.
> For devices that don't support QUEUE_ORDERED_TAG (i.e. commands sent to
> the controller can be tagged as barriers), SCSI will use the
> SYNCHRONIZE_CACHE command to flush the cache after the barrier
> request (a bit like the filesystem calling blkdev_issue_flush, but at
> a lower level). However it does this without setting the SYNC_NV bit.
> This means that a device with a non-volatile cache will be required --
> needlessly -- to flush that cache to media.

Yeah, it probably needs updating but some devices might react badly too.

> So: some questions to help encourage response:
> 
>  - Is the above substantial correct?  Totally correct?

Re: Kernel 2.6.20.4: Software RAID 5: ata13.00: (irq_stat 0x00020002, failed to transmit command FIS)

2007-04-09 Thread Tejun Heo
Justin Piszcz wrote:
> 
> 
> On Thu, 5 Apr 2007, Justin Piszcz wrote:
> 
>> Had a quick question, this is the first time I have seen this happen,
>> and it was not even under during heavy I/O, hardly anything was going
>> on with the box at the time.
> 
> .. snip ..
> 
> # /usr/bin/time badblocks -b 512 -s -v -w /dev/sdl
> Checking for bad blocks in read-write mode
> From block 0 to 293046768
> Testing with pattern 0xaa: done
> Reading and comparing: done
> 
> Not a single bad block on the drive so far.  I have not changed anything
> the box physically, with the exception of the BIOS version to V1666 for
> an Intel P965 motherboard (DG965WHMKR).  Any idea what or why this
> happened?  Is it a kernel or actual HW issue?  What caused this to
> occur?  Any thoughts or ideas?

My bet is on harddisk firmware acting weird.  You can prove this by
reconnecting the disk to known working port without cutting power.  Or,
you can prove that the original port works by removing the harddisk and
putting a different one.  You'll need to issue manual scan using SCSI
sysfs node.  It's very rare but some drives do choke like that.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20.3 AMD64 oops in CFQ code

2007-04-04 Thread Tejun Heo
Lee Revell wrote:
> On 4/4/07, Bill Davidsen <[EMAIL PROTECTED]> wrote:
>> I won't say that's voodoo, but if I ever did it I'd wipe down my
>> keyboard with holy water afterward. ;-)
>>
>> Well, I did save the message in my tricks file, but it sounds like a
>> last ditch effort after something get very wrong.

Which actually is true.  ATA ports failing to reset indicate something
is very wrong.  Either the attached device or the controller is broken
and libata shuts down the port to protect the rest of the system from
it.  The manual scan requests tell libata to give it one more shot and
polling hotplug can do that automatically.  Anyways, this shouldn't
happen unless you have a broken piece of hardware.

> Would it reallty be an impediment to development if the kernel
> maintainers simply refuse to merge patches that add new sysfs entries
> without corresponding documentation?

SCSI host scan nodes have been there for a long time.  I think it's
documented somewhere.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20.3 AMD64 oops in CFQ code

2007-04-03 Thread Tejun Heo
[EMAIL PROTECTED] wrote:
> [EMAIL PROTECTED] wrote:
>>> Anyway, what's annoying is that I can't figure out how to bring the
>>> drive back on line without resetting the box.  It's in a hot-swap enclosure,
>>> but power cycling the drive doesn't seem to help.  I thought libata hotplug
>>> was working?  (SiI3132 card, using the sil24 driver.)
> 
>> Yeah, it's working but failing resets are considered highly dangerous
>> (in that the controller status is unknown and may cause something
>> dangerous like screaming interrupts) and port is muted after that.  The
>> plan is to handle this with polling hotplug such that libata tries to
>> revive the port if PHY status change is detected by polling.  Patches
>> are available but they need other things to resolved to get integrated.
>> I think it'll happen before the summer.
> 
>> Anyways, you can tell libata to retry the port by manually telling it to
>> rescan the port (echo - - - > /sys/class/scsi_host/hostX/scan).
> 
> Ah, thank you!  I have to admit, that is at least as mysterious as any
> Microsoft registry tweak.

Polling hotplug should fix this.  I thought I would be able to merge it
much earlier.  I apparently was way too optimistic.  :-(

>>> (H'm... after rebooting, reallocated sectors jumped from 26 to 39.
>>> Something is up with that drive.)
> 
>> Yeap, seems like a broken drive to me.
> 
> Actually, after a few rounds, the reallocated sectors stabilized at 56
> and all is working well again.  It's like there was a major problem with
> error handling.
> 
> The problem is that I don't know where the blame lies.

I'm pretty sure it's the firmware's fault.  It's not supposed to go out
for lunch like that even when internal error occurs.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20.3 AMD64 oops in CFQ code

2007-04-02 Thread Tejun Heo
[resending.  my mail service was down for more than a week and this
message didn't get delivered.]

[EMAIL PROTECTED] wrote:
> > Anyway, what's annoying is that I can't figure out how to bring the
> > drive back on line without resetting the box.  It's in a hot-swap
enclosure,
> > but power cycling the drive doesn't seem to help.  I thought libata
hotplug
> > was working?  (SiI3132 card, using the sil24 driver.)

Yeah, it's working but failing resets are considered highly dangerous
(in that the controller status is unknown and may cause something
dangerous like screaming interrupts) and port is muted after that.  The
plan is to handle this with polling hotplug such that libata tries to
revive the port if PHY status change is detected by polling.  Patches
are available but they need other things to resolved to get integrated.
 I think it'll happen before the summer.

Anyways, you can tell libata to retry the port by manually telling it to
rescan the port (echo - - - > /sys/class/scsi_host/hostX/scan).

> > (H'm... after rebooting, reallocated sectors jumped from 26 to 39.
> > Something is up with that drive.)

Yeap, seems like a broken drive to me.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nonzero mismatch_cnt with no earlier error

2007-03-04 Thread Tejun Heo
Eyal Lebedinsky wrote:
> I CC'ed linux-ide to see if they think the reported error was really innocent:
> 
> Question: does this error report suggest that a disk could be corrupted?
> 
> This SATA disk is part of an md raid and no error was reported by md.
> 
> [937567.332751] ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x4190002 action 
> 0x2
> [937567.354094] ata3.00: cmd b0/d5:01:09:4f:c2/00:00:00:00:00/00 tag 0 cdb 
> 0x0 data 512 in
> [937567.354096]  res 51/04:83:45:00:00/00:00:00:00:00/a0 Emask 0x10 
> (ATA bus error)

Command 0xb0 is SMART.  The device failed some subcommand of SMART, so,
no, it isn't related to data integrity, but your link is reporting
recovered data transmission error and PHY ready status changed and some
other conditions making libata EH mark the failure as ATA bus error.
Care to post full dmesg?

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem booting linux 2.6.19-rc5, 2.6.19-rc5-git6, 2.6.19-rc5-mm2 with md raid 1 over lvm root

2006-11-15 Thread Tejun Heo

Nicolas Mailhot wrote:

The failing kernels (I tried -rc5, -rc5-git6, -rc5-mm2 only print :

%<
device-mapper: ioctl: 4.7.0-ioctl (2006-06-24) initialised:
[EMAIL PROTECTED]
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
%<-

(I didn't bother copying the rest of the failing kernel dmesg, as sata
initialisation fills the first half of the screen, then dm is initialised,
then you only get the logical consequences of failing to detect the /
volume. The sata part seems fine – it prints the name of the hard drives
we want to use)

I'm attaching the dmesg for the working distro kernel (yes I know not 100%
distro kernel, but very close to one), distro config , and the config I
used in my test. If anyone could help me to figure what's wrong I'd be
grateful.


Say 'y' not 'm' to SCSI disk support.

--
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libata hotplug and md raid?

2006-09-13 Thread Tejun Heo

Ric Wheeler wrote:

(Adding Tejun & Greg KH to this thread)

Adding linux-ide to this thread.


Leon Woestenberg wrote:

[--snip--]

In short, I use ext3 over /dev/md0 over 4 SATA drives /dev/sd[a-d]
each driven by libata ahci. I unplug then replug the drive that is
rebuilding in RAID-5.

When I unplug a drive, /dev/sda is removed, hotplug seems to work to
the point where proc/mdstat shows the drive failed, but not removed.


Yeap, that sounds about right.


Every other notion of the drive (in kernel and udev /dev namespace)
seems to be gone after unplugging. I cannot manually removed the drive
using mdadm, because it tells me the drive does not exist.


I see.  That's a problem.  Can you use /dev/.static/dev/sda instead?  If 
you can't find those static nodes, just create one w/ 'mknod 
my-static-sda b 8 0' and use it.



Replugging the drive brings it back as /dev/sde, md0 will not pick it up.


No, it won't.

I have a similar setup, AHCI + 4 drives but using a RAID-1 group.  The 
thing that you are looking for is "persistent device naming" and should 
work properly if you can tweak udev/hotplug correctly.


I have verified that a drive pull/drive reinsert on a mainline kernel 
with a SLES10 base does provide this (first insertion gives me sdb, pull 
followed by reinsert still is sdb), but have not tested interaction with 
RAID since I am focused on the bad block handling at the moment.  I will 
add this to my list ;-)




The expected behaviour (from me) is that the drive re-appears as 
/dev/sda.


Apart from persistent naming Ric mentioned above, the reason why you 
don't get sda back is md is holding the internal device.  It's removed 
from all visible name spaces but md still holds a reference, so the 
device cannot be destroyed.  So, when a new device comes along, sda is 
occupied by the dead device, and the new one gets the next available 
slot, which happens to be sde in your case.



What is the intended behaviour of md in this case?

Should some user-space application fail-remove a drive as a pre-action
of the unplug event from udev, or should md fully remove the drive
within kernel space??


I'm curious too.  Would it be better for md to listen to hotplug events 
and auto-remove dead devices or is it something which belongs to userland?


Thanks.

--
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Test feedback 2.6.17.4+libata-tj-stable (EH, hotplug)

2006-07-10 Thread Tejun Heo

Christian Pernegger wrote:

The fact that the disk had changed minor numbers after it was plugged
back in bugs me a bit. (was sdc before, sde after). Additionally udev
removed the sdc device file, so I had to manually recreate it to be
able to remove the 'faulty' disk from its md array.


That's because md is stilling holding onto sdc in failed mode.  A 
hotplug script which checks whether a removed device is in md array and 
if so removes it from the array will solve the problem.  Not sure 
whether that would be the correct approach though.


Thanks.

--
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html