Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
Bill Davidsen wrote: Jan Engelhardt wrote: On Dec 1 2007 06:26, Justin Piszcz wrote: I ran the following: dd if=/dev/zero of=/dev/sdc dd if=/dev/zero of=/dev/sdd dd if=/dev/zero of=/dev/sde (as it is always a very good idea to do this with any new disk) Why would you care about what's on the disk? fdisk, mkfs and the day-to-day operation will overwrite it _anyway_. (If you think the disk is not empty, you should look at it and copy off all usable warez beforehand :-) Do you not test your drive for minimum functionality before using them? I personally don't. Also, if you have the tools to check for relocated sectors before and after doing this, that's a good idea as well. S.M.A.R.T is your friend. And when writing /dev/zero to a drive, if it craps out you have less emotional attachment to the data. Writing all zero isn't too useful tho. Drive failing reallocation on write is catastrophic failure. It means that the drive wanna relocate but can't because it used up all its extra space which usually indicates something else is seriously wrong with the drive. The drive will have to go to the trash can. This is all serious and bad but the catch is that in such cases the problem usually stands like a sore thumb so either vendor doesn't ship such drive or you'll find the failure very early. I personally haven't seen any such failure yet. Maybe I'm lucky. Most data loss occurs when the drive fails to read what it thought it wrote successfully and the opposite - reading and dumping the whole disk to /dev/null periodically is probably much better than writing zeros as it allows the drive to find out deteriorating sector early while it's still readable and relocate. But then again I think it's an overkill. Writing zeros to sectors is more useful as cure rather than prevention. If your drive fails to read a sector, write whatever value to the sector. The drive will forget about the data on the damaged sector and reallocate and write new data to it. Of course, you lose data which was originally on the sector. I personally think it's enough to just throw in an extra disk and make it RAID0 or 5 and rebuild the array if read fails on one of the disks. If write fails or read fail continues, replace the disk. Of course, if you wanna be extra cautious, good for you. :-) -- tejun - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel 2.6.23.9 / P35 Chipset + WD 750GB Drives (reset port)
Justin Piszcz wrote: The badblocks did not do anything; however, when I built a software raid 5 and the performed a dd: /usr/bin/time dd if=/dev/zero of=fill_disk bs=1M [42332.936615] ata5.00: exception Emask 0x2 SAct 0x7000 SErr 0x0 action 0x2 frozen [42332.936706] ata5.00: spurious completions during NCQ issue=0x0 SAct=0x7000 FIS=004040a1:0800 Next test, I will turn off NCQ and try to make the problem re-occur. If anyone else has any thoughts here..? I ran long smart tests on all 3 disks, they all ran successfully. Perhaps these drives need to be NCQ BLACKLISTED with the P35 chipset? That was me being stupid. Patches for both upstream and -stable branches are posted. These will go away. Thanks. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possible data corruption sata_sil24?
David Shaw wrote: I'm not sure whether this is problem of sata_sil24 or dm layer. Cc'ing linux-raid for help. How much memory do you have? One big difference between ata_piix and sata_sil24 is that sil24 can handle 64bit DMA. Maybe dma mapping or something interacts weirdly with dm there? The machine has 640 megs of RAM. FWIW, I tried this with 512 megs of RAM with the same results. Running Memtest86+ shows the memory is good. Hmmm... I see, so no DMA to the wrong address problem then. Let's see whether dm people can help us out. Thanks. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possible data corruption sata_sil24?
David Shaw wrote: It fails whether I use a raw /dev/sdd or partition it into one large /dev/sdd1, or partition into multiple partitions. sata_sil24 seems to work by itself, as does dm, but as soon as I mix sata_sil24+dm, I get corruption. H Can you reproduce the corruption by accessing both devices simultaneously without using dm? Considering ich5 does fine, it looks like hardware and/or driver problem and I really wanna rule out dm. I think I wasn't clear enough before. The corruption happens when I use dm to create two dm mappings that both reside on the same real device. Using two different devices, or two different partitions on the same physical device works properly. ich5 does fine with these 3 tests, but sata_sil24 fails: * /dev/sdd, create 2 dm linear mappings on it, mke2fs and use those dm devices == corruption * Partition /dev/sdd into /dev/sdd1 and /dev/sdd2, mke2fs and use those partitions == no corruption * Partition /dev/sdd into /dev/sdd1 and /dev/sdd2, create 2 dm linear mappings on /dev/sdd1, mke2fs and use those dm devices == corruption I'm not sure whether this is problem of sata_sil24 or dm layer. Cc'ing linux-raid for help. How much memory do you have? One big difference between ata_piix and sata_sil24 is that sil24 can handle 64bit DMA. Maybe dma mapping or something interacts weirdly with dm there? Thanks. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] block: cosmetic changes
Cosmetic changes. This is taken from Jens' zero-length barrier patch. Signed-off-by: Tejun Heo [EMAIL PROTECTED] Cc: Jens Axboe [EMAIL PROTECTED] --- block/ll_rw_blk.c |5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) Index: work/block/ll_rw_blk.c === --- work.orig/block/ll_rw_blk.c +++ work/block/ll_rw_blk.c @@ -443,7 +443,8 @@ static inline struct request *start_orde rq_init(q, rq); if (bio_data_dir(q-orig_bar_rq-bio) == WRITE) rq-cmd_flags |= REQ_RW; - rq-cmd_flags |= q-ordered QUEUE_ORDERED_FUA ? REQ_FUA : 0; + if (q-ordered QUEUE_ORDERED_FUA) + rq-cmd_flags |= REQ_FUA; rq-elevator_private = NULL; rq-elevator_private2 = NULL; init_request_from_bio(rq, q-orig_bar_rq-bio); @@ -3167,7 +3168,7 @@ end_io: break; } - if (unlikely(bio_sectors(bio) q-max_hw_sectors)) { + if (unlikely(nr_sectors q-max_hw_sectors)) { printk(bio too big device %s (%u %u)\n, bdevname(bio-bi_bdev, b), bio_sectors(bio), - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] block: factor out bio_check_eod()
End of device check is done twice in __generic_make_request() and it's fully inlined each time. Factor out bio_check_eod(). This is taken from Jens' zero-length barrier patch. Signed-off-by: Tejun Heo [EMAIL PROTECTED] Cc: Jens Axboe [EMAIL PROTECTED] --- block/ll_rw_blk.c | 63 -- 1 file changed, 33 insertions(+), 30 deletions(-) Index: work/block/ll_rw_blk.c === --- work.orig/block/ll_rw_blk.c +++ work/block/ll_rw_blk.c @@ -3094,6 +3094,35 @@ static inline int should_fail_request(st #endif /* CONFIG_FAIL_MAKE_REQUEST */ +/* + * Check whether this bio extends beyond the end of the device. + */ +static int bio_check_eod(struct bio *bio, unsigned int nr_sectors) +{ + sector_t maxsector; + + if (!nr_sectors) + return 0; + + /* Test device or partition size, when known. */ + maxsector = bio-bi_bdev-bd_inode-i_size 9; + if (maxsector) { + sector_t sector = bio-bi_sector; + + if (maxsector nr_sectors || maxsector - nr_sectors sector) { + /* +* This may well happen - the kernel calls bread() +* without checking the size of the device, e.g., when +* mounting a device. +*/ + handle_bad_sector(bio); + return 1; + } + } + + return 0; +} + /** * generic_make_request: hand a buffer to its device driver for I/O * @bio: The bio describing the location in memory and on the device. @@ -3121,27 +3150,14 @@ static inline int should_fail_request(st static inline void __generic_make_request(struct bio *bio) { request_queue_t *q; - sector_t maxsector; sector_t old_sector; int ret, nr_sectors = bio_sectors(bio); dev_t old_dev; might_sleep(); - /* Test device or partition size, when known. */ - maxsector = bio-bi_bdev-bd_inode-i_size 9; - if (maxsector) { - sector_t sector = bio-bi_sector; - if (maxsector nr_sectors || maxsector - nr_sectors sector) { - /* -* This may well happen - the kernel calls bread() -* without checking the size of the device, e.g., when -* mounting a device. -*/ - handle_bad_sector(bio); - goto end_io; - } - } + if (bio_check_eod(bio, nr_sectors)) + goto end_io; /* * Resolve the mapping until finished. (drivers are @@ -3197,21 +3213,8 @@ end_io: old_sector = bio-bi_sector; old_dev = bio-bi_bdev-bd_dev; - maxsector = bio-bi_bdev-bd_inode-i_size 9; - if (maxsector) { - sector_t sector = bio-bi_sector; - - if (maxsector nr_sectors || - maxsector - nr_sectors sector) { - /* -* This may well happen - partitions are not -* checked to make sure they are within the size -* of the whole device. -*/ - handle_bad_sector(bio); - goto end_io; - } - } + if (bio_check_eod(bio, nr_sectors)) + goto end_io; ret = q-make_request_fn(q, bio); } while (ret); - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] block: factor out bio_check_eod()
Jens Axboe wrote: On Wed, Jul 18 2007, Tejun Heo wrote: End of device check is done twice in __generic_make_request() and it's fully inlined each time. Factor out bio_check_eod(). Tejun, yeah I should seperate the cleanups and put them in the upstream branch. Will do so and add your signed-off to both of them. Would they be different from the one I just posted? No big deal either way. I'm just basing the zero-length barrier on top of these patches. Oh well, the changes are trivial anyway. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] block: factor out bio_check_eod()
Jens Axboe wrote: On Wed, Jul 18 2007, Tejun Heo wrote: Jens Axboe wrote: On Wed, Jul 18 2007, Tejun Heo wrote: End of device check is done twice in __generic_make_request() and it's fully inlined each time. Factor out bio_check_eod(). Tejun, yeah I should seperate the cleanups and put them in the upstream branch. Will do so and add your signed-off to both of them. Would they be different from the one I just posted? No big deal either way. I'm just basing the zero-length barrier on top of these patches. Oh well, the changes are trivial anyway. This one ended up being the same, but in the first one you missed some of the cleanups. I ended up splitting the patch some more though, see the series: http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=barrier Alright, will base on 662d5c5e6afb79d05db5563205b809c0de530286. Thanks. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] block: factor out bio_check_eod()
Jens Axboe wrote: On Wed, Jul 18 2007, Tejun Heo wrote: Jens Axboe wrote: On Wed, Jul 18 2007, Tejun Heo wrote: Jens Axboe wrote: On Wed, Jul 18 2007, Tejun Heo wrote: End of device check is done twice in __generic_make_request() and it's fully inlined each time. Factor out bio_check_eod(). Tejun, yeah I should seperate the cleanups and put them in the upstream branch. Will do so and add your signed-off to both of them. Would they be different from the one I just posted? No big deal either way. I'm just basing the zero-length barrier on top of these patches. Oh well, the changes are trivial anyway. This one ended up being the same, but in the first one you missed some of the cleanups. I ended up splitting the patch some more though, see the series: http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=barrier Alright, will base on 662d5c5e6afb79d05db5563205b809c0de530286. Thanks. 1781c6a39fb6e31836557618c4505f5f7bc61605, no? Unless you want to rewrite it completely :-) I think I'll start from 662d5c5e and steal most parts from 1781c6a3. I like stealing, you know. :-) I think 1781c6a3 also can use splitting - zero length barrier implementation and issue_flush conversion. Anyways, how do I pull from git.kernel.dk? git://git.kernel.dk/linux-2.6-block.git gives me connection reset by server. Thanks. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] block: factor out bio_check_eod()
Jens Axboe wrote: somewhat annoying, I'll see if I can prefix it with git-daemon in the future. OK, now skip the /data/git/ stuff and just use git://git.kernel.dk/linux-2.6-block.git Alright, it works like a charm now. Thanks. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
[EMAIL PROTECTED] wrote: On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said: All of the high end arrays have non-volatile cache (read, on power loss, it is a promise that it will get all of your data out to permanent storage). You don't need to ask this kind of array to drain the cache. In fact, it might just ignore you if you send it that kind of request ;-) OK, I'll bite - how does the kernel know whether the other end of that fiberchannel cable is attached to a DMX-3 or to some no-name product that may not have the same assurances? Is there a I'm a high-end array bit in the sense data that I'm unaware of? Well, the array just has to tell the kernel that it doesn't to write back caching. The kernel automatically selects ORDERED_DRAIN in such case. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID is really RAID?
Mark Lord wrote: I believe he said it was ICH5 (different post/thread). My observation on ICH5 is that if one unplugs a drive, then the chipset/cpu locks up hard when toggling SRST in the EH code. Specifically, it locks up at the instruction which restores SRST back to the non-asserted state, which likely corresponds to the chipset finally actually sending a FIS to the drive. A hard(ware) lockup, not software. That's why Intel says ICH5 doesn't do hotplug. OIC. I don't think there's much left to do from the driver side then. Or is there any workaround? -- tejun - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [linux-lvm] 2.6.22-rc5 XFS fails after hibernate/resume
David Greaves wrote: Tejun Heo wrote: It's really weird tho. The PHY RDY status changed events are coming from the device which is NOT used while resuming There is an obvious problem there though Tejun (the errors even when sda isn't involved in the OS boot) - can I start another thread about that issue/bug later? I need to reshuffle partitions so I'd rather get the hibernate working first and then go back to it if that's OK? Yeah, sure. The problem is that we don't know whether or how those two are related. It would be great if there's a way to verify memory image read from hibernation is intact. Rafael, any ideas? Thanks. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Hello, Jens Axboe wrote: Would that be very different from issuing barrier and not waiting for its completion? For ATA and SCSI, we'll have to flush write back cache anyway, so I don't see how we can get performance advantage by implementing separate WRITE_ORDERED. I think zero-length barrier (haven't looked at the code yet, still recovering from jet lag :-) can serve as genuine barrier without the extra write tho. As always, it depends :-) If you are doing pure flush barriers, then there's no difference. Unless you only guarantee ordering wrt previously submitted requests, in which case you can eliminate the post flush. If you are doing ordered tags, then just setting the ordered bit is enough. That is different from the barrier in that we don't need a flush of FUA bit set. Hmmm... I'm feeling dense. Zero-length barrier also requires only one flush to separate requests before and after it (haven't looked at the code yet, will soon). Can you enlighten me? Thanks. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
[ cc'ing Ric Wheeler for storage array thingie. Hi, whole thread is at http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/3344 ] Hello, [EMAIL PROTECTED] wrote: but when you consider the self-contained disk arrays it's an entirely different story. you can easily have a few gig of cache and a complete OS pretending to be a single drive as far as you are concerned. and the price of such devices is plummeting (in large part thanks to Linux moving into this space), you can now readily buy a 10TB array for $10k that looks like a single drive. Don't those thingies usually have NV cache or backed by battery such that ORDERED_DRAIN is enough? The problem is that the interface between the host and a storage device (ATA or SCSI) is not built to communicate that kind of information (grouped flush, relaxed ordering...). I think battery backed ORDERED_DRAIN combined with fine-grained host queue flush would be pretty good. It doesn't require some fancy new interface which isn't gonna be used widely anyway and can achieve most of performance gain if the storage plays it smart. Thanks. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Jens Axboe wrote: On Thu, May 31 2007, David Chinner wrote: On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote: On Thu, May 31 2007, David Chinner wrote: IOWs, there are two parts to the problem: 1 - guaranteeing I/O ordering 2 - guaranteeing blocks are on persistent storage. Right now, a single barrier I/O is used to provide both of these guarantees. In most cases, all we really need to provide is 1); the need for 2) is a much rarer condition but still needs to be provided. if I am understanding it correctly, the big win for barriers is that you do NOT have to stop and wait until the data is on persistant media before you can continue. Yes, if we define a barrier to only guarantee 1), then yes this would be a big win (esp. for XFS). But that requires all filesystems to handle sync writes differently, and sync_blockdev() needs to call blkdev_issue_flush() as well So, what do we do here? Do we define a barrier I/O to only provide ordering, or do we define it to also provide persistent storage writeback? Whatever we decide, it needs to be documented The block layer already has a notion of the two types of barriers, with a very small amount of tweaking we could expose that. There's absolutely zero reason we can't easily support both types of barriers. That sounds like a good idea - we can leave the existing WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED behaviour that only guarantees ordering. The filesystem can then choose which to use where appropriate Precisely. The current definition of barriers are what Chris and I came up with many years ago, when solving the problem for reiserfs originally. It is by no means the only feasible approach. I'll add a WRITE_ORDERED command to the #barrier branch, it already contains the empty-bio barrier support I posted yesterday (well a slightly modified and cleaned up version). Would that be very different from issuing barrier and not waiting for its completion? For ATA and SCSI, we'll have to flush write back cache anyway, so I don't see how we can get performance advantage by implementing separate WRITE_ORDERED. I think zero-length barrier (haven't looked at the code yet, still recovering from jet lag :-) can serve as genuine barrier without the extra write tho. Thanks. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Stefan Bader wrote: 2007/5/30, Phillip Susi [EMAIL PROTECTED]: Stefan Bader wrote: Since drive a supports barrier request we don't get -EOPNOTSUPP but the request with block y might get written before block x since the disk are independent. I guess the chances of this are quite low since at some point a barrier request will also hit drive b but for the time being it might be better to indicate -EOPNOTSUPP right from device-mapper. The device mapper needs to ensure that ALL underlying devices get a barrier request when one comes down from above, even if it has to construct zero length barriers to send to most of them. And somehow also make sure all of the barriers have been processed before returning the barrier that came in. Plus it would have to queue all mapping requests until the barrier is done (if strictly acting according to barrier.txt). But I am wondering a bit whether the requirements to barriers are really that tight as described in Tejun's document (barrier request is only started if everything before is safe, the barrier itself isn't returned until it is safe, too, and all requests after the barrier aren't started before the barrier is done). Is it really necessary to defer any further requests until the barrier has been written to save storage? Or would it be sufficient to guarantee that, if a barrier request returns, everything up to (including the barrier) is on safe storage? Well, what's described in barrier.txt is the current implemented semantics and what filesystems expect, so we can't change it underneath them but we definitely can introduce new more relaxed variants, but one thing we should bear in mind is that harddisks don't have humongous caches or very smart controller / instruction set. No matter how relaxed interface the block layer provides, in the end, it just has to issue whole-sale FLUSH CACHE on the device to guarantee data ordering on the media. IMHO, we can do better by paying more attention to how we do things in the request queue which can be deeper and more intelligent than the device queue. Thanks. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Hello, Neil Brown. Please cc me on blkdev barriers and, if you haven't yet, reading Documentation/block/barrier.txt can be helpful too. Neil Brown wrote: [--snip--] 1/ SAFE. With a SAFE device, there is no write-behind cache, or if there is it is non-volatile. Once a write completes it is completely safe. Such a device does not require barriers or -issue_flush_fn, and can respond to them either by a no-op or with -EOPNOTSUPP (the former is preferred). 2/ FLUSHABLE. A FLUSHABLE device may have a volatile write-behind cache. This cache can be flushed with a call to blkdev_issue_flush. It may not support barrier requests. 3/ BARRIER. A BARRIER device supports both blkdev_issue_flush and BIO_RW_BARRIER. Either may be used to synchronise any write-behind cache to non-volatile storage (media). Handling of SAFE and FLUSHABLE devices is essentially the same and can work on a BARRIER device. The BARRIER device has the option of more efficient handling. Actually, all above three are handled by blkdev flush code. How does a filesystem use this? === [--snip--] 2/ Set the BIO_RW_BARRIER bit in the write request for the commit block. (This is more efficient on BARRIER). This really should be enough. HOW DO MD or DM USE THIS 1/ striping devices. This includes md/raid0 md/linear dm-linear dm-stripe and probably others. These devices can easily support blkdev_issue_flush by simply calling blkdev_issue_flush on all component devices. These devices would find it very hard to support BIO_RW_BARRIER. Doing this would require keeping track of all in-flight requests (which some, possibly all, of the above don't) and then: When a BIO_RW_BARRIER request arrives: wait for all pending writes to complete call blkdev_issue_flush on all devices issue the barrier write to the target device(s) as BIO_RW_BARRIER, if that is -EOPNOTSUP, re-issue, wait, flush. Hmm... What do you think about introducing zero-length BIO_RW_BARRIER for this case? 2/ Mirror devices. This includes md/raid1 and dm-raid1. These device can trivially implement blkdev_issue_flush much like the striping devices, and can support BIO_RW_BARRIER to some extent. md/raid1 currently tries. I'm not sure about dm-raid1. md/raid1 determines if the underlying devices can handle BIO_RW_BARRIER. If any cannot, it rejects such requests (EOPNOTSUP) itself. If all underlying devices do appear to support barriers, md/raid1 will pass a barrier-write down to all devices. The difficulty comes if it fails on one device, but not all devices. In this case it is not clear what to do. Failing the request is a lie, because some data has been written (possible too early). Succeeding the request (after re-submitting the failed requests) is also a lie as the barrier wasn't really honoured. md/raid1 currently takes the latter approach, but will only do it once - after that it fails all barrier requests. Hopefully this is unlikely to happen. What device would work correctly with barriers once, and then not the next time? The answer is md/raid1. If you remove a failed device and add a new device that doesn't support barriers, md/raid1 will notice and stop supporting barriers. If md/raid1 can change from supporting barrier to not, then maybe some other device could too? I'm not sure what to do about this - maybe just ignore it... That sounds good. :-) 3/ Other modules Other md and dm modules (raid5, mpath, crypt) do not add anything interesting to the above. Either handling BIO_RW_BARRIER is trivial, or extremely difficult. HOW DO LOW LEVEL DEVICES HANDLE THIS This is part of the picture that I haven't explored greatly. My feeling is that most if not all devices support blkdev_issue_flush properly, and support barriers reasonably well providing that the hardware does. There in an exception I recently found though. For devices that don't support QUEUE_ORDERED_TAG (i.e. commands sent to the controller can be tagged as barriers), SCSI will use the SYNCHRONIZE_CACHE command to flush the cache after the barrier request (a bit like the filesystem calling blkdev_issue_flush, but at a lower level). However it does this without setting the SYNC_NV bit. This means that a device with a non-volatile cache will be required -- needlessly -- to flush that cache to media. Yeah, it probably needs updating but some devices might react badly too. So: some questions to help encourage response: - Is the above substantial correct? Totally correct? - Should the various filesystems be fixed as suggested above? Is someone willing to do
Re: Kernel 2.6.20.4: Software RAID 5: ata13.00: (irq_stat 0x00020002, failed to transmit command FIS)
Justin Piszcz wrote: On Thu, 5 Apr 2007, Justin Piszcz wrote: Had a quick question, this is the first time I have seen this happen, and it was not even under during heavy I/O, hardly anything was going on with the box at the time. .. snip .. # /usr/bin/time badblocks -b 512 -s -v -w /dev/sdl Checking for bad blocks in read-write mode From block 0 to 293046768 Testing with pattern 0xaa: done Reading and comparing: done Not a single bad block on the drive so far. I have not changed anything the box physically, with the exception of the BIOS version to V1666 for an Intel P965 motherboard (DG965WHMKR). Any idea what or why this happened? Is it a kernel or actual HW issue? What caused this to occur? Any thoughts or ideas? My bet is on harddisk firmware acting weird. You can prove this by reconnecting the disk to known working port without cutting power. Or, you can prove that the original port works by removing the harddisk and putting a different one. You'll need to issue manual scan using SCSI sysfs node. It's very rare but some drives do choke like that. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20.3 AMD64 oops in CFQ code
Lee Revell wrote: On 4/4/07, Bill Davidsen [EMAIL PROTECTED] wrote: I won't say that's voodoo, but if I ever did it I'd wipe down my keyboard with holy water afterward. ;-) Well, I did save the message in my tricks file, but it sounds like a last ditch effort after something get very wrong. Which actually is true. ATA ports failing to reset indicate something is very wrong. Either the attached device or the controller is broken and libata shuts down the port to protect the rest of the system from it. The manual scan requests tell libata to give it one more shot and polling hotplug can do that automatically. Anyways, this shouldn't happen unless you have a broken piece of hardware. Would it reallty be an impediment to development if the kernel maintainers simply refuse to merge patches that add new sysfs entries without corresponding documentation? SCSI host scan nodes have been there for a long time. I think it's documented somewhere. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20.3 AMD64 oops in CFQ code
[EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: Anyway, what's annoying is that I can't figure out how to bring the drive back on line without resetting the box. It's in a hot-swap enclosure, but power cycling the drive doesn't seem to help. I thought libata hotplug was working? (SiI3132 card, using the sil24 driver.) Yeah, it's working but failing resets are considered highly dangerous (in that the controller status is unknown and may cause something dangerous like screaming interrupts) and port is muted after that. The plan is to handle this with polling hotplug such that libata tries to revive the port if PHY status change is detected by polling. Patches are available but they need other things to resolved to get integrated. I think it'll happen before the summer. Anyways, you can tell libata to retry the port by manually telling it to rescan the port (echo - - - /sys/class/scsi_host/hostX/scan). Ah, thank you! I have to admit, that is at least as mysterious as any Microsoft registry tweak. Polling hotplug should fix this. I thought I would be able to merge it much earlier. I apparently was way too optimistic. :-( (H'm... after rebooting, reallocated sectors jumped from 26 to 39. Something is up with that drive.) Yeap, seems like a broken drive to me. Actually, after a few rounds, the reallocated sectors stabilized at 56 and all is working well again. It's like there was a major problem with error handling. The problem is that I don't know where the blame lies. I'm pretty sure it's the firmware's fault. It's not supposed to go out for lunch like that even when internal error occurs. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20.3 AMD64 oops in CFQ code
[resending. my mail service was down for more than a week and this message didn't get delivered.] [EMAIL PROTECTED] wrote: Anyway, what's annoying is that I can't figure out how to bring the drive back on line without resetting the box. It's in a hot-swap enclosure, but power cycling the drive doesn't seem to help. I thought libata hotplug was working? (SiI3132 card, using the sil24 driver.) Yeah, it's working but failing resets are considered highly dangerous (in that the controller status is unknown and may cause something dangerous like screaming interrupts) and port is muted after that. The plan is to handle this with polling hotplug such that libata tries to revive the port if PHY status change is detected by polling. Patches are available but they need other things to resolved to get integrated. I think it'll happen before the summer. Anyways, you can tell libata to retry the port by manually telling it to rescan the port (echo - - - /sys/class/scsi_host/hostX/scan). (H'm... after rebooting, reallocated sectors jumped from 26 to 39. Something is up with that drive.) Yeap, seems like a broken drive to me. Thanks. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem booting linux 2.6.19-rc5, 2.6.19-rc5-git6, 2.6.19-rc5-mm2 with md raid 1 over lvm root
Nicolas Mailhot wrote: The failing kernels (I tried -rc5, -rc5-git6, -rc5-mm2 only print : % device-mapper: ioctl: 4.7.0-ioctl (2006-06-24) initialised: [EMAIL PROTECTED] md: Autodetecting RAID arrays. md: autorun ... md: ... autorun DONE. %- (I didn't bother copying the rest of the failing kernel dmesg, as sata initialisation fills the first half of the screen, then dm is initialised, then you only get the logical consequences of failing to detect the / volume. The sata part seems fine – it prints the name of the hard drives we want to use) I'm attaching the dmesg for the working distro kernel (yes I know not 100% distro kernel, but very close to one), distro config , and the config I used in my test. If anyone could help me to figure what's wrong I'd be grateful. Say 'y' not 'm' to SCSI disk support. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Test feedback 2.6.17.4+libata-tj-stable (EH, hotplug)
Christian Pernegger wrote: The fact that the disk had changed minor numbers after it was plugged back in bugs me a bit. (was sdc before, sde after). Additionally udev removed the sdc device file, so I had to manually recreate it to be able to remove the 'faulty' disk from its md array. That's because md is stilling holding onto sdc in failed mode. A hotplug script which checks whether a removed device is in md array and if so removes it from the array will solve the problem. Not sure whether that would be the correct approach though. Thanks. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html