Re: end to end error recovery musings
James Bottomley wrote: On Wed, 2007-02-28 at 17:28 -0800, H. Peter Anvin wrote: James Bottomley wrote: On Wed, 2007-02-28 at 12:42 -0500, Martin K. Petersen wrote: 4104. It's 8 bytes per hardware sector. At least for T10... Er ... that won't look good to the 512 ATA compatibility remapping ... Well, in that case you'd only see 8x512 data bytes, no metadata... i.e. no support for block guard in the 512 byte sector emulation mode ... That makes sense, though... if the raw sector size is 4096 bytes, that metadata would presumably not exist on a per-sector basis. -hpa - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
On Wed, 2007-02-28 at 17:28 -0800, H. Peter Anvin wrote: > James Bottomley wrote: > > On Wed, 2007-02-28 at 12:42 -0500, Martin K. Petersen wrote: > >> 4104. It's 8 bytes per hardware sector. At least for T10... > > > > Er ... that won't look good to the 512 ATA compatibility remapping ... > > > > Well, in that case you'd only see 8x512 data bytes, no metadata... i.e. no support for block guard in the 512 byte sector emulation mode ... James - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
James Bottomley wrote: On Wed, 2007-02-28 at 12:42 -0500, Martin K. Petersen wrote: 4104. It's 8 bytes per hardware sector. At least for T10... Er ... that won't look good to the 512 ATA compatibility remapping ... Well, in that case you'd only see 8x512 data bytes, no metadata... -hpa - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
On Wed, 2007-02-28 at 12:42 -0500, Martin K. Petersen wrote: > 4104. It's 8 bytes per hardware sector. At least for T10... Er ... that won't look good to the 512 ATA compatibility remapping ... James - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
> "James" == James Bottomley <[EMAIL PROTECTED]> writes: James> However, I could see the SATA manufacturers selling capacity at James> 512 (or the new 4096) sectors but allowing their OEMs to James> reformat them 520 (or 4160) 4104. It's 8 bytes per hardware sector. At least for T10... -- Martin K. Petersen Oracle Linux Engineering - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
On Wed, 2007-02-28 at 12:16 -0500, Martin K. Petersen wrote: > It's cool that it's on the radar in terms of the protocol. > > That doesn't mean that drive manufacturers are going to implement it, > though. The ones I've talked to were unwilling to sacrifice capacity > because that's the main competitive factor in the SATA/consumer space. > > Maybe we'll see it in the nearline product ranges? That would be a > good start... They wouldn't necessarily have to sacrifice capacity per-se. The current problem is that unlike SCSI disks, you can't seem to reformat SATA ones to arbitrary sector sizes. However, I could see the SATA manufacturers selling capacity at 512 (or the new 4096) sectors but allowing their OEMs to reformat them 520 (or 4160) and then implementing block guard on top of this. The OEMs who did this would obviously lose 1.6% of the capacity, but that would be their choice ... James - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
> "Doug" == Douglas Gilbert <[EMAIL PROTECTED]> writes: Doug> Work on SAT-2 is now underway and one of the agenda items is Doug> "end to end data protection" and is in the hands of the t13 Doug> ATA8-ACS technical editor. So it looks like data integrity is on Doug> the radar in the SATA world. It's cool that it's on the radar in terms of the protocol. That doesn't mean that drive manufacturers are going to implement it, though. The ones I've talked to were unwilling to sacrifice capacity because that's the main competitive factor in the SATA/consumer space. Maybe we'll see it in the nearline product ranges? That would be a good start... -- Martin K. Petersen http://mkp.net/ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: end to end error recovery musings
On Tuesday, February 27, 2007 12:07 PM, Martin K. Petersen wrote: > > Not sure you're up-to-date on the T10 data integrity feature. > Essentially it's an extension of the 520 byte sectors common in disk > arrays. For each 512 byte sector (or 4KB ditto) you get 8 bytes of > protection data. There's a 2 byte CRC (GUARD tag), a 2 byte > user-defined tag (APP) and a 4-byte reference tag (REF). Depending on > how the drive is formatted, the REF tag usually needs to match the > lower 32-bits of the target sector #. > I from the scsi lld perspective, all we need 32 byte cdbs, and a mechinism to pass the tags down from above. It appears our driver to firmware insterface is only providing the reference and application tags. It seems the guard tag is not present, so I guess mpt fusion controller firmware is setting it(I will have to check with others). I assume that for transfers greater than a sector, that the controller firmware updates the tags for all the other sectors within the boundary. I'm sure the flags probably tell whether EEDP is enabled or not. I will have to check if there are some manufacturing pages that say whether the controller is capable of EEDP(as not all our controllers support it). Here are the EEDP associated fields we provide in our scsi passthru, as well as target assist. u32 SecondaryReferenceTag u16 SecondaryApplicationTag u16 EEDPFlags u16 ApplicationTagTranslationMask u32 EEDPBlockSize - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
Martin K. Petersen wrote: >> "Alan" == Alan <[EMAIL PROTECTED]> writes: > >>> Not sure you're up-to-date on the T10 data integrity feature. >>> Essentially it's an extension of the 520 byte sectors common in >>> disk > > [...] > > Alan> but here's a minor bit of passing bad news - quite a few older > Alan> ATA controllers can't issue DMA transfers that are not a > Alan> multiple of 512 bytes without crapping themselves (eg > Alan> READ_LONG). Guess we may need to add > Alan> ap-> i_do_not_suck or similar 8) > > I'm afraid it stops even before you get that far. There doesn't seem > to be any interest in adopting the Data Integrity Feature (or anything > similar) in the ATA camp. So for now it's a SCSI-only thing. > > I encourage people to lean on their favorite disk manufacturer. This > would be a great feature to have on SATA too... Martin, SCSI to ATA Translation (SAT) is now a standard (ANSI INCITS 431-2007) [and libata is somewhat short of compliance]. Work on SAT-2 is now underway and one of the agenda items is "end to end data protection" and is in the hands of the t13 ATA8-ACS technical editor. So it looks like data integrity is on the radar in the SATA world. See http://www.t10.org/ftp/t10/document.06/06-497r4.pdf for more evidence of how SAS and SATA are converging at the command and feature set level. Doug Gilbert - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
> "Alan" == Alan <[EMAIL PROTECTED]> writes: >> Not sure you're up-to-date on the T10 data integrity feature. >> Essentially it's an extension of the 520 byte sectors common in >> disk [...] Alan> but here's a minor bit of passing bad news - quite a few older Alan> ATA controllers can't issue DMA transfers that are not a Alan> multiple of 512 bytes without crapping themselves (eg Alan> READ_LONG). Guess we may need to add Alan> ap-> i_do_not_suck or similar 8) I'm afraid it stops even before you get that far. There doesn't seem to be any interest in adopting the Data Integrity Feature (or anything similar) in the ATA camp. So for now it's a SCSI-only thing. I encourage people to lean on their favorite disk manufacturer. This would be a great feature to have on SATA too... -- Martin K. Petersen Oracle Linux Engineering - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
> Not sure you're up-to-date on the T10 data integrity feature. > Essentially it's an extension of the 520 byte sectors common in disk I saw the basics but not the detail. Thanks for the explanation it was most helpful and promises to fix a few things for some controllers.. but here's a minor bit of passing bad news - quite a few older ATA controllers can't issue DMA transfers that are not a multiple of 512 bytes without crapping themselves (eg READ_LONG). Guess we may need to add ap->i_do_not_suck or similar 8) On the bright side I believe the Intel ICH is the only one with this problem (and a workaround) which is SATA capable 8) Alan - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
> "Alan" == Alan <[EMAIL PROTECTED]> writes: >> These features make the most sense in terms of WRITE. Disks >> already have plenty of CRC on the data so if a READ fails on a >> regular drive we already know about it. Alan> Don't bet on it. This is why I mentioned that I want to expose the protection data to the host. As written, DIF only protects the path between initiator and target. See below... Alan> If you want to do this seriously you need an end to end (media Alan> to host ram) checksum. We do see bizarre and quite evil things Alan> happen to people occasionally because they rely on bus level Alan> protection - both faulty network cards and faulty disk or Alan> controller RAM can cause very bad things to happen in a critical Alan> environment and are very very hard to detect and test for. Not sure you're up-to-date on the T10 data integrity feature. Essentially it's an extension of the 520 byte sectors common in disk arrays. For each 512 byte sector (or 4KB ditto) you get 8 bytes of protection data. There's a 2 byte CRC (GUARD tag), a 2 byte user-defined tag (APP) and a 4-byte reference tag (REF). Depending on how the drive is formatted, the REF tag usually needs to match the lower 32-bits of the target sector #. For each sector coming in the disk firmware verifies that the CRC and the reference tags are in accordance with the contents of the sector and the CDB start sector + offset. If they don't match the drive will reject the request. If an HBA is capable of exposing the protection tuples to the host we can precalculate the checksum and the LBA when submitting a WRITE. My current proposal involves passing them down in two separate buffers to minimize the risk of in-memory corruption (Besides, it would suck if you had to interleave data and protection data. The scatterlists would become long and twisted). And that's when the READ case becomes interesting. Because then the fs can verify that the checksum of the in-buffer matches of the GUARD tag. In that case we'll know there's been no corruption in the middle. And of course this also opens up using the APP field to tag sector contents. -- Martin K. Petersen Oracle Linux Engineering - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
Martin K. Petersen wrote: "Eric" == Moore, Eric <[EMAIL PROTECTED]> writes: Eric> Martin K. Petersen on Data Intergrity Feature, which is also Eric> called EEDP(End to End Data Protection), which he presented some Eric> ideas/suggestions of adding an API in linux for this. T10 DIF is interesting for a few things: - Ensuring that the data integrity is preserved when writing a buffer to disk - Ensuring that the write ends up on the right hardware sector These features make the most sense in terms of WRITE. Disks already have plenty of CRC on the data so if a READ fails on a regular drive we already know about it. There are paths through a read that could still benefit from the extra data integrity. The CRC gets validated on the physical sector, but we don't have the same level of strict data checking once it is read into the disk's write cache or being transferred out of cache on the way to the transport... We can, however, leverage DIF with my proposal to expose the protection data to host memory. This will allow us to verify the data integrity information before passing it to the filesystem or application. We can say "this is really the information the disk sent. It hasn't been mangled along the way". And by using the APP tag we can mark a sector as - say - metadata or data to ease putting the recovery puzzle back together. It would be great if the app tag was more than 16 bits. Ted mentioned that ideally he'd like to store the inode number in the app tag. But as it stands there isn't room. In any case this is all slightly orthogonal to Ric's original post about finding the right persistence heuristics in the error handling path... Still all a very relevant discussion - I agree that we could really use more than just 16 bits... ric - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
On Feb 27, 2007 19:02 +, Alan wrote: > > It would be great if the app tag was more than 16 bits. Ted mentioned > > that ideally he'd like to store the inode number in the app tag. But > > as it stands there isn't room. > > The lowest few bits are the most important with ext2/ext3 because you > normally lose a sector of inodes which means you've got dangly bits > associated with a sequence of inodes with the same upper bits. More > problematic is losing indirect blocks, and being able to keep some kind > of [inode low bits/block index] would help put stuff back together. In the ext4 extents format there is the ability (not implemented yet) to add some extra information into the extent index blocks (previously referred to as the ext3_extent_tail). This is planned to be a checksum of the index block, and a back-pointer to the inode which is using this extent block. This allows online detection of corrupt index blocks, and also detection of an index block that is written to the wrong location. There is as yet no plan that I'm aware of to have in-filesystem checksums of the extent data. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
> These features make the most sense in terms of WRITE. Disks already > have plenty of CRC on the data so if a READ fails on a regular drive > we already know about it. Don't bet on it. If you want to do this seriously you need an end to end (media to host ram) checksum. We do see bizarre and quite evil things happen to people occasionally because they rely on bus level protection - both faulty network cards and faulty disk or controller RAM can cause very bad things to happen in a critical environment and are very very hard to detect and test for. IDE has another hideously evil feature in this area. Command blocks are sent by PIO cycles, and are therefore unprotected from corruption. So while a data burst with corruption will error and retry and command which corrupts the block number although very very much less likely (less bits and much lower speed) will not be caught on a PATA system for read or for write and will hit the wrong block. With networking you can turn off hardware IP checksumming (and many cluster people do) with disks we don't yet have a proper end to end checksum to media system in the fs or block layers. > It would be great if the app tag was more than 16 bits. Ted mentioned > that ideally he'd like to store the inode number in the app tag. But > as it stands there isn't room. The lowest few bits are the most important with ext2/ext3 because you normally lose a sector of inodes which means you've got dangly bits associated with a sequence of inodes with the same upper bits. More problematic is losing indirect blocks, and being able to keep some kind of [inode low bits/block index] would help put stuff back together. Alan - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
> "Eric" == Moore, Eric <[EMAIL PROTECTED]> writes: Eric> Martin K. Petersen on Data Intergrity Feature, which is also Eric> called EEDP(End to End Data Protection), which he presented some Eric> ideas/suggestions of adding an API in linux for this. T10 DIF is interesting for a few things: - Ensuring that the data integrity is preserved when writing a buffer to disk - Ensuring that the write ends up on the right hardware sector These features make the most sense in terms of WRITE. Disks already have plenty of CRC on the data so if a READ fails on a regular drive we already know about it. We can, however, leverage DIF with my proposal to expose the protection data to host memory. This will allow us to verify the data integrity information before passing it to the filesystem or application. We can say "this is really the information the disk sent. It hasn't been mangled along the way". And by using the APP tag we can mark a sector as - say - metadata or data to ease putting the recovery puzzle back together. It would be great if the app tag was more than 16 bits. Ted mentioned that ideally he'd like to store the inode number in the app tag. But as it stands there isn't room. In any case this is all slightly orthogonal to Ric's original post about finding the right persistence heuristics in the error handling path... -- Martin K. Petersen Oracle Linux Engineering - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: end to end error recovery musings
On Monday, February 26, 2007 9:42 AM, Ric Wheeler wrote: > Which brings us back to a recent discussion at the file > system workshop on being > more repair oriented in file system design so we can survive > situations like > this a bit more reliably ;-) > On the second day of the workshop, there was a presentation given by Martin K. Petersen on Data Intergrity Feature, which is also called EEDP(End to End Data Protection), which he presented some ideas/suggestions of adding an API in linux for this. I have his presentation if anyone is interested. One thing is scsi mid layer needs 32 byte cdbs support. mpt fusion supports EEDP for some versions of Fibre products, and we plan to add this for next generation sas products. We support EEDP in the windows driver where the driver generates its own tags. Our Linux driver don't. Here is our 32 byte passthru structure for SCSI_IO, defined in mpi_init.h, which as you may notice has some tags and flags for EEDP. typedef struct _MSG_SCSI_IO32_REQUEST { U8 Port; /* 00h */ U8 Reserved1; /* 01h */ U8 ChainOffset;/* 02h */ U8 Function; /* 03h */ U8 CDBLength; /* 04h */ U8 SenseBufferLength; /* 05h */ U8 Flags; /* 06h */ U8 MsgFlags; /* 07h */ U32 MsgContext; /* 08h */ U8 LUN[8]; /* 0Ch */ U32 Control;/* 14h */ MPI_SCSI_IO32_CDB_UNION CDB;/* 18h */ U32 DataLength; /* 38h */ U32 BidirectionalDataLength;/* 3Ch */ U32 SecondaryReferenceTag; /* 40h */ U16 SecondaryApplicationTag;/* 44h */ U16 Reserved2; /* 46h */ U16 EEDPFlags; /* 48h */ U16 ApplicationTagTranslationMask; /* 4Ah */ U32 EEDPBlockSize; /* 4Ch */ MPI_SCSI_IO32_ADDRESS DeviceAddress; /* 50h */ U8 SGLOffset0; /* 58h */ U8 SGLOffset1; /* 59h */ U8 SGLOffset2; /* 5Ah */ U8 SGLOffset3; /* 5Bh */ U32 Reserved3; /* 5Ch */ U32 Reserved4; /* 60h */ U32 SenseBufferLowAddr; /* 64h */ SGE_IO_UNIONSGL;/* 68h */ } MSG_SCSI_IO32_REQUEST, MPI_POINTER PTR_MSG_SCSI_IO32_REQUEST, SCSIIO32Request_t, MPI_POINTER pSCSIIO32Request_t; - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
> One interesting counter example is a smaller write than a full page - say 512 > bytes out of 4k. > > If we need to do a read-modify-write and it just so happens that 1 of the 7 > sectors we need to read is flaky, will this "look" like a write failure? The current core kernel code can't handle propogating sub-page sized errors up to the file system layers (there is nowhere in the page cache to store 'part of this page is missing'). This is a long standing (four year plus) problem with CD-RW support as well. For ATA we can at least retrieve the true media sector size now, which may be helpful at the physical layer but the page cache would need to grow some brains to do anything with it. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
Jeff Garzik wrote: Theodore Tso wrote: Can someone with knowledge of current disk drive behavior confirm that for all drives that support bad block sparing, if an attempt to write to a particular spot on disk results in an error due to bad media at that spot, the disk drive will automatically rewrite the sector to a sector in its spare pool, and automatically redirect that sector to the new location. I believe this should be always true, so presumably with all modern disk drives a write error should mean something very serious has happend. This is what will /probably/ happen. The drive should indeed find a spare sector and remap it, if the write attempt encounters a bad spot on the media. However, with a large enough write, large enough bad-spot-on-media, and a firmware programmed to never take more than X seconds to complete their enterprise customers' I/O, it might just fail. IMO, somewhere in the kernel, when we receive a read-op or write-op media error, we should immediately try to plaster that area with small writes. Sure, if it's a read-op you lost data, but this method will maximize the chance that you can refresh/reuse the logical sectors in question. Jeff One interesting counter example is a smaller write than a full page - say 512 bytes out of 4k. If we need to do a read-modify-write and it just so happens that 1 of the 7 sectors we need to read is flaky, will this "look" like a write failure? ric - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
Theodore Tso wrote: Can someone with knowledge of current disk drive behavior confirm that for all drives that support bad block sparing, if an attempt to write to a particular spot on disk results in an error due to bad media at that spot, the disk drive will automatically rewrite the sector to a sector in its spare pool, and automatically redirect that sector to the new location. I believe this should be always true, so presumably with all modern disk drives a write error should mean something very serious has happend. This is what will /probably/ happen. The drive should indeed find a spare sector and remap it, if the write attempt encounters a bad spot on the media. However, with a large enough write, large enough bad-spot-on-media, and a firmware programmed to never take more than X seconds to complete their enterprise customers' I/O, it might just fail. IMO, somewhere in the kernel, when we receive a read-op or write-op media error, we should immediately try to plaster that area with small writes. Sure, if it's a read-op you lost data, but this method will maximize the chance that you can refresh/reuse the logical sectors in question. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
Theodore Tso wrote: In any case, the reason why I bring this up is that it would be really nice if there was a way with a single laptop drive to be able to do snapshots and background fsck's without having to use initrd's with device mapper. This is a major part of why I've been trying to push integrated klibc to have all that stuff as a unified "kernel" deliverable. Unfortunately, as you know, Linus apparently rejected the concept "at least for now" at LKS last year. With klibc this stuff could still be in one single wrapper without funny dependencies, but wouldn't have to be ported to kernel space. -hpa - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
Alan wrote: I think that this is mostly true, but we also need to balance this against the need for higher levels to get a timely response. In a really large IO, a naive retry of a very large write could lead to a non-responsive system for a very large time... And losing the I/O could result in a system that is non responsive until the tape restore completes two days later Which brings us back to a recent discussion at the file system workshop on being more repair oriented in file system design so we can survive situations like this a bit more reliably ;-) ric - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
> I think that this is mostly true, but we also need to balance this against > the > need for higher levels to get a timely response. In a really large IO, a > naive > retry of a very large write could lead to a non-responsive system for a very > large time... And losing the I/O could result in a system that is non responsive until the tape restore completes two days later - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
Alan wrote: the new location. I believe this should be always true, so presumably with all modern disk drives a write error should mean something very serious has happend. Not quite that simple. I think that write errors are normally quite serious, but there are exceptions which might be able to be worked around with retries. To Ted's point, in general, a write to a bad spot on the media will cause a remapping which should be transparent (if a bit slow) to us. If you write a block aligned size the same size as the physical media block size maybe this is true. If you write a sector on a device with physical sector size larger than logical block size (as allowed by say ATA7) then it's less clear what happens. I don't know if the drive firmware implements multiple "tails" in this case. On a read error it is worth trying the other parts of the I/O. I think that this is mostly true, but we also need to balance this against the need for higher levels to get a timely response. In a really large IO, a naive retry of a very large write could lead to a non-responsive system for a very large time... ric - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
On Mon, 2007-02-26 at 08:25 -0500, Theodore Tso wrote: > Somewhat off-topic, but my one big regret with how the dm vs. evms > competition settled out was that evms had the ability to perform block > device snapshots using a non-LVM volume as the base --- and that EVMS > allowed a single drive to be partially managed by the LVM layer, and > partially managed by evms. If all you want is a snapshot, md can do this today ... you just create a RAID-1 resync it and then break it ... of course, you have to have the filesystem mounted above an md device initially ... James - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
> the new location. I believe this should be always true, so presumably > with all modern disk drives a write error should mean something very > serious has happend. Not quite that simple. If you write a block aligned size the same size as the physical media block size maybe this is true. If you write a sector on a device with physical sector size larger than logical block size (as allowed by say ATA7) then it's less clear what happens. I don't know if the drive firmware implements multiple "tails" in this case. On a read error it is worth trying the other parts of the I/O. Alan - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
On Mon, Feb 26, 2007 at 04:33:37PM +1100, Neil Brown wrote: > Do we want a path in the other direction to handle write errors? The > file system could say "Don't worry to much if this block cannot be > written, just return an error and I will write it somewhere else"? > This might allow md not to fail a whole drive if there is a single > write error. Can someone with knowledge of current disk drive behavior confirm that for all drives that support bad block sparing, if an attempt to write to a particular spot on disk results in an error due to bad media at that spot, the disk drive will automatically rewrite the sector to a sector in its spare pool, and automatically redirect that sector to the new location. I believe this should be always true, so presumably with all modern disk drives a write error should mean something very serious has happend. (Or that someone was in the middle of reconfiguring a FC network and they're running a kernel that doesn't understand why short-duration FC timeouts should be retried. :-) > Or is that completely un-necessary as all modern devices do bad-block > relocation for us? > Is there any need for a bad-block-relocating layer in md or dm? That's the question. It wouldn't be that hard for filesystems to be able to remap a data block, but (a) it would be much more difficult for fundamental metadata (for example, the inode table), and (b) it's unnecessary complexity if the lower levels in the storage stack should always be doing this for us in the case of media errors anyway. > What about corrected-error counts? Drives provide them with SMART. > The SCSI layer could provide some as well. Md can do a similar thing > to some extent. Where these are actually useful predictors of pending > failure is unclear, but there could be some value. > e.g. after a certain number of recovered errors raid5 could trigger a > background consistency check, or a filesystem could trigger a > background fsck should it support that. Somewhat off-topic, but my one big regret with how the dm vs. evms competition settled out was that evms had the ability to perform block device snapshots using a non-LVM volume as the base --- and that EVMS allowed a single drive to be partially managed by the LVM layer, and partially managed by evms. What this allowed is the ability to do device snapshots and therefore background fsck's without needing to convert the entire laptop disk to using a LVM solution (since to this day I still don't trust initrd's to always do the right thing when I am constantly replacing the kernel for kernel development). I know, I'm weird, distro users have initrd that seem to mostly work, and it's only wierd developers that try to use bleeding edge kernels with a RHEL4 userspace that suffer, but it's one of the reasons why I've avoided initrd's like the plague --- I've wasted entire days trying to debug problems with the userspace-provided initrd being too old to support newer 2.6 development kernels. In any case, the reason why I bring this up is that it would be really nice if there was a way with a single laptop drive to be able to do snapshots and background fsck's without having to use initrd's with device mapper. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
H. Peter Anvin wrote: > Ric Wheeler wrote: >> >> We still have the following challenges: >> >>(1) read-ahead often means that we will retry every bad sector at >> least twice from the file system level. The first time, the fs read >> ahead request triggers a speculative read that includes the bad sector >> (triggering the error handling mechanisms) right before the real >> application triggers a read does the same thing. Not sure what the >> answer is here since read-ahead is obviously a huge win in the normal >> case. >> > > Probably the only sane thing to do is to remember the bad sectors and > avoid attempting reading them; that would mean marking "automatic" > versus "explicitly requested" requests to determine whether or not to > filter them against a list of discovered bad blocks. Some disks are doing their own "read-ahead" in the form of a background media scan. Scans are done on request or periodically (e.g. once per day or once per week) and we have tools that can fetch the scan results from a disk (e.g. a list of unreadable sectors). What we don't have is any way to feed such information to a file system that may be impacted. Doug Gilbert - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
On Friday February 23, [EMAIL PROTECTED] wrote: > On Fri, Feb 23, 2007 at 05:37:23PM -0700, Andreas Dilger wrote: > > > Probably the only sane thing to do is to remember the bad sectors and > > > avoid attempting reading them; that would mean marking "automatic" > > > versus "explicitly requested" requests to determine whether or not to > > > filter them against a list of discovered bad blocks. > > > > And clearing this list when the sector is overwritten, as it will almost > > certainly be relocated at the disk level. For that matter, a huge win > > would be to have the MD RAID layer rewrite only the bad sector (in hopes > > of the disk relocating it) instead of failing the whiole disk. Otherwise, > > a few read errors on different disks in a RAID set can take the whole > > system offline. Apologies if this is already done in recent kernels... Yes, current md does this. > > And having a way of making this list available to both the filesystem > and to a userspace utility, so they can more easily deal with doing a > forced rewrite of the bad sector, after determining which file is > involved and perhaps doing something intelligent (up to and including > automatically requesting a backup system to fetch a backup version of > the file, and if it can be determined that the file shouldn't have > been changed since the last backup, automatically fixing up the > corrupted data block :-). > > - Ted So we want a clear path for media read errors from the device up to user-space. Stacked devices (like md) would do appropriate mappings maybe (for raid0/linear at least. Other levels wouldn't tolerate errors). There would need to be a limit on the number of 'bad blocks' that is recorded. Maybe a mechanism to clear old bad blocks from the list is needed. Maybe if generic make request gets a request for a block which overlaps a 'bad-block' it returns an error immediately. Do we want a path in the other direction to handle write errors? The file system could say "Don't worry to much if this block cannot be written, just return an error and I will write it somewhere else"? This might allow md not to fail a whole drive if there is a single write error. Or is that completely un-necessary as all modern devices do bad-block relocation for us? Is there any need for a bad-block-relocating layer in md or dm? What about corrected-error counts? Drives provide them with SMART. The SCSI layer could provide some as well. Md can do a similar thing to some extent. Where these are actually useful predictors of pending failure is unclear, but there could be some value. e.g. after a certain number of recovered errors raid5 could trigger a background consistency check, or a filesystem could trigger a background fsck should it support that. Lots of interesting questions... not so many answers. NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
On Fri, Feb 23, 2007 at 09:32:29PM -0500, Theodore Tso wrote: > And having a way of making this list available to both the > filesystem and to a userspace utility, so they can more easily deal > with doing a forced rewrite of the bad sector, after determining > which file is involved and perhaps doing something intelligent (up > to and including automatically requesting a backup system to fetch a > backup version of the file, and if it can be determined that the > file shouldn't have been changed since the last backup, > automatically fixing up the corrupted data block :-). i had a small c program + perl script that would take a badblocks list and figure out which files on an xfs filesystem were trashed, though in the case of xfs it's somewhat easier because you can dump the extents for a file something more generic wouldn't be hard to make work, it also wouldn't be hard to extend this to inodes in some cases though im not sure that there is much you can do there beyond fsck - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
On Fri, Feb 23, 2007 at 05:37:23PM -0700, Andreas Dilger wrote: > > Probably the only sane thing to do is to remember the bad sectors and > > avoid attempting reading them; that would mean marking "automatic" > > versus "explicitly requested" requests to determine whether or not to > > filter them against a list of discovered bad blocks. > > And clearing this list when the sector is overwritten, as it will almost > certainly be relocated at the disk level. For that matter, a huge win > would be to have the MD RAID layer rewrite only the bad sector (in hopes > of the disk relocating it) instead of failing the whiole disk. Otherwise, > a few read errors on different disks in a RAID set can take the whole > system offline. Apologies if this is already done in recent kernels... And having a way of making this list available to both the filesystem and to a userspace utility, so they can more easily deal with doing a forced rewrite of the bad sector, after determining which file is involved and perhaps doing something intelligent (up to and including automatically requesting a backup system to fetch a backup version of the file, and if it can be determined that the file shouldn't have been changed since the last backup, automatically fixing up the corrupted data block :-). - Ted - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
Andreas Dilger wrote: And clearing this list when the sector is overwritten, as it will almost certainly be relocated at the disk level. Certainly if the overwrite is successful. -hpa - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
On Feb 23, 2007 16:03 -0800, H. Peter Anvin wrote: > Ric Wheeler wrote: > > (1) read-ahead often means that we will retry every bad sector at > >least twice from the file system level. The first time, the fs read > >ahead request triggers a speculative read that includes the bad sector > >(triggering the error handling mechanisms) right before the real > >application triggers a read does the same thing. Not sure what the > >answer is here since read-ahead is obviously a huge win in the normal case. > > Probably the only sane thing to do is to remember the bad sectors and > avoid attempting reading them; that would mean marking "automatic" > versus "explicitly requested" requests to determine whether or not to > filter them against a list of discovered bad blocks. And clearing this list when the sector is overwritten, as it will almost certainly be relocated at the disk level. For that matter, a huge win would be to have the MD RAID layer rewrite only the bad sector (in hopes of the disk relocating it) instead of failing the whiole disk. Otherwise, a few read errors on different disks in a RAID set can take the whole system offline. Apologies if this is already done in recent kernels... Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: end to end error recovery musings
Ric Wheeler wrote: We still have the following challenges: (1) read-ahead often means that we will retry every bad sector at least twice from the file system level. The first time, the fs read ahead request triggers a speculative read that includes the bad sector (triggering the error handling mechanisms) right before the real application triggers a read does the same thing. Not sure what the answer is here since read-ahead is obviously a huge win in the normal case. Probably the only sane thing to do is to remember the bad sectors and avoid attempting reading them; that would mean marking "automatic" versus "explicitly requested" requests to determine whether or not to filter them against a list of discovered bad blocks. -hpa - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
end to end error recovery musings
In the IO/FS workshop, one idea we kicked around is the need to provide better and more specific error messages between the IO stack and the file system layer. My group has been working to stabilize a relatively up to date libata + MD based box, so I can try to lay out at least one "appliance like" typical configuration to help frame the issue. We are working on a relatively large appliance, but you can buy similar home appliance (or build them) that use linux to provide a NAS in a box for end users. The use case that we have is on an ICH6R/AHCI box with 4 large (500+ GB) drives, with some of the small system partitions on a 4-way RAID1 device. The libata version we have is back port of 2.6.18 onto SLES10, so the error handling at the libata level is a huge improvement over what we had before. Each box has a watchdog timer that can be set to fire after at most 2 minutes. (We have a second flavor of this box with an ICH5 and P-ATA drives using the non-libata drivers that has a similar use case). Using the patches that Mark sent around recently for error injection, we inject media errors into one or more drives and try to see how smoothly error handling runs and, importantly, whether or not the error handling will complete before the watchdog fires and reboots the box. If you want to be especially mean, inject errors into the RAID superblocks on 3 out of the 4 drives. We still have the following challenges: (1) read-ahead often means that we will retry every bad sector at least twice from the file system level. The first time, the fs read ahead request triggers a speculative read that includes the bad sector (triggering the error handling mechanisms) right before the real application triggers a read does the same thing. Not sure what the answer is here since read-ahead is obviously a huge win in the normal case. (2) the patches that were floating around on how to make sure that we effectively handle single sector errors in a large IO request are critical. On one hand, we want to combine adjacent IO requests into larger IO's whenever possible. On the other hand, when the combined IO fails, we need to isolate the error to the correct range, avoid reissuing a request that touches that sector again and communicate up the stack to file system/MD what really failed. All of this needs to complete in tens of seconds, not multiple minutes. (3) The timeout values on the failed IO's need to be tuned well (as was discussed in an earlier linux-ide thread). We cannot afford to hang for 30 seconds, especially in the MD case, since you might need to fail more than one device for a single IO. Prompt error prorogation (say that 4 times quickly!) can allow MD to mask the underlying errors as you would hope, hanging on too long will almost certainly cause a watchdog reboot... (4) The newish libata+SCSI stack is pretty good at handling disk errors, but adding in MD actually can reduce the reliability of your system unless you tune the error handling correctly. We will follow up with specific issues as they arise, but I wanted to lay out a use case that can help frame part of the discussion. I also want to encourage people to inject real disk errors with the Mark patches so we can share the pain ;-) ric - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html