subject:"end to end error recovery musings"

Re: end to end error recovery musings

2007-03-01 Thread H. Peter Anvin


James Bottomley wrote:

On Wed, 2007-02-28 at 17:28 -0800, H. Peter Anvin wrote:

James Bottomley wrote:

On Wed, 2007-02-28 at 12:42 -0500, Martin K. Petersen wrote:

4104.  It's 8 bytes per hardware sector.  At least for T10...

Er ... that won't look good to the 512 ATA compatibility remapping ...


Well, in that case you'd only see 8x512 data bytes, no metadata...


i.e. no support for block guard in the 512 byte sector emulation
mode ...



That makes sense, though... if the raw sector size is 4096 bytes, that 
metadata would presumably not exist on a per-sector basis.


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

2007-03-01 Thread James Bottomley

On Wed, 2007-02-28 at 17:28 -0800, H. Peter Anvin wrote:
> James Bottomley wrote:
> > On Wed, 2007-02-28 at 12:42 -0500, Martin K. Petersen wrote:
> >> 4104.  It's 8 bytes per hardware sector.  At least for T10...
> > 
> > Er ... that won't look good to the 512 ATA compatibility remapping ...
> > 
> 
> Well, in that case you'd only see 8x512 data bytes, no metadata...

i.e. no support for block guard in the 512 byte sector emulation
mode ...

James


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

2007-02-28 Thread H. Peter Anvin


James Bottomley wrote:

On Wed, 2007-02-28 at 12:42 -0500, Martin K. Petersen wrote:

4104.  It's 8 bytes per hardware sector.  At least for T10...


Er ... that won't look good to the 512 ATA compatibility remapping ...



Well, in that case you'd only see 8x512 data bytes, no metadata...

-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

2007-02-28 Thread James Bottomley

On Wed, 2007-02-28 at 12:42 -0500, Martin K. Petersen wrote:
> 4104.  It's 8 bytes per hardware sector.  At least for T10...

Er ... that won't look good to the 512 ATA compatibility remapping ...

James


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

2007-02-28 Thread Martin K. Petersen

> "James" == James Bottomley <[EMAIL PROTECTED]> writes:

James> However, I could see the SATA manufacturers selling capacity at
James> 512 (or the new 4096) sectors but allowing their OEMs to
James> reformat them 520 (or 4160)

4104.  It's 8 bytes per hardware sector.  At least for T10...

-- 
Martin K. Petersen  Oracle Linux Engineering
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

2007-02-28 Thread James Bottomley

On Wed, 2007-02-28 at 12:16 -0500, Martin K. Petersen wrote:
> It's cool that it's on the radar in terms of the protocol.  
> 
> That doesn't mean that drive manufacturers are going to implement it,
> though.  The ones I've talked to were unwilling to sacrifice capacity
> because that's the main competitive factor in the SATA/consumer space.
> 
> Maybe we'll see it in the nearline product ranges?  That would be a
> good start...

They wouldn't necessarily have to sacrifice capacity per-se.  The
current problem is that unlike SCSI disks, you can't seem to reformat
SATA ones to arbitrary sector sizes.  However, I could see the SATA
manufacturers selling capacity at 512 (or the new 4096) sectors but
allowing their OEMs to reformat them 520 (or 4160) and then implementing
block guard on top of this.  The OEMs who did this would obviously lose
1.6% of the capacity, but that would be their choice ...

James

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

2007-02-28 Thread Martin K. Petersen

> "Doug" == Douglas Gilbert <[EMAIL PROTECTED]> writes:

Doug> Work on SAT-2 is now underway and one of the agenda items is
Doug> "end to end data protection" and is in the hands of the t13
Doug> ATA8-ACS technical editor. So it looks like data integrity is on
Doug> the radar in the SATA world.

It's cool that it's on the radar in terms of the protocol.  

That doesn't mean that drive manufacturers are going to implement it,
though.  The ones I've talked to were unwilling to sacrifice capacity
because that's the main competitive factor in the SATA/consumer space.

Maybe we'll see it in the nearline product ranges?  That would be a
good start...

-- 
Martin K. Petersen  http://mkp.net/

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: end to end error recovery musings

2007-02-28 Thread Moore, Eric

On Tuesday, February 27, 2007 12:07 PM, Martin K. Petersen wrote: 
> 
> Not sure you're up-to-date on the T10 data integrity feature.
> Essentially it's an extension of the 520 byte sectors common in disk
> arrays.  For each 512 byte sector (or 4KB ditto) you get 8 bytes of
> protection data.  There's a 2 byte CRC (GUARD tag), a 2 byte
> user-defined tag (APP) and a 4-byte reference tag (REF).  Depending on
> how the drive is formatted, the REF tag usually needs to match the
> lower 32-bits of the target sector #.
> 

I from the scsi lld perspective, all we need 32 byte cdbs, and a
mechinism to pass the tags down from above.  It appears our driver to
firmware insterface is only providing the reference and application
tags. It seems the guard tag is not present, so I guess mpt fusion
controller firmware is setting it(I will have to check with others).   I
assume that for transfers greater than a sector, that the controller
firmware updates the tags for all the other sectors within the boundary.
I'm sure the flags probably tell whether EEDP is enabled or not.   I
will have to check if there are some manufacturing pages that say
whether the controller is capable of EEDP(as not all our controllers
support it).  

Here are the EEDP associated fields we provide in our scsi passthru, as
well as target assist.

u32 SecondaryReferenceTag
u16 SecondaryApplicationTag
u16 EEDPFlags
u16 ApplicationTagTranslationMask
u32 EEDPBlockSize
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

2007-02-28 Thread Douglas Gilbert

Martin K. Petersen wrote:
>> "Alan" == Alan  <[EMAIL PROTECTED]> writes:
> 
>>> Not sure you're up-to-date on the T10 data integrity feature.
>>> Essentially it's an extension of the 520 byte sectors common in
>>> disk
> 
> [...]
> 
> Alan> but here's a minor bit of passing bad news - quite a few older
> Alan> ATA controllers can't issue DMA transfers that are not a
> Alan> multiple of 512 bytes without crapping themselves (eg
> Alan> READ_LONG). Guess we may need to add
> Alan> ap-> i_do_not_suck or similar 8)
> 
> I'm afraid it stops even before you get that far.  There doesn't seem
> to be any interest in adopting the Data Integrity Feature (or anything
> similar) in the ATA camp.  So for now it's a SCSI-only thing.
> 
> I encourage people to lean on their favorite disk manufacturer.  This
> would be a great feature to have on SATA too...

Martin,
SCSI to ATA Translation (SAT) is now a standard
(ANSI INCITS 431-2007) [and libata is somewhat
short of compliance].

Work on SAT-2 is now underway and one of the agenda
items is "end to end data protection" and is in the
hands of the t13 ATA8-ACS technical editor. So it
looks like data integrity is on the radar in the SATA
world.

See http://www.t10.org/ftp/t10/document.06/06-497r4.pdf
for more evidence of how SAS and SATA are converging
at the command and feature set level.

Doug Gilbert

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

2007-02-27 Thread Martin K. Petersen

> "Alan" == Alan  <[EMAIL PROTECTED]> writes:

>> Not sure you're up-to-date on the T10 data integrity feature.
>> Essentially it's an extension of the 520 byte sectors common in
>> disk

[...]

Alan> but here's a minor bit of passing bad news - quite a few older
Alan> ATA controllers can't issue DMA transfers that are not a
Alan> multiple of 512 bytes without crapping themselves (eg
Alan> READ_LONG). Guess we may need to add
Alan> ap-> i_do_not_suck or similar 8)

I'm afraid it stops even before you get that far.  There doesn't seem
to be any interest in adopting the Data Integrity Feature (or anything
similar) in the ATA camp.  So for now it's a SCSI-only thing.

I encourage people to lean on their favorite disk manufacturer.  This
would be a great feature to have on SATA too...

-- 
Martin K. Petersen  Oracle Linux Engineering

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

2007-02-27 Thread Alan

> Not sure you're up-to-date on the T10 data integrity feature.
> Essentially it's an extension of the 520 byte sectors common in disk

I saw the basics but not the detail. Thanks for the explanation it was
most helpful and promises to fix a few things for some controllers.. but
here's a minor bit of passing bad news - quite a few older ATA controllers
can't issue DMA transfers that are not a multiple of 512 bytes without
crapping themselves (eg READ_LONG). Guess we may need to add
ap->i_do_not_suck or similar 8)

On the bright side I believe the Intel ICH is the only one with this
problem (and a workaround) which is SATA capable 8)

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

2007-02-27 Thread Martin K. Petersen

> "Alan" == Alan  <[EMAIL PROTECTED]> writes:

>> These features make the most sense in terms of WRITE.  Disks
>> already have plenty of CRC on the data so if a READ fails on a
>> regular drive we already know about it.

Alan> Don't bet on it. 

This is why I mentioned that I want to expose the protection data to
the host.  As written, DIF only protects the path between initiator
and target.

See below...

Alan> If you want to do this seriously you need an end to end (media
Alan> to host ram) checksum. We do see bizarre and quite evil things
Alan> happen to people occasionally because they rely on bus level
Alan> protection - both faulty network cards and faulty disk or
Alan> controller RAM can cause very bad things to happen in a critical
Alan> environment and are very very hard to detect and test for.

Not sure you're up-to-date on the T10 data integrity feature.
Essentially it's an extension of the 520 byte sectors common in disk
arrays.  For each 512 byte sector (or 4KB ditto) you get 8 bytes of
protection data.  There's a 2 byte CRC (GUARD tag), a 2 byte
user-defined tag (APP) and a 4-byte reference tag (REF).  Depending on
how the drive is formatted, the REF tag usually needs to match the
lower 32-bits of the target sector #.

For each sector coming in the disk firmware verifies that the CRC and
the reference tags are in accordance with the contents of the sector
and the CDB start sector + offset.  If they don't match the drive will
reject the request.

If an HBA is capable of exposing the protection tuples to the host we
can precalculate the checksum and the LBA when submitting a WRITE.  My
current proposal involves passing them down in two separate buffers to
minimize the risk of in-memory corruption (Besides, it would suck if
you had to interleave data and protection data.  The scatterlists
would become long and twisted).

And that's when the READ case becomes interesting.  Because then the
fs can verify that the checksum of the in-buffer matches of the GUARD
tag.  In that case we'll know there's been no corruption in the
middle.

And of course this also opens up using the APP field to tag sector
contents.

-- 
Martin K. Petersen  Oracle Linux Engineering
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

2007-02-27 Thread Ric Wheeler


Martin K. Petersen wrote:

"Eric" == Moore, Eric <[EMAIL PROTECTED]> writes:


Eric> Martin K. Petersen on Data Intergrity Feature, which is also
Eric> called EEDP(End to End Data Protection), which he presented some
Eric> ideas/suggestions of adding an API in linux for this.  

T10 DIF is interesting for a few things: 


 - Ensuring that the data integrity is preserved when writing a buffer
   to disk

 - Ensuring that the write ends up on the right hardware sector

These features make the most sense in terms of WRITE.  Disks already
have plenty of CRC on the data so if a READ fails on a regular drive
we already know about it.


There are paths through a read that could still benefit from the extra 
data integrity.  The CRC gets validated on the physical sector, but we 
don't have the same level of strict data checking once it is read into 
the disk's write cache or being transferred out of cache on the way to 
the transport...




We can, however, leverage DIF with my proposal to expose the
protection data to host memory.  This will allow us to verify the data
integrity information before passing it to the filesystem or
application.  We can say "this is really the information the disk
sent. It hasn't been mangled along the way".

And by using the APP tag we can mark a sector as - say - metadata or
data to ease putting the recovery puzzle back together.

It would be great if the app tag was more than 16 bits.  Ted mentioned
that ideally he'd like to store the inode number in the app tag.  But
as it stands there isn't room.

In any case this is all slightly orthogonal to Ric's original post
about finding the right persistence heuristics in the error handling
path...



Still all a very relevant discussion - I agree that we could really use 
more than just 16 bits...


ric

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

2007-02-27 Thread Andreas Dilger

On Feb 27, 2007  19:02 +, Alan wrote:
> > It would be great if the app tag was more than 16 bits.  Ted mentioned
> > that ideally he'd like to store the inode number in the app tag.  But
> > as it stands there isn't room.
> 
> The lowest few bits are the most important with ext2/ext3 because you
> normally lose a sector of inodes which means you've got dangly bits
> associated with a sequence of inodes with the same upper bits. More
> problematic is losing indirect blocks, and being able to keep some kind
> of [inode low bits/block index] would help put stuff back together.

In the ext4 extents format there is the ability (not implemented yet)
to add some extra information into the extent index blocks (previously
referred to as the ext3_extent_tail).  This is planned to be a checksum
of the index block, and a back-pointer to the inode which is using this
extent block.

This allows online detection of corrupt index blocks, and also detection
of an index block that is written to the wrong location.  There is as
yet no plan that I'm aware of to have in-filesystem checksums of the
extent data.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

2007-02-27 Thread Alan

> These features make the most sense in terms of WRITE.  Disks already
> have plenty of CRC on the data so if a READ fails on a regular drive
> we already know about it.

Don't bet on it. If you want to do this seriously you need an end to end
(media to host ram) checksum. We do see bizarre and quite evil things
happen to people occasionally because they rely on bus level protection -
both faulty network cards and faulty disk or controller RAM can cause very
bad things to happen in a critical environment and are very very hard to
detect and test for.

IDE has another hideously evil feature in this area. Command blocks are
sent by PIO cycles, and are therefore unprotected from corruption. So
while a data burst with corruption will error and retry and command which
corrupts the block number although very very much less likely (less bits
and much lower speed) will not be caught on a PATA system for read or for
write and will hit the wrong block.

With networking you can turn off hardware IP checksumming (and many
cluster people do) with disks we don't yet have a proper end to end
checksum to media system in the fs or block layers.

> It would be great if the app tag was more than 16 bits.  Ted mentioned
> that ideally he'd like to store the inode number in the app tag.  But
> as it stands there isn't room.

The lowest few bits are the most important with ext2/ext3 because you
normally lose a sector of inodes which means you've got dangly bits
associated with a sequence of inodes with the same upper bits. More
problematic is losing indirect blocks, and being able to keep some kind
of [inode low bits/block index] would help put stuff back together.

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

2007-02-27 Thread Martin K. Petersen

> "Eric" == Moore, Eric <[EMAIL PROTECTED]> writes:

Eric> Martin K. Petersen on Data Intergrity Feature, which is also
Eric> called EEDP(End to End Data Protection), which he presented some
Eric> ideas/suggestions of adding an API in linux for this.  

T10 DIF is interesting for a few things: 

 - Ensuring that the data integrity is preserved when writing a buffer
   to disk

 - Ensuring that the write ends up on the right hardware sector

These features make the most sense in terms of WRITE.  Disks already
have plenty of CRC on the data so if a READ fails on a regular drive
we already know about it.

We can, however, leverage DIF with my proposal to expose the
protection data to host memory.  This will allow us to verify the data
integrity information before passing it to the filesystem or
application.  We can say "this is really the information the disk
sent. It hasn't been mangled along the way".

And by using the APP tag we can mark a sector as - say - metadata or
data to ease putting the recovery puzzle back together.

It would be great if the app tag was more than 16 bits.  Ted mentioned
that ideally he'd like to store the inode number in the app tag.  But
as it stands there isn't room.

In any case this is all slightly orthogonal to Ric's original post
about finding the right persistence heuristics in the error handling
path...

-- 
Martin K. Petersen  Oracle Linux Engineering

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: end to end error recovery musings

2007-02-26 Thread Moore, Eric

On Monday, February 26, 2007 9:42 AM,  Ric Wheeler wrote:
> Which brings us back to a recent discussion at the file 
> system workshop on being 
> more repair oriented in file system design so we can survive 
> situations like 
> this a bit more reliably ;-)
> 

On the second day of the workshop, there was a presentation given by
Martin K. Petersen on  Data Intergrity Feature, which is also called
EEDP(End to End Data Protection), which he presented some
ideas/suggestions of adding an API in linux for this.   I have his
presentation if anyone is interested.  One thing is scsi mid layer needs
32 byte cdbs support.

mpt fusion supports EEDP for some versions of Fibre products, and we
plan to add this for next generation sas products.   We support EEDP in
the windows driver where the driver generates its own tags.  Our Linux
driver don't.

Here is our 32 byte passthru structure for SCSI_IO, defined in
mpi_init.h, which as you may notice has some tags and flags for EEDP.

typedef struct _MSG_SCSI_IO32_REQUEST
{
U8  Port;   /* 00h
*/
U8  Reserved1;  /* 01h
*/
U8  ChainOffset;/* 02h
*/
U8  Function;   /* 03h
*/
U8  CDBLength;  /* 04h
*/
U8  SenseBufferLength;  /* 05h
*/
U8  Flags;  /* 06h
*/
U8  MsgFlags;   /* 07h
*/
U32 MsgContext; /* 08h
*/
U8  LUN[8]; /* 0Ch
*/
U32 Control;/* 14h
*/
MPI_SCSI_IO32_CDB_UNION CDB;/* 18h
*/
U32 DataLength; /* 38h
*/
U32 BidirectionalDataLength;/* 3Ch
*/
U32 SecondaryReferenceTag;  /* 40h
*/
U16 SecondaryApplicationTag;/* 44h
*/
U16 Reserved2;  /* 46h
*/
U16 EEDPFlags;  /* 48h
*/
U16 ApplicationTagTranslationMask;  /* 4Ah
*/
U32 EEDPBlockSize;  /* 4Ch
*/
MPI_SCSI_IO32_ADDRESS   DeviceAddress;  /* 50h
*/
U8  SGLOffset0; /* 58h
*/
U8  SGLOffset1; /* 59h
*/
U8  SGLOffset2; /* 5Ah
*/
U8  SGLOffset3; /* 5Bh
*/
U32 Reserved3;  /* 5Ch
*/
U32 Reserved4;  /* 60h
*/
U32 SenseBufferLowAddr; /* 64h
*/
SGE_IO_UNIONSGL;/* 68h
*/
} MSG_SCSI_IO32_REQUEST, MPI_POINTER PTR_MSG_SCSI_IO32_REQUEST,
  SCSIIO32Request_t, MPI_POINTER pSCSIIO32Request_t;
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

2007-02-26 Thread Alan

> One interesting counter example is a smaller write than a full page - say 512 
> bytes out of 4k.
> 
> If we need to do a read-modify-write and it just so happens that 1 of the 7 
> sectors we need to read is flaky, will this "look" like a write failure?

The current core kernel code can't handle propogating sub-page sized
errors up to the file system layers (there is nowhere in the page cache
to store 'part of this page is missing'). This is a long standing (four
year plus) problem with CD-RW support as well.

For ATA we can at least retrieve the true media sector size now, which
may be helpful at the physical layer but the page cache would need to
grow some brains to do anything with it.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

2007-02-26 Thread Ric Wheeler




Jeff Garzik wrote:

Theodore Tso wrote:

Can someone with knowledge of current disk drive behavior confirm that
for all drives that support bad block sparing, if an attempt to write
to a particular spot on disk results in an error due to bad media at
that spot, the disk drive will automatically rewrite the sector to a
sector in its spare pool, and automatically redirect that sector to
the new location.  I believe this should be always true, so presumably
with all modern disk drives a write error should mean something very
serious has happend.  



This is what will /probably/ happen.  The drive should indeed find a 
spare sector and remap it, if the write attempt encounters a bad spot on 
the media.


However, with a large enough write, large enough bad-spot-on-media, and 
a firmware programmed to never take more than X seconds to complete 
their enterprise customers' I/O, it might just fail.



IMO, somewhere in the kernel, when we receive a read-op or write-op 
media error, we should immediately try to plaster that area with small 
writes.  Sure, if it's a read-op you lost data, but this method will 
maximize the chance that you can refresh/reuse the logical sectors in 
question.


Jeff


One interesting counter example is a smaller write than a full page - say 512 
bytes out of 4k.


If we need to do a read-modify-write and it just so happens that 1 of the 7 
sectors we need to read is flaky, will this "look" like a write failure?


ric
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

2007-02-26 Thread Jeff Garzik


Theodore Tso wrote:

Can someone with knowledge of current disk drive behavior confirm that
for all drives that support bad block sparing, if an attempt to write
to a particular spot on disk results in an error due to bad media at
that spot, the disk drive will automatically rewrite the sector to a
sector in its spare pool, and automatically redirect that sector to
the new location.  I believe this should be always true, so presumably
with all modern disk drives a write error should mean something very
serious has happend.  



This is what will /probably/ happen.  The drive should indeed find a 
spare sector and remap it, if the write attempt encounters a bad spot on 
the media.


However, with a large enough write, large enough bad-spot-on-media, and 
a firmware programmed to never take more than X seconds to complete 
their enterprise customers' I/O, it might just fail.



IMO, somewhere in the kernel, when we receive a read-op or write-op 
media error, we should immediately try to plaster that area with small 
writes.  Sure, if it's a read-op you lost data, but this method will 
maximize the chance that you can refresh/reuse the logical sectors in 
question.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

2007-02-26 Thread H. Peter Anvin


Theodore Tso wrote:


In any case, the reason why I bring this up is that it would be really
nice if there was a way with a single laptop drive to be able to do
snapshots and background fsck's without having to use initrd's with
device mapper.



This is a major part of why I've been trying to push integrated klibc to 
have all that stuff as a unified "kernel" deliverable.  Unfortunately, 
as you know, Linus apparently rejected the concept "at least for now" at 
LKS last year.


With klibc this stuff could still be in one single wrapper without funny 
dependencies, but wouldn't have to be ported to kernel space.


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

2007-02-26 Thread Ric Wheeler



Alan wrote:
I think that this is mostly true, but we also need to balance this against the 
need for higher levels to get a timely response.  In a really large IO, a naive 
retry of a very large write could lead to a non-responsive system for a very 
large time...


And losing the I/O could result in a system that is non responsive until
the tape restore completes two days later


Which brings us back to a recent discussion at the file system workshop on being 
more repair oriented in file system design so we can survive situations like 
this a bit more reliably ;-)


ric
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

2007-02-26 Thread Alan

> I think that this is mostly true, but we also need to balance this against 
> the 
> need for higher levels to get a timely response.  In a really large IO, a 
> naive 
> retry of a very large write could lead to a non-responsive system for a very 
> large time...

And losing the I/O could result in a system that is non responsive until
the tape restore completes two days later

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

2007-02-26 Thread Ric Wheeler




Alan wrote:

the new location.  I believe this should be always true, so presumably
with all modern disk drives a write error should mean something very
serious has happend. 


Not quite that simple.


I think that write errors are normally quite serious, but there are exceptions 
which might be able to be worked around with retries.  To Ted's point, in 
general, a write to a bad spot on the media will cause a remapping which should 
be transparent (if a bit slow) to us.




If you write a block aligned size the same size as the physical media
block size maybe this is true. If you write a sector on a device with
physical sector size larger than logical block size (as allowed by say
ATA7) then it's less clear what happens. I don't know if the drive
firmware implements multiple "tails" in this case.

On a read error it is worth trying the other parts of the I/O.



I think that this is mostly true, but we also need to balance this against the 
need for higher levels to get a timely response.  In a really large IO, a naive 
retry of a very large write could lead to a non-responsive system for a very 
large time...


ric



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

2007-02-26 Thread James Bottomley

On Mon, 2007-02-26 at 08:25 -0500, Theodore Tso wrote:
> Somewhat off-topic, but my one big regret with how the dm vs. evms
> competition settled out was that evms had the ability to perform block
> device snapshots using a non-LVM volume as the base --- and that EVMS
> allowed a single drive to be partially managed by the LVM layer, and
> partially managed by evms.  

If all you want is a snapshot, md can do this today ... you just create
a RAID-1 resync it and then break it ... of course, you have to have the
filesystem mounted above an md device initially ...

James


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

2007-02-26 Thread Alan

> the new location.  I believe this should be always true, so presumably
> with all modern disk drives a write error should mean something very
> serious has happend. 

Not quite that simple.

If you write a block aligned size the same size as the physical media
block size maybe this is true. If you write a sector on a device with
physical sector size larger than logical block size (as allowed by say
ATA7) then it's less clear what happens. I don't know if the drive
firmware implements multiple "tails" in this case.

On a read error it is worth trying the other parts of the I/O.

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

2007-02-26 Thread Theodore Tso

On Mon, Feb 26, 2007 at 04:33:37PM +1100, Neil Brown wrote:
> Do we want a path in the other direction to handle write errors?  The
> file system could say "Don't worry to much if this block cannot be
> written, just return an error and I will write it somewhere else"?
> This might allow md not to fail a whole drive if there is a single
> write error.

Can someone with knowledge of current disk drive behavior confirm that
for all drives that support bad block sparing, if an attempt to write
to a particular spot on disk results in an error due to bad media at
that spot, the disk drive will automatically rewrite the sector to a
sector in its spare pool, and automatically redirect that sector to
the new location.  I believe this should be always true, so presumably
with all modern disk drives a write error should mean something very
serious has happend.  

(Or that someone was in the middle of reconfiguring a FC network and
they're running a kernel that doesn't understand why short-duration FC
timeouts should be retried.  :-)

> Or is that completely un-necessary as all modern devices do bad-block
> relocation for us?
> Is there any need for a bad-block-relocating layer in md or dm?

That's the question.  It wouldn't be that hard for filesystems to be
able to remap a data block, but (a) it would be much more difficult
for fundamental metadata (for example, the inode table), and (b) it's
unnecessary complexity if the lower levels in the storage stack should
always be doing this for us in the case of media errors anyway.

> What about corrected-error counts?  Drives provide them with SMART.
> The SCSI layer could provide some as well.  Md can do a similar thing
> to some extent.  Where these are actually useful predictors of pending
> failure is unclear, but there could be some value.
> e.g. after a certain number of recovered errors raid5 could trigger a
> background consistency check, or a filesystem could trigger a
> background fsck should it support that.

Somewhat off-topic, but my one big regret with how the dm vs. evms
competition settled out was that evms had the ability to perform block
device snapshots using a non-LVM volume as the base --- and that EVMS
allowed a single drive to be partially managed by the LVM layer, and
partially managed by evms.  

What this allowed is the ability to do device snapshots and therefore
background fsck's without needing to convert the entire laptop disk to
using a LVM solution (since to this day I still don't trust initrd's
to always do the right thing when I am constantly replacing the kernel
for kernel development).

I know, I'm weird, distro users have initrd that seem to mostly work,
and it's only wierd developers that try to use bleeding edge kernels
with a RHEL4 userspace that suffer, but it's one of the reasons why
I've avoided initrd's like the plague --- I've wasted entire days
trying to debug problems with the userspace-provided initrd being too
old to support newer 2.6 development kernels.

In any case, the reason why I bring this up is that it would be really
nice if there was a way with a single laptop drive to be able to do
snapshots and background fsck's without having to use initrd's with
device mapper.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

2007-02-25 Thread Douglas Gilbert

H. Peter Anvin wrote:
> Ric Wheeler wrote:
>>
>> We still have the following challenges:
>>
>>(1) read-ahead often means that we will  retry every bad sector at
>> least twice from the file system level. The first time, the fs read
>> ahead request triggers a speculative read that includes the bad sector
>> (triggering the error handling mechanisms) right before the real
>> application triggers a read does the same thing.  Not sure what the
>> answer is here since read-ahead is obviously a huge win in the normal
>> case.
>>
> 
> Probably the only sane thing to do is to remember the bad sectors and
> avoid attempting reading them; that would mean marking "automatic"
> versus "explicitly requested" requests to determine whether or not to
> filter them against a list of discovered bad blocks.

Some disks are doing their own "read-ahead" in the form
of a background media scan. Scans are done on request or
periodically (e.g. once per day or once per week) and we
have tools that can fetch the scan results from a disk
(e.g. a list of unreadable sectors). What we don't have
is any way to feed such information to a file system
that may be impacted.

Doug Gilbert


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

2007-02-25 Thread Neil Brown

On Friday February 23, [EMAIL PROTECTED] wrote:
> On Fri, Feb 23, 2007 at 05:37:23PM -0700, Andreas Dilger wrote:
> > > Probably the only sane thing to do is to remember the bad sectors and 
> > > avoid attempting reading them; that would mean marking "automatic" 
> > > versus "explicitly requested" requests to determine whether or not to 
> > > filter them against a list of discovered bad blocks.
> > 
> > And clearing this list when the sector is overwritten, as it will almost
> > certainly be relocated at the disk level.  For that matter, a huge win
> > would be to have the MD RAID layer rewrite only the bad sector (in hopes
> > of the disk relocating it) instead of failing the whiole disk.  Otherwise,
> > a few read errors on different disks in a RAID set can take the whole
> > system offline.  Apologies if this is already done in recent kernels...

Yes, current md does this.

> 
> And having a way of making this list available to both the filesystem
> and to a userspace utility, so they can more easily deal with doing a
> forced rewrite of the bad sector, after determining which file is
> involved and perhaps doing something intelligent (up to and including
> automatically requesting a backup system to fetch a backup version of
> the file, and if it can be determined that the file shouldn't have
> been changed since the last backup, automatically fixing up the
> corrupted data block :-).
> 
>   - Ted

So we want a clear path for media read errors from the device up to
user-space.  Stacked devices (like md) would do appropriate mappings
maybe (for raid0/linear at least.  Other levels wouldn't tolerate
errors).
There would need to be a limit on the number of 'bad blocks' that is
recorded.  Maybe a mechanism to clear old bad  blocks from the list is
needed.

Maybe if generic make request gets a request for a block which
overlaps a 'bad-block' it returns an error immediately.

Do we want a path in the other direction to handle write errors?  The
file system could say "Don't worry to much if this block cannot be
written, just return an error and I will write it somewhere else"?
This might allow md not to fail a whole drive if there is a single
write error.
Or is that completely un-necessary as all modern devices do bad-block
relocation for us?
Is there any need for a bad-block-relocating layer in md or dm?

What about corrected-error counts?  Drives provide them with SMART.
The SCSI layer could provide some as well.  Md can do a similar thing
to some extent.  Where these are actually useful predictors of pending
failure is unclear, but there could be some value.
e.g. after a certain number of recovered errors raid5 could trigger a
background consistency check, or a filesystem could trigger a
background fsck should it support that.

Lots of interesting questions... not so many answers.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

2007-02-24 Thread Chris Wedgwood

On Fri, Feb 23, 2007 at 09:32:29PM -0500, Theodore Tso wrote:

> And having a way of making this list available to both the
> filesystem and to a userspace utility, so they can more easily deal
> with doing a forced rewrite of the bad sector, after determining
> which file is involved and perhaps doing something intelligent (up
> to and including automatically requesting a backup system to fetch a
> backup version of the file, and if it can be determined that the
> file shouldn't have been changed since the last backup,
> automatically fixing up the corrupted data block :-).

i had a small c program + perl script that would take a badblocks list
and figure out which files on an xfs filesystem were trashed, though
in the case of xfs it's somewhat easier because you can dump the
extents for a file something more generic wouldn't be hard to make
work, it also wouldn't be hard to extend this to inodes in some cases
though im not sure that there is much you can do there beyond fsck

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

2007-02-23 Thread Theodore Tso

On Fri, Feb 23, 2007 at 05:37:23PM -0700, Andreas Dilger wrote:
> > Probably the only sane thing to do is to remember the bad sectors and 
> > avoid attempting reading them; that would mean marking "automatic" 
> > versus "explicitly requested" requests to determine whether or not to 
> > filter them against a list of discovered bad blocks.
> 
> And clearing this list when the sector is overwritten, as it will almost
> certainly be relocated at the disk level.  For that matter, a huge win
> would be to have the MD RAID layer rewrite only the bad sector (in hopes
> of the disk relocating it) instead of failing the whiole disk.  Otherwise,
> a few read errors on different disks in a RAID set can take the whole
> system offline.  Apologies if this is already done in recent kernels...

And having a way of making this list available to both the filesystem
and to a userspace utility, so they can more easily deal with doing a
forced rewrite of the bad sector, after determining which file is
involved and perhaps doing something intelligent (up to and including
automatically requesting a backup system to fetch a backup version of
the file, and if it can be determined that the file shouldn't have
been changed since the last backup, automatically fixing up the
corrupted data block :-).

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

2007-02-23 Thread H. Peter Anvin


Andreas Dilger wrote:

And clearing this list when the sector is overwritten, as it will almost
certainly be relocated at the disk level.


Certainly if the overwrite is successful.

-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

2007-02-23 Thread Andreas Dilger

On Feb 23, 2007  16:03 -0800, H. Peter Anvin wrote:
> Ric Wheeler wrote:
> >   (1) read-ahead often means that we will  retry every bad sector at 
> >least twice from the file system level. The first time, the fs read 
> >ahead request triggers a speculative read that includes the bad sector 
> >(triggering the error handling mechanisms) right before the real 
> >application triggers a read does the same thing.  Not sure what the 
> >answer is here since read-ahead is obviously a huge win in the normal case.
> 
> Probably the only sane thing to do is to remember the bad sectors and 
> avoid attempting reading them; that would mean marking "automatic" 
> versus "explicitly requested" requests to determine whether or not to 
> filter them against a list of discovered bad blocks.

And clearing this list when the sector is overwritten, as it will almost
certainly be relocated at the disk level.  For that matter, a huge win
would be to have the MD RAID layer rewrite only the bad sector (in hopes
of the disk relocating it) instead of failing the whiole disk.  Otherwise,
a few read errors on different disks in a RAID set can take the whole
system offline.  Apologies if this is already done in recent kernels...

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

2007-02-23 Thread H. Peter Anvin


Ric Wheeler wrote:


We still have the following challenges:

   (1) read-ahead often means that we will  retry every bad sector at 
least twice from the file system level. The first time, the fs read 
ahead request triggers a speculative read that includes the bad sector 
(triggering the error handling mechanisms) right before the real 
application triggers a read does the same thing.  Not sure what the 
answer is here since read-ahead is obviously a huge win in the normal case.




Probably the only sane thing to do is to remember the bad sectors and 
avoid attempting reading them; that would mean marking "automatic" 
versus "explicitly requested" requests to determine whether or not to 
filter them against a list of discovered bad blocks.


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

end to end error recovery musings

2007-02-23 Thread Ric Wheeler

In the IO/FS workshop, one idea we kicked around is the need to provide 
better and more specific error messages between the IO stack and the 
file system layer.


My group has been working to stabilize a relatively up to date libata + 
MD based box, so I can try to lay out at least one "appliance like" 
typical configuration to help frame the issue. We are working on a 
relatively large appliance, but you can buy similar home appliance (or 
build them) that use linux to provide a NAS in a box for end users.


The use case that we have is on an ICH6R/AHCI box with 4 large (500+ GB) 
drives, with some of the small system partitions on a 4-way RAID1 
device. The libata version we have is back port of 2.6.18 onto SLES10, 
so the error handling at the libata level is a huge improvement over 
what we had before.


Each box has a watchdog timer that can be set to fire after at most 2 
minutes.


(We have a second flavor of this box with an ICH5 and P-ATA drives using 
the non-libata drivers that has a similar use case).


Using the patches that Mark sent around recently for error injection, we 
inject media errors into one or more drives and try to see how smoothly 
error handling runs and, importantly, whether or not the error handling 
will complete before the watchdog fires and reboots the box.  If you 
want to be especially mean, inject errors into the RAID superblocks on 3 
out of the 4 drives.


We still have the following challenges:

   (1) read-ahead often means that we will  retry every bad sector at 
least twice from the file system level. The first time, the fs read 
ahead request triggers a speculative read that includes the bad sector 
(triggering the error handling mechanisms) right before the real 
application triggers a read does the same thing.  Not sure what the 
answer is here since read-ahead is obviously a huge win in the normal case.


   (2) the patches that were floating around on how to make sure that 
we effectively handle single sector errors in a large IO request are 
critical. On one hand, we want to combine adjacent IO requests into 
larger IO's whenever possible. On the other hand, when the combined IO 
fails, we need to isolate the error to the correct range, avoid 
reissuing a request that touches that sector again and communicate up 
the stack to file system/MD what really failed.  All of this needs to 
complete in tens of seconds, not multiple minutes.


   (3) The timeout values on the failed IO's need to be tuned well (as 
was discussed in an earlier linux-ide thread). We cannot afford to hang 
for 30 seconds, especially in the MD case, since you might need to fail 
more than one device for a single IO.  Prompt error prorogation (say 
that 4 times quickly!) can allow MD to mask the underlying errors as you 
would hope, hanging on too long will almost certainly cause a watchdog 
reboot...


   (4) The newish libata+SCSI stack is pretty good at handling disk 
errors, but adding in MD actually can reduce the reliability of your 
system unless you tune the error handling correctly.


We will follow up with specific issues as they arise, but I wanted to 
lay out a use case that can help frame part of the discussion.  I also 
want to encourage people to inject real disk errors with the Mark 
patches so we can share the pain ;-)


ric



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: end to end error recovery musings

Re: end to end error recovery musings

Re: end to end error recovery musings

Re: end to end error recovery musings

Re: end to end error recovery musings

Re: end to end error recovery musings

Re: end to end error recovery musings

RE: end to end error recovery musings

Re: end to end error recovery musings

Re: end to end error recovery musings

Re: end to end error recovery musings

Re: end to end error recovery musings

Re: end to end error recovery musings

Re: end to end error recovery musings

Re: end to end error recovery musings

Re: end to end error recovery musings

RE: end to end error recovery musings

Re: end to end error recovery musings

Re: end to end error recovery musings

Re: end to end error recovery musings

Re: end to end error recovery musings

Re: end to end error recovery musings

Re: end to end error recovery musings

Re: end to end error recovery musings

Re: end to end error recovery musings

Re: end to end error recovery musings

Re: end to end error recovery musings

Re: end to end error recovery musings

Re: end to end error recovery musings

Re: end to end error recovery musings

Re: end to end error recovery musings

Re: end to end error recovery musings

Re: end to end error recovery musings

Re: end to end error recovery musings

end to end error recovery musings

35 matches

Site Navigation

Mail list logo

Footer information