subject:"\[Patch\] document ext3 requirements \(was Re\: \[RFD\] Incremental fsck\)"

I just had a talk with a colleague, John Palmer, who worked on disk drive 
design for about 5 years in the '90s and he gave me a very confident, 
credible explanation of some of the things we've been wondering about disk 
drive power loss in this thread, complete with demonstrations of various 
generations of disk drives, dismantled.

First of all, it is plain to see that there is no spring capable of 
parking the head, and there is no capacitor that looks big enough to 
possibly supply the energy to park the head, in any of the models I looked 
at.  Since parking of the heads is essential, we can only conclude that 
the myth of the kinetic energy of the disks being used for that (turned 
into electricity by the drive motor) is true.  The energy required is not 
just to move the heads to the parking zone, but to latch them there as 
well.

The myth is probably just that that energy is used for anything else; it's 
really easy to build a dumb circuit to park the heads using that power; 
keeping a computer running is something else.

The drive does drop a write in the middle of the sector if it is writing 
at the time of power loss.  The designers were too conservative to keep 
writing as power fails -- there's no telling what damage you might do.  So 
the drive cuts the power to the heads at the first sign of power loss.  If 
a write was in progress, this means there is one garbage sector on the 
disk.  It can't be read.

Trying to finish writing the sector is something I can image some drive 
model somewhere trying to do, but if even _some_ take the conservative 
approach, everyone has to design for it, so it doesn't matter.

A device might then reassign that sector the next time you try to write to 
it (after failing to read it), thinking the medium must be bad.  But there 
are various algorithms for deciding when to reassign a sector, so it might 
not too.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-18 Thread Jeff Garzik


Ric Wheeler wrote:

Theodore Tso wrote:

On Thu, Jan 17, 2008 at 04:31:48PM -0800, Bryan Henderson wrote:
But I heard some years ago from a disk drive engineer that that is a 
myth just like the rotational energy thing.  I added that to the 
discussion, but admitted that I haven't actually seen a disk drive 
write a partial sector.


Well, it would be impossible or at least very hard to see that in
practice, right?  My understanding is that drives do sector-level
checksums, so if there was a partially written sector, the checksum
would be bogus and the drive would return an error when you tried to
read from it.


There is extensive per sector error correction on each sector written. 
What you would see in this case (or many, many other possible ways 
drives can corrupt media) is a "media error" on the next read.


Correct.


You would never get back the partially written contents of that sector 
at the host.


Correct.


Having our tools (fsck especially) be resilient in the face of media 
errors is really critical. Although I don't think the scenario of a 
partially written sector is common, media errors in general are common 
and can develop over time.


Agreed.

Jeff



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

"H. Peter Anvin" <[EMAIL PROTECTED]> wrote on 01/18/2008 07:08:30 AM:

> Bryan Henderson wrote:
> > 
> > We weren't actually talking about writing out the cache.  While that 
was 
> > part of an earlier thread which ultimately conceded that disk drives 
most 
> > probably do not use the spinning disk energy to write out the cache, 
the 
> > claim was then made that the drive at least survives long enough to 
finish 
> > writing the sector it was writing, thereby maintaining the integrity 
of 
> > the data at the drive level.  People often say that a disk drive 
> > guarantees atomic writes at the sector level even in the face of a 
power 
> > failure.
> > 
> > But I heard some years ago from a disk drive engineer that that is a 
myth 
> > just like the rotational energy thing.  I added that to the 
discussion, 
> > but admitted that I haven't actually seen a disk drive write a partial 

> > sector.
> > 
> 
> A disk drive whose power is cut needs to have enough residual power to 
> park its heads (or *massive* data loss will occur), and at that point it 

> might as well keep enough on hand to finish an in-progress sector write.
> 
> There are two possible sources of onboard temporary power: a large 
> enough capacitor, or the rotational energy of the platters (an 
> electrical motor also being a generator.)  I don't care which one they 
> use, but they need to do something.

I believe the power for that comes from a third source: a spring.  Parking 
the heads is too important to leave to active circuits.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-18 Thread Ric Wheeler


Theodore Tso wrote:

On Thu, Jan 17, 2008 at 04:31:48PM -0800, Bryan Henderson wrote:
But I heard some years ago from a disk drive engineer that that is a myth 
just like the rotational energy thing.  I added that to the discussion, 
but admitted that I haven't actually seen a disk drive write a partial 
sector.


Well, it would be impossible or at least very hard to see that in
practice, right?  My understanding is that drives do sector-level
checksums, so if there was a partially written sector, the checksum
would be bogus and the drive would return an error when you tried to
read from it.


There is extensive per sector error correction on each sector written. 
What you would see in this case (or many, many other possible ways 
drives can corrupt media) is a "media error" on the next read.


You would never get back the partially written contents of that sector 
at the host.


Having our tools (fsck especially) be resilient in the face of media 
errors is really critical. Although I don't think the scenario of a 
partially written sector is common, media errors in general are common 
and can develop over time.




Ted brought up the separate issue of the host sending garbage to the disk 
device because its own power is failing at the same time, which makes the 
integrity at the disk level moot (or even undesirable, as you'd rather 
write a bad sector than a good one with the wrong data).


Yep, exactly.  It would be interesting to see if this happens on
modern hardware; all of the evidence I've had for this is years old at
this point.  


- Ted



See the NetApp paper from Sigmetrics 2007 for some interesting analysis...


ric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-18 Thread H. Peter Anvin


Bryan Henderson wrote:


We weren't actually talking about writing out the cache.  While that was 
part of an earlier thread which ultimately conceded that disk drives most 
probably do not use the spinning disk energy to write out the cache, the 
claim was then made that the drive at least survives long enough to finish 
writing the sector it was writing, thereby maintaining the integrity of 
the data at the drive level.  People often say that a disk drive 
guarantees atomic writes at the sector level even in the face of a power 
failure.


But I heard some years ago from a disk drive engineer that that is a myth 
just like the rotational energy thing.  I added that to the discussion, 
but admitted that I haven't actually seen a disk drive write a partial 
sector.




Did he work for Maxtor, by any chance?  :-/

A disk drive whose power is cut needs to have enough residual power to 
park its heads (or *massive* data loss will occur), and at that point it 
might as well keep enough on hand to finish an in-progress sector write.


There are two possible sources of onboard temporary power: a large 
enough capacitor, or the rotational energy of the platters (an 
electrical motor also being a generator.)  I don't care which one they 
use, but they need to do something.


-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-18 Thread Theodore Tso

On Thu, Jan 17, 2008 at 04:31:48PM -0800, Bryan Henderson wrote:
> But I heard some years ago from a disk drive engineer that that is a myth 
> just like the rotational energy thing.  I added that to the discussion, 
> but admitted that I haven't actually seen a disk drive write a partial 
> sector.

Well, it would be impossible or at least very hard to see that in
practice, right?  My understanding is that drives do sector-level
checksums, so if there was a partially written sector, the checksum
would be bogus and the drive would return an error when you tried to
read from it.

> Ted brought up the separate issue of the host sending garbage to the disk 
> device because its own power is failing at the same time, which makes the 
> integrity at the disk level moot (or even undesirable, as you'd rather 
> write a bad sector than a good one with the wrong data).

Yep, exactly.  It would be interesting to see if this happens on
modern hardware; all of the evidence I've had for this is years old at
this point.  

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-18 Thread H. Peter Anvin


Bryan Henderson wrote:


We weren't actually talking about writing out the cache.  While that was 
part of an earlier thread which ultimately conceded that disk drives most 
probably do not use the spinning disk energy to write out the cache, the 
claim was then made that the drive at least survives long enough to finish 
writing the sector it was writing, thereby maintaining the integrity of 
the data at the drive level.  People often say that a disk drive 
guarantees atomic writes at the sector level even in the face of a power 
failure.


But I heard some years ago from a disk drive engineer that that is a myth 
just like the rotational energy thing.  I added that to the discussion, 
but admitted that I haven't actually seen a disk drive write a partial 
sector.




Did he work for Maxtor, by any chance?  :-/

A disk drive whose power is cut needs to have enough residual power to 
park its heads (or *massive* data loss will occur), and at that point it 
might as well keep enough on hand to finish an in-progress sector write.


There are two possible sources of onboard temporary power: a large 
enough capacitor, or the rotational energy of the platters (an 
electrical motor also being a generator.)  I don't care which one they 
use, but they need to do something.


-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-18 Thread Theodore Tso

On Thu, Jan 17, 2008 at 04:31:48PM -0800, Bryan Henderson wrote:
 But I heard some years ago from a disk drive engineer that that is a myth 
 just like the rotational energy thing.  I added that to the discussion, 
 but admitted that I haven't actually seen a disk drive write a partial 
 sector.

Well, it would be impossible or at least very hard to see that in
practice, right?  My understanding is that drives do sector-level
checksums, so if there was a partially written sector, the checksum
would be bogus and the drive would return an error when you tried to
read from it.

 Ted brought up the separate issue of the host sending garbage to the disk 
 device because its own power is failing at the same time, which makes the 
 integrity at the disk level moot (or even undesirable, as you'd rather 
 write a bad sector than a good one with the wrong data).

Yep, exactly.  It would be interesting to see if this happens on
modern hardware; all of the evidence I've had for this is years old at
this point.  

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

H. Peter Anvin [EMAIL PROTECTED] wrote on 01/18/2008 07:08:30 AM:

 Bryan Henderson wrote:
  
  We weren't actually talking about writing out the cache.  While that 
was 
  part of an earlier thread which ultimately conceded that disk drives 
most 
  probably do not use the spinning disk energy to write out the cache, 
the 
  claim was then made that the drive at least survives long enough to 
finish 
  writing the sector it was writing, thereby maintaining the integrity 
of 
  the data at the drive level.  People often say that a disk drive 
  guarantees atomic writes at the sector level even in the face of a 
power 
  failure.
  
  But I heard some years ago from a disk drive engineer that that is a 
myth 
  just like the rotational energy thing.  I added that to the 
discussion, 
  but admitted that I haven't actually seen a disk drive write a partial 

  sector.
  
 
 A disk drive whose power is cut needs to have enough residual power to 
 park its heads (or *massive* data loss will occur), and at that point it 

 might as well keep enough on hand to finish an in-progress sector write.
 
 There are two possible sources of onboard temporary power: a large 
 enough capacitor, or the rotational energy of the platters (an 
 electrical motor also being a generator.)  I don't care which one they 
 use, but they need to do something.

I believe the power for that comes from a third source: a spring.  Parking 
the heads is too important to leave to active circuits.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

I just had a talk with a colleague, John Palmer, who worked on disk drive 
design for about 5 years in the '90s and he gave me a very confident, 
credible explanation of some of the things we've been wondering about disk 
drive power loss in this thread, complete with demonstrations of various 
generations of disk drives, dismantled.

First of all, it is plain to see that there is no spring capable of 
parking the head, and there is no capacitor that looks big enough to 
possibly supply the energy to park the head, in any of the models I looked 
at.  Since parking of the heads is essential, we can only conclude that 
the myth of the kinetic energy of the disks being used for that (turned 
into electricity by the drive motor) is true.  The energy required is not 
just to move the heads to the parking zone, but to latch them there as 
well.

The myth is probably just that that energy is used for anything else; it's 
really easy to build a dumb circuit to park the heads using that power; 
keeping a computer running is something else.

The drive does drop a write in the middle of the sector if it is writing 
at the time of power loss.  The designers were too conservative to keep 
writing as power fails -- there's no telling what damage you might do.  So 
the drive cuts the power to the heads at the first sign of power loss.  If 
a write was in progress, this means there is one garbage sector on the 
disk.  It can't be read.

Trying to finish writing the sector is something I can image some drive 
model somewhere trying to do, but if even _some_ take the conservative 
approach, everyone has to design for it, so it doesn't matter.

A device might then reassign that sector the next time you try to write to 
it (after failing to read it), thinking the medium must be bad.  But there 
are various algorithms for deciding when to reassign a sector, so it might 
not too.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-18 Thread Jeff Garzik


Ric Wheeler wrote:

Theodore Tso wrote:

On Thu, Jan 17, 2008 at 04:31:48PM -0800, Bryan Henderson wrote:
But I heard some years ago from a disk drive engineer that that is a 
myth just like the rotational energy thing.  I added that to the 
discussion, but admitted that I haven't actually seen a disk drive 
write a partial sector.


Well, it would be impossible or at least very hard to see that in
practice, right?  My understanding is that drives do sector-level
checksums, so if there was a partially written sector, the checksum
would be bogus and the drive would return an error when you tried to
read from it.


There is extensive per sector error correction on each sector written. 
What you would see in this case (or many, many other possible ways 
drives can corrupt media) is a media error on the next read.


Correct.


You would never get back the partially written contents of that sector 
at the host.


Correct.


Having our tools (fsck especially) be resilient in the face of media 
errors is really critical. Although I don't think the scenario of a 
partially written sector is common, media errors in general are common 
and can develop over time.


Agreed.

Jeff



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-18 Thread Ric Wheeler


Theodore Tso wrote:

On Thu, Jan 17, 2008 at 04:31:48PM -0800, Bryan Henderson wrote:
But I heard some years ago from a disk drive engineer that that is a myth 
just like the rotational energy thing.  I added that to the discussion, 
but admitted that I haven't actually seen a disk drive write a partial 
sector.


Well, it would be impossible or at least very hard to see that in
practice, right?  My understanding is that drives do sector-level
checksums, so if there was a partially written sector, the checksum
would be bogus and the drive would return an error when you tried to
read from it.


There is extensive per sector error correction on each sector written. 
What you would see in this case (or many, many other possible ways 
drives can corrupt media) is a media error on the next read.


You would never get back the partially written contents of that sector 
at the host.


Having our tools (fsck especially) be resilient in the face of media 
errors is really critical. Although I don't think the scenario of a 
partially written sector is common, media errors in general are common 
and can develop over time.




Ted brought up the separate issue of the host sending garbage to the disk 
device because its own power is failing at the same time, which makes the 
integrity at the disk level moot (or even undesirable, as you'd rather 
write a bad sector than a good one with the wrong data).


Yep, exactly.  It would be interesting to see if this happens on
modern hardware; all of the evidence I've had for this is years old at
this point.  


- Ted



See the NetApp paper from Sigmetrics 2007 for some interesting analysis...


ric
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

Ric Wheeler <[EMAIL PROTECTED]> wrote on 01/17/2008 03:18:05 PM:

> Theodore Tso wrote:
> > On Wed, Jan 16, 2008 at 09:02:50PM -0500, Daniel Phillips wrote:
> >> Have you observed that in the wild?  A former engineer of a disk 
drive
> >> company suggests to me that the capacitors on the board provide 
enough
> >> power to complete the last sector, even to park the head.
> >>
> 
> Even if true (which I doubt), this is not implemented.
> 
> A modern drive can have 16-32 MB of write cache. Worst case, those 
> sectors are not sequential which implies lots of head movement.

We weren't actually talking about writing out the cache.  While that was 
part of an earlier thread which ultimately conceded that disk drives most 
probably do not use the spinning disk energy to write out the cache, the 
claim was then made that the drive at least survives long enough to finish 
writing the sector it was writing, thereby maintaining the integrity of 
the data at the drive level.  People often say that a disk drive 
guarantees atomic writes at the sector level even in the face of a power 
failure.

But I heard some years ago from a disk drive engineer that that is a myth 
just like the rotational energy thing.  I added that to the discussion, 
but admitted that I haven't actually seen a disk drive write a partial 
sector.

Ted brought up the separate issue of the host sending garbage to the disk 
device because its own power is failing at the same time, which makes the 
integrity at the disk level moot (or even undesirable, as you'd rather 
write a bad sector than a good one with the wrong data).

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-17 Thread Ric Wheeler


Theodore Tso wrote:

On Wed, Jan 16, 2008 at 09:02:50PM -0500, Daniel Phillips wrote:

Have you observed that in the wild?  A former engineer of a disk drive
company suggests to me that the capacitors on the board provide enough
power to complete the last sector, even to park the head.



Even if true (which I doubt), this is not implemented.

A modern drive can have 16-32 MB of write cache. Worst case, those 
sectors are not sequential which implies lots of head movement.




The problem isn't with the disk drive; it's from the DRAM, which tend
to be much more voltage sensitive than the hard drives --- so it's
quite likely that you could end up DMA'ing garbage from the memory.
In fact the fact that the disk drives lasts longer due to capacitors
on the board, rotational inertia of the platters, etc., is part of the
problem.


I can tell you directly that when you drop power to a drive, you will 
lose write cache data if the write cache is enabled. With barriers 
enabled, our testing shows that file systems survive power failures 
which routinely caused corruption without them ;-)


ric


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-17 Thread Alan Cox


> interrupt which caused the Irix to run around frantically shutting
> down DMA's for a controlled shutdown.  Of course, PC-class hardware
> has none of this.  My source for this was Jim Mostek, one of the

PC class hardware has a power good signal which drops just before the
rest.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-17 Thread Daniel Phillips

On Jan 17, 2008 7:29 AM, Szabolcs Szakacsits <[EMAIL PROTECTED]> wrote:
> Similarly to ZFS, Windows Server 2008 also has self-healing NTFS:

I guess that is enough votes to justify going ahead and trying an
implementation of the reverse mapping ideas I posted.  But of course
more votes for this is better.  If online incremental fsck is
something people want, then please speak up here and that will very
definitely help make it happen.

On the walk-before-run principle, it would initially just be
filesystem checking, not repair.  But even this would help, by setting
per-group checked flags that offline fsck could use to do a much
quicker repair pass.  And it will let you know when a volume needs to
be taken offline without having to build in planned downtime just in
case, which already eats a bunch of nines.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-17 Thread Theodore Tso

On Wed, Jan 16, 2008 at 09:02:50PM -0500, Daniel Phillips wrote:
> 
> Have you observed that in the wild?  A former engineer of a disk drive
> company suggests to me that the capacitors on the board provide enough
> power to complete the last sector, even to park the head.
> 

The problem isn't with the disk drive; it's from the DRAM, which tend
to be much more voltage sensitive than the hard drives --- so it's
quite likely that you could end up DMA'ing garbage from the memory.
In fact the fact that the disk drives lasts longer due to capacitors
on the board, rotational inertia of the platters, etc., is part of the
problem.

It was observed in the wild by SGI, many years ago on their hardware.
They later added extra capacitors on the motherboard and a powerfail
interrupt which caused the Irix to run around frantically shutting
down DMA's for a controlled shutdown.  Of course, PC-class hardware
has none of this.  My source for this was Jim Mostek, one of the
original Linux XFS porters.  He had given me source code to a test
program that would show this; basically zeroed out a region of disk,
then started writing series of patterns on that part of the, and you
you kicked out the power cord, and then see if there was any garbage
on the disk.  If you saw something that wasn't one of the patterns
being written to the disk, then you knew you had a problem.  I can't
find the program any more, but it wouldn't be hard to write.  

I do know that I have seen reports from many ext2 users in the field
that could only be explained by the hard drive scribbling garbage onto
the inode table.  Ext3 solves this problem because of its physical
block journaling.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

"Daniel Phillips" <[EMAIL PROTECTED]> wrote on 01/16/2008 06:02:50 PM:

> On Jan 16, 2008 2:06 PM, Bryan Henderson <[EMAIL PROTECTED]> wrote:
> > >The "disk motor as a generator" tale may not be purely folklore. When
> > >an IDE drive is not in writeback mode, something special needs to 
done
> > >to ensure the last write to media is not a scribble.
> >
> > No it doesn't.  The last write _is_ a scribble.
> 
> Have you observed that in the wild?  A former engineer of a disk drive
> company suggests to me that the capacitors on the board provide enough
> power to complete the last sector, even to park the head.

No, I haven't.  It's hearsay, and from about 3 years ago.

As for parking the head, that's hard to believe, since it's so easy and 
more reliable to use a spring and an electromagnet.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-17 Thread Pavel Machek

On Tue 2008-01-15 20:36:16, Chris Mason wrote:
> On Tue, 15 Jan 2008 20:24:27 -0500
> "Daniel Phillips" <[EMAIL PROTECTED]> wrote:
> 
> > On Jan 15, 2008 7:15 PM, Alan Cox <[EMAIL PROTECTED]> wrote:
> > > > Writeback cache on disk in iteself is not bad, it only gets bad
> > > > if the disk is not engineered to save all its dirty cache on
> > > > power loss, using the disk motor as a generator or alternatively
> > > > a small battery. It would be awfully nice to know which brands
> > > > fail here, if any, because writeback cache is a big performance
> > > > booster.
> > >
> > > AFAIK no drive saves the cache. The worst case cache flush for
> > > drives is several seconds with no retries and a couple of minutes
> > > if something really bad happens.
> > >
> > > This is why the kernel has some knowledge of barriers and uses them
> > > to issue flushes when needed.
> > 
> > Indeed, you are right, which is supported by actual measurements:
> > 
> > http://sr5tech.com/write_back_cache_experiments.htm
> > 
> > Sorry for implying that anybody has engineered a drive that can do
> > such a nice thing with writeback cache.
> > 
> > The "disk motor as a generator" tale may not be purely folklore.  When
> > an IDE drive is not in writeback mode, something special needs to done
> > to ensure the last write to media is not a scribble.
> > 
> > A small UPS can make writeback mode actually reliable, provided the
> > system is smart enough to take the drives out of writeback mode when
> > the line power is off.
> 
> We've had mount -o barrier=1 for ext3 for a while now, it makes
> writeback caching safe.  XFS has this on by default, as does reiserfs.

Maybe ext3 should do barriers by default? Having ext3 in "lets corrupt
data by default"... seems like bad idea.
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-17 Thread Szabolcs Szakacsits

On Tue, 15 Jan 2008, Daniel Phillips wrote:

> Along with this effort, could you let me know if the world actually
> cares about online fsck? Now we know how to do it I think, but is it
> worth the effort.

Most users seem to care deeply about "things just work". Here is why
ntfs-3g also took the online fsck path some time ago.

NTFS support had a highly bad reputation on Linux thus the new code was
written with rigid sanity checks and extensive automatic, regression
testing. One of the consequences is that we're detecting way too many
inconsistencies left behind by the Windows and other NTFS drivers,
hardware faults, device drivers.

To better utilize the non-existing developer resources, it was obvious to
suggest the already existing Windows fsck (chkdsk) in such cases. Simple
and safe as most people like us would think who never used Windows.

However years of experience shows that depending on several factors chkdsk
may start or not, may report the real problems or not, but on the other
hand it may report bogus issues, it may run long or just forever, and it
may even remove completely valid files. So one could perhaps even consider
suggestions to run chkdsk a call to play Russian roulette.

Thankfully NTFS has some level of metadata redundancy with signatures and
weak "checksums" which make possible to correct some common and obvious
corruptions on the fly.

Similarly to ZFS, Windows Server 2008 also has self-healing NTFS:
http://technet2.microsoft.com/windowsserver2008/en/library/6f883d0d-3668-4e15-b7ad-4df0f6e6805d1033.mspx?mfr=true

Szaka

--
NTFS-3G: http://ntfs-3g.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-17 Thread Szabolcs Szakacsits

On Tue, 15 Jan 2008, Daniel Phillips wrote:

Along with this effort, could you let me know if the world actually
cares about online fsck? Now we know how to do it I think, but is it
worth the effort.

Most users seem to care deeply about things just work. Here is why
ntfs-3g also took the online fsck path some time ago.

Thankfully NTFS has some level of metadata redundancy with signatures and
weak checksums which make possible to correct some common and obvious
corruptions on the fly.

Similarly to ZFS, Windows Server 2008 also has self-healing NTFS:
http://technet2.microsoft.com/windowsserver2008/en/library/6f883d0d-3668-4e15-b7ad-4df0f6e6805d1033.mspx?mfr=true

Szaka

--
NTFS-3G: http://ntfs-3g.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-17 Thread Pavel Machek

On Tue 2008-01-15 20:36:16, Chris Mason wrote:
 On Tue, 15 Jan 2008 20:24:27 -0500
 Daniel Phillips [EMAIL PROTECTED] wrote:
 
  On Jan 15, 2008 7:15 PM, Alan Cox [EMAIL PROTECTED] wrote:
Writeback cache on disk in iteself is not bad, it only gets bad
if the disk is not engineered to save all its dirty cache on
power loss, using the disk motor as a generator or alternatively
a small battery. It would be awfully nice to know which brands
fail here, if any, because writeback cache is a big performance
booster.
  
   AFAIK no drive saves the cache. The worst case cache flush for
   drives is several seconds with no retries and a couple of minutes
   if something really bad happens.
  
   This is why the kernel has some knowledge of barriers and uses them
   to issue flushes when needed.
  
  Indeed, you are right, which is supported by actual measurements:
  
  http://sr5tech.com/write_back_cache_experiments.htm
  
  Sorry for implying that anybody has engineered a drive that can do
  such a nice thing with writeback cache.
  
  The disk motor as a generator tale may not be purely folklore.  When
  an IDE drive is not in writeback mode, something special needs to done
  to ensure the last write to media is not a scribble.
  
  A small UPS can make writeback mode actually reliable, provided the
  system is smart enough to take the drives out of writeback mode when
  the line power is off.
 
 We've had mount -o barrier=1 for ext3 for a while now, it makes
 writeback caching safe.  XFS has this on by default, as does reiserfs.

Maybe ext3 should do barriers by default? Having ext3 in lets corrupt
data by default... seems like bad idea.
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

Daniel Phillips [EMAIL PROTECTED] wrote on 01/16/2008 06:02:50 PM:

 On Jan 16, 2008 2:06 PM, Bryan Henderson [EMAIL PROTECTED] wrote:
  The disk motor as a generator tale may not be purely folklore. When
  an IDE drive is not in writeback mode, something special needs to 
done
  to ensure the last write to media is not a scribble.
 
  No it doesn't.  The last write _is_ a scribble.
 
 Have you observed that in the wild?  A former engineer of a disk drive
 company suggests to me that the capacitors on the board provide enough
 power to complete the last sector, even to park the head.

No, I haven't.  It's hearsay, and from about 3 years ago.

As for parking the head, that's hard to believe, since it's so easy and 
more reliable to use a spring and an electromagnet.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-17 Thread Daniel Phillips

On Jan 17, 2008 7:29 AM, Szabolcs Szakacsits [EMAIL PROTECTED] wrote:
 Similarly to ZFS, Windows Server 2008 also has self-healing NTFS:

I guess that is enough votes to justify going ahead and trying an
implementation of the reverse mapping ideas I posted.  But of course
more votes for this is better.  If online incremental fsck is
something people want, then please speak up here and that will very
definitely help make it happen.

On the walk-before-run principle, it would initially just be
filesystem checking, not repair.  But even this would help, by setting
per-group checked flags that offline fsck could use to do a much
quicker repair pass.  And it will let you know when a volume needs to
be taken offline without having to build in planned downtime just in
case, which already eats a bunch of nines.

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-17 Thread Theodore Tso

On Wed, Jan 16, 2008 at 09:02:50PM -0500, Daniel Phillips wrote:
 
 Have you observed that in the wild?  A former engineer of a disk drive
 company suggests to me that the capacitors on the board provide enough
 power to complete the last sector, even to park the head.
 

The problem isn't with the disk drive; it's from the DRAM, which tend
to be much more voltage sensitive than the hard drives --- so it's
quite likely that you could end up DMA'ing garbage from the memory.
In fact the fact that the disk drives lasts longer due to capacitors
on the board, rotational inertia of the platters, etc., is part of the
problem.

It was observed in the wild by SGI, many years ago on their hardware.
They later added extra capacitors on the motherboard and a powerfail
interrupt which caused the Irix to run around frantically shutting
down DMA's for a controlled shutdown.  Of course, PC-class hardware
has none of this.  My source for this was Jim Mostek, one of the
original Linux XFS porters.  He had given me source code to a test
program that would show this; basically zeroed out a region of disk,
then started writing series of patterns on that part of the, and you
you kicked out the power cord, and then see if there was any garbage
on the disk.  If you saw something that wasn't one of the patterns
being written to the disk, then you knew you had a problem.  I can't
find the program any more, but it wouldn't be hard to write.  

I do know that I have seen reports from many ext2 users in the field
that could only be explained by the hard drive scribbling garbage onto
the inode table.  Ext3 solves this problem because of its physical
block journaling.

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

Ric Wheeler [EMAIL PROTECTED] wrote on 01/17/2008 03:18:05 PM:

 Theodore Tso wrote:
  On Wed, Jan 16, 2008 at 09:02:50PM -0500, Daniel Phillips wrote:
  Have you observed that in the wild?  A former engineer of a disk 
drive
  company suggests to me that the capacitors on the board provide 
enough
  power to complete the last sector, even to park the head.
 
 
 Even if true (which I doubt), this is not implemented.
 
 A modern drive can have 16-32 MB of write cache. Worst case, those 
 sectors are not sequential which implies lots of head movement.

We weren't actually talking about writing out the cache.  While that was 
part of an earlier thread which ultimately conceded that disk drives most 
probably do not use the spinning disk energy to write out the cache, the 
claim was then made that the drive at least survives long enough to finish 
writing the sector it was writing, thereby maintaining the integrity of 
the data at the drive level.  People often say that a disk drive 
guarantees atomic writes at the sector level even in the face of a power 
failure.

But I heard some years ago from a disk drive engineer that that is a myth 
just like the rotational energy thing.  I added that to the discussion, 
but admitted that I haven't actually seen a disk drive write a partial 
sector.

Ted brought up the separate issue of the host sending garbage to the disk 
device because its own power is failing at the same time, which makes the 
integrity at the disk level moot (or even undesirable, as you'd rather 
write a bad sector than a good one with the wrong data).

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-17 Thread Ric Wheeler


Theodore Tso wrote:

On Wed, Jan 16, 2008 at 09:02:50PM -0500, Daniel Phillips wrote:

Have you observed that in the wild?  A former engineer of a disk drive
company suggests to me that the capacitors on the board provide enough
power to complete the last sector, even to park the head.



Even if true (which I doubt), this is not implemented.

A modern drive can have 16-32 MB of write cache. Worst case, those 
sectors are not sequential which implies lots of head movement.




The problem isn't with the disk drive; it's from the DRAM, which tend
to be much more voltage sensitive than the hard drives --- so it's
quite likely that you could end up DMA'ing garbage from the memory.
In fact the fact that the disk drives lasts longer due to capacitors
on the board, rotational inertia of the platters, etc., is part of the
problem.


I can tell you directly that when you drop power to a drive, you will 
lose write cache data if the write cache is enabled. With barriers 
enabled, our testing shows that file systems survive power failures 
which routinely caused corruption without them ;-)


ric


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-17 Thread Alan Cox


 interrupt which caused the Irix to run around frantically shutting
 down DMA's for a controlled shutdown.  Of course, PC-class hardware
 has none of this.  My source for this was Jim Mostek, one of the

PC class hardware has a power good signal which drops just before the
rest.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-16 Thread Andreas Dilger

On Jan 15, 2008  22:05 -0500, Rik van Riel wrote:
> With a filesystem that is compartmentalized and checksums metadata,
> I believe that an online fsck is absolutely worth having.
> 
> Instead of the filesystem resorting to mounting the whole volume
> read-only on certain errors, part of the filesystem can be offlined
> while an fsck runs.  This could even be done automatically in many
> situations.

In ext4 we store per-group state flags in each group, and the group
descriptor is checksummed (to detect spurious flags), so it should
be relatively straight forward to store an "error" flag in a single
group and have it become read-only.

As a starting point, it would be worthwhile to check instances of
ext4_error() to see how many of them can be targetted at a specific
group.  I'd guess most of them could be (corrupt inodes, directory
and indirect blocks, incorrect bitmaps).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-16 Thread Daniel Phillips

On Jan 16, 2008 2:06 PM, Bryan Henderson <[EMAIL PROTECTED]> wrote:
> >The "disk motor as a generator" tale may not be purely folklore.  When
> >an IDE drive is not in writeback mode, something special needs to done
> >to ensure the last write to media is not a scribble.
>
> No it doesn't.  The last write _is_ a scribble.

Have you observed that in the wild?  A former engineer of a disk drive
company suggests to me that the capacitors on the board provide enough
power to complete the last sector, even to park the head.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-16 Thread Eric Sandeen

Alan Cox wrote:
>> Writeback cache on disk in iteself is not bad, it only gets bad if the
>> disk is not engineered to save all its dirty cache on power loss,
>> using the disk motor as a generator or alternatively a small battery.
>> It would be awfully nice to know which brands fail here, if any,
>> because writeback cache is a big performance booster.
> 
> AFAIK no drive saves the cache. The worst case cache flush for drives is
> several seconds with no retries and a couple of minutes if something
> really bad happens.
> 
> This is why the kernel has some knowledge of barriers and uses them to
> issue flushes when needed.

Problem is, ext3 has barriers off by default so it's not saving most people.

And then if you turn them on, but have your filesystem on an lvm device,
lvm strips them out again.

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-16 Thread Valerie Henson

On Jan 16, 2008 3:49 AM, Pavel Machek <[EMAIL PROTECTED]> wrote:
>
> ext3's "lets fsck on every 20 mounts" is good idea, but it can be
> annoying when developing. Having option to fsck while filesystem is
> online takes that annoyance away.

I'm sure everyone on cc: knows this, but for the record you can change
ext3's fsck on N mounts or every N days to something that makes sense
for your use case.  Usually I just turn it off entirely and run fsck
by hand when I'm worried:

# tune2fs -c 0 -i 0 /dev/whatever

-VAL
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-16 Thread Alan Cox

> And I think there's a problem with drives that, upon sensing the 
> unreadable sector, assign an alternate even though the sector is fine, and 
> you eventually run out of spares.

You are assuming drives can't tell the difference between stray data loss
and sectors that can't be recovered by rewriting and reuse. I was under
the impression modern drives could do this ?

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-16 Thread Bryan Henderson

>The "disk motor as a generator" tale may not be purely folklore.  When
>an IDE drive is not in writeback mode, something special needs to done
>to ensure the last write to media is not a scribble.

No it doesn't.  The last write _is_ a scribble.  Systems that make atomic 
updates to disk drives use a shadow update mechanism and write the master 
sector twice.  If the power fails in the middle of writing one, it will 
almost certainly be unreadable due to a CRC failure, and the other one 
will have either the old or new master block contents.

And I think there's a problem with drives that, upon sensing the 
unreadable sector, assign an alternate even though the sector is fine, and 
you eventually run out of spares.


Incidentally, while this primitive behavior applies to IDE (ATA et al) 
drives, that isn't the only thing people put filesystem on.  Many 
important filesystems go on higher level storage subsystems that contain 
IDE drives and cache memory and batteries.  A device like this _does_ make 
sure that all data that it says has been written is actually retrievable 
even if there's a subsequent power outage, even while giving the 
performance of writeback caching.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-16 Thread Christoph Hellwig

On Wed, Jan 16, 2008 at 08:43:25AM +1100, David Chinner wrote:
> ext3 is not the only filesystem that will have trouble due to
> volatile write caches. We see problems often enough with XFS
> due to volatile write caches that it's in our FAQ:

In fact it will hit every filesystem.  A write-back cache that can't
be forced to write back bythe filesystem will cause corruption on
uncontained power loss, period.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-16 Thread Valdis . Kletnieks

On Wed, 16 Jan 2008 12:51:44 +0100, Pavel Machek said:

> I guess I should try to measure it. (Linux already does writeback
> caching, with 2GB of memory. I wonder how important disks's 2MB of
> cache can be).

It serves essentially the same purpose as the 'async' option in /etc/exports
(i.e. we declare it "done" when the other end of the wire says it's caught
the data, not when it's actually committed), with similar latency wins.  Of
course, it's impedance-matching for bursty traffic - the 2M doesn't do much
at all if you're streaming data to it.  For what it's worth, the 80G Seagate
drive in my laptop claims it has 8M, so it probably does 4 times as much
good as 2M. ;)


pgpMMVeFl62Qm.pgp
Description: PGP signature

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

On Tue 2008-01-15 18:44:26, Daniel Phillips wrote:
> On Jan 15, 2008 6:07 PM, Pavel Machek <[EMAIL PROTECTED]> wrote:
> > I had write cache enabled on my main computer. Oops. I guess that
> > means we do need better documentation.
> 
> Writeback cache on disk in iteself is not bad, it only gets bad if the
> disk is not engineered to save all its dirty cache on power loss,
> using the disk motor as a generator or alternatively a small battery.
> It would be awfully nice to know which brands fail here, if any,
> because writeback cache is a big performance booster.

Is it?

I guess I should try to measure it. (Linux already does writeback
caching, with 2GB of memory. I wonder how important disks's 2MB of
cache can be).
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

Hi!

> Along with this effort, could you let me know if the world actually
> cares about online fsck?  

I'm not the world's spokeperson (yet ;-).

> Now we know how to do it I think, but is it
> worth the effort.

ext3's "lets fsck on every 20 mounts" is good idea, but it can be
annoying when developing. Having option to fsck while filesystem is
online takes that annoyance away.

So yes, it would be very useful for me...

For long-running servers, this may be less of a problem... but OTOH
their filesystems are not checked at all as long servers are
online... so online fsck is actually important there, too, but for
other reasons.

So yes, it is very useful for world.

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

Hi!

 Along with this effort, could you let me know if the world actually
 cares about online fsck?  

I'm not the world's spokeperson (yet ;-).

 Now we know how to do it I think, but is it
 worth the effort.

ext3's lets fsck on every 20 mounts is good idea, but it can be
annoying when developing. Having option to fsck while filesystem is
online takes that annoyance away.

So yes, it would be very useful for me...

For long-running servers, this may be less of a problem... but OTOH
their filesystems are not checked at all as long servers are
online... so online fsck is actually important there, too, but for
other reasons.

So yes, it is very useful for world.

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

On Tue 2008-01-15 18:44:26, Daniel Phillips wrote:
 On Jan 15, 2008 6:07 PM, Pavel Machek [EMAIL PROTECTED] wrote:
  I had write cache enabled on my main computer. Oops. I guess that
  means we do need better documentation.
 
 Writeback cache on disk in iteself is not bad, it only gets bad if the
 disk is not engineered to save all its dirty cache on power loss,
 using the disk motor as a generator or alternatively a small battery.
 It would be awfully nice to know which brands fail here, if any,
 because writeback cache is a big performance booster.

Is it?

I guess I should try to measure it. (Linux already does writeback
caching, with 2GB of memory. I wonder how important disks's 2MB of
cache can be).
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-16 Thread Valdis . Kletnieks

On Wed, 16 Jan 2008 12:51:44 +0100, Pavel Machek said:

 I guess I should try to measure it. (Linux already does writeback
 caching, with 2GB of memory. I wonder how important disks's 2MB of
 cache can be).

It serves essentially the same purpose as the 'async' option in /etc/exports
(i.e. we declare it done when the other end of the wire says it's caught
the data, not when it's actually committed), with similar latency wins.  Of
course, it's impedance-matching for bursty traffic - the 2M doesn't do much
at all if you're streaming data to it.  For what it's worth, the 80G Seagate
drive in my laptop claims it has 8M, so it probably does 4 times as much
good as 2M. ;)


pgpMMVeFl62Qm.pgp
Description: PGP signature

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-16 Thread Christoph Hellwig

On Wed, Jan 16, 2008 at 08:43:25AM +1100, David Chinner wrote:
 ext3 is not the only filesystem that will have trouble due to
 volatile write caches. We see problems often enough with XFS
 due to volatile write caches that it's in our FAQ:

In fact it will hit every filesystem.  A write-back cache that can't
be forced to write back bythe filesystem will cause corruption on
uncontained power loss, period.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-16 Thread Bryan Henderson

The disk motor as a generator tale may not be purely folklore.  When
an IDE drive is not in writeback mode, something special needs to done
to ensure the last write to media is not a scribble.

No it doesn't.  The last write _is_ a scribble.  Systems that make atomic 
updates to disk drives use a shadow update mechanism and write the master 
sector twice.  If the power fails in the middle of writing one, it will 
almost certainly be unreadable due to a CRC failure, and the other one 
will have either the old or new master block contents.

And I think there's a problem with drives that, upon sensing the 
unreadable sector, assign an alternate even though the sector is fine, and 
you eventually run out of spares.


Incidentally, while this primitive behavior applies to IDE (ATA et al) 
drives, that isn't the only thing people put filesystem on.  Many 
important filesystems go on higher level storage subsystems that contain 
IDE drives and cache memory and batteries.  A device like this _does_ make 
sure that all data that it says has been written is actually retrievable 
even if there's a subsequent power outage, even while giving the 
performance of writeback caching.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-16 Thread Alan Cox

 And I think there's a problem with drives that, upon sensing the 
 unreadable sector, assign an alternate even though the sector is fine, and 
 you eventually run out of spares.

You are assuming drives can't tell the difference between stray data loss
and sectors that can't be recovered by rewriting and reuse. I was under
the impression modern drives could do this ?

Alan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-16 Thread Valerie Henson

On Jan 16, 2008 3:49 AM, Pavel Machek [EMAIL PROTECTED] wrote:

 ext3's lets fsck on every 20 mounts is good idea, but it can be
 annoying when developing. Having option to fsck while filesystem is
 online takes that annoyance away.

I'm sure everyone on cc: knows this, but for the record you can change
ext3's fsck on N mounts or every N days to something that makes sense
for your use case.  Usually I just turn it off entirely and run fsck
by hand when I'm worried:

# tune2fs -c 0 -i 0 /dev/whatever

-VAL
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-16 Thread Eric Sandeen

Alan Cox wrote:
 Writeback cache on disk in iteself is not bad, it only gets bad if the
 disk is not engineered to save all its dirty cache on power loss,
 using the disk motor as a generator or alternatively a small battery.
 It would be awfully nice to know which brands fail here, if any,
 because writeback cache is a big performance booster.
 
 AFAIK no drive saves the cache. The worst case cache flush for drives is
 several seconds with no retries and a couple of minutes if something
 really bad happens.
 
 This is why the kernel has some knowledge of barriers and uses them to
 issue flushes when needed.

Problem is, ext3 has barriers off by default so it's not saving most people.

And then if you turn them on, but have your filesystem on an lvm device,
lvm strips them out again.

-Eric
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-16 Thread Daniel Phillips

On Jan 16, 2008 2:06 PM, Bryan Henderson [EMAIL PROTECTED] wrote:
 The disk motor as a generator tale may not be purely folklore.  When
 an IDE drive is not in writeback mode, something special needs to done
 to ensure the last write to media is not a scribble.

 No it doesn't.  The last write _is_ a scribble.

Have you observed that in the wild?  A former engineer of a disk drive
company suggests to me that the capacitors on the board provide enough
power to complete the last sector, even to park the head.

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-16 Thread Andreas Dilger

On Jan 15, 2008  22:05 -0500, Rik van Riel wrote:
 With a filesystem that is compartmentalized and checksums metadata,
 I believe that an online fsck is absolutely worth having.
 
 Instead of the filesystem resorting to mounting the whole volume
 read-only on certain errors, part of the filesystem can be offlined
 while an fsck runs.  This could even be done automatically in many
 situations.

In ext4 we store per-group state flags in each group, and the group
descriptor is checksummed (to detect spurious flags), so it should
be relatively straight forward to store an error flag in a single
group and have it become read-only.

As a starting point, it would be worthwhile to check instances of
ext4_error() to see how many of them can be targetted at a specific
group.  I'd guess most of them could be (corrupt inodes, directory
and indirect blocks, incorrect bitmaps).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-15 Thread Rik van Riel

On Tue, 15 Jan 2008 20:44:38 -0500
"Daniel Phillips" <[EMAIL PROTECTED]> wrote:

> Along with this effort, could you let me know if the world actually
> cares about online fsck?  Now we know how to do it I think, but is it
> worth the effort.

With a filesystem that is compartmentalized and checksums metadata,
I believe that an online fsck is absolutely worth having.

Instead of the filesystem resorting to mounting the whole volume
read-only on certain errors, part of the filesystem can be offlined
while an fsck runs.  This could even be done automatically in many
situations.

-- 
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

Hi Pavel,

Along with this effort, could you let me know if the world actually
cares about online fsck?  Now we know how to do it I think, but is it
worth the effort.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-15 Thread Chris Mason

On Tue, 15 Jan 2008 20:24:27 -0500
"Daniel Phillips" <[EMAIL PROTECTED]> wrote:

> On Jan 15, 2008 7:15 PM, Alan Cox <[EMAIL PROTECTED]> wrote:
> > > Writeback cache on disk in iteself is not bad, it only gets bad
> > > if the disk is not engineered to save all its dirty cache on
> > > power loss, using the disk motor as a generator or alternatively
> > > a small battery. It would be awfully nice to know which brands
> > > fail here, if any, because writeback cache is a big performance
> > > booster.
> >
> > AFAIK no drive saves the cache. The worst case cache flush for
> > drives is several seconds with no retries and a couple of minutes
> > if something really bad happens.
> >
> > This is why the kernel has some knowledge of barriers and uses them
> > to issue flushes when needed.
> 
> Indeed, you are right, which is supported by actual measurements:
> 
> http://sr5tech.com/write_back_cache_experiments.htm
> 
> Sorry for implying that anybody has engineered a drive that can do
> such a nice thing with writeback cache.
> 
> The "disk motor as a generator" tale may not be purely folklore.  When
> an IDE drive is not in writeback mode, something special needs to done
> to ensure the last write to media is not a scribble.
> 
> A small UPS can make writeback mode actually reliable, provided the
> system is smart enough to take the drives out of writeback mode when
> the line power is off.

We've had mount -o barrier=1 for ext3 for a while now, it makes
writeback caching safe.  XFS has this on by default, as does reiserfs.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

On Jan 15, 2008 7:15 PM, Alan Cox <[EMAIL PROTECTED]> wrote:
> > Writeback cache on disk in iteself is not bad, it only gets bad if the
> > disk is not engineered to save all its dirty cache on power loss,
> > using the disk motor as a generator or alternatively a small battery.
> > It would be awfully nice to know which brands fail here, if any,
> > because writeback cache is a big performance booster.
>
> AFAIK no drive saves the cache. The worst case cache flush for drives is
> several seconds with no retries and a couple of minutes if something
> really bad happens.
>
> This is why the kernel has some knowledge of barriers and uses them to
> issue flushes when needed.

Indeed, you are right, which is supported by actual measurements:

http://sr5tech.com/write_back_cache_experiments.htm

Sorry for implying that anybody has engineered a drive that can do
such a nice thing with writeback cache.

The "disk motor as a generator" tale may not be purely folklore.  When
an IDE drive is not in writeback mode, something special needs to done
to ensure the last write to media is not a scribble.

A small UPS can make writeback mode actually reliable, provided the
system is smart enough to take the drives out of writeback mode when
the line power is off.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-15 Thread Alan Cox

> Writeback cache on disk in iteself is not bad, it only gets bad if the
> disk is not engineered to save all its dirty cache on power loss,
> using the disk motor as a generator or alternatively a small battery.
> It would be awfully nice to know which brands fail here, if any,
> because writeback cache is a big performance booster.

AFAIK no drive saves the cache. The worst case cache flush for drives is
several seconds with no retries and a couple of minutes if something
really bad happens.

This is why the kernel has some knowledge of barriers and uses them to
issue flushes when needed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

On Jan 15, 2008 6:07 PM, Pavel Machek <[EMAIL PROTECTED]> wrote:
> I had write cache enabled on my main computer. Oops. I guess that
> means we do need better documentation.

Writeback cache on disk in iteself is not bad, it only gets bad if the
disk is not engineered to save all its dirty cache on power loss,
using the disk motor as a generator or alternatively a small battery.
It would be awfully nice to know which brands fail here, if any,
because writeback cache is a big performance booster.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

Hi!

> > > > What are ext3 expectations of disk (is there doc somewhere)? For
> > > > example... if disk does not lie, but powerfail during write damages
> > > > the sector -- is ext3 still going to work properly?
> > > 
> > > Nope. However the few disks that did this rapidly got firmware updates
> > > because there are other OS's that can't cope.
> > > 
> > > > If disk does not lie, but powerfail during write may cause random
> > > > numbers to be returned on read -- can fsck handle that?
> > > 
> > > most of the time. and fsck knows about writing sectors to remove read
> > > errors in metadata blocks.
> > > 
> > > > What abou disk that kills 5 sectors around sector being written during
> > > > powerfail; can ext3 survive that?
> > > 
> > > generally. Note btw that for added fun there is nothing that guarantees
> > > the blocks around a block on the media are sequentially numbered. The
> > > usually are but you never know.
> > 
> > Ok, should something like this be added to the documentation?
> > 
> > It would be cool to be able to include few examples (modern SATA disks
> > support bariers so are safe, any IDE from 1989 is unsafe), but I do
> > not know enough about hw...
> 
> ext3 is not the only filesystem that will have trouble due to
> volatile write caches. We see problems often enough with XFS
> due to volatile write caches that it's in our FAQ:
> 
> http://oss.sgi.com/projects/xfs/faq.html#wcache

Nice FAQ, yep. Perhaps you should move parts of it to Documentation/ ,
and I could then make ext3 FAQ point to it?

I had write cache enabled on my main computer. Oops. I guess that
means we do need better documentation.
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-15 Thread David Chinner

On Tue, Jan 15, 2008 at 09:16:53PM +0100, Pavel Machek wrote:
> Hi!
> 
> > > What are ext3 expectations of disk (is there doc somewhere)? For
> > > example... if disk does not lie, but powerfail during write damages
> > > the sector -- is ext3 still going to work properly?
> > 
> > Nope. However the few disks that did this rapidly got firmware updates
> > because there are other OS's that can't cope.
> > 
> > > If disk does not lie, but powerfail during write may cause random
> > > numbers to be returned on read -- can fsck handle that?
> > 
> > most of the time. and fsck knows about writing sectors to remove read
> > errors in metadata blocks.
> > 
> > > What abou disk that kills 5 sectors around sector being written during
> > > powerfail; can ext3 survive that?
> > 
> > generally. Note btw that for added fun there is nothing that guarantees
> > the blocks around a block on the media are sequentially numbered. The
> > usually are but you never know.
> 
> Ok, should something like this be added to the documentation?
> 
> It would be cool to be able to include few examples (modern SATA disks
> support bariers so are safe, any IDE from 1989 is unsafe), but I do
> not know enough about hw...

ext3 is not the only filesystem that will have trouble due to
volatile write caches. We see problems often enough with XFS
due to volatile write caches that it's in our FAQ:

http://oss.sgi.com/projects/xfs/faq.html#wcache

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

Hi!

> > What are ext3 expectations of disk (is there doc somewhere)? For
> > example... if disk does not lie, but powerfail during write damages
> > the sector -- is ext3 still going to work properly?
> 
> Nope. However the few disks that did this rapidly got firmware updates
> because there are other OS's that can't cope.
> 
> > If disk does not lie, but powerfail during write may cause random
> > numbers to be returned on read -- can fsck handle that?
> 
> most of the time. and fsck knows about writing sectors to remove read
> errors in metadata blocks.
> 
> > What abou disk that kills 5 sectors around sector being written during
> > powerfail; can ext3 survive that?
> 
> generally. Note btw that for added fun there is nothing that guarantees
> the blocks around a block on the media are sequentially numbered. The
> usually are but you never know.

Ok, should something like this be added to the documentation?

It would be cool to be able to include few examples (modern SATA disks
support bariers so are safe, any IDE from 1989 is unsafe), but I do
not know enough about hw...

Signed-off-by: Pavel Machek <[EMAIL PROTECTED]>

diff --git a/Documentation/filesystems/ext3.txt 
b/Documentation/filesystems/ext3.txt
index b45f3c1..adfcc9d 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -183,6 +183,18 @@ mke2fs:create a ext3 partition with th
 debugfs:   ext2 and ext3 file system debugger.
 ext2online:online (mounted) ext2 and ext3 filesystem resizer
 
+Requirements
+
+
+Ext3 needs disk that does not do write-back caching or disk that
+supports barriers and Linux configuration that can use them.
+
+* if disk damages the sector being written during powerfail, ext3
+  can't cope with that.  Fortunately, such disks got firmware updates
+  to fix this long time ago.
+
+* if disk writes random data during powerfail, ext3 should survive
+  that most of the time.
 
 References
 ==


-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

Hi!

  What are ext3 expectations of disk (is there doc somewhere)? For
  example... if disk does not lie, but powerfail during write damages
  the sector -- is ext3 still going to work properly?
 
 Nope. However the few disks that did this rapidly got firmware updates
 because there are other OS's that can't cope.
 
  If disk does not lie, but powerfail during write may cause random
  numbers to be returned on read -- can fsck handle that?
 
 most of the time. and fsck knows about writing sectors to remove read
 errors in metadata blocks.
 
  What abou disk that kills 5 sectors around sector being written during
  powerfail; can ext3 survive that?
 
 generally. Note btw that for added fun there is nothing that guarantees
 the blocks around a block on the media are sequentially numbered. The
 usually are but you never know.

Ok, should something like this be added to the documentation?

It would be cool to be able to include few examples (modern SATA disks
support bariers so are safe, any IDE from 1989 is unsafe), but I do
not know enough about hw...

Signed-off-by: Pavel Machek [EMAIL PROTECTED]

diff --git a/Documentation/filesystems/ext3.txt 
b/Documentation/filesystems/ext3.txt
index b45f3c1..adfcc9d 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -183,6 +183,18 @@ mke2fs:create a ext3 partition with th
 debugfs:   ext2 and ext3 file system debugger.
 ext2online:online (mounted) ext2 and ext3 filesystem resizer
 
+Requirements
+
+
+Ext3 needs disk that does not do write-back caching or disk that
+supports barriers and Linux configuration that can use them.
+
+* if disk damages the sector being written during powerfail, ext3
+  can't cope with that.  Fortunately, such disks got firmware updates
+  to fix this long time ago.
+
+* if disk writes random data during powerfail, ext3 should survive
+  that most of the time.
 
 References
 ==


-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-15 Thread David Chinner

On Tue, Jan 15, 2008 at 09:16:53PM +0100, Pavel Machek wrote:
 Hi!
 
   What are ext3 expectations of disk (is there doc somewhere)? For
   example... if disk does not lie, but powerfail during write damages
   the sector -- is ext3 still going to work properly?
  
  Nope. However the few disks that did this rapidly got firmware updates
  because there are other OS's that can't cope.
  
   If disk does not lie, but powerfail during write may cause random
   numbers to be returned on read -- can fsck handle that?
  
  most of the time. and fsck knows about writing sectors to remove read
  errors in metadata blocks.
  
   What abou disk that kills 5 sectors around sector being written during
   powerfail; can ext3 survive that?
  
  generally. Note btw that for added fun there is nothing that guarantees
  the blocks around a block on the media are sequentially numbered. The
  usually are but you never know.
 
 Ok, should something like this be added to the documentation?
 
 It would be cool to be able to include few examples (modern SATA disks
 support bariers so are safe, any IDE from 1989 is unsafe), but I do
 not know enough about hw...

ext3 is not the only filesystem that will have trouble due to
volatile write caches. We see problems often enough with XFS
due to volatile write caches that it's in our FAQ:

http://oss.sgi.com/projects/xfs/faq.html#wcache

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

Hi!

What are ext3 expectations of disk (is there doc somewhere)? For
example... if disk does not lie, but powerfail during write damages
the sector -- is ext3 still going to work properly?
   
   Nope. However the few disks that did this rapidly got firmware updates
   because there are other OS's that can't cope.
   
If disk does not lie, but powerfail during write may cause random
numbers to be returned on read -- can fsck handle that?
   
   most of the time. and fsck knows about writing sectors to remove read
   errors in metadata blocks.
   
What abou disk that kills 5 sectors around sector being written during
powerfail; can ext3 survive that?
   
   generally. Note btw that for added fun there is nothing that guarantees
   the blocks around a block on the media are sequentially numbered. The
   usually are but you never know.
  
  Ok, should something like this be added to the documentation?
  
  It would be cool to be able to include few examples (modern SATA disks
  support bariers so are safe, any IDE from 1989 is unsafe), but I do
  not know enough about hw...
 
 ext3 is not the only filesystem that will have trouble due to
 volatile write caches. We see problems often enough with XFS
 due to volatile write caches that it's in our FAQ:
 
 http://oss.sgi.com/projects/xfs/faq.html#wcache

Nice FAQ, yep. Perhaps you should move parts of it to Documentation/ ,
and I could then make ext3 FAQ point to it?

I had write cache enabled on my main computer. Oops. I guess that
means we do need better documentation.
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

On Jan 15, 2008 6:07 PM, Pavel Machek [EMAIL PROTECTED] wrote:
 I had write cache enabled on my main computer. Oops. I guess that
 means we do need better documentation.

Writeback cache on disk in iteself is not bad, it only gets bad if the
disk is not engineered to save all its dirty cache on power loss,
using the disk motor as a generator or alternatively a small battery.
It would be awfully nice to know which brands fail here, if any,
because writeback cache is a big performance booster.

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-15 Thread Alan Cox

 Writeback cache on disk in iteself is not bad, it only gets bad if the
 disk is not engineered to save all its dirty cache on power loss,
 using the disk motor as a generator or alternatively a small battery.
 It would be awfully nice to know which brands fail here, if any,
 because writeback cache is a big performance booster.

AFAIK no drive saves the cache. The worst case cache flush for drives is
several seconds with no retries and a couple of minutes if something
really bad happens.

This is why the kernel has some knowledge of barriers and uses them to
issue flushes when needed.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

On Jan 15, 2008 7:15 PM, Alan Cox [EMAIL PROTECTED] wrote:
  Writeback cache on disk in iteself is not bad, it only gets bad if the
  disk is not engineered to save all its dirty cache on power loss,
  using the disk motor as a generator or alternatively a small battery.
  It would be awfully nice to know which brands fail here, if any,
  because writeback cache is a big performance booster.

 AFAIK no drive saves the cache. The worst case cache flush for drives is
 several seconds with no retries and a couple of minutes if something
 really bad happens.

 This is why the kernel has some knowledge of barriers and uses them to
 issue flushes when needed.

Indeed, you are right, which is supported by actual measurements:

http://sr5tech.com/write_back_cache_experiments.htm

Sorry for implying that anybody has engineered a drive that can do
such a nice thing with writeback cache.

The disk motor as a generator tale may not be purely folklore.  When
an IDE drive is not in writeback mode, something special needs to done
to ensure the last write to media is not a scribble.

A small UPS can make writeback mode actually reliable, provided the
system is smart enough to take the drives out of writeback mode when
the line power is off.

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-15 Thread Chris Mason

On Tue, 15 Jan 2008 20:24:27 -0500
Daniel Phillips [EMAIL PROTECTED] wrote:

 On Jan 15, 2008 7:15 PM, Alan Cox [EMAIL PROTECTED] wrote:
   Writeback cache on disk in iteself is not bad, it only gets bad
   if the disk is not engineered to save all its dirty cache on
   power loss, using the disk motor as a generator or alternatively
   a small battery. It would be awfully nice to know which brands
   fail here, if any, because writeback cache is a big performance
   booster.
 
  AFAIK no drive saves the cache. The worst case cache flush for
  drives is several seconds with no retries and a couple of minutes
  if something really bad happens.
 
  This is why the kernel has some knowledge of barriers and uses them
  to issue flushes when needed.
 
 Indeed, you are right, which is supported by actual measurements:
 
 http://sr5tech.com/write_back_cache_experiments.htm
 
 Sorry for implying that anybody has engineered a drive that can do
 such a nice thing with writeback cache.
 
 The disk motor as a generator tale may not be purely folklore.  When
 an IDE drive is not in writeback mode, something special needs to done
 to ensure the last write to media is not a scribble.
 
 A small UPS can make writeback mode actually reliable, provided the
 system is smart enough to take the drives out of writeback mode when
 the line power is off.

We've had mount -o barrier=1 for ext3 for a while now, it makes
writeback caching safe.  XFS has this on by default, as does reiserfs.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)