Re: critical bugs in md raid5 and ATA disk failure/recovery modes

2005-01-29 Thread Pavel Machek
Hi!

> > Well, you could set stripe size to 512B; that way, RAID-5 would be
> > *very* slow, but it should have same characteristics as normal disc
> > w.r.t. crash. Unrelated data would not be lost, and you'd either get
> > old data or new data...
> 
> When you lose a disk during recovery you can still lose
> unrelated data (any "sibling" in a stripe set because its parity
> information is incomplete).  RAID-1 doesn't have this problem though.

You are right, I'd have to do soething very special... Like if I know
it is 4K filesystem, raid-5 from 5 disks could do the trick. Like

Disk1   Disk2   Disk3   Disk4   Disk5
bytes0-511  512-10231024-1535   1536-2048   parity


no, even that does not work. You could add single bit for each 4K
saying "this stripe is being written" (with barriers etc) and return
read errors if bit is set might actually do the trick, but that's no
longer raid-5... (Can ext3 handle error in journal?)
Pavel
-- 
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: critical bugs in md raid5 and ATA disk failure/recovery modes

2005-01-29 Thread Andi Kleen
> Well, you could set stripe size to 512B; that way, RAID-5 would be
> *very* slow, but it should have same characteristics as normal disc
> w.r.t. crash. Unrelated data would not be lost, and you'd either get
> old data or new data...

When you lose a disk during recovery you can still lose
unrelated data (any "sibling" in a stripe set because its parity
information is incomplete).  RAID-1 doesn't have this problem though.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: critical bugs in md raid5 and ATA disk failure/recovery modes

2005-01-29 Thread Pavel Machek
Hi!

> > The nasty part there is that it can affect completely unrelated
> > data too (on a traditional disk you normally only lose the data
> > that is currently being written) because of of the relationship
> > between stripes on different disks.

Well, you could set stripe size to 512B; that way, RAID-5 would be
*very* slow, but it should have same characteristics as normal disc
w.r.t. crash. Unrelated data would not be lost, and you'd either get
old data or new data...

Nasty part might be that if it went to degraded mode (before resync is
done), data on disk might silently change; that's bad I guess.

Performance would not be good, also.
Pavel
-- 
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: critical bugs in md raid5 and ATA disk failure/recovery modes

2005-01-29 Thread Pavel Machek
Hi!

  The nasty part there is that it can affect completely unrelated
  data too (on a traditional disk you normally only lose the data
  that is currently being written) because of of the relationship
  between stripes on different disks.

Well, you could set stripe size to 512B; that way, RAID-5 would be
*very* slow, but it should have same characteristics as normal disc
w.r.t. crash. Unrelated data would not be lost, and you'd either get
old data or new data...

Nasty part might be that if it went to degraded mode (before resync is
done), data on disk might silently change; that's bad I guess.

Performance would not be good, also.
Pavel
-- 
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: critical bugs in md raid5 and ATA disk failure/recovery modes

2005-01-29 Thread Andi Kleen
 Well, you could set stripe size to 512B; that way, RAID-5 would be
 *very* slow, but it should have same characteristics as normal disc
 w.r.t. crash. Unrelated data would not be lost, and you'd either get
 old data or new data...

When you lose a disk during recovery you can still lose
unrelated data (any sibling in a stripe set because its parity
information is incomplete).  RAID-1 doesn't have this problem though.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: critical bugs in md raid5 and ATA disk failure/recovery modes

2005-01-29 Thread Pavel Machek
Hi!

  Well, you could set stripe size to 512B; that way, RAID-5 would be
  *very* slow, but it should have same characteristics as normal disc
  w.r.t. crash. Unrelated data would not be lost, and you'd either get
  old data or new data...
 
 When you lose a disk during recovery you can still lose
 unrelated data (any sibling in a stripe set because its parity
 information is incomplete).  RAID-1 doesn't have this problem though.

You are right, I'd have to do soething very special... Like if I know
it is 4K filesystem, raid-5 from 5 disks could do the trick. Like

Disk1   Disk2   Disk3   Disk4   Disk5
bytes0-511  512-10231024-1535   1536-2048   parity


no, even that does not work. You could add single bit for each 4K
saying this stripe is being written (with barriers etc) and return
read errors if bit is set might actually do the trick, but that's no
longer raid-5... (Can ext3 handle error in journal?)
Pavel
-- 
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: critical bugs in md raid5

2005-01-27 Thread pcg
On Thu, Jan 27, 2005 at 10:51:02AM +0100, Andi Kleen <[EMAIL PROTECTED]> wrote:
> The nasty part there is that it can affect completely unrelated
> data too (on a traditional disk you normally only lose the data
> that is currently being written) because of of the relationship
> between stripes on different disks.

Sorry, I must be a bit dense at times I understood that now, you meant in
the case where parity is lost and you have an I/O error in other cases.

> There were some suggestions in the past 
> to be a bit nicer on read IO errors - often if a read fails and you rewrite 
> the block from the reconstructed data the disk would allocate a new block
> and then be error free again.
> 
> The problem is just that when there are user visible IO errors
> on a modern disk something is very wrong and it will likely run quickly out 

Also, linux already does re-write failed parity blocks automatically on
a crash, so whatever damage you might think might be done to the disk
will already be done at numerous occasions, as linux in general nor the
raid driver in particular checks for bad blocks before rewriting (I don't
suggets that it does, just that linux already rewrites failed blocks if it
doesn't know about them, and this hasn't been a particular bad problem).

-- 
The choice of a
  -==- _GNU_
  ==-- _   generation Marc Lehmann
  ---==---(_)__  __   __  [EMAIL PROTECTED]
  --==---/ / _ \/ // /\ \/ /  http://schmorp.de/
  -=/_/_//_/\_,_/ /_/\_\  XX11-RIPE
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: critical bugs in md raid5 and ATA disk failure/recovery modes

2005-01-27 Thread Marc Lehmann
On Thu, Jan 27, 2005 at 10:51:02AM +0100, Andi Kleen <[EMAIL PROTECTED]> wrote:
> > I disagree. When not working in degraded mode, it's absolutely reasonable
> > to e.g. use only the non-parity data. A crash with raid5 is in no way
> 
> Yep. But when you go into degraded mode during the crash recovery 
> (before the RAID is fully synced again) you lose.

Hi, see below.

> 
> > different to a crash without raid5 then: either the old data is on the
> > disk, the new data is on the disk, or you had some catastrophic disk event
> > and no data is on the disk.
> 
> No, that's not how RAID-5 works. For its redundancy it requires

Hi, I think we might have misunderstood each other.

In fact, I fully agree with you on how raid5 works. However, the current
linux raid behaviour is highy suboptimal, and offers much less than your
description of raid5 would enable it to do:

- it shouldn't do any kind of re-syncing with two failed disks (as it did
  in the case I described). That makes no sense and possibly destroys more
  data.

- it should still satisfy read requests whenever possible (right now,
  the device is often fully dead, despite the data being there in maybe
  100% of the cases).

  The thing is, the md raid5 driver itself is able to satisfy most read
  requests and even some write requests in >= two-failed-disk mode, but it
  is not so when the failure happens during reconstruction, so in a case
  were much *more* data can safely be provided is offers *less* then when
  two disks have failed.

- Vital information about the disk order that might be required for repairing
  is being destroyed.

I think all of these points are valid, despite the deficiencies in raid5
protection in theory, the linux raid behaviour is much worse in practise.

> The nasty part there is that it can affect completely unrelated
> data too (on a traditional disk you normally only lose the data
> that is currently being written) because of of the relationship
> between stripes on different disks.

Hmm.. indeed, I do not understand this. My reasoning would be as follows:

If I had a bad block, I either lose parity (== no data loss) or I lose a
data block (== this data block is lost when the machine crashes).

If unrelated data can get lost (as it is right now, as the device
basically is lost), then this seems like a deficiency in the driver.

> > The case I reported was not a catastrophic failure: either the old or new
> > data was on the disk, and the filesystem journaling (which is ext3) will
> > take care of it. Even if the parity information is not in sync, either old 
> > or
> > new data is on the disk.
> 
> But you lost a disk in the middle of recovery (any IO error is
> a lost disk) 

Yes, and I hit a bug, which I reported.

> > Indeed, but I think linux' behaviour is especially poor. For example, the
> > renumbering of the devices or the strange rebuild-restart behaviour (which
> > is definitely a bug) will make recovery unnecessarily complicated.
> 
> There were some suggestions in the past 
> to be a bit nicer on read IO errors - often if a read fails and you rewrite 
> the block from the reconstructed data the disk would allocate a new block
> and then be error free again.

(I am not asking for this kind fo automatic recovery, but here are some
thoughts on the above):

With modern IDE drives, trying to correct is actually the right thing to
do.

At least when the device indicates that it is still working fine, as
opposed to be being in pre-failure mode for example due to lack of
replacement blocks.

> The problem is just that when there are user visible IO errors
> on a modern disk something is very wrong and it will likely run quickly out 

No, the disk will likely just re-write the block. There are different
failure modes on IDE drives, the most likely failure on a crash is that
some block couldn't be written due to, say, a power outage, or a hard or
soft reset in the middle of a write (sad, but true, many ide disks act
like that, there have been discussions about this on LK).

No replacement block will need to be allocated in that case, just the
currently written data is lost. And nothing at all will be wrong with the
disk in that case, either. So I dispute the "on a modern disk something is
very wrong" because that is normal operation of a ATA disk, mandated by
standards.

Also, the number of used replacement blocks can be queried on basically
all modern ATA disks, and there is a method in place to warn about
possible failures, namely SMART.

Please note that most disks can be made to regularly scan their surface and
replace blocks, so the device might run out of replacement blocks without any
write access from the driver.

So this kind of danger is already possible and likely without linux trying to
repair the block, so repairing the block is just normal operation for the
drive.

What the drive in many failures is simply tag the block as unreadable
(mostly because the checksum/ecc data does not match) and correct this 

Re: critical bugs in md raid5

2005-01-27 Thread Andi Kleen
> I disagree. When not working in degraded mode, it's absolutely reasonable
> to e.g. use only the non-parity data. A crash with raid5 is in no way

Yep. But when you go into degraded mode during the crash recovery 
(before the RAID is fully synced again) you lose.

> different to a crash without raid5 then: either the old data is on the
> disk, the new data is on the disk, or you had some catastrophic disk event
> and no data is on the disk.

No, that's not how RAID-5 works. For its redundancy it requires
coordinated writes of full stripes (= bigger than fs block) over
multiple disks. When you crash in the middle of a write and you
lose a disk during crash recovery there is no way to fully
reconstruct all the data because the XOR data recovery requires
valid data on all disks.

The nasty part there is that it can affect completely unrelated
data too (on a traditional disk you normally only lose the data
that is currently being written) because of of the relationship
between stripes on different disks.

> 
> The case I reported was not a catastrophic failure: either the old or new
> data was on the disk, and the filesystem journaling (which is ext3) will
> take care of it. Even if the parity information is not in sync, either old or
> new data is on the disk.

But you lost a disk in the middle of recovery (any IO error is
a lost disk) 

> Indeed, but I think linux' behaviour is especially poor. For example, the
> renumbering of the devices or the strange rebuild-restart behaviour (which
> is definitely a bug) will make recovery unnecessarily complicated.

There were some suggestions in the past 
to be a bit nicer on read IO errors - often if a read fails and you rewrite 
the block from the reconstructed data the disk would allocate a new block
and then be error free again.

The problem is just that when there are user visible IO errors
on a modern disk something is very wrong and it will likely run quickly out 
of replacement blocks, and will eventually fail. That is why
Linux "forces" early replacement of the disk on any error - it is the
safest thing to do.


> > problem though (e.g. when file system metadata is affected)
> 
> Of course, but that's supposed to be worked around by using a journaling
> file system, right?

Nope, journaling is no magical fix for meta data corruption.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: critical bugs in md raid5

2005-01-27 Thread Andi Kleen
 I disagree. When not working in degraded mode, it's absolutely reasonable
 to e.g. use only the non-parity data. A crash with raid5 is in no way

Yep. But when you go into degraded mode during the crash recovery 
(before the RAID is fully synced again) you lose.

 different to a crash without raid5 then: either the old data is on the
 disk, the new data is on the disk, or you had some catastrophic disk event
 and no data is on the disk.

No, that's not how RAID-5 works. For its redundancy it requires
coordinated writes of full stripes (= bigger than fs block) over
multiple disks. When you crash in the middle of a write and you
lose a disk during crash recovery there is no way to fully
reconstruct all the data because the XOR data recovery requires
valid data on all disks.

The nasty part there is that it can affect completely unrelated
data too (on a traditional disk you normally only lose the data
that is currently being written) because of of the relationship
between stripes on different disks.

 
 The case I reported was not a catastrophic failure: either the old or new
 data was on the disk, and the filesystem journaling (which is ext3) will
 take care of it. Even if the parity information is not in sync, either old or
 new data is on the disk.

But you lost a disk in the middle of recovery (any IO error is
a lost disk) 

 Indeed, but I think linux' behaviour is especially poor. For example, the
 renumbering of the devices or the strange rebuild-restart behaviour (which
 is definitely a bug) will make recovery unnecessarily complicated.

There were some suggestions in the past 
to be a bit nicer on read IO errors - often if a read fails and you rewrite 
the block from the reconstructed data the disk would allocate a new block
and then be error free again.

The problem is just that when there are user visible IO errors
on a modern disk something is very wrong and it will likely run quickly out 
of replacement blocks, and will eventually fail. That is why
Linux forces early replacement of the disk on any error - it is the
safest thing to do.


  problem though (e.g. when file system metadata is affected)
 
 Of course, but that's supposed to be worked around by using a journaling
 file system, right?

Nope, journaling is no magical fix for meta data corruption.

-Andi

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: critical bugs in md raid5 and ATA disk failure/recovery modes

2005-01-27 Thread Marc Lehmann
On Thu, Jan 27, 2005 at 10:51:02AM +0100, Andi Kleen [EMAIL PROTECTED] wrote:
  I disagree. When not working in degraded mode, it's absolutely reasonable
  to e.g. use only the non-parity data. A crash with raid5 is in no way
 
 Yep. But when you go into degraded mode during the crash recovery 
 (before the RAID is fully synced again) you lose.

Hi, see below.

 
  different to a crash without raid5 then: either the old data is on the
  disk, the new data is on the disk, or you had some catastrophic disk event
  and no data is on the disk.
 
 No, that's not how RAID-5 works. For its redundancy it requires

Hi, I think we might have misunderstood each other.

In fact, I fully agree with you on how raid5 works. However, the current
linux raid behaviour is highy suboptimal, and offers much less than your
description of raid5 would enable it to do:

- it shouldn't do any kind of re-syncing with two failed disks (as it did
  in the case I described). That makes no sense and possibly destroys more
  data.

- it should still satisfy read requests whenever possible (right now,
  the device is often fully dead, despite the data being there in maybe
  100% of the cases).

  The thing is, the md raid5 driver itself is able to satisfy most read
  requests and even some write requests in = two-failed-disk mode, but it
  is not so when the failure happens during reconstruction, so in a case
  were much *more* data can safely be provided is offers *less* then when
  two disks have failed.

- Vital information about the disk order that might be required for repairing
  is being destroyed.

I think all of these points are valid, despite the deficiencies in raid5
protection in theory, the linux raid behaviour is much worse in practise.

 The nasty part there is that it can affect completely unrelated
 data too (on a traditional disk you normally only lose the data
 that is currently being written) because of of the relationship
 between stripes on different disks.

Hmm.. indeed, I do not understand this. My reasoning would be as follows:

If I had a bad block, I either lose parity (== no data loss) or I lose a
data block (== this data block is lost when the machine crashes).

If unrelated data can get lost (as it is right now, as the device
basically is lost), then this seems like a deficiency in the driver.

  The case I reported was not a catastrophic failure: either the old or new
  data was on the disk, and the filesystem journaling (which is ext3) will
  take care of it. Even if the parity information is not in sync, either old 
  or
  new data is on the disk.
 
 But you lost a disk in the middle of recovery (any IO error is
 a lost disk) 

Yes, and I hit a bug, which I reported.

  Indeed, but I think linux' behaviour is especially poor. For example, the
  renumbering of the devices or the strange rebuild-restart behaviour (which
  is definitely a bug) will make recovery unnecessarily complicated.
 
 There were some suggestions in the past 
 to be a bit nicer on read IO errors - often if a read fails and you rewrite 
 the block from the reconstructed data the disk would allocate a new block
 and then be error free again.

(I am not asking for this kind fo automatic recovery, but here are some
thoughts on the above):

With modern IDE drives, trying to correct is actually the right thing to
do.

At least when the device indicates that it is still working fine, as
opposed to be being in pre-failure mode for example due to lack of
replacement blocks.

 The problem is just that when there are user visible IO errors
 on a modern disk something is very wrong and it will likely run quickly out 

No, the disk will likely just re-write the block. There are different
failure modes on IDE drives, the most likely failure on a crash is that
some block couldn't be written due to, say, a power outage, or a hard or
soft reset in the middle of a write (sad, but true, many ide disks act
like that, there have been discussions about this on LK).

No replacement block will need to be allocated in that case, just the
currently written data is lost. And nothing at all will be wrong with the
disk in that case, either. So I dispute the on a modern disk something is
very wrong because that is normal operation of a ATA disk, mandated by
standards.

Also, the number of used replacement blocks can be queried on basically
all modern ATA disks, and there is a method in place to warn about
possible failures, namely SMART.

Please note that most disks can be made to regularly scan their surface and
replace blocks, so the device might run out of replacement blocks without any
write access from the driver.

So this kind of danger is already possible and likely without linux trying to
repair the block, so repairing the block is just normal operation for the
drive.

What the drive in many failures is simply tag the block as unreadable
(mostly because the checksum/ecc data does not match) and correct this on
write. Most drivers will also check the surface 

Re: critical bugs in md raid5

2005-01-27 Thread pcg
On Thu, Jan 27, 2005 at 10:51:02AM +0100, Andi Kleen [EMAIL PROTECTED] wrote:
 The nasty part there is that it can affect completely unrelated
 data too (on a traditional disk you normally only lose the data
 that is currently being written) because of of the relationship
 between stripes on different disks.

Sorry, I must be a bit dense at times I understood that now, you meant in
the case where parity is lost and you have an I/O error in other cases.

 There were some suggestions in the past 
 to be a bit nicer on read IO errors - often if a read fails and you rewrite 
 the block from the reconstructed data the disk would allocate a new block
 and then be error free again.
 
 The problem is just that when there are user visible IO errors
 on a modern disk something is very wrong and it will likely run quickly out 

Also, linux already does re-write failed parity blocks automatically on
a crash, so whatever damage you might think might be done to the disk
will already be done at numerous occasions, as linux in general nor the
raid driver in particular checks for bad blocks before rewriting (I don't
suggets that it does, just that linux already rewrites failed blocks if it
doesn't know about them, and this hasn't been a particular bad problem).

-- 
The choice of a
  -==- _GNU_
  ==-- _   generation Marc Lehmann
  ---==---(_)__  __   __  [EMAIL PROTECTED]
  --==---/ / _ \/ // /\ \/ /  http://schmorp.de/
  -=/_/_//_/\_,_/ /_/\_\  XX11-RIPE
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: critical bugs in md raid5

2005-01-26 Thread pcg
On Thu, Jan 27, 2005 at 06:11:34AM +0100, Andi Kleen <[EMAIL PROTECTED]> wrote:
> Marc Lehmann <[EMAIL PROTECTED]> writes:
> >
> > The summary seems to be that the linux raid driver only protects your data
> > as long as all disks are fine and the machine never crashes.
> 
> "as long as the machine never crashes". That's correct. If you think

Thanks for your thoughts, btw :)

I forgot to mention that even if data is known to be lost it's much better
to return, say, EIO to higher levels than to completely shut down the
device (after all, this is no differnce to what other block devices behave).

Also, it's still likely that some old error can be repaired, as the broken
non-parity block might be old. This is probably better to be handled in
userspace, though, with special tools. But for them it might be vital to
get the correct disk index, to be able to detect the stripe layout.

It's usually much faster to repair and verify, as opposed to format and
restore, of course.

-- 
The choice of a
  -==- _GNU_
  ==-- _   generation Marc Lehmann
  ---==---(_)__  __   __  [EMAIL PROTECTED]
  --==---/ / _ \/ // /\ \/ /  http://schmorp.de/
  -=/_/_//_/\_,_/ /_/\_\  XX11-RIPE
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: critical bugs in md raid5

2005-01-26 Thread Marc Lehmann
On Thu, Jan 27, 2005 at 06:11:34AM +0100, Andi Kleen <[EMAIL PROTECTED]> wrote:
> Marc Lehmann <[EMAIL PROTECTED]> writes:
> > The summary seems to be that the linux raid driver only protects your data
> > as long as all disks are fine and the machine never crashes.
> 
> "as long as the machine never crashes". That's correct. If you think
> about how RAID 5 works there is no way around it. When a write to 

I disagree. When not working in degraded mode, it's absolutely reasonable
to e.g. use only the non-parity data. A crash with raid5 is in no way
different to a crash without raid5 then: either the old data is on the
disk, the new data is on the disk, or you had some catastrophic disk event
and no data is on the disk.

The case I reported was not a catastrophic failure: either the old or new
data was on the disk, and the filesystem journaling (which is ext3) will
take care of it. Even if the parity information is not in sync, either old or
new data is on the disk.

> a single stripe is interrupted (machine crash) and you lose a disk
> during the recovery a lot of data (even unrelated to the data just written)
> is lost.

This is not what I described, in fact, I haven't lost any data, despite
having had a number of such problems (I did verify that afterwards, and
found no differences. Maybe this is luck, but it seems to happen in the
majority of cases, and I ahd a similar problem at least 5 or 6 times
because I didn't encounter the bug I reported).

> But that's nothing inherent in Linux RAID5. It's a generic problem.
> Pretty much all Software RAID5 implementations have it.

Indeed, but I think linux' behaviour is especially poor. For example, the
renumbering of the devices or the strange rebuild-restart behaviour (which
is definitely a bug) will make recovery unnecessarily complicated.

> RAID-1 helps a bit, because you either get the old or the new data,
> but not some corruption.

You don't get any magical corruption with RAID5 either... the data contents
will either be old, or new. The differnce is that you cannot trust parity.

> In practice even old data can be a big
> problem though (e.g. when file system metadata is affected)

Of course, but that's supposed to be worked around by using a journaling
file system, right?

> Morale: if you really care about your data backup very often and
> use RAID-1 or get an expensive hardware RAID with battery backup
> (all the cheap "hardware RAIDs" are equally useless for this) 

Yes, I am thinking of that for some time now, but always had a problem
because the affordable ones have low performance. But given linux'
effective slower-than-a-single-disk performance it shouldn't be hard to
beat nowadays.

There is, however, at least the resyncing with only 4 out of 5 disks, that
is doubtlessly a bug somewhere.

-- 
The choice of a
  -==- _GNU_
  ==-- _   generation Marc Lehmann
  ---==---(_)__  __   __  [EMAIL PROTECTED]
  --==---/ / _ \/ // /\ \/ /  http://schmorp.de/
  -=/_/_//_/\_,_/ /_/\_\  XX11-RIPE
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: critical bugs in md raid5

2005-01-26 Thread Andi Kleen
Marc Lehmann <[EMAIL PROTECTED]> writes:
>
> The summary seems to be that the linux raid driver only protects your data
> as long as all disks are fine and the machine never crashes.

"as long as the machine never crashes". That's correct. If you think
about how RAID 5 works there is no way around it. When a write to 
a single stripe is interrupted (machine crash) and you lose a disk
during the recovery a lot of data (even unrelated to the data just written)
is lost. That is because there is no way to figure out what part
of the data on the stripe belonged to the old and what part to 
the new write.

But that's nothing inherent in Linux RAID5. It's a generic problem.
Pretty much all Software RAID5 implementations have it.

The only way around it is to journal all writes, to make stripe
updates atomic, but in general that's too slow unless you have a
battery backed up journal device. 

There are some tricks to avoid this (e.g. always write to a new disk
location and update an disk index atomically), but they tend to be
heavily patented and are slower too. They also go far beyond RAID-5
(use disk space less efficiently etc.)  and typically need support
from the file system to be efficient.

RAID-1 helps a bit, because you either get the old or the new data,
but not some corruption. In practice even old data can be a big
problem though (e.g. when file system metadata is affected)

Morale: if you really care about your data backup very often and
use RAID-1 or get an expensive hardware RAID with battery backup
(all the cheap "hardware RAIDs" are equally useless for this) 

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: critical bugs in md raid5

2005-01-26 Thread Andi Kleen
Marc Lehmann [EMAIL PROTECTED] writes:

 The summary seems to be that the linux raid driver only protects your data
 as long as all disks are fine and the machine never crashes.

as long as the machine never crashes. That's correct. If you think
about how RAID 5 works there is no way around it. When a write to 
a single stripe is interrupted (machine crash) and you lose a disk
during the recovery a lot of data (even unrelated to the data just written)
is lost. That is because there is no way to figure out what part
of the data on the stripe belonged to the old and what part to 
the new write.

But that's nothing inherent in Linux RAID5. It's a generic problem.
Pretty much all Software RAID5 implementations have it.

The only way around it is to journal all writes, to make stripe
updates atomic, but in general that's too slow unless you have a
battery backed up journal device. 

There are some tricks to avoid this (e.g. always write to a new disk
location and update an disk index atomically), but they tend to be
heavily patented and are slower too. They also go far beyond RAID-5
(use disk space less efficiently etc.)  and typically need support
from the file system to be efficient.

RAID-1 helps a bit, because you either get the old or the new data,
but not some corruption. In practice even old data can be a big
problem though (e.g. when file system metadata is affected)

Morale: if you really care about your data backup very often and
use RAID-1 or get an expensive hardware RAID with battery backup
(all the cheap hardware RAIDs are equally useless for this) 

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: critical bugs in md raid5

2005-01-26 Thread Marc Lehmann
On Thu, Jan 27, 2005 at 06:11:34AM +0100, Andi Kleen [EMAIL PROTECTED] wrote:
 Marc Lehmann [EMAIL PROTECTED] writes:
  The summary seems to be that the linux raid driver only protects your data
  as long as all disks are fine and the machine never crashes.
 
 as long as the machine never crashes. That's correct. If you think
 about how RAID 5 works there is no way around it. When a write to 

I disagree. When not working in degraded mode, it's absolutely reasonable
to e.g. use only the non-parity data. A crash with raid5 is in no way
different to a crash without raid5 then: either the old data is on the
disk, the new data is on the disk, or you had some catastrophic disk event
and no data is on the disk.

The case I reported was not a catastrophic failure: either the old or new
data was on the disk, and the filesystem journaling (which is ext3) will
take care of it. Even if the parity information is not in sync, either old or
new data is on the disk.

 a single stripe is interrupted (machine crash) and you lose a disk
 during the recovery a lot of data (even unrelated to the data just written)
 is lost.

This is not what I described, in fact, I haven't lost any data, despite
having had a number of such problems (I did verify that afterwards, and
found no differences. Maybe this is luck, but it seems to happen in the
majority of cases, and I ahd a similar problem at least 5 or 6 times
because I didn't encounter the bug I reported).

 But that's nothing inherent in Linux RAID5. It's a generic problem.
 Pretty much all Software RAID5 implementations have it.

Indeed, but I think linux' behaviour is especially poor. For example, the
renumbering of the devices or the strange rebuild-restart behaviour (which
is definitely a bug) will make recovery unnecessarily complicated.

 RAID-1 helps a bit, because you either get the old or the new data,
 but not some corruption.

You don't get any magical corruption with RAID5 either... the data contents
will either be old, or new. The differnce is that you cannot trust parity.

 In practice even old data can be a big
 problem though (e.g. when file system metadata is affected)

Of course, but that's supposed to be worked around by using a journaling
file system, right?

 Morale: if you really care about your data backup very often and
 use RAID-1 or get an expensive hardware RAID with battery backup
 (all the cheap hardware RAIDs are equally useless for this) 

Yes, I am thinking of that for some time now, but always had a problem
because the affordable ones have low performance. But given linux'
effective slower-than-a-single-disk performance it shouldn't be hard to
beat nowadays.

There is, however, at least the resyncing with only 4 out of 5 disks, that
is doubtlessly a bug somewhere.

-- 
The choice of a
  -==- _GNU_
  ==-- _   generation Marc Lehmann
  ---==---(_)__  __   __  [EMAIL PROTECTED]
  --==---/ / _ \/ // /\ \/ /  http://schmorp.de/
  -=/_/_//_/\_,_/ /_/\_\  XX11-RIPE
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: critical bugs in md raid5

2005-01-26 Thread pcg
On Thu, Jan 27, 2005 at 06:11:34AM +0100, Andi Kleen [EMAIL PROTECTED] wrote:
 Marc Lehmann [EMAIL PROTECTED] writes:
 
  The summary seems to be that the linux raid driver only protects your data
  as long as all disks are fine and the machine never crashes.
 
 as long as the machine never crashes. That's correct. If you think

Thanks for your thoughts, btw :)

I forgot to mention that even if data is known to be lost it's much better
to return, say, EIO to higher levels than to completely shut down the
device (after all, this is no differnce to what other block devices behave).

Also, it's still likely that some old error can be repaired, as the broken
non-parity block might be old. This is probably better to be handled in
userspace, though, with special tools. But for them it might be vital to
get the correct disk index, to be able to detect the stripe layout.

It's usually much faster to repair and verify, as opposed to format and
restore, of course.

-- 
The choice of a
  -==- _GNU_
  ==-- _   generation Marc Lehmann
  ---==---(_)__  __   __  [EMAIL PROTECTED]
  --==---/ / _ \/ // /\ \/ /  http://schmorp.de/
  -=/_/_//_/\_,_/ /_/\_\  XX11-RIPE
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/