Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

2000-01-16 Thread Stephen C. Tweedie

Hi,

Chris Wedgwood writes:

   This may affect data which was not being written at the time of the
   crash.  Only raid 5 is affected.
  
  Long term -- if you journal to something outside the RAID5 array (ie.
  to raid-1 protected log disks) then you should be safe against this
  type of failure?

Indeed.  The jfs journaling layer in ext3 is a completely generic
block device journaling layer which could be used for such a purpose
(and raid/LVM journaling is one of the reasons it was designed this
way).

--Stephen



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

2000-01-15 Thread Stephen C. Tweedie

Hi,

Benno Senoner writes:

  wow, really good idea to journal to a RAID1 array !
  
  do you think it is possible to to the following:
  
  - N disks holding a soft RAID5  array.
  - reserve a small partition on at least 2 disks of the array to hold a RAID1
  array.
  - keep the journal on this partition.

Yes.  My jfs code will eventually support this.  The main thing it is
missing right now is the ability to journal multiple devices to a
single journal: the on-disk structure is already designed with that in
mind but the code does not yet support it.

--Stephen



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

2000-01-14 Thread Benno Senoner

Chris Wedgwood wrote:

  In the power+disk failure case, there is a very narrow window in which
  parity may be incorrect, so loss of the disk may result in inability to
  correctly restore the lost data.

 For some people, this very narrow window may still be a problem.
 Especially when you consider the case of a disk failing because of a
 power surge -- which also kills a drive.

  This may affect data which was not being written at the time of the
  crash.  Only raid 5 is affected.

 Long term -- if you journal to something outside the RAID5 array (ie.
 to raid-1 protected log disks) then you should be safe against this
 type of failure?

 -cw

wow, really good idea to journal to a RAID1 array !

do you think it is possible to to the following:

- N disks holding a soft RAID5  array.
- reserve a small partition on at least 2 disks of the array to hold a RAID1
array.
- keep the journal on this partition.

do you think that this will be possible ?
is ext3 / reiserfs  capable of keeping the journal on a different partition
than
the one holding the FS ?

That would really be great !

Benno.




Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

2000-01-14 Thread D. Lance Robinson

Ingo,

I can fairly regularly generate corruption (data or ext2 filesystem) on a busy
RAID-5 by adding a spare drive to a degraded array and letting it build the
parity. Could the problem be from the bad (illegal) buffer interactions you
mentioned, or are there other areas that need fixing as well? I have been
looking into this issue for a long time with no resolve. Since you may be aware
of possible problem areas: any ideas, code or encouragement is greatly welcome.

 Lance.


Ingo Molnar wrote:

 On Wed, 12 Jan 2000, Gadi Oxman wrote:

  As far as I know, we took care not to poke into the buffer cache to
  find clean buffers -- in raid5.c, the only code which does a find_buffer()
  is:

 yep, this is still the case. (Sorry Stephen, my bad.) We will have these
 problems once we try to eliminate the current copying overhead.
 Nevertheless there are bad (illegal) interactions between the RAID code
 and the buffer cache, i'm cleaning up this for 2.3 right now. Especially
 the reconstruction code is a rathole. Unfortunately blocking
 reconstruction if b_count == 0 is not acceptable because several
 filesystems (such as ext2fs) keep metadata caches around (eg. the block
 group descriptors in the ext2fs case) which have b_count == 1 for a longer
 time.



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

2000-01-13 Thread Stephen C. Tweedie

Hi,

On Wed, 12 Jan 2000 22:09:35 +0100, Benno Senoner [EMAIL PROTECTED]
said:

 Sorry for my ignorance I got a little confused by this post:

 Ingo said we are 100% journal-safe, you said the contrary,

Raid resync is safe in the presence of journaling.  Journaling is not
safe in the presence of raid resync.

 can you or Ingo please explain us in which situation (power-loss)
 running linux-raid+ journaled FS we risk a corrupted filesystem ?

Please read my previous reply on the subject (the one that started off
with "I'm tired of answering the same question a million times so here's
a definitive answer").  Basically, there will always be a small risk of
data loss if power-down is accompanied by loss of a disk (it's a
double-failure); and the current implementation of raid resync means
that journaling will be broken by the raid1 or raid5 resync code after a
reboot on a journaled filesystem (ext3 is likely to panic, reiserfs will
not but will still get its IO ordering requirements messed up by the
resync). 

 After the reboot if all disk remain intact physically, will we only
 lose the data that was being written, or is there a possibility to end
 up in a corrupted filesystem which could more damages in future ?

In the power+disk failure case, there is a very narrow window in which
parity may be incorrect, so loss of the disk may result in inability to
correctly restore the lost data.  This may affect data which was not
being written at the time of the crash.  Only raid 5 is affected.

--Stephen



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

2000-01-12 Thread Bryce Willing

- Original Message -
From: "Benno Senoner" [EMAIL PROTECTED]
To: "Stephen C. Tweedie" [EMAIL PROTECTED]
Cc: "Linux RAID" [EMAIL PROTECTED];
[EMAIL PROTECTED]; "Ingo Molnar" [EMAIL PROTECTED]
Sent: Tuesday, January 11, 2000 1:17 PM
Subject: Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure
=problems ?


-- much snippage here


 The problem is that power outages are unpredictable even in presence
 of UPSes therefore it is important to have some protection against
 power losses.

 regards,
 Benno.



I run an MGE UPS on my RH6.1 box running RAID 1, they have software for
Linux that communicates with the UPS and performs an orderly system shutdown
if the box goes on battery and stays on battery for a given (user
selectable) length of time. I have tested and verified that this actually
works, it's a Good Thing(tm).
I did have to cut one pin on the standard RS-232 cable that came the UPS for
use on the Linux box, and download the software and install (scripted,
easy...)

bwilling




Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-12 Thread Benno Senoner

James Manning wrote:

 [ Tuesday, January 11, 2000 ] Benno Senoner wrote:
  The problem is that power outages are unpredictable even in presence
  of UPSes therefore it is important to have some protection against
  power losses.

 I gotta ask dying power supply? cord getting ripped out?
 Most ppl run serial lines (of course :) and with powerd they
 get nice shutdowns :)

 Just wanna make sure I'm understanding you...

 James
 --
 Miscellaneous Engineer --- IBM Netfinity Performance Development

yep, obviously the UPS has a serial line to shut down the machine nicely
before a failure,
but it happened to me that the serial cable was disconnected and the
power outage lasted
SEVERAL hours during a weekend , where no one was in the machine room (of
an ISP).

you know murphy's law ...
:-)

But I am mainly interested in the power-failure-protection in the case
where you want to setup
a workstation with a reliable disk array (soft raid5), and do not have
always an UPS handy,

you will loose the file that was being written, but the important thing
is that the disk array remains
in a safe state , just  like a single disk + journaled FS.

Sthephen Tweedie said that this is possible (by fixing the remaining
races in the RAID code),
if these problems will be fixed sometime, then our fears of a corrupted
soft-RAID array in
the case of a  power-failure on a machine without UPS will completely go
away.

cheers,
Benno.







Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

2000-01-12 Thread Ingo Molnar


On Wed, 12 Jan 2000, Gadi Oxman wrote:

 As far as I know, we took care not to poke into the buffer cache to
 find clean buffers -- in raid5.c, the only code which does a find_buffer()
 is:

yep, this is still the case. (Sorry Stephen, my bad.) We will have these
problems once we try to eliminate the current copying overhead.
Nevertheless there are bad (illegal) interactions between the RAID code
and the buffer cache, i'm cleaning up this for 2.3 right now. Especially
the reconstruction code is a rathole. Unfortunately blocking
reconstruction if b_count == 0 is not acceptable because several
filesystems (such as ext2fs) keep metadata caches around (eg. the block
group descriptors in the ext2fs case) which have b_count == 1 for a longer
time.

If both power and a disk fails at once then we still might get local
corruption for partially written RAID5 stripes. If either power or a disk
fails, then the Linux RAID5 code is safe wrt. journalling, because it
behaves like an ordinary disk. We are '100% journal-safe' if power fails
during resync. We are also 100% journal-safe if power fails during
reconstruction of failed disk or in degraded mode.

the 2.3 buffer-cache enhancements i wrote ensure that 'cache snooping' and
adding to the buffer-cache can be done safely by 'external' cache
managers. I also added means to do atomic IO operations which in fact are
several underlying IO operations - without the need of allocating a
separate bh. The RAID code uses these facilities now.

Ingo



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-12 Thread mauelsha

"Stephen C. Tweedie" wrote:
 
 Hi,
 
 On Tue, 11 Jan 2000 15:03:03 +0100, mauelsha
 [EMAIL PROTECTED] said:
 
  THIS IS EXPECTED.  RAID-5 isn't proof against multiple failures, and the
  only way you can get bitten by this failure mode is to have a system
  failure and a disk failure at the same time.
 
  To try to avoid this kind of problem some brands do have additional
  logging (to disk which is slow for sure or to NVRAM) in place, which
  enables them to at least recognize the fault to avoid the
  reconstruction of invalid data or even enables them to recover the
  data by using redundant copies of it in NVRAM + logging information
  what could be written to the disks and what not.
 
 Absolutely: the only way to avoid it is to make the data+parity updates
 atomic, either in NVRAM or via transactions.  I'm not aware of any
 software RAID solutions which do such logging at the moment: do you know
 of any?
 

AFAIK Veritas only does the first part of what i mentioned above
(invalid
on disk data recognition).

They do logging by default for RAID5 volumes and optionaly also for
RAID1 volumes.

In the RAID5 (with logging) case they can figure out if an n-1 disk
write took place and can
rebuild the data. In case an n-m (1  m  n) took place they can
therefore at least
recognize the desaster ;-)

In the RAID1 (with logging) scenario they are able to recognize, which
of the n mirrors have actual
data and which ones don't to deliver the actual data to the user and to
try to make
the other mirrors consistent.

But because it's a software solution without any NVRAM support they
can't
handle the data redundancy case.

Heinz



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-12 Thread Stephen C. Tweedie

Hi,

On Wed, 12 Jan 2000 00:12:55 +0200 (IST), Gadi Oxman
[EMAIL PROTECTED] said:

 Stephen, I'm afraid that there are some misconceptions about the
 RAID-5 code.

I don't think so --- I've been through this with Ingo --- but I
appreciate your feedback since I'm getting inconsistent advise here!
Please let me explain...

 In an early pre-release version of the RAID code (more than two years
 ago?), which didn't protect against that race, we indeed saw locked
 buffers changing under us from the point in which we computed the
 parity till the point in which they were actually written to the disk,
 leading to a corrupted parity.

That is not the race.  The race has nothing at all to do with buffers
changing while they are being used for parity: that's a different
problem, long ago fixed by copying the buffers.

The race I'm concerned about could occur when the raid driver wants to
compute parity for a stripe and finds some of the blocks are present,
and clean, in the buffer cache.  Raid assumes that those buffers
represent what is on disk, naturally enough.  So, it uses them to
calculate parity without rereading all of the disk blocks in the stripe.

The trouble is that the standard practice in the kernel, when modifying
a buffer, is to make the change and _then_ mark the buffer dirty.  If
you hit that window, then the raid driver will find a buffer which
doesn't match what is on disk, and will compute parity from that buffer
rather than from the on-disk contents.

 1. n dirty blocks are scheduled for a stripe write.

That's not the race.  The problem occurs when only one single dirty
block is scheduled for a write, and we need to find the contents of the
rest of the stripe to compute parity.

 Point (2) is also incorrect; we have taken care *not* to peek into
 the buffer cache to find clean buffers and use them for parity
 calculations. We make no such assumptions.

Not according to Ingo --- can we get a definitive answer on this,
please?

Many thanks,
  Stephen



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-12 Thread Mark Ferrell

  Perhaps I am confused.  How is it that a power outage while attached
to the UPS becomes "unpredictable"?  

  We run a Dell PowerEdge 2300/400 using Linux software raid and the
system monitors it's own UPS.  When power failure occures the system
will bring itself down to a minimal state (runlevel 1) after the
batteries are below 50% .. and once below 15% it will shutdown which
turns off the UPS.  When power comes back on the UPS fires up and the
system resumes as normal.

  Addmitedly this wont prevent issues like god reaching out and slapping
my system via lightning or something, nor will it resolve issues where
someone decides to grab the power cable and swing around on it severing
the connection from the UPS to the system .. but for the most part it
has thus far prooven to be a fairly decent configuration.

Benno Senoner wrote:
 
 "Stephen C. Tweedie" wrote:
 
 (...)
 
 
  3) The soft-raid backround rebuild code reads and writes through the
 buffer cache with no synchronisation at all with other fs activity.
 After a crash, this background rebuild code will kill the
 write-ordering attempts of any journalling filesystem.
 
 This affects both ext3 and reiserfs, under both RAID-1 and RAID-5.
 
  Interaction 3) needs a bit more work from the raid core to fix, but it's
  still not that hard to do.
 
  So, can any of these problems affect other, non-journaled filesystems
  too?  Yes, 1) can: throughout the kernel there are places where buffers
  are modified before the dirty bits are set.  In such places we will
  always mark the buffers dirty soon, so the window in which an incorrect
  parity can be calculated is _very_ narrow (almost non-existant on
  non-SMP machines), and the window in which it will persist on disk is
  also very small.
 
  This is not a problem.  It is just another example of a race window
  which exists already with _all_ non-battery-backed RAID-5 systems (both
  software and hardware): even with perfect parity calculations, it is
  simply impossible to guarantee that an entire stipe update on RAID-5
  completes in a single, atomic operation.  If you write a single data
  block and its parity block to the RAID array, then on an unexpected
  reboot you will always have some risk that the parity will have been
  written, but not the data.  On a reboot, if you lose a disk then you can
  reconstruct it incorrectly due to the bogus parity.
 
  THIS IS EXPECTED.  RAID-5 isn't proof against multiple failures, and the
  only way you can get bitten by this failure mode is to have a system
  failure and a disk failure at the same time.
 
 
 
  --Stephen
 
 thank you very much for these clear explanations,
 
 Last doubt: :-)
 Assume all RAID code - FS interaction problems get fixed,
 since a linux soft-RAID5 box has no battery backup,
 does this mean that we will loose data
 ONLY if there is a power failure AND successive disk failure ?
 If we loose the power and then after reboot all disks remain intact
 can the RAID layer reconstruct all information in a safe way ?
 
 The problem is that power outages are unpredictable even in presence
 of UPSes therefore it is important to have some protection against
 power losses.
 
 regards,
 Benno.



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

2000-01-12 Thread Benno Senoner

"Stephen C. Tweedie" wrote:

 Ideally, what I'd like to see the reconstruction code do is to:

 * lock a stripe
 * read a new copy of that stripe locally
 * recalc parity and write back whatever disks are necessary for the stripe
 * unlock the stripe

 so that the data never goes through the buffer cache at all, but that
 the stripe is locked with respect to other IOs going on below the level
 of ll_rw_block (remember there may be IOs coming in to ll_rw_block which
 are not from the buffer cache, eg. swap or journal IOs).

  We are '100% journal-safe' if power fails during resync.

 Except for the fact that resync isn't remotely journal-safe in the first
 place, yes.  :-)

 --Stephen

Sorry for my ignorance I got a little confused by this post:

Ingo said we are 100% journal-safe, you said the contrary,

can you or Ingo please explain us in which situation (power-loss)
running linux-raid+ journaled FS we risk a corrupted filesystem ?

I am interested what happens if the power goes down while you write
heavily to a ext3/reiserfs (journaled FS) on soft-raid5 array.

After the reboot if all disk remain intact physically,
will we only lose the data that was being written, or is there a possibility
to end up in a corrupted filesystem which could more damages in future ?

(or do we need to wait for the raid code in 2.3 ?)

sorry for re-asking that question, but I am still confused.

regards,
Benno.





Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-12 Thread Stephen C. Tweedie

Hi,

On Tue, 11 Jan 2000 16:41:55 -0600, "Mark Ferrell"
[EMAIL PROTECTED] said:

   Perhaps I am confused.  How is it that a power outage while attached
 to the UPS becomes "unpredictable"?  

One of the most common ways to get an outage while on a UPS is somebody
tripping over, or otherwise removing, the cable between the UPS and the
computer.  How exactly is that predictable?

Just because you reduce the risk of unexpected power outage doesn't mean
we can ignore the possibility.

--Stephen



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

2000-01-12 Thread Stephen C. Tweedie

Hi,

On Wed, 12 Jan 2000 07:21:17 -0500 (EST), Ingo Molnar [EMAIL PROTECTED]
said:

 On Wed, 12 Jan 2000, Gadi Oxman wrote:

 As far as I know, we took care not to poke into the buffer cache to
 find clean buffers -- in raid5.c, the only code which does a find_buffer()
 is:

 yep, this is still the case.

OK, that's good to know.

 Especially the reconstruction code is a rathole. Unfortunately
 blocking reconstruction if b_count == 0 is not acceptable because
 several filesystems (such as ext2fs) keep metadata caches around
 (eg. the block group descriptors in the ext2fs case) which have
 b_count == 1 for a longer time.

That's not a problem: we don't need reconstruction to interact with the
buffer cache at all.

Ideally, what I'd like to see the reconstruction code do is to:

* lock a stripe
* read a new copy of that stripe locally
* recalc parity and write back whatever disks are necessary for the stripe
* unlock the stripe

so that the data never goes through the buffer cache at all, but that
the stripe is locked with respect to other IOs going on below the level
of ll_rw_block (remember there may be IOs coming in to ll_rw_block which
are not from the buffer cache, eg. swap or journal IOs).

 We are '100% journal-safe' if power fails during resync. 

Except for the fact that resync isn't remotely journal-safe in the first
place, yes.  :-)

--Stephen



Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-11 Thread Benno Senoner

"Stephen C. Tweedie" wrote:

 Hi,

 On Fri, 07 Jan 2000 13:26:21 +0100, Benno Senoner [EMAIL PROTECTED]
 said:

  what happens when I run RAID5+ jornaled FS and the box is just writing
  data to the disk and then a power outage occurs ?

  Will this lead to a corrupted filesystem or will only the data which
  was just written, be lost ?

 It's more complex than that.  Right now, without any other changes, the
 main danger is that the raid code can sometimes lead to the filesystem's
 updates being sent to disk in the wrong order, so that on reboot, the
 journaling corrupts things unpredictably and silently.

 There is a second effect, which is that if the journaling code tries to

 prevent a buffer being written early by keeping its dirty bit clear,

 then raid can miscalculate parity by assuming that the buffer matches
 what is on disk, and that can actually cause damage to other data than
 the data being written if a disk dies and we have to start using parity
 for that stripe.

do you know if using soft RAID5 + regular etx2 causes the same sort of
damages,
or if the corruption chances are lower when using a non journaled FS ?

is the potential corruption  caused by the RAID layer or by the FS layer ?
( does need the FS code or the RAID code to be fixed ?)

if it's caused by the FS layer, how does behave XFS (not here yet ;-) ) or
ReiserFS in this case ?

cheers,
Benno.




 Both are fixable, but for now, be careful...

 --Stephen





Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-11 Thread Benno Senoner

"Stephen C. Tweedie" wrote:

(...)


 3) The soft-raid backround rebuild code reads and writes through the
buffer cache with no synchronisation at all with other fs activity.
After a crash, this background rebuild code will kill the
write-ordering attempts of any journalling filesystem.

This affects both ext3 and reiserfs, under both RAID-1 and RAID-5.

 Interaction 3) needs a bit more work from the raid core to fix, but it's
 still not that hard to do.

 So, can any of these problems affect other, non-journaled filesystems
 too?  Yes, 1) can: throughout the kernel there are places where buffers
 are modified before the dirty bits are set.  In such places we will
 always mark the buffers dirty soon, so the window in which an incorrect
 parity can be calculated is _very_ narrow (almost non-existant on
 non-SMP machines), and the window in which it will persist on disk is
 also very small.

 This is not a problem.  It is just another example of a race window
 which exists already with _all_ non-battery-backed RAID-5 systems (both
 software and hardware): even with perfect parity calculations, it is
 simply impossible to guarantee that an entire stipe update on RAID-5
 completes in a single, atomic operation.  If you write a single data
 block and its parity block to the RAID array, then on an unexpected
 reboot you will always have some risk that the parity will have been
 written, but not the data.  On a reboot, if you lose a disk then you can
 reconstruct it incorrectly due to the bogus parity.

 THIS IS EXPECTED.  RAID-5 isn't proof against multiple failures, and the
 only way you can get bitten by this failure mode is to have a system
 failure and a disk failure at the same time.



 --Stephen

thank you very much for these clear explanations,

Last doubt: :-)
Assume all RAID code - FS interaction problems get fixed,
since a linux soft-RAID5 box has no battery backup,
does this mean that we will loose data
ONLY if there is a power failure AND successive disk failure ?
If we loose the power and then after reboot all disks remain intact
can the RAID layer reconstruct all information in a safe way ?

The problem is that power outages are unpredictable even in presence
of UPSes therefore it is important to have some protection against
power losses.

regards,
Benno.






[FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-11 Thread Stephen C. Tweedie

Hi,

This is a FAQ: I've answered it several times, but in different places,
so here's a definitive answer which will be my last one: future
questions will be directed to the list archives. :-)

On Tue, 11 Jan 2000 16:20:35 +0100, Benno Senoner [EMAIL PROTECTED]
said:

 then raid can miscalculate parity by assuming that the buffer matches
 what is on disk, and that can actually cause damage to other data
 than the data being written if a disk dies and we have to start using
 parity for that stripe.

 do you know if using soft RAID5 + regular etx2 causes the same sort of
 damages, or if the corruption chances are lower when using a non
 journaled FS ?

Sort of.  See below.

 is the potential corruption caused by the RAID layer or by the FS
 layer ?  ( does need the FS code or the RAID code to be fixed ?)

It is caused by neither: it is an interaction effect.

 if it's caused by the FS layer, how does behave XFS (not here yet ;-)
 ) or ReiserFS in this case ?

They will both fail in the same way.

Right, here's the problem:

The semantics of the linux-2.2 buffer cache are not well defined with
respect to write ordering.  There is no policy to guide what gets
written and when: the writeback caching can trickle to disk at any time,
and other system components such as filesystems and the VM can force a
write-back of data to disk at any time.

Journaling imposes write ordering constraints which insist that data in
the buffer cache *MUST NOT* be written to disk unless the filesystem
explicitly says so.

RAID-5 needs to interact directly with the buffer cache in order to be
able to improve performance.

There are three nasty interactions which result:

1) RAID-5 tries to bunch writes of dirty buffers up so that all the data
   in a stripe gets written to disk at once.  For RAID-5, this is very
   much faster than dribbling the stripe back one disk at a time.
   Unfortunately, this can result in dirty buffers being written to disk
   earlier than the filesystem expected, with the result that on a
   crash, the filesystem journal may not be entirely consistent.

   This interaction hits ext3, which stores its pending transaction
   buffer updates in the buffer cache with the b_dirty bit set.

2) RAID-5 peeks into the buffer cache to look for buffer contents in
   order to calculate parity without reading all of the disks in a
   stripe.  If a journaling system tries to prevent modified data from
   being flushed to disk by deferring the setting of the buffer dirty
   flag, then RAID-5 will think that the buffer, being clean, matches
   the state of the disk and so it will calculate parity which doesn't
   actually match what is on disk.  If we crash and one disk fails on
   reboot, wrong parity may prevent recovery of the lost data.

   This interaction hits reiserfs, which stores its pending transaction
   buffer updates in the buffer cache with the b_dirty bit clear.

Both interactions 1) and 2) can be solved by making RAID-5 completely
avoid buffers which have an incremented b_count reference count, and
making sure that the filesystems all hold that count raised when the
buffers are in an inconsistent or pinned state.

3) The soft-raid backround rebuild code reads and writes through the
   buffer cache with no synchronisation at all with other fs activity.
   After a crash, this background rebuild code will kill the
   write-ordering attempts of any journalling filesystem.  

   This affects both ext3 and reiserfs, under both RAID-1 and RAID-5.

Interaction 3) needs a bit more work from the raid core to fix, but it's
still not that hard to do.


So, can any of these problems affect other, non-journaled filesystems
too?  Yes, 1) can: throughout the kernel there are places where buffers
are modified before the dirty bits are set.  In such places we will
always mark the buffers dirty soon, so the window in which an incorrect
parity can be calculated is _very_ narrow (almost non-existant on
non-SMP machines), and the window in which it will persist on disk is
also very small.

This is not a problem.  It is just another example of a race window
which exists already with _all_ non-battery-backed RAID-5 systems (both
software and hardware): even with perfect parity calculations, it is
simply impossible to guarantee that an entire stipe update on RAID-5
completes in a single, atomic operation.  If you write a single data
block and its parity block to the RAID array, then on an unexpected
reboot you will always have some risk that the parity will have been
written, but not the data.  On a reboot, if you lose a disk then you can
reconstruct it incorrectly due to the bogus parity.

THIS IS EXPECTED.  RAID-5 isn't proof against multiple failures, and the
only way you can get bitten by this failure mode is to have a system
failure and a disk failure at the same time.


--Stephen



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-11 Thread mauelsha

"Stephen C. Tweedie" wrote:
 
 Hi,
 
 This is a FAQ: I've answered it several times, but in different places,

SNIP

 THIS IS EXPECTED.  RAID-5 isn't proof against multiple failures, and the
 only way you can get bitten by this failure mode is to have a system
 failure and a disk failure at the same time.
 

To try to avoid this kind of problem some brands do have additional
logging (to disk
which is slow for sure or to NVRAM) in place, which enables them to at
least recognize
the fault to avoid the reconstruction of invalid data or even enables
them to recover
the data by using redundant copies of it in NVRAM + logging information
what could be
written to the disks and what not.

Heinz



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-11 Thread Stephen C. Tweedie

Hi,

On Tue, 11 Jan 2000 15:03:03 +0100, mauelsha
[EMAIL PROTECTED] said:

 THIS IS EXPECTED.  RAID-5 isn't proof against multiple failures, and the
 only way you can get bitten by this failure mode is to have a system
 failure and a disk failure at the same time.

 To try to avoid this kind of problem some brands do have additional
 logging (to disk which is slow for sure or to NVRAM) in place, which
 enables them to at least recognize the fault to avoid the
 reconstruction of invalid data or even enables them to recover the
 data by using redundant copies of it in NVRAM + logging information
 what could be written to the disks and what not.

Absolutely: the only way to avoid it is to make the data+parity updates
atomic, either in NVRAM or via transactions.  I'm not aware of any
software RAID solutions which do such logging at the moment: do you know
of any?

--Stephen



Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-07 Thread Stephen C. Tweedie

Hi,

On Fri, 07 Jan 2000 13:26:21 +0100, Benno Senoner [EMAIL PROTECTED]
said:

 what happens when I run RAID5+ jornaled FS and the box is just writing
 data to the disk and then a power outage occurs ?

 Will this lead to a corrupted filesystem or will only the data which
 was just written, be lost ?

It's more complex than that.  Right now, without any other changes, the
main danger is that the raid code can sometimes lead to the filesystem's
updates being sent to disk in the wrong order, so that on reboot, the
journaling corrupts things unpredictably and silently.

There is a second effect, which is that if the journaling code tries to
prevent a buffer being written early by keeping its dirty bit clear,
then raid can miscalculate parity by assuming that the buffer matches
what is on disk, and that can actually cause damage to other data than
the data being written if a disk dies and we have to start using parity
for that stripe.

Both are fixable, but for now, be careful...

--Stephen



soft RAID5 + journalled FS + power failure = problems ?

2000-01-07 Thread Benno Senoner

Hi,
I was just thinking about the following :

There will soon be available at least one stable journaled FS for linux
out of the box.
Of course we want to run our soft-RAID5 array with journaling in order
to
prevent large fsck and speed up the boot process.

My question:

what happens when I run RAID5+ jornaled FS and the box is just writing
data
to the disk and then a power outage occurs ?

Will this lead to a corrupted filesystem or will only the data which was
just written,
be lost ?

I know that a single disk + journaled FS do not lead to a corrupted
disk,

but in the RAID array case ?

The Software-RAID HOWTO says:
"In particular, note that RAID is  designed to protect against *disk*
failures, and not against
*power* failures or *operator* mistakes."

What happens if a new written block was only committed to 1 disk out of
4disks present in an array (RAID5) ?

Will this block be marked as free after the array resync or will it lead
to problems making the
md device corrupted ?

If  the software RAID in linux doesn't already guarantee this, could
this be added with a similar technique
just like a journaled FS does ?

I am thinking about keeping a journal of the committed blocks to disk,
and if a power failure occurs, then
just wipe out these blocks and make them free again.

A journaled FS on top of the raid device will automatically avoid files
being placed in these
"uncommitted  disk blocks".

But since md is a "virtual device" the situation might be more
complicated.

I think quite a few of us are interested this "raid on powerfailure
during diskwrites reliability" topic.

Thank you in advance for your explanations.

regards,
Benno.