Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-06 Thread Luca Berra

On Mon, Feb 04, 2008 at 07:38:40PM +0300, Michael Tokarev wrote:

Eric Sandeen wrote:
[]

http://oss.sgi.com/projects/xfs/faq.html#nulls

and note that recent fixes have been made in this area (also noted in
the faq)

Also - the above all assumes that when a drive says it's written/flushed
data, that it truly has.  Modern write-caching drives can wreak havoc
with any journaling filesystem, so that's one good reason for a UPS.  If


Unfortunately an UPS does not *really* help here.  Because unless
it has control program which properly shuts system down on the loss
of input power, and the battery really has the capacity to power the
system while it's shutting down (anyone tested this?  With new UPS?
and after an year of use, when the battery is not new?), -- unless
the UPS actually has the capacity to shutdown system, it will cut
the power at an unexpected time, while the disk(s) still has dirty
caches...


if the ups is supported by nut (http://www.networkupstools.org) you can
do this easily.
Obviously you should tune the timeout to give your systems enough time
to shutdown in case of power outage, and periodically check your
battery duration (that means real tests) and re-tune the nut software
(and when you discover your battery is dead, change it)

L.

--
Luca Berra -- [EMAIL PROTECTED]
   Communication Media  Services S.r.l.
/\
\ / ASCII RIBBON CAMPAIGN
 XAGAINST HTML MAIL
/ \
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-05 Thread Linda Walsh



Michael Tokarev wrote:

note that with some workloads, write caching in
the drive actually makes write speed worse, not better - namely,
in case of massive writes.


With write barriers enabled, I did a quick test of
a large copy from one backup filesystem to another.
I'm not what you refer to when you say large, but
this disk has 387G used with 975 files, averaging about 406MB/file.

I was copying from /hde (ATA100-750G) to
/sdb (SATA-300-750G) (both, basically underlying model)

Of course your 'mileage may vary', and these were averages over
12 runs each (w/ + w/out wcaching);

(write cache on) writeread
dev ave   TPS MB/sMB/s
hde ave  64.67   30.94 0.0
sdb ave 249.510.2430.93

(write cache off)writeread
dev ave  TPS  MB/sMB/s
hde ave  45.63   21.81 0.0
xx: ave 177.76   0.24 21.96

write w/cache = (30.94-21.86)/21.86 = 45% faster
w/o write cache =   100-(100*21.81/30.94)   = 30% slower

These disks have barrier support, so I'd guess the differences would
have been greater if you didn't worry about losing w-cache contents.

If  barrier support doesn't work and one has to disable write-caching,
that is a noticeable performance penalty.

All writes with noatime, nodiratime, logbufs=8.


FWIW...slightly OT, the rates under Win for their write-through (FAT32-perf)
vs. write-back caching (NTFS-perf) were FAT about 60% faster over NTFS or
NTFS ~ 40% slower than FAT32 (with ops for no-last-access and no 3.1
filename creation)



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-05 Thread Linda Walsh


Michael Tokarev wrote:

Unfortunately an UPS does not *really* help here.  Because unless
it has control program which properly shuts system down on the loss
of input power, and the battery really has the capacity to power the
system while it's shutting down (anyone tested this? 


Yes.  I must say, I am not connected or paid by APC.


With new UPS?
and after an year of use, when the battery is not new?), -- unless
the UPS actually has the capacity to shutdown system, it will cut
the power at an unexpected time, while the disk(s) still has dirty
caches...


If you have a SmartUPS by APC, their is a freeware demon that monitors
it's status.  The UPS has USB and serial connections.
 It's included in some distributions (SuSE).  The config
file is pretty straight forward.

I recommend the 1000XL (1000 peak Volt-Amp load -- usually at startup;
note, this is not the same as watts as some of us were taught in basic
electronics class since the unit isn't a simple resistor (like a light
bulb). over the 1500XL because with the 1000XL, you can buy several
add-on batteries that plug into the back.

One minor (but not fatal) design flaw: the add-on batteries give no indication
that they are live (I knocked a cord on one, and only got 7 minutes
of uptime before things shut-down instead of my expected 20.
I have 3-cells total (controller  1 extra pack).  So why is my run time
so short?  I am being lazy in buying more extension packs.
The UPS is running 3 computers, the house-phone, (answering and wireless
handsets).  a digital clock, 1 LCD (usually off),  The real killer is a
new workstation with 2x2-Core-II chips and other comparable equipment.

The 1500XL doesn't allow for adding more power packs.
The 2200XL does allow extra packs but comes in a rack-mount format.

It's not just a battery backup -- it conditions the power -- to filter out
spikes and emit a pure sine wave.  It will kick in during over or under
voltage conditions (you can set the sensitivity).  Adjustable alarm
when on battery, setting of output volts (115, 230, 120, 240).  It
selftests at least every 2 weeks or shorter (to your fancy).

It also has a network feature (that I haven't gotten to work yet -- they just
changed the format), that allows other computers on the same net to also be
notified and take action.

You specify what scripts to run at what times (power off, power on, getting
critically low, etc).

Hasn't failed me 'yet' -- cept when a charger died and was replaced free of
cost (within warantee).  I have a separate setup another room for another
computer.

The upspowerd runs on linux or windows (under cygwin, I think).


You can specify when to shut down -- like 5 minutes of battery life left.

The controller unit has 1 battery.  But the add-ons have 2 batteries
each, so the first add-on adds 3x to the run-time.  When my system
did shut down prematurely, it went through the full halt sequence,
which I'd presume flushes disk caches.




the drive claims to have metadata safe on disk but actually does not,
and you lose power, the data claimed safe will evaporate, there's not
much the fs can do.  IO write barriers address this by forcing the drive
to flush order-critical data before continuing; xfs has them on by
default, although they are tested at mount time and if you have
something in between xfs and the disks which does not support barriers
(i.e. lvm...) then they are disabled again, with a notice in the logs.



Note also that with linux software raid barriers are NOT supported.

--
Are you sure about this?  When my system boots, I used to have
3 new IDE's, and one older one.  XFS checked each drive for barriers
and turned off barriers for a disk that didn't support it.  ... or
are you referring specifically to linux-raid setups?

Would it be possible on boot to have xfs probe the Raid array,
physically, to see if barriers are really supported (or not), and disable
them if they are not (and optionally disabling write caching, but that's
a major performance hit in my experience.

Linda
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-05 Thread Michael Tokarev
Linda Walsh wrote:
 
 Michael Tokarev wrote:
 Unfortunately an UPS does not *really* help here.  Because unless
 it has control program which properly shuts system down on the loss
 of input power, and the battery really has the capacity to power the
 system while it's shutting down (anyone tested this? 
 
 Yes.  I must say, I am not connected or paid by APC.
 
 With new UPS?
 and after an year of use, when the battery is not new?), -- unless
 the UPS actually has the capacity to shutdown system, it will cut
 the power at an unexpected time, while the disk(s) still has dirty
 caches...
 
 If you have a SmartUPS by APC, their is a freeware demon that monitors
[...]

Good stuff.  I knew at least SOME UPSes are good... ;)
Too bad I rarely see such stuff in use by regular
home users...
[]
 Note also that with linux software raid barriers are NOT supported.
 --
 Are you sure about this?  When my system boots, I used to have
 3 new IDE's, and one older one.  XFS checked each drive for barriers
 and turned off barriers for a disk that didn't support it.  ... or
 are you referring specifically to linux-raid setups?

I'm referring especially to linux-raid setups (software raid).
md devices don't support barriers, because of a very simple
reasons: once more than one disk drive is involved, md layer
can't guarantee ordering ACROSS drives too.  The problem is
that in case of power loss during writes, when an array needs
recovery/resync (at least the parts which were being written,
if bitmaps are in use), md layer will choose arbitrary drive
as a master and will copy data to another drive (speaking
of simplest case of 2-drive raid1 array).  But the thing
is that one drive may have two last barriers written (I mean
the data that was assotiated with the barriers), and
another neither of the two - in two different places.  And
hence we may see quite.. some inconsistency here.

This is regardless of whether underlying component devices
supports barriers or not.

 Would it be possible on boot to have xfs probe the Raid array,
 physically, to see if barriers are really supported (or not), and disable
 them if they are not (and optionally disabling write caching, but that's
 a major performance hit in my experience.

Xfs already probes the devices as you describe, exactly the
same way as you've seen with your ide disks, and disables
barriers.

The question and confusing was about what happens when the
barriers are disabled (provided, again, that we don't rely
on UPS and other external things).  As far as I understand,
when barriers are working properly, xfs should be safe wrt
power losses (still a bit unsure about this).  Now, when
barriers are turned off (for whatever reason), is it still
as safe?  I don't know.  Does it use regular cache flushes
in place of barriers in that case (which ARE supported by
md layer)?

Generally, it has been said numerous times that XFS is not
powercut-friendly, and it has to be used when everything
is stable, including power.  Hence I'm afraid to deploy it
where I know the power is not stable (we've about 70 such
places here, with servers in each, where they don't always
replace UPS batteries in time - ext3fs never crashed so
far, while ext2 did).

Thanks.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-04 Thread Michael Tokarev
Moshe Yudkowsky wrote:
[]
 But that's *exactly* what I have -- well, 5GB -- and which failed. I've
 modified /etc/fstab system to use data=journal (even on root, which I
 thought wasn't supposed to work without a grub option!) and I can
 power-cycle the system and bring it up reliably afterwards.

Note also that data=journal effectively doubles the write time.
It's a bit faster for small writes (because all writes are first
done into the journal, i.e. into the same place, so no seeking
is needed), but for larger writes, the journal will become full
and data found in it needs to be written to proper place, to free
space for new data.  Here, if you'll continue writing, you will
have more than 2x speed degradation, because of a) double writes,
and b) more seeking.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-04 Thread Moshe Yudkowsky

Michael Tokarev wrote:

Moshe Yudkowsky wrote:
[]

But that's *exactly* what I have -- well, 5GB -- and which failed. I've
modified /etc/fstab system to use data=journal (even on root, which I
thought wasn't supposed to work without a grub option!) and I can
power-cycle the system and bring it up reliably afterwards.


Note also that data=journal effectively doubles the write time.
It's a bit faster for small writes (because all writes are first
done into the journal, i.e. into the same place, so no seeking
is needed), but for larger writes, the journal will become full
and data found in it needs to be written to proper place, to free
space for new data.  Here, if you'll continue writing, you will
have more than 2x speed degradation, because of a) double writes,
and b) more seeking.


The alternative seems to be that portions of the / file system won't 
mount because the file system is corrupted on a crash while writing.


If I'm reading the man pages, Wikis, READMEs and mailing lists correctly 
--  not necessarily the case -- the ext3 file system uses the equivalent 
of data=journal as a default.


The question then becomes what data scheme to use with reiserfs on the 
remainder of the file system, the /usr, /var, /home, and others. If they 
can recover on a reboot sing fsck and the default configuration of 
resierfs, then I have no problem using them. But my understanding is 
that data can be destroyed or lost or destroyed if there's a crash on a 
write; then there's little point in running a RAID system that can 
collect corrupt data.


Another way to phrase this: unless you're running data-center grade 
hardware and have absolute confidence in your UPS, you should use 
data=journal for reiserfs and perhaps avoid XFS entirely.



--
Moshe Yudkowsky * [EMAIL PROTECTED] * www.pobox.com/~moshe
Right in the middle of a large field where there had never been a 
trench was a
shell hole... 8 feet deep by 15 across. On the edge of it was a dead... 
rat not over
twice the size of a mouse. No wonder the war costs so much. Col. George 
Patton

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-04 Thread Moshe Yudkowsky

Robin, thanks for the explanation. I have a further question.

Robin Hill wrote:


Once the file system is mounted then hdX,Y maps according to the
device.map file (which may actually bear no resemblance to the drive
order at boot - I've had issues with this before).  At boot time it maps
to the BIOS boot order though, and (in my experience anyway) hd0 will
always map to the drive the BIOS is booting from.


At the time that I use grub to write to the MBR, hd2,1 is /dev/sdc1. 
Therefore, I don't quite understand why this would not work:


grub EOF
root(hd2,1)
setup(hd2)
EOF

This would seem to be a command to have the MBR on hd2 written to use 
the boot on hd2,1. It's valid when written. Are you saying that it's a 
command for the MBR on /dev/sdc to find the data on (hd2,1), the 
location of which might change at any time? That's... a  very strange 
way to write the tool. I thought it would be a command for the MBR on 
hd2 (sdc) to look at hd2,1 (sdc1) to find its data, regardless of the 
boot order that caused sdc to be the boot disk.


--
Moshe Yudkowsky * [EMAIL PROTECTED] * www.pobox.com/~moshe
 Bring me the head of Prince Charming.
-- Robert Sheckley  Roger Zelazny
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-04 Thread Michael Tokarev
Moshe Yudkowsky wrote:
[]
 If I'm reading the man pages, Wikis, READMEs and mailing lists correctly
 --  not necessarily the case -- the ext3 file system uses the equivalent
 of data=journal as a default.

ext3 defaults to data=ordered, not data=journal.  ext2 doesn't have
journal at all.

 The question then becomes what data scheme to use with reiserfs on the

I'd say don't use reiserfs in the first place ;)

 Another way to phrase this: unless you're running data-center grade
 hardware and have absolute confidence in your UPS, you should use
 data=journal for reiserfs and perhaps avoid XFS entirely.

By the way, even if you do have a good UPS, there should be some
control program for it, to properly shut down your system when
UPS loses the AC power.  So far, I've seen no such programs...

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-04 Thread Moshe Yudkowsky

Eric,

Thanks very much for your note. I'm becoming very leery of resiserfs at 
the moment... I'm about to run another series of crash tests.


Eric Sandeen wrote:

Justin Piszcz wrote:


Why avoid XFS entirely?

esandeen, any comments here?


Heh; well, it's the meme.


Well, yeah...


Note also that ext3 has the barrier option as well, but it is not
enabled by default due to performance concerns.  Barriers also affect
xfs performance, but enabling them in the non-battery-backed-write-cache
scenario is the right thing to do for filesystem integrity.


So if I understand you correctly, you're stating that current the most 
reliable fs in its default configuration, in terms of protection against 
power-loss scenarios, is XFS?



--
Moshe Yudkowsky * [EMAIL PROTECTED] * www.pobox.com/~moshe
 There is something fundamentally wrong with a country [USSR] where
  the citizens want to buy your underwear.  -- Paul Thereaux
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-04 Thread Michael Tokarev
Eric Sandeen wrote:
 Moshe Yudkowsky wrote:
 So if I understand you correctly, you're stating that current the most 
 reliable fs in its default configuration, in terms of protection against 
 power-loss scenarios, is XFS?
 
 I wouldn't go that far without some real-world poweroff testing, because
 various fs's are probably more or less tolerant of a write-cache
 evaporation.  I suppose it'd depend on the size of the write cache as well.

I know no filesystem which is, as you say, tolerant to a write-cache
evaporation.  If a drive says the data is written but in fact it's
not, it's a Bad Drive (tm) and it should be thrown away immediately.
Fortunately, almost all modern disk drives don't lie this way.  The
only thing needed for the filesystem is to tell the drive to flush
it's cache at the appropriate time, and actually wait for the flush
to complete.  Barriers (mentioned in this thread) is just another
way to do so, in a somewhat more efficient way, but normal cache
flush will do as well.  IFF the write caching is enabled in the
first place - note that with some workloads, write caching in
the drive actually makes write speed worse, not better - namely,
in case of massive writes.

Speaking of XFS (and with ext3fs with write barriers enabled) -
I'm confused here as well, and answers to my questions didn't
help either.  As far as I understand, XFS only use barriers,
not regular cache flushes, hence without write barrier support
(which is not here for linux software raid, which is explained
elsewhere) it's unsafe, -- probably the same applies to ext3
with barrier support enabled.  But I'm not sure I got it all
correctly.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-04 Thread Richard Scobie

Michael Tokarev wrote:


Unfortunately an UPS does not *really* help here.  Because unless
it has control program which properly shuts system down on the loss
of input power, and the battery really has the capacity to power the
system while it's shutting down (anyone tested this?  With new UPS?
and after an year of use, when the battery is not new?), -- unless
the UPS actually has the capacity to shutdown system, it will cut
the power at an unexpected time, while the disk(s) still has dirty
caches...


I'm unsure what you mean here. The Network UPS Tools project
http://www.networkupstools.org/ has been supplying software to do this
for years.

In addition, a number of UPS manufacturers including APC, one of the
larger ones, provide Linux management and monitoring software with the UPS.

As far as worrying whether a one year old battery has enough capacity to
hold up while the system shuts down, there is no reason why you cannot
set it to shut the system down gracefully after maybe 30 seconds of
power loss if you feel it is necessary.

A reputable brand UPS with a correctly sized battery capacity will have
no trouble in this scenario unless the battery is faulty, in which case
it will probably be picked up during automated load tests. As long as
the manufacturers battery replacement schedule is followed, genuine
replacement batteries are used and automated regular UPS tests are
enabled, the risks of failure are small.

Regards,

Richard


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-04 Thread Eric Sandeen
Moshe Yudkowsky wrote:
 So if I understand you correctly, you're stating that current the most 
 reliable fs in its default configuration, in terms of protection against 
 power-loss scenarios, is XFS?

I wouldn't go that far without some real-world poweroff testing, because
various fs's are probably more or less tolerant of a write-cache
evaporation.  I suppose it'd depend on the size of the write cache as well.

-Eric
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-04 Thread Robin Hill
On Mon Feb 04, 2008 at 05:06:09AM -0600, Moshe Yudkowsky wrote:

 Robin, thanks for the explanation. I have a further question.

 Robin Hill wrote:

 Once the file system is mounted then hdX,Y maps according to the
 device.map file (which may actually bear no resemblance to the drive
 order at boot - I've had issues with this before).  At boot time it maps
 to the BIOS boot order though, and (in my experience anyway) hd0 will
 always map to the drive the BIOS is booting from.

 At the time that I use grub to write to the MBR, hd2,1 is /dev/sdc1. 
 Therefore, I don't quite understand why this would not work:

 grub EOF
 root(hd2,1)
 setup(hd2)
 EOF

 This would seem to be a command to have the MBR on hd2 written to use the 
 boot on hd2,1. It's valid when written. Are you saying that it's a command 
 for the MBR on /dev/sdc to find the data on (hd2,1), the location of which 
 might change at any time? That's... a  very strange way to write the tool. 
 I thought it would be a command for the MBR on hd2 (sdc) to look at hd2,1 
 (sdc1) to find its data, regardless of the boot order that caused sdc to be 
 the boot disk.

This is exactly what it does, yes - the hdX,Y are mapped by GRUB into
BIOS disk interfaces (0x80 being the first, 0x81 the second and so on)
and it writes (to hdc in this case) the instructions to look on the
first partition of BIOS drive 0x82 (whichever drive that ends up being)
for the rest of the bootloader.

It is a bit of a strange way to work, but it's really the only way it
_can_ work (and cover all circumstances).  Unfortunately when you start
playing with bootloaders you have to get down to the BIOS level, and
things weren't written to make sense at that level (after all, when
these standards were put in place everyone was booting from a single
floppy disk system).  If EFI becomes more standard then hopefully this
will simplify but we're stuck with things as they are for now.

Cheers,
Robin

-- 
 ___
( ' } |   Robin Hill[EMAIL PROTECTED] |
   / / )  | Little Jim says |
  // !!   |  He fallen in de water !! |


pgplpoLJetl8c.pgp
Description: PGP signature


Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-04 Thread Justin Piszcz



On Mon, 4 Feb 2008, Michael Tokarev wrote:


Moshe Yudkowsky wrote:
[]

If I'm reading the man pages, Wikis, READMEs and mailing lists correctly
--  not necessarily the case -- the ext3 file system uses the equivalent
of data=journal as a default.


ext3 defaults to data=ordered, not data=journal.  ext2 doesn't have
journal at all.


The question then becomes what data scheme to use with reiserfs on the


I'd say don't use reiserfs in the first place ;)


Another way to phrase this: unless you're running data-center grade
hardware and have absolute confidence in your UPS, you should use
data=journal for reiserfs and perhaps avoid XFS entirely.


By the way, even if you do have a good UPS, there should be some
control program for it, to properly shut down your system when
UPS loses the AC power.  So far, I've seen no such programs...

/mjt


Why avoid XFS entirely?

esandeen, any comments here?

Justin.
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-04 Thread Michael Tokarev
Eric Sandeen wrote:
[]
 http://oss.sgi.com/projects/xfs/faq.html#nulls
 
 and note that recent fixes have been made in this area (also noted in
 the faq)
 
 Also - the above all assumes that when a drive says it's written/flushed
 data, that it truly has.  Modern write-caching drives can wreak havoc
 with any journaling filesystem, so that's one good reason for a UPS.  If

Unfortunately an UPS does not *really* help here.  Because unless
it has control program which properly shuts system down on the loss
of input power, and the battery really has the capacity to power the
system while it's shutting down (anyone tested this?  With new UPS?
and after an year of use, when the battery is not new?), -- unless
the UPS actually has the capacity to shutdown system, it will cut
the power at an unexpected time, while the disk(s) still has dirty
caches...

 the drive claims to have metadata safe on disk but actually does not,
 and you lose power, the data claimed safe will evaporate, there's not
 much the fs can do.  IO write barriers address this by forcing the drive
 to flush order-critical data before continuing; xfs has them on by
 default, although they are tested at mount time and if you have
 something in between xfs and the disks which does not support barriers
 (i.e. lvm...) then they are disabled again, with a notice in the logs.

Note also that with linux software raid barriers are NOT supported.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-04 Thread Eric Sandeen
Eric Sandeen wrote:
 Justin Piszcz wrote:
 
 Why avoid XFS entirely?

 esandeen, any comments here?
 
 Heh; well, it's the meme.
 
 see:
 
 http://oss.sgi.com/projects/xfs/faq.html#nulls
 
 and note that recent fixes have been made in this area (also noted in
 the faq)

Actually, continue reading past that specific entry to the next several,
 it covers all this quite well.

-Eric

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-04 Thread Eric Sandeen
Justin Piszcz wrote:

 Why avoid XFS entirely?
 
 esandeen, any comments here?

Heh; well, it's the meme.

see:

http://oss.sgi.com/projects/xfs/faq.html#nulls

and note that recent fixes have been made in this area (also noted in
the faq)

Also - the above all assumes that when a drive says it's written/flushed
data, that it truly has.  Modern write-caching drives can wreak havoc
with any journaling filesystem, so that's one good reason for a UPS.  If
the drive claims to have metadata safe on disk but actually does not,
and you lose power, the data claimed safe will evaporate, there's not
much the fs can do.  IO write barriers address this by forcing the drive
to flush order-critical data before continuing; xfs has them on by
default, although they are tested at mount time and if you have
something in between xfs and the disks which does not support barriers
(i.e. lvm...) then they are disabled again, with a notice in the logs.

Note also that ext3 has the barrier option as well, but it is not
enabled by default due to performance concerns.  Barriers also affect
xfs performance, but enabling them in the non-battery-backed-write-cache
scenario is the right thing to do for filesystem integrity.

-Eric

 Justin.
 

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-04 Thread Justin Piszcz



On Mon, 4 Feb 2008, Michael Tokarev wrote:


Eric Sandeen wrote:
[]

http://oss.sgi.com/projects/xfs/faq.html#nulls

and note that recent fixes have been made in this area (also noted in
the faq)

Also - the above all assumes that when a drive says it's written/flushed
data, that it truly has.  Modern write-caching drives can wreak havoc
with any journaling filesystem, so that's one good reason for a UPS.  If


Unfortunately an UPS does not *really* help here.  Because unless
it has control program which properly shuts system down on the loss
of input power, and the battery really has the capacity to power the
system while it's shutting down (anyone tested this?  With new UPS?
and after an year of use, when the battery is not new?), -- unless
the UPS actually has the capacity to shutdown system, it will cut
the power at an unexpected time, while the disk(s) still has dirty
caches...
You use nut and a large enough UPS to handle the load of the system, it 
shuts the machine down just fine.





the drive claims to have metadata safe on disk but actually does not,
and you lose power, the data claimed safe will evaporate, there's not
much the fs can do.  IO write barriers address this by forcing the drive
to flush order-critical data before continuing; xfs has them on by
default, although they are tested at mount time and if you have
something in between xfs and the disks which does not support barriers
(i.e. lvm...) then they are disabled again, with a notice in the logs.


Note also that with linux software raid barriers are NOT supported.

/mjt



-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-03 Thread Moshe Yudkowsky
I've been reading the draft and checking it against my experience. 
Because of local power fluctuations, I've just accidentally checked my 
system:  My system does *not* survive a power hit. This has happened 
twice already today.


I've got /boot and a few other pieces in a 4-disk RAID 1 (three running, 
one spare). This partition is on /dev/sd[abcd]1.


I've used grub to install grub on all three running disks:

grub --no-floppy EOF
root (hd0,1)
setup (hd0)
root (hd1,1)
setup (hd1)
root (hd2,1)
setup (hd2)
EOF

(To those reading this thread to find out how to recover: According to 
grub's map option, /dev/sda1 maps to hd0,1.)



After the power hit, I get:

 Error 16
 Inconsistent filesystem mounted

I then tried to boot up on hda1,1, hdd2,1 -- none of them worked.

The culprit, in my opinion, is the reiserfs file system. During the 
power hit, the reiserfs file system of /boot was left in an inconsistent 
state; this meant I had up to three bad copies of /boot.


Recommendations:

1. I'm going to try adding a data=journal option to the reiserfs file 
systems, including the /boot. If this does not work, then /boot must be 
ext3 in order to survive a power hit.


2. We discussed what should be on the RAID1 bootable portion of the 
filesystem. True, it's nice to have the ability to boot from just the 
RAID1 portion. But if that RAID1 portion can't survive a power hit, 
there's little sense. It might make a lot more sense to put /boot on its 
own tiny partition.


The Fix:

The way to fix this problem with booting is to get the reiser file 
system back into sync. I did this by booting to my emergency single-disk 
partition ((hd0,0) if you must know) and then mounting the /dev/md/root 
that contains /boot. This forced a resierfs consistency check and 
journal replay, and let me reboot without problems.




--
Moshe Yudkowsky * [EMAIL PROTECTED] * www.pobox.com/~moshe
A gun is, in many people's minds, like a magic wand. If you point it at 
people,

they are supposed to do your bidding.
-- Edwin E. Moise, _Tonkin Gulf_
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-03 Thread Robin Hill
On Sun Feb 03, 2008 at 01:15:10PM -0600, Moshe Yudkowsky wrote:

 I've been reading the draft and checking it against my experience. Because 
 of local power fluctuations, I've just accidentally checked my system:  My 
 system does *not* survive a power hit. This has happened twice already 
 today.

 I've got /boot and a few other pieces in a 4-disk RAID 1 (three running, 
 one spare). This partition is on /dev/sd[abcd]1.

 I've used grub to install grub on all three running disks:

 grub --no-floppy EOF
 root (hd0,1)
 setup (hd0)
 root (hd1,1)
 setup (hd1)
 root (hd2,1)
 setup (hd2)
 EOF

 (To those reading this thread to find out how to recover: According to 
 grub's map option, /dev/sda1 maps to hd0,1.)

This is wrong - the disk you boot from will always be hd0 (no matter
what the map file says - that's only used after the system's booted).
You need to remap the hd0 device for each disk:

grub --no-floppy EOF
root (hd0,1)
setup (hd0)
device (hd0) /dev/sdb
root (hd0,1)
setup (hd0)
device (hd0) /dev/sdc
root (hd0,1)
setup (hd0)
device (hd0) /dev/sdd
root (hd0,1)
setup (hd0)
EOF


 After the power hit, I get:

  Error 16
  Inconsistent filesystem mounted

 I then tried to boot up on hda1,1, hdd2,1 -- none of them worked.

 The culprit, in my opinion, is the reiserfs file system. During the power 
 hit, the reiserfs file system of /boot was left in an inconsistent state; 
 this meant I had up to three bad copies of /boot.

Could well be - I always use ext2 for the /boot filesystem and don't
have it automounted.  I only mount the partition to install a new
kernel, then unmount it again.

Cheers,
Robin
-- 
 ___
( ' } |   Robin Hill[EMAIL PROTECTED] |
   / / )  | Little Jim says |
  // !!   |  He fallen in de water !! |


pgp9vtpHG44V7.pgp
Description: PGP signature


Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-03 Thread Michael Tokarev
Moshe Yudkowsky wrote:
 I've been reading the draft and checking it against my experience.
 Because of local power fluctuations, I've just accidentally checked my
 system:  My system does *not* survive a power hit. This has happened
 twice already today.
 
 I've got /boot and a few other pieces in a 4-disk RAID 1 (three running,
 one spare). This partition is on /dev/sd[abcd]1.
 
 I've used grub to install grub on all three running disks:
 
 grub --no-floppy EOF
 root (hd0,1)
 setup (hd0)
 root (hd1,1)
 setup (hd1)
 root (hd2,1)
 setup (hd2)
 EOF
 
 (To those reading this thread to find out how to recover: According to
 grub's map option, /dev/sda1 maps to hd0,1.)

I usually install all the drives identically in this regard -
to be treated as first bios disk (disk 0x80).  As already
pointed out in this thread - not all BIOSes are able to boot
off a second or third disk, so if your first disk (sda) fail
your only option is to put your sdb into place of sda and boot
from it - this way, grub needs to think it's first boot drive
too.

By the way, lilo works here more easily and more reliable.
You just install a standard mbr (lilo has it too) which just
boots from an active partition, and install lilo onto the
raid array, and tell it to NOT do anything fancy with raid
at all (raid-extra-boot none).  But for this to work, you
have to have identical partitions with identical offsets -
at least for the boot partitions.

 After the power hit, I get:
 
 Error 16
 Inconsistent filesystem mounted

But did it actually mount it?

 I then tried to boot up on hda1,1, hdd2,1 -- none of them worked.

Which is in fact expected after the above.  You have 3 identical
copies (thanks to raid) of your boot filesystem, all 3 equally
broken.  When it boots, it assembles your /boot raid array - the
same regardless if you boot off hda, hdb or hdc.

 The culprit, in my opinion, is the reiserfs file system. During the
 power hit, the reiserfs file system of /boot was left in an inconsistent
 state; this meant I had up to three bad copies of /boot.

I've never seen any problem with ext[23] wrt unexpected power loss, so
far.  Running several 100s of different systems, some since 1998, some
since 2000.  Sure there was several inconsistencies, sometimes (maybe
once or twice) some minor data loss (only few newly created files were
lost), but most serious was to find a few items in lost+found after an
fsck - that's ext2, never seen that with ext3.

More, I tried hard to force a power failure at unexpected time, by
doing massive write operations and cutting power while at it - I was
never able to trigger any problem this way, at all.

In any case, even if ext[23] is somewhat damaged, it can be mounted
still - access to some files may return I/O errors (in the parts
where it's really damaged), but the rest will work.

On the other hand, I had several immediate issues with reiserfs.  It
was long time ago, when the filesystem first has been included into
mainline kernel, so that doesn't reflect current situation.  Yet even
at that stage, reiserfs was declared stable by the authors.  Issues
were trivially triggerable by cutting the power at an unexpected
time, and fsck didn't help several times.

So I tend to avoid reiserfs - due to my own experience, and due to
numerous problems elsewhere.

 Recommendations:
 
 1. I'm going to try adding a data=journal option to the reiserfs file
 systems, including the /boot. If this does not work, then /boot must be
 ext3 in order to survive a power hit.

By the way, if your /boot is separate filesystem (ie, there's nothing
more there), I see absolutely, zero no reason for it to crash.  /boot
is modified VERY rarely (only when installing a kernel), and only when
it's modified there's a chance for it to be damaged somehow.  During
the rest of the time, it's constant, and any power cut should not hurt
it at all.  If even for a non-modified filesystem reiserfs shows such
behavour (

 2. We discussed what should be on the RAID1 bootable portion of the
 filesystem. True, it's nice to have the ability to boot from just the
 RAID1 portion. But if that RAID1 portion can't survive a power hit,
 there's little sense. It might make a lot more sense to put /boot on its
 own tiny partition.

Hehe.

/boot doesn't matter really.  Separate /boot were used for 3 purposes:

1) to work around bios 1024th cylinder issues (long gone with LBA)
2) to be able to put the rest of the system onto an unsupported-by-
 bootloader filesystem/raid/lvm/etc.  Like, lilo didn't support
 reiserfs (and still doesn't with tail packing enabled), so if you
 want to use reiserfs for your root fs, put /boot into a separate
 ext2fs.  The same is true for raid - you can put the rest of the
 system into a raid5 array (unsupported by grub/lilo), and in order
 to boot, create small raid1 (or any other supported level) /boot.
3) to keep it as less volatile as possible. Like, an area of the
 disk which never changes (except of a few very rare cases).  For
 

Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-03 Thread Moshe Yudkowsky

Robin Hill wrote:


This is wrong - the disk you boot from will always be hd0 (no matter
what the map file says - that's only used after the system's booted).
You need to remap the hd0 device for each disk:

grub --no-floppy EOF
root (hd0,1)
setup (hd0)
device (hd0) /dev/sdb
root (hd0,1)
setup (hd0)
device (hd0) /dev/sdc
root (hd0,1)
setup (hd0)
device (hd0) /dev/sdd
root (hd0,1)
setup (hd0)
EOF


For my enlightenment: if the file system is mounted, then hd2,1 is a 
sensible grub operation, isn't it? For the record, given my original 
script when I boot I am able to edit the grub boot options to read


root (hd2,1)

and proceed to boot.


--
Moshe Yudkowsky * [EMAIL PROTECTED] * www.pobox.com/~moshe
 I love deadlines... especially the whooshing sound they
  make as they fly past.
-- Dermot Dobson
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-03 Thread Moshe Yudkowsky

Michael Tokarev wrote:


Speaking of repairs.  As I already mentioned, I always use small
(256M..1G) raid1 array for my root partition, including /boot,
/bin, /etc, /sbin, /lib and so on (/usr, /home, /var are on
their own filesystems).  And I had the following scenarios
happened already:


But that's *exactly* what I have -- well, 5GB -- and which failed. I've 
modified /etc/fstab system to use data=journal (even on root, which I 
thought wasn't supposed to work without a grub option!) and I can 
power-cycle the system and bring it up reliably afterwards.


So I'm a little suspicious of this theory that /etc and others can be on 
the same partition as /boot in a non-ext3 file system.


--
Moshe Yudkowsky * [EMAIL PROTECTED] * www.pobox.com/~moshe
 Thanks to radio, TV, and the press we can now develop absurd
  misconceptions about peoples and governments we once hardly knew
  existed.-- Charles Fair, _From the Jaws of Victory_
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-03 Thread Michael Tokarev
Moshe Yudkowsky wrote:
 Michael Tokarev wrote:
 
 Speaking of repairs.  As I already mentioned, I always use small
 (256M..1G) raid1 array for my root partition, including /boot,
 /bin, /etc, /sbin, /lib and so on (/usr, /home, /var are on
 their own filesystems).  And I had the following scenarios
 happened already:
 
 But that's *exactly* what I have -- well, 5GB -- and which failed. I've
 modified /etc/fstab system to use data=journal (even on root, which I
 thought wasn't supposed to work without a grub option!) and I can
 power-cycle the system and bring it up reliably afterwards.
 
 So I'm a little suspicious of this theory that /etc and others can be on
 the same partition as /boot in a non-ext3 file system.

If even your separate /boot failed (which should NEVER fail), what to
say about the rest?

I mean, if you'll save your /boot, what help it will be for you, if
your root fs is damaged?

That's why I said /boot is mostly irrelevant.

Well.  You can have some recovery stuff in your initrd/initramfs - that's
for sure (and for that to work, you can make your /boot more reliable by
creating a separate filesystem for it).  But if to go this route, it's
better to boot off some recovery CD instead of trying recovery from very
limited toolset available in your initramfs.

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID needs more to survive a power hit, different /boot layout for example (was Re: draft howto on making raids for surviving a disk crash)

2008-02-03 Thread Robin Hill
On Sun Feb 03, 2008 at 02:46:54PM -0600, Moshe Yudkowsky wrote:

 Robin Hill wrote:

 This is wrong - the disk you boot from will always be hd0 (no matter
 what the map file says - that's only used after the system's booted).
 You need to remap the hd0 device for each disk:
 grub --no-floppy EOF
 root (hd0,1)
 setup (hd0)
 device (hd0) /dev/sdb
 root (hd0,1)
 setup (hd0)
 device (hd0) /dev/sdc
 root (hd0,1)
 setup (hd0)
 device (hd0) /dev/sdd
 root (hd0,1)
 setup (hd0)
 EOF

 For my enlightenment: if the file system is mounted, then hd2,1 is a 
 sensible grub operation, isn't it? For the record, given my original script 
 when I boot I am able to edit the grub boot options to read

 root (hd2,1)

 and proceed to boot.

Once the file system is mounted then hdX,Y maps according to the
device.map file (which may actually bear no resemblance to the drive
order at boot - I've had issues with this before).  At boot time it maps
to the BIOS boot order though, and (in my experience anyway) hd0 will
always map to the drive the BIOS is booting from.

So initially you may have:
SATA-1: hd0
SATA-2: hd1
SATA-3: hd2

Now, if the SATA-1 drive dies totally you will have:
SATA-1: -
SATA-2: hd0
SATA-3: hd1

or if SATA-2 dies:
SATA-1: hd0
SATA-2: -
SATA-3: hd1

Note that in the case where the drive is still detected but fails to
boot then the behaviour seems to be very BIOS dependent - some will
continue to drive 2 as above, whereas others will just sit and complain.

So to answer the second part of your question, yes - at boot time
currently you can do root (hd2,1) or root (hd3,1).  If a disk dies,
however (whichever disk it is), then root (hd3,1) will fail to work.

Note also that the above is only my experience - if you're depending on
certain behaviour under these circumstances then you really need to test
it out on your hardware by disconnecting drives, substituting
non-bootable drives, etc.

HTH,
Robin
-- 
 ___
( ' } |   Robin Hill[EMAIL PROTECTED] |
   / / )  | Little Jim says |
  // !!   |  He fallen in de water !! |


pgpuoxEErijXz.pgp
Description: PGP signature