Re: [zfs-discuss] ZFS, power failures, and UPSes (and ZFS recovery guide links)

2009-07-01 Thread Haudy Kazemi

Ian Collins wrote:

David Magda wrote:

On Jun 30, 2009, at 14:08, Bob Friesenhahn wrote:

I have seen UPSs help quite a lot for short glitches lasting 
seconds, or a minute.  Otherwise the outage is usually longer than 
the UPSs can stay up since the problem required human attention.


A standby generator is needed for any long outages.


Can't remember where I read the claim, but supposedly if power isn't 
restored within about ten minutes, then it will probably be out for a 
few hours. If this 'statistic' is true, it would mean that your UPS 
should last (say) fifteen minutes, and after that you really need a 
generator.
Or run your systems of DC and get as much backup as you have room (and 
budget!) for batteries.  I once visited a central exchange with 48 
hours of battery capacity...


The way Google handles UPSes is to have a small 12v battery integrated 
with each PC power supply.  When the machine is on, the battery has its 
charged maintained.  Not unlike a laptop in that it has a built in 
battery backup, but using an inexpensive sealed lead acid battery 
instead of lithium ion.  Here is info along with photos of the Google 
server internals:

http://news.cnet.com/8301-1001_3-10209580-92.html
http://willysr.blogspot.com/2009/04/googles-server-design.html

(IIRC there have been power supply UPSes since at least the late 1980s 
which had an internal battery.  Either that or they were UPSes that fit 
inside the standard PC (AT) compatible desktop case, making the power 
protection system entirely internal to the computer.  I think I saw 
these models one time while browsing late 1980s or early 1990s issues of 
PC Magazine that reviewed UPSes.  They still exist...one company selling 
them is http://www.globtek.com/html/ups.html .  A Google search for 
'power supply built in UPS' would likely find more.)


I also did additional searches in the zfs-discuss archives and found a 
thread from mid-February, which lead me to other threads.  It looks like 
there are still scattered instances where ZFS has not recovered 
gracefully from power failures or other failures, where it became 
necessary to perform a manual transaction group (txg) rollback.  Here is 
a consolidated list of links related to manual uberblock transaction 
group (txg) rollback and similar ZFS data recovery guides, including 
undeleting:


Section 1: Nathan Hand's guide and related thread
Nathan Hand's guide to invalidating uberblocks (Dec 2008 thread)
http://www.opensolaris.org/jive/thread.jspa?threadID=85794
or http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg22153.html


Section 2. Victor Latushkin's guide and related threads
Thread: zpool unimportable (corrupt zpool metadata??) but no zdb -l 
device problems (Oct 2008 to Feb 2009 thread)

http://www.opensolaris.org/jive/thread.jspa?threadID=76960
or http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg19839.html

Repair report: Re: Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
http://www.opensolaris.org/jive/message.jspa?messageID=289537#289537

Some recovery discussion by Victor: zdb -bv alone took several hours to 
walk the block tree

http://www.opensolaris.org/jive/message.jspa?messageID=292991#292991
or 
http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/022365.html

or http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg20095.html

Victor Latushkin's guide: Thanks to COW nature of ZFS it was possible 
to successfully recover pool state which was only 5 seconds older than 
last unopenable one.

http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/022331.html
or http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg20061.html


Section 3: reliability debates, recovery tool planning, uberblock info
Thread: Availability: ZFS needs to handle disk removal / driver failure 
better (August 2008 thread)

http://www.opensolaris.org/jive/thread.jspa?threadID=70811
or http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg19057.html

Thread: ZFS: unreliable for professional usage? (Feb 2009 thread)
http://www.opensolaris.org/jive/thread.jspa?threadID=91426
or http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg23833.html

Richard Elling's post that uberblocks are kept in an 128-entry circular 
queue which is 4x redundant with 2 copies each at the beginning and end 
of the vdev. Other metadata, by default, is 2x redundant and spatially 
diverse.

http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg24145.html

Jeff Bonwick's post about Bug ID 6667683
http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg23961.html

Bug ID 6667683: need a way to rollback to an uberblock from a previous txg
Description: If we are unable to open the pool based on the most recent 
uberblock then it might be useful to try an older txg uberblock as it 
might provide a better view of the world. Having a utility to reset the 
uberblock to a previous txg might provide a nice recovery mechanism.


Re: [zfs-discuss] ZFS, power failures, and UPSes

2009-07-01 Thread Andre van Eyssen

On Thu, 2 Jul 2009, Ian Collins wrote:


5+ is typical for telco use.


Aah, but we start getting into rooms full of giant 2V wet lead acid cells 
and giant busbars the size of railway tracks.


--
Andre van Eyssen.
mail: an...@purplecow.org  jabber: an...@interact.purplecow.org
purplecow.org: UNIX for the masses http://www2.purplecow.org
purplecow.org: PCOWpix http://pix.purplecow.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS, power failures, and UPSes

2009-06-30 Thread Haudy Kazemi

Hello,

I've looked around Google and the zfs-discuss archives but have not been 
able to find a good answer to this question (and the related questions 
that follow it):


How well does ZFS handle unexpected power failures? (e.g. environmental 
power failures, power supply dying, etc.)

Does it consistently gracefully recover?
Should having a UPS be considered a (strong) recommendation or a don't 
even think about running without it item?
Are there any communications/interfacing caveats to be aware of when 
choosing the UPS?


In this particular case, we're talking about a home file server running 
OpenSolaris 2009.06.  Actual environment power failures are generally  
1 per year.  I know there are a few blog articles about this type of 
application, but I don't recall seeing any (or any detailed) discussion 
about power failures and UPSes as they relate to ZFS.  I did see that 
the ZFS Evil Tuning Guide says cache flushes are done every 5 seconds.


Here is one post that didn't get any replies about a year ago after 
someone had a power failure, then UPS battery failure while copying data 
to a ZFS pool:

http://lists.macosforge.org/pipermail/zfs-discuss/2008-July/000670.html

Both theoretical answers and real life experiences would be appreciated 
as the former tells me where ZFS is needed while the later tells me 
where it has been or is now.


Thanks,

-hk
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, power failures, and UPSes

2009-06-30 Thread Ross
I've seen enough people suffer from corrupted pools that a UPS is definitely 
good advice.  However, I'm running a (very low usage) ZFS server at home and 
it's suffered through at least half a dozen power outages without any problems 
at all.

I do plan to buy a UPS as soon as I can, but it seems to be surviving very well 
so far.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, power failures, and UPSes

2009-06-30 Thread Monish Shah

A related question:  If you are on a UPS, is it OK to disable ZIL?

The evil tuning guide says The ZIL is an essential part of ZFS and should 
never be disabled.  However, if you have a UPS, what can go wrong that 
really requires ZIL?


Opinions?

Monish

- Original Message - 
From: Ross no-re...@opensolaris.org

To: zfs-discuss@opensolaris.org
Sent: Tuesday, June 30, 2009 3:04 PM
Subject: Re: [zfs-discuss] ZFS, power failures, and UPSes


I've seen enough people suffer from corrupted pools that a UPS is 
definitely good advice.  However, I'm running a (very low usage) ZFS 
server at home and it's suffered through at least half a dozen power 
outages without any problems at all.


I do plan to buy a UPS as soon as I can, but it seems to be surviving very 
well so far.

--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, power failures, and UPSes

2009-06-30 Thread Scott Lawson



Haudy Kazemi wrote:

Hello,

I've looked around Google and the zfs-discuss archives but have not 
been able to find a good answer to this question (and the related 
questions that follow it):


How well does ZFS handle unexpected power failures? (e.g. 
environmental power failures, power supply dying, etc.)

Does it consistently gracefully recover?
Mostly. Unless you are unlucky. Backups are your friend in *any* 
environment though.
Should having a UPS be considered a (strong) recommendation or a 
don't even think about running without it item?
There has been quite any interesting thread on this over the last few 
months. I won't repeat my comments, but it is there in digital posterity 
on the zfs-discuss archives.


Certainly in a large environment with a lot of data being written, then 
one should consider this a mandatory requirement if you care about your
data. Particularly if there are many links in your storage chain that 
cause data corruption due to power failure.


Are there any communications/interfacing caveats to be aware of when 
choosing the UPS?


In this particular case, we're talking about a home file server 
running OpenSolaris 2009.06.  
As far as a home server goes, particularly if it is not write intensive 
then you will 'most likely' be fine. I have a home one with a v120 
running S10 u6 with a D1000
and 7 x 300 GB SCSI disk in a RAIDZ2 that has seen numerous power 
interruptions with no faults. This machine is a Samba server for my Macs 
and printing

business.

I also have another mail / web server also on another v120 which 
experiences the same power faults and regularly bounces back without 
issues.  But your mileage may vary. It all really

depends on how much you care about the data really.

I haven't used OpenSolaris specifically however as I prefer the 
generally more well supported S10 releases. (yes I know you can get 
support for OS, but I tend to be
conservative and standardize as much as possible. I do have millions of 
files stored on ZFS volumes for our Uni and I sleep well ;))


Actual environment power failures are generally  1 per year.  I know 
there are a few blog articles about this type of application, but I 
don't recall seeing any (or any detailed) discussion about power 
failures and UPSes as they relate to ZFS.  I did see that the ZFS Evil 
Tuning Guide says cache flushes are done every 5 seconds.
The flush time you mention is based on older versions of ZFS, newer ones 
can have a flush time as long as 30 seconds I believe now.


Here is one post that didn't get any replies about a year ago after 
someone had a power failure, then UPS battery failure while copying 
data to a ZFS pool:

http://lists.macosforge.org/pipermail/zfs-discuss/2008-July/000670.html

Both theoretical answers and real life experiences would be 
appreciated as the former tells me where ZFS is needed while the later 
tells me where it has been or is now.


Thanks,

-hk
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, power failures, and UPSes

2009-06-30 Thread Andre van Eyssen

On Tue, 30 Jun 2009, Monish Shah wrote:

The evil tuning guide says The ZIL is an essential part of ZFS and should 
never be disabled.  However, if you have a UPS, what can go wrong that 
really requires ZIL?


Without addressing a single ZFS-specific issue:

* panics
* crashes
* hardware failures
- dead RAM
- dead CPU
- dead systemboard
- dead something else
* natural disasters
* UPS failure
* UPS failure (must be said twice)
* Human error (what does this button do?)
* Cabling problems (say, where did my disks go?)
* Malicious actions (Fired? Let me turn their power off!)

That's just a warm-up; I'm sure people can add both the ZFS-specific 
reasons and also the fallacy that a UPS does anything more than mitigate 
one particular single point of failure.


Don't forget to buy two UPSes and split your machine across both. And 
don't forget to actually maintain the UPS. And check the batteries. And 
schedule a load test.


The single best way to learn about the joys of UPS behaviour is to sit 
down and have a drink with a facilities manager who has been doing the job 
for at least ten years. At least you'll hear some funny stories about the 
day a loose screw on one floor took out a house UPS and 100+ hosts and NEs 
with it.


Andre.


--
Andre van Eyssen.
mail: an...@purplecow.org  jabber: an...@interact.purplecow.org
purplecow.org: UNIX for the masses http://www2.purplecow.org
purplecow.org: PCOWpix http://pix.purplecow.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, power failures, and UPSes

2009-06-30 Thread Doug Baker - Sun UK - Support Engineer

Monish Shah wrote:

A related question:  If you are on a UPS, is it OK to disable ZIL?

The evil tuning guide says The ZIL is an essential part of ZFS and 
should never be disabled.  However, if you have a UPS, what can go 
wrong that really requires ZIL?


The UPS.



Opinions?

Monish

- Original Message - From: Ross no-re...@opensolaris.org
To: zfs-discuss@opensolaris.org
Sent: Tuesday, June 30, 2009 3:04 PM
Subject: Re: [zfs-discuss] ZFS, power failures, and UPSes


I've seen enough people suffer from corrupted pools that a UPS is 
definitely good advice.  However, I'm running a (very low usage) ZFS 
server at home and it's suffered through at least half a dozen power 
outages without any problems at all.


I do plan to buy a UPS as soon as I can, but it seems to be surviving 
very well so far.

--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



--
Dr Doug Baker
Sun Microsystems Systems Support Engineer.
UK Mission Critical Solution Centre.
Tel : 0870 600 3222
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, power failures, and UPSes

2009-06-30 Thread Neal Pollack

On 06/30/09 03:00 AM, Andre van Eyssen wrote:

On Tue, 30 Jun 2009, Monish Shah wrote:

The evil tuning guide says The ZIL is an essential part of ZFS and 
should never be disabled.  However, if you have a UPS, what can go 
wrong that really requires ZIL?


Without addressing a single ZFS-specific issue:

* panics
* crashes
* hardware failures
- dead RAM
- dead CPU
- dead systemboard
- dead something else
* natural disasters
* UPS failure
* UPS failure (must be said twice)
* Human error (what does this button do?)
* Cabling problems (say, where did my disks go?)
* Malicious actions (Fired? Let me turn their power off!)

That's just a warm-up; I'm sure people can add both the ZFS-specific 
reasons and also the fallacy that a UPS does anything more than 
mitigate one particular single point of failure.


Actually, they do quite a bit more than that.
They create jobs, generate revenue for battery manufacturers, and tech's 
that change
batteries and do PM maintenance on the large units.  Let's not forget 
that they add
significant revenue to the transportation industry, given their weight 
for shipping.


In the last 28 years of doing this stuff, I've found a few times that 
the UPS has actually
worked and lasted as long as the outage.  Many other times, the unit is 
failed (circuits),
or the batteries are beyond the service life.  But really, something 
approaching 40%

of the time they actually work out OK.

So they also create repair and recycling jobs. :-)




Don't forget to buy two UPSes and split your machine across both. And 
don't forget to actually maintain the UPS. And check the batteries. 
And schedule a load test.


The single best way to learn about the joys of UPS behaviour is to sit 
down and have a drink with a facilities manager who has been doing the 
job for at least ten years. At least you'll hear some funny stories 
about the day a loose screw on one floor took out a house UPS and 100+ 
hosts and NEs with it.


Andre.




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, power failures, and UPSes

2009-06-30 Thread Bob Friesenhahn

On Tue, 30 Jun 2009, Neal Pollack wrote:

Actually, they do quite a bit more than that. They create jobs, 
generate revenue for battery manufacturers, and tech's that change 
batteries and do PM maintenance on the large units.  Let's not


It sounds like this is a responsibility which should be moved to the 
US federal goverment since UPSs create jobs.


In the last 28 years of doing this stuff, I've found a few times 
that the UPS has actually worked and lasted as long as the outage.


I have seen UPSs help quite a lot for short glitches lasting seconds, 
or a minute.  Otherwise the outage is usually longer than the UPSs can 
stay up since the problem required human attention.


A standby generator is needed for any long outages.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, power failures, and UPSes

2009-06-30 Thread Erik Trimble

Bob Friesenhahn wrote:

On Tue, 30 Jun 2009, Neal Pollack wrote:

Actually, they do quite a bit more than that. They create jobs, 
generate revenue for battery manufacturers, and tech's that change 
batteries and do PM maintenance on the large units.  Let's not


It sounds like this is a responsibility which should be moved to the 
US federal goverment since UPSs create jobs.


Actually, I think UPS already employs some 410,000+ people, making it 
the 3rd largest private employer in the USA. (5th overall, if you 
include the Federal Gov't and the US Postal Service).


wink


In the last 28 years of doing this stuff, I've found a few times that 
the UPS has actually worked and lasted as long as the outage.


I have seen UPSs help quite a lot for short glitches lasting seconds, 
or a minute.  Otherwise the outage is usually longer than the UPSs can 
stay up since the problem required human attention.


A standby generator is needed for any long outages.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, 
http://www.simplesystems.org/users/bfriesen/

GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


As someone who has spend enough time doing data center work, I can 
attest to the fact that UPSes are really useful only as 
extremely-short-interval solutions. A dozen or so minutes, at best.


The best design I've see was for an old BBN (hey, remember them!) site 
just outside of Cambridge, MA.  It took in utility power, ran it through 
a conditioner setup, and then through this nice switch thing.  The 
switch took three inputs:  Utility, a local diesel generator, and a line 
of marine batteries.  The switch itself was internally redundant (which 
isn't hard to do, it's 50's tech), so you could draw power from any (or 
even all 3 at once).  Nothing really fancy; it was simple, with no 
semiconductor stuff to fail - just all 50-ish hardwired circuitry. I 
don't even think there was a transistor in the whole shebang. Lots of 
capacitors, though.   :-)



The jist of the whole thing was, that if utility power was out more than 
5 minutes, there was not good predictor of how long it would remain out 
- I saw a nice little graph that showed no real good prediction of 
outage time based on existing outage length (i.e. if the power has been 
out X minutes, you can expect it to be restored in Y minutes...).   I 
suspect it was something like 20 years of accumulated data or so...


The end of this is simple:  UPSes should give you enough time to start 
the gen-pack.  If you are having problems with your gen-pack, you'll 
never have enough UPS time to fix it (and, it's not cost-effective to 
try to make it so), so FIX YOUR GEN PACK BEFORE the outage.  Which means 
- TEST it, and TEST it, and TEST it again!



For home use, I set my UPS to immediately shut down anything attached to 
it for /any/ service outage.  Large enough batteries to handle anything 
more than a couple of minutes are frankly a fire-hazard for the home, 
not to mention a maintenance PITA.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, power failures, and UPSes

2009-06-30 Thread Jason King
On Tue, Jun 30, 2009 at 1:36 PM, Erik Trimbleerik.trim...@sun.com wrote:
 Bob Friesenhahn wrote:

 On Tue, 30 Jun 2009, Neal Pollack wrote:

 Actually, they do quite a bit more than that. They create jobs, generate
 revenue for battery manufacturers, and tech's that change batteries and do
 PM maintenance on the large units.  Let's not

 It sounds like this is a responsibility which should be moved to the US
 federal goverment since UPSs create jobs.

 Actually, I think UPS already employs some 410,000+ people, making it the
 3rd largest private employer in the USA. (5th overall, if you include the
 Federal Gov't and the US Postal Service).

 wink


 In the last 28 years of doing this stuff, I've found a few times that the
 UPS has actually worked and lasted as long as the outage.

 I have seen UPSs help quite a lot for short glitches lasting seconds, or a
 minute.  Otherwise the outage is usually longer than the UPSs can stay up
 since the problem required human attention.

 A standby generator is needed for any long outages.

 Bob
 --
 Bob Friesenhahn
 bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

 As someone who has spend enough time doing data center work, I can attest to
 the fact that UPSes are really useful only as extremely-short-interval
 solutions. A dozen or so minutes, at best.

 The best design I've see was for an old BBN (hey, remember them!) site just
 outside of Cambridge, MA.  It took in utility power, ran it through a
 conditioner setup, and then through this nice switch thing.  The switch took
 three inputs:  Utility, a local diesel generator, and a line of marine
 batteries.  The switch itself was internally redundant (which isn't hard to
 do, it's 50's tech), so you could draw power from any (or even all 3 at
 once).  Nothing really fancy; it was simple, with no semiconductor stuff to
 fail - just all 50-ish hardwired circuitry. I don't even think there was a
 transistor in the whole shebang. Lots of capacitors, though.   :-)


 The jist of the whole thing was, that if utility power was out more than 5
 minutes, there was not good predictor of how long it would remain out - I
 saw a nice little graph that showed no real good prediction of outage time
 based on existing outage length (i.e. if the power has been out X minutes,
 you can expect it to be restored in Y minutes...).   I suspect it was
 something like 20 years of accumulated data or so...

 The end of this is simple:  UPSes should give you enough time to start the
 gen-pack.  If you are having problems with your gen-pack, you'll never have
 enough UPS time to fix it (and, it's not cost-effective to try to make it
 so), so FIX YOUR GEN PACK BEFORE the outage.  Which means - TEST it, and
 TEST it, and TEST it again!

Slight corollary -- just because you have a generator and test it
doesn't mean you can assume you can get fuel in a timely manner (so
still be prepared to shutdown if needed).  I have seen places whose DR
plans completely rely on the assumption there will never be any
problems refueling their generators.  However, last year after Ike
hit, one of ATT's central offices lost power because it ran out of
fuel (and couldn't get refilled in time).



 For home use, I set my UPS to immediately shut down anything attached to it
 for /any/ service outage.  Large enough batteries to handle anything more
 than a couple of minutes are frankly a fire-hazard for the home, not to
 mention a maintenance PITA.

 --
 Erik Trimble
 Java System Support
 Mailstop:  usca22-123
 Phone:  x17195
 Santa Clara, CA

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, power failures, and UPSes

2009-06-30 Thread Miles Nordin
 ms == Monish Shah mon...@indranetworks.com writes:
 sl == Scott Lawson scott.law...@manukau.ac.nz writes:
 np == Neal Pollack neal.poll...@sun.com writes:

ms If you are on a UPS, is it OK to disable ZIL?

sl I have seen numerous UPS' failures over the years,

yeah at my place in NYC we've had more problems with the UPS than with
the service.  At the very least a UPS needs to switch off for new
batteries every two years, and the raw service does not go out that
often for me.

It starts to make more sense to use a UPS if you have dual power
supplies, dual UPS's, bypass switches.  Or crappy aboveground power.

anyway, typical machines panic because of bugs a lot more often than
either UPS or line problems.

**BUT THIS IS ALL BESIDE THE POINT**!

The ZIL is for implementing fsync() for databases and also the part of
NFS that allows servers to reboot without client data loss.  It has
*NOTHING TO DO* with losing your entire pool.  Disabling the ZIL does
not make catastrophic pool loss more likely, not even a little bit!

Unfortunately some software developer decided to write a bunch of DIRE
WARNINGS to SCARE PEOPLE INTO ASSUMPTIONS leading them to use the
maximum amount of code of which said developer is justly proud,
regardless of whether they're using it for the right reason or not.

oddly, I don't think disabling ZIL will make catastrophic loss more
likely for databases running above the ZFS, either, because unlike
non-COW filesystems ZFS never recovers to a state where writes appear
to have happened out-of-order prior to the crash.  Yes, disabling the
ZIL could break the 'D' in ACID for databases running above that ZFS,
but in a way that rolls them back in time, not makes them become
corrupt.  Running without ZIL is as if a snapshot were taken at each
TXG commit time, and on reboot after a crash you recover to the most
recent TXG-snapshot that fully committed, thus databases will be
``crash-consistent'' even without the ZIL, unless I'm mistaken.

Adding an SSD *does* make catastrophic pool loss more likely, because
if you break the SSD and then export the pool, you can never import it
again.  so, adding an SSD for the ZIL as a suggestive good-little-boy
alternative to disabling the ZIL makes catastrophic loss of the entire
pool more likely, not less.

The advantage of rolling with ZIL is, if you're using NFS you should
be able to crash and reboot the server without the clients noticing.
Also MTA's that accept messages, databases that confirm orders and
bookings, won't lose anything they've accepted or confirmed in the
crash (if everything else works).  I wish ZIL could be enabled and
disabled per filesystem instead of per kernel.


pgpxF80aXBJS7.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, power failures, and UPSes

2009-06-30 Thread David Magda

On Jun 30, 2009, at 14:08, Bob Friesenhahn wrote:

I have seen UPSs help quite a lot for short glitches lasting  
seconds, or a minute.  Otherwise the outage is usually longer than  
the UPSs can stay up since the problem required human attention.


A standby generator is needed for any long outages.


Can't remember where I read the claim, but supposedly if power isn't  
restored within about ten minutes, then it will probably be out for a  
few hours. If this 'statistic' is true, it would mean that your UPS  
should last (say) fifteen minutes, and after that you really need a  
generator.


At $WORK we currently have about thirty minutes worth of juice at full  
load, but as time drags on and we start shutting down less essential  
stuff we can increase that. The PBX and security system have their own  
UPSes in their own racks, so there are two layers of battery there.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, power failures, and UPSes

2009-06-30 Thread Scott Lawson



David Magda wrote:

On Jun 30, 2009, at 14:08, Bob Friesenhahn wrote:

I have seen UPSs help quite a lot for short glitches lasting seconds, 
or a minute.  Otherwise the outage is usually longer than the UPSs 
can stay up since the problem required human attention.


A standby generator is needed for any long outages.


Can't remember where I read the claim, but supposedly if power isn't 
restored within about ten minutes, then it will probably be out for a 
few hours. If this 'statistic' is true, it would mean that your UPS 
should last (say) fifteen minutes, and after that you really need a 
generator.
Most UPS's from any vendor are designed to run for around ~12 minutes at 
full load. So that would appear to back

that claim up and from my experience that is pretty much on the money...


At $WORK we currently have about thirty minutes worth of juice at full 
load, but as time drags on and we start shutting down less essential 
stuff we can increase that. The PBX and security system have their own 
UPSes in their own racks, so there are two layers of battery there.
The problem comes  when the power cut comes and you aren't there in the 
middle of the night. Then you either
need an automated shutdown system instigated by traps from the UPS 
(shutting things down in the correct order)
or a generator. About here the generator becomes a very good option. The 
above no generator scenario needs to be consistently tested to maintain 
it's validity, which is a royal pain in the neck. Gen sets are worth 
their weight in gold. I can't
even think how many times in the last few years they have saved our 
bacon. (through both planned and unplanned

outages)


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS, power failures, and UPSes

2009-06-30 Thread Ian Collins

David Magda wrote:

On Jun 30, 2009, at 14:08, Bob Friesenhahn wrote:

I have seen UPSs help quite a lot for short glitches lasting seconds, 
or a minute.  Otherwise the outage is usually longer than the UPSs 
can stay up since the problem required human attention.


A standby generator is needed for any long outages.


Can't remember where I read the claim, but supposedly if power isn't 
restored within about ten minutes, then it will probably be out for a 
few hours. If this 'statistic' is true, it would mean that your UPS 
should last (say) fifteen minutes, and after that you really need a 
generator.
Or run your systems of DC and get as much backup as you have room (and 
budget!) for batteries.  I once visited a central exchange with 48 hours 
of battery capacity...


--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss