[zfs-discuss] ZFS on SAN? Availability edition.

2009-02-18 Thread Robin Harris
Calculating the availability and economic trade-offs of configurations  
is hard. Rule of thumb seems to rule.


I recently profiled an availability/reliability tool on  
StorageMojo.com that uses Bayesian analysis to estimate datacenter  
availability. You can quickly (minutes, not days) model systems and  
compare availability and recovery times as well as OpEx and CapEx  
implications.


One hole: AFAIK, ZFS isn't in their product catalog. There's a free  
version of the tool at http://www.twinstrata.com/


Feedback on the tool from this group is invited.

Robin
StorageMojo.com



Date: Tue, 17 Feb 2009 21:36:38 -0800
From: Richard Elling richard.ell...@gmail.com
To: Toby Thain t...@telegraphics.com.au
Cc: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] ZFS on SAN?
Message-ID: 499b9e66.2010...@gmail.com
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Toby Thain wrote:

Not at all. You've convinced me. Your servers will never, ever lose
power unexpectedly.


Methinks living in Auckland has something to do with that :-)
http://en.wikipedia.org/wiki/1998_Auckland_power_crisis

When services are reliable, then complacency brings risk.
My favorite example recently is the levees in New Orleans.
Katrina didn't top the levees, they were undermined.
-- richard___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS on SAN? Availability edition.

2009-02-18 Thread Scott Lawson

Robin,

From recollection the business case for investment in power protection 
technology was relatively simple.


We calculated what the downtime per hour was worth and how frequently it 
happened. We used to
have several if not more incidents per year and that would cause major 
system outages. When you have
over 1000 staff and multiple remote sites depending on your data center 
(now data centers, plural). Calculate
cost per hour for staff wages alone and it becomes quite easy to 
justify. (I am not even going to fact in loss
of reputation and the media in this, or our most important customer. Our 
students)


I cannot *stress* just how important power and environment protection is 
to data. It is the main consideration
I take into account when deploying new sites. (This discussion went off 
list yesterday and I was mentioning
these same things there). My analogy here is what would be the first 
thing NASA designs into a new space craft?
Life Support. Without it you don't even leave the ground. Electricity 
*is* the lifeblood of available storage.


Case in point. Last year we had an arsonist set fire to a critical point 
in out campus infrastructure last year which burnt
down a building that just happened to have one of the main communication 
and power trenches running
through it. Knocked out around 5 buildings on that campus for two weeks. 
Immense upheaval and disruption
followed. Our brand new DR data center was on that site. Kept running 
because of redundant fibre paths to
the SAN switches and core routers so that we could still provide service 
to the rest of the campus and maintain
active DR to our primary site. Emergency power via generator was also 
available until main power could be rerouted

to the data center as well.

I will take a look at the twinstrata website. (as should others).

Sorry to all if we are diverging too much from zfs-discuss.

/Scott

This stuff does happen. When you have been around for a while you see it.

Robin Harris wrote:
Calculating the availability and economic trade-offs of configurations 
is hard. Rule of thumb seems to rule.


I recently profiled an availability/reliability tool 
on StorageMojo.com that uses Bayesian analysis to estimate datacenter 
availability. You can quickly (minutes, not days) model systems and 
compare availability and recovery times as well as OpEx and CapEx 
implications. 

One hole: AFAIK, ZFS isn't in their product catalog. There's a free 
version of the tool at http://www.twinstrata.com/


Feedback on the tool from this group is invited.

Robin
StorageMojo.com



Date: Tue, 17 Feb 2009 21:36:38 -0800
From: Richard Elling richard.ell...@gmail.com 
mailto:richard.ell...@gmail.com
To: Toby Thain t...@telegraphics.com.au 
mailto:t...@telegraphics.com.au

Cc: zfs-discuss@opensolaris.org mailto:zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] ZFS on SAN?
Message-ID: 499b9e66.2010...@gmail.com 
mailto:499b9e66.2010...@gmail.com

Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Toby Thain wrote:
Not at all. You've convinced me. Your servers will never, ever lose 
power unexpectedly.


Methinks living in Auckland has something to do with that :-) 
http://en.wikipedia.org/wiki/1998_Auckland_power_crisis


When services are reliable, then complacency brings risk.
My favorite example recently is the levees in New Orleans.
Katrina didn't top the levees, they were undermined.
-- richard


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  


--
___


Scott Lawson
Systems Architect
Manukau Institute of Technology
Information Communication Technology Services Private Bag 94006 Manukau
City Auckland New Zealand

Phone  : +64 09 968 7611
Fax: +64 09 968 7641
Mobile : +64 27 568 7611

mailto:sc...@manukau.ac.nz

http://www.manukau.ac.nz




perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS on SAN? Availability edition.

2009-02-18 Thread Miles Nordin
 sl == Scott Lawson scott.law...@manukau.ac.nz writes:

sl Electricity *is* the lifeblood of available storage.

I never meant to suggest computing machinery could run without
electricity.  My suggestion is, if your focus is _reliability_ rather
than availability, meaning you don't want to lose the contents of a
pool, you should think about what happens when power goes out, not
just how to make sure power Never goes out Ever Absolutely because we
Paid and our power is PERFECT.

 * pools should not go corrupt when power goes out.

 * UPS does not replace need for NVRAM's to have batteries in it
   because there are things between the UPS and the NVRAM like cords
   and power supplies, and the UPS themselves are not reliable enough
   if you have only one, and the controller containing the NVRAM may
   need to be hard-booted because of bugs.

 * supplying superexpensive futuristic infalliable fancypower to all
   disk shelves does not mean the SYNC CACHE command can be thrown
   out.  maybe the power is still not infalliable, or maybe there will
   be SAN outages or blown controllers or shelves with junky software
   in them that hang the whole array when one drive goes bad.

If you really care about availability:

 * reliability crosses into availability if you are planning to have
   fragile pools backed by a single SAN LUN, which may become corrupt
   if they lose power.  Maybe you're planning to destroy the pool and
   restore from backup in that case, and you have some
   carefully-planned offsite backup heirarchy that's always recent
   enough to capture all the data you care about.  But, a restore
   could take days, which turns two minutes of unavailable power into
   one day of unavailable data.  If there were no reliability problem
   causing pool loss during power loss, two minutes unavailable power
   maybe means 10min of unavailable data.

 * there are reported problems with systems that take hours to boot
   up, ex. with thousands of filesystems, snapshots, or nfs exports,
   which isn't exactly a reliability problem, but is a problem.  That
   open issue falls into the above outage-magnification category, too.

I just don't like the idea people are building fancy space-age data
centers and then thinking they can safely run crappy storage software
that won't handle power outages because they're above having to worry
about all that little-guy nonsense.  A big selling point of the last
step-forward in filesystems (metadata logging) was that they'd handle
power failures with better consistency guarantees and faster
reboots---at the time, did metadata logging appeal only to people with
unreliable power?  I hope not.

never mind those of us who find these filsystem features important
because we'd like cheaper or smaller systems, with cords that we
sometimes trip over, that are still useful.  I think having such
protections in the storage software and having them actually fully
working not just imaginary or fragile, is always useful, isn't
something you can put yourself above by ``careful power design'' or
``paying for it'' because without them, in a disaster you've got this
brittle house-of-cards system that cracks once you deviate from the
specific procedures you've planned.

I'm glad your disaster planning has stood the test of practice so
well.  But we're supposed to have an industry baseline right now that
databases and MTA's and NFS servers and their underlying filesystems
can lose power without losing any data, and I think we should stick to
that rather than letting it slip.


pgpL40sobVdnK.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS on SAN? Availability edition.

2009-02-18 Thread Bob Friesenhahn

On Wed, 18 Feb 2009, Miles Nordin wrote:


I just don't like the idea people are building fancy space-age data
centers and then thinking they can safely run crappy storage software
that won't handle power outages because they're above having to worry
about all that little-guy nonsense.  A big selling point of the last


Luckily that is not a concern for the members of this list since you 
posted to the ZFS mailing list and we all use ZFS rather than some 
crappy storage software.


Thanks for expressing your concern though.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS on SAN? Availability edition.

2009-02-18 Thread Scott Lawson



Miles Nordin wrote:

sl == Scott Lawson scott.law...@manukau.ac.nz writes:



sl Electricity *is* the lifeblood of available storage.

I never meant to suggest computing machinery could run without
electricity.  My suggestion is, if your focus is _reliability_ rather
than availability, meaning you don't want to lose the contents of a
pool, you should think about what happens when power goes out, not
just how to make sure power Never goes out Ever Absolutely because we
Paid and our power is PERFECT.
  

My focus is on both. And I understand that nothing is ever perfect, only
that one should strive for it if possible. But when one lives in a place 
like NZ where
our power grid system is creaky, it starts becoming a real liability 
that needs mitigation

thats all. I am sure there are plenty of  ZFS users in the same boat.

 * pools should not go corrupt when power goes out.
  

Absolutely agree.

 * UPS does not replace need for NVRAM's to have batteries in it
   because there are things between the UPS and the NVRAM like cords
   and power supplies, and the UPS themselves are not reliable enough
   if you have only one, and the controller containing the NVRAM may
   need to be hard-booted because of bugs.
  
Fully understand this too. If you use as I do hardware RAID arrays 
behind zpool
vdevs then it is very important that this stuff is maintained and that 
the batteries backing the
RAID array write caches are good and that you can have power available 
to allow them
to flush cache to disk before the batteries go flat. This is certainly 
true of any file system that

is built upon LUNS from hardware backed RAID arrays.

 * supplying superexpensive futuristic infalliable fancypower to all
   disk shelves does not mean the SYNC CACHE command can be thrown
   out.  maybe the power is still not infalliable, or maybe there will
   be SAN outages or blown controllers or shelves with junky software
   in them that hang the whole array when one drive goes bad.
  
In general why I use mirrored vdevs with LUNS provided from two 
different arrays
geographically isolated, less likely to be a problem hopefully. But yes 
anything that

ignores SYNC CACHE could pose a serious problem if it is hidden by an array
controller from ZFS.

If you really care about availability:

 * reliability crosses into availability if you are planning to have
   fragile pools backed by a single SAN LUN, which may become corrupt
   if they lose power.  Maybe you're planning to destroy the pool and
   restore from backup in that case, and you have some
   carefully-planned offsite backup heirarchy that's always recent
   enough to capture all the data you care about.  But, a restore
   could take days, which turns two minutes of unavailable power into
   one day of unavailable data.  If there were no reliability problem
   causing pool loss during power loss, two minutes unavailable power
   maybe means 10min of unavailable data.
  
Agreed and is why I would recommend against a single hardware RAID SAN 
LUN for a zpool. At bare
minimum for this you would want to use copies=2 if you really care about 
your data. IF you
don't care about the data then no problems, go ahead. I do use zpools 
for transient data
that I don't care about and favor capacity over resiliency. (main think 
I want is L2ARC for these,

think squid proxy server caches)

 * there are reported problems with systems that take hours to boot
   up, ex. with thousands of filesystems, snapshots, or nfs exports,
   which isn't exactly a reliability problem, but is a problem.  That
   open issue falls into the above outage-magnification category, too.
  
Have seen this myself. Not nice after a  system reboot. Can't recall if 
I have seen

it recently though? Seem to recall it was more around S10 U2 or U3.

I just don't like the idea people are building fancy space-age data
centers and then thinking they can safely run crappy storage software
that won't handle power outages because they're above having to worry
about all that little-guy nonsense.  A big selling point of the last
step-forward in filesystems (metadata logging) was that they'd handle
power failures with better consistency guarantees and faster
reboots---at the time, did metadata logging appeal only to people with
unreliable power?  I hope not.
  
I am just trying to put forward the perspective of a big user here. I 
have already
generated numerous off list posts with people wanting more info on the 
methodology

that we like to use. If I can be of help to people I will.

never mind those of us who find these filsystem features important
because we'd like cheaper or smaller systems, with cords that we
sometimes trip over, that are still useful.  I think having such
protections in the storage software and having them actually fully
working not just imaginary or fragile, is always useful,
Absolutely.  It is all part of the big picture. Albeit probably the 
*the* most important
part. Consistency of your