Re: [zfs-discuss] [RFC] Backup solution

2010-10-10 Thread Bob Friesenhahn

On Sat, 9 Oct 2010, Richard Elling wrote:


On Oct 8, 2010, at 10:01 AM, Bob Friesenhahn wrote:

Regardless, nothing beats raidz3 based on computable statistics.


Well, no, not really. It all depends on the number of sets and the MTTR.


Well, ok.  I should have appended "except for 3-way mirrors". :-)

3-way mirrors seem like an expensive solution for bulk data backup 
except that it turns out that if the current data fits (with plenty of 
headroom) on the 3-way mirror solution, zfs snapshots (with 
compression enabled) are an excellent way to capture the incremental 
changes over time.  This requires care for how updates are applied to 
the backup pool so that unchanged data blocks are not overwritten. 
Usually backed up data does not change rapidly over time so the 
incremental snapshots don't require much space.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [RFC] Backup solution

2010-10-09 Thread Richard Elling
On Oct 8, 2010, at 10:01 AM, Bob Friesenhahn wrote:
> Regardless, nothing beats raidz3 based on computable statistics.

Well, no, not really. It all depends on the number of sets and the MTTR.
Consider the case where you have 1 set of raidz3 and 2 sets of 3-way
mirrors.  The raidz3 set can only stand to lose 3 disks where the mirrored
sets can stand to lose 4 disks.  The answer is not immediately intuitive 
because it does depend on the MTTR for practical cases.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [RFC] Backup solution

2010-10-08 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Roy Sigurd Karlsbakk
> 
> In addition to this comes another aspect. What if one drive fails and
> you find bad data on another in the same VDEV while resilvering. This
> is quite common these days, and for mirrors, that will mean data loss
> unless you mirror 3-way or more, which will be rather costy.

Like the resilver, scrub goes faster with mirrors.  Scrub regularly.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [RFC] Backup solution

2010-10-08 Thread Bob Friesenhahn

On Fri, 8 Oct 2010, Roy Sigurd Karlsbakk wrote:

In addition to this comes another aspect. What if one drive fails 
and you find bad data on another in the same VDEV while resilvering. 
This is quite common these days, and for mirrors, that will mean 
data loss unless you mirror 3-way or more, which will be rather 
costy.


The "answer" to this is to schedule a periodic scrub.  It is of course 
not a complete answer since the drive may degrade since the previous 
scrub and you might still lose some (or even all!) data.  If you use 
mirrors or raidz1 you should definitely include a periodic scrub in 
the plan.  The good news is that mirrors scrub quickly with far fewer 
I/Os and system impact than raidz?.


Regardless, nothing beats raidz3 based on computable statistics.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [RFC] Backup solution

2010-10-08 Thread Roy Sigurd Karlsbakk
> Now, the above does not include things like proper statistics that the
> chances of that 2nd and 3rd disk failing (even correlations) may be
> higher than our 'flat-line' %/hr. based on 1-year MTBF, or stuff like
> if all the disks were purchased in the same lots and at the same time,
> so their chances of failing around the same time is higher, etc.

In addition to this comes another aspect. What if one drive fails and you find 
bad data on another in the same VDEV while resilvering. This is quite common 
these days, and for mirrors, that will mean data loss unless you mirror 3-way 
or more, which will be rather costy.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [RFC] Backup solution

2010-10-08 Thread Scott Meilicke

On Oct 8, 2010, at 8:25 AM, Bob Friesenhahn wrote:
> 
> It also does not include the "human factor" which is still the most 
> significant contributor to data loss.  This is the most difficult factor to 
> diminish.  If the humans have difficulty understanding the system or the 
> hardware, then they are more likely to do something wrong which damages the 
> data.

This is often overlooked during a system design. It is very easy to lose your 
head during a high stress moment, and pull the wrong drive (I of course, have 
never done that... ). Having z2(3) / triple mirrors, graphical pictures 
of which disk has failed, working LED failures lights, and letting a hot spare 
finish reslivering before replacing a disk are all good counter measures.

> It also does not account for an OS kernel which caches quite a lot of data in 
> memory (relying on ECC for reliability), and which may have bugs.

At some point you have to rely on your backups for the unexpected and 
unforeseen. Make sure they are good!

Michael, nice reliability write up!

--

Scott Meilicke



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [RFC] Backup solution

2010-10-08 Thread Bob Friesenhahn

On Fri, 8 Oct 2010, Michael DeMan wrote:

Now, the above does not include things like proper statistics that 
the chances of that 2nd and 3rd disk failing (even correlations) may 
be higher than our 'flat-line' %/hr. based on 1-year MTBF, or stuff 
like if all the disks were purchased in the same lots and at the 
same time, so their chances of failing around the same time is 
higher, etc.


It also does not include the "human factor" which is still the most 
significant contributor to data loss.  This is the most difficult 
factor to diminish.  If the humans have difficulty understanding the 
system or the hardware, then they are more likely to do something 
wrong which damages the data.


It also does not account for an OS kernel which caches quite a lot of 
data in memory (relying on ECC for reliability), and which may have 
bugs.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [RFC] Backup solution

2010-10-08 Thread Michael DeMan

On Oct 8, 2010, at 4:33 AM, Edward Ned Harvey wrote:

>> From: Peter Jeremy [mailto:peter.jer...@alcatel-lucent.com]
>> Sent: Thursday, October 07, 2010 10:02 PM
>> 
>> On 2010-Oct-08 09:07:34 +0800, Edward Ned Harvey 
>> wrote:
>>> If you're going raidz3, with 7 disks, then you might as well just make
>>> mirrors instead, and eliminate the slow resilver.
>> 
>> There is a difference in reliability:  raidzN means _any_ N disks can
>> fail, whereas mirror means one disk in each mirror pair can fail.
>> With a mirror, Murphy's Law says that the second disk to fail will be
>> the pair of the first disk :-).
> 
> Maybe.  But in reality, you're just guessing the probability of a single
> failure, the probability of multiple failures, and the probability of
> multiple failures within the critical time window and critical redundancy
> set.
> 
> The probability of a 2nd failure within the critical time window is smaller
> whenever the critical time window is decreased, and the probability of that
> failure being within the critical redundancy set is smaller whenever your
> critical redundancy set is smaller.  So if raidz2 takes twice as long to
> resilver than a mirror, and has a larger critical redundancy set, then you
> haven't gained any probable resiliency over a mirror.
> 
> Although it's true with mirrors, it's possible for 2 disks to fail and
> result in loss of pool, I think the probability of that happening is smaller
> than the probability of a 3-disk failure in the raidz2.
> 
> How much longer does a 7-disk raidz2 take to resilver as compared to a
> mirror?  According to my calculations, it's in the vicinity of 10x longer.  
> 

This article has been posted elsewhere, is about 10 months old, but is a good 
read:

http://queue.acm.org/detail.cfm?id=1670144



Really, there should be a ballpark / back of the napkin formula to be able to 
calculate this?  I've been curious about this too, so here goes a 1st cut...



DR = disk reliability, in terms of chance of the disk dying in any given time 
period, say any given hour?

DFW = disk full write - time to write every sector on the disk.  This will vary 
depending on system load, but is still an input item that can be determined by 
some testing.


RSM = resilver time for a mirror of two of the given disks
RSZ1 = resilver time for raidz1 vdev of two of the given disks?
RSZ2 = resilver time for raidz2 vdev of two of the given disks?


chances of losing all data in a mirror: DLM = RSM * DR.
chances of losing all data in a raiz1: DLRZ1 = RSZ1 * DR.
chances of losing all data in a raidz2: DLRZ2 = RSZ2 * DR * DR



Now, for the above, I'll make some other assumptions...


Lets just guess at a 1-year MTBF for our disks, and for purposes here, just 
flat line that at a failure rate  of chance per hour throughout the year.

Lets presume rebuilding a mirror takes one hour.
Lets presume that a 7-disk raidz1 takes 24 times longer to rebuild one disk 
than a mirror, I think this would be a 'safe' ratio to the benefit of the 
mirror.
Lets presume that a 7-disk raidz2 takes 72 times longer to rebuild one disk 
than a mirror, this should be 'safe' and again benefit to the mirror.




DR for a one hour period = 1 / 24 hours / 365 day = .000114 - chance a disk 
might die in any given hour.


DLM = one hour * DR = .000114

DLRZ1 = 24 hours * DR = .0001114 * 6 ( x6 because there are six more drives in 
the pool, and any one of them could fail)

DLRZ2 = 72 hours * DR * DR = (72 * (.0001114 * 6-disks) * (.0001114 * 5 disks)  
= a much tinier chance of losing all that data.





A better way to think about it maybe

Based on our 1-year flat-line MTBF for disks, to figure out how much faster the 
mirror must rebuild for reliability to be the same as a raidz2...

DLM = DLRZ2

.0001114 * 1 hour = X hours * (.0001114 * 6-disks) * (.0001114 * 5 disks)

X = (.0001114 * 6-disks) * 5 

X = .003342

So, the mirror would have to resilver three hundred times faster than the raiz2 
 (1 / .003342) in order for it to offer the same levels of reliability in 
regards to the chances of losing the entire vdev due to additional disk 
failures during a resilver?





The governing thing here is that O(2) level of reliability based on expected 
chances of failure of  additional disks during any given moment in time, vs. 
O(1) for mirrors and raidz1?

Note that the above is O(2) for raidz2 and O(1) for mirror/raidz1, because we 
are working on the assumption we have already lost one disk.

With raidz3, we would have ( 1  /  (.0001114 * 4-disks remaining in pool ), or 
about 2,000 times more reliability?




Now, the above does not include things like proper statistics that the chances 
of that 2nd and 3rd disk failing (even correlations) may be higher than our 
'flat-line' %/hr. based on 1-year MTBF, or stuff like if all the disks were 
purchased in the same lots and at the same time, so their chances of failing 
around the same time is higher, etc.
















___

Re: [zfs-discuss] [RFC] Backup solution

2010-10-08 Thread Bob Friesenhahn

On Thu, 7 Oct 2010, Edward Ned Harvey wrote:


If you're going raidz3, with 7 disks, then you might as well just make
mirrors instead, and eliminate the slow resilver.


While the math supports using raidz3, practicality (other than storage 
space) supports using mirrors.  Mirrors are just much more agile and 
easier to maintain.  Having one or two hot spares that zfs can 
resilver to right away will help improve mirrored pool reliability.



Mirrors resilver enormously faster than raidzN.  At least for now, until
maybe one day the raidz resilver code might be rewritten.


The resilver algorithm is closely aligned to the zfs data storage 
model so it is unlikely to dramatically improve.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [RFC] Backup solution

2010-10-08 Thread Edward Ned Harvey
> From: Peter Jeremy [mailto:peter.jer...@alcatel-lucent.com]
> Sent: Thursday, October 07, 2010 10:02 PM
> 
> On 2010-Oct-08 09:07:34 +0800, Edward Ned Harvey 
> wrote:
> >If you're going raidz3, with 7 disks, then you might as well just make
> >mirrors instead, and eliminate the slow resilver.
> 
> There is a difference in reliability:  raidzN means _any_ N disks can
> fail, whereas mirror means one disk in each mirror pair can fail.
> With a mirror, Murphy's Law says that the second disk to fail will be
> the pair of the first disk :-).

Maybe.  But in reality, you're just guessing the probability of a single
failure, the probability of multiple failures, and the probability of
multiple failures within the critical time window and critical redundancy
set.

The probability of a 2nd failure within the critical time window is smaller
whenever the critical time window is decreased, and the probability of that
failure being within the critical redundancy set is smaller whenever your
critical redundancy set is smaller.  So if raidz2 takes twice as long to
resilver than a mirror, and has a larger critical redundancy set, then you
haven't gained any probable resiliency over a mirror.

Although it's true with mirrors, it's possible for 2 disks to fail and
result in loss of pool, I think the probability of that happening is smaller
than the probability of a 3-disk failure in the raidz2.

How much longer does a 7-disk raidz2 take to resilver as compared to a
mirror?  According to my calculations, it's in the vicinity of 10x longer.  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [RFC] Backup solution

2010-10-07 Thread Peter Jeremy
On 2010-Oct-08 09:07:34 +0800, Edward Ned Harvey  wrote:
>If you're going raidz3, with 7 disks, then you might as well just make
>mirrors instead, and eliminate the slow resilver.

There is a difference in reliability:  raidzN means _any_ N disks can
fail, whereas mirror means one disk in each mirror pair can fail.
With a mirror, Murphy's Law says that the second disk to fail will be
the pair of the first disk :-).

-- 
Peter Jeremy


pgpqLss4mZKH3.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [RFC] Backup solution

2010-10-07 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Ian Collins
> 
> I would seriously consider raidz3, given I typically see 80-100 hour
> resilver times for 500G drives in raidz2 vdevs.  If you haven't
> already,

If you're going raidz3, with 7 disks, then you might as well just make
mirrors instead, and eliminate the slow resilver.

Mirrors resilver enormously faster than raidzN.  At least for now, until
maybe one day the raidz resilver code might be rewritten.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [RFC] Backup solution

2010-10-07 Thread Ian Collins

On 10/ 8/10 11:22 AM, Scott Meilicke wrote:

Those must be pretty busy drives. I had a recent failure of a 1.5T disks in a 7 
disk raidz2 vdev that took about 16 hours to resliver. There was very little IO 
on the array, and it had maybe 3.5T of data to resliver.

On Oct 7, 2010, at 3:17 PM, Ian Collins wrote:
   

I would seriously consider raidz3, given I typically see 80-100 hour resilver 
times for 500G drives in raidz2 vdevs.
 



Those must be pretty busy drives. I had a recent failure of a 1.5T disks in a 7 
disk raidz2 vdev that took about 16 hours to resliver. There was very little IO 
on the array, and it had maybe 3.5T of data to resliver.


It's is a backup staging server (a Thumper), so it's receiving a steady 
stream of snapshots and rsyncs (from windows).  That's why it typically 
gets to 100% complete half way through the actual resilver!


--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [RFC] Backup solution

2010-10-07 Thread Scott Meilicke
Those must be pretty busy drives. I had a recent failure of a 1.5T disks in a 7 
disk raidz2 vdev that took about 16 hours to resliver. There was very little IO 
on the array, and it had maybe 3.5T of data to resliver.

On Oct 7, 2010, at 3:17 PM, Ian Collins wrote:  
> I would seriously consider raidz3, given I typically see 80-100 hour resilver 
> times for 500G drives in raidz2 vdevs.  If you haven't already, read Adam 
> Leventhal's paper:
> 
> http://queue.acm.org/detail.cfm?id=1670144
> 
> -- 
> Ian.
> 
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Scott Meilicke



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [RFC] Backup solution

2010-10-07 Thread Ian Collins

On 10/ 8/10 11:06 AM, Roy Sigurd Karlsbakk wrote:

- Original Message -
   

On 10/ 8/10 10:54 AM, Roy Sigurd Karlsbakk wrote:
 

Hi all

I'm setting up a couple of 110TB servers and I just want some
feedback in case I have forgotten something.

The servers (two of them) will, as of current plans, be using 11
VDEVs with 7 2TB WD Blacks each, with a couple of Crucial RealSSD
256GB SSDs for the L2ARC and another couple of 100GB OCZ Vertex 2
Pro for the SLOG (I know, it's way too much, but they will wear out
slowlier and there aren't fast SSDs around that are small). There
will be 48 gigs of RAM for each box on recent Xeon CPUs.
   

What configuration are you proposing for the vdevs? Don't forget you
will have very long resilver times with those drives.
 

RAIDz2 on each VDEV. I'm aware of that the resilver time will be worse than 
using 10k or 15k drives, but then, those 2TB drives aren't available for 
anything but 7k2 or less.

   
I would seriously consider raidz3, given I typically see 80-100 hour 
resilver times for 500G drives in raidz2 vdevs.  If you haven't already, 
read Adam Leventhal's paper:


http://queue.acm.org/detail.cfm?id=1670144

--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [RFC] Backup solution

2010-10-07 Thread Roy Sigurd Karlsbakk
- Original Message -
> On 10/ 8/10 10:54 AM, Roy Sigurd Karlsbakk wrote:
> > Hi all
> >
> > I'm setting up a couple of 110TB servers and I just want some
> > feedback in case I have forgotten something.
> >
> > The servers (two of them) will, as of current plans, be using 11
> > VDEVs with 7 2TB WD Blacks each, with a couple of Crucial RealSSD
> > 256GB SSDs for the L2ARC and another couple of 100GB OCZ Vertex 2
> > Pro for the SLOG (I know, it's way too much, but they will wear out
> > slowlier and there aren't fast SSDs around that are small). There
> > will be 48 gigs of RAM for each box on recent Xeon CPUs.
>
> What configuration are you proposing for the vdevs? Don't forget you
> will have very long resilver times with those drives.

RAIDz2 on each VDEV. I'm aware of that the resilver time will be worse than 
using 10k or 15k drives, but then, those 2TB drives aren't available for 
anything but 7k2 or less.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] [RFC] Backup solution

2010-10-07 Thread Ian Collins

On 10/ 8/10 10:54 AM, Roy Sigurd Karlsbakk wrote:

Hi all

I'm setting up a couple of 110TB servers and I just want some feedback in case 
I have forgotten something.

The servers (two of them) will, as of current plans, be using 11 VDEVs with 7 
2TB WD Blacks each, with a couple of Crucial RealSSD 256GB SSDs for the L2ARC 
and another couple of 100GB OCZ Vertex 2 Pro for the SLOG (I know, it's way too 
much, but they will wear out slowlier and there aren't fast SSDs around that 
are small). There will be 48 gigs of RAM for each box on recent Xeon CPUs.

   
What configuration are you proposing for the vdevs?  Don't forget you 
will have very long resilver times with those drives.


--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss