Re: [zfs-discuss] Large scale performance query

2011-08-09 Thread Richard Elling
On Aug 8, 2011, at 4:01 PM, Peter Jeremy wrote:

 On 2011-Aug-08 17:12:15 +0800, Andrew Gabriel andrew.gabr...@oracle.com 
 wrote:
 periodic scrubs to cater for this case. I do a scrub via cron once a 
 week on my home system. Having almost completely filled the pool, this 
 was taking about 24 hours. However, now that I've replaced the disks and 
 done a send/recv of the data across to a new larger pool which is only 
 1/3rd full, that's dropped down to 2 hours.
 
 FWIW, scrub time is more related to how fragmented a pool is, rather
 than how full it is.  My main pool is only at 61% (of 5.4TiB) and has
 never been much above that but has lots of snapshots and a fair amount
 of activity.  A scrub takes around 17 hours.

Don't forget, scrubs are throttled on later versions of ZFS.

In a former life, we did a study of when to scrub and the answer was about once 
a year
for enterprise-grade storage.  Once a week is ok for the paranoid.

 
 This is another area where the mythical block rewrite would help a lot.

Maybe, by then I'll be retired and fishing somewhere, scaring the children with 
stories
about how hard we had it back in the days when we stored data on spinning 
platters :-)
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-08 Thread Andrew Gabriel

Alexander Lesle wrote:

And what is your suggestion for scrubbing a mirror pool?
Once per month, every 2 weeks, every week.
  


There isn't just one answer.

For a pool with redundancy, you need to do a scrub just before the 
redundancy is lost, so you can be reasonably sure the remaining data is 
correct and can rebuild the redundancy.


The problem comes with knowing when this might happen. Of course, if you 
are doing some planned maintenance which will reduce the pool 
redundancy, then always do a scrub before that. However, in most cases, 
the redundancy is lost without prior warning, and you need to do 
periodic scrubs to cater for this case. I do a scrub via cron once a 
week on my home system. Having almost completely filled the pool, this 
was taking about 24 hours. However, now that I've replaced the disks and 
done a send/recv of the data across to a new larger pool which is only 
1/3rd full, that's dropped down to 2 hours.


For a pool with no redundancy, where you rely only on backups for 
recovery, the scrub needs to be integrated into the backup cycle, such 
that you will discover corrupt data before it has crept too far through 
your backup cycle to be able to find a non corrupt version of the data.


When you have a new hardware setup, I would perform scrubs more 
frequently as a further check that the hardware doesn't have any 
systemic problems, until you have gained confidence in it.


--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-08 Thread Peter Jeremy
On 2011-Aug-08 17:12:15 +0800, Andrew Gabriel andrew.gabr...@oracle.com wrote:
periodic scrubs to cater for this case. I do a scrub via cron once a 
week on my home system. Having almost completely filled the pool, this 
was taking about 24 hours. However, now that I've replaced the disks and 
done a send/recv of the data across to a new larger pool which is only 
1/3rd full, that's dropped down to 2 hours.

FWIW, scrub time is more related to how fragmented a pool is, rather
than how full it is.  My main pool is only at 61% (of 5.4TiB) and has
never been much above that but has lots of snapshots and a fair amount
of activity.  A scrub takes around 17 hours.

This is another area where the mythical block rewrite would help a lot.

-- 
Peter Jeremy


pgpH1dpSOBHnT.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-07 Thread Alexander Lesle
Hello Bob Friesenhahn and List,

On August, 06 2011, 20:41 Bob Friesenhahn wrote in [1]:

 I think that this depends on the type of hardware you have, how much
 new data is written over a period of time, the typical I/O load on the
 server (i.e. does scrubbing impact usability?), and how critical the 
 data is to you.  Even power consumption and air conditioning can be a 
 factor since scrubbing is an intensive operation which will increase 
 power consumption.

Thx Bob for answering.

The hardware ist SM-Board, Xeon, 16 GB reg RAM, LSI 9211-8i HBA,
6 x Hitachi 2TB Deskstar 5K3000 HDS5C3020ALA632.
Server is standing in the basement by 32°C
The HDs are filled to 80% and the workload ist only most reading.

Whats the best? Scrubbing every week, every second week once a month?

-- 
Best Regards
Alexander
August, 07 2011

[1] mid:alpine.gso.2.01.1108061337570.1...@freddy.simplesystems.org


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-07 Thread Roy Sigurd Karlsbakk
 The hardware ist SM-Board, Xeon, 16 GB reg RAM, LSI 9211-8i HBA,
 6 x Hitachi 2TB Deskstar 5K3000 HDS5C3020ALA632.
 Server is standing in the basement by 32°C
 The HDs are filled to 80% and the workload ist only most reading.
 
 Whats the best? Scrubbing every week, every second week once a month?

Generally, you can't scrub too often. If you have a set of striped mirrors, the 
scrub shouldn't take too long. The extra stress on the drives during scrub 
shouldn't matter much, drives are made to be used. By the way, 32˚C is a bit 
high for most servers. Could you check the drive temperature with smartctl or 
ipmi tools?

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-07 Thread Alexander Lesle
Hello Roy Sigurd Karlsbakk and List,

On August, 07 2011, 19:27 Roy Sigurd Karlsbakk wrote in [1]:

 Generally, you can't scrub too often. If you have a set of striped
 mirrors, the scrub shouldn't take too long. The extra stress on the
 drives during scrub shouldn't matter much, drives are made to be
 used. By the way, 32˚C is a bit high for most servers. Could you
 check the drive temperature with smartctl or ipmi tools?

Thx Ron for answering.
The temp. are between 27° and 31° C checked with smartctl.
At the moment I scrub every sunday.
-- 
Best Regards
Alexander
August, 08 2011

[1] mid:8906420.12.1312738051044.JavaMail.root@zimbra


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-06 Thread Orvar Korvar
Ok, so mirrors resilver faster.

But, it is not uncommon that another disk shows problem during resilver (for 
instance r/w errors), this scenario would mean your entire raid is gone, right? 
If you are using mirrors, and one disk crashes and you start resilver. Then the 
other disk shows r/w errors because of the increased load - then you are 
screwed? Because large disks take long time to resilver, possibly weeks?

In that case, it would be preferable to use mirrors with 3 disks in each vdev. 
Trimorrs. Each vdev should be one raidz3.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-06 Thread Mark Sandrock
Shouldn't the choice of RAID type also
be based on the i/o requirements?

Anyway, with RAID-10, even a second
failed disk is not catastophic, so long
as it is not the counterpart of the first
failed disk, no matter the no. of disks.
(With 2-way mirrors.)

But that's why we do backups, right?

Mark

Sent from my iPhone

On Aug 6, 2011, at 7:01 AM, Orvar Korvar knatte_fnatte_tja...@yahoo.com wrote:

 Ok, so mirrors resilver faster.
 
 But, it is not uncommon that another disk shows problem during resilver (for 
 instance r/w errors), this scenario would mean your entire raid is gone, 
 right? If you are using mirrors, and one disk crashes and you start resilver. 
 Then the other disk shows r/w errors because of the increased load - then you 
 are screwed? Because large disks take long time to resilver, possibly weeks?
 
 In that case, it would be preferable to use mirrors with 3 disks in each 
 vdev. Trimorrs. Each vdev should be one raidz3.
 -- 
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-06 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Orvar Korvar
 
 Ok, so mirrors resilver faster.
 
 But, it is not uncommon that another disk shows problem during resilver
(for
 instance r/w errors), this scenario would mean your entire raid is gone,
right?

Imagine, you have 8 disks configured as 4x 2-way mirrors.  Capacity of 4
disks.
Imagine for comparison, you have 6 disks configured as raidz2.  Capacity of
4 disks.
Imagine, in the event of a disk failure, the mirrored configuration
resilvers 4x faster, which is a good estimate because each mirrored vdev has
1/4 as many objects on it.

Yes it's possible for 2 disks failure to destroy the mirrored configuration,
if they happen to both be partners of each other.  But the probability of a
2nd disk failure being the partner of the first failed disk is only 1/7, and
it only results in pool failure if it occurs within the resilver window,
which is 4x less probable.

You can work out the probabilities, but suffice it to say, the probability
of pool failure using the mirrored configuration is not dramatically
different from the probability of pool failure with the raidz configuration.
If you want to know the precise probabilities, you have to fill in all the
variables...  Number of drives in each configuration, resilver times, MTTDL
for each drive...  etc.  Sometimes the mirrors are more reliable, sometimes
the raid is more reliable.

Performance of the mirrors is always equal or better than performance of the
raidz.  Cost of the mirrors is always equal or higher than the cost of the
raidz.


 If you are using mirrors, and one disk crashes and you start resilver.
Then the
 other disk shows r/w errors because of the increased load - then you are
 screwed? Because large disks take long time to resilver, possibly weeks?

If one disk fails in a mirror, then one disk has increased load.
If one disk fails in a raidz, then N disks have increased load.  So no, I
don't think this is a solid argument against mirrors.   ;-)

Incidentally, large disks only take weeks to resilver in a large raid
configuration.  That never happens in a mirrored configuration.  ;-)


 In that case, it would be preferable to use mirrors with 3 disks in each
vdev.
 Trimorrs. Each vdev should be one raidz3.

If I'm not mistaken, a 3-way mirror is not implemented behind the scenes in
the same way as a 3-disk raidz3.  You should use a 3-way mirror instead of a
3-disk raidz3.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-06 Thread Rob Cohen
I may have RAIDZ reading wrong here.  Perhaps someone could clarify.

For a read-only workload, does each RAIDZ drive act like a stripe, similar to 
RAID5/6?  Do they have independant queues?

It would seem that there is no escaping read/modify/write operations for 
sub-block writes, forcing the RAIDZ group to act like a single stripe.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-06 Thread Rob Cohen
RAIDZ has to rebuild data by reading all drives in the group, and 
reconstructing from parity.  Mirrors simply copy a drive.

Compare 3tb mirros vs. 9x3tb RAIDZ2.

Mirrors:
Read 3tb
Write 3tb

RAIDZ2:
Read 24tb
Reconstruct data on CPU
Write 3tb

In this case, RAIDZ is at least 8x slower to resilver (assuming CPU and writing 
happen in parallel).  In the mean time, performance for the array is severely 
degraded for RAIDZ, but not for mirrors.

Aside from resilvering, for many workloads, I have seen over 10x (!) better 
performance from mirrors.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-06 Thread Rob Cohen
 I may have RAIDZ reading wrong here.  Perhaps someone
 could clarify.
 
 For a read-only workload, does each RAIDZ drive act
 like a stripe, similar to RAID5/6?  Do they have
 independant queues?
 
 It would seem that there is no escaping
 read/modify/write operations for sub-block writes,
 forcing the RAIDZ group to act like a single stripe.

Can RAIDZ even do a partial block read?  Perhaps it needs to read the full 
block (from all drives) in order to verify the checksum.  If so, then RAIDZ 
groups would always act like one stripe, unlike RAID5/6.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-06 Thread Bob Friesenhahn

On Sat, 6 Aug 2011, Orvar Korvar wrote:


Ok, so mirrors resilver faster.

But, it is not uncommon that another disk shows problem during 
resilver (for instance r/w errors), this scenario would mean your 
entire raid is gone, right? If you are using mirrors, and one disk

crashes and you start resilver. Then the other disk shows r/w errors


Those using mirrors or raidz1 are best advised to perform periodic 
scrubs.  This helps avoid future media read errors and also helps 
flush out failing hardware.


Regardless, it is true that two hard failures can take out your whole 
pool.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-06 Thread Bob Friesenhahn

On Sat, 6 Aug 2011, Rob Cohen wrote:


I may have RAIDZ reading wrong here.  Perhaps someone could clarify.

For a read-only workload, does each RAIDZ drive act like a stripe, 
similar to RAID5/6?  Do they have independant queues?


They act like a stripe like in RAID5/6.

It would seem that there is no escaping read/modify/write operations 
for sub-block writes, forcing the RAIDZ group to act like a single 
stripe.


True.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-06 Thread Bob Friesenhahn

On Sat, 6 Aug 2011, Rob Cohen wrote:


Can RAIDZ even do a partial block read?  Perhaps it needs to read 
the full block (from all drives) in order to verify the checksum. 
If so, then RAIDZ groups would always act like one stripe, unlike 
RAID5/6.


ZFS does not do partial block reads/writes.  It must read the whole 
block in order to validate the checksum.  If there is a checksum 
failure, then RAID5 type algorithms are used to produce a corrected 
block.


For this reason, it is wise to make sure that the zfs filesystem 
blocksize is appropriate for the task, and make sure that the system 
has sufficient RAM that the zfs ARC can cache enough data that it does 
not need to re-read from disk for recently accessed files.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-06 Thread Rob Cohen
Thanks for clarifying.

If a block is spread across all drives in a RAIDZ group, and there are no 
partial block reads, how can each drive in the group act like a stripe?  Many 
RAID56 implementations can do partial block reads, allowing for parallel 
random reads across drives (as long as there are no writes in the queue).

Perhaps you are saying that they act like stripes for bandwidth purposes, but 
not for read ops/sec?
-Rob

-Original Message-
From: Bob Friesenhahn [mailto:bfrie...@simple.dallas.tx.us] 
Sent: Saturday, August 06, 2011 11:41 AM
To: Rob Cohen
Cc: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Large scale performance query

On Sat, 6 Aug 2011, Rob Cohen wrote:

 Can RAIDZ even do a partial block read?  Perhaps it needs to read the 
 full block (from all drives) in order to verify the checksum.
 If so, then RAIDZ groups would always act like one stripe, unlike 
 RAID5/6.

ZFS does not do partial block reads/writes.  It must read the whole block in 
order to validate the checksum.  If there is a checksum failure, then RAID5 
type algorithms are used to produce a corrected block.

For this reason, it is wise to make sure that the zfs filesystem blocksize is 
appropriate for the task, and make sure that the system has sufficient RAM that 
the zfs ARC can cache enough data that it does not need to re-read from disk 
for recently accessed files.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-06 Thread Alexander Lesle
Hello Bob Friesenhahn and List,

On August, 06 2011, 18:34 Bob Friesenhahn wrote in [1]:

 Those using mirrors or raidz1 are best advised to perform periodic
 scrubs.  This helps avoid future media read errors and also helps 
 flush out failing hardware.

And what is your suggestion for scrubbing a mirror pool?
Once per month, every 2 weeks, every week.

 Regardless, it is true that two hard failures can take out your whole 
 pool.

In RaidZ1 not in Mirror?

-- 
Best Regards
Alexander
August, 06 2011

[1] mid:alpine.gso.2.01.1108061131170.1...@freddy.simplesystems.org


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-06 Thread Alexander Lesle
Hello Rob Cohen and List,

On August, 06 2011, 17:32 Rob Cohen wrote in [1]:

 In this case, RAIDZ is at least 8x slower to resilver (assuming CPU
 and writing happen in parallel).  In the mean time, performance for
 the array is severely degraded for RAIDZ, but not for mirrors.

 Aside from resilvering, for many workloads, I have seen over 10x
 (!) better performance from mirrors.

Horrible.
My little pool needs for scrubbing more than 8 hours with no workload.
The pool has 6 Hitachi 2 TB

# zpool status archepool
pool: archepool
state: ONLINE
scan: scrub repaired 0 in 8h14m with 0 errors on Sun Jul 31 19:14:47 2011
config:

NAME   STATE READ WRITE CKSUM
archepool  ONLINE   0 0 0
  mirror-0 ONLINE   0 0 0
c1t50024E9003CE0317d0  ONLINE   0 0 0
c1t50024E9003CF7685d0  ONLINE   0 0 0
  mirror-1 ONLINE   0 0 0
c1t50024E9003CE031Bd0  ONLINE   0 0 0
c1t50024E9003CE0368d0  ONLINE   0 0 0
  mirror-2 ONLINE   0 0 0
c1t5000CCA369CA262Bd0  ONLINE   0 0 0
c1t5000CCA369CBF60Cd0  ONLINE   0 0 0

errors: No known data errors

How much time needs the thread opener with his config?
 Technical Specs:
 216x 3TB 7k3000 HDDs
 24x 9 drive RAIDZ3

I suggest resilver need weeks and the chance that a second or
third HD crashs in that time is high. Murphy’s Law

-- 
Best Regards
Alexander
August, 06 2011

[1] mid:1688088365.31312644757091.JavaMail.Twebapp@sf-app1


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-06 Thread Roy Sigurd Karlsbakk
 How much time needs the thread opener with his config?
  Technical Specs:
  216x 3TB 7k3000 HDDs
  24x 9 drive RAIDZ3
 
 I suggest resilver need weeks and the chance that a second or
 third HD crashs in that time is high. Murphy’s Law

With a full pool, perhaps a couple of weeks, but unless the pool is full 
(something that's strictly discouraged), a few days should do. I'm currently 
replacing WD drives on a server with 4 9-drive RAIDz2 VDEVs, and it takes about 
two days. A single drive replacement takes about 24 hours.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-06 Thread Bob Friesenhahn

On Sat, 6 Aug 2011, Rob Cohen wrote:


Perhaps you are saying that they act like stripes for bandwidth purposes, but 
not for read ops/sec?


Exactly.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-06 Thread Bob Friesenhahn

On Sat, 6 Aug 2011, Alexander Lesle wrote:



Those using mirrors or raidz1 are best advised to perform periodic
scrubs.  This helps avoid future media read errors and also helps
flush out failing hardware.


And what is your suggestion for scrubbing a mirror pool?
Once per month, every 2 weeks, every week.


I think that this depends on the type of hardware you have, how much 
new data is written over a period of time, the typical I/O load on the 
server (i.e. does scrubbing impact usability?), and how critical the 
data is to you.  Even power consumption and air conditioning can be a 
factor since scrubbing is an intensive operation which will increase 
power consumption.


Written data which has not been scrubbed at least once becomes subject 
to the possibility that it was not written correctly in the first 
place.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-06 Thread Rob Cohen
 If I'm not mistaken, a 3-way mirror is not
 implemented behind the scenes in
 the same way as a 3-disk raidz3.  You should use a
 3-way mirror instead of a
 3-disk raidz3.

RAIDZ2 requires at least 4 drives, and RAIDZ3 requires at least 5 drives.  But, 
yes, a 3-way mirror is implemented totally differently.  Mirrored drives have 
identical copies of the data.  RAIDZ drives store the data once, plus parity 
data.  A 3-way mirror gives imporved redundancy and read performance, but at a 
high capacity cost, and slower writes than a 2-way mirror.

It's more common to do 2-way mirrors + hot spare.  This gives comparable 
protection to RAIDZ2, but with MUCH better performance.

Of course, mirrors cost more capacity, but it helps that ZFS's compression and 
thin provisioning can often offset the loss in capacity, without sacrificing 
performance (especially when used in combination with L2ARC).
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-05 Thread Orvar Korvar
Is mirrors really a realistic alternative? I mean, if I have to resilver a raid 
with 3TB discs, it can take days I suspect. With 4TB disks it can take a week, 
maybe. So, if I use mirror and one disk break, then I only have single 
redundancy while the mirror repairs. Reparation will take long time and will 
stress the disks, which means the other disk might malfunction.

Therefore, I think raidz2 or raidz3 that allows 2 or 3 disks to break while you 
resilver. Hence, mirror is not a realistic alternative when using large disks.

True/false? What do you guys say?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-05 Thread Ian Collins

 On 08/ 6/11 10:42 AM, Orvar Korvar wrote:

Is mirrors really a realistic alternative?


To what?  Some context would be helpful.


I mean, if I have to resilver a raid with 3TB discs, it can take days I 
suspect. With 4TB disks it can take a week, maybe. So, if I use mirror and one 
disk break, then I only have single redundancy while the mirror repairs. 
Reparation will take long time and will stress the disks, which means the other 
disk might malfunction.

Therefore, I think raidz2 or raidz3 that allows 2 or 3 disks to break while you 
resilver. Hence, mirror is not a realistic alternative when using large disks.

True/false? What do you guys say?


I don't have any exact like for like comparison data, but from what I've 
seen a mirror resilvers a lot faster than a drive in a raidz(2) vdev.


--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-05 Thread Rob Cohen
Generally, mirrors resilver MUCH faster than RAIDZ, and you only lose 
redundancy on that stripe, so combined, you're much closer to RAIDZ2 odds than 
you might think, especially with hot spare(s), which I'd reccommend.

When you're talking about IOPS, each stripe can support 1 simultanious user.

Writing:
Each RAIDZ group = 1 stripe.
Each mirror group = 1 stripe.
So, 216 drives can be 24 stripes or 108 stripes.

Reading:
Each RAIDZ group = 1 stripe.
Each mirror group = 1 stripe per drive.
So, 216 drives can be 24 stripes or 216 stripes.

Actually, reads from mirrors are even more efficient than reads from stripes, 
because the software can optimally load balance across mirrors.

So, back to the original poster's question, 9 stripes might be enough to 
support 5 clients, but 216 stripes could support many more.

Actually, this is an area where RAID5/6 has an advantage over RAIDZ, if I 
understand correctly, because for RAID5/6 on read-only workloads, each drive 
acts like a stripe.  For workloads with writing, though, RAIDZ is significantly 
faster than RAID5/6, but mirrors/RAID10 give the best performance for all 
workloads.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-04 Thread Rob Cohen
Try mirrors.  You will get much better multi-user performance, and you can 
easily split the mirrors across enclosures.

If your priority is performance over capacity, you could experiment with n-way 
mirros, since more mirrors will load balance reads better than more stripes.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-07-31 Thread Evgueni Martynov

On 25/07/2011 2:34 AM, Phil Harrison wrote:

Hi All,

Hoping to gain some insight from some people who have done large scale systems 
before? I'm hoping to get some
performance estimates, suggestions and/or general discussion/feedback. I cannot 
discuss the exact specifics of the
purpose but will go into as much detail as I can.

Technical Specs:
216x 3TB 7k3000 HDDs
24x 9 drive RAIDZ3
4x JBOD Chassis (45 bay)
1x server (36 bay)
2x AMD 12 Core CPU
128GB EEC RAM
2x 480GB SSD Cache
10Gbit NIC

Workloads:

Mainly streaming compressed data. That is, pulling compressed data in a 
sequential manner however could have multiple
streams happening at once making it somewhat random. We are hoping to have 5 
clients pull 500Mbit sustained.

Considerations:

The main reason RAIDZ3 was chosen was so we can distribute the parity across 
the JBOD enclosures. With this method even
if an entire JBOD enclosure is taken offline the data is still accessible.


What kind of 45 bay enclosures?
Have you tested this and took an enclosure out?

Thanks
Evgueni


Questions:

How to manage the physical locations of such a vast number of drives? I have 
read this
(http://blogs.oracle.com/eschrock/entry/external_storage_enclosures_in_solaris) 
and am hoping some can shed some light
if the SES2 enclosure identification has worked for them? (enclosures are SES2)

What kind of performance would you expect from this setup? I know we can 
multiple the base IOPS by 24 but what about max
sequential read/write?

Thanks,

Phil



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-07-26 Thread Rocky Shek
Phil,

 

Recently, we have built a large configuration on 4 way Xeon sever with 8 4U
24 Bay JBOD. We are using 2x LSI 6160 SAS switch so we can easy to expand
the Storage in the future.

 

1)  If you are planning to expand your storage, you should consider
using LSI SAS switch for easy future expansion. 

2)  We carefully pick one HD from each JBOD to create RAIDZ2. So we can
loss two JBOD at the same time while data is still accessible . It is good
to know you have the same idea

3)  Seq. read/write is currently limited by 10G NIC. Local storage can
easily hit 1500MB/s + with even small number of HD. Again 10G is bottom-neck


4)  I recommend you use native SAS HD in large scale system if possible.
Native SAS HD work better 

5)  We are using DSM to locate fail disk and monitor FRU of JBOD
http://dataonstorage.com/dsm.

 

I hope the above points can help

 

The configuration is similar to the configuration 3 in the following link

http://dataonstorage.com/dataon-solutions/lsi-6gb-sas-switch-sas6160-storage
.html

 

Technical Specs:

DNS-4800 4way Intel Xeon 7550 server with 256G RAM 

2x LSI 9200-8E HBA

2x LSI 6160 SAS Switch

8x DNS-1600 4U 24bay JBOD(dual IO in MPxIO) with 2TB Seagate SAS HD RAIDZ2

STEC Zeus RAM for ZIL

Intel 320 SSD for L2ARC   

10G NIC

 

Rocky 

 

From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Phil Harrison
Sent: Sunday, July 24, 2011 11:34 PM
To: zfs-discuss@opensolaris.org
Subject: [zfs-discuss] Large scale performance query

 

Hi All,

 

Hoping to gain some insight from some people who have done large scale
systems before? I'm hoping to get some performance estimates, suggestions
and/or general discussion/feedback. I cannot discuss the exact specifics of
the purpose but will go into as much detail as I can.

 

Technical Specs:

216x 3TB 7k3000 HDDs

24x 9 drive RAIDZ3

4x JBOD Chassis (45 bay)

1x server (36 bay)

2x AMD 12 Core CPU

128GB EEC RAM

2x 480GB SSD Cache

10Gbit NIC

 

Workloads:

 

Mainly streaming compressed data. That is, pulling compressed data in a
sequential manner however could have multiple streams happening at once
making it somewhat random. We are hoping to have 5 clients pull 500Mbit
sustained. 

 

Considerations:

 

The main reason RAIDZ3 was chosen was so we can distribute the parity across
the JBOD enclosures. With this method even if an entire JBOD enclosure is
taken offline the data is still accessible. 

 

Questions:

 

How to manage the physical locations of such a vast number of drives? I have
read this
(http://blogs.oracle.com/eschrock/entry/external_storage_enclosures_in_solar
is) and am hoping some can shed some light if the SES2 enclosure
identification has worked for them? (enclosures are SES2)

 

What kind of performance would you expect from this setup? I know we can
multiple the base IOPS by 24 but what about max sequential read/write?


Thanks, 

 

Phil

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-07-25 Thread Orvar Korvar
Wow. If you ever finish this monster, I would really like to hear more about 
the performance and how you connected everything. Could be useful as a 
reference for anyone else building big stuff.

*drool*
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-07-25 Thread Roberto Waltman

Phil Harrison wrote:
 Hi All,

 Hoping to gain some insight from some people who have done large scale
 systems before? I'm hoping to get some performance estimates, suggestions
 and/or general discussion/feedback.

No personal experience, but you may find this useful:
Petabytes on a budget

http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/

-- 

Roberto Waltman

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-07-25 Thread Tiernan OToole
they dont go into too much detail on their setup, and they are not running
Solaris, but they do mention how their SATA cards see different drives,
based on where they are placed they also have a second revision at
http://blog.backblaze.com/2011/07/20/petabytes-on-a-budget-v2-0revealing-more-secrets/
which
talks about building their system with 135Tb in a single 45 bay 4U box...

I am also interested in this kind of scale... Looking at the BackBlaze box,
i am thinking of building something like this, but not in one go... so,
anything you do find out in your build, keep us informed! :)

--Tiernan

On Mon, Jul 25, 2011 at 4:25 PM, Roberto Waltman li...@rwaltman.com wrote:


 Phil Harrison wrote:
  Hi All,
 
  Hoping to gain some insight from some people who have done large scale
  systems before? I'm hoping to get some performance estimates, suggestions
  and/or general discussion/feedback.

 No personal experience, but you may find this useful:
 Petabytes on a budget


 http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/

 --

 Roberto Waltman

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




-- 
Tiernan O'Toole
blog.lotas-smartman.net
www.geekphotographer.com
www.tiernanotoole.ie
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-07-25 Thread Brandon High
On Sun, Jul 24, 2011 at 11:34 PM, Phil Harrison philha...@gmail.com wrote:

 What kind of performance would you expect from this setup? I know we can
 multiple the base IOPS by 24 but what about max sequential read/write?


You should have a theoretical max close to 144x single-disk throughput. Each
raidz3 has 6 data drives which can be read from simultaneously, multiplied
by your 24 vdevs. Of course, you'll hit your controllers' limits well before
that.

Even with a controller per JBOD, you'll be limited by the SAS connection.
The 7k3000 has throughput from 115 - 150 MB/s, meaning each of your JBODs
will be capable of 5.2 GB/sec - 6.8 GB/sec, roughly 10 times the bandwidth
of a single SAS 6g connection. Use multipathing if you can to increase the
bandwidth to each JBOD.

Depending on the types of access that clients are performing, your cache
devices may not be any help. If the data is read multiple times by multiple
clients, then you'll see some benefit. If it's only being read infrequently
or by one client, it probably won't help much at all. That said, if your
access is mostly sequential then random access latency shouldn't affect you
too much, and you will still have more bandwidth from your main storage
pools than from the cache devices.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-07-25 Thread Roy Sigurd Karlsbakk
 Workloads:
 
 Mainly streaming compressed data. That is, pulling compressed data in
 a sequential manner however could have multiple streams happening at
 once making it somewhat random. We are hoping to have 5 clients pull
 500Mbit sustained.

That shouldn't be much of a problem with that amount of drives. I have a couple 
of smaller setups with 11x7-drive raidz2, about 100TiB each, and even they can 
handle 2,5Gbps load.

 Considerations:
 
 The main reason RAIDZ3 was chosen was so we can distribute the parity
 across the JBOD enclosures. With this method even if an entire JBOD
 enclosure is taken offline the data is still accessible.

Sounds like a good idea to me.

 How to manage the physical locations of such a vast number of drives?
 I have read this (
 http://blogs.oracle.com/eschrock/entry/external_storage_enclosures_in_solaris
 ) and am hoping some can shed some light if the SES2 enclosure
 identification has worked for them? (enclosures are SES2)

Which enclosures will you be using? From the data you've posted, it looks like 
SuperMicro, and AFAIK, the ones we have, don't support SES2.

 What kind of performance would you expect from this setup? I know we
 can multiple the base IOPS by 24 but what about max sequential
 read/write?

Parallell read/write from several clients will look like random I/O on the 
server. If bandwidth is crucial, use RAID1+0.

Also, it looks to me you're planning to fill up all external bays with data 
drives - where do you plan to put the root? If you're looking at the SuperMicro 
SC847 line, there's indeed room for a couple of 2,5 drives inside, but the 
chassis is screwed tightly together and doesn't allow for opening during 
runtime. Also, those drives are placed in a rather awkward slot.

If planning to use RAIDz, a couple of SSDs for the SLOG will help write 
performance a lot especially during scrub/resilver. For streaming, L2ARC won't 
be of much use, though.

Finally, a few spares won't hurt even with redundancy levels as high as RAIDz3. 

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-07-25 Thread Roy Sigurd Karlsbakk
 Even with a controller per JBOD, you'll be limited by the SAS
 connection. The 7k3000 has throughput from 115 - 150 MB/s, meaning
 each of your JBODs will be capable of 5.2 GB/sec - 6.8 GB/sec, roughly
 10 times the bandwidth of a single SAS 6g connection. Use multipathing
 if you can to increase the bandwidth to each JBOD.

With (something like) LSI 9211 and those supermicro babies I guess he's 
planning on using, you'll have one quad-port SAS2 cable to each backplane/SAS 
expander, one in front and one in the back, meaning theroretical 24Gbps (or 
2,4GBps) to each backplane. With a maximum of 24 drives per back, this should 
probably suffice, since you'll never get 150MB/s sustained from all drives.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss