Re: [zfs-discuss] Large scale performance query

2011-08-06 Thread Rob Cohen
> If I'm not mistaken, a 3-way mirror is not
> implemented behind the scenes in
> the same way as a 3-disk raidz3.  You should use a
> 3-way mirror instead of a
> 3-disk raidz3.

RAIDZ2 requires at least 4 drives, and RAIDZ3 requires at least 5 drives.  But, 
yes, a 3-way mirror is implemented totally differently.  Mirrored drives have 
identical copies of the data.  RAIDZ drives store the data once, plus parity 
data.  A 3-way mirror gives imporved redundancy and read performance, but at a 
high capacity cost, and slower writes than a 2-way mirror.

It's more common to do 2-way mirrors + hot spare.  This gives comparable 
protection to RAIDZ2, but with MUCH better performance.

Of course, mirrors cost more capacity, but it helps that ZFS's compression and 
thin provisioning can often offset the loss in capacity, without sacrificing 
performance (especially when used in combination with L2ARC).
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-06 Thread Rob Cohen
Thanks for clarifying.

If a block is spread across all drives in a RAIDZ group, and there are no 
partial block reads, how can each drive in the group act like a stripe?  Many 
RAID5&6 implementations can do partial block reads, allowing for parallel 
random reads across drives (as long as there are no writes in the queue).

Perhaps you are saying that they act like stripes for bandwidth purposes, but 
not for read ops/sec?
-Rob

-Original Message-
From: Bob Friesenhahn [mailto:bfrie...@simple.dallas.tx.us] 
Sent: Saturday, August 06, 2011 11:41 AM
To: Rob Cohen
Cc: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] Large scale performance query

On Sat, 6 Aug 2011, Rob Cohen wrote:
>
> Can RAIDZ even do a partial block read?  Perhaps it needs to read the 
> full block (from all drives) in order to verify the checksum.
> If so, then RAIDZ groups would always act like one stripe, unlike 
> RAID5/6.

ZFS does not do partial block reads/writes.  It must read the whole block in 
order to validate the checksum.  If there is a checksum failure, then RAID5 
type algorithms are used to produce a corrected block.

For this reason, it is wise to make sure that the zfs filesystem blocksize is 
appropriate for the task, and make sure that the system has sufficient RAM that 
the zfs ARC can cache enough data that it does not need to re-read from disk 
for recently accessed files.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-06 Thread Rob Cohen
> I may have RAIDZ reading wrong here.  Perhaps someone
> could clarify.
> 
> For a read-only workload, does each RAIDZ drive act
> like a stripe, similar to RAID5/6?  Do they have
> independant queues?
> 
> It would seem that there is no escaping
> read/modify/write operations for sub-block writes,
> forcing the RAIDZ group to act like a single stripe.

Can RAIDZ even do a partial block read?  Perhaps it needs to read the full 
block (from all drives) in order to verify the checksum.  If so, then RAIDZ 
groups would always act like one stripe, unlike RAID5/6.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-06 Thread Rob Cohen
RAIDZ has to rebuild data by reading all drives in the group, and 
reconstructing from parity.  Mirrors simply copy a drive.

Compare 3tb mirros vs. 9x3tb RAIDZ2.

Mirrors:
Read 3tb
Write 3tb

RAIDZ2:
Read 24tb
Reconstruct data on CPU
Write 3tb

In this case, RAIDZ is at least 8x slower to resilver (assuming CPU and writing 
happen in parallel).  In the mean time, performance for the array is severely 
degraded for RAIDZ, but not for mirrors.

Aside from resilvering, for many workloads, I have seen over 10x (!) better 
performance from mirrors.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-06 Thread Rob Cohen
I may have RAIDZ reading wrong here.  Perhaps someone could clarify.

For a read-only workload, does each RAIDZ drive act like a stripe, similar to 
RAID5/6?  Do they have independant queues?

It would seem that there is no escaping read/modify/write operations for 
sub-block writes, forcing the RAIDZ group to act like a single stripe.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-05 Thread Rob Cohen
Generally, mirrors resilver MUCH faster than RAIDZ, and you only lose 
redundancy on that stripe, so combined, you're much closer to RAIDZ2 odds than 
you might think, especially with hot spare(s), which I'd reccommend.

When you're talking about IOPS, each stripe can support 1 simultanious user.

Writing:
Each RAIDZ group = 1 stripe.
Each mirror group = 1 stripe.
So, 216 drives can be 24 stripes or 108 stripes.

Reading:
Each RAIDZ group = 1 stripe.
Each mirror group = 1 stripe per drive.
So, 216 drives can be 24 stripes or 216 stripes.

Actually, reads from mirrors are even more efficient than reads from stripes, 
because the software can optimally load balance across mirrors.

So, back to the original poster's question, 9 stripes might be enough to 
support 5 clients, but 216 stripes could support many more.

Actually, this is an area where RAID5/6 has an advantage over RAIDZ, if I 
understand correctly, because for RAID5/6 on read-only workloads, each drive 
acts like a stripe.  For workloads with writing, though, RAIDZ is significantly 
faster than RAID5/6, but mirrors/RAID10 give the best performance for all 
workloads.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large scale performance query

2011-08-04 Thread Rob Cohen
Try mirrors.  You will get much better multi-user performance, and you can 
easily split the mirrors across enclosures.

If your priority is performance over capacity, you could experiment with n-way 
mirros, since more mirrors will load balance reads better than more stripes.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] problem adding second MD1000 enclosure to LSI 9200-16e

2011-01-10 Thread Rob Cohen
As a follow-up, I tried a SuperMicro enclosure (SC847E26-RJBOD1).  I have 3 
sets of 15 drives.  I got the same results when I loaded the second set of 
drives (15 to 30).

Then, I tried changing the LSI 9200's BIOS setting for max INT 13 drives from 
24 (the default) to 15.  From then on, the SuperMicro enclosure worked fine, 
even with all 45 drives, and no kernel hangs.

I suspect that the BIOS setting would have worked with >1 MD1000 enclosure, but 
I never tested the MD1000s, after I had the SuperMicro enclosure running.

I'm not sure if the kernal hang with max int13=24 was a hardware problem, or a 
Solaris bug.
  - Rob

> I have 15x SAS drives in a Dell MD1000 enclosure,
> attached to an LSI 9200-16e.  This has been working
> well.  The system is boothing off of internal drives,
> on a Dell SAS 6ir.
> 
> I just tried to add a second storage enclosure, with
> 15 more SAS drives, and I got a lockup during Loading
> Kernel.  I got the same results, whether I daisy
> chained the enclosures, or plugged them both directly
> into the LSI 9200.  When I removed the second
> enclosure, it booted up fine.
> 
> I also have an LSI MegaRAID 9280-8e I could use, but
> I don't know if there is a way to pass the drives
> through, without creating RAID0 virtual drives for
> each drive, which would complicate replacing disks.
> The 9280 boots up fine, and the systems can see new
>  virtual drives.
> 
> Any suggestions?  Is there some sort of boot
> procedure, in order to get the system to recognize
> the second enclosure without locking up?  Is there a
> special way to configure one of these LSI boards?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] problem adding second MD1000 enclosure to LSI 9200-16e

2010-11-21 Thread Rob Cohen
Markus,
I'm pretty sure that I have the MD1000 plugged in properly, especially since 
the same connection works on the 9280 and Perc 6/e.  It's not in split mode.

Thanks for the suggestion, though.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] problem adding second MD1000 enclosure to LSI 9200-16e

2010-11-21 Thread Rob Cohen
I have 15x SAS drives in a Dell MD1000 enclosure, attached to an LSI 9200-16e.  
This has been working well.  The system is boothing off of internal drives, on 
a Dell SAS 6ir.

I just tried to add a second storage enclosure, with 15 more SAS drives, and I 
got a lockup during Loading Kernel.  I got the same results, whether I daisy 
chained the enclosures, or plugged them both directly into the LSI 9200.  When 
I removed the second enclosure, it booted up fine.

I also have an LSI MegaRAID 9280-8e I could use, but I don't know if there is a 
way to pass the drives through, without creating RAID0 virtual drives for each 
drive, which would complicate replacing disks.  The 9280 boots up fine, and the 
systems can see new virtual drives.

Any suggestions?  Is there some sort of boot procedure, in order to get the 
system to recognize the second enclosure without locking up?  Is there a 
special way to configure one of these LSI boards?

Thanks,
   Rob
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] l2arc_noprefetch

2010-11-21 Thread Rob Cohen
When running real data, as opposed to benchmarks, I notice that my l2arc stops 
filling, even though the majority of my reads are still going to primary 
storage.  I'm using 5 SSDs for L2ARC, so I'd expect to get good throughput, 
even with sequential reads.

I'd like to experiment with disabling the l2arc_noprefetch feature, to see how 
the performance compares by caching more data.  How exactly do I do that?

Right now, I added the following line to /etc/system, but it doesn't seem to 
have made a difference.  I'm still seeing most of my reads go to primary 
storage, even though my cache should be warm by now, and my SSDs are far from 
full.

set zfs:l2arc_noprefetch = 0

Am I setting this wrong?  Am misunderstanding this option?

Thanks,
  Rob
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs record size implications

2010-11-10 Thread Rob Cohen
Thanks, Richard.  Your answers were very helpful.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs record size implications

2010-11-04 Thread Rob Cohen
I have read some conflicting things regarding the ZFs record size setting.  
Could you guys verify/correct my these statements:

(These reflect my understanding, not necessarily the facts!)

1) The ZFS record size in a zvol is the unit that dedup happens at.  So, for a 
volume that is shared to an NTFS machine, if the NTFS cluster size is smaller 
than the zvol record size, dedup will get dramatically worse, since it won't 
dedup clusters that are positioned differently in zvol records.

2) For shared folders, the record size is the allocation unit size, so large 
records can waste a substantial amount of space, in cases with lots of very 
small files.  This is different than a HW raid stripe size, which only affects 
performance, not space usage.

3) Although small record sizes have a large RAM overhead for dedup tables, as 
long as the dedup table working set fits in RAM, and the rest fits in L2ARC, 
performance will be good.

Thanks,
   Rob
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] stripes of different size mirror groups

2010-10-28 Thread Rob Cohen
Thanks, Ian.

If I understand correctly, the performance would then drop to the same level as 
if I set them up as separate volumes in the first place.

So, I get double the performance for 75% of my data, and equal performance for 
25% of my data, and my L2ARC will adapt to my working set across both 
enclosures.

That sounds like all upside, and no downside, unless I'm missing something.

Are there any other problems?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] stripes of different size mirror groups

2010-10-28 Thread Rob Cohen
I have a couple drive enclosures:
15x 450gb 15krpm SAS
15x 600gb 15krpm SAS

I'd like to set them up like RAID10.  Previously, I was using two hardware 
RAID10 volumes, with the 15th drive as a hot spare, in each enclosure.

Using ZFS, it could be nice to make them a single volume, so that I could share 
L2ARC and ZIL devices, rather than buy two sets.

It appears possible to set up 7x450gb mirrored sets and 7x600gb mirrored sets 
in the same volume, without losing capacity.  Is that a bad idea?  Is there a 
problem with having different stripe sizes, like this?

Thanks,
Rob
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss