Re: [zfs-discuss] Zpool LUN Sizes

2012-10-27 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
 From: Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
 
 Performance is much better if you use mirrors instead of raid.  (Sequential
 performance is just as good either way, but sequential IO is unusual for most
 use cases. Random IO is much better with mirrors, and that includes scrubs 
 resilvers.)

Even if you think you use sequential IO...  If you use snapshots...  Thanks to 
the nature of snapshot creation  deletion  the nature of COW, you probably 
don't have much sequential IO in your system, after a couple months of actual 
usage.  Some people use raidzN, but I always use mirrors.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zpool LUN Sizes

2012-10-27 Thread Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Fajar A. Nugraha
 
 So my
 suggestion is actually just present one huge 25TB LUN to zfs and let
 the SAN handle redundancy.

Oh - No

Definitely let zfs handle the redundancy.  Because ZFS is doing the 
checksumming, if it finds a cksum error, it needs access to the redundant copy 
in order to correct it.  If you let the SAN handle the redundancy, then zfs 
finds a cksum error, and your data is unrecoverable.  (Just the file in 
question, not the whole pool or anything like that.)

The answer to Morris's question, about size of LUNs and so forth...  It really 
doesn't matter what size the LUNs are.  Just choose based on your redundancy 
and performance requirements.  Best would be to go JBOD, or if that's not 
possible, create a bunch of 1-disk volumes and let ZFS handle them as if 
they're JBOD.

Performance is much better if you use mirrors instead of raid.  (Sequential 
performance is just as good either way, but sequential IO is unusual for most 
use cases. Random IO is much better with mirrors, and that includes scrubs  
resilvers.)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Scrub and checksum permutations

2012-10-27 Thread Ray Arachelian
On 10/26/2012 04:29 AM, Karl Wagner wrote:

 Does it not store a separate checksum for a parity block? If so, it
 should not even need to recalculate the parity: assuming checksums
 match for all data and parity blocks, the data is good.

 I could understand why it would not store a checksum for a parity
 block. It is not really necessary: Parity is only used to reconstruct
 a corrupted block, so you can reconstruct the block and verify the
 data checksum. But I can also see why they would: Simplified logic,
 faster identification of corrupt parity blocks (more usefull for
 RAIDZ2 and greater), and the general principal that all blocks are
 checksummed.

 If this was the case, it should mean that RAIDZ scub is faster than
 mirror scrub, which I don't think it is. So this post is probably
 redundant (pun intended)


Parity is very simple to calculate and doesn't use a lot of CPU - just
slightly more work than reading all the blocks: read all the stripe
blocks on all the drives involved in a stripe, then do a simple XOR
operation across all the data.  The actual checksums are more expensive
as they're MD5 - much nicer when these can be hardware accelerated.

Also, on x86, there are SSE block operations that make XORing for a
whole block a lot faster by doing a whole chunk at a time, so you don't
need a loop to do it - not sure which ZFS implementations take advantage
of these, but in the end XOR is not an expensive operation. MD5 is by
several orders of magnitude.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Scrub and checksum permutations

2012-10-27 Thread Toby Thain

On 27/10/12 11:56 AM, Ray Arachelian wrote:

On 10/26/2012 04:29 AM, Karl Wagner wrote:


Does it not store a separate checksum for a parity block? If so, it
should not even need to recalculate the parity: assuming checksums
match for all data and parity blocks, the data is good.
...



Parity is very simple to calculate and doesn't use a lot of CPU - just
slightly more work than reading all the blocks: read all the stripe
blocks on all the drives involved in a stripe, then do a simple XOR
operation across all the data.  The actual checksums are more expensive
as they're MD5 - much nicer when these can be hardware accelerated.


Checksums are MD5??

--Toby



Also, on x86,  ...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Scrub and checksum permutations

2012-10-27 Thread Jim Klimov

2012-10-27 20:54, Toby Thain wrote:

Parity is very simple to calculate and doesn't use a lot of CPU - just
slightly more work than reading all the blocks: read all the stripe
blocks on all the drives involved in a stripe, then do a simple XOR
operation across all the data.  The actual checksums are more expensive
as they're MD5 - much nicer when these can be hardware accelerated.


Checksums are MD5??


No, they are fletcher variants or sha256, with more probably coming
up soon, and some of these might also be boosted by certain hardware
capabilities, but I tend to agree that parity calculations likely
are faster (even if not all parities are simple XORs - that would
be silly for double- or triple-parity sets which may use different
algos just to be sure).

//Jim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Zpool LUN Sizes

2012-10-27 Thread Timothy Coalson
On Sat, Oct 27, 2012 at 9:21 AM, Edward Ned Harvey
(opensolarisisdeadlongliveopensolaris) 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

  From: Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
 
  Performance is much better if you use mirrors instead of raid.
  (Sequential
  performance is just as good either way, but sequential IO is unusual for
 most
  use cases. Random IO is much better with mirrors, and that includes
 scrubs 
  resilvers.)

 Even if you think you use sequential IO...  If you use snapshots...
  Thanks to the nature of snapshot creation  deletion  the nature of COW,
 you probably don't have much sequential IO in your system, after a couple
 months of actual usage.  Some people use raidzN, but I always use mirrors.


This may be the case if you often rewrite portions of files, so especially
database usage, but if you generally write entire new files rather than
modifying old ones, I wouldn't expect fragmentation to be that bad.  The
particular workload I have is like this, if a file is changed, it is
overwritten entirely, so I went with raidz2 vdevs for more capacity.
However, I'm not exactly pushing the limits of the pool performance, as my
bottleneck is network.

Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Scrub and checksum permutations

2012-10-27 Thread Timothy Coalson
On Sat, Oct 27, 2012 at 12:35 PM, Jim Klimov jimkli...@cos.ru wrote:

 2012-10-27 20:54, Toby Thain wrote:

 Parity is very simple to calculate and doesn't use a lot of CPU - just
 slightly more work than reading all the blocks: read all the stripe
 blocks on all the drives involved in a stripe, then do a simple XOR
 operation across all the data.  The actual checksums are more expensive
 as they're MD5 - much nicer when these can be hardware accelerated.


 Checksums are MD5??


 No, they are fletcher variants or sha256, with more probably coming
 up soon, and some of these might also be boosted by certain hardware
 capabilities, but I tend to agree that parity calculations likely
 are faster (even if not all parities are simple XORs - that would
 be silly for double- or triple-parity sets which may use different
 algos just to be sure).


I would expect raidz2 and 3 to use the same math as traditional raid6 for
parity: https://en.wikipedia.org/wiki/Raid6#RAID_6 .  In particular, the
sentence For a computer scientist, a good way to think about this is that
operator is a bitwise XOR operator and g superscript i is the action of
a linear feedback shift register on a chunk of data.  If I understood it
correctly, it does a different number of iterations of the LFSR on each
sector, depending on which sector among the data sectors it is, and that
the LFSR is applied independently to small groups of bytes in each sector,
and then does the XOR to get the second parity sector (and for third
parity, I believe it needs to use a different generator polynomial for the
LFSR).  For small numbers of iterations, multiple iterations of the LSFR
can be optimized to a single shift and an XOR with a lookup value on the
lowest bits.  For larger numbers of iterations (if you have, say, 28 disks
in a raidz3), it could construct the 25th iteration by doing 10, 10, 5, but
I have no idea how ZFS actually implements it.

As I understand it, fletcher checksums are extremely simple and are
basically 2 additions and 2 modulus per however many bytes at a time it
processes, so I wouldn't be surprised if fletcher was about the same speed
as computing second/third parity.  SHA256 I don't know, I would expect it
to be more expensive, simply because it is a cryptographic hash.

Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss