Re: [zfs-discuss] sharenfs option rw,root=host1 don't take effect

2010-04-08 Thread John Doh
If you're getting nobody:nobody on an NFS mount you have an NFS version 
mismatch, (usually between V3  V4) to get around this use the following as 
mount options on the client:

hard,bg,intr,vers=3

e.g:

mount -o hard,bg,intr,vers=3 server:/pool/zfs /mountpoint
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] compression property not received

2010-04-08 Thread Cindy Swearingen

Hi Daniel,

D'oh...

I found a related bug when I looked at this yesterday but I didn't think
it was your problem because you didn't get a busy message.

See this RFE:

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6700597


Cindy
On 04/07/10 17:59, Daniel Bakken wrote:
We have found the problem. The mountpoint property on the sender was at 
one time changed from the default, then later changed back to defaults 
using zfs set instead of zfs inherit. Therefore, zfs send included these 
local non-default properties in the stream, even though the local 
properties are effectively set at defaults. This caused the receiver to 
stop processing subsequent properties in the stream because the 
mountpoint isn't valid on the receiver.


I tested this theory with a spare zpool. First I used zfs inherit 
mountpoint promise1/archive to remove the local setting (which was 
exactly the same value as the default). This time the compression=gzip 
property was correctly received.


It seems like a bug to me that one failed property in a stream prevents 
the rest from being applied. I should have used zfs inherit, but it 
would be best if zfs receive handled failures more gracefully, and 
attempted to set as many properties as possible.


Thanks to Cindy and Tom for their help.

Daniel

On Wed, Apr 7, 2010 at 2:31 AM, Tom Erickson thomas.erick...@oracle.com 
mailto:thomas.erick...@oracle.com wrote:



Now I remember that 'zfs receive' used to give up after the first
property it failed to set. If I'm remembering correctly, then, in
this case, if the mountpoint was invalid on the receive side, 'zfs
receive' would not even try to set the remaining properties.

I'd try the following in the source dataset:

zfs inherit mountpoint promise1/archive

to clear the explicit mountpoint and prevent it from being included
in the send stream. Later set it back the way it was. (Soon there
will be an option to take care of that; see CR 6883722 want 'zfs
recv -o prop=value' to set initial property values of received
dataset.) Then see if you receive the compression property successfully.




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] compression property not received

2010-04-08 Thread Tomas Ögren
On 08 April, 2010 - Cindy Swearingen sent me these 2,6K bytes:

 Hi Daniel,

 D'oh...

 I found a related bug when I looked at this yesterday but I didn't think
 it was your problem because you didn't get a busy message.

 See this RFE:

 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6700597

Solaris 10 'man zfs', under 'receive':

 -uFile system that is associated with  the  received
   stream is not mounted.

/Tomas
-- 
Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RaidZ recommendation

2010-04-08 Thread Bob Friesenhahn

On Thu, 8 Apr 2010, Erik Trimble wrote:
While that's great in theory, there's getting to be a consensus that 1TB 
7200RPM 3.5 Sata drives are really going to be the last usable capacity.


Agreed.  The 2.5 form factor is rapidly emerging.  I see that 
enterprise 6-Gb/s SAS drives are available with 600GB capacity 
already.  It won't be long until they also reach your 1TB barrier.


So, while it's nice that you can indeed seemlessly swap up drives sizes (and 
your recommendation of using 2x7 helps that process), in reality, it's not a 
good idea to upgrade from his existing 1TB drives.


It would make more sense to add a new chassis, or replace the existing 
chassis with one which supports more (physically smaller) drives. 
While products are often sold based on their ability to be upgraded, 
upgrades often don't make sense.


Now, in the Real Near Future when we have 1TB+ SSDs that are 1cent/GB, well, 
then, it will be nice to swap up.  But not until then...


I don't see that happening any time soon.  FLASH is close to hitting 
the wall on device geometries and tri-level and quad-level only gets 
you so far.  A new type of device will need to be invented.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RaidZ recommendation

2010-04-08 Thread Richard Elling
On Apr 8, 2010, at 8:52 AM, Bob Friesenhahn wrote:

 On Thu, 8 Apr 2010, Erik Trimble wrote:
 While that's great in theory, there's getting to be a consensus that 1TB 
 7200RPM 3.5 Sata drives are really going to be the last usable capacity.

I doubt that 1TB (or even 1.5TB) 3.5 disks are being manufactured anymore.
These have dropped to the $100 price barrier already.  2TB are hanging out
around $150.

 Agreed.  The 2.5 form factor is rapidly emerging.  I see that enterprise 
 6-Gb/s SAS drives are available with 600GB capacity already.  It won't be 
 long until they also reach your 1TB barrier.

Yep, seeing some nice movement in this space.

 So, while it's nice that you can indeed seemlessly swap up drives sizes (and 
 your recommendation of using 2x7 helps that process), in reality, it's not a 
 good idea to upgrade from his existing 1TB drives.
 
 It would make more sense to add a new chassis, or replace the existing 
 chassis with one which supports more (physically smaller) drives. While 
 products are often sold based on their ability to be upgraded, upgrades often 
 don't make sense.
 
 Now, in the Real Near Future when we have 1TB+ SSDs that are 1cent/GB, well, 
 then, it will be nice to swap up.  But not until then...
 
 I don't see that happening any time soon.  FLASH is close to hitting the wall 
 on device geometries and tri-level and quad-level only gets you so far.  A 
 new type of device will need to be invented.

It is a good idea to not bet against Moore's Law :-)
The current state of the art is an 8GB (byte, not bit) MLC flash chip which is 
162 mm^2.
In the space of a 2.5 disk with some clever packaging you could pack dozens of 
TB.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] sharenfs option rw,root=host1 don't take effect

2010-04-08 Thread Ragnar Sundblad

On 12 mar 2010, at 03.58, Damon Atkins wrote:
...
 Unfortunately DNS spoofing exists, which means forward lookups can be poison.

And IP address spoofing, and...

 The best (maybe only) way to make NFS secure is NFSv4 and Kerb5 used together.

Amen!

DNS is NOT an authentication system!
IP is NOT an authentication system!

I don't think the (rw|root|...)=(hostname|address) kind of functionality
has any place in a system from after the 80's, when the world got
connected and security became an issue for the masses. It should be an
extra feature marked with a big insecure that you should have to
enable through a very cumbersome process.

Instead, use Kerberos, or if that is not possible, at least use IPSEC to
make IP address spoofing harder.

/ragge

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS monitoring - best practices?

2010-04-08 Thread Ray Van Dolson
We're starting to grow our ZFS environment and really need to start
standardizing our monitoring procedures.

OS tools are great for spot troubleshooting and sar can be used for
some trending, but we'd really like to tie this into an SNMP based
system that can generate graphs for us (via RRD or other).

Whether or not we do this via our standard enterprise monitoring tool
or write some custom scripts I don't really care... but I do have the
following questions:

- What metrics are you guys tracking?  I'm thinking:
- IOPS
- ZIL statistics
- L2ARC hit ratio
- Throughput
- IO Wait (I know there's probably a better term here)
- How do you gather this information?  Some but not all is
  available via SNMP.  Has anyone written a ZFS specific MIB or
  plugin to make the info available via the standard Solaris SNMP
  daemon?  What information is available only via zdb/mdb?
- Anyone have any RRD-based setups for monitoring their ZFS
  environments they'd be willing to share or talk about?

Thanks in advance,
Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS monitoring - best practices?

2010-04-08 Thread Joel Buckley

Ray,

Here is my short list of Performance Metrics I track on 7410 Performance 
Rigs via 7000 Analytics.


Cheers,
Joel.

m:analytics datasets ls
Datasets:

DATASET STATE   INCORE ONDISK NAME
dataset-000 active   1016K  75.9M arc.accesses[hit/miss]
dataset-001 active390K  37.9M arc.l2_accesses[hit/miss]
dataset-002 active242K  13.7M arc.l2_size
dataset-003 active242K  13.7M arc.size
dataset-004 active958K  86.1M arc.size[component]
dataset-005 active242K  13.7M cpu.utilization
dataset-006 active477K  46.2M cpu.utilization[mode]
dataset-007 active648K  59.7M dnlc.accesses[hit/miss]
dataset-008 active242K  13.7M fc.bytes
dataset-009 active242K  13.7M fc.ops
dataset-010 active242K  12.8M fc.ops[latency]
dataset-011 active242K  12.8M fc.ops[op]
dataset-012 active242K  13.7M ftp.kilobytes
dataset-013 active242K  12.8M ftp.kilobytes[op]
dataset-014 active242K  13.7M http.reqs
dataset-015 active242K  12.8M http.reqs[latency]
dataset-016 active242K  12.8M http.reqs[op]
dataset-017 active242K  13.7M io.bytes
dataset-018 active439K  43.7M io.bytes[op]
dataset-019 active308K  29.6M io.disks[utilization=95][disk]
dataset-020 active   2.93M  87.2M io.disks[utilization]
dataset-021 active242K  13.7M io.ops
dataset-022 active   9.85M   274M io.ops[disk]
dataset-023 active   20.0M   827M io.ops[latency]
dataset-024 active438K  43.6M io.ops[op]
dataset-025 active242K  13.7M iscsi.bytes
dataset-026 active242K  13.7M iscsi.ops
dataset-027 active   1.45M  91.1M iscsi.ops[latency]
dataset-028 active248K  14.8M iscsi.ops[op]
dataset-029 active242K  13.7M ndmp.diskkb
dataset-030 active242K  13.8M nfs2.ops
dataset-031 active242K  12.8M nfs2.ops[latency]
dataset-032 active242K  13.8M nfs2.ops[op]
dataset-033 active242K  13.8M nfs3.ops
dataset-034 active   8.82M   163M nfs3.ops[latency]
dataset-035 active327K  18.1M nfs3.ops[op]
dataset-036 active242K  13.8M nfs4.ops
dataset-037 active   2.31M  97.8M nfs4.ops[latency]
dataset-038 active311K  17.2M nfs4.ops[op]
dataset-039 active242K  13.7M nic.kilobytes
dataset-040 active970K  84.5M nic.kilobytes[device]
dataset-041 active943K  77.1M nic.kilobytes[direction=in][device]
dataset-042 active457K  31.1M nic.kilobytes[direction=out][device]
dataset-043 active503K  49.1M nic.kilobytes[direction]
dataset-044 active242K  13.7M sftp.kilobytes
dataset-045 active242K  12.8M sftp.kilobytes[op]
dataset-046 active242K  13.7M smb.ops
dataset-047 active242K  12.8M smb.ops[latency]
dataset-048 active242K  13.7M smb.ops[op]
dataset-049 active242K  12.8M srp.bytes
dataset-050 active242K  12.8M srp.ops[latency]
dataset-051 active242K  12.8M srp.ops[op]

Cheers,
Joel.

On 04/08/10 14:06, Ray Van Dolson wrote:

We're starting to grow our ZFS environment and really need to start
standardizing our monitoring procedures.

OS tools are great for spot troubleshooting and sar can be used for
some trending, but we'd really like to tie this into an SNMP based
system that can generate graphs for us (via RRD or other).

Whether or not we do this via our standard enterprise monitoring tool
or write some custom scripts I don't really care... but I do have the
following questions:

- What metrics are you guys tracking?  I'm thinking:
- IOPS
- ZIL statistics
- L2ARC hit ratio
- Throughput
- IO Wait (I know there's probably a better term here)
  

Utilize Latency instead of IO Wait.

- How do you gather this information?  Some but not all is
  available via SNMP.  Has anyone written a ZFS specific MIB or
  plugin to make the info available via the standard Solaris SNMP
  daemon?  What information is available only via zdb/mdb?
  

On 7000 appliances, this is easy via Analytics.

On Solaris, you need to pull data from kstats and/or DTrace scripts and then
archive the data in similar manner...


- Anyone have any RRD-based setups for monitoring their ZFS
  environments they'd be willing to share or talk about?

Thanks in advance,
Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  



--

http://www.oracle.com/Joel Buckley | +1.303.272.5556
Oracle Open Storage Systems
500 Eldorado Blvd

Broomfield, CO 80021-3400

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RaidZ recommendation

2010-04-08 Thread Miles Nordin
 dm == David Magda dma...@ee.ryerson.ca writes:
 bf == Bob Friesenhahn bfrie...@simple.dallas.tx.us writes:

dm OP may also want to look into the multi-platform pkgsrc for
dm third-party open source software:

+1.  jucr.opensolaris.org seems to be based on RPM which is totally
fail.  RPM is the oldest, crappiest, most frustrating thing!  packages
are always frustrating but pkgsrc is designed to isolate itself from
the idiosyncracies of each host platform, through factoring.

Its major weakness is upgrades, but with Solaris you can use zones and
snapshots to make this a lot less painful:

 * run their ``bulk build'' inside a zone.  The ``bulk build'' feature
   is like the jucr: it downloads stuff from all over the internet and
   bulids it, generates a tree of static web pages to report its
   results, plus a repository of binary packages.  Like jucr it does
   not build packages on an ordinary machine, but in a well-specified
   minimal environment which has installed only the packages named as
   build dependencies---between each package build the bulk scripts
   remove all not-needed packages.  Thus you really need a separate
   machine, like a zone, for bulk building.  There is a non-bulk way
   to build pkgsrc, but it's not as good.

   Except that unlike the jucr, the implementation of the bulk build
   is included in the pkgsrc distribution and supported and ordinary
   people who run pkgsrc are expected to use it themselves.

 * clone a zone, upgrade the packages inside it using the binary
   packages produced by the bulk build, and cut services over to the
   clone only after everything's working right.

Both of these things are a bit painful with pkgsrc on normal systems
and much easier with zones and ZFS.  The type of upgrade that's
guaranteed to work on pkgsrc, is:

 * to take a snapshot of /usr/pkgsrc which *is* pkgsrc, all packages'
   build instructions, and no binaries under this tree

 * ``bulk build''

 * replace all your current running packages with the new binary
   packages in the repository the bulk build made.

In practice people usually rebuild less than that to upgrade a
package, and it often works anyway, but if it doesn't work then you're
left wondering ``is pkgsrc just broken again, or will a more thorough 
upgrade actually work?''

The coolest immediate trick is that you can run more than one bulk
build with different starting options, ex SunPro vs gcc, 32 vs 64-bit.
The first step of using pkgsrc is to ``bootstrap'' it, and during
bootstrap you choose the C compiler and also whether to use host's or
pkgsrc's versions of things like perl and pax and awk.

You also choose prefixes for /usr /var and /etc and /var/db/pkg that
will isolate all pkgsrc files from the rest of the system.  In general
this level of pathname flexibility is only achievable at build time,
so only a source-based package system can pull off this trick.  The
corrolary is that you can install more than one pkgsrc on a single
system and choose between them with PATH.  pkgsrc is generally
designed to embed full pathnames of its shared libs, so this has got a
good shot of working.  

You could have /usr/pkg64 and /usr/pkg32, or /usr/pkg-gcc and
/usr/pkg-spro.  pkgsrc will also build pkg_add, pkg_info, u.s.w. under
/usr/pkg-gcc/bin which will point to /var/db/pkg-gcc or whatever to
track what's installed, so you can have more than one pkg_add on a
single system pointing to different sets of directories.  You could
also do weirder things like use different paths every time you do a
bulk build, like /usr/pkg-20100130 and /usr/pkg-20100408, although
it's very strange to do that so far.

It would also be possible to use ugly post-Unix directory layouts, ex
/pkg/marker/usr/bin and /pkg/marker/etc and
/pkg/marker/var/db/pkg, and then make /pkg/marker into a ZFS that
could be snapshotted and rolled back.  It is odd in pkgsrc world to
put /var/db/pkg tracking-database of what's installed into the same
subtree as the installed stuff itself, but in the context of ZFS it
makes sense to do that.  However the pathnames will be fixed for a
given set of binary packages, so whatever you do with the ZFS the
results of bulk builds sharing a common ``bootstrap'' phase would have
to stay mounted on the same directory.  You cannot clone something to
a new directory then add/remove packages.  There was an attempt called
``pkgviews'' to do something like this, but I think it's ultimately
doomed because the idea's not compartmentalized enough to work with
every package.

In general pkgsrc gives you a toolkit for dealing with suboptimal
package trees where a lot of shit is broken.  It's well-adapted to the
ugly modern way we run Unixes, sealed, with only web facing the users,
because you can dedicate an entire bulk build to one user-facing app.
If you have an app that needs a one-line change to openldap, pkgsrc
makes it easy to perform this 1-line change and rebuild 100
interdependent packages linked to your mutant library

Re: [zfs-discuss] sharenfs option rw,root=host1 don't take effect

2010-04-08 Thread Miles Nordin
 rs == Ragnar Sundblad ra...@csc.kth.se writes:

rs use IPSEC to make IP address spoofing harder.

IPsec with channel binding is win, but not until SA's are offloaded to
the NIC and all NIC's can do IPsec AES at line rate.  Until this
happens you need to accept there will be some protocols used on SAN
that are not on ``the Internet'' and for which your axiomatic security
declarations don't apply, where the relevant features are things like
doing the DNS lookup in the proper .rhosts manner and doing uRPF,
minimum, and more optimistically stop adding new protocols without
IPv6 support, and start adding support for multiple IP stacks / VRF's.
If saying ``the only way to do any given thing is twicecrypted
kerberized ipsec within dnssec namespaces'' is blocking doing these
immediate plaintext things that allow a host to participate in both
the internet and a SAN at once, well that's no good either.


pgptkJNIK5h42.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] L2ARC Workingset Size

2010-04-08 Thread Abdullah Al-Dahlawi
Hi Richard

Thanks for your comments. OK ZFS is COW, I understand, but, this also means
a waste of valuable space of my L2ARC SSD device, more than 60% of the space
is consumed by COW !!!. I do not get it ?

On Sat, Apr 3, 2010 at 11:35 PM, Richard Elling richard.ell...@gmail.comwrote:

 On Apr 1, 2010, at 9:41 PM, Abdullah Al-Dahlawi wrote:

  Hi all
 
  I ran a workload that reads  writes within 10 files each file is 256M,
 ie,  (10 * 256M = 2.5GB total Dataset Size).
 
  I have set the ARC max size to 1 GB  on  etc/system file
 
  In the worse case, let us assume that the whole dataset is hot, meaning
 my workingset size= 2.5GB
 
  My SSD flash size = 8GB and being used for L2ARC
 
  No slog is used in the pool
 
  My File system record size = 8K , meaning 2.5% of 8GB is used for L2ARC
 Directory in ARC. which ultimately mean that available ARC is 1024M - 204.8M
 = 819.2M Available ARC  (Am I Right ?)

 this is worst case

  Now the Question ...
 
  After running the workload for 75 minutes, I have noticed that L2ARC
 device has grown to 6 GB !!!   

 You're not interpreting the values properly, see below.

  What is in L2ARC beyond my 2.5GB Workingset ?? something else is has been
 added to L2ARC 

 ZFS is COW, so modified data is written to disk and the L2ARC.

  Here is a 5 minute interval of Zpool iostat

 [snip]
  Also, a FULL  Kstat ZFS for 5 minutes Interval

 [snip]
  module: zfs instance: 0
  name:   arcstatsclass:misc
  c   1073741824
  c_max   1073741824

  Max ARC size is limited to 1GB

  c_min   134217728
  crtime  28.083178473
  data_size   955407360
  deleted 966956
  demand_data_hits843880
  demand_data_misses  452182
  demand_metadata_hits68572
  demand_metadata_misses  5737
  evict_skip  82548
  hash_chain_max  18
  hash_chains 61732
  hash_collisions 1444874
  hash_elements   329553
  hash_elements_max   329561
  hdr_size46553328
  hits978241
  l2_abort_lowmem 0
  l2_cksum_bad0
  l2_evict_lock_retry 0
  l2_evict_reading0
  l2_feeds4738
  l2_free_on_write184
  l2_hdr_size 17024784

 size of L2ARC headers is approximately 17MB

  l2_hits 252839
  l2_io_error 0
  l2_misses   203767
  l2_read_bytes   2071482368
  l2_rw_clash 13
  l2_size 2632226304

 currently, there is approximately 2.5GB in the L2ARC

  l2_write_bytes  6486009344

 total amount of data written to L2ARC since boot is 6+ GB

  l2_writes_done  4127
  l2_writes_error 0
  l2_writes_hdr_miss  21
  l2_writes_sent  4127
  memory_throttle_count   0
  mfu_ghost_hits  120524
  mfu_hits500516
  misses  468227
  mru_ghost_hits  61398
  mru_hits412112
  mutex_miss  511
  other_size  56325712
  p   775528448
  prefetch_data_hits  50804
  prefetch_data_misses7819
  prefetch_metadata_hits  14985
  prefetch_metadata_misses2489
  recycle_miss13096
  size1073830768

 ARC size is 1GB

 The best way to understand these in detail is to look at the source which
 is nicely commented. L2ARC design is commented near

 http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c#3590

  -- richard

 ZFS storage and performance consulting at 
 http://www.RichardElling.comhttp://www.richardelling.com/
 ZFS training on deduplication, NexentaStor, and NAS performance
 Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com








-- 
Abdullah Al-Dahlawi
PhD Candidate
George Washington University
Department. Of Electrical  Computer Engineering

Check The Fastest 500 Super Computers Worldwide
http://www.top500.org/list/2009/11/100
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RaidZ recommendation

2010-04-08 Thread Daniel Carosone
On Thu, Apr 08, 2010 at 12:14:55AM -0700, Erik Trimble wrote:
 Daniel Carosone wrote:
 Go with the 2x7 raidz2.  When you start to really run out of space,
 replace the drives with bigger ones.

 While that's great in theory, there's getting to be a consensus that 1TB  
 7200RPM 3.5 Sata drives are really going to be the last usable capacity. 

I dunno.  The 'forces' and issues you describe are real, but 'usable'
depends very heavily on the user's requirements.  

For example, a large amount of the extra space available on a larger
drive may be very rarely accessed in normal use (scrubs and resilvers
aside).  In the OP's example of an ever-expanding home media
collection, much of it will never or very rarely get
re-watched. Another common use for the extra space is simply storing 
more historical snapshots, against the unlikely future need to access
them.  For such data, speed is really not a concern at all.

For the subset of users for whom these forces are not overwhelming for
real usage, that leaves scrubs and resilvers.  There is room for
improvement in zfs here, too - a more sequential streaming access
pattern would help.

To me, the biggest issue you left unmentioned is the problem of
backup.  There's little option for backing up these larger drives,
other than more of the same drives.  In turn, lots of the use such
drives will be put to, is for backing up other data stores, and there
again, the usage pattern fits the above profile well.

Another usage pattern we may see more of, and that helps address some
of the performance issues, is this.  Say I currently have 2 pools of
1TB disks, one as a backup for the other.  I want to expand the
space.  I replace all the disks with 2TB units, but I also change my
data distribution as it grows: now, each pool is to be at most
half-full of data, and the other half is used as a backup of the
opposite pool.  ZFS send is fast enough that the backup windows are
short, and I now have effectively twice as many spindles in active
service. 

 [..] it looks like hard drives are really at the end of their
 advancement, as far as capacities per drive go.

The challenges are undeniable, but that's way too big a call.  Those
are words you will regret in future; at least, I hope the future will
be one in which those words are regrettable. :-)

 1TB drives currently have excessively long resilver time, inferior  
 reliability (for the most part), and increased power consumption.

Yes, for the most part.  However, a 2TB drive has dramatically less
power consumption than 2x1TB drives (and less of other valuable
resources, like bays and controller slots). 

 I'd generally recommend that folks NOT step beyond the 1TB capacity
 at the 3.5 hard drive format.

A general recommendation is fine, and this is one I agree with for
many scenarios.  At least, I'd recommend that folks look more closely
at alternatives using 2.5 drives and sas expander bays than they
might otherwise.

 So, while it's nice that you can indeed seemlessly swap up drives sizes  
 (and your recommendation of using 2x7 helps that process), in reality,  
 it's not a good idea to upgrade from his existing 1TB drives.

So what does he do instead, when he's running out of space and 1TB
drives are hard to come by?   The advice still stands, as far as I'm
concerned: do something now, that will leave you room for different
expansion choices later - and evaluate the best expansion choice
later, when the parameters of the time are known.

--
Dan.

pgpvUFmIBbrcE.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RaidZ recommendation

2010-04-08 Thread Erik Trimble
On Fri, 2010-04-09 at 08:07 +1000, Daniel Carosone wrote:
 On Thu, Apr 08, 2010 at 12:14:55AM -0700, Erik Trimble wrote:
  Daniel Carosone wrote:
  Go with the 2x7 raidz2.  When you start to really run out of space,
  replace the drives with bigger ones.
 
  While that's great in theory, there's getting to be a consensus that 1TB  
  7200RPM 3.5 Sata drives are really going to be the last usable capacity. 
 
 I dunno.  The 'forces' and issues you describe are real, but 'usable'
 depends very heavily on the user's requirements.  
 

Well

The problem is (and this isn't just a ZFS issue) that resilver and scrub
times /are/ very bad for 1TB disks.  This goes directly to the problem
of redundancy - if you don't really care about resilver/scrub issues,
then you really shouldn't bother to use Raidz or mirroring.  It's pretty
much in the same ballpark.


That is, 1TB 3.5 drives have such long resilver/scrub times that with
ZFS, it's a good bet you can kill a second (or third) drive before you
can scrub or resilver in time to compensate for the already-failed one.
Put it another way, you get more errors before you have time to fix the
old ones, which effectively means you now can't fix errors before they
become permanent. Permanent errors = data loss.




 For example, a large amount of the extra space available on a larger
 drive may be very rarely accessed in normal use (scrubs and resilvers
 aside).  In the OP's example of an ever-expanding home media
 collection, much of it will never or very rarely get
 re-watched. Another common use for the extra space is simply storing 
 more historical snapshots, against the unlikely future need to access
 them.  For such data, speed is really not a concern at all.
 

Yes, it is. It's still a concern, and not just in the scrub/resilver
arena. Big drives have considerably lower performance, to the point
where that replacing 1TB drives with 2TB drives may very well drop them
below the threshold where they start to see stutter.  That is, while the
setup may work with 1TB drives, it won't with 2TB drives.  It's not a
no-brainer to just upgrade the size.

For example, the 2TB 5900RPM 3.5 drives are (on average) over 2x as
slow as the 1TB 7200RPM 3.5 drives for most operations. Access time is
slower by 40%, and throughput is slower on by 30-50%.


 For the subset of users for whom these forces are not overwhelming for
 real usage, that leaves scrubs and resilvers.  There is room for
 improvement in zfs here, too - a more sequential streaming access
 pattern would help.
 

While ZFS certainly has problems with randomly written small-data pools,
scrubs and silvers on large streaming writes (like the media server) is
rather straightforward. Note that RAID-6 and many RAID-5/3 hardware
setups have similar issues.

In any case, resilver/scrub times are becoming the dominant factor in
reliability of these large drives.


 To me, the biggest issue you left unmentioned is the problem of
 backup.  There's little option for backing up these larger drives,
 other than more of the same drives.  In turn, lots of the use such
 drives will be put to, is for backing up other data stores, and there
 again, the usage pattern fits the above profile well.
 
 Another usage pattern we may see more of, and that helps address some
 of the performance issues, is this.  Say I currently have 2 pools of
 1TB disks, one as a backup for the other.  I want to expand the
 space.  I replace all the disks with 2TB units, but I also change my
 data distribution as it grows: now, each pool is to be at most
 half-full of data, and the other half is used as a backup of the
 opposite pool.  ZFS send is fast enough that the backup windows are
 short, and I now have effectively twice as many spindles in active
 service. 
 

Don't count on 'zfs send' being fast enough. Even for liberal values of
fast enough - it's highly data dependent.  For the situation you
describe, you're actually making it worse - now, both pools have a
backup I/O load which reduces their available throughput. If you're
talking about a pool that's already 50% slower than one made of 1TB
drives, then, well, you're hosed.


  [..] it looks like hard drives are really at the end of their
  advancement, as far as capacities per drive go.
 
 The challenges are undeniable, but that's way too big a call.  Those
 are words you will regret in future; at least, I hope the future will
 be one in which those words are regrettable. :-)
 

Honestly, from what I've seen and heard both here and on other forums,
the writing is on the wall, the fat lady has sung, and Mighty Casey has
struck out.  The 3.5 winchester hard drive is on terminal life support
for use in enterprises. It will linger a little longer in commodity
places, where its cost/GB overcomes its weaknesses.  2.5 HDs will last
out the decade, as they're slightly higher performance/GB and
space/power savings will allow them to hold off solid-state media for a
bit.  But solid-state is the future, and 

Re: [zfs-discuss] L2ARC Workingset Size

2010-04-08 Thread Tomas Ögren
On 08 April, 2010 - Abdullah Al-Dahlawi sent me these 12K bytes:

 Hi Richard
 
 Thanks for your comments. OK ZFS is COW, I understand, but, this also means
 a waste of valuable space of my L2ARC SSD device, more than 60% of the space
 is consumed by COW !!!. I do not get it ?

The rest can and will be used if L2ARC needs it. It's not wasted, it's
just a number that doesn't match what you think it should be.

/Tomas
-- 
Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] sharenfs option rw,root=host1 don't take effect

2010-04-08 Thread Harry Putnam
mingli liming...@gmail.com writes:

 Thank Erik, and I will try it, but the new question is that the root
 of the NFS server mapped as nobody at the NFS client.

 For this issue, I set up a new test NFS server and NFS client, and
 with the same option, at this test environment, the file owner
 mapped correctly, it confused me.

From the original post in this thread it wasn't clear if you're doing
this on a local lan, and if both server and client are opensolaris
machines.

Maybe I missed it.

I don't have any problems now and don't use any of the options to
sharenfs that you showed. 

  zfs get sharenfs z3/projects
  NAME PROPERTY  VALUE SOURCE
  z3/projects  sharenfs  onlocal

Just a simple `on'.

At first, I had all kinds of problems and being a newbie nfs user
seemed to see all kinds of strange phenomena, including seeing 
  `nobody:nobody' as owner:group

I had the version for nfs set properly on the opensolaris server but
it turned to be only set for the server:

grep NFS_SERVER_VERSMAX /etc/default/nfs
  #NFS_SERVER_VERSMAX=4
  NFS_SERVER_VERSMAX=3

But somehow had completely overlooked the CLIENT setting:

grep NFS_CLIENT_VERSMAX /etc/default/nfs

   grep NFS_CLIENT_VERSMAX /etc/default/nfs
  # NFS_CLIENT_VERSMAX=4
  # NFS_CLIENT_VERSMAX=3

I'd been running with both commented out instead of what I needed,
like this: 
 NFS_CLIENT_VERSMAX=3
 (uncommented)

The client was a linux machine and it was the client trying to mount
the share as version 4.

What tipped me off was accidentally seeing something in the output of the
linux `mount' cmd that indicated the share was mounted as version 4
nfs.

Once I made the correct setting for NFS_CLIENT... things just started
working.



 







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] L2ARC Workingset Size

2010-04-08 Thread Richard Elling
On Apr 8, 2010, at 3:23 PM, Tomas Ögren wrote:
 On 08 April, 2010 - Abdullah Al-Dahlawi sent me these 12K bytes:
 
 Hi Richard
 
 Thanks for your comments. OK ZFS is COW, I understand, but, this also means
 a waste of valuable space of my L2ARC SSD device, more than 60% of the space
 is consumed by COW !!!. I do not get it ?
 
 The rest can and will be used if L2ARC needs it. It's not wasted, it's
 just a number that doesn't match what you think it should be.

Another way to look at it is: all cache space is wasted by design.  If the 
backing 
store for the cache were performant, there wouldn't be a cache.  So caches waste
space to gain performance. Space, dependability, performance: pick two.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] sharenfs option rw,root=host1 don't take effect

2010-04-08 Thread Ragnar Sundblad

On 8 apr 2010, at 23.21, Miles Nordin wrote:

 rs == Ragnar Sundblad ra...@csc.kth.se writes:
 
rs use IPSEC to make IP address spoofing harder.
 
 IPsec with channel binding is win, but not until SA's are offloaded to
 the NIC and all NIC's can do IPsec AES at line rate.  Until this
 happens you need to accept there will be some protocols used on SAN
 that are not on ``the Internet'' and for which your axiomatic security
 declarations don't apply, where the relevant features are things like
 doing the DNS lookup in the proper .rhosts manner and doing uRPF,
 minimum, and more optimistically stop adding new protocols without
 IPv6 support, and start adding support for multiple IP stacks / VRF's.
 If saying ``the only way to do any given thing is twicecrypted
 kerberized ipsec within dnssec namespaces'' is blocking doing these
 immediate plaintext things that allow a host to participate in both
 the internet and a SAN at once, well that's no good either.

I totally agree.

Since DNS, fqdn, and the like was mentioned, I don't think this
was intended for a SAN, not-on-the-internet, environment.

uRPF and other filters may of course harden your environment.
Let's hope everyone using the NFS features in question all use
them in a completely non-spoofable (L1..L3 and name resolver)
setup, then! ;-)

/ragge

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS kstat Stats

2010-04-08 Thread Tony MacDoodle
Do the following ZFS stats look ok?

 ::memstat
Page Summary Pages MB %Tot
   
Kernel 106619 832 28%
ZFS File Data 79817 623 21%
Anon 28553 223 7%
Exec and libs 3055 23 1%
Page cache 18024 140 5%
Free (cachelist) 2880 22 1%
Free (freelist) 146309 1143 38%

Total 385257 3009
Physical 367243 2869
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS kstat Stats

2010-04-08 Thread Dennis Clarke

 Do the following ZFS stats look ok?

 ::memstat
 Page Summary Pages MB %Tot
    
 Kernel 106619 832 28%
 ZFS File Data 79817 623 21%
 Anon 28553 223 7%
 Exec and libs 3055 23 1%
 Page cache 18024 140 5%
 Free (cachelist) 2880 22 1%
 Free (freelist) 146309 1143 38%

 Total 385257 3009
 Physical 367243 2869

Looks beautiful.

Just for giggles try this :

r...@aequitas:/root# uname -a
SunOS aequitas 5.11 snv_136 i86pc i386 i86pc Solaris
r...@aequitas:/root#
r...@aequitas:/root# /bin/printf ::kmastat\n | mdb -k
cachebufbufbufmemory alloc alloc
namesize in use  totalin use   succeed  fail
- -- -- -- -- - -
kmem_magazine_18   8595   8736 212992B  8595 0
kmem_magazine_3   16   3697   3780 122880B  3697 0
kmem_magazine_7   32   7633   7686 499712B  7633 0
kmem_magazine_15  64  11642  116561540096B 11642 0
.
. etc etc
.
nfs4_access_cache 32  0  0  0B 0 0
client_handle4_cache  16  0  0  0B 0 0
nfs4_ace4vals_cache   36  0  0  0B 0 0
nfs4_ace4_list_cache 176  0  0  0B 0 0
NFS_idmap_cache   24  0  0  0B 0 0
pty_map   48  0 64   4096B 1 0
-- - -- -- -- - -
Total [hat_memload]974848B   1306984 0
Total [kmem_msb] 56860672B506215 0
Total [kmem_va]  78249984B 12180 0
Total [kmem_default] 76316672B   8546762 0
Total [kmem_io_1G]   36712448B  8643 0
Total [bp_map]  0B   212 0
Total [segkp] 6356992B186825 0
Total [umem_np] 0B   148 0
Total [ip_minor_arena_sa]  64B   180 0
Total [spdsock] 0B 1 0
Total [namefs_inodes]  64B18 0
-- - -- -- -- - -
.
. etc etc
.

Dennis


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RaidZ recommendation

2010-04-08 Thread Ian Collins

On 04/ 9/10 10:48 AM, Erik Trimble wrote:

Well

The problem is (and this isn't just a ZFS issue) that resilver and scrub
times /are/ very bad for1TB disks.  This goes directly to the problem
of redundancy - if you don't really care about resilver/scrub issues,
then you really shouldn't bother to use Raidz or mirroring.  It's pretty
much in the same ballpark.


That is,1TB 3.5 drives have such long resilver/scrub times that with
ZFS, it's a good bet you can kill a second (or third) drive before you
can scrub or resilver in time to compensate for the already-failed one.
Put it another way, you get more errors before you have time to fix the
old ones, which effectively means you now can't fix errors before they
become permanent. Permanent errors = data loss.


   
That's one of the big problems with the build it now, expand with bigger 
drives later approach.  If you were designing from scratch with 2TB 
drives, you would be wise to consider triple parity raid, where double 
parity has acceptable reliability for 1TB drives.  Each time drive 
capacity double (and performance does not) an extra level or parity is 
required.  I guess this extrapolates to one data and N parity drives..


--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RaidZ recommendation

2010-04-08 Thread Daniel Carosone
On Thu, Apr 08, 2010 at 03:48:54PM -0700, Erik Trimble wrote:
 Well

To be clear, I don't disagree with you; in fact for a specific part of
the market (at least) and a large part of your commentary, I agree.  I
just think you're overstating the case for the rest.
 
 The problem is (and this isn't just a ZFS issue) that resilver and scrub
 times /are/ very bad for 1TB disks.  This goes directly to the problem
 of redundancy - if you don't really care about resilver/scrub issues,
 then you really shouldn't bother to use Raidz or mirroring.  It's pretty
 much in the same ballpark.

Sure, and that's why you have raidz3 now; also why multi-way mirrors
are getting more attention, as the drives are getting large enough
that capacities and redundancies previously only available via raidz
constructions can now be had with mirrors and a reasonable number of
spindles. 

Large drives (with the constraints you describe) certainly change the
deployment scenarios.  I don't agree that they shouldn't be deployed
at all, ever - which seems to be what you're saying.

Take 6x1TB in raidz2, replace with 6x2TB in three-way-mirror.  Chances
are, you've just improved performance.  I'm just trying to show it's
really not all that black and white.

As for error rates, this is something zfs should not be afraid
of. Indeed, many of us would be happy to get drives with less internal
ECC overhead and complexity for greater capacity, and tolerate the
resultant higher error rates, specifically for use with zfs (sector
errors, not overall drive failure, of course).  Even if it means I
need raidz4, and wind up with the same overall usable space, I may
prefer the redundancy across drives rather than within.

 That is, 1TB 3.5 drives have such long resilver/scrub times that with
 ZFS, it's a good bet you can kill a second (or third) drive before you
 can scrub or resilver in time to compensate for the already-failed one.
 Put it another way, you get more errors before you have time to fix the
 old ones, which effectively means you now can't fix errors before they
 become permanent. Permanent errors = data loss.

Again, potential zfs improvements could help here:
 - resilver in parallel for multiply redundant vdevs with multiple
   failures/replacements (currently, I think resilver restarts in this
   case?)
 - scrub a (top level) vdev at a time, rather than a whole pool. If I
   know I'm about to replace a drive, perhaps for capacity upgrade,
   I'll scrub first to minimise the chances of tripping over a latent
   error, especially on the previous drive i just replaced. No need to
   scrub other vdevs right now. 
 - scrub/resilver selectively by dataset, to allow higher priority
   data to be given better protection.

 For example, the 2TB 5900RPM 3.5 drives are (on average) over 2x as
 slow as the 1TB 7200RPM 3.5 drives for most operations. Access time is
 slower by 40%, and throughput is slower on by 30-50%.

Please, be fair and compare like with like -  say replacing 5400rpm
1TB drives.  Your same problem would apply if replacing 1TB 7200's
with 1TB 5400's; it has little to do with the capacity.  Indeed, at
the same rpm, the higher density has the potential to be faster.

 In any case, resilver/scrub times are becoming the dominant factor in
 reliability of these large drives.

Agreed; I'd argue they have been for some time (ie, even at the 1TB
size). 

 As a practical matter, small setups are for the most part not
 expandable/upgradable much, if at all. Buy what you need now, and plan
 on rebuying something new in 5-10 years, but don't think that what you
 put together now can be continuously upgraded for a decade. 

On this, I agree completely, even on a shorter time-scale (say 3-5
years). On each generation, repurpose the previous generation for
backup or something else as appropriate.  This applies to drives, and
to the boxes that house them.  Even so, leave yourself wiggle room
for upgrades and other unanticipated devlopments in the meantime where
you can.

--
Dan.



pgpLw78wUivGj.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RaidZ recommendation

2010-04-08 Thread Jason S
Well I would like to thank everyone for there comments and ideas.

I finally have this machine up and running with Nexenta Community edition and 
am really liking the GUI for administering it. It suits my needs perfectly and 
is running very well. I ended up going with 2 X 7 RaidZ2 vdevs in one pool for 
a total capacity of 10 TB.

One thing i have noticed that seems a littler different from my previous 
hardware raid controller (Areca) is the data is not constantly being written to 
the spindles. For example i am copying some large files to the array right now 
(approx 4 gigs a file) and my network performance is showing a transfer rate on 
average of 75MB/s. When i physically watch the server i only see a 1-2 second 
flury of activity on the drives then about 10 seconds of no activity. Is this 
the nature of ZFS? 

Thanks for all the help!
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RaidZ recommendation

2010-04-08 Thread Bob Friesenhahn

On Thu, 8 Apr 2010, Jason S wrote:
One thing i have noticed that seems a littler different from my 
previous hardware raid controller (Areca) is the data is not 
constantly being written to the spindles. For example i am copying 
some large files to the array right now (approx 4 gigs a file) and 
my network performance is showing a transfer rate on average of 
75MB/s. When i physically watch the server i only see a 1-2 second 
flury of activity on the drives then about 10 seconds of no 
activity. Is this the nature of ZFS?


Yes, this is the nature of ZFS.  ZFS batches up writes and writes them 
in bulk.  On a large memory system and with a very high write rate, up 
to 5 seconds worth of low-level write may be batched up.  With a slow 
write rate, up to 30 seconds of user-level writes may be batched up.


The reasons for doing this become obvious when you think about it a 
bit.  Zfs writes data as large transactions (transaction groups) and 
uses copy on write (COW).  Batching up the writes allows more 
full-blocks to be written, which decreases fragmentation, improves 
space allocation efficiency, improves write performance, and uses 
fewer precious IOPS.  The main drawback is that reads/writes are 
temporarily stalled during part of the TXG write cycle.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RaidZ recommendation

2010-04-08 Thread Richard Elling
On Apr 8, 2010, at 6:19 PM, Daniel Carosone wrote:
 
 As for error rates, this is something zfs should not be afraid
 of. Indeed, many of us would be happy to get drives with less internal
 ECC overhead and complexity for greater capacity, and tolerate the
 resultant higher error rates, specifically for use with zfs (sector
 errors, not overall drive failure, of course).  Even if it means I
 need raidz4, and wind up with the same overall usable space, I may
 prefer the redundancy across drives rather than within.

Disagree. Reliability trumps availability every time. And the problem
with the availability provided by redundancy techniques is that the
amount of work needed to recover is increasing.  This work is limited
by latency and HDDs are not winning any latency competitions anymore.

To combat this, some vendors are moving to an overprovision model.
Current products deliver multiple disks in a single FRU with builtin, 
fine-grained redundancy. Because the size and scope of the FRU is 
bounded, the recovery can be optimized and the reliability of the FRU 
is increased. From a market perspective, these solutions are not suitable 
for the home user because the size and cost of the FRU is high. It remains 
to be seen how such products survive in the enterprise space as HDDs
become relegated to backup roles.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RaidZ recommendation

2010-04-08 Thread Daniel Carosone
On Thu, Apr 08, 2010 at 08:36:43PM -0700, Richard Elling wrote:
 On Apr 8, 2010, at 6:19 PM, Daniel Carosone wrote:
  
  As for error rates, this is something zfs should not be afraid
  of. Indeed, many of us would be happy to get drives with less internal
  ECC overhead and complexity for greater capacity, and tolerate the
  resultant higher error rates, specifically for use with zfs (sector
  errors, not overall drive failure, of course).  Even if it means I
  need raidz4, and wind up with the same overall usable space, I may
  prefer the redundancy across drives rather than within.
 
 Disagree. Reliability trumps availability every time. 

Often, but not sure about every. The economics shift around too fast
for such truisms to be reliable, and there's always room for an
upstart (often in a niche) to make great economic advantages out of
questioning this established wisdom.  The oft-touted example  is
google's servers, but there are many others.

 And the problem
 with the availability provided by redundancy techniques is that the
 amount of work needed to recover is increasing.  This work is limited
 by latency and HDDs are not winning any latency competitions anymore.

We're talking about generalities; the niche can be very important to
enable these kinds of tricks by holding some of the other troubling
variables constant (e.g. application/programming platform).  It
doesn't really matter whether you're talking about 1 dual-PSU server
vs 2 single-PSU servers, or whole datacentres - except that solid
large-scale diversity tends to lessen your concentration (and perhaps
spend) on internal redundancy within a datacentre (or disk).

Put another way: some application niches are much more able to adopt
redundancy techniques that don't require so much work. 

Again, for the google example: if you're big and diverse enough that
shifting load between data centres on failure is no work, then
moving the load for other reasons is viable too - such as moving
to where it's night time and power and cooling are cheaper.  The work
has been done once, up front, and the benefits are repeatable.

 To combat this, some vendors are moving to an overprovision model.
 Current products deliver multiple disks in a single FRU with builtin, 
 fine-grained redundancy. Because the size and scope of the FRU is 
 bounded, the recovery can be optimized and the reliability of the FRU 
 is increased. 

That's not new.  Past examplees in the direct experience of this
community include the BladeStor and SSA-1000 storage units, which
aggregated disks into failure domains (e.g. drawers) for a (big)
density win.

--
Dan.

pgpPTNvdAEWVY.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RaidZ recommendation

2010-04-08 Thread Eric Andersen
I thought I might chime in with my thoughts and experiences.  For starters, I 
am very new to both OpenSolaris and ZFS, so take anything I say with a grain of 
salt.  I have a home media server / backup server very similar to what the OP 
is looking for.  I am currently using 4 x 1TB and 4 x 2TB drives set up as 
mirrors.  Tomorrow, I'm going to wipe my pool and go to 4 x 1TB and 4 x 2TB in 
two 4 disk raidz's.

I backup my pool to 2 external 2TB drives that are simply striped using zfs 
send/receive followed by a scrub.  As of right now, I only have 1.58TB of 
actual data.  ZFS send over USB2.0 capped out at 27MB/s.  The scrub for 1.5TB 
of backup data on the USB drives took roughly 14 hours.  As needed, I'll 
destroy the backup pool and add more drives as needed.  I looked at a lot of 
different options for external backup, and decided to go with cheap (USB).

I am using 1TB and 2TB WD Caviar Green drives for my storage pool, which are 
about the cheapest and probably close to the slowest consumer drives you can 
buy.  I've only been at this for about 4-5 months now, and thankfully I haven't 
had a drive fail yet so I cannot attest to resilver times.  I do weekly scrubs 
on both my rpool and storage pool via a script called through cron.  I just set 
things up to do scrubs during a timeframe when I know I'm not going to be using 
it for anything.  I can't recall the exact times it took for the scrubs to 
complete, but it wasn't anything that interfered with my usage (yet...)

The vast majority of any streaming media I do (up to 1080p) is over wireless-n. 
 Occasionally, I will get stuttering (on the HD stuff), but I haven't looked 
into whether it was due to a network or I/O bottleneck.  Personally, I would 
think it was due to network traffic, but that is pure speculation.  The vast 
majority of the time, I don't have any issues whatsoever.  The main point I'm 
trying to make is that I'm not I/O bound at this point.  I'm also not streaming 
to 4 media players simultaneously.

I currently have far more storage space than I am using.  When I do end up 
running low on space, I plan to start with replacing the 1TB drives with, 
hopefully much cheaper at that point, 2TB drives.  If using 2 x raidz vdevs 
doesn't work well for me, I'll go back to mirrors and start looking at other 
options for expansion.

I find Erik Trimble's statements regarding a 1 TB limit on drives to be a very 
bold statement.  I don't have the knowledge or the inclination to argue the 
point, but I am betting that we will continue to see advances in storage 
technology on par with what we have seen in the past.  If we still are capped 
out at 2TB as the limit for a physical device in 2 years, I solemnly pledge now 
that I will drink a six-pack of beer in his name.  Again, I emphasize that this 
assumption is not based on any sort of knowledge other than past experience 
with the ever growing storage capacity of physical disks.

My personal advice to the OP would be to set up three 4 x 1TB raidz vdevs, and 
investing in a reasonable backup solution.  If you have to use the last two 
drives, set them up as a mirror.  Redundancy is great, but in my humble 
opinion, for the home user that is using cheap hardware, it's not as critical 
as performance and available storage space.  That particular configuration 
would give you more IOPS than just two raidz2 vdevs, with slightly less 
redundancy and slightly more storage space.  For my own needs, I don't see 
redundancy as being as high a priority as IOPS and available storage space.  
Everyone has to make their own decision on that, and the ability of ZFS to 
accommodate a vast array of different individual needs is a big part of what 
makes it such an excellent filesystem.  With a solid backup, there is really no 
reason you can't redesign your pool at a later date if need be.  Try out what 
you think will work best, and if that configuration doesn't work well in s
 ome way, adjust and move on...

There are a few different schools of thought on how to backup ZFS filesystems.  
ZFS send/receive works for me, but there are certainly weaknesses with using it 
as a backup solution (as has been much discussed on this list.)

Hopefully, in the future it will be possible to remove vdevs from a pool and to 
restripe data across a pool.  Those particular features would certainly be 
great for me.

Just my thoughts.

Eric
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS RaidZ recommendation

2010-04-08 Thread Richard Elling
On Apr 8, 2010, at 9:06 PM, Daniel Carosone wrote:

 On Thu, Apr 08, 2010 at 08:36:43PM -0700, Richard Elling wrote:
 On Apr 8, 2010, at 6:19 PM, Daniel Carosone wrote:
 
 As for error rates, this is something zfs should not be afraid
 of. Indeed, many of us would be happy to get drives with less internal
 ECC overhead and complexity for greater capacity, and tolerate the
 resultant higher error rates, specifically for use with zfs (sector
 errors, not overall drive failure, of course).  Even if it means I
 need raidz4, and wind up with the same overall usable space, I may
 prefer the redundancy across drives rather than within.
 
 Disagree. Reliability trumps availability every time. 
 
 Often, but not sure about every.

I am quite sure.

 The economics shift around too fast
 for such truisms to be reliable, and there's always room for an
 upstart (often in a niche) to make great economic advantages out of
 questioning this established wisdom.  The oft-touted example  is
 google's servers, but there are many others.

A small change in reliability for massively parallel systems has a
significant, multiplicative effect on the overall system.  Companies
like Google weigh many factors, including component reliability,
when designing the systems.

 
 And the problem
 with the availability provided by redundancy techniques is that the
 amount of work needed to recover is increasing.  This work is limited
 by latency and HDDs are not winning any latency competitions anymore.
 
 We're talking about generalities; the niche can be very important to
 enable these kinds of tricks by holding some of the other troubling
 variables constant (e.g. application/programming platform).  It
 doesn't really matter whether you're talking about 1 dual-PSU server
 vs 2 single-PSU servers, or whole datacentres - except that solid
 large-scale diversity tends to lessen your concentration (and perhaps
 spend) on internal redundancy within a datacentre (or disk).
 
 Put another way: some application niches are much more able to adopt
 redundancy techniques that don't require so much work. 

At the other extreme, if disks were truly reliable, the only RAID that
would matter is RAID-0.

 Again, for the google example: if you're big and diverse enough that
 shifting load between data centres on failure is no work, then
 moving the load for other reasons is viable too - such as moving
 to where it's night time and power and cooling are cheaper.  The work
 has been done once, up front, and the benefits are repeatable.

Most folks never even get to a decent disaster recovery design, let
alone a full datacenter mirror :-(

 To combat this, some vendors are moving to an overprovision model.
 Current products deliver multiple disks in a single FRU with builtin, 
 fine-grained redundancy. Because the size and scope of the FRU is 
 bounded, the recovery can be optimized and the reliability of the FRU 
 is increased. 
 
 That's not new.  Past examplees in the direct experience of this
 community include the BladeStor and SSA-1000 storage units, which
 aggregated disks into failure domains (e.g. drawers) for a (big)
 density win.

Nope. The FRUs for BladeStor and SSA-100 were traditional disks.
To see something different you need to rethink the disk -- something
like a Xiotech ISE.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss