Re: [zfs-discuss] sharenfs option rw,root=host1 don't take effect
If you're getting nobody:nobody on an NFS mount you have an NFS version mismatch, (usually between V3 V4) to get around this use the following as mount options on the client: hard,bg,intr,vers=3 e.g: mount -o hard,bg,intr,vers=3 server:/pool/zfs /mountpoint -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] compression property not received
Hi Daniel, D'oh... I found a related bug when I looked at this yesterday but I didn't think it was your problem because you didn't get a busy message. See this RFE: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6700597 Cindy On 04/07/10 17:59, Daniel Bakken wrote: We have found the problem. The mountpoint property on the sender was at one time changed from the default, then later changed back to defaults using zfs set instead of zfs inherit. Therefore, zfs send included these local non-default properties in the stream, even though the local properties are effectively set at defaults. This caused the receiver to stop processing subsequent properties in the stream because the mountpoint isn't valid on the receiver. I tested this theory with a spare zpool. First I used zfs inherit mountpoint promise1/archive to remove the local setting (which was exactly the same value as the default). This time the compression=gzip property was correctly received. It seems like a bug to me that one failed property in a stream prevents the rest from being applied. I should have used zfs inherit, but it would be best if zfs receive handled failures more gracefully, and attempted to set as many properties as possible. Thanks to Cindy and Tom for their help. Daniel On Wed, Apr 7, 2010 at 2:31 AM, Tom Erickson thomas.erick...@oracle.com mailto:thomas.erick...@oracle.com wrote: Now I remember that 'zfs receive' used to give up after the first property it failed to set. If I'm remembering correctly, then, in this case, if the mountpoint was invalid on the receive side, 'zfs receive' would not even try to set the remaining properties. I'd try the following in the source dataset: zfs inherit mountpoint promise1/archive to clear the explicit mountpoint and prevent it from being included in the send stream. Later set it back the way it was. (Soon there will be an option to take care of that; see CR 6883722 want 'zfs recv -o prop=value' to set initial property values of received dataset.) Then see if you receive the compression property successfully. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] compression property not received
On 08 April, 2010 - Cindy Swearingen sent me these 2,6K bytes: Hi Daniel, D'oh... I found a related bug when I looked at this yesterday but I didn't think it was your problem because you didn't get a busy message. See this RFE: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6700597 Solaris 10 'man zfs', under 'receive': -uFile system that is associated with the received stream is not mounted. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RaidZ recommendation
On Thu, 8 Apr 2010, Erik Trimble wrote: While that's great in theory, there's getting to be a consensus that 1TB 7200RPM 3.5 Sata drives are really going to be the last usable capacity. Agreed. The 2.5 form factor is rapidly emerging. I see that enterprise 6-Gb/s SAS drives are available with 600GB capacity already. It won't be long until they also reach your 1TB barrier. So, while it's nice that you can indeed seemlessly swap up drives sizes (and your recommendation of using 2x7 helps that process), in reality, it's not a good idea to upgrade from his existing 1TB drives. It would make more sense to add a new chassis, or replace the existing chassis with one which supports more (physically smaller) drives. While products are often sold based on their ability to be upgraded, upgrades often don't make sense. Now, in the Real Near Future when we have 1TB+ SSDs that are 1cent/GB, well, then, it will be nice to swap up. But not until then... I don't see that happening any time soon. FLASH is close to hitting the wall on device geometries and tri-level and quad-level only gets you so far. A new type of device will need to be invented. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RaidZ recommendation
On Apr 8, 2010, at 8:52 AM, Bob Friesenhahn wrote: On Thu, 8 Apr 2010, Erik Trimble wrote: While that's great in theory, there's getting to be a consensus that 1TB 7200RPM 3.5 Sata drives are really going to be the last usable capacity. I doubt that 1TB (or even 1.5TB) 3.5 disks are being manufactured anymore. These have dropped to the $100 price barrier already. 2TB are hanging out around $150. Agreed. The 2.5 form factor is rapidly emerging. I see that enterprise 6-Gb/s SAS drives are available with 600GB capacity already. It won't be long until they also reach your 1TB barrier. Yep, seeing some nice movement in this space. So, while it's nice that you can indeed seemlessly swap up drives sizes (and your recommendation of using 2x7 helps that process), in reality, it's not a good idea to upgrade from his existing 1TB drives. It would make more sense to add a new chassis, or replace the existing chassis with one which supports more (physically smaller) drives. While products are often sold based on their ability to be upgraded, upgrades often don't make sense. Now, in the Real Near Future when we have 1TB+ SSDs that are 1cent/GB, well, then, it will be nice to swap up. But not until then... I don't see that happening any time soon. FLASH is close to hitting the wall on device geometries and tri-level and quad-level only gets you so far. A new type of device will need to be invented. It is a good idea to not bet against Moore's Law :-) The current state of the art is an 8GB (byte, not bit) MLC flash chip which is 162 mm^2. In the space of a 2.5 disk with some clever packaging you could pack dozens of TB. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] sharenfs option rw,root=host1 don't take effect
On 12 mar 2010, at 03.58, Damon Atkins wrote: ... Unfortunately DNS spoofing exists, which means forward lookups can be poison. And IP address spoofing, and... The best (maybe only) way to make NFS secure is NFSv4 and Kerb5 used together. Amen! DNS is NOT an authentication system! IP is NOT an authentication system! I don't think the (rw|root|...)=(hostname|address) kind of functionality has any place in a system from after the 80's, when the world got connected and security became an issue for the masses. It should be an extra feature marked with a big insecure that you should have to enable through a very cumbersome process. Instead, use Kerberos, or if that is not possible, at least use IPSEC to make IP address spoofing harder. /ragge ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS monitoring - best practices?
We're starting to grow our ZFS environment and really need to start standardizing our monitoring procedures. OS tools are great for spot troubleshooting and sar can be used for some trending, but we'd really like to tie this into an SNMP based system that can generate graphs for us (via RRD or other). Whether or not we do this via our standard enterprise monitoring tool or write some custom scripts I don't really care... but I do have the following questions: - What metrics are you guys tracking? I'm thinking: - IOPS - ZIL statistics - L2ARC hit ratio - Throughput - IO Wait (I know there's probably a better term here) - How do you gather this information? Some but not all is available via SNMP. Has anyone written a ZFS specific MIB or plugin to make the info available via the standard Solaris SNMP daemon? What information is available only via zdb/mdb? - Anyone have any RRD-based setups for monitoring their ZFS environments they'd be willing to share or talk about? Thanks in advance, Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS monitoring - best practices?
Ray, Here is my short list of Performance Metrics I track on 7410 Performance Rigs via 7000 Analytics. Cheers, Joel. m:analytics datasets ls Datasets: DATASET STATE INCORE ONDISK NAME dataset-000 active 1016K 75.9M arc.accesses[hit/miss] dataset-001 active390K 37.9M arc.l2_accesses[hit/miss] dataset-002 active242K 13.7M arc.l2_size dataset-003 active242K 13.7M arc.size dataset-004 active958K 86.1M arc.size[component] dataset-005 active242K 13.7M cpu.utilization dataset-006 active477K 46.2M cpu.utilization[mode] dataset-007 active648K 59.7M dnlc.accesses[hit/miss] dataset-008 active242K 13.7M fc.bytes dataset-009 active242K 13.7M fc.ops dataset-010 active242K 12.8M fc.ops[latency] dataset-011 active242K 12.8M fc.ops[op] dataset-012 active242K 13.7M ftp.kilobytes dataset-013 active242K 12.8M ftp.kilobytes[op] dataset-014 active242K 13.7M http.reqs dataset-015 active242K 12.8M http.reqs[latency] dataset-016 active242K 12.8M http.reqs[op] dataset-017 active242K 13.7M io.bytes dataset-018 active439K 43.7M io.bytes[op] dataset-019 active308K 29.6M io.disks[utilization=95][disk] dataset-020 active 2.93M 87.2M io.disks[utilization] dataset-021 active242K 13.7M io.ops dataset-022 active 9.85M 274M io.ops[disk] dataset-023 active 20.0M 827M io.ops[latency] dataset-024 active438K 43.6M io.ops[op] dataset-025 active242K 13.7M iscsi.bytes dataset-026 active242K 13.7M iscsi.ops dataset-027 active 1.45M 91.1M iscsi.ops[latency] dataset-028 active248K 14.8M iscsi.ops[op] dataset-029 active242K 13.7M ndmp.diskkb dataset-030 active242K 13.8M nfs2.ops dataset-031 active242K 12.8M nfs2.ops[latency] dataset-032 active242K 13.8M nfs2.ops[op] dataset-033 active242K 13.8M nfs3.ops dataset-034 active 8.82M 163M nfs3.ops[latency] dataset-035 active327K 18.1M nfs3.ops[op] dataset-036 active242K 13.8M nfs4.ops dataset-037 active 2.31M 97.8M nfs4.ops[latency] dataset-038 active311K 17.2M nfs4.ops[op] dataset-039 active242K 13.7M nic.kilobytes dataset-040 active970K 84.5M nic.kilobytes[device] dataset-041 active943K 77.1M nic.kilobytes[direction=in][device] dataset-042 active457K 31.1M nic.kilobytes[direction=out][device] dataset-043 active503K 49.1M nic.kilobytes[direction] dataset-044 active242K 13.7M sftp.kilobytes dataset-045 active242K 12.8M sftp.kilobytes[op] dataset-046 active242K 13.7M smb.ops dataset-047 active242K 12.8M smb.ops[latency] dataset-048 active242K 13.7M smb.ops[op] dataset-049 active242K 12.8M srp.bytes dataset-050 active242K 12.8M srp.ops[latency] dataset-051 active242K 12.8M srp.ops[op] Cheers, Joel. On 04/08/10 14:06, Ray Van Dolson wrote: We're starting to grow our ZFS environment and really need to start standardizing our monitoring procedures. OS tools are great for spot troubleshooting and sar can be used for some trending, but we'd really like to tie this into an SNMP based system that can generate graphs for us (via RRD or other). Whether or not we do this via our standard enterprise monitoring tool or write some custom scripts I don't really care... but I do have the following questions: - What metrics are you guys tracking? I'm thinking: - IOPS - ZIL statistics - L2ARC hit ratio - Throughput - IO Wait (I know there's probably a better term here) Utilize Latency instead of IO Wait. - How do you gather this information? Some but not all is available via SNMP. Has anyone written a ZFS specific MIB or plugin to make the info available via the standard Solaris SNMP daemon? What information is available only via zdb/mdb? On 7000 appliances, this is easy via Analytics. On Solaris, you need to pull data from kstats and/or DTrace scripts and then archive the data in similar manner... - Anyone have any RRD-based setups for monitoring their ZFS environments they'd be willing to share or talk about? Thanks in advance, Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- http://www.oracle.com/Joel Buckley | +1.303.272.5556 Oracle Open Storage Systems 500 Eldorado Blvd Broomfield, CO 80021-3400 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RaidZ recommendation
dm == David Magda dma...@ee.ryerson.ca writes: bf == Bob Friesenhahn bfrie...@simple.dallas.tx.us writes: dm OP may also want to look into the multi-platform pkgsrc for dm third-party open source software: +1. jucr.opensolaris.org seems to be based on RPM which is totally fail. RPM is the oldest, crappiest, most frustrating thing! packages are always frustrating but pkgsrc is designed to isolate itself from the idiosyncracies of each host platform, through factoring. Its major weakness is upgrades, but with Solaris you can use zones and snapshots to make this a lot less painful: * run their ``bulk build'' inside a zone. The ``bulk build'' feature is like the jucr: it downloads stuff from all over the internet and bulids it, generates a tree of static web pages to report its results, plus a repository of binary packages. Like jucr it does not build packages on an ordinary machine, but in a well-specified minimal environment which has installed only the packages named as build dependencies---between each package build the bulk scripts remove all not-needed packages. Thus you really need a separate machine, like a zone, for bulk building. There is a non-bulk way to build pkgsrc, but it's not as good. Except that unlike the jucr, the implementation of the bulk build is included in the pkgsrc distribution and supported and ordinary people who run pkgsrc are expected to use it themselves. * clone a zone, upgrade the packages inside it using the binary packages produced by the bulk build, and cut services over to the clone only after everything's working right. Both of these things are a bit painful with pkgsrc on normal systems and much easier with zones and ZFS. The type of upgrade that's guaranteed to work on pkgsrc, is: * to take a snapshot of /usr/pkgsrc which *is* pkgsrc, all packages' build instructions, and no binaries under this tree * ``bulk build'' * replace all your current running packages with the new binary packages in the repository the bulk build made. In practice people usually rebuild less than that to upgrade a package, and it often works anyway, but if it doesn't work then you're left wondering ``is pkgsrc just broken again, or will a more thorough upgrade actually work?'' The coolest immediate trick is that you can run more than one bulk build with different starting options, ex SunPro vs gcc, 32 vs 64-bit. The first step of using pkgsrc is to ``bootstrap'' it, and during bootstrap you choose the C compiler and also whether to use host's or pkgsrc's versions of things like perl and pax and awk. You also choose prefixes for /usr /var and /etc and /var/db/pkg that will isolate all pkgsrc files from the rest of the system. In general this level of pathname flexibility is only achievable at build time, so only a source-based package system can pull off this trick. The corrolary is that you can install more than one pkgsrc on a single system and choose between them with PATH. pkgsrc is generally designed to embed full pathnames of its shared libs, so this has got a good shot of working. You could have /usr/pkg64 and /usr/pkg32, or /usr/pkg-gcc and /usr/pkg-spro. pkgsrc will also build pkg_add, pkg_info, u.s.w. under /usr/pkg-gcc/bin which will point to /var/db/pkg-gcc or whatever to track what's installed, so you can have more than one pkg_add on a single system pointing to different sets of directories. You could also do weirder things like use different paths every time you do a bulk build, like /usr/pkg-20100130 and /usr/pkg-20100408, although it's very strange to do that so far. It would also be possible to use ugly post-Unix directory layouts, ex /pkg/marker/usr/bin and /pkg/marker/etc and /pkg/marker/var/db/pkg, and then make /pkg/marker into a ZFS that could be snapshotted and rolled back. It is odd in pkgsrc world to put /var/db/pkg tracking-database of what's installed into the same subtree as the installed stuff itself, but in the context of ZFS it makes sense to do that. However the pathnames will be fixed for a given set of binary packages, so whatever you do with the ZFS the results of bulk builds sharing a common ``bootstrap'' phase would have to stay mounted on the same directory. You cannot clone something to a new directory then add/remove packages. There was an attempt called ``pkgviews'' to do something like this, but I think it's ultimately doomed because the idea's not compartmentalized enough to work with every package. In general pkgsrc gives you a toolkit for dealing with suboptimal package trees where a lot of shit is broken. It's well-adapted to the ugly modern way we run Unixes, sealed, with only web facing the users, because you can dedicate an entire bulk build to one user-facing app. If you have an app that needs a one-line change to openldap, pkgsrc makes it easy to perform this 1-line change and rebuild 100 interdependent packages linked to your mutant library
Re: [zfs-discuss] sharenfs option rw,root=host1 don't take effect
rs == Ragnar Sundblad ra...@csc.kth.se writes: rs use IPSEC to make IP address spoofing harder. IPsec with channel binding is win, but not until SA's are offloaded to the NIC and all NIC's can do IPsec AES at line rate. Until this happens you need to accept there will be some protocols used on SAN that are not on ``the Internet'' and for which your axiomatic security declarations don't apply, where the relevant features are things like doing the DNS lookup in the proper .rhosts manner and doing uRPF, minimum, and more optimistically stop adding new protocols without IPv6 support, and start adding support for multiple IP stacks / VRF's. If saying ``the only way to do any given thing is twicecrypted kerberized ipsec within dnssec namespaces'' is blocking doing these immediate plaintext things that allow a host to participate in both the internet and a SAN at once, well that's no good either. pgptkJNIK5h42.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC Workingset Size
Hi Richard Thanks for your comments. OK ZFS is COW, I understand, but, this also means a waste of valuable space of my L2ARC SSD device, more than 60% of the space is consumed by COW !!!. I do not get it ? On Sat, Apr 3, 2010 at 11:35 PM, Richard Elling richard.ell...@gmail.comwrote: On Apr 1, 2010, at 9:41 PM, Abdullah Al-Dahlawi wrote: Hi all I ran a workload that reads writes within 10 files each file is 256M, ie, (10 * 256M = 2.5GB total Dataset Size). I have set the ARC max size to 1 GB on etc/system file In the worse case, let us assume that the whole dataset is hot, meaning my workingset size= 2.5GB My SSD flash size = 8GB and being used for L2ARC No slog is used in the pool My File system record size = 8K , meaning 2.5% of 8GB is used for L2ARC Directory in ARC. which ultimately mean that available ARC is 1024M - 204.8M = 819.2M Available ARC (Am I Right ?) this is worst case Now the Question ... After running the workload for 75 minutes, I have noticed that L2ARC device has grown to 6 GB !!! You're not interpreting the values properly, see below. What is in L2ARC beyond my 2.5GB Workingset ?? something else is has been added to L2ARC ZFS is COW, so modified data is written to disk and the L2ARC. Here is a 5 minute interval of Zpool iostat [snip] Also, a FULL Kstat ZFS for 5 minutes Interval [snip] module: zfs instance: 0 name: arcstatsclass:misc c 1073741824 c_max 1073741824 Max ARC size is limited to 1GB c_min 134217728 crtime 28.083178473 data_size 955407360 deleted 966956 demand_data_hits843880 demand_data_misses 452182 demand_metadata_hits68572 demand_metadata_misses 5737 evict_skip 82548 hash_chain_max 18 hash_chains 61732 hash_collisions 1444874 hash_elements 329553 hash_elements_max 329561 hdr_size46553328 hits978241 l2_abort_lowmem 0 l2_cksum_bad0 l2_evict_lock_retry 0 l2_evict_reading0 l2_feeds4738 l2_free_on_write184 l2_hdr_size 17024784 size of L2ARC headers is approximately 17MB l2_hits 252839 l2_io_error 0 l2_misses 203767 l2_read_bytes 2071482368 l2_rw_clash 13 l2_size 2632226304 currently, there is approximately 2.5GB in the L2ARC l2_write_bytes 6486009344 total amount of data written to L2ARC since boot is 6+ GB l2_writes_done 4127 l2_writes_error 0 l2_writes_hdr_miss 21 l2_writes_sent 4127 memory_throttle_count 0 mfu_ghost_hits 120524 mfu_hits500516 misses 468227 mru_ghost_hits 61398 mru_hits412112 mutex_miss 511 other_size 56325712 p 775528448 prefetch_data_hits 50804 prefetch_data_misses7819 prefetch_metadata_hits 14985 prefetch_metadata_misses2489 recycle_miss13096 size1073830768 ARC size is 1GB The best way to understand these in detail is to look at the source which is nicely commented. L2ARC design is commented near http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c#3590 -- richard ZFS storage and performance consulting at http://www.RichardElling.comhttp://www.richardelling.com/ ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com -- Abdullah Al-Dahlawi PhD Candidate George Washington University Department. Of Electrical Computer Engineering Check The Fastest 500 Super Computers Worldwide http://www.top500.org/list/2009/11/100 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RaidZ recommendation
On Thu, Apr 08, 2010 at 12:14:55AM -0700, Erik Trimble wrote: Daniel Carosone wrote: Go with the 2x7 raidz2. When you start to really run out of space, replace the drives with bigger ones. While that's great in theory, there's getting to be a consensus that 1TB 7200RPM 3.5 Sata drives are really going to be the last usable capacity. I dunno. The 'forces' and issues you describe are real, but 'usable' depends very heavily on the user's requirements. For example, a large amount of the extra space available on a larger drive may be very rarely accessed in normal use (scrubs and resilvers aside). In the OP's example of an ever-expanding home media collection, much of it will never or very rarely get re-watched. Another common use for the extra space is simply storing more historical snapshots, against the unlikely future need to access them. For such data, speed is really not a concern at all. For the subset of users for whom these forces are not overwhelming for real usage, that leaves scrubs and resilvers. There is room for improvement in zfs here, too - a more sequential streaming access pattern would help. To me, the biggest issue you left unmentioned is the problem of backup. There's little option for backing up these larger drives, other than more of the same drives. In turn, lots of the use such drives will be put to, is for backing up other data stores, and there again, the usage pattern fits the above profile well. Another usage pattern we may see more of, and that helps address some of the performance issues, is this. Say I currently have 2 pools of 1TB disks, one as a backup for the other. I want to expand the space. I replace all the disks with 2TB units, but I also change my data distribution as it grows: now, each pool is to be at most half-full of data, and the other half is used as a backup of the opposite pool. ZFS send is fast enough that the backup windows are short, and I now have effectively twice as many spindles in active service. [..] it looks like hard drives are really at the end of their advancement, as far as capacities per drive go. The challenges are undeniable, but that's way too big a call. Those are words you will regret in future; at least, I hope the future will be one in which those words are regrettable. :-) 1TB drives currently have excessively long resilver time, inferior reliability (for the most part), and increased power consumption. Yes, for the most part. However, a 2TB drive has dramatically less power consumption than 2x1TB drives (and less of other valuable resources, like bays and controller slots). I'd generally recommend that folks NOT step beyond the 1TB capacity at the 3.5 hard drive format. A general recommendation is fine, and this is one I agree with for many scenarios. At least, I'd recommend that folks look more closely at alternatives using 2.5 drives and sas expander bays than they might otherwise. So, while it's nice that you can indeed seemlessly swap up drives sizes (and your recommendation of using 2x7 helps that process), in reality, it's not a good idea to upgrade from his existing 1TB drives. So what does he do instead, when he's running out of space and 1TB drives are hard to come by? The advice still stands, as far as I'm concerned: do something now, that will leave you room for different expansion choices later - and evaluate the best expansion choice later, when the parameters of the time are known. -- Dan. pgpvUFmIBbrcE.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RaidZ recommendation
On Fri, 2010-04-09 at 08:07 +1000, Daniel Carosone wrote: On Thu, Apr 08, 2010 at 12:14:55AM -0700, Erik Trimble wrote: Daniel Carosone wrote: Go with the 2x7 raidz2. When you start to really run out of space, replace the drives with bigger ones. While that's great in theory, there's getting to be a consensus that 1TB 7200RPM 3.5 Sata drives are really going to be the last usable capacity. I dunno. The 'forces' and issues you describe are real, but 'usable' depends very heavily on the user's requirements. Well The problem is (and this isn't just a ZFS issue) that resilver and scrub times /are/ very bad for 1TB disks. This goes directly to the problem of redundancy - if you don't really care about resilver/scrub issues, then you really shouldn't bother to use Raidz or mirroring. It's pretty much in the same ballpark. That is, 1TB 3.5 drives have such long resilver/scrub times that with ZFS, it's a good bet you can kill a second (or third) drive before you can scrub or resilver in time to compensate for the already-failed one. Put it another way, you get more errors before you have time to fix the old ones, which effectively means you now can't fix errors before they become permanent. Permanent errors = data loss. For example, a large amount of the extra space available on a larger drive may be very rarely accessed in normal use (scrubs and resilvers aside). In the OP's example of an ever-expanding home media collection, much of it will never or very rarely get re-watched. Another common use for the extra space is simply storing more historical snapshots, against the unlikely future need to access them. For such data, speed is really not a concern at all. Yes, it is. It's still a concern, and not just in the scrub/resilver arena. Big drives have considerably lower performance, to the point where that replacing 1TB drives with 2TB drives may very well drop them below the threshold where they start to see stutter. That is, while the setup may work with 1TB drives, it won't with 2TB drives. It's not a no-brainer to just upgrade the size. For example, the 2TB 5900RPM 3.5 drives are (on average) over 2x as slow as the 1TB 7200RPM 3.5 drives for most operations. Access time is slower by 40%, and throughput is slower on by 30-50%. For the subset of users for whom these forces are not overwhelming for real usage, that leaves scrubs and resilvers. There is room for improvement in zfs here, too - a more sequential streaming access pattern would help. While ZFS certainly has problems with randomly written small-data pools, scrubs and silvers on large streaming writes (like the media server) is rather straightforward. Note that RAID-6 and many RAID-5/3 hardware setups have similar issues. In any case, resilver/scrub times are becoming the dominant factor in reliability of these large drives. To me, the biggest issue you left unmentioned is the problem of backup. There's little option for backing up these larger drives, other than more of the same drives. In turn, lots of the use such drives will be put to, is for backing up other data stores, and there again, the usage pattern fits the above profile well. Another usage pattern we may see more of, and that helps address some of the performance issues, is this. Say I currently have 2 pools of 1TB disks, one as a backup for the other. I want to expand the space. I replace all the disks with 2TB units, but I also change my data distribution as it grows: now, each pool is to be at most half-full of data, and the other half is used as a backup of the opposite pool. ZFS send is fast enough that the backup windows are short, and I now have effectively twice as many spindles in active service. Don't count on 'zfs send' being fast enough. Even for liberal values of fast enough - it's highly data dependent. For the situation you describe, you're actually making it worse - now, both pools have a backup I/O load which reduces their available throughput. If you're talking about a pool that's already 50% slower than one made of 1TB drives, then, well, you're hosed. [..] it looks like hard drives are really at the end of their advancement, as far as capacities per drive go. The challenges are undeniable, but that's way too big a call. Those are words you will regret in future; at least, I hope the future will be one in which those words are regrettable. :-) Honestly, from what I've seen and heard both here and on other forums, the writing is on the wall, the fat lady has sung, and Mighty Casey has struck out. The 3.5 winchester hard drive is on terminal life support for use in enterprises. It will linger a little longer in commodity places, where its cost/GB overcomes its weaknesses. 2.5 HDs will last out the decade, as they're slightly higher performance/GB and space/power savings will allow them to hold off solid-state media for a bit. But solid-state is the future, and
Re: [zfs-discuss] L2ARC Workingset Size
On 08 April, 2010 - Abdullah Al-Dahlawi sent me these 12K bytes: Hi Richard Thanks for your comments. OK ZFS is COW, I understand, but, this also means a waste of valuable space of my L2ARC SSD device, more than 60% of the space is consumed by COW !!!. I do not get it ? The rest can and will be used if L2ARC needs it. It's not wasted, it's just a number that doesn't match what you think it should be. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] sharenfs option rw,root=host1 don't take effect
mingli liming...@gmail.com writes: Thank Erik, and I will try it, but the new question is that the root of the NFS server mapped as nobody at the NFS client. For this issue, I set up a new test NFS server and NFS client, and with the same option, at this test environment, the file owner mapped correctly, it confused me. From the original post in this thread it wasn't clear if you're doing this on a local lan, and if both server and client are opensolaris machines. Maybe I missed it. I don't have any problems now and don't use any of the options to sharenfs that you showed. zfs get sharenfs z3/projects NAME PROPERTY VALUE SOURCE z3/projects sharenfs onlocal Just a simple `on'. At first, I had all kinds of problems and being a newbie nfs user seemed to see all kinds of strange phenomena, including seeing `nobody:nobody' as owner:group I had the version for nfs set properly on the opensolaris server but it turned to be only set for the server: grep NFS_SERVER_VERSMAX /etc/default/nfs #NFS_SERVER_VERSMAX=4 NFS_SERVER_VERSMAX=3 But somehow had completely overlooked the CLIENT setting: grep NFS_CLIENT_VERSMAX /etc/default/nfs grep NFS_CLIENT_VERSMAX /etc/default/nfs # NFS_CLIENT_VERSMAX=4 # NFS_CLIENT_VERSMAX=3 I'd been running with both commented out instead of what I needed, like this: NFS_CLIENT_VERSMAX=3 (uncommented) The client was a linux machine and it was the client trying to mount the share as version 4. What tipped me off was accidentally seeing something in the output of the linux `mount' cmd that indicated the share was mounted as version 4 nfs. Once I made the correct setting for NFS_CLIENT... things just started working. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC Workingset Size
On Apr 8, 2010, at 3:23 PM, Tomas Ögren wrote: On 08 April, 2010 - Abdullah Al-Dahlawi sent me these 12K bytes: Hi Richard Thanks for your comments. OK ZFS is COW, I understand, but, this also means a waste of valuable space of my L2ARC SSD device, more than 60% of the space is consumed by COW !!!. I do not get it ? The rest can and will be used if L2ARC needs it. It's not wasted, it's just a number that doesn't match what you think it should be. Another way to look at it is: all cache space is wasted by design. If the backing store for the cache were performant, there wouldn't be a cache. So caches waste space to gain performance. Space, dependability, performance: pick two. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] sharenfs option rw,root=host1 don't take effect
On 8 apr 2010, at 23.21, Miles Nordin wrote: rs == Ragnar Sundblad ra...@csc.kth.se writes: rs use IPSEC to make IP address spoofing harder. IPsec with channel binding is win, but not until SA's are offloaded to the NIC and all NIC's can do IPsec AES at line rate. Until this happens you need to accept there will be some protocols used on SAN that are not on ``the Internet'' and for which your axiomatic security declarations don't apply, where the relevant features are things like doing the DNS lookup in the proper .rhosts manner and doing uRPF, minimum, and more optimistically stop adding new protocols without IPv6 support, and start adding support for multiple IP stacks / VRF's. If saying ``the only way to do any given thing is twicecrypted kerberized ipsec within dnssec namespaces'' is blocking doing these immediate plaintext things that allow a host to participate in both the internet and a SAN at once, well that's no good either. I totally agree. Since DNS, fqdn, and the like was mentioned, I don't think this was intended for a SAN, not-on-the-internet, environment. uRPF and other filters may of course harden your environment. Let's hope everyone using the NFS features in question all use them in a completely non-spoofable (L1..L3 and name resolver) setup, then! ;-) /ragge ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS kstat Stats
Do the following ZFS stats look ok? ::memstat Page Summary Pages MB %Tot Kernel 106619 832 28% ZFS File Data 79817 623 21% Anon 28553 223 7% Exec and libs 3055 23 1% Page cache 18024 140 5% Free (cachelist) 2880 22 1% Free (freelist) 146309 1143 38% Total 385257 3009 Physical 367243 2869 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS kstat Stats
Do the following ZFS stats look ok? ::memstat Page Summary Pages MB %Tot Kernel 106619 832 28% ZFS File Data 79817 623 21% Anon 28553 223 7% Exec and libs 3055 23 1% Page cache 18024 140 5% Free (cachelist) 2880 22 1% Free (freelist) 146309 1143 38% Total 385257 3009 Physical 367243 2869 Looks beautiful. Just for giggles try this : r...@aequitas:/root# uname -a SunOS aequitas 5.11 snv_136 i86pc i386 i86pc Solaris r...@aequitas:/root# r...@aequitas:/root# /bin/printf ::kmastat\n | mdb -k cachebufbufbufmemory alloc alloc namesize in use totalin use succeed fail - -- -- -- -- - - kmem_magazine_18 8595 8736 212992B 8595 0 kmem_magazine_3 16 3697 3780 122880B 3697 0 kmem_magazine_7 32 7633 7686 499712B 7633 0 kmem_magazine_15 64 11642 116561540096B 11642 0 . . etc etc . nfs4_access_cache 32 0 0 0B 0 0 client_handle4_cache 16 0 0 0B 0 0 nfs4_ace4vals_cache 36 0 0 0B 0 0 nfs4_ace4_list_cache 176 0 0 0B 0 0 NFS_idmap_cache 24 0 0 0B 0 0 pty_map 48 0 64 4096B 1 0 -- - -- -- -- - - Total [hat_memload]974848B 1306984 0 Total [kmem_msb] 56860672B506215 0 Total [kmem_va] 78249984B 12180 0 Total [kmem_default] 76316672B 8546762 0 Total [kmem_io_1G] 36712448B 8643 0 Total [bp_map] 0B 212 0 Total [segkp] 6356992B186825 0 Total [umem_np] 0B 148 0 Total [ip_minor_arena_sa] 64B 180 0 Total [spdsock] 0B 1 0 Total [namefs_inodes] 64B18 0 -- - -- -- -- - - . . etc etc . Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RaidZ recommendation
On 04/ 9/10 10:48 AM, Erik Trimble wrote: Well The problem is (and this isn't just a ZFS issue) that resilver and scrub times /are/ very bad for1TB disks. This goes directly to the problem of redundancy - if you don't really care about resilver/scrub issues, then you really shouldn't bother to use Raidz or mirroring. It's pretty much in the same ballpark. That is,1TB 3.5 drives have such long resilver/scrub times that with ZFS, it's a good bet you can kill a second (or third) drive before you can scrub or resilver in time to compensate for the already-failed one. Put it another way, you get more errors before you have time to fix the old ones, which effectively means you now can't fix errors before they become permanent. Permanent errors = data loss. That's one of the big problems with the build it now, expand with bigger drives later approach. If you were designing from scratch with 2TB drives, you would be wise to consider triple parity raid, where double parity has acceptable reliability for 1TB drives. Each time drive capacity double (and performance does not) an extra level or parity is required. I guess this extrapolates to one data and N parity drives.. -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RaidZ recommendation
On Thu, Apr 08, 2010 at 03:48:54PM -0700, Erik Trimble wrote: Well To be clear, I don't disagree with you; in fact for a specific part of the market (at least) and a large part of your commentary, I agree. I just think you're overstating the case for the rest. The problem is (and this isn't just a ZFS issue) that resilver and scrub times /are/ very bad for 1TB disks. This goes directly to the problem of redundancy - if you don't really care about resilver/scrub issues, then you really shouldn't bother to use Raidz or mirroring. It's pretty much in the same ballpark. Sure, and that's why you have raidz3 now; also why multi-way mirrors are getting more attention, as the drives are getting large enough that capacities and redundancies previously only available via raidz constructions can now be had with mirrors and a reasonable number of spindles. Large drives (with the constraints you describe) certainly change the deployment scenarios. I don't agree that they shouldn't be deployed at all, ever - which seems to be what you're saying. Take 6x1TB in raidz2, replace with 6x2TB in three-way-mirror. Chances are, you've just improved performance. I'm just trying to show it's really not all that black and white. As for error rates, this is something zfs should not be afraid of. Indeed, many of us would be happy to get drives with less internal ECC overhead and complexity for greater capacity, and tolerate the resultant higher error rates, specifically for use with zfs (sector errors, not overall drive failure, of course). Even if it means I need raidz4, and wind up with the same overall usable space, I may prefer the redundancy across drives rather than within. That is, 1TB 3.5 drives have such long resilver/scrub times that with ZFS, it's a good bet you can kill a second (or third) drive before you can scrub or resilver in time to compensate for the already-failed one. Put it another way, you get more errors before you have time to fix the old ones, which effectively means you now can't fix errors before they become permanent. Permanent errors = data loss. Again, potential zfs improvements could help here: - resilver in parallel for multiply redundant vdevs with multiple failures/replacements (currently, I think resilver restarts in this case?) - scrub a (top level) vdev at a time, rather than a whole pool. If I know I'm about to replace a drive, perhaps for capacity upgrade, I'll scrub first to minimise the chances of tripping over a latent error, especially on the previous drive i just replaced. No need to scrub other vdevs right now. - scrub/resilver selectively by dataset, to allow higher priority data to be given better protection. For example, the 2TB 5900RPM 3.5 drives are (on average) over 2x as slow as the 1TB 7200RPM 3.5 drives for most operations. Access time is slower by 40%, and throughput is slower on by 30-50%. Please, be fair and compare like with like - say replacing 5400rpm 1TB drives. Your same problem would apply if replacing 1TB 7200's with 1TB 5400's; it has little to do with the capacity. Indeed, at the same rpm, the higher density has the potential to be faster. In any case, resilver/scrub times are becoming the dominant factor in reliability of these large drives. Agreed; I'd argue they have been for some time (ie, even at the 1TB size). As a practical matter, small setups are for the most part not expandable/upgradable much, if at all. Buy what you need now, and plan on rebuying something new in 5-10 years, but don't think that what you put together now can be continuously upgraded for a decade. On this, I agree completely, even on a shorter time-scale (say 3-5 years). On each generation, repurpose the previous generation for backup or something else as appropriate. This applies to drives, and to the boxes that house them. Even so, leave yourself wiggle room for upgrades and other unanticipated devlopments in the meantime where you can. -- Dan. pgpLw78wUivGj.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RaidZ recommendation
Well I would like to thank everyone for there comments and ideas. I finally have this machine up and running with Nexenta Community edition and am really liking the GUI for administering it. It suits my needs perfectly and is running very well. I ended up going with 2 X 7 RaidZ2 vdevs in one pool for a total capacity of 10 TB. One thing i have noticed that seems a littler different from my previous hardware raid controller (Areca) is the data is not constantly being written to the spindles. For example i am copying some large files to the array right now (approx 4 gigs a file) and my network performance is showing a transfer rate on average of 75MB/s. When i physically watch the server i only see a 1-2 second flury of activity on the drives then about 10 seconds of no activity. Is this the nature of ZFS? Thanks for all the help! -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RaidZ recommendation
On Thu, 8 Apr 2010, Jason S wrote: One thing i have noticed that seems a littler different from my previous hardware raid controller (Areca) is the data is not constantly being written to the spindles. For example i am copying some large files to the array right now (approx 4 gigs a file) and my network performance is showing a transfer rate on average of 75MB/s. When i physically watch the server i only see a 1-2 second flury of activity on the drives then about 10 seconds of no activity. Is this the nature of ZFS? Yes, this is the nature of ZFS. ZFS batches up writes and writes them in bulk. On a large memory system and with a very high write rate, up to 5 seconds worth of low-level write may be batched up. With a slow write rate, up to 30 seconds of user-level writes may be batched up. The reasons for doing this become obvious when you think about it a bit. Zfs writes data as large transactions (transaction groups) and uses copy on write (COW). Batching up the writes allows more full-blocks to be written, which decreases fragmentation, improves space allocation efficiency, improves write performance, and uses fewer precious IOPS. The main drawback is that reads/writes are temporarily stalled during part of the TXG write cycle. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RaidZ recommendation
On Apr 8, 2010, at 6:19 PM, Daniel Carosone wrote: As for error rates, this is something zfs should not be afraid of. Indeed, many of us would be happy to get drives with less internal ECC overhead and complexity for greater capacity, and tolerate the resultant higher error rates, specifically for use with zfs (sector errors, not overall drive failure, of course). Even if it means I need raidz4, and wind up with the same overall usable space, I may prefer the redundancy across drives rather than within. Disagree. Reliability trumps availability every time. And the problem with the availability provided by redundancy techniques is that the amount of work needed to recover is increasing. This work is limited by latency and HDDs are not winning any latency competitions anymore. To combat this, some vendors are moving to an overprovision model. Current products deliver multiple disks in a single FRU with builtin, fine-grained redundancy. Because the size and scope of the FRU is bounded, the recovery can be optimized and the reliability of the FRU is increased. From a market perspective, these solutions are not suitable for the home user because the size and cost of the FRU is high. It remains to be seen how such products survive in the enterprise space as HDDs become relegated to backup roles. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RaidZ recommendation
On Thu, Apr 08, 2010 at 08:36:43PM -0700, Richard Elling wrote: On Apr 8, 2010, at 6:19 PM, Daniel Carosone wrote: As for error rates, this is something zfs should not be afraid of. Indeed, many of us would be happy to get drives with less internal ECC overhead and complexity for greater capacity, and tolerate the resultant higher error rates, specifically for use with zfs (sector errors, not overall drive failure, of course). Even if it means I need raidz4, and wind up with the same overall usable space, I may prefer the redundancy across drives rather than within. Disagree. Reliability trumps availability every time. Often, but not sure about every. The economics shift around too fast for such truisms to be reliable, and there's always room for an upstart (often in a niche) to make great economic advantages out of questioning this established wisdom. The oft-touted example is google's servers, but there are many others. And the problem with the availability provided by redundancy techniques is that the amount of work needed to recover is increasing. This work is limited by latency and HDDs are not winning any latency competitions anymore. We're talking about generalities; the niche can be very important to enable these kinds of tricks by holding some of the other troubling variables constant (e.g. application/programming platform). It doesn't really matter whether you're talking about 1 dual-PSU server vs 2 single-PSU servers, or whole datacentres - except that solid large-scale diversity tends to lessen your concentration (and perhaps spend) on internal redundancy within a datacentre (or disk). Put another way: some application niches are much more able to adopt redundancy techniques that don't require so much work. Again, for the google example: if you're big and diverse enough that shifting load between data centres on failure is no work, then moving the load for other reasons is viable too - such as moving to where it's night time and power and cooling are cheaper. The work has been done once, up front, and the benefits are repeatable. To combat this, some vendors are moving to an overprovision model. Current products deliver multiple disks in a single FRU with builtin, fine-grained redundancy. Because the size and scope of the FRU is bounded, the recovery can be optimized and the reliability of the FRU is increased. That's not new. Past examplees in the direct experience of this community include the BladeStor and SSA-1000 storage units, which aggregated disks into failure domains (e.g. drawers) for a (big) density win. -- Dan. pgpPTNvdAEWVY.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RaidZ recommendation
I thought I might chime in with my thoughts and experiences. For starters, I am very new to both OpenSolaris and ZFS, so take anything I say with a grain of salt. I have a home media server / backup server very similar to what the OP is looking for. I am currently using 4 x 1TB and 4 x 2TB drives set up as mirrors. Tomorrow, I'm going to wipe my pool and go to 4 x 1TB and 4 x 2TB in two 4 disk raidz's. I backup my pool to 2 external 2TB drives that are simply striped using zfs send/receive followed by a scrub. As of right now, I only have 1.58TB of actual data. ZFS send over USB2.0 capped out at 27MB/s. The scrub for 1.5TB of backup data on the USB drives took roughly 14 hours. As needed, I'll destroy the backup pool and add more drives as needed. I looked at a lot of different options for external backup, and decided to go with cheap (USB). I am using 1TB and 2TB WD Caviar Green drives for my storage pool, which are about the cheapest and probably close to the slowest consumer drives you can buy. I've only been at this for about 4-5 months now, and thankfully I haven't had a drive fail yet so I cannot attest to resilver times. I do weekly scrubs on both my rpool and storage pool via a script called through cron. I just set things up to do scrubs during a timeframe when I know I'm not going to be using it for anything. I can't recall the exact times it took for the scrubs to complete, but it wasn't anything that interfered with my usage (yet...) The vast majority of any streaming media I do (up to 1080p) is over wireless-n. Occasionally, I will get stuttering (on the HD stuff), but I haven't looked into whether it was due to a network or I/O bottleneck. Personally, I would think it was due to network traffic, but that is pure speculation. The vast majority of the time, I don't have any issues whatsoever. The main point I'm trying to make is that I'm not I/O bound at this point. I'm also not streaming to 4 media players simultaneously. I currently have far more storage space than I am using. When I do end up running low on space, I plan to start with replacing the 1TB drives with, hopefully much cheaper at that point, 2TB drives. If using 2 x raidz vdevs doesn't work well for me, I'll go back to mirrors and start looking at other options for expansion. I find Erik Trimble's statements regarding a 1 TB limit on drives to be a very bold statement. I don't have the knowledge or the inclination to argue the point, but I am betting that we will continue to see advances in storage technology on par with what we have seen in the past. If we still are capped out at 2TB as the limit for a physical device in 2 years, I solemnly pledge now that I will drink a six-pack of beer in his name. Again, I emphasize that this assumption is not based on any sort of knowledge other than past experience with the ever growing storage capacity of physical disks. My personal advice to the OP would be to set up three 4 x 1TB raidz vdevs, and investing in a reasonable backup solution. If you have to use the last two drives, set them up as a mirror. Redundancy is great, but in my humble opinion, for the home user that is using cheap hardware, it's not as critical as performance and available storage space. That particular configuration would give you more IOPS than just two raidz2 vdevs, with slightly less redundancy and slightly more storage space. For my own needs, I don't see redundancy as being as high a priority as IOPS and available storage space. Everyone has to make their own decision on that, and the ability of ZFS to accommodate a vast array of different individual needs is a big part of what makes it such an excellent filesystem. With a solid backup, there is really no reason you can't redesign your pool at a later date if need be. Try out what you think will work best, and if that configuration doesn't work well in s ome way, adjust and move on... There are a few different schools of thought on how to backup ZFS filesystems. ZFS send/receive works for me, but there are certainly weaknesses with using it as a backup solution (as has been much discussed on this list.) Hopefully, in the future it will be possible to remove vdevs from a pool and to restripe data across a pool. Those particular features would certainly be great for me. Just my thoughts. Eric -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS RaidZ recommendation
On Apr 8, 2010, at 9:06 PM, Daniel Carosone wrote: On Thu, Apr 08, 2010 at 08:36:43PM -0700, Richard Elling wrote: On Apr 8, 2010, at 6:19 PM, Daniel Carosone wrote: As for error rates, this is something zfs should not be afraid of. Indeed, many of us would be happy to get drives with less internal ECC overhead and complexity for greater capacity, and tolerate the resultant higher error rates, specifically for use with zfs (sector errors, not overall drive failure, of course). Even if it means I need raidz4, and wind up with the same overall usable space, I may prefer the redundancy across drives rather than within. Disagree. Reliability trumps availability every time. Often, but not sure about every. I am quite sure. The economics shift around too fast for such truisms to be reliable, and there's always room for an upstart (often in a niche) to make great economic advantages out of questioning this established wisdom. The oft-touted example is google's servers, but there are many others. A small change in reliability for massively parallel systems has a significant, multiplicative effect on the overall system. Companies like Google weigh many factors, including component reliability, when designing the systems. And the problem with the availability provided by redundancy techniques is that the amount of work needed to recover is increasing. This work is limited by latency and HDDs are not winning any latency competitions anymore. We're talking about generalities; the niche can be very important to enable these kinds of tricks by holding some of the other troubling variables constant (e.g. application/programming platform). It doesn't really matter whether you're talking about 1 dual-PSU server vs 2 single-PSU servers, or whole datacentres - except that solid large-scale diversity tends to lessen your concentration (and perhaps spend) on internal redundancy within a datacentre (or disk). Put another way: some application niches are much more able to adopt redundancy techniques that don't require so much work. At the other extreme, if disks were truly reliable, the only RAID that would matter is RAID-0. Again, for the google example: if you're big and diverse enough that shifting load between data centres on failure is no work, then moving the load for other reasons is viable too - such as moving to where it's night time and power and cooling are cheaper. The work has been done once, up front, and the benefits are repeatable. Most folks never even get to a decent disaster recovery design, let alone a full datacenter mirror :-( To combat this, some vendors are moving to an overprovision model. Current products deliver multiple disks in a single FRU with builtin, fine-grained redundancy. Because the size and scope of the FRU is bounded, the recovery can be optimized and the reliability of the FRU is increased. That's not new. Past examplees in the direct experience of this community include the BladeStor and SSA-1000 storage units, which aggregated disks into failure domains (e.g. drawers) for a (big) density win. Nope. The FRUs for BladeStor and SSA-100 were traditional disks. To see something different you need to rethink the disk -- something like a Xiotech ISE. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss