Re: [zfs-discuss] Zvol vs zfs send/zfs receive
On 09/14/12 22:39, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Dave Pooser Unfortunately I did not realize that zvols require disk space sufficient to duplicate the zvol, and my zpool wasn't big enough. After a false start (zpool add is dangerous when low on sleep) I added a 250GB mirror and a pair of 3GB mirrors to miniraid and was able to successfully snapshot the zvol: miniraid/RichRAID@exportable This doesn't make any sense to me. The snapshot should not take up any (significant) space on the sending side. It's only on the receiving side, trying to receive a snapshot, that you require space. Because it won't clobber the existing zvol on the receiving side until the complete new zvol was received to clobber it with. But simply creating the snapshot on the sending side should be no problem. By default, zvols have reservations equal to their size (so that writes don't fail due to the pool being out of space). Creating a snapshot in the presence of a reservation requires reserving enough space to overwrite every block on the device. You can remove or shrink the reservation if you know that the entire device won't be overwritten. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Very poor small-block random write performance
On 07/19/12 18:24, Traffanstead, Mike wrote: iozone doesn't vary the blocksize during the test, it's a very artificial test but it's useful for gauging performance under different scenarios. So for this test all of the writes would have been 64k blocks, 128k, etc. for that particular step. Just as another point of reference I reran the test with a Crucial M4 SSD and the results for 16G/64k were 35mB/s (x5 improvement). I'll rerun that part of the test with zpool iostat and see what it says. For random writes to work without forcing a lot of read i/o and read-modify-write sequences, set the recordsize on the filesystem used for the test to match the iozone recordsize. For instance: zfs set recordsize=64k $fsname and ensure that the files used for the test are re-created after you make this setting change ("recordsize" is sticky at file creation time). ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New fast hash algorithm - is it needed?
On 07/11/12 02:10, Sašo Kiselkov wrote: > Oh jeez, I can't remember how many times this flame war has been going > on on this list. Here's the gist: SHA-256 (or any good hash) produces a > near uniform random distribution of output. Thus, the chances of getting > a random hash collision are around 2^-256 or around 10^-77. I think you're correct that most users don't need to worry about this -- sha-256 dedup without verification is not going to cause trouble for them. But your analysis is off. You're citing the chance that two blocks picked at random will have the same hash. But that's not what dedup does; it compares the hash of a new block to a possibly-large population of other hashes, and that gets you into the realm of "birthday problem" or "birthday paradox". See http://en.wikipedia.org/wiki/Birthday_problem for formulas. So, maybe somewhere between 10^-50 and 10^-55 for there being at least one collision in really large collections of data - still not likely enough to worry about. Of course, that assumption goes out the window if you're concerned that an adversary may develop practical ways to find collisions in sha-256 within the deployment lifetime of a system. sha-256 is, more or less, a scaled-up sha-1, and sha-1 is known to be weaker than the ideal 2^80 strength you'd expect from 2^160 bits of hash; the best credible attack is somewhere around 2^57.5 (see http://en.wikipedia.org/wiki/SHA-1#SHA-1). on a somewhat less serious note, perhaps zfs dedup should contain "chinese lottery" code (see http://tools.ietf.org/html/rfc3607 for one explanation) which asks the sysadmin to report a detected sha-256 collision to eprint.iacr.org or the like... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Advanced Format HDD's - are we there yet? (or - how to buy a drive that won't be teh sux0rs on zfs)
On 05/28/12 17:13, Daniel Carosone wrote: There are two problems using ZFS on drives with 4k sectors: 1) if the drive lies and presents 512-byte sectors, and you don't manually force ashift=12, then the emulation can be slow (and possibly error prone). There is essentially an internal RMW cycle when a 4k sector is partially updated. We use ZFS to get away from the perils of RMW :) 2) with ashift=12, whther forced manually or automatically because the disks present 4k sectors, ZFS is less space-efficient for metadata and keeps fewer historical uberblocks. two, more specific, problems I've run into recently: 1) if you move a disk with an ashift=9 pool on it from a controller/enclosure/.. combo where it claims to have 512 byte sectors to a path where it is detected as having 4k sectors (even if it can cope with 512-byte aligned I/O), the pool will fail to import and appear to be gravely corrupted; the error message you get will make no mention of the sector size change. Move the disk back to the original location and it imports cleanly. 2) if you have a pool with ashift=9 and a disk dies, and the intended replacement is detected as having 4k sectors, it will not be possible to attach the disk as a replacement drive.. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs receive slowness - lots of systime spent in genunix`list_next ?
On 12/05/11 10:47, Lachlan Mulcahy wrote: > zfs`lzjb_decompress10 0.0% > unix`page_nextn31 0.0% > genunix`fsflush_do_pages 37 0.0% > zfs`dbuf_free_range 183 0.1% > genunix`list_next5822 3.7% > unix`mach_cpu_idle 150261 96.1% your best bet in a situation like this -- where there's a lot of cpu time spent in a generic routine -- is to use an alternate profiling method that shows complete stack traces rather than just the top function on the stack. often the names of functions two or three or four deep in the stack will point at what's really responsible. something as simple as: dtrace -n 'profile-1001 { @[stack()] = count(); }' (let it run for a bit then interrupt it). should show who's calling list_next() so much. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] "zfs diff" performance disappointing
On 09/26/11 12:31, Nico Williams wrote: > On Mon, Sep 26, 2011 at 1:55 PM, Jesus Cea wrote: >> Should I disable "atime" to improve "zfs diff" performance? (most data >> doesn't change, but "atime" of most files would change). > > atime has nothing to do with it. based on my experiences with time-based snapshots and atime on a server which had cron-driven file tree walks running every night, I can easily believe atime has a lot to do with it - the atime updates associated with a tree walk will mean that that much of a filesystem's metadata will diverge between the writeable filesystem and its last snapshot. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Encryption accelerator card recommendations.
On 06/27/11 15:24, David Magda wrote: > Given the amount of transistors that are available nowadays I think > it'd be simpler to just create a series of SIMD instructions right > in/on general CPUs, and skip the whole co-processor angle. see: http://en.wikipedia.org/wiki/AES_instruction_set Present in many current Intel CPUs; also expected to be present in AMD's "Bulldozer" based CPUs. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] OpenIndiana | ZFS | scrub | network | awful slow
On 06/16/11 15:36, Sven C. Merckens wrote: > But is the L2ARC also important while writing to the device? Because > the storeges are used most of the time only for writing data on it, > the Read-Cache (as I thought) isn´t a performance-factor... Please > correct me, if my thoughts are wrong. if you're using dedup, you need a large read cache even if you're only doing application-layer writes, because you need fast random read access to the dedup tables while you write. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disk replacement need to scan full pool ?
On 06/14/11 04:15, Rasmus Fauske wrote: > I want to replace some slow consumer drives with new edc re4 ones but > when I do a replace it needs to scan the full pool and not only that > disk set (or just the old drive) > > Is this normal ? (the speed is always slow in the start so thats not > what I am wondering about, but that it needs to scan all of my 18.7T to > replace one drive) This is normal. The resilver is not reading all data blocks; it's reading all of the metadata blocks which contain one or more block pointers, which is the only way to find all the allocated data (and in the case of raidz, know precisely how it's spread and encoded across the members of the vdev). And it's reading all the data blocks needed to reconstruct the disk to be replaced. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Wired write performance problem
On 06/08/11 01:05, Tomas Ögren wrote: And if pool usage is>90%, then there's another problem (change of finding free space algorithm). Another (less satisfying) workaround is to increase the amount of free space in the pool, either by reducing usage or adding more storage. Observed behavior is that allocation is fast until usage crosses a threshhold, then performance hits a wall. I have a small sample size (maybe 2-3 samples), but the threshhold point varies from pool to pool but tends to be consistent for a given pool. I suspect some artifact of layout/fragmentation is at play. I've seen things hit the wall at as low as 70% on one pool. The original poster's pool is about 78% full. If possible, try freeing stuff until usage goes back under 75% or 70% and see if your performance returns. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Available space confusion
On 06/06/11 08:07, Cyril Plisko wrote: zpool reports space usage on disks, without taking into account RAIDZ overhead. zfs reports net capacity available, after RAIDZ overhead accounted for. Yup. Going back to the original numbers: nebol@filez:/$ zfs list tank2 NAMEUSED AVAIL REFER MOUNTPOINT tank2 3.12T 902G 32.9K /tank2 Given that it's a 4-disk raidz1, you have (roughly) one block of parity for every three blocks of data. 3.12T / 3 = 1.04T so 3.12T + 1.04T = 4.16T, which is close to the 4.18T showed by zpool list: NAMESIZE USED AVAILCAP HEALTH ALTROOT tank2 5.44T 4.18T 1.26T76% ONLINE ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is another drive worth anything?
On 05/31/11 09:01, Anonymous wrote: > Hi. I have a development system on Intel commodity hardware with a 500G ZFS > root mirror. I have another 500G drive same as the other two. Is there any > way to use this disk to good advantage in this box? I don't think I need any > more redundancy, I would like to increase performance if possible. I have > only one SATA port left so I can only use 3 drives total unless I buy a PCI > card. Would you please advise me. Many thanks. I'd use the extra SATA port for an ssd, and use that ssd for some combination of boot/root, ZIL, and L2ARC. I have a couple systems in this configuration now and have been quite happy with the config. While slicing an ssd and using one slice for root, one slice for zil, and one slice for l2arc isn't optimal from a performance standpoint and won't scale up to a larger configuration, it is a noticeable improvement from a 2-disk mirror. I used an 80G intel X25-M, with 1G for zil, with the rest split roughly 50:50 between root pool and l2arc for the data pool. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Format returning bogus controller info
On 02/26/11 17:21, Dave Pooser wrote: While trying to add drives one at a time so I can identify them for later use, I noticed two interesting things: the controller information is unlike any I've seen before, and out of nine disks added after the boot drive all nine are attached to c12 -- and no single controller has more than eight ports. on your system, c12 is the mpxio virtual controller; any disk which is potentially multipath-able (and that includes the SAS drives) will appear as a child of the virtual controller (rather than appear as the child of two or more different physical controllers). see stmsboot(1m) for information on how to turn that off if you don't need multipathing and don't like the longer device names. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS send/recv initial data load
On 02/16/11 07:38, white...@gmail.com wrote: Is it possible to use a portable drive to copy the initial zfs filesystem(s) to the remote location and then make the subsequent incrementals over the network? Yes. > If so, what would I need to do to make sure it is an exact copy? Thank you, Rough outline: plug removable storage into source or a system near the source. zpool create backup pool on removable storage use an appropriate combination of zfs send & zfs receive to copy bits. zpool export backup pool. unplug removable storage move it plug it in to remote server zpool import backup pool use zfs send -i to verify that incrementals work (I did something like the above when setting up my home backup because I initially dinked around with the backup pool hooked up to a laptop and then moved it to a desktop system). optional: use zpool attach to mirror the removable storage to something faster/better/..., then after the mirror completes zpool detach to free up the removable storage. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On 02/07/11 12:49, Yi Zhang wrote: If buffering is on, the running time of my app doesn't reflect the actual I/O cost. My goal is to accurately measure the time of I/O. With buffering on, ZFS would batch up a bunch of writes and change both the original I/O activity and the time. if batching main pool writes improves the overall throughput of the system over a more naive i/o scheduling model, don't you want your users to see the improvement in performance from that batching? why not set up a steady-state sustained workload that will run for hours, and measure how long it takes the system to commit each 1000 or 1 transactions in the middle of the steady state workload? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding directio, O_DSYNC and zfs_nocacheflush on ZFS
On 02/07/11 11:49, Yi Zhang wrote: The reason why I tried that is to get the side effect of no buffering, which is my ultimate goal. ultimate = "final". you must have a goal beyond the elimination of buffering in the filesystem. if the writes are made durable by zfs when you need them to be durable, why does it matter that it may buffer data while it is doing so? - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS advice for laptop
On 01/04/11 18:40, Bob Friesenhahn wrote: Zfs will disable write caching if it sees that a partition is being used This is backwards. ZFS will enable write caching on a disk if a single pool believes it owns the whole disk. Otherwise, it will do nothing to caching. You can enable it yourself with the format command and ZFS won't disable it. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Crypto in Oracle Solaris 11 Express
On 11/17/10 12:04, Miles Nordin wrote: black-box crypto is snake oil at any level, IMNSHO. Absolutely. Congrats again on finishing your project, but every other disk encryption framework I've seen taken remotely seriously has a detailed paper describing the algorithm, not just a list of features and a configuration guide. It should be a requirement for anything treated as more than a toy. I might have missed yours, or maybe it's coming soon. In particular, the mechanism by which dedup-friendly block IV's are chosen based on the plaintext needs public scrutiny. Knowing Darren, it's very likely that he got it right, but in crypto, all the details matter and if a spec detailed enough to allow for interoperability isn't available, it's safest to assume that some of the details are wrong. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver = defrag?
On 09/09/10 20:08, Edward Ned Harvey wrote: Scores so far: 2 No 1 Yes No. resilver does not re-layout your data or change whats in the block pointers on disk. if it was fragmented before, it will be fragmented after. C) Does zfs send zfs receive mean it will defrag? Scores so far: 1 No 2 Yes "maybe". If there is sufficient contiguous freespace in the destination pool, files may be less fragmented. But if you do incremental sends of multiple snapshots, you may well replicate some or all the fragmentation on the origin (because snapshots only copy the blocks that change, and receiving an incremental send does the same). And if the destination pool is short on space you may end up more fragmented than the source. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS with Equallogic storage
On 08/21/10 10:14, Ross Walker wrote: I am trying to figure out the best way to provide both performance and resiliency given the Equallogic provides the redundancy. (I have no specific experience with Equallogic; the following is just generic advice) Every bit stored in zfs is checksummed at the block level; zfs will not use data or metadata if the checksum doesn't match. zfs relies on redundancy (storing multiple copies) to provide resilience; if it can't independently read the multiple copies and pick the one it likes, it can't recover from bitrot or failure of the underlying storage. if you want resilience, zfs must be responsible for redundancy. You imply having multiple storage servers. The simplest thing to do is export one large LUN from each of two different storage servers, and have ZFS mirror them. While this reduces the available space, depending on your workload, you can make some of it back by enabling compression. And, given sufficiently recent software, and sufficient memory and/or ssd for l2arc, you can enable dedup. Of course, the effectiveness of both dedup and compression depends on your workload. Would I be better off forgoing resiliency for simplicity, putting all my faith into the Equallogic to handle data resiliency? IMHO, no; the resulting system will be significantly more brittle. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Increase resilver priority
On 07/23/10 02:31, Giovanni Tirloni wrote: We've seen some resilvers on idle servers that are taking ages. Is it possible to speed up resilver operations somehow? Eg. iostat shows<5MB/s writes on the replaced disks. What build of opensolaris are you running? There were some recent improvements (notably the addition of prefetch to the pool traverse used by scrub and resilver) which sped this up significantly for my systems. Also: if there are large numbers of snapshots, pools seem to take longer to resilver, particularly when there's a lot of metadata divergence between snapshots. Turning off atime updates (if you and your applications can cope with this) may also help going forward. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC and ZIL on same SSD?
On 07/22/10 04:00, Orvar Korvar wrote: Ok, so the bandwidth will be cut in half, and some people use this configuration. But, how bad is it to have the bandwidth cut in half? Will it hardly notice? For a home server, I doubt you'll notice. I've set up several systems (desktop & home server) as follows: - two large conventional disks, mirrored, as data pool. - single X25-M, 80GB, divided in three slices: 50% in slice 0 as root pool, (with dedup & compression enabled, and copies=2 for rpool/ROOT) 1GB in slice 3 as ZIL for data pool remainder in slice 4 as L2ARC for data pool. two conventional disks + 1 ssd performs much better than two disks alone. If I needed more space (I haven't, yet), I'd add another mirror pair or two to the data pool. I've been very happy with the results. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool throughput: snv 134 vs 138 vs 143
On 07/20/10 14:10, Marcelo H Majczak wrote: It also seems to be issuing a lot more writing to rpool, though I can't tell what. In my case it causes a lot of read contention since my rpool is a USB flash device with no cache. iostat says something like up to 10w/20r per second. Up to 137 the performance has been enough, so far, for my purposes on this laptop. if pools are more than about 60-70% full, you may be running into 6962304 workaround: add the following to /etc/system, run bootadm update-archive, and reboot -cut here- * Work around 6962304 set zfs:metaslab_min_alloc_size=0x1000 * Work around 6965294 set zfs:metaslab_smo_bonus_pct=0xc8 -cut here- no guarantees, but it's helped a few systems.. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool throughput: snv 134 vs 138 vs 143
On 07/20/10 14:10, Marcelo H Majczak wrote: It also seems to be issuing a lot more writing to rpool, though I can't tell what. In my case it causes a lot of read contention since my rpool is a USB flash device with no cache. iostat says something like up to 10w/20r per second. Up to 137 the performance has been enough, so far, for my purposes on this laptop. if pools are more than about 60-70% full, you may be running into 6962304 workaround: add the following to /etc/system, run bootadm update-archive, and reboot -cut here- * Work around 6962304 set zfs:metaslab_min_alloc_size=0x1000 * Work around 6965294 set zfs:metaslab_smo_bonus_pct=0xc8 -cut here- no guarantees, but it's helped a few systems.. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup... still in beta status
On 06/15/10 10:52, Erik Trimble wrote: Frankly, dedup isn't practical for anything but enterprise-class machines. It's certainly not practical for desktops or anything remotely low-end. We're certainly learning a lot about how zfs dedup behaves in practice. I've enabled dedup on two desktops and a home server and so far haven't regretted it on those three systems. However, they each have more than typical amounts of memory (4G and up) a data pool in two or more large-capacity SATA drives, plus an X25-M ssd sliced into a root pool as well as l2arc and slog slices for the data pool (see below: [1]) I tried enabling dedup on a smaller system (with only 1G memory and a single very slow disk), observed serious performance problems, and turned it off pretty quickly. I think, with current bits, it's not a simple matter of "ok for enterprise, not ok for desktops". with an ssd for either main storage or l2arc, and/or enough memory, and/or a not very demanding workload, it seems to be ok. For one such system, I'm seeing: # zpool list z NAME SIZE ALLOC FREECAP DEDUP HEALTH ALTROOT z 464G 258G 206G55% 1.25x ONLINE - # zdb -D z DDT-sha256-zap-duplicate: 432759 entries, size 304 on disk, 156 in core DDT-sha256-zap-unique: 1094244 entries, size 298 on disk, 151 in core dedup = 1.25, compress = 1.44, copies = 1.00, dedup * compress / copies = 1.80 - Bill [1] To forestall responses of the form: "you're nuts for putting a slog on an x25-m", which is off-topic for this thread and being discussed elsewhere": Yes, I'm aware of the write cache issues on power fail on the x25-m. For my purposes, it's a better robustness/performance tradeoff than either zil-on-spinning-rust or zil disabled, because: a) for many potential failure cases on whitebox hardware running bleeding edge opensolaris bits, the x25-m will not lose power and thus the write cache will stay intact across a crash. b) even if it loses power and loses some writes-in-flight, it's not likely to lose *everything* since the last txg sync. It's good enough for my personal use. Your mileage will vary. As always, system design involves tradeoffs. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] New SSD options
On 05/20/10 12:26, Miles Nordin wrote: I don't know, though, what to do about these reports of devices that almost respect cache flushes but seem to lose exactly one transaction. AFAICT this should be a works/doesntwork situation, not a continuum. But there's so much brokenness out there. I've seen similar "tail drop" behavior before -- last write or two before a hardware reset goes into the bit bucket, but ones before that are durable. So, IMHO, a cheap consumer ssd used as a zil may still be worth it (for some use cases) to narrow the window of data loss from ~30 seconds to a sub-second value. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS root ARC memory usage on VxFS system...
On 05/07/10 15:05, Kris Kasner wrote: Is ZFS swap cached in the ARC? I can't account for data in the ZFS filesystems to use as much ARC as is in use without the swap files being cached.. seems a bit redundant? There's nothing to explicitly disable caching just for swap; from zfs's point of view, the swap zvol is just like any other zvol. But, you can turn this off (assuming sufficiently recent zfs). try: zfs set primarycache=metadata rpool/swap (or whatever your swap zvol is named). (you probably want metadata rather than "none" so that things like indirect blocks for the swap device get cached). - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Single-disk pool corrupted after controller failure
On 05/01/10 13:06, Diogo Franco wrote: After seeing that on some cases labels were corrupted, I tried running zdb -l on mine: ... (labels 0, 1 not there, labels 2, 3 are there). I'm looking for pointers on how to fix this situation, since the disk still has available metadata. there are two reasons why you could get this: 1) the labels are gone. 2) the labels are not at the start of what solaris sees as p1, and thus are somewhere else on the disk. I'd look more closely at how freebsd computes the start of the partition or slice '/dev/ad6s1d' that contains the pool. I think #2 is somewhat more likely. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD best practices
On 04/17/10 07:59, Dave Vrona wrote: 1) Mirroring. Leaving cost out of it, should ZIL and/or L2ARC SSDs be mirrored ? L2ARC cannot be mirrored -- and doesn't need to be. The contents are checksummed; if the checksum doesn't match, it's treated as a cache miss and the block is re-read from the main pool disks. The ZIL can be mirrored, and mirroring it improves your ability to recover the pool in the face of multiple failures. 2) ZIL write cache. It appears some have disabled the write cache on the X-25E. This results in a 5 fold performance hit but it eliminates a potential mechanism for data loss. Is this valid? With the ZIL disabled, you may lose the last ~30s of writes to the pool (the transaction group being assembled and written at the time of the crash). With the ZIL on a device with a write cache that ignores cache flush requests, you may lose the tail of some of the intent logs, starting with the first block in each log which wasn't readable after the restart. (I say "may" rather than "will" because some failures may not result in the loss of the write cache). Depending on how quickly your ZIL device pushes writes from cache to stable storage, this may narrow the window from ~30s to less than 1s, but doesn't close the window entirely. If I can mirror ZIL, I imagine this is no longer a concern? Mirroring a ZIL device with a volatile write cache doesn't eliminate this risk. Whether it reduces the risk depends on precisely *what* caused your system to crash and reboot; if the failure also causes loss of the write cache contents on both sides of the mirror, mirroring won't help. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Is it safe/possible to idle HD's in a ZFS Vdev to save wear/power?
On 04/16/10 20:26, Joe wrote: I was just wondering if it is possible to spindown/idle/sleep hard disks that are part of a Vdev& pool SAFELY? it's possible. my ultra24 desktop has this enabled by default (because it's a known desktop type). see the power.conf man page; I think you may need to add an "autopm enable" if the system isn't recognized as a known desktop. the disks spin down when the system is idle; there's a delay of a few seconds when they spin back up. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedup screwing up snapshot deletion
On 04/14/10 19:51, Richard Jahnel wrote: This sounds like the known issue about the dedupe map not fitting in ram. Indeed, but this is not correct: When blocks are freed, dedupe scans the whole map to ensure each block is not is use before releasing it. That's not correct. dedup uses a data structure which is indexed by the hash of the contents of each block. That hash function is effectively random, so it needs to access a *random* part of the map for each free which means that it (as you correctly stated): ... takes a veeery long time if the map doesn't fit in ram. If you can try adding more ram to the system. Adding a flash-based ssd as an cache/L2ARC device is also very effective; random i/o to ssd is much faster than random i/o to spinning rust. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Suggestions about current ZFS setup
On 04/14/10 12:37, Christian Molson wrote: First I want to thank everyone for their input, It is greatly appreciated. To answer a few questions: Chassis I have: http://www.supermicro.com/products/chassis/4U/846/SC846E2-R900.cfm Motherboard: http://www.tyan.com/product_board_detail.aspx?pid=560 RAM: 24 GB (12 x 2GB) 10 x 1TB Seagates 7200.11 10 x 1TB Hitachi 4 x 2TB WD WD20EARS (4K blocks) If you have the spare change for it I'd add one or two SSD's to the mix, with space on them allocated to the root pool plus l2arc cache, and slog for the data pool(s). - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Secure delete?
On 04/11/10 12:46, Volker A. Brandt wrote: The most paranoid will replace all the disks and then physically destroy the old ones. I thought the most paranoid will encrypt everything and then forget the key... :-) Actually, I hear that the most paranoid encrypt everything *and then* destroy the physical media when they're done with it. Seriously, once encrypted zfs is integrated that's a viable method. It's certainly a new tool to help with the problem, but consider that forgetting a key requires secure deletion of the key. Like most cryptographic techniques, filesystem encryption only changes the size of the problem we need to solve. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Secure delete?
On 04/11/10 10:19, Manoj Joseph wrote: Earlier writes to the file might have left older copies of the blocks lying around which could be recovered. Indeed; to be really sure you need to overwrite all the free space in the pool. If you limit yourself to worrying about data accessible via a regular read on the raw device, it's possible to do this without an outage if you have a spare disk and a lot of time: rough process: 0) delete the files and snapshots containing the data you wish to purge. 1) replace a previously unreplaced disk in the pool with the spare disk using "zpool replace" 2) wait for the replace to complete 3) wipe the removed disk, using the "purge" command of format(1m)'s analyze subsystem or equivalent; the wiped disk is now the spare disk. 4) if all disks have not been replaced yet, go back to step 1. This relies on the fact that the resilver kicked off by "zpool replace" copies only allocated data. There are some assumptions in the above. For one, I'm assuming that that all disks in the pool are the same size. A bigger one is that a "purge" is sufficient to wipe the disks completely -- probably the biggest single assumption, given that the underlying storage devices themselves are increasingly using copy-on-write techniques. The most paranoid will replace all the disks and then physically destroy the old ones. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD sale on newegg
On 04/06/10 17:17, Richard Elling wrote: You could probably live with an X25-M as something to use for all three, but of course you're making tradeoffs all over the place. That would be better than almost any HDD on the planet because the HDD tradeoffs result in much worse performance. Indeed. I've set up a couple small systems (one a desktop workstation, and the other a home fileserver) with root pool plus the l2arc and slog for a data pool on an 80G X25-M and have been very happy with the result. The recipe I'm using is to slice the ssd, with the rpool in s0 with roughly half the space, 1GB in s3 for slog, and the rest of the space as L2ARC in s4. That may actually be overly generous for the root pool, but I run with copies=2 on rpool/ROOT and I tend to keep a bunch of BE's around. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Tuning the ARC towards LRU
On 04/05/10 15:24, Peter Schuller wrote: In the urxvt case, I am basing my claim on informal observations. I.e., "hit terminal launch key, wait for disks to rattle, get my terminal". Repeat. Only by repeating it very many times in very rapid succession am I able to coerce it to be cached such that I can immediately get my terminal. And what I mean by that is that it keeps necessitating disk I/O for a long time, even on rapid successive invocations. But once I have repeated it enough times it seems to finally enter the cache. Are you sure you're not seeing unrelated disk update activity like atime updates, mtime updates on pseudo-terminals, etc., ? I'd want to start looking more closely at I/O traces (dtrace can be very helpful here) before blaming any specific system component for the unexpected I/O. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Proposition of a new zpool property.
On 03/22/10 11:02, Richard Elling wrote: > Scrub tends to be a random workload dominated by IOPS, not bandwidth. you may want to look at this again post build 128; the addition of metadata prefetch to scrub/resilver in that build appears to have dramatically changed how it performs (largely for the better). - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] sympathetic (or just multiple) drive failures
On 03/19/10 19:07, zfs ml wrote: What are peoples' experiences with multiple drive failures? 1985-1986. DEC RA81 disks. Bad glue that degraded at the disk's operating temperature. Head crashes. No more need be said. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Scrub not completing?
On 03/17/10 14:03, Ian Collins wrote: I ran a scrub on a Solaris 10 update 8 system yesterday and it is 100% done, but not complete: scrub: scrub in progress for 23h57m, 100.00% done, 0h0m to go Don't panic. If "zpool iostat" still shows active reads from all disks in the pool, just step back and let it do its thing until it says the scrub is complete. There's a bug open on this: 6899970 scrub/resilver percent complete reporting in zpool status can be overly optimistic scrub/resilver progress reporting compares the number of blocks read so far to the number of blocks currently allocated in the pool. If blocks that have already been visited are freed and new blocks are allocated, the seen:allocated ratio is no longer an accurate estimate of how much more work is needed to complete the scrub. Before the scrub prefetch code went in, I would routinely see scrubs last 75 hours which had claimed to be "100.00% done" for over a day. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] terrible ZFS performance compared to UFS on ramdisk (70% drop)
On 03/08/10 17:57, Matt Cowger wrote: Change zfs options to turn off checksumming (don't want it or need it), atime, compression, 4K block size (this is the applications native blocksize) etc. even when you disable checksums and compression through the zfs command, zfs will still compress and checksum metadata. the evil tuning guide describes an unstable interface to turn off metadata compression, but I don't see anything in there for metadata checksums. if you have an actual need for an in-memory filesystem, will tmpfs fit the bill? - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Snapshot recycle freezes system activity
On 03/08/10 12:43, Tomas Ögren wrote: So we tried adding 2x 4GB USB sticks (Kingston Data Traveller Mini Slim) as metadata L2ARC and that seems to have pushed the snapshot times down to about 30 seconds. Out of curiosity, how much physical memory does this system have? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] swap across multiple pools
On 03/03/10 05:19, Matt Keenan wrote: In a multipool environment, would be make sense to add swap to a pool outside or the root pool, either as the sole swap dataset to be used or as extra swap ? Yes. I do it routinely, primarily to preserve space on boot disks on large-memory systems. swap can go in any pool, while dump has the same limitations as root: single top-level vdev, single-disk or mirrors only. Would this have any performance implications ? If the non-root pool has many spindles, random read I/O should be faster and thus swap i/o should be faster. I haven't attempted to measure if this makes a difference. I generally set primarycache=metadata on swap zvols but I also haven't been able to measure whether it makes any difference. My users do complain when /tmp fills because there isn't sufficient swap so I do know I need large amounts of swap on these systems. (when migrating one such system from Nevada to Opensolaris recently I forgot to add swap to /etc/vfstab). - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] compressed root pool at installation time with flash archive predeployment script
On 03/02/10 12:57, Miles Nordin wrote: "cc" == chad campbell writes: cc> I was trying to think of a way to set compression=on cc> at the beginning of a jumpstart. are you sure grub/ofwboot/whatever can read compressed files? Grub and the sparc zfs boot blocks can read lzjb-compressed blocks in zfs. I have compression=on (and copies=2) for both sparc and x86 roots; I'm told that grub's zfs support also knows how to fall back to ditto blocks if the first copy fails to be readable or has a bad checksum. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Who is using ZFS ACL's in production?
On 03/02/10 08:13, Fredrich Maney wrote: Why not do the same sort of thing and use that extra bit to flag a file, or directory, as being an ACL only file and will negate the rest of the mask? That accomplishes what Paul is looking for, without breaking the existing model for those that need/wish to continue to use it? While we're designing on the fly: Another possibility would be to use an additional umask bit or two to influence the mode-bit - acl interaction. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Who is using ZFS ACL's in production?
On 03/01/10 13:50, Miles Nordin wrote: "dd" == David Dyer-Bennet writes: dd> Okay, but the argument goes the other way just as well -- when dd> I run "chmod 6400 foobar", I want the permissions set that dd> specific way, and I don't want some magic background feature dd> blocking me. This will be true either way. Even if chmod isn't ignored, it will reach into the nest of ACL's and mangle them in some non-obvious way with unpredictable consequences, and the mangling will be implemented by a magical background feature. actually, you can be surprised even if there are no acls in use -- if, unbeknownst to you, some user has been granted file_dac_read or file_dac_write privilege, they will be able to bypass the file modes for read and/or for write. Likewise if that user has been delegated zfs "send" rights on the filesystem the file is in, they'll be able to read every bit of the file. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS compression and deduplication on root pool on SSD
On 02/28/10 15:58, valrh...@gmail.com wrote: Also, I don't have the numbers to prove this, but it seems to me > that the actual size of rpool/ROOT has grown substantially since I > did a clean install of build 129a (I'm now at build133). WIthout > compression, either, that was around 24 GB, but things seem > to have accumulated by an extra 11 GB or so. One common source for this is slowly accumulating files under /var/pkg/download. Clean out /var/pkg/download and delete all but the most recent boot environment to recover space (you need to do this to get the space back because the blocks are referenced by the snapshots used by each clone as its base version). To avoid this in the future, set PKG_CACHEDIR in your environment to point at a filesystem which isn't cloned by beadm -- something outside rpool/ROOT, for instance. On several systems which have two pools (root & data) I've relocated it to the data pool - it doesn't have to be part of the root pool. This has significantly slimmed down my root filesystem on systems which are chasing the dev branch of opensolaris. > At present, my rpool/ROOT has no compression, and no deduplication. I > was wondering about whether it would be a good idea, from a > performance and data integrity standpoint, to use one, the other, or > both, on the root pool. I've used the combination of copies=2 and compression=yes on rpool/ROOT for a while and have been happy with the result. On one system I recently moved to an ssd root, I also turned on dedup and it seems to be doing just fine: NAME SIZE ALLOC FREECAP DEDUP HEALTH ALTROOT r2 37G 14.7G 22.3G39% 1.31x ONLINE - (the relatively high dedup ratio is because I have one live upgrade BE with nevada build 130, and a beadm BE with opensolaris build 130, which is mostly the same) - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Who is using ZFS ACL's in production?
On 02/26/10 17:38, Paul B. Henson wrote: As I wrote in that new sub-thread, I see no option that isn't surprising in some way. My preference would be for what I labeled as option (b). And I think you absolutely should be able to configure your fileserver to implement your preference. Why shouldn't I be able to configure my fileserver to implement mine :)? acl-chmod interactions have been mishandled so badly in the past that i think a bit of experimentation with differing policies is in order. Based on the amount of wailing I see around acls, I think that, based on personal experience with both systems, AFS had it more or less right and POSIX got it more or less wrong -- once you step into the world of acls, the file mode should be mostly ignored, and an accidental chmod should *not* destroy carefully crafted acls. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Freeing unused space in thin provisioned zvols
On 02/26/10 11:42, Lutz Schumann wrote: Idea: - If the guest writes a block with 0's only, the block is freed again - if someone reads this block again - it wil get the same 0's it would get if the 0's would be written - The checksum of a "all 0" block dan be hard coded for SHA1 / Flecher, so the comparison for "is this a "0 only" block is easy. With this in place, a host wishing to free thin provisioned zvol space can fill the unused blocks wirth 0s easity with simple tools (e.g. dd if=/dev/zero of=/MYFILE bs=1M; rm /MYFILE) and the space is freed again on the zvol side. You've just described how ZFS behaves when compression is enabled -- a block of zeros is compressed to a hole represented by an all-zeros block pointer. > Does anyone know why this is not incorporated into ZFS ? It's in there. Turn on compression to use it. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Who is using ZFS ACL's in production?
On 02/26/10 10:45, Paul B. Henson wrote: I've already posited as to an approach that I think would make a pure-ACL deployment possible: http://mail.opensolaris.org/pipermail/zfs-discuss/2010-February/037206.html Via this concept or something else, there needs to be a way to configure ZFS to prevent the attempted manipulation of legacy permission mode bits from breaking the security policy of the ACL. I believe this proposal is sound. In it, you wrote: The feedback was that the internal Sun POSIX compliance police wouldn't like that ;). There are already per-filesystem tunables for ZFS which allow the system to escape the confines of POSIX (noatime, for one); I don't see why a "chmod doesn't truncate acls" option couldn't join it so long as it was off by default and left off while conformance tests were run. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS ZIL + L2ARC SSD Setup
On 02/12/10 09:36, Felix Buenemann wrote: given I've got ~300GB L2ARC, I'd need about 7.2GB RAM, so upgrading to 8GB would be enough to satisfy the L2ARC. But that would only leave ~800MB free for everything else the server needs to do. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Reading ZFS config for an extended period
On 02/11/10 10:33, Lori Alt wrote: This bug is closed as a dup of another bug which is not readable from the opensolaris site, (I'm not clear what makes some bugs readable and some not). the other bug in question was opened yesterday and probably hasn't had time to propagate. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] most of my space is gone
On 02/06/10 08:38, Frank Middleton wrote: AFAIK there is no way to get around this. You can set a flag so that pkg tries to empty /var/pkg/downloads, but even though it looks empty, it won't actually become empty until you delete the snapshots, and IIRC you still have to manually delete the contents. I understand that you can try creating a separate dataset and mounting it on /var/pkg, but I haven't tried it yet, and I have no idea if doing so gets around the BE snapshot problem. You can set the environment variable PKG_CACHEDIR to place the cache in an alternate filesystem. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] server hang with compression on, ping timeouts from remote machine
On 01/31/10 07:07, Christo Kutrovsky wrote: I've also experienced similar behavior (short freezes) when running zfs send|zfs receive with compression on LOCALLY on ZVOLs again. Has anyone else experienced this ? Know any of bug? This is on snv117. you might also get better results after the fix to: 6881015 ZFS write activity prevents other threads from running in a timely manner which was fixed in build 129. As a workaround, try a lower gzip compression level -- higher gzip levels usually burn lots more CPU without significantly increasing the compression ratio. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zvol being charged for double space
On 01/27/10 21:17, Daniel Carosone wrote: This is as expected. Not expected is that: usedbyrefreservation = refreservation I would expect this to be 0, since all the reserved space has been allocated. This would be the case if the volume had no snapshots. As a result, used is over twice the size of the volume (+ a few small snapshots as well). I'm seeing essentially the same thing with a recently-created zvol with snapshots that I export via iscsi for time machine backups on a mac. % zfs list -r -o name,refer,used,usedbyrefreservation,refreservation,volsize z/tm/mcgarrett NAMEREFER USED USEDREFRESERV REFRESERV VOLSIZE z/tm/mcgarrett 26.7G 88.2G60G60G 60G The actual volume footprint is a bit less than half of the volume size, but the refreservation ensures that there is enough free space in the pool to allow me to overwrite every block of the zvol with uncompressable data without any writes failing due to the pool being out of space. If you were to disable time-based snapshots and then overwrite a measurable fraction of the zvol you I'd expect "USEDBYREFRESERVATION" to shrink as the reserved blocks were actually used. If you want to allow for overcommit, you need to delete the refreservation. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Degrated pool menbers excluded from writes ?
On 01/24/10 12:20, Lutz Schumann wrote: One can see that the degrated mirror is excluded from the writes. I think this is expected behaviour right ? (data protection over performance) That's correct. It will use the space if it needs to but it prefers to avoid "sick" top-level vdevs if there are healthy ones available. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disks and caches
On Thu, 2010-01-07 at 11:07 -0800, Anil wrote: > There is talk about using those cheap disks for rpool. Isn't rpool > also prone to a lot of writes, specifically when the /tmp is in a SSD? Huh? By default, solaris uses tmpfs for /tmp, /var/run, and /etc/svc/volatile; writes to those filesystems won't hit the SSD unless the system is short on physical memory. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool fragmentation issues?
On Tue, 2009-12-15 at 17:28 -0800, Bill Sprouse wrote: > After > running for a while (couple of months) the zpool seems to get > "fragmented", backups take 72 hours and a scrub takes about 180 > hours. Are there periodic snapshots being created in this pool? Can they run with atime turned off? (file tree walks performed by backups will update the atime of all directories; this will generate extra write traffic and also cause snapshots to diverge from their parents and take longer to scrub). - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs on ssd
On Fri, 2009-12-11 at 13:49 -0500, Miles Nordin wrote: > > "sh" == Seth Heeren writes: > > sh> If you don't want/need log or cache, disable these? You might > sh> want to run your ZIL (slog) on ramdisk. > > seems quite silly. why would you do that instead of just disabling > the ZIL? I guess it would give you a way to disable it pool-wide > instead of system-wide. > > A per-filesystem ZIL knob would be awesome. for what it's worth, there's already a per-filesystem ZIL knob: the "logbias" property. It can be set either to "latency" or "throughput". ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resilver/scrub times?
Yesterday's integration of 6678033 resilver code should prefetch as part of changeset 74e8c05021f1 (which should be in build 129 when it comes out) may improve scrub times, particularly if you have a large number of small files and a large number of snapshots. I recently tested an early version of the fix, and saw one pool go from an elapsed time of 85 hours to 20 hours; another (with many fewer snapshots) went from 35 to 17. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs eradication
On Wed, 2009-11-11 at 10:29 -0800, Darren J Moffat wrote: > Joerg Moellenkamp wrote: > > Hi, > > > > Well ... i think Darren should implement this as a part of > zfs-crypto. Secure Delete on SSD looks like quite challenge, when wear > leveling and bad block relocation kicks in ;) > > No I won't be doing that as part of the zfs-crypto project. As I said > some jurisdictions are happy that if the data is encrypted then > overwrite of the blocks isn't required. For those that aren't use > dd(1M) or format(1M) may be sufficient - if that isn't then nothing > short of physical destruction is likely good enough. note that "eradication" via overwrite makes no sense if the underlying storage uses copy-on-write, because there's no guarantee that the newly written block actually will overlay the freed block. IMHO the sweet spot here may be to overwrite once with zeros (allowing the block to be compressed out of existance if the underlying storage is a compressed zvol or equivalent) or to use the TRIM command. (It may also be worthwhile for zvols exported via various protocols to themselves implement the TRIM command -- freeing the underlying storage). - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] This is the scrub that never ends...
On Fri, 2009-09-11 at 13:51 -0400, Will Murnane wrote: > On Thu, Sep 10, 2009 at 13:06, Will Murnane wrote: > > On Wed, Sep 9, 2009 at 21:29, Bill Sommerfeld wrote: > >>> Any suggestions? > >> > >> Let it run for another day. > > I'll let it keep running as long as it wants this time. > scrub: scrub completed after 42h32m with 0 errors on Thu Sep 10 17:20:19 2009 > > And the people rejoiced. So I guess the issue is more "scrubs may > report ETA very inaccurately" than "scrubs never finish". Thanks for > the suggestions and support. One of my pools routinely does this -- the scrub gets to 100% after about 50 hours but keeps going for another day or more after that. It turns out that zpool reports "number of blocks visited" vs "number of blocks allocated", but clamps the ratio at 100%. If there is substantial turnover in the pool, it appears you may end up needing to visit more blocks than are actually allocated at any one point in time. I made a modified version of the zpool command and this is what it prints for me: ... scrub: scrub in progress for 74h25m, 119.90% done, 0h0m to go 5428197411840 blocks examined, 4527262118912 blocks allocated ... This is the (trivial) source change I made to see what's going on under the covers: diff -r 12fb4fb507d6 usr/src/cmd/zpool/zpool_main.c --- a/usr/src/cmd/zpool/zpool_main.cMon Oct 26 22:25:39 2009 -0700 +++ b/usr/src/cmd/zpool/zpool_main.cTue Nov 10 17:07:59 2009 -0500 @@ -2941,12 +2941,15 @@ if (examined == 0) examined = 1; - if (examined > total) - total = examined; fraction_done = (double)examined / total; - minutes_left = (uint64_t)((now - start) * - (1 - fraction_done) / fraction_done / 60); + if (fraction_done < 1) { + minutes_left = (uint64_t)((now - start) * + (1 - fraction_done) / fraction_done / 60); + } else { + minutes_left = 0; + } + minutes_taken = (uint64_t)((now - start) / 60); (void) printf(gettext("%s in progress for %lluh%um, %.2f%% done, " @@ -2954,6 +2957,9 @@ scrub_type, (u_longlong_t)(minutes_taken / 60), (uint_t)(minutes_taken % 60), 100 * fraction_done, (u_longlong_t)(minutes_left / 60), (uint_t)(minutes_left % 60)); + (void) printf(gettext("\t %lld blocks examined, %lld blocks allocated\n"), + examined, + total); } static void ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] dedupe question
On Sat, 2009-11-07 at 17:41 -0500, Dennis Clarke wrote: > Does the dedupe functionality happen at the file level or a lower block > level? it occurs at the block allocation level. > I am writing a large number of files that have the fol structure : > > -- file begins > 1024 lines of random ASCII chars 64 chars long > some tilde chars .. about 1000 of then > some text ( english ) for 2K > more text ( english ) for 700 bytes or so > -- ZFS's default block size is 128K and is controlled by the "recordsize" filesystem property. Unless you changed "recordsize", each of the files above would be a single block distinct from the others. you may or may not get better dedup ratios with a smaller recordsize depending on how the common parts of the file line up with block boundaries. the cost of additional indirect blocks might overwhelm the savings from deduping a small common piece of the file. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] sched regularily writing a lots of MBs to the pool?
zfs groups writes together into transaction groups; the physical writes to disk are generally initiated by kernel threads (which appear in dtrace as threads of the "sched" process). Changing the attribution is not going to be simple as a single physical write to the pool may contain data and metadata changes triggered by multiple user processes. You need to go up a level of abstraction and look at the vnode layer to attribute writes to particular processes. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resilvering, amount of data on disk, etc.
On Mon, 2009-10-26 at 10:24 -0700, Brian wrote: > Why does resilvering an entire disk, yield different amounts of data that was > resilvered each time. > I have read that ZFS only resilvers what it needs to, but in the case of > replacing an entire disk with another formatted clean disk, you would think > the amount of data would be the same each time a disk is replaced with an > empty formatted disk. > I'm getting different results when viewing the 'zpool status' info (below) replacing a disk adds an entry to the "zpool history" log, which requires allocating blocks, which will change what's stored in the pool. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Which directories must be part of rpool?
On Fri, 2009-09-25 at 14:39 -0600, Lori Alt wrote: > The list of datasets in a root pool should look something like this: ... > rpool/swap I've had success with putting swap into other pools. I believe others have, as well. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RAIDZ versus mirrroed
On Wed, 2009-09-16 at 14:19 -0700, Richard Elling wrote: > Actually, I had a ton of data on resilvering which shows mirrors and > raidz equivalently bottlenecked on the media write bandwidth. However, > there are other cases which are IOPS bound (or CR bound :-) which > cover some of the postings here. I think Sommerfeld has some other > data which could be pertinent. I'm not sure I have data, but I have anecdotes and observations, and a few large production pools used for solaris development by me and my coworkers. the biggest one (by disk count) takes 80-100 hours to scrub and/or resilver. my working hypothesis is that resilver of pools which: 1) have a lot of files, directories, filesystems, and periodic snapshots 2) have atime updates enabled (default config) 3) have regular (daily) jobs doing large-scale filesystem tree-walks wind up rewriting most blocks of the dnode files on every tree walk doing atime updates, and as a result the dnode file (but not most of the blocks it points to) differs greatly from daily snapshot to daily snapshot. as a result, scrub/resilver traversals end up spending most of their time doing random reads of the dnode files of each snapshot. here are some bugs that, if fixed, might help: 6678033 resilver code should prefetch 6730737 investigate colocating directory dnodes - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] This is the scrub that never ends...
On Wed, 2009-09-09 at 21:30 +, Will Murnane wrote: > Some hours later, here I am again: > scrub: scrub in progress for 18h24m, 100.00% done, 0h0m to go > Any suggestions? Let it run for another day. A pool on a build server I manage takes about 75-100 hours to scrub, but typically starts reporting "100.00% done, 0h0m to go" at about the 50-60 hour point. I suspect the combination of frequent time-based snapshots and a pretty active set of users causes the progress estimate to be off.. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs kernel compilation issue
On Fri, 2009-08-28 at 23:12 -0700, P. Anil Kumar wrote: > I would like to know why its picking up amd64 config params from the > Makefile, while uname -a clearly shows that its i386 ? it's behaving as designed. on solaris, uname -a always shows i386 regardless of whether the system is in 32-bit or 64-bit mode. you can use the isainfo command to tell if amd64 is available. on i386, we always build both 32-bit and 64-bit kernel modules; the bootloader will figure out which kernel to load. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] avail drops to 32.1T from 40.8T after create -o mountpoint
On Wed, 2009-07-29 at 06:50 -0700, Glen Gunselman wrote: > There was a time when manufacturers know about base-2 but those days > are long gone. Oh, they know all about base-2; it's just that disks seem bigger when you use base-10 units. Measure a disk's size in 10^(3n)-based KB/MB/GB/TB units, and you get a bigger number than its size in the natural-for-software 2^(10n)-sized units. So it's obvious which numbers end up on the marketing glossies, and it's all downhill from there... - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Speeding up resilver on x4500
On Mon, 2009-06-22 at 06:06 -0700, Richard Elling wrote: > Nevertheless, in my lab testing, I was not able to create a random-enough > workload to not be write limited on the reconstructing drive. Anecdotal > evidence shows that some systems are limited by the random reads. Systems I've run which have random-read-limited reconstruction have a combination of: - regular time-based snapshots - daily cron jobs which walk the filesystem, accessing all directories and updating all directory atimes in the process. Because the directory dnodes are randomly distributed through the dnode file, each block of the dnode file likely contains at least one directory dnode, and as a result each of the tree walk jobs causes the entire dnode file to diverge from the previous day's snapshot. If the underlying filesystems are mostly static and there are dozens of snapshots, a pool traverse spends most of its time reading the dnode files and finding block pointers to older blocks which it knows it has already seen. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] compression at zfs filesystem creation
On Wed, 2009-06-17 at 12:35 +0200, casper@sun.com wrote: > I still use "disk swap" because I have some bad experiences > with ZFS swap. (ZFS appears to cache and that is very wrong) I'm experimenting with running zfs swap with the primarycache attribute set to "metadata" instead of the default "all". aka: zfs set primarycache=metadata rpool/swap seems like that would be more likely to behave appropriately. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] schedulers [was: zfs related google summer of code ideas - your vote]
On Wed, 2009-03-04 at 12:49 -0800, Richard Elling wrote: > But I'm curious as to why you would want to put both the slog and > L2ARC on the same SSD? Reducing part count in a small system. For instance: adding L2ARC+slog to a laptop. I might only have one slot free to allocate to ssd. IMHO the right administrative interface for this is for zpool to allow you to add the same device to a pool as both cache and ssd, and let zfs figure out how to not step on itself when allocating blocks. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS: unreliable for professional usage?
On Thu, 2009-02-12 at 17:35 -0500, Blake wrote: > That does look like the issue being discussed. > > It's a little alarming that the bug was reported against snv54 and is > still not fixed :( bugs.opensolaris.org's information about this bug is out of date. It was fixed in snv_54: changeset: 3169:1dea14abfe17 user:phitran date:Sat Nov 25 11:05:17 2006 -0800 files: usr/src/uts/common/io/scsi/targets/sd.c 6424510 usb ignores DKIOCFLUSHWRITECACHE - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Problems at 90% zpool capacity 2008.05
On Tue, 2009-01-06 at 22:18 -0700, Neil Perrin wrote: > I vaguely remember a time when UFS had limits to prevent > ordinary users from consuming past a certain limit, allowing > only the super-user to use it. Not that I'm advocating that > approach for ZFS. looks to me like zfs already provides a mechanism for this (quotas and reservations); it's up to the sysadmin to decide on policy. Don't want the last 10% of the pool used? Create a "ballast" zvol or filesystem with a big reservation, and don't put anything in it.. Of course, some degree of experimentation may be necessary before you figure out what policy makes sense for your system or site. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Tool to figure out optimum ZFS recordsize for a Mail server Maildir tree?
On Wed, 2008-10-22 at 09:46 -0700, Mika Borner wrote: > If I turn zfs compression on, does the recordsize influence the > compressratio in anyway? zfs conceptually chops the data into recordsize chunks, then compresses each chunk independently, allocating on disk only the space needed to store each compressed block. On average, I'd expect to get a better compression ratio with a larger block size since typical compression algorithms will have more chance to find redundancy in a larger block of text. as always your mileage may vary. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Disabling COMMIT at NFS level, or disabling ZIL on a per-filesystem basis
On Wed, 2008-10-22 at 10:45 -0600, Neil Perrin wrote: > Yes: 6280630 zil synchronicity > > Though personally I've been unhappy with the exposure that zil_disable has > got. > It was originally meant for debug purposes only. So providing an official > way to make synchronous behaviour asynchronous is to me dangerous. It seems far more dangerous to only provide a global knob instead of a local knob. I want it in conjunction with bulk operations (like an ON "nightly" build, database reloads, etc.) where the response to a partial failure will be to rm -rf and start over. Any time spent waiting for intermediate states of the filesystem to be committed to stable store is wasted time. > >Once Admins start to disable the ZIL for whole pools because the extra > >performance is too tempting, wouldn't it be the lesser evil to let them > >disable it on a per filesystem basis? Agreed. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Setting per-file record size / querying fs/file record size?
On Wed, 2008-10-22 at 10:30 +0100, Darren J Moffat wrote: > I'm assuming this is local filesystem rather than ZFS backed NFS (which > is what I have). Correct, on a laptop. > What has setting the 32KB recordsize done for the rest of your home > dir, or did you give the evolution directory its own dataset ? The latter, though it occurs to me that I could set the recordsize back up to 128K once the databases (one per mail account) are created -- the recordsize dataset property is read only at file create time when the file's recordsize is set. (Having a new interface to set the file's recordsize directly at create time would bypass this sort of gyration). (Apparently the sqlite file format uses 16-bit within-page offsets; 32kb is its current maximum page size and 64k may be as large as it can go without significant renovations..) - Bill - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Setting per-file record size / querying fs/file record size?
On Mon, 2008-10-20 at 16:57 -0500, Nicolas Williams wrote: > I've a report that the mismatch between SQLite3's default block size and > ZFS' causes some performance problems for Thunderbird users. I was seeing a severe performance problem with sqlite3 databases as used by evolution (not thunderbird). It appears that reformatting the evolution databases to a 32KB database page size and setting zfs's record size to a matching 32KB has done wonders for evolution performance to a ZFS home directory. > It'd be great if there was an API by which SQLite3 could set its block > size to match the hosting filesystem or where it could set the DB file's > record size to match the SQLite3/app default block size (1KB). IMHO some of the fix has to involve sqlite3 using a larger page size by default when creating the database -- it seems to be a lot more efficient with the larger page size. Databases like sqlite3 are being used "under the covers" by growing numbers of applications -- it seems like there's a missing interface here if we want decent out-of-the-box performance of end-user apps like tbird and evolution using databases on zfs. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Quantifying ZFS reliability
On Wed, 2008-10-01 at 11:54 -0600, Robert Thurlow wrote: > > like they are not good enough though, because unless this broken > > router that Robert and Darren saw was doing NAT, yeah, it should not > > have touch the TCP/UDP checksum. NAT was not involved. > I believe we proved that the problem bit flips were such > that the TCP checksum was the same, so the original checksum > still appeared correct. That's correct. The pattern we found in corrupted data was that there would be two offsetting bit-flips. A 0->1 was followed 256 or 512 or 1024 bytes later by a 1->0 Or vice-versa. (It was always the same bit; in the cases I analyzed, the corrupted files contained C source code and the bit-flips were obvious). Under the 16-bit one's-complement checksum used by TCP, these two changes cancel each other out and the resulting packet has the same checksum. > > BTW which router was it, or you > > can't say because you're in the US? :) > > I can't remember; it was aging at the time. to use excruciatingly precise terminology, I believe the switch in question is marketed as a combo L2 bridge/L3 router but in this case may have been acting as a bridge rather than a router. After we noticed the data corruption we looked at TCP counters on hosts on that subnet and noticed a high rate of failed checksums, so clearly the TCP checksum was catching *most* of the corrupted packets; we just didn't look at the counters until after we saw data corruption. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] resilver speed.
On Fri, 2008-09-05 at 09:41 -0700, Richard Elling wrote: > > Also does the resilver deliberately pause? Running iostat I see > that it will pause for five to ten seconds where no IO is done at all, > then it continues on at a more reasonable pace. > I have not seen such behaviour during resilver characterization. I have, post nv_94, and I filed a bug: 6729696 sync causes scrub or resilver to pause for up to 30s - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sidebar to ZFS Availability discussion
On Sun, 2008-08-31 at 15:03 -0400, Miles Nordin wrote: > It's sort of like network QoS, but not quite, because: > > (a) you don't know exactly how big the ``pipe'' is, only > approximately, In an ip network, end nodes generally know no more than the pipe size of the first hop -- and in some cases (such as true CSMA networks like classical ethernet or wireless) only have an upper bound on the pipe size. beyond that, they can only estimate the characteristics of the rest of the network by observing its behavior - all they get is end-to-end latency, and *maybe* a 'congestion observed' mark set by an intermediate system. > (c) all the fabrics are lossless, so while there are queues which > undesireably fill up during congestion, these queues never drop > ``packets'' but instead exert back-pressure all the way up to > the top of the stack. hmm. I don't think the back pressure makes it all the way up to zfs (the top of the block storage stack) except as added latency. (on the other hand, if it did, zfs could schedule around it both for reads and writes, avoiding pouring more work on already-congested paths..) > I'm surprised we survive as well as we do without disk QoS. Are the > storage vendors already doing it somehow? I bet that (as with networking) in many/most cases overprovisioning the hardware and running at lower average utilization is often cheaper in practice than running close to the edge and spending a lot of expensive expert time monitoring performance and tweaking QoS parameters. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sidebar to ZFS Availability discussion
On Sun, 2008-08-31 at 12:00 -0700, Richard Elling wrote: > 2. The algorithm *must* be computationally efficient. >We are looking down the tunnel at I/O systems that can >deliver on the order of 5 Million iops. We really won't >have many (any?) spare cycles to play with. If you pick the constants carefully (powers of two) you can do the TCP RTT + variance estimation using only a handful of shifts, adds, and subtracts. > In both of these cases, the solutions imply multi-minute timeouts are > required to maintain a stable system. Again, there are different uses for timeouts: 1) how long should we wait on an ordinary request before deciding to try "plan B" and go elsewhere (a la B_FAILFAST) 2) how long should we wait (while trying all alternatives) before declaring an overall failure and giving up. The RTT estimation approach is really only suitable for the former, where you have some alternatives available (retransmission in the case of TCP; trying another disk in the case of mirrors, etc.,). when you've tried all the alternatives and nobody's responding, there's no substitute for just retrying for a long time. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Availability: ZFS needs to handle disk removal / driver failure better
On Thu, 2008-08-28 at 13:05 -0700, Eric Schrock wrote: > A better option would be to not use this to perform FMA diagnosis, but > instead work into the mirror child selection code. This has already > been alluded to before, but it would be cool to keep track of latency > over time, and use this to both a) prefer one drive over another when > selecting the child and b) proactively timeout/ignore results from one > child and select the other if it's taking longer than some historical > standard deviation. This keeps away from diagnosing drives as faulty, > but does allow ZFS to make better choices and maintain response times. > It shouldn't be hard to keep track of the average and/or standard > deviation and use it for selection; proactively timing out the slow I/Os > is much trickier. tcp has to solve essentially the same problem: decide when a response is "overdue" based only on the timing of recent successful exchanges in a context where it's difficult to make assumptions about "reasonable" expected behavior of the underlying network. it tracks both the smoothed round trip time and the variance, and declares a response overdue after (SRTT + K * variance). I think you'd probably do well to start with something similar to what's described in http://www.ietf.org/rfc/rfc2988.txt and then tweak based on experience. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Best layout for 15 disks?
On Thu, 2008-08-21 at 21:15 -0700, mike wrote: > I've seen 5-6 disk zpools are the most recommended setup. This is incorrect. Much larger zpools built out of striped redundant vdevs (mirror, raidz1, raidz2) are recommended and also work well. raidz1 or raidz2 vdevs of more than a single-digit number of drives are not recommended. so, for instance, the following is an appropriate use of 12 drives in two raidz2 sets of 6 disks, with 8 disks worth of raw space available: zpool create mypool raidz2 disk0 disk1 disk2 disk3 disk4 disk5 zpool add mypool raidz2 disk6 disk7 disk8 disk9 disk10 disk11 > In traditional RAID terms, I would like to do RAID5 + hot spare (13 > disks usable) out of the 15 disks (like raidz2 I suppose). What would > make the most sense to setup 15 disks with ~ 13 disks of usable space? Enable compression, and set up multiple raidz2 groups. Depending on what you're storing, you may get back more than you lose to parity. > This is for a home fileserver, I do not need HA/hotplugging/etc. so I > can tolerate a failure and replace it with plenty of time. It's not > mission critical. That's a lot of spindles for a home fileserver. I'd be inclined to go with a smaller number of larger disks in mirror pairs, allowing me to buy larger disks in pairs as they come on the market to increase capacity. > Same question, but 10 disks, and I'd sacrifice one for parity then. > Not two. so ~9 disks usable roughly (like raidz) zpool create mypool raidz1 disk0 disk1 disk2 disk3 disk4 zpool add mypool raidz1 disk5 disk6 disk7 disk8 disk9 8 disks raw capacity, can survive the loss of any one disk or the loss of two disks in different raidz groups. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] more ZFS recovery
On Thu, 2008-08-07 at 11:34 -0700, Richard Elling wrote: > How would you describe the difference between the data recovery > utility and ZFS's normal data recovery process? I'm not Anton but I think I see what he's getting at. Assume you have disks which once contained a pool but all of the uberblocks have been clobbered. So you don't know where the root of the block tree is, but all the actual data is there, intact, on the disks. Given the checksums you could rebuild one or more plausible structure of the pool from the bottom up. I'd think that you could construct an offline zpool data recovery tool where you'd start with N disk images and a large amount of extra working space, compute checksums of all possible data blocks on the images, scan the disk images looking for things that might be valid block pointers, and attempt to stitch together subtrees of the filesystem and recover as much as you can even if many upper nodes in the block tree have had holes shot in them by a miscreant device. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Block unification in ZFS
See the long thread titled "ZFS deduplication", last active approximately 2 weeks ago. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Checksum error: which of my files have failed scrubbing?
On Tue, 2008-08-05 at 12:11 -0700, soren wrote: > > soren wrote: > > > ZFS has detected that my root filesystem has a > > small number of errors. Is there a way to tell which > > specific files have been corrupted? > > > > After a scrub a zpool status -v should give you a > > list of files with > > unrecoverable errors. > > Hmm, I just tried that. Perhaps "No known data errors" means that my files > are OK. In that case I wonder what the checksum failure was from. If this is build 94 and you have one or more unmounted filesystems, (such as alternate boot environments), these errors are false positives. There is no actual error; the scrubber misinterpreted the end of an intent log block chain as a checksum error. the bug id is: 6727872 zpool scrub: reports checksum errors for pool with zfs and unplayed ZIL This bug is fixed in build 95. One workaround is to mount the filesystems and then unmount them to apply the intent log changes. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can I trust ZFS?
On Sun, 2008-08-03 at 11:42 -0500, Bob Friesenhahn wrote: > Zfs makes human error really easy. For example > >$ zpool destroy mypool Note that "zpool destroy" can be undone by "zpool import -D" (if you get to it before the disks are overwritten). ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] checksum errors on root pool after upgrade to snv_94
On Fri, 2008-07-18 at 10:28 -0700, Jürgen Keil wrote: > > I ran a scrub on a root pool after upgrading to snv_94, and got checksum > > errors: > > Hmm, after reading this, I started a zpool scrub on my mirrored pool, > on a system that is running post snv_94 bits: It also found checksum errors > > # zpool status files > pool: files > state: DEGRADED > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are unaffected. > action: Determine if the device needs to be replaced, and clear the errors > using 'zpool clear' or replace the device with 'zpool replace'. >see: http://www.sun.com/msg/ZFS-8000-9P > scrub: scrub completed after 0h46m with 9 errors on Fri Jul 18 13:33:56 2008 > config: > > NAME STATE READ WRITE CKSUM > files DEGRADED 0 018 > mirror DEGRADED 0 018 > c8t0d0s6 DEGRADED 0 036 too many errors > c9t0d0s6 DEGRADED 0 036 too many errors > > errors: No known data errors out of curiosity, is this a root pool? A second system of mine with a mirrored root pool (and an additional large multi-raidz pool) shows the same symptoms on the mirrored root pool only. once is accident. twice is coincidence. three times is enemy action :-) I'll file a bug as soon as I can (I'm travelling at the moment with spotty connectivity), citing my and your reports. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] checksum errors on root pool after upgrade to snv_94
I ran a scrub on a root pool after upgrading to snv_94, and got checksum errors: pool: r00t state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub completed after 0h26m with 1 errors on Thu Jul 17 14:52:14 2008 config: NAME STATE READ WRITE CKSUM r00t ONLINE 0 0 2 mirror ONLINE 0 0 2 c4t0d0s0 ONLINE 0 0 4 c4t1d0s0 ONLINE 0 0 4 I ran it again, and it's now reporting the same errors, but still says "applications are unaffected": pool: r00t state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub completed after 0h27m with 2 errors on Thu Jul 17 20:24:15 2008 config: NAME STATE READ WRITE CKSUM r00t ONLINE 0 0 4 mirror ONLINE 0 0 4 c4t0d0s0 ONLINE 0 0 8 c4t1d0s0 ONLINE 0 0 8 errors: No known data errors I wonder if I'm running into some combination of: 6725341 Running 'zpool scrub' repeatedly on a pool show an ever increasing error count and maybe: 6437568 ditto block repair is incorrectly propagated to root vdev Any way to dig further to determine what's going on? - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] J4500 device renumbering
On Tue, 2008-07-15 at 15:32 -0500, Bob Friesenhahn wrote: > On Tue, 15 Jul 2008, Ross Smith wrote: > > > > > It sounds like you might be interested to read up on Eric Schrock's work. > > I read today about some of the stuff he's been doing to bring integrated > > fault management to Solaris: > > http://blogs.sun.com/eschrock/entry/external_storage_enclosures_in_solaris > > His last paragraph is great to see, Sun really do seem to be headed in the > > right direction: > > That does sound good. It seems like this effort is initially limited > to SAS enclosures. It seems to get some info from a SE3510 jbod (fiberchannel), but doesn't identify which disk is in each drive slot: # /usr/lib/fm/fmd/fmtopo -V '*/ses-enclosure=0/bay=0' TIME UUID Jul 15 17:33:37 6033e234-94a3-ca79-9138-af1ee7f95b8d hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/bay=0 group: protocol version: 1 stability: Private/Private resource fmri hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/bay=0 label stringDisk Drives 0 FRU fmri hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/bay=0 group: authority version: 1 stability: Private/Private product-idstringSUN-StorEdge-3510F-D chassis-idstring205000c0ff086b4a server-id string group: sesversion: 1 stability: Private/Private node-id uint640x3 target-path string/dev/es/ses0 # /usr/lib/fm/fmd/fmtopo '*/ses-enclosure=0/*' TIME UUID Jul 15 17:35:23 16ff7d01-7f1d-e8ef-f8a5-d60a01d99b68 hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/psu=0 hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/psu=1 hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/fan=0 hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/fan=1 hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/fan=2 hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/fan=3 hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/bay=0 hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/bay=1 hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/bay=2 hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/bay=3 hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/bay=4 hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/bay=5 hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/bay=6 hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/bay=7 hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/bay=8 hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/bay=9 hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/bay=10 hc://:product-id=SUN-StorEdge-3510F-D:chassis-id=205000c0ff086b4a:server-id=/ses-enclosure=0/bay=11 - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] [caiman-discuss] swap & dump on ZFS volume
On Tue, 2008-06-24 at 09:41 -0700, Richard Elling wrote: > IMHO, you can make dump optional, with no dump being default. > Before Sommerfeld pounces on me (again :-)) actually, in the case of virtual machines, doing the dump *in* the virtual machine into preallocated virtual disk blocks is silly. if you can break the abstraction barriers a little, I'd think it would make more sense for the virtual machine infrastructure to create some sort of snapshot at the time of failure which could then be converted into a form that mdb can digest... - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Growing root pool ?
On Wed, 2008-06-11 at 07:40 -0700, Richard L. Hamilton wrote: > > I'm not even trying to stripe it across multiple > > disks, I just want to add another partition (from the > > same physical disk) to the root pool. Perhaps that > > is a distinction without a difference, but my goal is > > to grow my root pool, not stripe it across disks or > > enable raid features (for now). > > > > Currently, my root pool is using c1t0d0s4 and I want > > to add c1t0d0s0 to the pool, but can't. > > > > -Wyllys > > Right, that's how it is right now (which the other guy seemed to > be suggesting might change eventually, but nobody knows when > because it's just not that important compared to other things). > > AFAIK, if you could shrink the partition whose data is after > c1t0d0s4 on the disk, you could grow c1t0d0s4 by that much, > and I _think_ zfs would pick up the growth of the device automatically. This works. ZFS doesn't notice the size increase until you reboot. I've been installing systems over the past year with a slice arrangement intended to make it easy to go to zfs root: s0 with a ZFS pool at start of disk s1 swap s3 UFS boot environment #1 s4 UFS boot environment #2 s7 SVM metadb (if mirrored root) I was happy to discover that this paid off. Once I upgraded a BE to nv_90 and was running on it, it was a matter of: lucreate -p $pool -n nv_90zfs luactivate nv_90zfs init 6 (reboot) ludelete other BE's format format> partition reboot and you're all ZFS all the time. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS root compressed ?
On Thu, 2008-06-05 at 23:04 +0300, Cyril Plisko wrote: > 1. Are there any reasons to *not* enable compression by default ? Not exactly an answer: Most of the systems I'm running today on ZFS root have compression=on and copies=2 for rpool/ROOT > 2. How can I do it ? (I think I can run "zfs set compression=on > rpool/ROOT/snv_90" in the other window, right after the installation > begins, but I would like less hacky way.) what I did was to migrate via live upgrade, creating the pool and the pool/ROOT filesystem myself, tweaking both copies and compression on pool/ROOT before using lucreate. I haven't tried this on a fresh install yet. after install, I'd think you could play games with zfs send | zfs receive on an inactive BE to rewrite everything with the desired attributes (more important for copies than compression). - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] disk names?
On Wed, 2008-06-04 at 23:12 +, A Darren Dunham wrote: > Best story I've heard is that it dates from before the time when > modifiable (or at least *easily* modifiable) slices didn't exist. No > hopping into 'format' or using 'fmthard'. Instead, your disk came with > an entry in 'format.dat' with several fixed slices. format.dat? bah. in some systems I used - notably 4.2/4.3BSD on the vax and some even more obscure hardware - the partition table was *compiled into the device driver* (one table per known disk type). Don't like the partition layout? you have kernel source, you can change it... Disk labels didn't turn up until after BSD4.3. > So you could use the entire disk with any of: > a,b,d,e,f,g > a,b,d,e,h > c Right. You'd typically use the a/b/d/e/f/g or a/b/d/e/h slice on your boot disk and the c slice on additional disks. > without having to change the label. And the reason why changing the label was avoided was because it required recompiling the kernel and rebooting. > I speculate that then utilities were written that used c/2 for > information about the entire disk and people thought keeping the > convention going was good. it's more like it was too painful to change. > You can later use access to block 0 (via any slice) to corrupt (...er > *modify*) that label, but that's not a feature of s2. s0 would do it as > well with the way most disks are labled (because it also contains > cylinder 0/block 0.) and why didn't this get fixed? inertia. because slices are implemented in the disk driver by looking at the low order bits of the disk minor number, you couldn't just wedge in an additional device instance for the unsliced disk without taking away one slice or re-creating *all* of your disk block & character devices. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Hardware Check, OS X Compatibility, NEWBIE!!
On Wed, 2008-06-04 at 11:52 -0400, Bill McGonigle wrote: > but we got one server in > where 4 of the 8 drives failed in the first two months, at which > point we called Seagate and they were happy to swap out all 8 drives > for us. I suspect a bad lot, and even found some other complaints > about the lot on Google. Problems like that seem to pop up with disturbing regularity, and have done so for decades. (Anyone else remember the DEC RA81 glue problem in around 1985-1986?) I've thought for some time that a good way to defend against the "bad lot" problem (if you can manage it) is to buy half of your disks from each of two manufacturers and then set up mirror pairs containing one disk of each model... - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] What is a vdev?
On Fri, 2008-05-23 at 13:45 -0700, Orvar Korvar wrote: > Ok, so i make one vdev out of 8 discs. And I combine all vdevs into one large > zpool? Is it correct? > > I have 8 port SATA card. I have 4 drives into one zpool. zpool create mypool raidz1 disk0 disk1 disk2 disk3 you have a pool consisting of one vdev made up of 4 disks. > That is one vdev, right? Now I can add 4 new drives and make them > into one zpool. you could do that and keep the pool separate, or you could add them as a single vdev to the existing pool: zpool add mypool raidz1 disk4 disk5 disk6 disk7 - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Cifs and Solaris
On Fri, 2008-04-18 at 09:26 -0500, Tim wrote: > Correct me if I'm wrong, but just to clarify a bit for those currently > thinking "WHAT, NEVER IN MAINLINE!?" The main line of solaris development *is* SunOS 5.11/solaris express/nevada/opensolaris/whatever we're calling it this week. Development targets nevada first, and then selected features are backported to an update release. update releases are (conceptually) a branch/fork off of the main line. > It will make it back to mainline, but just not until the next solaris > release (something other than 10updateX), correct? it's already in the solaris mainline. it's not going into a solaris 10 update branch. the mainline is released via SXDE/SXCE and other future release vehicles. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS ACLs/Samba integration
On Fri, 2008-03-14 at 18:11 -0600, Mark Shellenbaum wrote: > > I think it is a misnomer to call the current > > implementation of ZFS a "pure ACL" system, as clearly the ACLs are heavily > > contaminated by legacy mode bits. > > Feel free to open an RFE. It may be a tough sell with PSARC, but maybe > if we have enough customer requests maybe they can be won over. It is always wrong to have a mental model of PSARC as a monolithic entity. I suspect at least some of the membership would be interested in this sort of extension and it shouldn't be that hard to "sell" if it's not the default behavior and it's clearly documented that turning it on (probably on a fs-by-fs basis like every other zfs tunable) takes you out of POSIX land. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can ZFS be event-driven or not?
On Wed, 2008-02-27 at 13:43 -0500, Kyle McDonald wrote: > How was it MVFS could do this without any changes to the shells or any > other programs? > > I ClearCase could 'grep FOO /dir1/dir2/file@@/main/*' to see which > version of 'file' added FOO. > (I think @@ was the special hidden key. It might have been something > else though.) When I last used clearcase (on the order of 12 years ago) foo@@/ only worked within clearcase mvfs filesystems. It behaved as if the filesystem created a "foo@@" virtual directory for each real "foo" directory entry, but then filtered those names out of directory listings. Doing the same as an alternate "view" on snapshot space would be a simple matter of programming within ZFS, though the magic token/suffix to get you into version/snapshot space would likely not be POSIX compliant.. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss