Re: [zfs-discuss] How does ZFS dedup space accounting work with quota?
On 4/25/2011 6:23 PM, Ian Collins wrote: On 04/26/11 01:13 PM, Fred Liu wrote: H, it seems dedup is pool-based not filesystem-based. That's correct. Although it can be turned off and on at the filesystem level (assuming it is enabled for the pool). Which is effectively the same as choosing per-filesystem dedup. Just the inverse. You turn it on at the pool level, and off at the filesystem level, which is identical to "off at the pool level, on at the filesystem level" that NetApp does. If it can have fine-grained granularity(like based on fs), that will be great! It is pity! NetApp is sweet in this aspect. So what happens to user B's quota if user B stores a ton of data that is a duplicate of user A's data and then user A deletes the original? Actually, right now, nothing happens to B's quota. He's always charged the un-deduped amount for his quota usage, whether or not dedup is enabled, and regardless of how much of his data is actually deduped. Which is as it should be, as quotas are about limiting how much a user is consuming, not how much the backend needs to store that data consumption. e.g. A, B, C, & D all have 100Mb of data in the pool, with dedup on. 20MB of storage has a dedup-factor of 3:1 (common to A, B, & C) 50MB of storage has a dedup factor of 2:1 (common to A & B ) Thus, the amount of unique data would be: A: 100 - 20 - 50 = 30MB B: 100 - 20 - 50 = 30MB C: 100 - 20 = 80MB D: 100MB Summing it all up, you would have an actual storage consumption of 70 (50+20 deduped) + 30+30+80+100 (unique data) = 310MB to actual storage, for 400MB of apparent storage (i.e. dedup ratio of 1.29:1 ) A, B, C, & D would each still have a quota usage of 100MB. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Spare drives sitting idle in raidz2 with failed drive
Thanks Brandon, On 04/25/2011 05:47 PM, Brandon High wrote: On Mon, Apr 25, 2011 at 4:56 PM, Lamp Zy wrote: I'd expect the spare drives to auto-replace the failed one but this is not happening. What am I missing? Is the autoreplace property set to 'on'? # zpool get autoreplace fwgpool0 # zpool set autoreplace=on fwgpool0 Yes, autoreplace is on. I should have mentioned it in my original post: # zpool get autoreplace fwgpool0 NAME PROPERTY VALUE SOURCE fwgpool0 autoreplace onlocal I really would like to get the pool back in a healthy state using the spare drives before trying to identify which one is the failed drive in the storage array and trying to replace it. How do I do this? Turning on autoreplace might start the replace. If not, the following will replace the failed drive with the first spare. (I'd suggest verifying the device names before running it.) # zpool replace fwgpool0 c4t5000C5001128FE4Dd0 c4t5000C50014D70072d0 I thought about doing that. My understanding is that this command should be used to replace a drive with a brand new one i.e. a drive that is not known to the raidz configuration. Should I somehow unconfigure one of the spare drives to be just a loose drive and not a raidz spare before running the command (and how do I do it)? Or, is it save to just run the replace command and let zfs take care of the details like noticing that one of the spares has been manually re-purposed to replace a failed drive? Thank you Peter ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Drive replacement speed
On Mon, Apr 25, 2011 at 5:26 PM, Brandon High wrote: > Setting zfs_resilver_delay seems to have helped some, based on the > iostat output. Are there other tunables? I found zfs_resilver_min_time_ms while looking. I've tried bumping it up considerably, without much change. 'zpool status' is still showing: scan: resilver in progress since Sat Apr 23 17:03:13 2011 6.06T scanned out of 6.40T at 36.0M/s, 2h46m to go 769G resilvered, 94.64% done 'iostat -xn' shows asvc_t under 10ms still. Increasing the per-device queue depth has increased the ascv_t but hasn't done much to effect the throughput. I'm using: echo zfs_vdev_max_pending/W0t35 | pfexec mdb -kw -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How does ZFS dedup space accounting work with quota?
On 04/26/11 01:13 PM, Fred Liu wrote: > H, it seems dedup is pool-based not filesystem-based. That's correct. Although it can be turned off and on at the filesystem level (assuming it is enabled for the pool). > If it can have fine-grained granularity(like based on fs), that will be great! > It is pity! NetApp is sweet in this aspect. > So what happens to user B's quota if user B stores a ton of data that is a duplicate of user A's data and then user A deletes the original? -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How does ZFS dedup space accounting work with quota?
H, it seems dedup is pool-based not filesystem-based. If it can have fine-grained granularity(like based on fs), that will be great! It is pity! NetApp is sweet in this aspect. Thanks. Fred > -Original Message- > From: Brandon High [mailto:bh...@freaks.com] > Sent: 星期二, 四月 26, 2011 8:50 > To: Fred Liu > Cc: cindy.swearin...@oracle.com; ZFS discuss > Subject: Re: [zfs-discuss] How does ZFS dedup space accounting work > with quota? > > On Mon, Apr 25, 2011 at 4:53 PM, Fred Liu wrote: > > So how can I set the quota size on a file system with dedup enabled? > > I believe the quota applies to the non-dedup'd data size. If a user > stores 10G of data, it will use 10G of quota, regardless of whether it > dedups at 100:1 or 1:1. > > -B > > -- > Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How does ZFS dedup space accounting work with quota?
On Mon, Apr 25, 2011 at 4:53 PM, Fred Liu wrote: > So how can I set the quota size on a file system with dedup enabled? I believe the quota applies to the non-dedup'd data size. If a user stores 10G of data, it will use 10G of quota, regardless of whether it dedups at 100:1 or 1:1. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Spare drives sitting idle in raidz2 with failed drive
On Mon, Apr 25, 2011 at 4:56 PM, Lamp Zy wrote: > I'd expect the spare drives to auto-replace the failed one but this is not > happening. > > What am I missing? Is the autoreplace property set to 'on'? # zpool get autoreplace fwgpool0 # zpool set autoreplace=on fwgpool0 > I really would like to get the pool back in a healthy state using the spare > drives before trying to identify which one is the failed drive in the > storage array and trying to replace it. How do I do this? Turning on autoreplace might start the replace. If not, the following will replace the failed drive with the first spare. (I'd suggest verifying the device names before running it.) # zpool replace fwgpool0 c4t5000C5001128FE4Dd0 c4t5000C50014D70072d0 -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Drive replacement speed
On Mon, Apr 25, 2011 at 4:45 PM, Richard Elling wrote: > If there is other work going on, then you might be hitting the resilver > throttle. By default, it will delay 2 clock ticks, if needed. It can be turned There is some other access to the pool from nfs and cifs clients, but not much, and mostly reads. Setting zfs_resilver_delay seems to have helped some, based on the iostat output. Are there other tunables? > Probably won't work because it does not make the resilvering drive > any faster. It doesn't seem like the devices are the bottleneck, even with the delay turned off. $ iostat -xn 60 3 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 369.2 11.5 5577.0 71.3 0.7 0.71.91.9 14 29 c2t0d0 371.9 11.5 5570.3 71.3 0.7 0.71.71.8 13 29 c2t1d0 369.9 11.5 5574.4 71.3 0.7 0.71.81.9 14 29 c2t2d0 370.7 11.5 5573.9 71.3 0.7 0.71.81.9 14 29 c2t3d0 368.0 11.5 5553.1 71.3 0.7 0.71.81.9 14 29 c2t4d0 196.1 172.8 2825.5 2436.6 0.3 1.10.83.0 6 26 c2t5d0 183.6 184.9 2717.6 2674.7 0.5 1.31.43.5 11 31 c2t6d0 393.0 11.2 5540.7 71.3 0.5 0.61.31.5 12 26 c2t7d0 95.81.2 95.6 16.2 0.0 0.00.20.2 0 1 c0t0d0 0.91.23.6 16.2 0.0 0.07.51.9 0 0 c0t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 891.2 11.8 2386.9 64.4 0.0 1.20.01.3 1 36 c2t0d0 919.9 12.1 2351.8 64.6 0.0 1.10.01.2 0 35 c2t1d0 906.9 12.1 2346.1 64.6 0.0 1.20.01.3 0 36 c2t2d0 877.9 11.6 2351.0 64.5 0.7 0.50.80.6 23 35 c2t3d0 883.4 12.0 2322.0 64.4 0.2 1.00.21.1 7 35 c2t4d0 0.8 758.00.8 1910.4 0.2 5.00.26.6 3 72 c2t5d0 882.7 11.4 2355.1 64.4 0.8 0.40.90.4 27 34 c2t6d0 907.8 11.4 2373.1 64.5 0.7 0.30.80.4 23 30 c2t7d0 1607.89.4 1568.2 83.0 0.1 0.20.10.1 3 18 c0t0d0 7.39.1 23.5 83.0 0.1 0.06.01.4 2 2 c0t1d0 extended device statistics r/sw/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 960.3 12.7 2868.0 59.0 1.1 0.71.20.8 37 52 c2t0d0 963.2 12.7 2877.5 59.1 1.1 0.81.10.8 36 51 c2t1d0 960.3 12.6 2844.7 59.1 1.1 0.71.10.8 37 52 c2t2d0 1000.1 12.8 2827.1 59.0 0.6 1.20.61.2 21 52 c2t3d0 960.9 12.3 2811.1 59.0 1.3 0.61.30.6 42 51 c2t4d0 0.5 962.20.4 2418.3 0.0 4.10.04.3 0 59 c2t5d0 1014.2 12.3 2820.6 59.1 0.8 0.80.80.8 28 48 c2t6d0 1031.2 12.5 2822.0 59.1 0.8 0.80.70.8 26 45 c2t7d0 1836.40.0 1783.40.0 0.0 0.20.00.1 1 19 c0t0d0 5.30.05.30.0 0.0 0.01.11.5 1 1 c0t1d0 -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Spare drives sitting idle in raidz2 with failed drive
Hi, One of my drives failed in Raidz2 with two hot spares: # zpool status pool: fwgpool0 state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-2Q scrub: resilver completed after 0h0m with 0 errors on Mon Apr 25 14:45:44 2011 config: NAME STATE READ WRITE CKSUM fwgpool0 DEGRADED 0 0 0 raidz2 DEGRADED 0 0 0 c4t5000C500108B406Ad0 ONLINE 0 0 0 c4t5000C50010F436E2d0 ONLINE 0 0 0 c4t5000C50011215B6Ed0 ONLINE 0 0 0 c4t5000C50011234715d0 ONLINE 0 0 0 c4t5000C50011252B4Ad0 ONLINE 0 0 0 c4t5000C500112749EDd0 ONLINE 0 0 0 c4t5000C5001128FE4Dd0 UNAVAIL 0 0 0 cannot open c4t5000C500112C4959d0 ONLINE 0 0 0 c4t5000C50011318199d0 ONLINE 0 0 0 c4t5000C500113C0E9Dd0 ONLINE 0 0 0 c4t5000C500113D0229d0 ONLINE 0 0 0 c4t5000C500113E97B8d0 ONLINE 0 0 0 c4t5000C50014D065A9d0 ONLINE 0 0 0 c4t5000C50014D0B3B9d0 ONLINE 0 0 0 c4t5000C50014D55DEFd0 ONLINE 0 0 0 c4t5000C50014D642B7d0 ONLINE 0 0 0 c4t5000C50014D64521d0 ONLINE 0 0 0 c4t5000C50014D69C14d0 ONLINE 0 0 0 c4t5000C50014D6B2CFd0 ONLINE 0 0 0 c4t5000C50014D6C6D7d0 ONLINE 0 0 0 c4t5000C50014D6D486d0 ONLINE 0 0 0 c4t5000C50014D6D77Fd0 ONLINE 0 0 0 spares c4t5000C50014D70072d0AVAIL c4t5000C50014D7058Dd0AVAIL errors: No known data errors I'd expect the spare drives to auto-replace the failed one but this is not happening. What am I missing? I really would like to get the pool back in a healthy state using the spare drives before trying to identify which one is the failed drive in the storage array and trying to replace it. How do I do this? Thanks for any hints. -- Peter ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] How does ZFS dedup space accounting work with quota?
Cindy, Following is quoted from ZFS Dedup FAQ: "Deduplicated space accounting is reported at the pool level. You must use the zpool list command rather than the zfs list command to identify disk space consumption when dedup is enabled. If you use the zfs list command to review deduplicated space, you might see that the file system appears to be increasing because we're able to store more data on the same physical device. Using the zpool list will show you how much physical space is being consumed and it will also show you the dedup ratio.The df command is not dedup-aware and will not provide accurate space accounting." So how can I set the quota size on a file system with dedup enabled? Thanks. Fred ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Drive replacement speed
On Apr 25, 2011, at 2:52 PM, Brandon High wrote: > I'm in the process of replacing drive in a pool, and the resilver > times seem to have increased with each device. The way that I'm doing > this is by pulling a drive, physically replacing it, then doing > 'cfgadm -c configure ; zpool replace tank '. I don't have any > hot-swap bays available, so I'm physically replacing the device before > doing a 'zpool replace'. > > I'm replacing Western Digital WD10EADS 1TB drives with Hitachi 5K3000 > 3TB drives. Neither device is fast, but they aren't THAT slow. wsvc_t > and asvc_t both look fairly healthy giving the device types. Look for 10-12 ms for asvc_t. In my experience, SATA disks tend to not handle NCQ as well as SCSI disks handle TCQ -- go figure. In your iostats below, you are obviously not bottlenecking on the disks. > > Replacing the first device (took about 20 hours) went about as > expected. The second took about 44 hours. The third is still running > and should finish in slightly over 48 hours. If there is other work going on, then you might be hitting the resilver throttle. By default, it will delay 2 clock ticks, if needed. It can be turned off temporarily using: echo zfs_resilver_delay/W0t0 | mdb -kw to return to normal: echo zfs_resilver_delay/W0t2 | mdb -kw > I'm wondering if the following would help for the next drive: > # zpool offline tank c2t4d0 > # cfgadm -c unconfigure sata3/4::dsk/c2t4d0 > > At this point pull the drive and put it into an external USB adapter. > Put the new drive in the hot-swap bay. The USB adapter shows up as > c4t0d0. > > # zpool online tank c4t0d0 > > This should re-add it to the pool and resilver the last few > transactions that may have been missed, right? > > Then I want to actually replace the drive in the zpool: > # cfgadm -c configure sata3/4 > # zpool replace tank c4t0d0 c2t4d0 > > Will this work? Will the replace go faster, since it won't need to > resilver from the parity data? Probably won't work because it does not make the resilvering drive any faster. -- richard > > > $ zpool list tank > NAME SIZE ALLOC FREECAP DEDUP HEALTH ALTROOT > tank 7.25T 6.40T 867G88% 1.11x DEGRADED - > $ zpool status -x > pool: tank > state: DEGRADED > status: One or more devices is currently being resilvered. The pool will >continue to function, possibly in a degraded state. > action: Wait for the resilver to complete. > scan: resilver in progress since Sat Apr 23 17:03:13 2011 >5.91T scanned out of 6.40T at 38.0M/s, 3h42m to go >752G resilvered, 92.43% done > config: > >NAME STATE READ WRITE CKSUM >tank DEGRADED 0 0 0 > raidz2-0DEGRADED 0 0 0 >c2t0d0ONLINE 0 0 0 >c2t1d0ONLINE 0 0 0 >c2t2d0ONLINE 0 0 0 >c2t3d0ONLINE 0 0 0 >c2t4d0ONLINE 0 0 0 >replacing-5 DEGRADED 0 0 0 > c2t5d0/old FAULTED 0 0 0 corrupted data > c2t5d0 ONLINE 0 0 0 (resilvering) >c2t6d0ONLINE 0 0 0 >c2t7d0ONLINE 0 0 0 > > errors: No known data errors > $ zpool iostat -v tank 60 3 > capacity operationsbandwidth > pool alloc free read write read write > - - - - - - > tank 6.40T 867G566 25 32.2M 156K > raidz2 6.40T 867G566 25 32.2M 156K >c2t0d0- -362 11 5.56M 71.6K >c2t1d0- -365 11 5.56M 71.6K >c2t2d0- -363 11 5.56M 71.6K >c2t3d0- -363 11 5.56M 71.6K >c2t4d0- -361 11 5.54M 71.6K >replacing - - 0492 8.28K 4.79M > c2t5d0/old - -202 5 2.84M 36.7K > c2t5d0 - - 0315 8.66K 4.78M >c2t6d0- -170190 2.68M 2.69M >c2t7d0- -386 10 5.53M 71.6K > - - - - - - > > capacity operationsbandwidth > pool alloc free read write read write > - - - - - - > tank 6.40T 867G612 14 8.43M 70.7K > raidz2 6.40T 867G612 14 8.43M 70.7K >c2t0d0- -411 11 1.51M 57.9K >c2t1d0- -414 11 1.50M 58.0K >c2t2d0- -385 11 1.51M 57.9K >c2t3d0- -412 11 1.50M 58.0K >c2t4d0- -412 11 1.45M 57.8K >
[zfs-discuss] arcstat updates
Hi ZFSers, I've been working on merging the Joyent arcstat enhancements with some of my own and am now to the point where it is time to broaden the requirements gathering. The result is to be merged into the illumos tree. arcstat is a perl script to show the value of ARC kstats as they change over time. This is similar to the ideas behind mpstat, iostat, vmstat, and friends. The current usage is: Usage: arcstat [-hvx] [-f fields] [-o file] [interval [count]] Field definitions are as follows: mtxmis : mutex_miss per second arcsz : ARC size mrug : MRU ghost list hits per second l2hit% : L2ARC access hit percentage mh% : Metadata hit percentage l2miss% : L2ARC access miss percentage read : Total ARC accesses per second l2hsz : L2ARC header size c : ARC target size mfug : MFU ghost list hits per second miss : ARC misses per second dm% : Demand data miss percentage hsz : ARC header size dhit : Demand data hits per second pread : Prefetch accesses per second dread : Demand data accesses per second l2miss : L2ARC misses per second pmis : Prefetch misses per second time : Time l2bytes : Bytes read per second from the L2ARC pm% : Prefetch miss percentage mm% : Metadata miss percentage hits : ARC reads per second throt : Memory throttles per second mfu : MFU list hits per second l2read : Total L2ARC accesses per second mmis : Metadata misses per second rmis : recycle_miss per second mhit : Metadata hits per second dmis : Demand data misses per second mru : MRU list hits per second ph% : Prefetch hits percentage eskip : evict_skip per second l2size : L2ARC size l2hits : L2ARC hits per second hit% : ARC hit percentage miss% : ARC miss percentage dh% : Demand data hit percentage mread : Metadata accesses per second phit : Prefetch hits per second Some questions for the community: 1. Should there be flag compatibility with vmstat, iostat, mpstat, and friends? 2. What is missing? 3. Is it ok if the man page explains the meanings of each field, even though it might be many pages long? 4. Is there a common subset of columns that are regularly used that would justify a shortcut option? Or do we even need shortcuts? (eg -x) 5. Who wants to help with this little project? -- Richard Elling rich...@nexenta.com +1-760-896-4422 Nexenta European User Conference, Amsterdam, May 20 www.nexenta.com/corp/european-user-conference-2011 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)
On Mon, Apr 25, 2011 at 8:20 AM, Edward Ned Harvey wrote: > and 128k assuming default recordsize. (BTW, recordsize seems to be a zfs > property, not a zpool property. So how can you know or configure the > blocksize for something like a zvol iscsi target?) zvols use the 'volblocksize' property, which defaults to 8k. A 1TB zvol is therefore 2^27 blocks and would require ~ 34 GB for the ddt (assuming that a ddt entry is 270 bytes). The zfs man page for the property reads: volblocksize=blocksize For volumes, specifies the block size of the volume. The blocksize cannot be changed once the volume has been written, so it should be set at volume creation time. The default blocksize for volumes is 8 Kbytes. Any power of 2 from 512 bytes to 128 Kbytes is valid. This property can also be referred to by its shortened column name, volblock. -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Drive replacement speed
I'm in the process of replacing drive in a pool, and the resilver times seem to have increased with each device. The way that I'm doing this is by pulling a drive, physically replacing it, then doing 'cfgadm -c configure ; zpool replace tank '. I don't have any hot-swap bays available, so I'm physically replacing the device before doing a 'zpool replace'. I'm replacing Western Digital WD10EADS 1TB drives with Hitachi 5K3000 3TB drives. Neither device is fast, but they aren't THAT slow. wsvc_t and asvc_t both look fairly healthy giving the device types. Replacing the first device (took about 20 hours) went about as expected. The second took about 44 hours. The third is still running and should finish in slightly over 48 hours. I'm wondering if the following would help for the next drive: # zpool offline tank c2t4d0 # cfgadm -c unconfigure sata3/4::dsk/c2t4d0 At this point pull the drive and put it into an external USB adapter. Put the new drive in the hot-swap bay. The USB adapter shows up as c4t0d0. # zpool online tank c4t0d0 This should re-add it to the pool and resilver the last few transactions that may have been missed, right? Then I want to actually replace the drive in the zpool: # cfgadm -c configure sata3/4 # zpool replace tank c4t0d0 c2t4d0 Will this work? Will the replace go faster, since it won't need to resilver from the parity data? $ zpool list tank NAME SIZE ALLOC FREECAP DEDUP HEALTH ALTROOT tank 7.25T 6.40T 867G88% 1.11x DEGRADED - $ zpool status -x pool: tank state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Sat Apr 23 17:03:13 2011 5.91T scanned out of 6.40T at 38.0M/s, 3h42m to go 752G resilvered, 92.43% done config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 raidz2-0DEGRADED 0 0 0 c2t0d0ONLINE 0 0 0 c2t1d0ONLINE 0 0 0 c2t2d0ONLINE 0 0 0 c2t3d0ONLINE 0 0 0 c2t4d0ONLINE 0 0 0 replacing-5 DEGRADED 0 0 0 c2t5d0/old FAULTED 0 0 0 corrupted data c2t5d0 ONLINE 0 0 0 (resilvering) c2t6d0ONLINE 0 0 0 c2t7d0ONLINE 0 0 0 errors: No known data errors $ zpool iostat -v tank 60 3 capacity operationsbandwidth pool alloc free read write read write - - - - - - tank 6.40T 867G566 25 32.2M 156K raidz2 6.40T 867G566 25 32.2M 156K c2t0d0- -362 11 5.56M 71.6K c2t1d0- -365 11 5.56M 71.6K c2t2d0- -363 11 5.56M 71.6K c2t3d0- -363 11 5.56M 71.6K c2t4d0- -361 11 5.54M 71.6K replacing - - 0492 8.28K 4.79M c2t5d0/old - -202 5 2.84M 36.7K c2t5d0 - - 0315 8.66K 4.78M c2t6d0- -170190 2.68M 2.69M c2t7d0- -386 10 5.53M 71.6K - - - - - - capacity operationsbandwidth pool alloc free read write read write - - - - - - tank 6.40T 867G612 14 8.43M 70.7K raidz2 6.40T 867G612 14 8.43M 70.7K c2t0d0- -411 11 1.51M 57.9K c2t1d0- -414 11 1.50M 58.0K c2t2d0- -385 11 1.51M 57.9K c2t3d0- -412 11 1.50M 58.0K c2t4d0- -412 11 1.45M 57.8K replacing - - 0574366 852K c2t5d0/old - - 0 0 0 0 c2t5d0 - - 0324366 852K c2t6d0- -427 11 1.45M 57.8K c2t7d0- -431 11 1.49M 57.9K - - - - - - capacity operationsbandwidth pool alloc free read write read write - - - - - - tank 6.40T 867G 1.02K 12 11.1M 69.4K raidz2 6.40T 867G 1.02K 12 11.1M 69.4K c2t0d0- -772 10 1.99M 59.3K c2t1d0- -771 10 1.99M 59.4K c2t2d0-
Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)
On Mon, Apr 25, 2011 at 10:55 AM, Erik Trimble wrote: > Min block size is 512 bytes. Technically, isn't the minimum block size 2^(ashift value)? Thus, on 4 KB disks where the vdevs have an ashift=12, the minimum block size will be 4 KB. -- Freddie Cash fjwc...@gmail.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)
On 04/25/11 11:55, Erik Trimble wrote: On 4/25/2011 8:20 AM, Edward Ned Harvey wrote: And one more comment: Based on what's below, it seems that the DDT gets stored on the cache device and also in RAM. Is that correct? What if you didn't have a cache device? Shouldn't it *always* be in ram? And doesn't the cache device get wiped every time you reboot? It seems to me like putting the DDT on the cache device would be harmful... Is that really how it is? Nope. The DDT is stored only in one place: cache device if present, /or/ RAM otherwise (technically, ARC, but that's in RAM). If a cache device is present, the DDT is stored there, BUT RAM also must store a basic lookup table for the DDT (yea, I know, a lookup table for a lookup table). No, that's not true. The DDT is just like any other ZFS metadata and can be split over the ARC, cache device (L2ARC) and the main pool devices. An infrequently referenced DDT block will get evicted from the ARC to the L2ARC then evicted from the L2ARC. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)
On 4/25/2011 8:20 AM, Edward Ned Harvey wrote: There are a lot of conflicting references on the Internet, so I'd really like to solicit actual experts (ZFS developers or people who have physical evidence) to weigh in on this... After searching around, the reference I found to be the most seemingly useful was Erik's post here: http://opensolaris.org/jive/thread.jspa?threadID=131296 Unfortunately it looks like there's an arithmetic error (1TB of 4k blocks means 268million blocks, not 1 billion). Also, IMHO it seems important make the distinction, #files != #blocks. Due to the existence of larger files, there will sometimes be more than one block per file; and if I'm not mistaken, thanks to write aggregation, there will sometimes be more than one file per block. YMMV. Average block size could be anywhere between 1 byte and 128k assuming default recordsize. (BTW, recordsize seems to be a zfs property, not a zpool property. So how can you know or configure the blocksize for something like a zvol iscsi target?) I said 2^30, which is roughly a quarter billion. But, I should have been more exact. And, the file != block difference is important to note. zvols also take a Recordsize attribute. And, zvols tend to be sticklers about all blocks being /exactly/ the recordsize value, unlike filesystems, which use it as a *maximum* block size. Min block size is 512 bytes. (BTW, is there any way to get a measurement of number of blocks consumed per zpool? Per vdev? Per zfs filesystem?) The calculations below are based on assumption of 4KB blocks adding up to a known total data consumption. The actual thing that matters is the number of blocks consumed, so the conclusions drawn will vary enormously when people actually have average block sizes != 4KB. you need to use zdb to see what the current block usage is for a filesystem. I'd have to look up the particular CLI usage for that, as I don't know what it is off the top of my head. And one more comment: Based on what's below, it seems that the DDT gets stored on the cache device and also in RAM. Is that correct? What if you didn't have a cache device? Shouldn't it *always* be in ram? And doesn't the cache device get wiped every time you reboot? It seems to me like putting the DDT on the cache device would be harmful... Is that really how it is? Nope. The DDT is stored only in one place: cache device if present, /or/ RAM otherwise (technically, ARC, but that's in RAM). If a cache device is present, the DDT is stored there, BUT RAM also must store a basic lookup table for the DDT (yea, I know, a lookup table for a lookup table). My minor corrections here: The rule-of-thumb is 270 bytes/DDT entry, and 200 bytes of ARC for every L2ARC entry, since the DDT is stored on the cache device. the DDT itself doesn't consume any ARC space usage if stored in a L2ARC cache E.g.: I have 1TB of 4k blocks that are to be deduped, and it turns out that I have about a 5:1 dedup ratio. I'd also like to see how much ARC usage I eat up with using a 160GB L2ARC to store my DDT on. (1) How many entries are there in the DDT? 1TB of 4k blocks means there are 268million blocks. However, at a 5:1 dedup ratio, I'm only actually storing 20% of that, so I have about 54 million blocks. Thus, I need a DDT of about 270bytes * 54 million =~ 14GB in size (2) How much ARC space does this DDT take up? The 54 million entries in my DDT take up about 200bytes * 54 million =~ 10G of ARC space, so I need to have 10G of RAM dedicated just to storing the references to the DDT in the L2ARC. (3) How much space do I have left on the L2ARC device, and how many blocks can that hold? Well, I have 160GB - 14GB (DDT) = 146GB of cache space left on the device, which, assuming I'm still using 4k blocks, means I can cache about 37 million 4k blocks, or about 66% of my total data. This extra cache of blocks in the L2ARC would eat up 200 b * 37 million =~ 7.5GB of ARC entries. Thus, for the aforementioned dedup scenario, I'd better spec it with (whatever base RAM for basic OS and ordinary ZFS cache and application requirements) at least a 14G L2ARC device for dedup + 10G more of RAM for the DDT L2ARC requirements + 1GB of RAM for every 20GB of additional space in the L2ARC cache beyond that used by the DDT. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs problem vdev I/O failure
So, I install FreeBSD 8.2 with ZFS patch v28 and have this error message with full freeze zfs system: Solaris: Warning: can`t open object for zroot/var/crash log_sysevent: type19 is not emplement log_sysevent: type19 is not emplement log_sysevent: type19 is not emplement log_sysevent: type19 is not emplement log_sysevent: type19 is not emplement Solaris: Warning: can`t open object for zroot/var/crash log_sysevent: type19 is not emplement log_sysevent: type19 is not emplement log_sysevent: type19 is not emplement log_sysevent: type19 is not emplement log_sysevent: type19 is not emplement 2011/4/24 Pawel Tyll > Hi Konstantin, > > > zpool status: > > Flash# zpool status > > pool: zroot > > state: DEGRADED > > status: One or more devices are faulted in response to IO failures. > > action: Make sure the affected devices are connected, then run 'zpool > > clear'. > >see: http://www.sun.com/msg/ZFS-8000-HC > > scrub: resilver in progress for 0h6m, 0.00% done, 1582566h29m to go > > config: > > > NAME STATE READ WRITE CKSUM > > zrootDEGRADED12 0 1 > > mirror DEGRADED36 0 4 > > 7159451150335751026 UNAVAIL 0 0 0 was > > /dev/gpt/disk0 > > gpt/disk1ONLINE 0 040 > > > errors: 12 data errors, use '-v' for a list > > > Zpool scrub freeze and time to resilver up in time... > > How i can repair it, if zpool scrub -s zroot and detach don`t work...and > > don`t work all of zfs commands =\ > > Try booting mfsBSD and fixing there, http://mfsbsd.vx.sk/ > > http://mfsbsd.vx.sk/iso/mfsbsd-8.2-zfsv28-i386.iso > http://mfsbsd.vx.sk/iso/mfsbsd-se-8.2-zfsv28-amd64.iso > > > -- С уважением Куклин Константин. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)
> After modifications that I hope are corrections, I think the post > should look like this: > > The rule-of-thumb is 270 bytes/DDT entry, and 200 bytes of ARC for > every L2ARC entry. > > DDT doesn't count for this ARC space usage > > E.g.: I have 1TB of 4k blocks that are to be deduped, and it turns out > that I have about a 5:1 dedup ratio. I'd also like to see how much ARC > usage I eat up with a 160GB L2ARC. > > (1) How many entries are there in the DDT: > > 1TB of 4k blocks means there are 268million blocks. However, at a 5:1 > dedup ratio, I'm only actually storing 20% of that, so I have about 54 > million blocks. Thus, I need a DDT of about 270bytes * 54 million =~ > 14GB in size > > (2) My L2ARC is 160GB in size, but I'm using 14GB for the DDT. Thus, I > have 146GB free for use as a data cache. 146GB / 4k =~ 38 million > blocks can be stored in the > remaining L2ARC space. However, 38 million files takes up: 200bytes * > 38 million =~ 7GB of space in ARC. > > Thus, I better spec my system with (whatever base RAM for basic OS and > cache and application requirements) + 14G because of dedup + 7G > because of L2ARC. Thanks, but one more ting: Add some tuning parameters, such as "set zfs:zfs_arc_meta_limit = somevalue in /etc/system" to help zfs use more memory for its metadata (like the DDT), as it won't use more than (RAM-1GB)/4 by default Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Dedup and L2ARC memory requirements (again)
There are a lot of conflicting references on the Internet, so I'd really like to solicit actual experts (ZFS developers or people who have physical evidence) to weigh in on this... After searching around, the reference I found to be the most seemingly useful was Erik's post here: http://opensolaris.org/jive/thread.jspa?threadID=131296 Unfortunately it looks like there's an arithmetic error (1TB of 4k blocks means 268million blocks, not 1 billion). Also, IMHO it seems important make the distinction, #files != #blocks. Due to the existence of larger files, there will sometimes be more than one block per file; and if I'm not mistaken, thanks to write aggregation, there will sometimes be more than one file per block. YMMV. Average block size could be anywhere between 1 byte and 128k assuming default recordsize. (BTW, recordsize seems to be a zfs property, not a zpool property. So how can you know or configure the blocksize for something like a zvol iscsi target?) (BTW, is there any way to get a measurement of number of blocks consumed per zpool? Per vdev? Per zfs filesystem?) The calculations below are based on assumption of 4KB blocks adding up to a known total data consumption. The actual thing that matters is the number of blocks consumed, so the conclusions drawn will vary enormously when people actually have average block sizes != 4KB. And one more comment: Based on what's below, it seems that the DDT gets stored on the cache device and also in RAM. Is that correct? What if you didn't have a cache device? Shouldn't it *always* be in ram? And doesn't the cache device get wiped every time you reboot? It seems to me like putting the DDT on the cache device would be harmful... Is that really how it is? After modifications that I hope are corrections, I think the post should look like this: The rule-of-thumb is 270 bytes/DDT entry, and 200 bytes of ARC for every L2ARC entry. DDT doesn't count for this ARC space usage E.g.: I have 1TB of 4k blocks that are to be deduped, and it turns out that I have about a 5:1 dedup ratio. I'd also like to see how much ARC usage I eat up with a 160GB L2ARC. (1) How many entries are there in the DDT: 1TB of 4k blocks means there are 268million blocks. However, at a 5:1 dedup ratio, I'm only actually storing 20% of that, so I have about 54 million blocks. Thus, I need a DDT of about 270bytes * 54 million =~ 14GB in size (2) My L2ARC is 160GB in size, but I'm using 14GB for the DDT. Thus, I have 146GB free for use as a data cache. 146GB / 4k =~ 38 million blocks can be stored in the remaining L2ARC space. However, 38 million files takes up: 200bytes * 38 million =~ 7GB of space in ARC. Thus, I better spec my system with (whatever base RAM for basic OS and cache and application requirements) + 14G because of dedup + 7G because of L2ARC. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs problem vdev I/O failure
Hi Konstantin, > zpool status: > Flash# zpool status > pool: zroot > state: DEGRADED > status: One or more devices are faulted in response to IO failures. > action: Make sure the affected devices are connected, then run 'zpool > clear'. >see: http://www.sun.com/msg/ZFS-8000-HC > scrub: resilver in progress for 0h6m, 0.00% done, 1582566h29m to go > config: > NAME STATE READ WRITE CKSUM > zrootDEGRADED12 0 1 > mirror DEGRADED36 0 4 > 7159451150335751026 UNAVAIL 0 0 0 was > /dev/gpt/disk0 > gpt/disk1ONLINE 0 040 > errors: 12 data errors, use '-v' for a list > Zpool scrub freeze and time to resilver up in time... > How i can repair it, if zpool scrub -s zroot and detach don`t work...and > don`t work all of zfs commands =\ Try booting mfsBSD and fixing there, http://mfsbsd.vx.sk/ http://mfsbsd.vx.sk/iso/mfsbsd-8.2-zfsv28-i386.iso http://mfsbsd.vx.sk/iso/mfsbsd-se-8.2-zfsv28-amd64.iso ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs problem vdev I/O failure
Good morning, I have a problem with ZFS: ZFS filesystem version 4 ZFS storage pool version 15 Yesterday my comp with Freebsd 8.2 releng shutdown with ad4 error detached,when I copy a big file... and after reboot in 2 wd green 1tb say me goodbye. One of them die and other with zfs errors: Apr 24 04:53:41 Flash root: ZFS: vdev I/O failure, zpool=zroot path= offset=187921768448 size=512 error=6 Apr 24 04:53:41 Flash root: ZFS: vdev I/O failure, zpool=zroot path= offset=187921768960 size=512 error=6 Apr 24 04:53:41 Flash root: ZFS: vdev I/O failure, zpool=zroot path= offset=311738368 size=21504 error=6 Apr 24 04:53:41 Flash root: ZFS: zpool I/O failure, zpool=zroot error=6 Apr 24 04:53:41 Flash root: ZFS: vdev I/O failure, zpool=zroot path= offset= size= error= Apr 24 04:53:41 Flash root: ZFS: vdev I/O failure, zpool=zroot path= offset=635155456 size=3072 error=6 Apr 24 04:53:41 Flash root: ZFS: zpool I/O failure, zpool=zroot error=6 Apr 24 04:53:41 Flash root: ZFS: vdev I/O failure, zpool=zroot path= offset= size= error= Apr 24 04:53:41 Flash root: ZFS: vdev I/O failure, zpool=zroot path= offset=635158528 size=12288 error=6 Apr 24 04:53:41 Flash root: ZFS: zpool I/O failure, zpool=zroot error=6 Apr 24 04:53:41 Flash root: ZFS: vdev I/O failure, zpool=zroot path= offset= size= error= Apr 24 04:53:41 Flash root: ZFS: vdev I/O failure, zpool=zroot path= offset=635170816 size=512 error=6 Apr 24 04:53:41 Flash root: ZFS: zpool I/O failure, zpool=zroot error=6 Apr 24 04:53:41 Flash root: ZFS: vdev I/O failure, zpool=zroot path= offset= size= error= Apr 24 04:53:41 Flash root: ZFS: vdev I/O failure, zpool=zroot path= offset=635171328 size=512 error=6 Apr 24 04:53:41 Flash root: ZFS: vdev I/O failure, zpool=zroot path= offset=635171840 size=512 error=6 Apr 24 04:53:41 Flash root: ZFS: zpool I/O failure, zpool=zroot error=6 zpool status: Flash# zpool status pool: zroot state: DEGRADED status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run 'zpool clear'. see: http://www.sun.com/msg/ZFS-8000-HC scrub: resilver in progress for 0h6m, 0.00% done, 1582566h29m to go config: NAME STATE READ WRITE CKSUM zrootDEGRADED12 0 1 mirror DEGRADED36 0 4 7159451150335751026 UNAVAIL 0 0 0 was /dev/gpt/disk0 gpt/disk1ONLINE 0 040 errors: 12 data errors, use '-v' for a list Zpool scrub freeze and time to resilver up in time... How i can repair it, if zpool scrub -s zroot and detach don`t work...and don`t work all of zfs commands =\ Thx ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss