[zfs-discuss] zpool errors without fmdump or dmesg errors
Hi all, I am running S11 on a Dell PE650. It has 5 zpools attached that are made out of 240 drives, connected via fibre. On thursday all of the sudden two out of three zpools on one FC channel showed numerous errors and one of them showed this: root@solaris11a:~# zpool status vsmPool01 pool: vsmPool01 state: SUSPENDED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Sat Jan 19 08:53:28 2013 344G scanned out of 24,7T at 128M/s, 55h18m to go 45,9G resilvered, 1,36% done config: NAME STATE READ WRITE CKSUM vsmPool01UNAVAIL 0 0 0 experienced I/O failures mirror-0 UNAVAIL 0 0 0 experienced I/O failures c0t201A001378E06A18d0UNAVAIL 0 0 0 experienced I/O failures c0t2006001378E0E198d0UNAVAIL 0 0 0 experienced I/O failures c0t2005001378E0DE98d0UNAVAIL 0 0 0 experienced I/O failures mirror-1 UNAVAIL 0 0 0 experienced I/O failures c0t2006001378E0DE98d0UNAVAIL 0 0 0 experienced I/O failures c0t201B001378E06A18d0UNAVAIL 0 0 0 experienced I/O failures c0t2007001378E0E198d0UNAVAIL 0 0 0 experienced I/O failures mirror-2 UNAVAIL 0 0 0 experienced I/O failures c0t2007001378E0DE98d0UNAVAIL 0 0 0 experienced I/O failures c0t201C001378E06A18d0UNAVAIL 0 0 0 experienced I/O failures c0t2008001378E0E198d0UNAVAIL 0 0 0 experienced I/O failures mirror-3 UNAVAIL 0 0 0 experienced I/O failures c0t2008001378E0DE98d0UNAVAIL 0 0 0 experienced I/O failures c0t2009001378E0E198d0UNAVAIL 0 0 0 experienced I/O failures c0t201D001378E06A18d0UNAVAIL 0 0 0 experienced I/O failures mirror-4 UNAVAIL 0 0 0 experienced I/O failures c0t2009001378E0DE98d0UNAVAIL 0 0 0 experienced I/O failures c0t201E001378E06A18d0UNAVAIL 0 0 0 experienced I/O failures c0t200A001378E0E198d0UNAVAIL 0 0 0 experienced I/O failures mirror-5 UNAVAIL 0 0 0 experienced I/O failures c0t200A001378E0DE98d0UNAVAIL 0 0 0 experienced I/O failures spare-1 UNAVAIL 0 0 0 experienced I/O failures c0t201F001378E06A18d0 UNAVAIL 0 0 0 experienced I/O failures c0t2015001378E0E198d0 UNAVAIL 0 0 0 experienced I/O failures (resilvering) c0t200B001378E0E198d0UNAVAIL 0 0 0 experienced I/O failures mirror-6 UNAVAIL 0 0 0 experienced I/O failures c0t200B001378E0DE98d0UNAVAIL 0 0 0 experienced I/O failures c0t2020001378E06A18d0UNAVAIL 0 0 0 experienced I/O failures c0t200C001378E0E198d0UNAVAIL 0 0 0 experienced I/O failures mirror-7 UNAVAIL 0 0 0 experienced I/O failures spare-0 UNAVAIL 0 0 0 experienced I/O failures c0t2021001378E06A18d0 UNAVAIL 0 0 0 experienced I/O failures c0t2014001378E0DE98d0 UNAVAIL 0 0 0 experienced I/O failures (resilvering) c0t200D001378E0E198d0UNAVAIL 0 0 0 experienced I/O failures c0t200C001378E0DE98d0UNAVAIL 0 0 0 experienced I/O failures mirror-8 UNAVAIL 0 0 0 experienced I/O failures c0t200D001378E0DE98d0UNAVAIL 0 0 0 experienced I/O failures c0t2022001378E06A18d0UNAVAIL 0 0 0 experienced I/O failures c0t200E001378E0E198d0UNAVAIL 0 0 0 experienced I/O failures mirror-9 UNAVAIL 0 0 0 experienced I/O failures c0t200E001378E0DE98d0UNAVAIL 0 0 0 experienced I/O failures c0t2023001378E06A18d0UNAVAIL 0 0 0 experienced I/O failures c0t200F001378E0E198d0UNAVAIL 0 0 0 experienced I/O failures mirror-10 UNAVAIL 0 0 0 experienced I/O failures c0t200F001378E0DE98d0UNAVAIL 0 0 0 experienced I/O failures c0t2024001378E06A18d0UNAVAIL 0 0 0 experienced I/O failures c0t2010001378E0E198d0UNAVAIL 0 0 0
Re: [zfs-discuss] iSCSI access patterns and possible improvements?
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Bob Friesenhahn If almost all of the I/Os are 4K, maybe your ZVOLs should use a volblocksize of 4K? This seems like the most obvious improvement. Oh, I forgot to mention - The above logic only makes sense for mirrors and stripes. Not for raidz (or raid-5/6/dp in general) If you have a pool of mirrors or stripes, the system isn't forced to subdivide a 4k block onto multiple disks, so it works very well. But if you have a pool blocksize of 4k and let's say a 5-disk raidz (capacity of 4 disks) then the 4k block gets divided into 1k on each disk and 1k parity on the parity disk. Now, since the hardware only supports block sizes of 4k ... You can see there's a lot of wasted space, and if you do a bunch of it, you'll also have a lot of wasted time waiting for seeks/latency. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Resilver w/o errors vs. scrub with errors
Hi, I am always experiencing chksum errors while scrubbing my zpool(s), but I never experienced chksum errors while resilvering. Does anybody know why that would be? This happens on all of my servers, Sun Fire 4170M2, Dell PE 650 and on any FC storage that I have. Currently I had a major issue where two of my zpools have been suspended due to every single drive had been marked as UNAVAIL due to experienced I/O failures. Now, this zpool is made of 3-way mirrors and currently 13 out of 15 vdevs are resilvering (which they had gone through yesterday as well) and I never got any error while resilvering. I have been all over the setup to find any glitch or bad part, but I couldn't come up with anything significant. Doesn't this sound improbable, wouldn't one expect to encounter other chksum errors while resilvering is running? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] RFE: Un-dedup for unique blocks
Hello all, While revising my home NAS which had dedup enabled before I gathered that its RAM capacity was too puny for the task, I found that there is some deduplication among the data bits I uploaded there (makes sense, since it holds backups of many of the computers I've worked on - some of my homedirs' contents were bound to intersect). However, a lot of the blocks are in fact unique - have entries in the DDT with count=1 and the blkptr_t bit set. In fact they are not deduped, and with my pouring of backups complete - they are unlikely to ever become deduped. Thus these many unique deduped blocks are just a burden when my system writes into the datasets with dedup enabled, when it walks the superfluously large DDT, when it has to store this DDT on disk and in ARC, maybe during the scrubbing... These entries bring lots of headache (or performance degradation) for zero gain. So I thought it would be a nice feature to let ZFS go over the DDT (I won't care if it requires to offline/export the pool) and evict the entries with count==1 as well as locate the block-pointer tree entries on disk and clear the dedup bits, making such blocks into regular unique ones. This would require rewriting metadata (less DDT, new blockpointer) but should not touch or reallocate the already-saved userdata (blocks' contents) on the disk. The new BP without the dedup bit set would have the same contents of other fields (though its parents would of course have to be changed more - new DVAs, new checksums...) In the end my pool would only track as deduped those blocks which do already have two or more references - which, given the static nature of such backup box, should be enough (i.e. new full backups of the same source data would remain deduped and use no extra space, while unique data won't waste the resources being accounted as deduped). What do you think? //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors
On Sat, 19 Jan 2013, Stephan Budach wrote: Now, this zpool is made of 3-way mirrors and currently 13 out of 15 vdevs are resilvering (which they had gone through yesterday as well) and I never got any error while resilvering. I have been all over the setup to find any glitch or bad part, but I couldn't come up with anything significant. Doesn't this sound improbable, wouldn't one expect to encounter other chksum errors while resilvering is running? I can't attest to chksum errors since I have yet to see one on my machines (have seen several complete disk failures, or disks faulted by the system though). Checksum errors are bad and not seeing them should be the normal case. Resilver may in fact be just verifying that the pool disks are coherent via metadata. This might happen if the fiber channel is flapping. Regarding the dire fiber channel issue, are you using fiber channel switches or direct connections to the storage array(s)? If you are using switches, are they stable or are they doing something terrible like resetting? Do you have duplex connectivity? Have you verified that your FC HBA's firmware is correct? Did you check for messages in /var/adm/messages which might indicate when and how FC connectivity has been lost? Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors
On 2013-01-19 18:17, Bob Friesenhahn wrote: Resilver may in fact be just verifying that the pool disks are coherent via metadata. This might happen if the fiber channel is flapping. Correction: that (verification) would be scrubbing ;) The way I get it, resilvering is related to scrubbing but limited in impact such that it rebuilds a particular top-level vdev (i.e. one of the component mirrors) with an assigned-bad and new device. So they both should walk the block-pointer tree from the uberblock (current BP tree root) until they ultimately read all the BP entries and validate the userdata with checksums. But while scrub walks and verifies the whole pool and fixes discrepancies (logging checksum errors), the resilver verifies a particular TLVdev (and maybe has a cut-off earliest TXG for disks which fell out of the pool and later returned into it - with a known latest TXT that is assumed valid on this disk) and the process expects there to be errors - it is intent on (partially) rewriting one of the devices in it. Hmmm... Maybe that's why there are no errors logged? I don't know :) As for practice, I also have one Thumper that logs errors on a couple of drives upon every scrub. I think it was related to connectors, at least replugging the disks helped a lot (counts went from tens per scrub to 0-3). One of the original 250Gb disks was replaced with a 3Tb one and a 250Gb partition became part of the old pool (the remainder became a new test pool over a single device). Scrubbing the pools yields errors in those new 250Gb, but never on the 2.75Tb single-disk pool... so go figure :) Overall, intermittent errors might be attibuted to non-ECC RAM/CPUs (not our case), temperature affecting the mechanics and electronics (conditioned server room - not our case), electric power variations and noise (other systems in the room on the same and other UPSes don't complain like this), and cable/connector/HBA degradation (oxydization, wear, etc. - likely all that remains for our causes). This example regards internal disks of the Thumper, so at least we are certain to attribute no problems related to further breakage components - external cables, disk trays, etc... HTH, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors
Am 19.01.13 18:17, schrieb Bob Friesenhahn: On Sat, 19 Jan 2013, Stephan Budach wrote: Now, this zpool is made of 3-way mirrors and currently 13 out of 15 vdevs are resilvering (which they had gone through yesterday as well) and I never got any error while resilvering. I have been all over the setup to find any glitch or bad part, but I couldn't come up with anything significant. Doesn't this sound improbable, wouldn't one expect to encounter other chksum errors while resilvering is running? I can't attest to chksum errors since I have yet to see one on my machines (have seen several complete disk failures, or disks faulted by the system though). Checksum errors are bad and not seeing them should be the normal case. I know and it's really bugging me, that I seem to have these chksum errors on all of my machines, be it Sun gear or Dell. Resilver may in fact be just verifying that the pool disks are coherent via metadata. This might happen if the fiber channel is flapping. Regarding the dire fiber channel issue, are you using fiber channel switches or direct connections to the storage array(s)? If you are using switches, are they stable or are they doing something terrible like resetting? Do you have duplex connectivity? Have you verified that your FC HBA's firmware is correct? Looking on my FC switches, I am noticing such errors like these: [656][Thu Dec 06 03:33:04.795 UTC 2012][I][8600.001E][Port][Port: 2][PortID 0x30200 PortWWN 10:00:00:06:2b:12:d3:55 logged out of nameserver.] [657][Thu Dec 06 03:33:05.829 UTC 2012][I][8600.0020][Port][Port: 2][SYNC_LOSS] [658][Thu Dec 06 03:37:08.077 UTC 2012][I][8600.001F][Port][Port: 2][SYNC_ACQ] [659][Thu Dec 06 03:37:10.582 UTC 2012][I][8600.001D][Port][Port: 2][PortID 0x30200 PortWWN 10:00:00:06:2b:12:d3:55 logged into nameserver.] [660][Sun Dec 09 04:18:32.324 UTC 2012][I][8600.001E][Port][Port: 10][PortID 0x30a00 PortWWN 21:01:00:1b:32:22:30:53 logged out of nameserver.] [661][Sun Dec 09 04:18:32.326 UTC 2012][I][8600.0020][Port][Port: 10][SYNC_LOSS] [662][Sun Dec 09 04:18:32.913 UTC 2012][I][8600.001F][Port][Port: 10][SYNC_ACQ] [663][Sun Dec 09 04:18:33.024 UTC 2012][I][8600.001D][Port][Port: 10][PortID 0x30a00 PortWWN 21:01:00:1b:32:22:30:53 logged into nameserver.] Just ignore the timestamp, as it seems that the time is not set correctly, but the dates match my two issues from today and thursday, which accounts for three days. I didn't catch that before, but it seems to clearly indicate a problem with the FC connection… But, what do I make of this information? Did you check for messages in /var/adm/messages which might indicate when and how FC connectivity has been lost? Well, this is the most scaring part to me. Neither fmdump nor dmesg showed anything that would indicate a connectivity issue - at least not the last time. Bob Thanks, Stephan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors
On Sat, 19 Jan 2013, Jim Klimov wrote: On 2013-01-19 18:17, Bob Friesenhahn wrote: Resilver may in fact be just verifying that the pool disks are coherent via metadata. This might happen if the fiber channel is flapping. Correction: that (verification) would be scrubbing ;) I don't think that zfs would call it scrubbing unless the user requested scrubbing. Unplugging a USB drive which is part of a mirror for a short while results in considerable activity when it is plugged back in. It is as if zfs does not trust the device which was temporarily unplugged and does a full validation of it. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors
On Sat, 19 Jan 2013, Stephan Budach wrote: Just ignore the timestamp, as it seems that the time is not set correctly, but the dates match my two issues from today and thursday, which accounts for three days. I didn't catch that before, but it seems to clearly indicate a problem with the FC connection… But, what do I make of this information? I don't know, but the issue/problem seems to below the zfs level so you need to fix that lower level before worrying about zfs. Did you check for messages in /var/adm/messages which might indicate when and how FC connectivity has been lost? Well, this is the most scaring part to me. Neither fmdump nor dmesg showed anything that would indicate a connectivity issue - at least not the last time. Weird. I wonder if multipathing is working for you at all. With my direct-connect setup, if a path is lost, then there is quite a lot of messaging to /var/adm/messages. I also see a lot of messaging related to multipathing when the system boots and first starts using the array. However, with the direct-connect setup, the HBA can report problems immediately if it sees a loss of signal. Your issues might be on the other side of the switch (on the storage array side) so the local HBA does not see the problem and timeouts are used. Make sure to check the logs in your storage array to see if it is encountering resets or flapping connectivity. Do you have duplex switches so that there are fully-redundant paths, or is only one switch used? Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors
On 2013-01-19 20:08, Bob Friesenhahn wrote: On Sat, 19 Jan 2013, Jim Klimov wrote: On 2013-01-19 18:17, Bob Friesenhahn wrote: Resilver may in fact be just verifying that the pool disks are coherent via metadata. This might happen if the fiber channel is flapping. Correction: that (verification) would be scrubbing ;) I don't think that zfs would call it scrubbing unless the user requested scrubbing. Unplugging a USB drive which is part of a mirror for a short while results in considerable activity when it is plugged back in. It is as if zfs does not trust the device which was temporarily unplugged and does a full validation of it. Now, THAT would be resilvering - and by default it should be a limited one, with a cutoff at the last TXG known to the disk that went MIA/AWOL. The disk's copy of the pool label (4 copies in fact) record the last TXG it knew safely. So the resilver should only try to validate and copy over the blocks whose BP entries' birth TXG number is above that. And since these blocks' components (mirror copies or raidz parity/data parts) are expected to be missing on this device, mismatches are likely not reported - I am not sure there's any attempt to even detect them. //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors
Am 19.01.13 20:18, schrieb Bob Friesenhahn: On Sat, 19 Jan 2013, Stephan Budach wrote: Just ignore the timestamp, as it seems that the time is not set correctly, but the dates match my two issues from today and thursday, which accounts for three days. I didn't catch that before, but it seems to clearly indicate a problem with the FC connection… But, what do I make of this information? I don't know, but the issue/problem seems to below the zfs level so you need to fix that lower level before worrying about zfs. Yes, I do think that as well. Did you check for messages in /var/adm/messages which might indicate when and how FC connectivity has been lost? Well, this is the most scaring part to me. Neither fmdump nor dmesg showed anything that would indicate a connectivity issue - at least not the last time. Weird. I wonder if multipathing is working for you at all. With my direct-connect setup, if a path is lost, then there is quite a lot of messaging to /var/adm/messages. I also see a lot of messaging related to multipathing when the system boots and first starts using the array. However, with the direct-connect setup, the HBA can report problems immediately if it sees a loss of signal. Your issues might be on the other side of the switch (on the storage array side) so the local HBA does not see the problem and timeouts are used. Make sure to check the logs in your storage array to see if it is encountering resets or flapping connectivity. I will check that. Do you have duplex switches so that there are fully-redundant paths, or is only one switch used? Well, no… I don't have enough switch ports on my FC San, but we will replace these Sanboxes with Nexus Switches from Cisco this year and I will have multipathing then. Bob Thanks, Stephan ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resilver w/o errors vs. scrub with errors
On 2013-01-19 20:23, Jim Klimov wrote: On 2013-01-19 20:08, Bob Friesenhahn wrote: On Sat, 19 Jan 2013, Jim Klimov wrote: On 2013-01-19 18:17, Bob Friesenhahn wrote: Resilver may in fact be just verifying that the pool disks are coherent via metadata. This might happen if the fiber channel is flapping. Correction: that (verification) would be scrubbing ;) I don't think that zfs would call it scrubbing unless the user requested scrubbing. Unplugging a USB drive which is part of a mirror for a short while results in considerable activity when it is plugged back in. It is as if zfs does not trust the device which was temporarily unplugged and does a full validation of it. Now, THAT would be resilvering - and by default it should be a limited one, with a cutoff at the last TXG known to the disk that went MIA/AWOL. The disk's copy of the pool label (4 copies in fact) record the last TXG it knew safely. So the resilver should only try to validate and copy over the blocks whose BP entries' birth TXG number is above that. And since these blocks' components (mirror copies or raidz parity/data parts) are expected to be missing on this device, mismatches are likely not reported - I am not sure there's any attempt to even detect them. And regarding the considerable activity - AFAIK there is little way for ZFS to reliably read and test TXGs newer than X other than to walk the whole current tree of block pointers and go deeper into those that match the filter (TLVDEV number in DVA, and optionally TXG numbers in birth/physical fields). So likely the resilver does much of the same activity that a full scrub would - at least in terms of reading all of the pool's metadata (though maybe not all copies thereof). My 2c and my speculation, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] iSCSI access patterns and possible improvements?
On Jan 19, 2013, at 7:16 AM, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) opensolarisisdeadlongliveopensola...@nedharvey.com wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Bob Friesenhahn If almost all of the I/Os are 4K, maybe your ZVOLs should use a volblocksize of 4K? This seems like the most obvious improvement. Oh, I forgot to mention - The above logic only makes sense for mirrors and stripes. Not for raidz (or raid-5/6/dp in general) If you have a pool of mirrors or stripes, the system isn't forced to subdivide a 4k block onto multiple disks, so it works very well. But if you have a pool blocksize of 4k and let's say a 5-disk raidz (capacity of 4 disks) then the 4k block gets divided into 1k on each disk and 1k parity on the parity disk. Now, since the hardware only supports block sizes of 4k ... You can see there's a lot of wasted space, and if you do a bunch of it, you'll also have a lot of wasted time waiting for seeks/latency. This is not quite true for raidz. If there is a 4k write to a raidz comprised of 4k sector disks, then there will be one data and one parity block. There will not be 4 data + 1 parity with 75% space wastage. Rather, the space allocation more closely resembles a variant of mirroring, like some vendors call RAID-1E -- richard -- richard.ell...@richardelling.com +1-760-896-4422 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] iSCSI access patterns and possible improvements?
On 2013-01-19 23:39, Richard Elling wrote: This is not quite true for raidz. If there is a 4k write to a raidz comprised of 4k sector disks, then there will be one data and one parity block. There will not be 4 data + 1 parity with 75% space wastage. Rather, the space allocation more closely resembles a variant of mirroring, like some vendors call RAID-1E I agree with this exact reply, but as I posted sometime late last year, reporting on my digging in the bowels of ZFS and my problematic pool, for a 6-disk raidz2 set I only saw allocations (including two parity disks) divisible by 3 sectors, even if the amount of the (compressed) userdata was not so rounded. I.e. I had either miniature files or tails of files fitting into one sector plus two parities (overall a 3 sector allocation), or tails ranging 2-4 sectors and occupying 6 with parity (while 2 or 3 sectors could use just 4 or 5 w/parities, respectively). I am not sure what these numbers mean - 3 being a case for one userdata sector plus both parities or for half of 6-disk stripe - both such explanations fit in my case. But yes, with current raidz allocation there are many ways to waste space. And those small percentages (or not so small) do add up. Rectifying this example, i.e. allocating only as much as is used, does not seem like an incompatible on-disk format change, and should be doable within the write-queue logic. Maybe it would cause tradeoffs in efficiency; however, ZFS does explicitly rotate starting disks of allocations every few megabytes in order to even out the loads among spindles (normally parity disks don't have to be accessed - unless mismatches occur on data disks). Disabling such padding would only help achieve this goal and save space at the same time... My 2c, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: Un-dedup for unique blocks
I've wanted a system where dedup applies only to blocks being written that have a good chance of being dups of others. I think one way to do this would be to keep a scalable Bloom filter (on disk) into which one inserts block hashes. To decide if a block needs dedup one would first check the Bloom filter, then if the block is in it, use the dedup code path, else the non-dedup codepath and insert the block in the Bloom filter. This means that the filesystem would store *two* copies of any deduplicatious block, with one of those not being in the DDT. This would allow most writes of non-duplicate blocks to be faster than normal dedup writes, but still slower than normal non-dedup writes: the Bloom filter will add some cost. The nice thing about this is that Bloom filters can be sized to fit in main memory, and will be much smaller than the DDT. It's very likely that this is a bit too obvious to just work. Of course, it is easier to just use flash. It's also easier to just not dedup: the most highly deduplicatious data (VM images) is relatively easy to manage using clones and snapshots, to a point anyways. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] RFE: Un-dedup for unique blocks
bloom filters are a great fit for this :-) -- richard On Jan 19, 2013, at 5:59 PM, Nico Williams n...@cryptonector.com wrote: I've wanted a system where dedup applies only to blocks being written that have a good chance of being dups of others. I think one way to do this would be to keep a scalable Bloom filter (on disk) into which one inserts block hashes. To decide if a block needs dedup one would first check the Bloom filter, then if the block is in it, use the dedup code path, else the non-dedup codepath and insert the block in the Bloom filter. This means that the filesystem would store *two* copies of any deduplicatious block, with one of those not being in the DDT. This would allow most writes of non-duplicate blocks to be faster than normal dedup writes, but still slower than normal non-dedup writes: the Bloom filter will add some cost. The nice thing about this is that Bloom filters can be sized to fit in main memory, and will be much smaller than the DDT. It's very likely that this is a bit too obvious to just work. Of course, it is easier to just use flash. It's also easier to just not dedup: the most highly deduplicatious data (VM images) is relatively easy to manage using clones and snapshots, to a point anyways. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss