Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements
Op 06-05-11 05:44, Richard Elling schreef: As the size of the data grows, the need to have the whole DDT in RAM or L2ARC decreases. With one notable exception, destroying a dataset or snapshot requires the DDT entries for the destroyed blocks to be updated. This is why people can go for months or years and not see a problem, until they try to destroy a dataset. So what you are saying is you with your ram-starved system, don't even try to start using snapshots on that system. Right? -- No part of this copyright message may be reproduced, read or seen, dead or alive or by any means, including but not limited to telepathy without the benevolence of the author. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements
Op 06-05-11 05:44, Richard Elling schreef: As the size of the data grows, the need to have the whole DDT in RAM or L2ARC decreases. With one notable exception, destroying a dataset or snapshot requires the DDT entries for the destroyed blocks to be updated. This is why people can go for months or years and not see a problem, until they try to destroy a dataset. So what you are saying is you with your ram-starved system, don't even try to start using snapshots on that system. Right? I think it's more like don't use dedup when you don't have RAM. (It is not possible to not use snapshots in Solaris; they are used for everything) Casper ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements
On 5/6/2011 1:37 AM, casper@oracle.com wrote: Op 06-05-11 05:44, Richard Elling schreef: As the size of the data grows, the need to have the whole DDT in RAM or L2ARC decreases. With one notable exception, destroying a dataset or snapshot requires the DDT entries for the destroyed blocks to be updated. This is why people can go for months or years and not see a problem, until they try to destroy a dataset. So what you are saying is you with your ram-starved system, don't even try to start using snapshots on that system. Right? I think it's more like don't use dedup when you don't have RAM. (It is not possible to not use snapshots in Solaris; they are used for everything) Casper Casper and Richard are correct - RAM starvation seriously impacts snapshot or dataset deletion when a pool has dedup enabled. The reason behind this is that ZFS needs to scan the entire DDT to check to see if it can actually delete each block in the to-be-deleted snapshot/dataset, or if it just needs to update the dedup reference count. If it can't store the entire DDT in either the ARC or L2ARC, it will be forced to do considerable I/O to disk, as it brings in the appropriate DDT entry. Worst case for insufficient ARC/L2ARC space can increase deletion times by many orders of magnitude. E.g. days, weeks, or even months to do a deletion. If dedup isn't enabled, snapshot and data deletion is very light on RAM requirements, and generally won't need to do much (if any) disk I/O. Such deletion should take milliseconds to a minute or so. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements
On 06 May, 2011 - Erik Trimble sent me these 1,8K bytes: If dedup isn't enabled, snapshot and data deletion is very light on RAM requirements, and generally won't need to do much (if any) disk I/O. Such deletion should take milliseconds to a minute or so. .. or hours. We've had problems on an old raidz2 that a recursive snapshot creation over ~800 filesystems could take quite some time, up until the sata-scsi disk box ate the pool. Now we're using raid10 on a scsi box, and it takes 3-15 minute or so, during which sync writes (NFS) are almost unusable. Using 2 fast usb sticks as l2arc, waiting for a Vertex2EX and a Vertex3 to arrive for ZILL2ARC testing. IO to the filesystems are quite low (50 writes, 500k data per sec average), but snapshot times goes waay up during backups. /Tomas -- Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Umeå `- Sysadmin at {cs,acc}.umu.se ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Faster copy from UFS to ZFS
On 05/ 5/11 10:02 PM, Joerg Schilling wrote: Ian Collinsi...@ianshome.com wrote: *ufsrestore works fine on ZFS filesystems (although I haven't tried it with any POSIX ACLs on the original ufs filesystem, which would probably simply get lost). star -copy -no-fsync is typically 30% faster that ufsdump | ufsrestore. Does it preserve ACLs? Star supports ACLs from the withdrawn POSIX draft. So star would work moving data from UFS to ZFS, assuming it uses acl_get/set to read and write the ACLs. Star could already support ZFS ACLs in case that Sun had offered a correctly working ACL support library when they introdiced ZFS ACLs. Unfortunately it took some time until this lib was fixed and since then, I had other projects that took my time. ZFS ACLs are not fogetten however. Um, I thought the acl_totext and acl_fromtext functions had been around for many years. -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements
From: Richard Elling [mailto:richard.ell...@gmail.com] --- To calculate size of DDT --- zdb -S poolname Look at total blocks allocated. It is rounded, and uses a suffix like K, M, G but it's in decimal (powers of 10) notation, so you have to remember that... So I prefer the zdb -D method below, but this works too. Total blocks allocated * mem requirement per DDT entry, and you have the mem requirement to hold whole DDT in ram. zdb -DD poolname This just gives you the -S output, and the -D output all in one go. So I recommend using -DD, and base your calculations on #duplicate and #unique, as mentioned below. Consider the histogram to be informational. zdb -D poolname It gives you a number of duplicate, and a number of unique blocks. Add them to get the total number of blocks. Multiply by the mem requirement per DDT entry, and you have the mem requirement to hold the whole DDT in ram. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Edward Ned Harvey zdb -DD poolname This just gives you the -S output, and the -D output all in one go. So I Sorry, zdb -DD only works for pools that are already dedup'd. If you want to get a measurement for a pool that is not already dedup'd, you have to use -S ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Recommended eSATA PCI cards
Hi all, I'm looking at replacing my old D1000 array with some new external drives, most likely these: http://www.g-technology.com/products/g-drive.cfm . In the immediate term, I'm planning to use USB 2.0 connections, but the drive I'm considering also supports eSATA, which is MUCH faster than USB, but also (I think, please correct me if I'm wrong) more reliable. Neither of the machines I'll be using as my server (currently an SB1000 but will be an Ultra 20 M2 soon; this is my home network, very light workload) has an integrated eSATA port, so I must turn to add-on PCI cards. What are people recommending? I need to attach at least two drives (I'll be mirroring them), preferably three or more. The machines are currently running SXCE snv_b130, with an upgrade to Solaris Express 11 not too far away. Thanks! -- Rich Teer, Publisher Vinylphile Magazine www.vinylphilemag.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Extremely Slow ZFS Performance
Sounds like a nasty bug, and not one I've seen in illumos or NexentaStor. What build are you running? - Garrett On Wed, 2011-05-04 at 15:40 -0700, Adam Serediuk wrote: Dedup is disabled (confirmed to be.) Doing some digging it looks like this is a very similar issue to http://forums.oracle.com/forums/thread.jspa?threadID=2200577tstart=0. On May 4, 2011, at 2:26 PM, Garrett D'Amore wrote: My first thought is dedup... perhaps you've got dedup enabled and the DDT no longer fits in RAM? That would create a huge performance cliff. -Original Message- From: zfs-discuss-boun...@opensolaris.org on behalf of Eric D. Mudama Sent: Wed 5/4/2011 12:55 PM To: Adam Serediuk Cc: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] Extremely Slow ZFS Performance On Wed, May 4 at 12:21, Adam Serediuk wrote: Both iostat and zpool iostat show very little to zero load on the devices even while blocking. Any suggestions on avenues of approach for troubleshooting? is 'iostat -en' error free? -- Eric D. Mudama edmud...@bounceswoosh.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Deduplication Memory Requirements
On Wed, May 04, 2011 at 08:49:03PM -0700, Edward Ned Harvey wrote: From: Tim Cook [mailto:t...@cook.ms] That's patently false. VM images are the absolute best use-case for dedup outside of backup workloads. I'm not sure who told you/where you got the idea that VM images are not ripe for dedup, but it's wrong. Well, I got that idea from this list. I said a little bit about why I believed it was true ... about dedup being ineffective for VM's ... Would you care to describe a use case where dedup would be effective for a VM? Or perhaps cite something specific, instead of just wiping the whole thing and saying patently false? I don't feel like this comment was productive... We use dedupe on our VMware datastores and typically see 50% savings, often times more. We do of course keep like VM's on the same volume (at this point nothing more than groups of Windows VM's, Linux VM's and so on). Note that this isn't on ZFS (yet), but we hope to begin experimenting with it soon (using NexentaStor). Apologies for devolving the conversation too much in the NetApp direction -- simply was a point of reference for me to get a better understanding of things on the ZFS side. :) Ray ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Recommended eSATA PCI cards
Hi Rich, With the Ultra 20M2 there is a very cheap/easy alternative that might work for you (until you need to expand past 2 more external devices anyway) Pick up an eSATA pci bracket cable adapter, something like this- http://www.newegg.com/Product/Product.aspx?Item=N82E16812226003cm_re=eSATA-_-12-226-003-_-Product (I haven't used this specific product but it was the first example I found) The U20M2 has slots for just 2 internal SATA drives but the motherboard has a total of 4 SATA connectors so there are two that normally go unused. Connect these to the bracket and connect your external eSATA enclosures to these. You'll get two eSATA ports without needing to use any PCI slots and I believe that if you use the very bottom pci slot opening you won't even block any of the actual pci slots from future use. -Mark D. On 05/ 6/11 12:04 PM, Rich Teer wrote: Hi all, I'm looking at replacing my old D1000 array with some new external drives, most likely these: http://www.g-technology.com/products/g-drive.cfm . In the immediate term, I'm planning to use USB 2.0 connections, but the drive I'm considering also supports eSATA, which is MUCH faster than USB, but also (I think, please correct me if I'm wrong) more reliable. Neither of the machines I'll be using as my server (currently an SB1000 but will be an Ultra 20 M2 soon; this is my home network, very light workload) has an integrated eSATA port, so I must turn to add-on PCI cards. What are people recommending? I need to attach at least two drives (I'll be mirroring them), preferably three or more. The machines are currently running SXCE snv_b130, with an upgrade to Solaris Express 11 not too far away. Thanks! ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Deduplication Memory Requirements
On Fri, May 6, 2011 at 9:15 AM, Ray Van Dolson rvandol...@esri.com wrote: We use dedupe on our VMware datastores and typically see 50% savings, often times more. We do of course keep like VM's on the same volume I think NetApp uses 4k blocks by default, so the block size and alignment should match up for most filesystems and yield better savings. Your server's resource requirements for ZFS and dedup will be much higher due to the large DDT, as you initially suspected. If bp_rewrite is ever completed and released, this might change. It should allow for offline dedup, which may make dedup usable in more situations. Apologies for devolving the conversation too much in the NetApp direction -- simply was a point of reference for me to get a better understanding of things on the ZFS side. :) It's good to compare the two, since they have a pretty large overlap in functionality but sometimes very different implementations. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements
One of the quoted participants is Richard Elling, the other is Edward Ned Harvey, but my quoting was screwed up enough that I don't know which is which. Apologies. zdb -DD poolname This just gives you the -S output, and the -D output all in one go. So I Sorry, zdb -DD only works for pools that are already dedup'd. If you want to get a measurement for a pool that is not already dedup'd, you have to use -S And since zdb -S runs for 2 hours and dumps core (without results), the correct answer remains: zdb -bb poolname | grep 'bp count' as was given in the summary. The theoretical output of zdb -S my be superior if you have a version that works, but I haven't seen anyone mention onlist which version(s) it is, or if/how it can be obtained; short of recompiling it yourself. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements
On May 6, 2011, at 3:24 AM, Erik Trimble erik.trim...@oracle.com wrote: On 5/6/2011 1:37 AM, casper@oracle.com wrote: Op 06-05-11 05:44, Richard Elling schreef: As the size of the data grows, the need to have the whole DDT in RAM or L2ARC decreases. With one notable exception, destroying a dataset or snapshot requires the DDT entries for the destroyed blocks to be updated. This is why people can go for months or years and not see a problem, until they try to destroy a dataset. So what you are saying is you with your ram-starved system, don't even try to start using snapshots on that system. Right? I think it's more like don't use dedup when you don't have RAM. (It is not possible to not use snapshots in Solaris; they are used for everything) :-) Casper Casper and Richard are correct - RAM starvation seriously impacts snapshot or dataset deletion when a pool has dedup enabled. The reason behind this is that ZFS needs to scan the entire DDT to check to see if it can actually delete each block in the to-be-deleted snapshot/dataset, or if it just needs to update the dedup reference count. AIUI, the issue is not the the DDT is scanned, it is an AVL tree for a reason. The issue is that each reference update means that one, small bit of data is changed. If the reference is not already in ARC, then a small, probably random read is needed. If you have a typical consumer disk, especially a green disk, and have not tuned zfs_vdev_max_pending, then that itty bitty read can easily take more than 100 milliseconds(!) Consider that you can have thousands or millions of reference updates to do during a zfs destroy, and the math gets ugly. This is why fast SSDs make good dedup candidates. If it can't store the entire DDT in either the ARC or L2ARC, it will be forced to do considerable I/O to disk, as it brings in the appropriate DDT entry. Worst case for insufficient ARC/L2ARC space can increase deletion times by many orders of magnitude. E.g. days, weeks, or even months to do a deletion. I've never seen months, but I have seen days, especially for low-perf disks. If dedup isn't enabled, snapshot and data deletion is very light on RAM requirements, and generally won't need to do much (if any) disk I/O. Such deletion should take milliseconds to a minute or so. Yes, perhaps a bit longer for recursive destruction, but everyone here knows recursion is evil, right? :-) -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements
On 5/6/2011 5:46 PM, Richard Elling wrote: On May 6, 2011, at 3:24 AM, Erik Trimbleerik.trim...@oracle.com wrote: Casper and Richard are correct - RAM starvation seriously impacts snapshot or dataset deletion when a pool has dedup enabled. The reason behind this is that ZFS needs to scan the entire DDT to check to see if it can actually delete each block in the to-be-deleted snapshot/dataset, or if it just needs to update the dedup reference count. AIUI, the issue is not the the DDT is scanned, it is an AVL tree for a reason. The issue is that each reference update means that one, small bit of data is changed. If the reference is not already in ARC, then a small, probably random read is needed. If you have a typical consumer disk, especially a green disk, and have not tuned zfs_vdev_max_pending, then that itty bitty read can easily take more than 100 milliseconds(!) Consider that you can have thousands or millions of reference updates to do during a zfs destroy, and the math gets ugly. This is why fast SSDs make good dedup candidates. Just out of curiosity - I'm assuming that a delete works like this: (1) find list of blocks associated with file to be deleted (2) using the DDT, find out if any other files are using those blocks (3) delete/update any metadata associated with the file (dirents, ACLs, etc.) (4) for each block in the file (4a) if the DDT indicates there ARE other files using this block, update the DDT entry to change the refcount (4b) if the DDT indicates there AREN'T any other files, move the physical block to the free list, and delete the DDT entry In a bulk delete scenario (not just snapshot deletion), I'd presume #1 above almost always causes a Random I/O request to disk, as all the relevant metadata for every (to be deleted) file is unlikely to be stored in ARC. If you can't fit the DDT in ARC/L2ARC, #2 above would require you to pull in the remainder of the DDT info from disk, right? #3 and #4 can be batched up, so they don't hurt that much. Is that a (roughly) correct deletion methodology? Or can someone give a more accurate view of what's actually going on? If it can't store the entire DDT in either the ARC or L2ARC, it will be forced to do considerable I/O to disk, as it brings in the appropriate DDT entry. Worst case for insufficient ARC/L2ARC space can increase deletion times by many orders of magnitude. E.g. days, weeks, or even months to do a deletion. I've never seen months, but I have seen days, especially for low-perf disks. I've seen an estimate of 5 weeks for removing a snapshot on a 1TB dedup pool made up of 1 disk. Not an optimal set up. :-) If dedup isn't enabled, snapshot and data deletion is very light on RAM requirements, and generally won't need to do much (if any) disk I/O. Such deletion should take milliseconds to a minute or so. Yes, perhaps a bit longer for recursive destruction, but everyone here knows recursion is evil, right? :-) -- richard You, my friend, have obviously never worshipped at the Temple of the Lamba Calculus, nor been exposed to the Holy Writ that is Structure and Interpretation of Computer Programs (http://mitpress.mit.edu/sicp/full-text/book/book.html). I sentence you to a semester of 6.001 problem sets, written by Prof Sussman sometime in the 1980s. (yes, I went to MIT.) -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss