Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Ian Collins Add to that: if running dedup, get plenty of RAM and cache. Add plenty RAM. And tweak your arc_meta_limit. You can at least get dedup performance that's on the same order of magnitude as performance without dedup. Cache devices don't really help dedup very much - Because each DDT stored in ARC/L2ARC takes 376 bytes, and each reference to an L2ARC entry requires 176 bytes of ARC. So in order to prevent an individual DDT entry from being evicted to disk, you must either keep the 376 bytes in ARC, or evict it to L2ARC and keep 176 bytes. This is a very small payload. A good payload would be to evict a 128k block from ARC into L2ARC, keeping the 176 bytes only in ARC. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
Edward Ned Harvey wrote: So I'm getting comparisons of write speeds for 10G files, sampling at 100G intervals. For a 6x performance degradation, it would be 7 sec to write without dedup, and 40-45sec to write with dedup. For a totally unscientific data point: The HW: Server - Supermicro server motherboard. Intel 920 CPU. 6 GB memory. 1 x 16 GB SSD as a boot device. 8 x 2TB green (5400 RPM?) hard drives. The disks configured with 3 equal size partitions, all p1's in one raidz2 pool, all p2's in another, all p3's in another. (Done to improve performance by limiting head movement when most of the disk activity is in one pool) The SW: the last release of Open Solaris. (Current at the time, I have since moved to Solaris 11) The test: backup an almost full 750Gb external hard disk formatted as a single NTFS volume. The disk was connected via eSATA to a fast computer (also a supermicro + I920) running Ubuntu. The Ubuntu machine had access to the file server via NFS. The NFS-exported file system was created new for this backup, with dedup enabled, encryption and compression disabled, atime=off. This was the first (and last) time I tried enabling dedup. From previous similar transfers, (without dedup), I expected the backup to be finished in a few hours overnight, with the bottlenecks being the NTFS-3G driver in Ubuntu and the 100Mbit ethernet connection. It took more than EIGHT DAYS, without any other activity going on both machines. (My) conclusion: Running low on storage? get more/bigger disks. -- Roberto Waltman ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
On Jul 9, 2011 1:56 PM, Edward Ned Harvey opensolarisisdeadlongliveopensola...@nedharvey.com wrote: Given the abysmal performance, I have to assume there is a significant number of overhead reads or writes in order to maintain the DDT for each actual block write operation. Something I didn't mention in the other email is that I also tracked iostat throughout the whole operation. It's all writes (or at least 99.9% writes.) So I am forced to conclude it's a bunch of small DDT maintenance writes taking place and incurring access time penalties in addition to each intended single block access time penalty. The nature of the DDT is that it's a bunch of small blocks, that tend to be scattered randomly, and require maintenance in order to do anything else. This sounds like precisely the usage pattern that benefits from low latency devices such as SSD's. The DDT should be written to in COW fashion, and asynchronously, so there should be no access time penalty. Or so ISTM it should be. Dedup is necessarily slower for writing because of the deduplication table lookups. Those are synchronous lookups, but for async writes you'd think that total write throughput would only be affected by a) the additional read load (which is zero in your case) and b) any inability to put together large transactions due to the high latency of each logical write, but (b) shouldn't happen, particularly if the DDT fits in RAM or L2ARC, as it does in your case. So, at first glance my guess is ZFS is leaving dedup write performance on the table most likely due to implementation reasons, not design reasons. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
On 07/25/11 04:21 AM, Roberto Waltman wrote: Edward Ned Harvey wrote: So I'm getting comparisons of write speeds for 10G files, sampling at 100G intervals. For a 6x performance degradation, it would be 7 sec to write without dedup, and 40-45sec to write with dedup. For a totally unscientific data point: The HW: Server - Supermicro server motherboard. Intel 920 CPU. 6 GB memory. 1 x 16 GB SSD as a boot device. 8 x 2TB green (5400 RPM?) hard drives. The disks configured with 3 equal size partitions, all p1's in one raidz2 pool, all p2's in another, all p3's in another. (Done to improve performance by limiting head movement when most of the disk activity is in one pool) The SW: the last release of Open Solaris. (Current at the time, I have since moved to Solaris 11) The test: backup an almost full 750Gb external hard disk formatted as a single NTFS volume. The disk was connected via eSATA to a fast computer (also a supermicro + I920) running Ubuntu. The Ubuntu machine had access to the file server via NFS. The NFS-exported file system was created new for this backup, with dedup enabled, encryption and compression disabled, atime=off. This was the first (and last) time I tried enabling dedup. From previous similar transfers, (without dedup), I expected the backup to be finished in a few hours overnight, with the bottlenecks being the NTFS-3G driver in Ubuntu and the 100Mbit ethernet connection. It took more than EIGHT DAYS, without any other activity going on both machines. (My) conclusion: Running low on storage? get more/bigger disks. Add to that: if running dedup, get plenty of RAM and cache. I'm still seeing similar performance on my test system with and without dedup enabled. Snapshot deletion appears slightly slower, but I have yet to run timed tests. -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
On 07/10/11 04:04 AM, Edward Ned Harvey wrote: There were a lot of useful details put into the thread Summary: Dedup and L2ARC memory requirements Please refer to that thread as necessary... After much discussion leading up to that thread, I thought I had enough understanding to make dedup useful, but then in practice, it didn't work out. Now I've done a lot more work on it, reduced it all to practice, and I finally feel I can draw up conclusions that are actually useful: I am testing on a Sun Oracle server, X4270, 1 Xeon 4-core 2.4Ghz, 24G ram, 12 disks ea 2T sas 7.2krpm. Solaris 11 express snv_151a Can you provide more details of your tests? I'm currently testing a couple of slightly better configured X4270s (2 CPU, 96GB RAM and a FLASH accelerator card) using real data from an existing server. So far, I haven't seen the levels of performance fall off you report. I currently have about 5TB of uncompressed data in the pool (stripe of 5 mirrors) and throughput is similar to the existing, Solaris 10, servers. The pool dedup ratio is 1.7, so there's a good mix of unique and duplicate blocks. -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
From: Ian Collins [mailto:i...@ianshome.com] Sent: Saturday, July 23, 2011 4:02 AM Can you provide more details of your tests? Here's everything: http://dl.dropbox.com/u/543241/dedup%20tests/dedup%20tests.zip In particular: Under the work server directory. The basic concept goes like this: Find some amount of data that takes approx 10 sec to write. I don't know the size, I just kept increasing a block counter till got times I felt were reasonable, so let's suppose it's 10G. Time Write that much without dedup (all unique). Remove the file. Time Write that much with dedup (sha256, no verify) (all unique). Remove the file. Write 10x that much with dedup (all unique). Don't remove the file. Repeat. So I'm getting comparisons of write speeds for 10G files, sampling at 100G intervals. For a 6x performance degradation, it would be 7 sec to write without dedup, and 40-45sec to write with dedup. I am doing fflush() and fsync() at the end of every file write, to ensure results are not skewed by write buffering. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
Op 15-07-11 04:27, Edward Ned Harvey schreef: Is anyone from Oracle reading this? I understand if you can't say what you're working on and stuff like that. But I am merely hopeful this work isn't going into a black hole... Anyway. Thanks for listening (I hope.) ttyl If they aren't, maybe someone from an open source Solaris version is :) -- No part of this copyright message may be reproduced, read or seen, dead or alive or by any means, including but not limited to telepathy without the benevolence of the author. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
If you clone zones from a golden image using ZFS cloning, you get fast, efficient dedup for free. Sparse root always was a horrible hack! - Reply message - From: Jim Klimov jimkli...@cos.ru To: Cc: zfs-discuss@opensolaris.org Subject: [zfs-discuss] Summary: Dedup memory and performance (again, again) Date: Tue, Jul 12, 2011 14:05 This dedup discussion (and my own bad expreience) have also left me with another grim thought: some time ago sparse-root zone support was ripped out of OpenSolaris. Among the published rationales were transition to IPS and the assumption that most people used them to save on disk space (notion about saving RAM on shared objects was somehow dismissed). Regarding the disk savings, it was said that dedup would solve the problem, at least for those systems which use dedup on zoneroot dataset (and preferably that would be in the rpool, too). On one hand, storing zoneroots in the rpool was never practical for us because we tend to keep the rpool small and un-clobbered, and on the other hand, now adding dedup to rpool would seem like shooting oneself in the foot with a salt-loaded shotgun. Maybe it won't kill, but would hurt a lot and for a long time. On the third hand ;) with a small rpool hosting zoneroots as well, the DDT would reasonably be small too, and may actually boost performance while saving space. But lots of attention should now be paid to seperating /opt, parts of /var and stuff into delegated datasets from a larger datapool. And software like Sun JES which installs into a full-root zone's /usr might overwhelm a small rpool as well. Anyhow, Edward, is there a test for this scenario - i.e. a 10Gb pool with lots of non-unique data in small blocks? Thanks, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
2011-07-15 11:10, phil.har...@gmail.com пишет: If you clone zones from a golden image using ZFS cloning, you get fast, efficient dedup for free. Sparse root always was a horrible hack! Sounds like a holy war is flaming up ;) From what I heard, sparse root zones with shared common system libraries allowed to save not only on disk space but also on RAM. Can't vouch, never tested extensively myself. Cloning of golden zones is of course used in our systems. But this approach breaks badly upon any major systems update (i.e. LiveUpgrade to a new release) - many of the binaries change, and you either suddenly have the zones (wanting to) consume many gigabytes of disk space which are not there on a small rpool or a busy data pool, or you have to make a new golden image, clone a new set of zones and reinstall/migrate all applications and settings. True, this is a no-brainer for zones running a single task like an /opt/tomcat directory which can be tossed around to any OS, but becomes tedious for software with many packages and complicated settings, especially if (in some extremity) it was homebrewn and/or home-compiled and unpackaged ;) I am not the first (or probably last) to write about inconvenience of zone upgrades which loses the cloning benefit, and much of the same is true for upgrading cloned/deduped VM golden images as well, where the golden image is just some common baseline OS but the clones all run different software. And it is this different software which makes them useful and unique, and too distinct to maintain a dozen of golden images efficiently (i.e. there might be just 2 or 3 clones of each gold). But in general, the problem is there - you either accept that your OS images in effect won't be deduped, much or at all, after some lifespan involving OS upgrades, or you don't update them often (which may be inacceptable for security and/or paranoia types of deployments), or you use some trickery to update frequently and not lose much disk space, such as automation of software and configs migration from one clone (of old gold) to another clone (of new gold). Dedup was a promising variant in this context, unless it kills performance and/or stability... which was the subject of this thread, with Edward's research into performance of current dedup implementation (and perhaps some baseline to test whether real improvements appear in the future). And in terms of performance there's some surprise in Edward's findings regarding i.e. reads from the deduped data. For infrequent maintenance (i.e. monthly upgrades) zoneroots (OS image part) would be read-mostly and write performance of dedup may not matter much. If the updates must pour in often for whatever reason, then write and delete performance of dedup may begin to matter. Sorry about straying the discussion into zones - they, their performance and coping with changes introduced during lifetime (see OS upgrades), are one good example for discussion of dedup, and its one application which may be commonly useful on any server or workstation, not only on on hardware built for dedicated storage. Sparse-root vs. full-root zones, or disk images of VMs; are they stuffed in one rpool or spread between rpool and data pools - that detail is not actually the point of the thread. Actual useability of dedup for savings and gains on these tasks (preferably working also on low-mid-range boxes, where adding a good enterprise SSD would double the server cost - not only on those big good systems with tens of GB of RAM), and hopefully simplifying the system configuration and maintenance - that is indeed the point in question. //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
On Fri, Jul 15, 2011 at 5:19 AM, Jim Klimov jimkli...@cos.ru wrote: 2011-07-15 11:10, phil.har...@gmail.com пишет: If you clone zones from a golden image using ZFS cloning, you get fast, efficient dedup for free. Sparse root always was a horrible hack! Sounds like a holy war is flaming up ;) From what I heard, sparse root zones with shared common system libraries allowed to save not only on disk space but also on RAM. Can't vouch, never tested extensively myself. There may be some benefit to that, I'd argue that most of the time there's not that much. Using what is surely an imperfect way of measuring, I took a look a zone on a Solaris 10 box that I happen to be logged into. I found it is using about 52 MB of memory in mappings of executables and libraries. By disabling webconsole (a java program that has a RSS size of 100+ MB) the shared mappings drop to 40 MB. # cd /proc # pmap -xa * | grep r.x | grep -v ' anon ' | grep -v ' stack ' | grep -v ' heap ' | sort -u | nawk '{ t+= $3 } END { print t / 1024, MB }' pmap: cannot examine 22427: system process 40.3281 MB If you are running the same large application (large executable + libraries resident in memory) in many zones, you may have additional benefit. Solaris 10 was released in 2005, meaning that sparse root zones were conceived sometime in the years leading up to that. In that time, the entry level servers have gone from 1 - 2 GB of memory (e.g. a V210 or V240) to 12 - 16+ GB of memory (X2270 M2, T3-1). Further, large systems tend to have NUMA characteristics that challenge the logic of trying to maintain only one copy of hot read-only executable pages. It just doesn't make sense to constrain the design of zones around something that is going to save 0.3% of the memory of an entry level server. Even in 2005, I'm not so sure it was a strong argument. Disk space is another issue. Jim does a fine job of describing the issues around that. Cloning of golden zones is of course used in our systems. But this approach breaks badly upon any major systems update (i.e. LiveUpgrade to a new release) - many of the binaries change, and you either suddenly have the zones (wanting to) consume many gigabytes of disk space which are not there on a small rpool or a busy data pool, or you have to make a new golden image, clone a new set of zones and reinstall/migrate all applications and settings. True, this is a no-brainer for zones running a single task like an /opt/tomcat directory which can be tossed around to any OS, but becomes tedious for software with many packages and complicated settings, especially if (in some extremity) it was homebrewn and/or home-compiled and unpackaged ;) I am not the first (or probably last) to write about inconvenience of zone upgrades which loses the cloning benefit, and much of the same is true for upgrading cloned/deduped VM golden images as well, where the golden image is just some common baseline OS but the clones all run different software. And it is this different software which makes them useful and unique, and too distinct to maintain a dozen of golden images efficiently (i.e. there might be just 2 or 3 clones of each gold). But in general, the problem is there - you either accept that your OS images in effect won't be deduped, much or at all, after some lifespan involving OS upgrades, or you don't update them often (which may be inacceptable for security and/or paranoia types of deployments), or you use some trickery to update frequently and not lose much disk space, such as automation of software and configs migration from one clone (of old gold) to another clone (of new gold). Dedup was a promising variant in this context, unless it kills performance and/or stability... which was the subject of this thread, with Edward's research into performance of current dedup implementation (and perhaps some baseline to test whether real improvements appear in the future). And in terms of performance there's some surprise in Edward's findings regarding i.e. reads from the deduped data. For infrequent maintenance (i.e. monthly upgrades) zoneroots (OS image part) would be read-mostly and write performance of dedup may not matter much. If the updates must pour in often for whatever reason, then write and delete performance of dedup may begin to matter. Sorry about straying the discussion into zones - they, their performance and coping with changes introduced during lifetime (see OS upgrades), are one good example for discussion of dedup, and its one application which may be commonly useful on any server or workstation, not only on on hardware built for dedicated storage. Sparse-root vs. full-root zones, or disk images of VMs; are they stuffed in one rpool or spread between rpool and data pools - that detail is not actually the point of the thread. Actual useability of dedup for savings and gains on these tasks (preferably working also on low-mid-range boxes,
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
Op 12-07-11 13:40, Jim Klimov schreef: Even if I batch background RM's so a hundred processes hang and then they all at once complete in a minute or two. Hmmm. I only run one rm process at a time. You think running more processes at the same time would be faster? -- No part of this copyright message may be reproduced, read or seen, dead or alive or by any means, including but not limited to telepathy without the benevolence of the author. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
2011-07-14 11:54, Frank Van Damme пишет: Op 12-07-11 13:40, Jim Klimov schreef: Even if I batch background RM's so a hundred processes hang and then they all at once complete in a minute or two. Hmmm. I only run one rm process at a time. You think running more processes at the same time would be faster? Yes, quite often it seems so. Whenever my slow dcpool decides to accept a write, it processes a hundred pending deletions instead of one ;) Even so, it took quite a few pool or iscsi hangs and then reboots of both server and client, and about a week overall, to remove a 50Gb dir with 400k small files from a deduped pool served over iscsi from a volume in a physical pool. Just completed this night ;) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
Op 14-07-11 12:28, Jim Klimov schreef: Yes, quite often it seems so. Whenever my slow dcpool decides to accept a write, it processes a hundred pending deletions instead of one ;) Even so, it took quite a few pool or iscsi hangs and then reboots of both server and client, and about a week overall, to remove a 50Gb dir with 400k small files from a deduped pool served over iscsi from a volume in a physical pool. Just completed this night ;) It seems counter-intuitive - you'd say: concurrent disk access makes things only slower - , but it turns out to be true. I'm deleting a dozen times faster than before. How completely ridiculous. Thank you :-) -- No part of this copyright message may be reproduced, read or seen, dead or alive or by any means, including but not limited to telepathy without the benevolence of the author. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
2011-07-14 15:48, Frank Van Damme пишет: It seems counter-intuitive - you'd say: concurrent disk access makes things only slower - , but it turns out to be true. I'm deleting a dozen times faster than before. How completely ridiculous. Thank you :-) Well, look at it this way: it is not only about singular disk accesses (i.e. unlike other FSes, you do not in-place modify a directory entry), with ZFS COW it is about rewriting a tree of block pointers, with any new writes going into free (unreferenced ATM) disk blocks anyway. So by hoarding writes you have a chance to reduce mechanical IOPS required for your tasks. Until you run out of RAM ;) Just in case it helps, to quickly fire up removals of the specific directory after yet another reboot of the box, and not overwhelm it with hundreds of thousands queued rmprocesses either, I made this script as /bin/RM: === #!/bin/sh SLEEP=10 [ x$1 != x ] SLEEP=$1 A=0 # To rm small files: find ... -size -10 find /export/OLD/PATH/TO/REMOVE -type f | while read LINE; do du -hs $LINE rm -f $LINE A=$(($A+1)) [ $A -ge 100 ] ( date; while [ `ps -ef | grep -wc rm` -gt 50 ]; do echo Sleep $SLEEP...; ps -ef | grep -wc rm ; sleep $SLEEP; ps -ef | grep -wc rm; done date ) A=`ps -ef | grep -wc rm` done ; date === Essentially, after firing up 100 rm attempts it waits for the rm process count to go below 50, then goes on. Sizing may vary between systems, phase of the moon and computer's attitude. Sometimes I had 700 processes stacked and processed quickly. Sometimes it hung on 50... HTH, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
um, this is what xargs -P is for ... -- Dan. On Thu, Jul 14, 2011 at 07:24:52PM +0400, Jim Klimov wrote: 2011-07-14 15:48, Frank Van Damme ?: It seems counter-intuitive - you'd say: concurrent disk access makes things only slower - , but it turns out to be true. I'm deleting a dozen times faster than before. How completely ridiculous. Thank you :-) Well, look at it this way: it is not only about singular disk accesses (i.e. unlike other FSes, you do not in-place modify a directory entry), with ZFS COW it is about rewriting a tree of block pointers, with any new writes going into free (unreferenced ATM) disk blocks anyway. So by hoarding writes you have a chance to reduce mechanical IOPS required for your tasks. Until you run out of RAM ;) Just in case it helps, to quickly fire up removals of the specific directory after yet another reboot of the box, and not overwhelm it with hundreds of thousands queued rmprocesses either, I made this script as /bin/RM: === #!/bin/sh SLEEP=10 [ x$1 != x ] SLEEP=$1 A=0 # To rm small files: find ... -size -10 find /export/OLD/PATH/TO/REMOVE -type f | while read LINE; do du -hs $LINE rm -f $LINE A=$(($A+1)) [ $A -ge 100 ] ( date; while [ `ps -ef | grep -wc rm` -gt 50 ]; do echo Sleep $SLEEP...; ps -ef | grep -wc rm ; sleep $SLEEP; ps -ef | grep -wc rm; done date ) A=`ps -ef | grep -wc rm` done ; date === Essentially, after firing up 100 rm attempts it waits for the rm process count to go below 50, then goes on. Sizing may vary between systems, phase of the moon and computer's attitude. Sometimes I had 700 processes stacked and processed quickly. Sometimes it hung on 50... HTH, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss pgprXDuV2KRuK.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Edward Ned Harvey I understand the argument, DDT must be stored in the primary storage pool so you can increase the size of the storage pool without running out of space to hold the DDT... But it's a fatal design flaw as long as you care about performance... If you don't care about performance, you might as well use the netapp and do offline dedup. The point of online dedup is to gain performance. So in ZFS you have to care about the performance. There are only two possible ways to fix the problem. Either ... The DDT must be changed so it can be stored entirely in a designated sequential area of disk, and maintained entirely in RAM, so all DDT reads/writes can be infrequent and serial in nature... This would solve the case of async writes and large sync writes, but would still perform poorly for small sync writes. And it would be memory intensive. But it should perform very nicely given those limitations. ;-) Or ... The DDT stays as it is now, highly scattered small blocks, and there needs to be an option to store it entirely on low latency devices such as dedicated SSD's. Eliminate the need for the DDT to reside on the slow primary storage pool disks. I understand you must consider what happens when the dedicated SSD gets full. The obvious choices would be either (a) dedup turns off whenever the metadatadevice is full or (b) it defaults to writing blocks in the main storage pool. Maybe that could even be a configurable behavior. Either way, there's a very realistic use case here. For some people in some situations, it may be acceptable to say I have 32G mirrored metadatadevice, divided by 137bytes per entry I can dedup up to a maximum 218M unique blocks in pool, and if I estimate 100K average block size that means up to 20T primary pool storage. If I reach that limit, I'll add more metadatadevice. Both of those options would also go a long way toward eliminating the surprise delete performance black hole. Is anyone from Oracle reading this? I understand if you can't say what you're working on and stuff like that. But I am merely hopeful this work isn't going into a black hole... Anyway. Thanks for listening (I hope.) ttyl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
2011-07-15 6:21, Daniel Carosone ?: um, this is what xargs -P is for ... Thanks for the hint. True, I don't often use xargs. However from the man pages, I don't see a -P option on OpenSolaris boxes of different releases, and there is only a -p (prompt) mode. I am not eager to enter yes 40 times ;) The way I had this script in practice, I could enter RM once and it worked till the box hung. Even then, a watchdog script could often have it rebooted without my interaction so it could continue in the next lifetime ;) -- Dan. On Thu, Jul 14, 2011 at 07:24:52PM +0400, Jim Klimov wrote: 2011-07-14 15:48, Frank Van Damme ?: It seems counter-intuitive - you'd say: concurrent disk access makes things only slower - , but it turns out to be true. I'm deleting a dozen times faster than before. How completely ridiculous. Thank you :-) Well, look at it this way: it is not only about singular disk accesses (i.e. unlike other FSes, you do not in-place modify a directory entry), with ZFS COW it is about rewriting a tree of block pointers, with any new writes going into free (unreferenced ATM) disk blocks anyway. So by hoarding writes you have a chance to reduce mechanical IOPS required for your tasks. Until you run out of RAM ;) Just in case it helps, to quickly fire up removals of the specific directory after yet another reboot of the box, and not overwhelm it with hundreds of thousands queued rmprocesses either, I made this script as /bin/RM: === #!/bin/sh SLEEP=10 [ x$1 != x ] SLEEP=$1 A=0 # To rm small files: find ... -size -10 find /export/OLD/PATH/TO/REMOVE -type f | while read LINE; do du -hs $LINE rm -f $LINE A=$(($A+1)) [ $A -ge 100 ] ( date; while [ `ps -ef | grep -wc rm` -gt 50 ]; do echo Sleep $SLEEP...; ps -ef | grep -wc rm ; sleep $SLEEP; ps -ef | grep -wc rm; done date ) A=`ps -ef | grep -wc rm` done ; date === Essentially, after firing up 100 rm attempts it waits for the rm process count to go below 50, then goes on. Sizing may vary between systems, phase of the moon and computer's attitude. Sometimes I had 700 processes stacked and processed quickly. Sometimes it hung on 50... HTH, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
On Fri, Jul 15, 2011 at 07:56:25AM +0400, Jim Klimov wrote: 2011-07-15 6:21, Daniel Carosone ?: um, this is what xargs -P is for ... Thanks for the hint. True, I don't often use xargs. However from the man pages, I don't see a -P option on OpenSolaris boxes of different releases, and there is only a -p (prompt) mode. I am not eager to enter yes 40 times ;) you want the /usr/gnu/{bin,share/man} version, at least in this case. -- Dan. pgpItiuUybbdI.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
2011-07-09 20:04, Edward Ned Harvey ?: --- Performance gain: Unfortunately there was only one area that I found any performance gain. When you read back duplicate data that was previously written with dedup, then you get a lot more cache hits, and as a result, the reads go faster. Unfortunately these gains are diminished... I don't know by what... But you only have about 2x to 4x performance gain reading previously dedup'd data, as compared to reading the same data which was never dedup'd. Even when repeatedly reading the same file which is 100% duplicate data (created by dd from /dev/zero) so all the data is 100% in cache... I still see only 2x to 4x performance gain with dedup. First of all, thanks for all the experimental research and results, even if the outlook is grim. I'd love to see comments about those systems which use dedup and actually gain benefits, and how much they gain (i.e. VM farms, etc.), and what may differ in terms of setup (i.e. at least 256Gb RAM or whatever). Hopefully the discrepancy between blissful hopes (I had) - that dedup would save disk space and boost the systems somewhat kinda like online compression can do - and cruel reality would result in some improvement project. Perhaps it would be an offline dedup implementation (perhaps with an online-dedup option turnable off), as recently discussed on list. Deleting stuff is still apain though. For the past week my box is trying to delete an rsynced backup of a linux machine, some 300k files summed up to 50Gb. Deleting large files was rather quick, but those consuming just a few blocks are really slow. Even if I batch background RM's so a hundred processes hang and then they all at once complete in a minute or two. And quite often the iSCSI initiator or target go crazy so one of the boxes (or both) have to be rebooted, about trice a day. I described my setup before, won't clobber it into here ;) Regarding the low read performance gain, you suggested in a later post that this could be due to the RAM and disk bandwidth difference in your machine. I for one think that (without sufficient ARC block-caching) dedup reading would suffer greatly also from fragmentation - any one large file with some or all deduped data is basically guaranteed to have its blocks scattered across all of your storage. At least if this file was committed to the deduped pool late in its life, when most or all of the blocks were already there. By the way, did you estimate how much is dedup's overhead in terms of metadata blocks? For example it was often said on the list that you shouldn't bother with dedup unless you data can be deduped 2x or better, and if you're lucky to already have it on ZFS - you can estimate the reduction with zdb. Now, I wonder where the number comes from - is it empirical, or would dedup metadata take approx 1x the data space, thus under 2x reduction you gain little or nothing? ;) Thanks for the research, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Jim Klimov By the way, did you estimate how much is dedup's overhead in terms of metadata blocks? For example it was often said on the list that you shouldn't bother with dedup unless you data can be deduped 2x or better, and if you're lucky to already have it on ZFS - you can estimate the reduction with zdb. Now, I wonder where the number comes from - is it empirical, or would dedup metadata take approx 1x the data space, thus under 2x reduction you gain little or nothing? ;) You and I seem to have different interprettations of the empirical 2x soft-requirement to make dedup worthwhile. I always interpretted it like this: If read/write of DUPLICATE blocks with dedup enabled yields 4x performance gain, and read/write of UNIQUE blocks with dedup enabled yields 4x performance loss, then you need a 50/50 mix of unique and duplicate blocks in the system in order to break even. This is the same as having a 2x dedup ratio. Unfortunately based on this experience, I would now say something like a dedup ratio of 10x is more likely the break even point. Ideally, read/write of unique blocks should be just as fast, with or without dedup. Ideally, read/write of duplicate blocks would be an order of magnitude (or more) faster with dedup. It's not there right now... But I still have high hopes. You know what? A year ago I would have said dedup still wasn't stable enough for production. Now I would say it's plenty stable enough... But it needs performance enhancement before it's truly useful for most cases. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
This dedup discussion (and my own bad expreience) have also left me with another grim thought: some time ago sparse-root zone support was ripped out of OpenSolaris. Among the published rationales were transition to IPS and the assumption that most people used them to save on disk space (notion about saving RAM on shared objects was somehow dismissed). Regarding the disk savings, it was said that dedup would solve the problem, at least for those systems which use dedup on zoneroot dataset (and preferably that would be in the rpool, too). On one hand, storing zoneroots in the rpool was never practical for us because we tend to keep the rpool small and un-clobbered, and on the other hand, now adding dedup to rpool would seem like shooting oneself in the foot with a salt-loaded shotgun. Maybe it won't kill, but would hurt a lot and for a long time. On the third hand ;) with a small rpool hosting zoneroots as well, the DDT would reasonably be small too, and may actually boost performance while saving space. But lots of attention should now be paid to seperating /opt, parts of /var and stuff into delegated datasets from a larger datapool. And software like Sun JES which installs into a full-root zone's /usr might overwhelm a small rpool as well. Anyhow, Edward, is there a test for this scenario - i.e. a 10Gb pool with lots of non-unique data in small blocks? Thanks, //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
You and I seem to have different interprettations of the empirical 2x soft-requirement to make dedup worthwhile. Well, until recently I had little interpretation for it at all, so your approach may be better. I hope that authors of the requirement statement would step forward and explain what it is about under the hood and why 2x ;) You know what? A year ago I would have said dedup still wasn't stable enough for production. Now I would say it's plenty stable enough... But it needs performance enhancement before it's truly useful for most cases. Well, not that this would contradict you, but on my oi_148a (which may be based on code close to a year old), it seems rather unstable, with systems either freezing or slowing down after some writes and having to be rebooted in order to work (fresh after boot writes are usually relatively good, i.e. 5Mb/s vs. 100k/s). On the iSCSI server side, the LUN and STMF service often lock up with device busy even though the volume pool/dcpool is not itself deduped. For me this is only solved by a reboot... And reboots of the VM client which fights its way through deleting files from the deduped datasets inside dcpool (imported over iSCSI) are beyond counting. Actually in a couple of weeks I might be passing by that machine and may have a chance to update it to oi_151-dev, would that buy me any improvements, or potentially worsen my situation? ;) //Jim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
On Tue, 12 Jul 2011, Edward Ned Harvey wrote: You know what? A year ago I would have said dedup still wasn't stable enough for production. Now I would say it's plenty stable enough... But it needs performance enhancement before it's truly useful for most cases. What has changed for you to change your mind? Did the zfs code change in the past year, or is this based on experience with the same old stagnant code? Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
From: Bob Friesenhahn [mailto:bfrie...@simple.dallas.tx.us] Sent: Tuesday, July 12, 2011 9:58 AM You know what? A year ago I would have said dedup still wasn't stable enough for production. Now I would say it's plenty stable enough... But it needs performance enhancement before it's truly useful for most cases. What has changed for you to change your mind? Did the zfs code change in the past year, or is this based on experience with the same old stagnant code? No idea. I assume they've been patching, and I don't hear many people complaining of dedup instability on this list anymore. But the other option is that nothing's changed, and only my perception has changed. I acknowledge that's possible. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Edward Ned Harvey --- Performance loss: I ran one more test, that is rather enlightening. I repeated test #2 (tweak arc_meta_limit, use the default primarycache=all) but this time I wrote 100% duplicate data instead of unique. Dedup=sha256 (no verify). Ideally, you would expect this to write very very fast... Because it's all duplicate data, and it's all async, the system should just buffer a bunch of tiny metadata changes, aggregate them, and occasionally write a single serial block when it flushes the TXG. It should be much faster to write dedup. The results are: With dedup, it writes several times slower. Just the same as test #2, minus the amount of time it takes to write the actual data. For example, here's one datapoint, which is representative of the whole test: time to write unique data without dedup: 7.090 sec time to write unique data with dedup: 47.379 sec time to write duplic data without dedup: 7.016 sec time to write duplic data with dedup: 39.852 sec This clearly breaks it down: 7 sec to write the actual data 40 sec overhead caused by dedup 1 sec is about how fast it should have been writing duplicated data ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
From: Roy Sigurd Karlsbakk [mailto:r...@karlsbakk.net] Sent: Saturday, July 09, 2011 3:44 PM Could you test with some SSD SLOGs and see how well or bad the system performs? These are all async writes, so slog won't be used. Async writes that have a single fflush() and fsync() at the end to ensure system buffering is not skewing the results. Sorry, my bad, I meant L2ARC to help buffer the DDT Oh - It just so happens I don't have one available, but that doesn't mean I can't talk about it. ;-) For quite a lot of these tests, all the data resides in the ARC, period. The only area where the L2ARC would have an effect is after that region... When I'm pushing the limits of ARC then there may be some benefit from the use of L2ARC. So ... It is distinctly possible the L2ARC might help soften the brick wall. When reaching arc_meta_limit, some of the metadata might have been pushed out to L2ARC in order to leave a (slightly) smaller footprint in the ARC... I doubt it, but maybe there could be some gain here. It is distinctly possible the L2ARC might help test #2 approach the performance of test #3 (test #2 had primarycache=all and suffered approx 10x write performance degradation, while test #3 had primarycache=metadata and suffered approx 6x write performance degradation.) But there's positively no way the L2ARC would come into play on test #3. In this situation, all the metadata, the complete DDT resides in RAM. So with or without the cache device, the best case we're currently looking at is approx 6x write performance degradation. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
From: Roy Sigurd Karlsbakk [mailto:r...@karlsbakk.net] Sent: Saturday, July 09, 2011 3:44 PM Sorry, my bad, I meant L2ARC to help buffer the DDT Also, bear in mind, the L2ARC is only for reads. So it can't help accelerate writing updates to the DDT. Those updates need to hit the pool, period. Yes, on test 1 and test 2, there were significant regions where reads were taking place. (Basically the whole test, approx 25% to 30% reads.) On test 3, there were absolutely no reads up till 75M entries (9.07 T used) arc_meta_used= 12960 MB. Up to this point, it was a 4x write performance degradation. Then it suddenly started performing about 5% reads and 95% writes and suddenly jumped to 6x write performance degradation. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
Given the abysmal performance, I have to assume there is a significant number of overhead reads or writes in order to maintain the DDT for each actual block write operation. Something I didn't mention in the other email is that I also tracked iostat throughout the whole operation. It's all writes (or at least 99.9% writes.) So I am forced to conclude it's a bunch of small DDT maintenance writes taking place and incurring access time penalties in addition to each intended single block access time penalty. The nature of the DDT is that it's a bunch of small blocks, that tend to be scattered randomly, and require maintenance in order to do anything else. This sounds like precisely the usage pattern that benefits from low latency devices such as SSD's. I understand the argument, DDT must be stored in the primary storage pool so you can increase the size of the storage pool without running out of space to hold the DDT... But it's a fatal design flaw as long as you care about performance... If you don't care about performance, you might as well use the netapp and do offline dedup. The point of online dedup is to gain performance. So in ZFS you have to care about the performance. There are only two possible ways to fix the problem. Either ... The DDT must be changed so it can be stored entirely in a designated sequential area of disk, and maintained entirely in RAM, so all DDT reads/writes can be infrequent and serial in nature... This would solve the case of async writes and large sync writes, but would still perform poorly for small sync writes. And it would be memory intensive. But it should perform very nicely given those limitations. ;-) Or ... The DDT stays as it is now, highly scattered small blocks, and there needs to be an option to store it entirely on low latency devices such as dedicated SSD's. Eliminate the need for the DDT to reside on the slow primary storage pool disks. I understand you must consider what happens when the dedicated SSD gets full. The obvious choices would be either (a) dedup turns off whenever the metadatadevice is full or (b) it defaults to writing blocks in the main storage pool. Maybe that could even be a configurable behavior. Either way, there's a very realistic use case here. For some people in some situations, it may be acceptable to say I have 32G mirrored metadatadevice, divided by 137bytes per entry I can dedup up to a maximum 218M unique blocks in pool, and if I estimate 100K average block size that means up to 20T primary pool storage. If I reach that limit, I'll add more metadatadevice. Both of those options would also go a long way toward eliminating the surprise delete performance black hole. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Edward Ned Harvey When you read back duplicate data that was previously written with dedup, then you get a lot more cache hits, and as a result, the reads go faster. Unfortunately these gains are diminished... I don't know by what... But you only have about 2x to 4x performance gain reading previously dedup'd data, as compared to reading the same data which was never dedup'd. Even when repeatedly reading the same file which is 100% duplicate data (created by dd from /dev/zero) so all the data is 100% in cache... I still see only 2x to 4x performance gain with dedup. For what it's worth: I also repeated this without dedup. Created a large file (17G, just big enough that it will fit entirely in my ARC). Rebooted. Timed reading it. Now it's entirely in cache. Time reading it again. When it's not cached, of course the read time was equal to the original write time. When it's cached, it goes 4x faster. Perhaps this is only because I'm testing on a machine that has super fast storage... 11 striped SAS disks yielding 8Gbit/sec as compared to all-RAM which yielded 31.2Gbit/sec. It seems in this case, RAM is only 4x faster than the storage itself... But I would have expected a couple orders of magnitude... So perhaps my expectations are off, or the ARC itself simply incurs overhead. Either way, dedup is not to blame for obtaining merely 2x or 4x performance gain over the non-dedup equivalent. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
When it's not cached, of course the read time was equal to the original write time. When it's cached, it goes 4x faster. Perhaps this is only because I'm testing on a machine that has super fast storage... 11 striped SAS disks yielding 8Gbit/sec as compared to all-RAM which yielded 31.2Gbit/sec. It seems in this case, RAM is only 4x faster than the storage itself... But I would have expected a couple orders of magnitude... So perhaps my expectations are off, or the ARC itself simply incurs overhead. Either way, dedup is not to blame for obtaining merely 2x or 4x performance gain over the non-dedup equivalent. Could you test with some SSD SLOGs and see how well or bad the system performs? Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
From: Roy Sigurd Karlsbakk [mailto:r...@karlsbakk.net] Sent: Saturday, July 09, 2011 2:33 PM Could you test with some SSD SLOGs and see how well or bad the system performs? These are all async writes, so slog won't be used. Async writes that have a single fflush() and fsync() at the end to ensure system buffering is not skewing the results. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Summary: Dedup memory and performance (again, again)
From: Roy Sigurd Karlsbakk [mailto:r...@karlsbakk.net] Sent: Saturday, July 09, 2011 2:33 PM Could you test with some SSD SLOGs and see how well or bad the system performs? These are all async writes, so slog won't be used. Async writes that have a single fflush() and fsync() at the end to ensure system buffering is not skewing the results. Sorry, my bad, I meant L2ARC to help buffer the DDT Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss