Re: [zfs-discuss] ZFS deduplication ratio on Server 2008 backup VHD files
The problem is that the windows server backup seems to choose dynamic vhd (which would make sense in most cases) and I dont know if there is a way to change that. Using ISCSI-volumes wont help in my case since servers are running on physical hardware. Am 27.04.2010 01:54, schrieb Brandon High: On Mon, Apr 26, 2010 at 8:51 AM, tim Kriestim.kr...@gmx.de wrote: I am kinda confused over the change of dedup ratio from changing the record size, since it should dedup 256-bit blocks. Dedup works on the blocks or either recordsize or volblocksize. The checksum is made per block written, and those checksums are used to dedup the data. With a recordsize of 128k, two blocks with a one byte difference would not dedup. With an 8k recordsize, 15 out of 16 blocks would dedup. Repeat over the entire VHD. Setting the record size equal to a multiple of the VHD's internal block size and ensuring that the internal filesystem is block aligned will probably help to improve dedup ratios. So for an NTFS guest with 4k blocks, use a 4k, 8k or 16k record size and ensure that when you install in the VHD that its partitions are block aligned for the recordsize you're using. VHD supports fixed size and dynamic size images. If you're using a fixed image, the space is pre-allocated. This doesn't mean you'll waste unused space on ZFS with compression, since all those zeros will take up almost no space. Your VHD file should remain block-aligned however. I'm not sure that a dynamic size image will block align if there is empty space. Using compress=zle will only compress the zeros with almost no cpu penalty. Using a COMSTAR iscsi volume is probably an even better idea, since you won't have the POSIX layer in the path, and you won't have the VHD file header throwing off your block alignment. -B ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication ratio on Server 2008 backup VHD files
- Tim.Kreis tim.kr...@gmx.de skrev: The problem is that the windows server backup seems to choose dynamic vhd (which would make sense in most cases) and I dont know if there is a way to change that. Using ISCSI-volumes wont help in my case since servers are running on physical hardware. It should work well anyway, if you (a) fill up the server with memory and (b) reduce block size to 8k or even less. But do (a) before (b). Dedup is very memory hungry roy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication ratio on Server 2008 backup VHD files
Hi Tim, thanks for sharing your dedup experience. Especially for Virtualization, having a good pool of experience will help a lot of people. So you see a dedup ratio of 1.29 for two installations of Windows Server 2008 on the same ZFS backing store, if I understand you correctly. What dedup ratios do you see for the third, fourth and fifth server installation? Also, maybe dedup is not the only way to save space. What compression rate do you get? And: Have you tried setting up a Windows System, then setting up the next one based on a ZFS clone of the first one? Hope this helps, Constantin On 04/23/10 08:13 PM, tim Kries wrote: Dedup is a key element for my purpose, because i am planning a central repository for like 150 Windows Server 2008 (R2) servers which would take a lot less storage if they dedup right. -- Sent from OpenSolaris, http://www.opensolaris.org/ Constantin Gonzalez Sun Microsystems GmbH, Germany Principal Field Technologist Blog: constantin.glez.de Tel.: +49 89/4 60 08-25 91 Twitter: @zalez Sitz d. Ges.: Sun Microsystems GmbH, Sonnenallee 1, 85551 Kirchheim-Heimstetten Amtsgericht Muenchen: HRB 161028 Geschaeftsfuehrer: Jürgen Kunz ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication ratio on Server 2008 backup VHD files
Hi, The setting was this: Fresh installation of 2008 R2 - server backup with the backup feature - move vhd to zfs - install active directory role - backup again - move vhd to same share I am kinda confused over the change of dedup ratio from changing the record size, since it should dedup 256-bit blocks. I have to set up the opensolaris again since it died in my virtualbox (no sure why), so i cant test more server installations atm. Compression seemed to work pretty good (i used gzip-6) and i think it was compress ratio ~4, but i dont think that would work well for productive systems since you would need some serious cpu-power to work with. I will setup up another test in a few hours. Personally i am not sure if using clones might be a good idea for windows server 2008, all these problems with sid... -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication ratio on Server 2008 backup VHD files
I found the VHD specification here: http://download.microsoft.com/download/f/f/e/ffef50a5-07dd-4cf8-aaa3-442c0673a029/Virtual%20Hard%20Disk%20Format%20Spec_10_18_06.doc I am not sure if i understand it right, but it seems like data on disk gets compressed into the vhd (no empty space), so even a slight difference in the beginning of the file will slide through and ruin the pattern for block based dedup. As I am not an expert on file systems, someone with more expertise would be appreciated to look at this. Would be a real shame. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication ratio on Server 2008 backup VHD files
On Mon, Apr 26, 2010 at 8:51 AM, tim Kries tim.kr...@gmx.de wrote: I am kinda confused over the change of dedup ratio from changing the record size, since it should dedup 256-bit blocks. Dedup works on the blocks or either recordsize or volblocksize. The checksum is made per block written, and those checksums are used to dedup the data. With a recordsize of 128k, two blocks with a one byte difference would not dedup. With an 8k recordsize, 15 out of 16 blocks would dedup. Repeat over the entire VHD. Setting the record size equal to a multiple of the VHD's internal block size and ensuring that the internal filesystem is block aligned will probably help to improve dedup ratios. So for an NTFS guest with 4k blocks, use a 4k, 8k or 16k record size and ensure that when you install in the VHD that its partitions are block aligned for the recordsize you're using. VHD supports fixed size and dynamic size images. If you're using a fixed image, the space is pre-allocated. This doesn't mean you'll waste unused space on ZFS with compression, since all those zeros will take up almost no space. Your VHD file should remain block-aligned however. I'm not sure that a dynamic size image will block align if there is empty space. Using compress=zle will only compress the zeros with almost no cpu penalty. Using a COMSTAR iscsi volume is probably an even better idea, since you won't have the POSIX layer in the path, and you won't have the VHD file header throwing off your block alignment. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS deduplication ratio on Server 2008 backup VHD files
Hi, I am playing with opensolaris a while now. Today i tried to deduplicate the backup VHD files Windows Server 2008 generates. I made a backup before and after installing AD-role and copied the files to the share on opensolaris (build 134). First i got a straight 1.00x, then i set recordsize to 4k (to be like NTFS), it jumped up to 1.29x after that. But it should be a lot better right? Is there something i missed? Regards Tim -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication ratio on Server 2008 backup VHD files
You might note, dedupe only dedupes data that is writen after the flag is set. It does not retroactivly dedupe already writen data. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication ratio on Server 2008 backup VHD files
It was active all the time. Made a new zfs with -o dedup=on, copied with default record size, got no dedup, deleted files, set recordsize 4k, dedup ratio 1.29x -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication ratio on Server 2008 backup VHD files
A few things come to mind... 1. A lot better than...what? Setting the recordsize to 4K got you some deduplication but maybe the pertinent question is what were you expecting? 2. Dedup is fairly new. I haven't seen any reports of experiments like yours so...CONGRATULATIONS!! You're probably the first. Or at least the first willing to discuss it with the world as a matter of public record? Since dedup is new, you can't expect much in the way of previous experience with it. I also haven't seen coordinated experiments of various configurations with dedup off then on, for comparison. In the end, the question is going to be whether that level of dedup is going to be enough for you. Is dedup even important? Is it just a gravy feature or a key requirement? You're in un-explored territory, it appears. On Fri, Apr 23, 2010 at 11:41, tim Kries tim.kr...@gmx.de wrote: Hi, I am playing with opensolaris a while now. Today i tried to deduplicate the backup VHD files Windows Server 2008 generates. I made a backup before and after installing AD-role and copied the files to the share on opensolaris (build 134). First i got a straight 1.00x, then i set recordsize to 4k (to be like NTFS), it jumped up to 1.29x after that. But it should be a lot better right? Is there something i missed? Regards Tim -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- You can choose your friends, you can choose the deals. - Equity Private If Linux is faster, it's a Solaris bug. - Phil Harman Blog - http://whatderass.blogspot.com/ Twitter - @khyron4eva ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication ratio on Server 2008 backup VHD files
Dedup is a key element for my purpose, because i am planning a central repository for like 150 Windows Server 2008 (R2) servers which would take a lot less storage if they dedup right. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Deduplication Replication
Hi Darren, Could you post the -D part of the man pages? I have no access to a system (yet) with the latest man pages. http://docs.sun.com/app/docs/doc/819-2240/zfs-1m has not been updated yet. Regards Peter Darren J Moffat wrote: Steven Sim wrote: Hello; Dedup on ZFS is an absolutely wonderful feature! Is there a way to conduct dedup replication across boxes from one dedup ZFS data set to another? Pass the '-D' argument to 'zfs send'. -- Regards Peter Brouwer, Sun Microsystems Linlithgow Principal Storage Architect, ABCP DRII Consultant Office:+44 (0) 1506 672767 Mobile:+44 (0) 7720 598226 Skype :flyingdutchman_,flyingdutchman_l smime.p7s Description: S/MIME Cryptographic Signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS Deduplication Replication
Hello; Dedup on ZFS is an absolutely wonderful feature! Is there a way to conduct dedup replication across boxes from one dedup ZFS data set to another? Warmest Regards Steven Sim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Deduplication Replication
Steven Sim wrote: Hello; Dedup on ZFS is an absolutely wonderful feature! Is there a way to conduct dedup replication across boxes from one dedup ZFS data set to another? Pass the '-D' argument to 'zfs send'. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zfs deduplication
Dave McDorman wrote: I don't think is at liberty to discuss ZFS Deduplication at this point in time: Did Jeff Bonwick and Bill Moore give a presentation at kernel.conf.au or not? If so, did anyone see the presentation? Did the conference attendees all sign NDAs or something? Wes Felter ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zfs deduplication
On Mon, 03 Aug 2009 18:26:44 -0500 Wes Felter wes...@felter.org wrote: Dave McDorman wrote: I don't think is at liberty to discuss ZFS Deduplication at this point in time: Did Jeff Bonwick and Bill Moore give a presentation at kernel.conf.au or not? Yes they did - a keynote, and they participated in a panel discussion with Pawel Dawidek as well. If so, did anyone see the presentation? Yes. Everybody who attended. Did the conference attendees all sign NDAs or something? Don't be ridiculous. We did actually have problems with the video quality, for a variety of reasons. The video from the session is now being edited, however, with more than one hour of footage to process, and only one person to do it this is taking a while to get finished. James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog Kernel Conference Australia - http://au.sun.com/sunnews/events/2009/kernel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zfs deduplication
On Tue, 4 Aug 2009, James C. McPherson wrote: If so, did anyone see the presentation? Yes. Everybody who attended. You know, I think we might even have some evidence of their attendance! http://mexico.purplecow.org/static/kca_spk/tn/IMG_2177.jpg.html http://mexico.purplecow.org/static/kca_spk/tn/IMG_2178.jpg.html http://mexico.purplecow.org/static/kca_spk/tn/IMG_2179.jpg.html http://mexico.purplecow.org/static/kca_spk/tn/IMG_2184.jpg.html http://mexico.purplecow.org/static/kca_spk/tn/IMG_2186.jpg.html http://mexico.purplecow.org/static/kca_spk/tn/IMG_2228.jpg.html So they obviously attended, but it takes time to get get video and documentation out the door. You can already watch their participation in the ZFS panel online: http://www.ustream.tv/recorded/1810931 -- Andre van Eyssen. mail: an...@purplecow.org jabber: an...@interact.purplecow.org purplecow.org: UNIX for the masses http://www2.purplecow.org purplecow.org: PCOWpix http://pix.purplecow.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Zfs deduplication
Will the material ever be posted. Looks there is some huge bugs with zfs deduplication that the organizers do not want to post it also there is no indication on sun website if there will be a deduplication feature. I think its best they concentrate on improving zfs performance and speed with compression enabled. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zfs deduplication
I don't think is at liberty to discuss ZFS Deduplication at this point in time: http://www.itworld.com/storage/71307/sun-tussles-de-duplication-startup Hopefully, the matter is resolved and discussions can proceed openly. Send lawyers, guns and money. - Warren Zevon -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Ok, thank you Nils, Wade for the concise replies. After much reading I agree that the ZFS-development queued features do deserve a higher ranking on the priority list (pool-shrinking/disk-removal and user/group quotas would be my favourites), so probably the deduplication tool I'd need would, indeed, probably be some community-contributed script which does many hash-checks in zone-root file systems and does what Nils described to calculate the most-common template filesystem and derive zone roots as minimal changes to it. Does anybody with a wider awareness know of such readily-available scripts on some blog? :) Does some script-usable ZFS API (if any) provide for fetching block/file hashes (checksums) stored in the filesystem itself? In fact, am I wrong to expect file-checksums to be readily available? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Jim Klimov wrote: Ok, thank you Nils, Wade for the concise replies. After much reading I agree that the ZFS-development queued features do deserve a higher ranking on the priority list (pool-shrinking/disk-removal and user/group quotas would be my favourites), so probably the deduplication tool I'd need would, indeed, probably be some community-contributed script which does many hash-checks in zone-root file systems and does what Nils described to calculate the most-common template filesystem and derive zone roots as minimal changes to it. Does anybody with a wider awareness know of such readily-available scripts on some blog? :) Does some script-usable ZFS API (if any) provide for fetching block/file hashes (checksums) stored in the filesystem itself? In fact, am I wrong to expect file-checksums to be readily available? Yes. Files are not checksummed, blocks are checksummed. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Does some script-usable ZFS API (if any) provide for fetching block/file hashes (checksums) stored in the filesystem itself? In fact, am I wrong to expect file-checksums to be readily available? Yes. Files are not checksummed, blocks are checksummed. -- richard Further, even if they were file level checksums, the default checksums in zfs are too collision prone to be used for that purpose. If I were to write such a script I would md5+sha and then bit level compare collisions to be safe. -Wade ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
On Tue, 26 Aug 2008, Darren J Moffat wrote: zfs set checksum=sha256 Expect performance to really suck after setting this. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Bob Friesenhahn wrote: On Tue, 26 Aug 2008, Darren J Moffat wrote: zfs set checksum=sha256 Expect performance to really suck after setting this. Do you have evidence of that ? What kind of workload and how did you test it ? I've recently been benchmarking using filebench filemicro and filemacro workloads for ZFS Crypto and as part of setting my base line I compared the default checksum (flecher2) with sha256 and I didn't see a big enough difference to classify it as sucks. Here is my evidence for the filebench filemacro workload: http://cr.opensolaris.org/~darrenm/zfs-checksum-compare.html This was done on a X4500 running the zfs-crypto development binaries. In the interest of full disclosure I have changed the sha256.c in the ZFS source to use the default kernel one via the crypto framework rather than a private copy. I wouldn't expect that to have too big an impact (I will be verifying it I just didn't have the data to hand quickly). -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
On Aug 26, 2008, at 9:58 AM, Darren J Moffat wrote: than a private copy. I wouldn't expect that to have too big an impact (I On a SPARC CMT (Niagara 1+) based system wouldn't that be likely to have a large impact? -- Keith H. Bierman [EMAIL PROTECTED] | AIM kbiermank 5430 Nassau Circle East | Cherry Hills Village, CO 80113 | 303-997-2749 speaking for myself* Copyright 2008 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
On Tue, Aug 26, 2008 at 10:58 AM, Darren J Moffat [EMAIL PROTECTED] wrote: In the interest of full disclosure I have changed the sha256.c in the ZFS source to use the default kernel one via the crypto framework rather than a private copy. I wouldn't expect that to have too big an impact (I will be verifying it I just didn't have the data to hand quickly). Would this also make it so that it would use hardware assisted sha256 on capable (e.g N2) platforms? Is that the same as this change from long ago? http://mail.opensolaris.org/pipermail/zfs-code/2007-March/000448.html -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
On Tue, 26 Aug 2008, Darren J Moffat wrote: Bob Friesenhahn wrote: On Tue, 26 Aug 2008, Darren J Moffat wrote: zfs set checksum=sha256 Expect performance to really suck after setting this. Do you have evidence of that ? What kind of workload and how did you test it I did some random I/O throughput testing using iozone. While I saw similar I/O performance degredation to what you did (similar to your large_db_oltp_8k_cached), I did observe high CPU usage. The default fletcher algorithm uses hardly any CPU. In a dedicated file server, CPU usage is not a problem unless it slows subsequent requests. In a desktop system, or compute workstation, filesystem CPU usage competes with application CPU usage. With Solaris 10, enabling sha256 resulted in jerky mouse and desktop application behavior. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Keith Bierman wrote: On Aug 26, 2008, at 9:58 AM, Darren J Moffat wrote: than a private copy. I wouldn't expect that to have too big an impact (I On a SPARC CMT (Niagara 1+) based system wouldn't that be likely to have a large impact? UltraSPARC T1 has no hardware SHA256 so I wouldn't expect any real change from running the private software sha256 copy in ZFS versus the software sha256 in the crypto framework. The software sha256 in crypto framework has very little (if any) optimization for sun4v. An UltraSPARC T2 has on chip SHA256 and it should have a good impact on performance to use the crypto framework. I don't have the data to hand a the moment. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Mike Gerdts wrote: On Tue, Aug 26, 2008 at 10:58 AM, Darren J Moffat [EMAIL PROTECTED] wrote: In the interest of full disclosure I have changed the sha256.c in the ZFS source to use the default kernel one via the crypto framework rather than a private copy. I wouldn't expect that to have too big an impact (I will be verifying it I just didn't have the data to hand quickly). Would this also make it so that it would use hardware assisted sha256 on capable (e.g N2) platforms? Yes. Is that the same as this change from long ago? http://mail.opensolaris.org/pipermail/zfs-code/2007-March/000448.html Slightly different implementation - in particular it doesn't use PKCS#11 in userland only libmd. It also falls back to direct sha256 if the crypto framework call crypto_mech2id() call fails - this is needed to support ZFS boot. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
On Tue, Aug 26, 2008 at 10:11 AM, Darren J Moffat [EMAIL PROTECTED]wrote: Keith Bierman wrote: On a SPARC CMT (Niagara 1+) based system wouldn't that be likely to have a large impact? UltraSPARC T1 has no hardware SHA256 so I wouldn't expect any real change from running the private software sha256 copy in ZFS versus the software sha256 in the crypto framework. The Sorry for the typo (or thinko; I did know that but it's possible that it slipped my mind in the moment). Admittedly most community members probably don't have an N2 to play with, but it might well be available in the DC. -- Keith Bierman [EMAIL PROTECTED] kbiermank AIM ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
[EMAIL PROTECTED] wrote on 08/22/2008 04:26:35 PM: Just my 2c: Is it possible to do an offline dedup, kind of like snapshotting? What I mean in practice, is: we make many Solaris full-root zones. They share a lot of data as complete files. This is kind of easy to save space - make one zone as a template, snapshot/clone its dataset, make new zones. However, as projects evolve (software installed, etc.) these zones are filled with many similar files, many of which are duplicates. It seems reasonable to make some dedup process which would create a least-common-denominator snapshot for all the datasets involved (zone roots), of which all other datasets' current data are to be dubbed clones with modified data. For the system (and user) it should be perceived just the same as these datasets are currently clones with modified data of the original template zone-root dataset. Only the template becomes different... Hope this idea makes sense, and perhaps makes its way into code sometime :) Jim, There have been a few long threads about this in the past on this list. My take is that it is worthwhile, but should (or really needs to) wait until the resilver/resize/evac code is done and the zfs libs are stabilized and public (meaning people can actually write non throw away code against them). Some people feel that dedup is over extending the premise of the filesystem (and would unnecessarily complicate the code). Some feel that the benefits would be less than we suspect. I would expect first dedup code you see to be written by non sun people -- and if it is enticing enough to be backported to trunk (maybe). There are a bunch of need-to-haves sitting in queue that Sun needs to focus on such as real user/group quotas, disk shrink/evac, utility/toolkit for failed pool recovery (beyond skull and bones forensic tools), etc that should be way ahead of the line vs dedup. -Wade This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Just my 2c: Is it possible to do an offline dedup, kind of like snapshotting? What I mean in practice, is: we make many Solaris full-root zones. They share a lot of data as complete files. This is kind of easy to save space - make one zone as a template, snapshot/clone its dataset, make new zones. However, as projects evolve (software installed, etc.) these zones are filled with many similar files, many of which are duplicates. It seems reasonable to make some dedup process which would create a least-common-denominator snapshot for all the datasets involved (zone roots), of which all other datasets' current data are to be dubbed clones with modified data. For the system (and user) it should be perceived just the same as these datasets are currently clones with modified data of the original template zone-root dataset. Only the template becomes different... Hope this idea makes sense, and perhaps makes its way into code sometime :) This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
with other Word files. You will thus end up seeking all over the disk to read _most_ Word files. Which really sucks. snip very limited, constrained usage. Disk is just so cheap, that you _really_ have to have an enormous amount of dup before the performance penalties of dedup are countered. Neither of these hold true for SSDs though, do they? Seeks are essentially free, and the devices are not cheap. cheers, --justin ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
On Tue, Jul 22, 2008 at 10:44 PM, Erik Trimble [EMAIL PROTECTED] wrote: More than anything, Bob's reply is my major feeling on this. Dedup may indeed turn out to be quite useful, but honestly, there's no broad data which says that it is a Big Win (tm) _right_now_, compared to finishing other features. I'd really want a Engineering Study about the real-world use (i.e. what percentage of the userbase _could_ use such a feature, and what percentage _would_ use it, and exactly how useful would each segment find it...) before bumping it up in the priority queue of work to be done on ZFS. I get this. However, for most of my uses of clones dedup is considered finishing the job. Without it, I run the risk of having way more writable data than I can restore. Another solution to this is to consider the output of zfs send to be a stable format and get integration with enterprise backup software that can perform restores in a way that maintains space efficiency. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Hi All Is there any hope for deduplication on ZFS ? Mertol Ozyoney Storage Practice - Sales Manager Sun Microsystems Email [EMAIL PROTECTED] There is always hope. Seriously thought, looking at http://en.wikipedia.org/wiki/Comparison_of_revision_control_software there are a lot of choices of how we could implement this. SVN/K , Mercurial and Sun Teamware all come to mind. Simply ;) merge one of those with ZFS. It _could_ be as simple (with SVN as an example) of using directory listings to produce files which were then 'diffed'. You could then view the diffs as though they were changes made to lines of source code. Just add a tree subroutine to allow you to grab all the diffs that referenced changes to file 'xyz' and you would have easy access to all the changes of a particular file (or directory). With the speed optimized ability added to use ZFS snapshots with the tree subroutine to rollback a single file (or directory) you could undo / redo your way through the filesystem. Using a LKCD (http://www.faqs.org/docs/Linux-HOWTO/Linux-Crash-HOWTO.html) you could sit out on the play and watch from the sidelines -- returning to the OS when you thought you were 'safe' (and if not, jumping backout). Thus, Mertol, it is possible (and could work very well). Rob This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
[EMAIL PROTECTED] wrote on 07/22/2008 08:05:01 AM: Hi All Is there any hope for deduplication on ZFS ? Mertol Ozyoney Storage Practice - Sales Manager Sun Microsystems Email [EMAIL PROTECTED] There is always hope. Seriously thought, looking at http://en.wikipedia. org/wiki/Comparison_of_revision_control_software there are a lot of choices of how we could implement this. SVN/K , Mercurial and Sun Teamware all come to mind. Simply ;) merge one of those with ZFS. It _could_ be as simple (with SVN as an example) of using directory listings to produce files which were then 'diffed'. You could then view the diffs as though they were changes made to lines of source code. Just add a tree subroutine to allow you to grab all the diffs that referenced changes to file 'xyz' and you would have easy access to all the changes of a particular file (or directory). With the speed optimized ability added to use ZFS snapshots with the tree subroutine to rollback a single file (or directory) you could undo / redo your way through the filesystem. dedup is not revision control, you seem to completely misunderstand the problem. Using a LKCD (http://www.faqs.org/docs/Linux-HOWTO/Linux-Crash-HOWTO.html ) you could sit out on the play and watch from the sidelines -- returning to the OS when you thought you were 'safe' (and if not, jumping backout). Now it seems you have veered even further off course. What are you implying the LKCD has to do with zfs, solaris, dedup, let alone revision control software? -Wade ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
To do dedup properly, it seems like there would have to be some overly complicated methodology for a sort of delayed dedup of the data. For speed, you'd want your writes to go straight into the cache and get flushed out as quickly as possibly, keep everything as ACID as possible. Then, a dedup scrubber would take what was written, do the voodoo magic of checksumming the new data, scanning the tree to see if there are any matches, locking the duplicates, run the usage counters up or down for that block of data, swapping out inodes, and marking the duplicate data as free space. It's a lofty goal, but one that is doable. I guess this is only necessary if deduplication is done at the file level. If done at the block level, it could possibly be done on the fly, what with the already implemented checksumming at the block level, but then your reads will suffer because pieces of files can potentially be spread all over hell and half of Georgia on the zdevs. Deduplication is going to require the judicious application of hallucinogens and man hours. I expect that someone is up to the task. On Tue, Jul 22, 2008 at 10:39 AM, [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote on 07/22/2008 08:05:01 AM: Hi All Is there any hope for deduplication on ZFS ? Mertol Ozyoney Storage Practice - Sales Manager Sun Microsystems Email [EMAIL PROTECTED] There is always hope. Seriously thought, looking at http://en.wikipedia. org/wiki/Comparison_of_revision_control_software there are a lot of choices of how we could implement this. SVN/K , Mercurial and Sun Teamware all come to mind. Simply ;) merge one of those with ZFS. It _could_ be as simple (with SVN as an example) of using directory listings to produce files which were then 'diffed'. You could then view the diffs as though they were changes made to lines of source code. Just add a tree subroutine to allow you to grab all the diffs that referenced changes to file 'xyz' and you would have easy access to all the changes of a particular file (or directory). With the speed optimized ability added to use ZFS snapshots with the tree subroutine to rollback a single file (or directory) you could undo / redo your way through the filesystem. dedup is not revision control, you seem to completely misunderstand the problem. Using a LKCD ( http://www.faqs.org/docs/Linux-HOWTO/Linux-Crash-HOWTO.html ) you could sit out on the play and watch from the sidelines -- returning to the OS when you thought you were 'safe' (and if not, jumping backout). Now it seems you have veered even further off course. What are you implying the LKCD has to do with zfs, solaris, dedup, let alone revision control software? -Wade ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- chris -at- microcozm -dot- net === Si Hoc Legere Scis Nimium Eruditionis Habes ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
[EMAIL PROTECTED] wrote on 07/22/2008 09:58:53 AM: To do dedup properly, it seems like there would have to be some overly complicated methodology for a sort of delayed dedup of the data. For speed, you'd want your writes to go straight into the cache and get flushed out as quickly as possibly, keep everything as ACID as possible. Then, a dedup scrubber would take what was written, do the voodoo magic of checksumming the new data, scanning the tree to see if there are any matches, locking the duplicates, run the usage counters up or down for that block of data, swapping out inodes, and marking the duplicate data as free space. I agree, but what you are describing is file based dedup, ZFS already has the groundwork for dedup in the system (block level checksuming and pointers). It's a lofty goal, but one that is doable. I guess this is only necessary if deduplication is done at the file level. If done at the block level, it could possibly be done on the fly, what with the already implemented checksumming at the block level, exactly -- that is why it is attractive for ZFS, so much of the groundwork is done and needed for the fs/pool already. but then your reads will suffer because pieces of files can potentially be spread all over hell and half of Georgia on the zdevs. I don't know that you can make this statement without some study of an actual implementation on real world data -- and then because it is block based, you should see varying degrees of this dedup-flack-frag depending on data/usage. For instance, I would imagine that in many scenarios much od the dedup data blocks would belong to the same or very similar files. In this case the blocks were written as best they could on the first write, the deduped blocks would point to a pretty sequential line o blocks. Now on some files there may be duplicate header or similar portions of data -- these may cause you to jump around the disk; but I do not know how much this would be hit or impact real world usage. Deduplication is going to require the judicious application of hallucinogens and man hours. I expect that someone is up to the task. I would prefer the coder(s) not be seeing pink elephants while writing this, but yes it can and will be done. It (I believe) will be easier after the grow/shrink/evac code paths are in place though. Also, the grow/shrink/evac path allows (if it is done right) for other cool things like a base to build a roaming defrag that takes into account snaps, clones, live and the like. I know that some feel that the grow/shrink/evac code is more important for home users, but I think that it is super important for most of these additional features. -Wade On Tue, Jul 22, 2008 at 10:39 AM, [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote on 07/22/2008 08:05:01 AM: Hi All Is there any hope for deduplication on ZFS ? Mertol Ozyoney Storage Practice - Sales Manager Sun Microsystems Email [EMAIL PROTECTED] There is always hope. Seriously thought, looking at http://en.wikipedia. org/wiki/Comparison_of_revision_control_software there are a lot of choices of how we could implement this. SVN/K , Mercurial and Sun Teamware all come to mind. Simply ;) merge one of those with ZFS. It _could_ be as simple (with SVN as an example) of using directory listings to produce files which were then 'diffed'. You could then view the diffs as though they were changes made to lines of source code. Just add a tree subroutine to allow you to grab all the diffs that referenced changes to file 'xyz' and you would have easy access to all the changes of a particular file (or directory). With the speed optimized ability added to use ZFS snapshots with the tree subroutine to rollback a single file (or directory) you could undo / redo your way through the filesystem. dedup is not revision control, you seem to completely misunderstand the problem. Using a LKCD (http://www.faqs.org/docs/Linux-HOWTO/Linux-Crash-HOWTO.html ) you could sit out on the play and watch from the sidelines -- returning to the OS when you thought you were 'safe' (and if not, jumping backout). Now it seems you have veered even further off course. What are you implying the LKCD has to do with zfs, solaris, dedup, let alone revision control software? -Wade ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- chris -at- microcozm -dot- net === Si Hoc Legere Scis Nimium Eruditionis Habes ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
On Tue, Jul 22, 2008 at 11:19 AM, [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote on 07/22/2008 09:58:53 AM: To do dedup properly, it seems like there would have to be some overly complicated methodology for a sort of delayed dedup of the data. For speed, you'd want your writes to go straight into the cache and get flushed out as quickly as possibly, keep everything as ACID as possible. Then, a dedup scrubber would take what was written, do the voodoo magic of checksumming the new data, scanning the tree to see if there are any matches, locking the duplicates, run the usage counters up or down for that block of data, swapping out inodes, and marking the duplicate data as free space. I agree, but what you are describing is file based dedup, ZFS already has the groundwork for dedup in the system (block level checksuming and pointers). It's a lofty goal, but one that is doable. I guess this is only necessary if deduplication is done at the file level. If done at the block level, it could possibly be done on the fly, what with the already implemented checksumming at the block level, exactly -- that is why it is attractive for ZFS, so much of the groundwork is done and needed for the fs/pool already. but then your reads will suffer because pieces of files can potentially be spread all over hell and half of Georgia on the zdevs. I don't know that you can make this statement without some study of an actual implementation on real world data -- and then because it is block based, you should see varying degrees of this dedup-flack-frag depending on data/usage. It's just a NonScientificWAG. I agree that most of the duplicated blocks will in most cases be part of identical files anyway, and thus lined up exactly as you'd want them. I was just free thinking and typing. For instance, I would imagine that in many scenarios much od the dedup data blocks would belong to the same or very similar files. In this case the blocks were written as best they could on the first write, the deduped blocks would point to a pretty sequential line o blocks. Now on some files there may be duplicate header or similar portions of data -- these may cause you to jump around the disk; but I do not know how much this would be hit or impact real world usage. Deduplication is going to require the judicious application of hallucinogens and man hours. I expect that someone is up to the task. I would prefer the coder(s) not be seeing pink elephants while writing this, but yes it can and will be done. It (I believe) will be easier after the grow/shrink/evac code paths are in place though. Also, the grow/shrink/evac path allows (if it is done right) for other cool things like a base to build a roaming defrag that takes into account snaps, clones, live and the like. I know that some feel that the grow/shrink/evac code is more important for home users, but I think that it is super important for most of these additional features. The elephants are just there to keep the coders company. There are tons of benefits for dedup, both for home and non-home users. I'm happy that it's going to be done. I expect the first complaints will come from those people who don't understand it, and their df and du numbers look different than their zpool status ones. Perhaps df/du will just have to be faked out for those folks, or we just apply the same hallucinogens to them instead. -Wade On Tue, Jul 22, 2008 at 10:39 AM, [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote on 07/22/2008 08:05:01 AM: Hi All Is there any hope for deduplication on ZFS ? Mertol Ozyoney Storage Practice - Sales Manager Sun Microsystems Email [EMAIL PROTECTED] There is always hope. Seriously thought, looking at http://en.wikipedia. org/wiki/Comparison_of_revision_control_software there are a lot of choices of how we could implement this. SVN/K , Mercurial and Sun Teamware all come to mind. Simply ;) merge one of those with ZFS. It _could_ be as simple (with SVN as an example) of using directory listings to produce files which were then 'diffed'. You could then view the diffs as though they were changes made to lines of source code. Just add a tree subroutine to allow you to grab all the diffs that referenced changes to file 'xyz' and you would have easy access to all the changes of a particular file (or directory). With the speed optimized ability added to use ZFS snapshots with the tree subroutine to rollback a single file (or directory) you could undo / redo your way through the filesystem. dedup is not revision control, you seem to completely misunderstand the problem. Using a LKCD (http://www.faqs.org/docs/Linux-HOWTO/Linux-Crash-HOWTO.html ) you could sit out on the play and watch from the sidelines -- returning to the OS when you thought you were 'safe' (and if not, jumping
Re: [zfs-discuss] ZFS deduplication
Chris Cosby wrote: On Tue, Jul 22, 2008 at 11:19 AM, [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote on 07/22/2008 09:58:53 AM: To do dedup properly, it seems like there would have to be some overly complicated methodology for a sort of delayed dedup of the data. For speed, you'd want your writes to go straight into the cache and get flushed out as quickly as possibly, keep everything as ACID as possible. Then, a dedup scrubber would take what was written, do the voodoo magic of checksumming the new data, scanning the tree to see if there are any matches, locking the duplicates, run the usage counters up or down for that block of data, swapping out inodes, and marking the duplicate data as free space. I agree, but what you are describing is file based dedup, ZFS already has the groundwork for dedup in the system (block level checksuming and pointers). It's a lofty goal, but one that is doable. I guess this is only necessary if deduplication is done at the file level. If done at the block level, it could possibly be done on the fly, what with the already implemented checksumming at the block level, exactly -- that is why it is attractive for ZFS, so much of the groundwork is done and needed for the fs/pool already. but then your reads will suffer because pieces of files can potentially be spread all over hell and half of Georgia on the zdevs. I don't know that you can make this statement without some study of an actual implementation on real world data -- and then because it is block based, you should see varying degrees of this dedup-flack-frag depending on data/usage. It's just a NonScientificWAG. I agree that most of the duplicated blocks will in most cases be part of identical files anyway, and thus lined up exactly as you'd want them. I was just free thinking and typing. No, you are right to be concerned over block-level dedup seriously impacting seeks. The problem is that, given many common storage scenarios, you will have not just similar files, but multiple common sections of many files. Things such as the various standard productivity app documents will not just have the same header sections, but internally, there will be significant duplications of considerable length with other documents from the same application. Your 5MB Word file is thus likely to share several (actually, many) multi-kB segments with other Word files. You will thus end up seeking all over the disk to read _most_ Word files. Which really sucks. I can list at least a couple more common scenarios where dedup has to potential to save at least some reasonable amount of space, yet will absolutely kill performance. For instance, I would imagine that in many scenarios much od the dedup data blocks would belong to the same or very similar files. In this case the blocks were written as best they could on the first write, the deduped blocks would point to a pretty sequential line o blocks. Now on some files there may be duplicate header or similar portions of data -- these may cause you to jump around the disk; but I do not know how much this would be hit or impact real world usage. Deduplication is going to require the judicious application of hallucinogens and man hours. I expect that someone is up to the task. I would prefer the coder(s) not be seeing pink elephants while writing this, but yes it can and will be done. It (I believe) will be easier after the grow/shrink/evac code paths are in place though. Also, the grow/shrink/evac path allows (if it is done right) for other cool things like a base to build a roaming defrag that takes into account snaps, clones, live and the like. I know that some feel that the grow/shrink/evac code is more important for home users, but I think that it is super important for most of these additional features. The elephants are just there to keep the coders company. There are tons of benefits for dedup, both for home and non-home users. I'm happy that it's going to be done. I expect the first complaints will come from those people who don't understand it, and their df and du numbers look different than their zpool status ones. Perhaps df/du will just have to be faked out for those folks, or we just apply the same hallucinogens to them instead. I'm still not convinced that dedup is really worth it for anything but very limited, constrained usage. Disk is just so cheap, that you _really_ have to have an enormous amount of dup before the performance penalties of dedup are countered. This in many ways reminds me the last year's discussion over file versioning in the
Re: [zfs-discuss] ZFS deduplication
FWIW, Sun's VTL products use ZFS and offer de-duplication services. http://www.sun.com/aboutsun/pr/2008-04/sunflash.20080407.2.xml -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
[EMAIL PROTECTED] wrote on 07/22/2008 11:48:30 AM: Chris Cosby wrote: On Tue, Jul 22, 2008 at 11:19 AM, [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote on 07/22/2008 09:58:53 AM: To do dedup properly, it seems like there would have to be some overly complicated methodology for a sort of delayed dedup of the data. For speed, you'd want your writes to go straight into the cache and get flushed out as quickly as possibly, keep everything as ACID as possible. Then, a dedup scrubber would take what was written, do the voodoo magic of checksumming the new data, scanning the tree to see if there are any matches, locking the duplicates, run the usage counters up or down for that block of data, swapping out inodes, and marking the duplicate data as free space. I agree, but what you are describing is file based dedup, ZFS already has the groundwork for dedup in the system (block level checksuming and pointers). It's a lofty goal, but one that is doable. I guess this is only necessary if deduplication is done at the file level. If done at the block level, it could possibly be done on the fly, what with the already implemented checksumming at the block level, exactly -- that is why it is attractive for ZFS, so much of the groundwork is done and needed for the fs/pool already. but then your reads will suffer because pieces of files can potentially be spread all over hell and half of Georgia on the zdevs. I don't know that you can make this statement without some study of an actual implementation on real world data -- and then because it is block based, you should see varying degrees of this dedup-flack-frag depending on data/usage. It's just a NonScientificWAG. I agree that most of the duplicated blocks will in most cases be part of identical files anyway, and thus lined up exactly as you'd want them. I was just free thinking and typing. No, you are right to be concerned over block-level dedup seriously impacting seeks. The problem is that, given many common storage scenarios, you will have not just similar files, but multiple common sections of many files. Things such as the various standard productivity app documents will not just have the same header sections, but internally, there will be significant duplications of considerable length with other documents from the same application. Your 5MB Word file is thus likely to share several (actually, many) multi-kB segments with other Word files. You will thus end up seeking all over the disk to read _most_ Word files. Which really sucks. I can list at least a couple more common scenarios where dedup has to potential to save at least some reasonable amount of space, yet will absolutely kill performance. While you may have a point on some data sets, actual testing of this type of data (28.000+ of actual end user doc files) using xdelta with 4k and 8k block sizes shows that the similar blocks in these files are in the 2% range (~ 6% for 4k). That means a full read of each file on average would require 6% seeks to other disk areas. That is not bad, but this is the worst case picture as those duplicate blocks would need to live in the same offsets and have the same block boundaries to match under the proposed algo. To me this means word docs are not a good candidate for dedup at the block level -- but the actual cost to dedup anyways seems small. Of course you could come up with data that is pathologically bad for these benchmarks, but I do not believe it would be nearly as bad as you are making it out to be on real world data. For instance, I would imagine that in many scenarios much od the dedup data blocks would belong to the same or very similar files. In this case the blocks were written as best they could on the first write, the deduped blocks would point to a pretty sequential line o blocks. Now on some files there may be duplicate header or similar portions of data -- these may cause you to jump around the disk; but I do not know how much this would be hit or impact real world usage. Deduplication is going to require the judicious application of hallucinogens and man hours. I expect that someone is up to the task. I would prefer the coder(s) not be seeing pink elephants while writing this, but yes it can and will be done. It (I believe) will be easier after the grow/shrink/evac code paths are in place though. Also, the grow/shrink/evac path allows (if it is done right) for other cool things like a base to build a roaming defrag that takes into account snaps, clones, live and the like. I know
Re: [zfs-discuss] ZFS deduplication
On Tue, Jul 22, 2008 at 11:48 AM, Erik Trimble [EMAIL PROTECTED] wrote: No, you are right to be concerned over block-level dedup seriously impacting seeks. The problem is that, given many common storage scenarios, you will have not just similar files, but multiple common sections of many files. Things such as the various standard productivity app documents will not just have the same header sections, but internally, there will be significant duplications of considerable length with other documents from the same application. Your 5MB Word file is thus likely to share several (actually, many) multi-kB segments with other Word files. You will thus end up seeking all over the disk to read _most_ Word files. Which really sucks. I can list at least a couple more common scenarios where dedup has to potential to save at least some reasonable amount of space, yet will absolutely kill performance. This would actually argue in favor of dedup... If the blocks are common they are more likely to be in the ARC with dedup, thus avoiding a read altogether. There would likely be greater overhead in assembling smaller packets Here's some real life... I have 442 Word documents created by me and others over several years. Many were created from the same corporate templates. I generated the MD5 hash of every 8 KB of each file and came up with a total of 8409 hash - implying 65 MB of word documents. Taking those hashes through sort | uniq -c | sort -n led to the following: 3 p9I7HgbxFme7TlPZmsD6/Q 3 sKE3RBwZt8A6uz+tAihMDA 3 uA4PK1+SQqD+h1Nv6vJ6fQ 3 wQoU2g7f+dxaBMzY5rVE5Q 3 yM0csnXKtRxjpSxg1Zma0g 3 yyokNamrTcD7lQiitcVgqA 4 jdsZZfIHtshYZiexfX3bQw 17 pohs0DWPFwF8HJ8p/HnFKw 19 s0eKyh/vT1LothTvsqtZOw 64 CCn3F0CqsauYsz6uId7hIg Note that CCn3F0CqsauYsz6uId7hIg is the MD5 hash of 8 KB of zeros. If compression is used as well, this block would not even be stored. If 512 byte blocks are used, the story is a bit different: 81 DEf6rofNmnr1g5f7oaV75w 109 3gP+ZaZ2XKqMkTQ6zGLP/A 121 ypk+0ryBeMVRnnjYQD2ZEA 124 HcuMdyNKV7FDYcPqvb2o3Q 371 s0eKyh/vT1LothTvsqtZOw 372 ozgGMCCoc+0/RFbFDO8MsQ 8535 v2GerAzfP2jUluqTRBN+iw As you might guess, that most common hash is a block of zeros. Most likely, however, these files will end up using 128K blocks for the first part of the file, smaller for the portions that don't fit. When I look at just 128K... 1 znJqBX8RtPrAOV2I6b5Wew 2 6tuJccWHGVwv3v4nee6B9w 2 Qr//PMqqhMtuKfgKhUIWVA 2 idX0awfYjjFmwHwi60MAxg 2 s0eKyh/vT1LothTvsqtZOw 3 +Q/cXnknPr/uUCARsaSIGw 3 /kyIGuWnPH/dC5ETtMqqLw 3 4G/QmksvChYvfhAX+rfgzg 3 SCMoKuvPepBdQEBVrTccvA 3 vbaNWd5IQvsGdQ9R8dIqhw There is actually very little duplication in word files. Many of the dupes above are from various revisions of the same files. Dedup Advantages: (1) save space relative to the amount of duplication. this is highly dependent on workload, and ranges from 0% to 99%, but the distribution of possibilities isn't a bell curve (i.e. the average space saved isn't 50%). I have evidence that shows 75% duplicate data on (mostly sparse) zone roots created and maintained over a 18 month period. I show other evidence above that it is not nearly as good for one person's copy of word documents. I suspect that it would be different if the file system that I did this on was on a file server where all of my colleagues also stored their documents (and revisions of mine that they have reviewed). (2) noticable write performance penalty (assuming block-level dedup on write), with potential write cache issues. Depends on the approach taken. (3) very significant post-write dedup time, at least on the order of 'zfs scrub'. Also, during such a post-write scenario, it more or less takes the zpool out of usage. The ZFS competition that has this in shipping product today does not quiesce the file system during dedup passes. (4) If dedup is done at block level, not at file level, it kills read performance, effectively turning all dedup'd files from sequential read to a random read. That is, block-level dedup drastically accelerates filesystem fragmentation. Absent data that shows this, I don't accept this claim. Arguably the blocks that are duplicate are more likely to be in cache. I think that my analysis above shows that this is not a concern for my data set. (5) Something no one has talked about, but is of concern. By removing duplication, you increase the likelihood that loss of the master segment will corrupt many more files. Yes, ZFS has self-healing and such. But, particularly in the case where there is no ZFS pool redundancy (or pool-level redundancy has been compromised), loss of one block can thus be many more times severe. I believe this is true and likely a good topic for discussion. We need to think long and hard about what the real widespread benefits are of dedup
Re: [zfs-discuss] ZFS deduplication
On 7/22/08 11:48 AM, Erik Trimble [EMAIL PROTECTED] wrote: I'm still not convinced that dedup is really worth it for anything but very limited, constrained usage. Disk is just so cheap, that you _really_ have to have an enormous amount of dup before the performance penalties of dedup are countered. Again, I will argue that the spinning rust itself isn't expensive, but data management is. If I am looking to protect multiple PB (through remote data replication and backup), I need more than just the rust to store that. I need to copy this data, which takes time and effort. If the system can say these 500K blocks are the same as these 500K, don't bother copying them to the DR site AGAIN, then I have a less daunting data management task. De-duplication makes a lot of sense at some layer(s) within the data management scheme. Charles ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
On Tue, 22 Jul 2008, Erik Trimble wrote: Dedup Disadvantages: Obviously you do not work in the Sun marketing department which is intrested in this feature (due to some other companies marketing it). Note that the topic starter post came from someone in Sun's marketing department. I think that dedupication is a potential diversion which draws attention away from the core ZFS things which are still not ideally implemented or do not yet exist at all. Compared with other filesystems, ZFS is still a toddler since it has only been deployed for a few years. ZFS is intended to be an enterprise filesystem so let's give it more time to mature before hiting it with the feature stick. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
et == Erik Trimble [EMAIL PROTECTED] writes: et Dedup Advantages: et (1) save space (2) coalesce data which is frequently used by many nodes in a large cluster into a small nugget of common data which can fit into RAM or L2 fast disk (3) back up non-ZFS filesystems that don't have snapshots and clones (4) make offsite replication easier on the WAN but, yeah, aside from imagining ahead to possible disastrous problems with the final implementation, the imagined use cases should probably be carefully compared to existing large installations. Firstly, dedup may be more tempting as a bulletted marketing feature or a bloggable/banterable boasting point than it is valuable to real people. Secondly, the comparison may drive the implementation. For example, should dedup happen at write time and be something that doesn't happen to data written before it's turned on, like recordsize or compression, to make it simpler in the user interface, and avoid problems with scrubs making pools uselessly slow? Or should it be scrub-like so that already-written filesystems can be thrown into the dedup bag and slowly squeezed, or so that dedup can run slowly during the business day over data written quickly at night (fast outside-business-hours backup)? pgpHArHK13e1c.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
On Tue, 22 Jul 2008, Miles Nordin wrote: scrubs making pools uselessly slow? Or should it be scrub-like so that already-written filesystems can be thrown into the dedup bag and slowly squeezed, or so that dedup can run slowly during the business day over data written quickly at night (fast outside-business-hours backup)? I think that the scrub-like model makes the most sense since ZFS write performance should not be penalized. It is useful to implement score-boarding so that a block is not considered for de-duplication until it has been duplicated a certain number of times. In order to decrease resource consumption, it is useful to perform de-duplication over a span of multiple days or multiple weeks doing just part of the job each time around. Deduping a petabyte of data seems quite challenging yet ZFS needs to be scalable to these levels. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Bob Friesenhahn wrote: On Tue, 22 Jul 2008, Erik Trimble wrote: Dedup Disadvantages: Obviously you do not work in the Sun marketing department which is intrested in this feature (due to some other companies marketing it). Note that the topic starter post came from someone in Sun's marketing department. I think that dedupication is a potential diversion which draws attention away from the core ZFS things which are still not ideally implemented or do not yet exist at all. Compared with other filesystems, ZFS is still a toddler since it has only been deployed for a few years. ZFS is intended to be an enterprise filesystem so let's give it more time to mature before hiting it with the feature stick. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ More than anything, Bob's reply is my major feeling on this. Dedup may indeed turn out to be quite useful, but honestly, there's no broad data which says that it is a Big Win (tm) _right_now_, compared to finishing other features. I'd really want a Engineering Study about the real-world use (i.e. what percentage of the userbase _could_ use such a feature, and what percentage _would_ use it, and exactly how useful would each segment find it...) before bumping it up in the priority queue of work to be done on ZFS. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
On Tue, 22 Jul 2008, Miles Nordin wrote: scrubs making pools uselessly slow? Or should it be scrub-like so that already-written filesystems can be thrown into the dedup bag and slowly squeezed, or so that dedup can run slowly during the business day over data written quickly at night (fast outside-business-hours backup)? I think that the scrub-like model makes the most sense since ZFS write performance should not be penalized. It is useful to implement score-boarding so that a block is not considered for de-duplication until it has been duplicated a certain number of times. In order to decrease resource consumption, it is useful to perform de-duplication over a span of multiple days or multiple weeks doing just part of the job each time around. Deduping a petabyte of data seems quite challenging yet ZFS needs to be scalable to these levels. Bob Friesenhahn In case anyone (other than Bob) missed it, this is why I suggested File-Level Dedup: ... using directory listings to produce files which were then 'diffed'. You could then view the diffs as though they were changes made ... We could have: Block-Level (if we wanted to restore an exact copy of the drive - duplicate the 'dd' command) or Byte-Level (if we wanted to use compression - duplicate the 'zfs set compression=on rpool' _or_ 'bzip' commands) ... etc... assuming we wanted to duplicate commands which already implement those features, and provide more than we (the filesystem) needs at a very high cost (performance). So I agree with your comment about the need to be mindful of resource consumption, the ability to do this over a period of days is also useful. Indeed the Plan9 filesystem simply snapshots to WORM and has no delete - nor are they able to fill their drives faster than they can afford to buy new ones: Venti Filesystem http://www.cs.bell-labs.com/who/seanq/p9trace.html Rob This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Raw storage space is cheap. Managing the data is what is expensive. Not for my customer. Internal accounting means that the storage team gets paid for each allocated GB on a monthly basis. They have stacks of IO bandwidth and CPU cycles to spare outside of their daily busy period. I can't think of a better spend of their time than a scheduled dedup. Perhaps deduplication is a response to an issue which should be solved elsewhere? I don't think you can make this generalisation. For most people, yes, but not everyone. cheers, --justin ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Does anyone know a tool that can look over a dataset and give duplication statistics? I'm not looking for something incredibly efficient but I'd like to know how much it would actually benefit our Check out the following blog..: http://blogs.sun.com/erickustarz/entry/how_dedupalicious_is_your_pool ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Justin Stringfellow wrote: Raw storage space is cheap. Managing the data is what is expensive. Not for my customer. Internal accounting means that the storage team gets paid for each allocated GB on a monthly basis. They have stacks of IO bandwidth and CPU cycles to spare outside of their daily busy period. I can't think of a better spend of their time than a scheduled dedup. Perhaps deduplication is a response to an issue which should be solved elsewhere? I don't think you can make this generalisation. For most people, yes, but not everyone. cheers, --justin ___ Frankly, while I tend to agree with Bob that backend dedup is something that ever-cheaper disks and client-side misuse make unnecessary, I would _very_ much like us to have some mechanism by which we could have some sort of a 'pay-per-feature' system, so people who disagree with me can still get what they want. grin By that, I mean, that something along the lines of a 'bounty' system where folks pony up cash for features. I'd love to have many more outside (from Sun) contributors to the OpenSolaris base, ZFS in particular. Right now, virtually all the development work is being driven by internal-to-Sun priorities, which, given that Sun pays the developers, is OK. However, I would really like to have some direct method where outsiders can show to Mgmt that there is direct cash for certain improvements. For Justin, it sounds like being able to pony up several thousand (minimum) for desired feature would be no problem. And, for the rest of us, I can think that a couple of hundred of us putting up $100 each to get RAIDZ expansion might move it to the front of the TODO list. wink Plus, we might be able to attract some more interest from the hobbiest folks that way. :-) Buying a service contract and then bugging your service rep doesn't say the same thing a I'm willing to pony up $10k right now for feature X. Big customers have weight to throw around, but we need some mechanism where a mid/small guy can make a real statement, and back it up. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Just going to make a quick comment here. It's a good point about wanting backup software to support this, we're a much smaller company but it's already more difficult to manage the storage needed for backups than our live storage. However, we're actively planning that over the next 12 months, ZFS will actually *be* our backup system, so for us just ZFS and send/receive supporting de-duplication would be great :) In fact, I can see that being useful for a number of places. ZFS send/receive is already a good way to stream incremental changes and keep filesystems in sync. Having de-duplication built into that can only be a good thing. PS. Yes, we'll still have off-site tape backups just in case, but the vast majority of our backup restore functionality (including two off-site backups) will be just ZFS. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Even better would be using the ZFS block checksums (assuming we are only summing the data, not it's position or time :)... Then we could have two files that have 90% the same blocks, and still get some dedup value... ;) Yes, but you will need to add some sort of highly collision resistant checksum (sha+md5 maybe) and code to a; bit level compare blocks on collision (100% bit verification) and b; handle linked or cascaded collision tables (2+ blocks with the same hash but differing bits). I actually coded some of this and was playing with it. My testbed relied on another internal data store to track hash maps, collisions (dedup lists) and collision cascades (kind of like what perl does with hash key collisions). It turned out to be a real pain when taking into account snaps and clones. I decided to wait until the resilver/grow/remove code was in place as this seems to be part of the puzzle. -Wade ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
[EMAIL PROTECTED] wrote on 07/08/2008 03:08:26 AM: Does anyone know a tool that can look over a dataset and give duplication statistics? I'm not looking for something incredibly efficient but I'd like to know how much it would actually benefit our Check out the following blog..: http://blogs.sun.com/erickustarz/entry/how_dedupalicious_is_your_pool Just want to add, while this is ok to give you a ballpark dedup number -- fletcher2 is notoriously collision prone on real data sets. It is meant to be fast at the expense of collisions. This issue can show much more dedup possible than really exists on large datasets. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
[EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote on 07/08/2008 03:08:26 AM: Does anyone know a tool that can look over a dataset and give duplication statistics? I'm not looking for something incredibly efficient but I'd like to know how much it would actually benefit our Check out the following blog..: http://blogs.sun.com/erickustarz/entry/how_dedupalicious_is_your_pool Just want to add, while this is ok to give you a ballpark dedup number -- fletcher2 is notoriously collision prone on real data sets. It is meant to be fast at the expense of collisions. This issue can show much more dedup possible than really exists on large datasets. Doing this using sha256 as the checksum algorithm would be much more interesting. I'm going to try that now and see how it compares with fletcher2 for a small contrived test. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Justin Stringfellow wrote: Raw storage space is cheap. Managing the data is what is expensive. Not for my customer. Internal accounting means that the storage team gets paid for each allocated GB on a monthly basis. They have stacks of IO bandwidth and CPU cycles to spare outside of their daily busy period. I can't think of a better spend of their time than a scheduled dedup. [donning my managerial accounting hat] It is not a good idea to design systems based upon someone's managerial accounting whims. These are subject to change in illogical ways at unpredictable intervals. This is why managerial accounting can be so much fun for people who want to hide costs. For example, some bright manager decided that they should charge $100/month/port for ethernet drops. So now, instead of having a centralized, managed network with well defined port mappings, every cube has an el-cheapo ethernet switch. Saving money? Not really, but this can be hidden by the accounting. In the interim, I think you will find that if the goal is to reduce the number of bits stored on some expensive storage, there is more than one way to accomplish that goal. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
On Jul 8, 2008, at 11:00 AM, Richard Elling wrote: much fun for people who want to hide costs. For example, some bright manager decided that they should charge $100/month/port for ethernet drops. So now, instead of having a centralized, managed network with well defined port mappings, every cube has an el-cheapo ethernet switch. Saving money? Not really, but this can be hidden by the accounting. Indeed, it actively hurts performance (mixing sunray, mobile, and fixed units on the same subnets rather than segregation by type). -- Keith H. Bierman [EMAIL PROTECTED] | AIM kbiermank 5430 Nassau Circle East | Cherry Hills Village, CO 80113 | 303-997-2749 speaking for myself* Copyright 2008 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
On Tue, 8 Jul 2008, Richard Elling wrote: [donning my managerial accounting hat] It is not a good idea to design systems based upon someone's managerial accounting whims. These are subject to change in illogical ways at unpredictable intervals. This is why managerial accounting can be so Managerial accounting whims can be put to good use. If there is desire to reduce the amout of disk space consumed, then the accounting whims should make sure that those who consume the disk space get to pay for it. Apparently this is not currently the case or else there would not be so much blatant waste. On the flip-side, the approach which results in so much blatant waste may be extremely profitable so the waste does not really matter. Imagine if university students were allowed to use as much space as they wanted but had to pay a per megabyte charge every two weeks or their account is terminated? This would surely result in huge reduction in disk space consumption. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Something else came to mind which is a negative regarding deduplication. When zfs writes new sequential files, it should try to allocate blocks in a way which minimizes fragmentation (disk seeks). Disk seeks are the bane of existing storage systems since they come out of the available IOPS budget, which is only a couple hundred ops/second per drive. The deduplication algorithm will surely result in increasing effective fragmentation (decreasing sequential performance) since duplicated blocks will result in a seek to the master copy of the block followed by a seek to the next block. Disk seeks will remain an issue until rotating media goes away, which (in spite of popular opinion) is likely quite a while from now. Someone has to play devil's advocate here. :-) Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Hmmn, you might want to look at Andrew Tridgell's' thesis (yes, Andrew of Samba fame), as he had to solve this very question to be able to select an algorithm to use inside rsync. --dave Darren J Moffat wrote: [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote on 07/08/2008 03:08:26 AM: Does anyone know a tool that can look over a dataset and give duplication statistics? I'm not looking for something incredibly efficient but I'd like to know how much it would actually benefit our Check out the following blog..: http://blogs.sun.com/erickustarz/entry/how_dedupalicious_is_your_pool Just want to add, while this is ok to give you a ballpark dedup number -- fletcher2 is notoriously collision prone on real data sets. It is meant to be fast at the expense of collisions. This issue can show much more dedup possible than really exists on large datasets. Doing this using sha256 as the checksum algorithm would be much more interesting. I'm going to try that now and see how it compares with fletcher2 for a small contrived test. -- David Collier-Brown| Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest [EMAIL PROTECTED] | -- Mark Twain (905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583 bridge: (877) 385-4099 code: 506 9191# ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
[EMAIL PROTECTED] wrote on 07/08/2008 01:26:15 PM: Something else came to mind which is a negative regarding deduplication. When zfs writes new sequential files, it should try to allocate blocks in a way which minimizes fragmentation (disk seeks). Disk seeks are the bane of existing storage systems since they come out of the available IOPS budget, which is only a couple hundred ops/second per drive. The deduplication algorithm will surely result in increasing effective fragmentation (decreasing sequential performance) since duplicated blocks will result in a seek to the master copy of the block followed by a seek to the next block. Disk seeks will remain an issue until rotating media goes away, which (in spite of popular opinion) is likely quite a while from now. Yes, I think it should be close to common sense to realize that you are trading speed for space (but should be well documented if dedup/squash ever makes it into the codebase). You find these types of tradoffs in just about every area of disk administration from the type of raid you select, inode numbers, block size, to the number of spindles and size of disk you use. The key here is that it would be a choice just as compression is per fs -- let the administrator choose her path. In some situations it would make sense, in others not. -Wade Someone has to play devil's advocate here. :-) Debate is welcome, it is the only way to flesh out the issues. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Bob Friesenhahn wrote: Something else came to mind which is a negative regarding deduplication. When zfs writes new sequential files, it should try to allocate blocks in a way which minimizes fragmentation (disk seeks). It should, but because of its copy-on-write nature, fragmentation is a significant part of the ZFS data lifecycle. There was a discussion of this on this list at the beginning of the year... http://mail.opensolaris.org/pipermail/zfs-discuss/2007-November/044077.h tml Disk seeks are the bane of existing storage systems since they come out of the available IOPS budget, which is only a couple hundred ops/second per drive. The deduplication algorithm will surely result in increasing effective fragmentation (decreasing sequential performance) since duplicated blocks will result in a seek to the master copy of the block followed by a seek to the next block. Disk seeks will remain an issue until rotating media goes away, which (in spite of popular opinion) is likely quite a while from now. On ZFS, sequential files are rarely sequential anyway. The SPA tries to keep blocks nearby, but when dealing with snapshotted sequential files being rewritten, there is no way to keep everything in order. But if you read through the thread referenced above, you'll see that there's no clear data about just how that impacts performance (I still owe Mr. Elling a filebench run on one of my spare servers) --Joe ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
On Tue, 8 Jul 2008, Moore, Joe wrote: On ZFS, sequential files are rarely sequential anyway. The SPA tries to keep blocks nearby, but when dealing with snapshotted sequential files being rewritten, there is no way to keep everything in order. I think that rewriting files (updating existing blocks) is pretty rare. Only limited types of applications do such things. That is a good thing since zfs is not so good at rewriting files. The most common situation is that a new file is written, even if selecting save for an existing file in an application. Even if the user thinks that the file is being re-written, usually the application writes to a new temporary file and moves it into place once it is known to be written correctly. The majority of files will be written sequentially and most files will be small enough that zfs will see all the data before it outputs to disk. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
On Tue, Jul 8, 2008 at 12:25 PM, Bob Friesenhahn [EMAIL PROTECTED] wrote: On Tue, 8 Jul 2008, Richard Elling wrote: [donning my managerial accounting hat] It is not a good idea to design systems based upon someone's managerial accounting whims. These are subject to change in illogical ways at unpredictable intervals. This is why managerial accounting can be so Managerial accounting whims can be put to good use. If there is desire to reduce the amout of disk space consumed, then the accounting whims should make sure that those who consume the disk space get to pay for it. Apparently this is not currently the case or else there would not be so much blatant waste. On the flip-side, the approach which results in so much blatant waste may be extremely profitable so the waste does not really matter. The existence of the waste paves the way for new products to come in and offer competitive advantage over in-place solutions. When companies aren't buying anything due to budget constraints, the only way to make sales is to show businesses that by buying something they will save money - and quickly. Imagine if university students were allowed to use as much space as they wanted but had to pay a per megabyte charge every two weeks or their account is terminated? This would surely result in huge reduction in disk space consumption. If you can offer the perception of more storage because of efficiencies of the storage devices make it the same cost as less storage, then perhaps allocating more per student is feasible. Or maybe tuition could drop by a few bucks. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
On Tue, Jul 8, 2008 at 1:26 PM, Bob Friesenhahn [EMAIL PROTECTED] wrote: Something else came to mind which is a negative regarding deduplication. When zfs writes new sequential files, it should try to allocate blocks in a way which minimizes fragmentation (disk seeks). Disk seeks are the bane of existing storage systems since they come out of the available IOPS budget, which is only a couple hundred ops/second per drive. The deduplication algorithm will surely result in increasing effective fragmentation (decreasing sequential performance) since duplicated blocks will result in a seek to the master copy of the block followed by a seek to the next block. Disk seeks will remain an issue until rotating media goes away, which (in spite of popular opinion) is likely quite a while from now. Someone has to play devil's advocate here. :-) With L2ARC on SSD, seeks are free and IOPs are quite cheap (compared to spinning rust). Cold reads may be a problem, but there is a reasonable chance that L2ARC sizing can be helpful here. Also, the blocks that are likely to be duplicate are going to be the same file but just with a different offset. That is, this file is going to be the same in every one of my LDom disk images. # du -h /usr/jdk/instances/jdk1.5.0/jre/lib/rt.jar 38M /usr/jdk/instances/jdk1.5.0/jre/lib/rt.jar There is a pretty good chance that the first copy will be sequential and as a result all of the deduped copies would be sequential as well. What's more - it is quite likely to be in the ARC or L2ARC. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Tim Spriggs wrote: Does anyone know a tool that can look over a dataset and give duplication statistics? I'm not looking for something incredibly efficient but I'd like to know how much it would actually benefit our dataset: HiRISE has a large set of spacecraft data (images) that could potentially have large amounts of redundancy, or not. Also, other up and coming missions have a large data volume that have a lot of duplicate image info and a small budget; with d11p in OpenSolaris there is a good business case to invest in Sun/OpenSolaris rather than buy the cheaper storage (+ linux?) that can simply hold everything as is. If someone feels like coding a tool up that basically makes a file of checksums and counts how many times a particular checksum get's hit over a dataset, I would be willing to run it and provide feedback. :) -Tim Me too. Our data profile is just like Tim's: Terra bytes of satellite data. I'm going to guess that the d11p ratio won't be fantastic for us. I sure would like to measure it though. Jon -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Justin Stringfellow wrote: Does anyone know a tool that can look over a dataset and give duplication statistics? I'm not looking for something incredibly efficient but I'd like to know how much it would actually benefit our Check out the following blog..: http://blogs.sun.com/erickustarz/entry/how_dedupalicious_is_your_pool Unfortunately we are on Solaris 10 :( Can I get a zdb for zfs V4 that will dump those checksums? Jon -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Moore, Joe wrote: On ZFS, sequential files are rarely sequential anyway. The SPA tries to keep blocks nearby, but when dealing with snapshotted sequential files being rewritten, there is no way to keep everything in order. In some cases, a d11p system could actually speed up data reads and writes. If you are repeatedly accessing duplicate data, then you will more likely hit your ARC, and not have to go to disk. With your data d11p, the ARC can hold a significantly higher percentage of your data set, just like the disks. For a d11p ARC, I would expire based upon block reference count. If a block has few references, it should expire first, and vise versa, blocks with many references should be the last out. With all the savings on disks, think how much RAM you could buy ;) Jon -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Mike Gerdts wrote: [I agree with the comments in this thread, but... I think we're still being old fashioned...] Imagine if university students were allowed to use as much space as they wanted but had to pay a per megabyte charge every two weeks or their account is terminated? This would surely result in huge reduction in disk space consumption. If you can offer the perception of more storage because of efficiencies of the storage devices make it the same cost as less storage, then perhaps allocating more per student is feasible. Or maybe tuition could drop by a few bucks. hmm... well, having spent the past two years at the University, I can provide the observation that: 0. Tuition never drops. 1. Everybody (yes everybody) had a laptop. I would say the average hard disk size per laptop was 100 GBytes. 2. Everybody (yes everybody) had USB flash drives. In part because the school uses them for recruitment tools (give-aways), but they are inexpensive, too. 3. Everybody (yes everybody) had a MP3 player of some magnitude. Many were disk-based, but there were many iPod Nanos, too. 4. 50% had smart phones -- crackberries, iPhones, etc. 5. The school actually provides some storage space, but I don't know anyone who took advantage of the service. E-mail and document sharing was outsourced to google -- no perceptible shortage of space there. Even Microsoft charges only $3/user/month for exchange and sharepoint services. I think many businesses would be hard-pressed to match that sort of efficiency. Unlike my undergraduate days, where we had to make trade-offs between beer and floppy disks, there does not seem to be a shortage of storage space amongst the university students today -- in spite of the rise of beer prices recently (hops shortage, they claim ;-O Is the era of centralized home directories for students over? I think that the normal enterprise backup scenarios are more likely to gain from de-dup, in part because they tend to make full backups of systems and end up with zillions of copies of (static) OS files. Actual work files tend to be smaller, for many businesses. De-dup on my desktop seems to be a non-issue. Has anyone done a full value chain or data path analysis for de-dup? Will de-dup grow beyond the backup function? Will the performance penalty of SHA-256 and bit comparison kill all interactive performance? Should I set aside a few acres at the ranch to grow hops? So many good questions, so little time... -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS deduplication
Hi All ; Is there any hope for deduplication on ZFS ? Mertol http://www.sun.com/ http://www.sun.com/emrkt/sigs/6g_top.gif Mertol Ozyoney Storage Practice - Sales Manager Sun Microsystems, TR Istanbul TR Phone +902123352200 Mobile +905339310752 Fax +90212335 Email [EMAIL PROTECTED] attachment: image001.gif___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Mertol, Yes, dedup is certainly on our list and has been actively discussed recently, so there's hope and some forward progress. It would be interesting to see where it fits into our customers priorities for ZFS. We have a long laundry list of projects. In addition there's bug fixes performance changes that customers are demanding. Neil. Mertol Ozyoney wrote: Hi All ; Is there any hope for deduplication on ZFS ? Mertol http://www.sun.com/emrkt/sigs/6g_top.gif http://www.sun.com/ *Mertol Ozyoney * Storage Practice - Sales Manager *Sun Microsystems, TR* Istanbul TR Phone +902123352200 Mobile +905339310752 Fax +90212335 Email [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
A really smart nexus for dedup is right when archiving takes place. For systems like EMC Centera, dedup is basically a byproduct of checksumming. Two files with similar metadata that have the same hash? They're identical. Charles On 7/7/08 4:25 PM, Neil Perrin [EMAIL PROTECTED] wrote: Mertol, Yes, dedup is certainly on our list and has been actively discussed recently, so there's hope and some forward progress. It would be interesting to see where it fits into our customers priorities for ZFS. We have a long laundry list of projects. In addition there's bug fixes performance changes that customers are demanding. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Even better would be using the ZFS block checksums (assuming we are only summing the data, not it's position or time :)... Then we could have two files that have 90% the same blocks, and still get some dedup value... ;) Nathan. Charles Soto wrote: A really smart nexus for dedup is right when archiving takes place. For systems like EMC Centera, dedup is basically a byproduct of checksumming. Two files with similar metadata that have the same hash? They're identical. Charles On 7/7/08 4:25 PM, Neil Perrin [EMAIL PROTECTED] wrote: Mertol, Yes, dedup is certainly on our list and has been actively discussed recently, so there's hope and some forward progress. It would be interesting to see where it fits into our customers priorities for ZFS. We have a long laundry list of projects. In addition there's bug fixes performance changes that customers are demanding. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Neil Perrin wrote: Mertol, Yes, dedup is certainly on our list and has been actively discussed recently, so there's hope and some forward progress. It would be interesting to see where it fits into our customers priorities for ZFS. We have a long laundry list of projects. In addition there's bug fixes performance changes that customers are demanding. Neil. I want to cast my vote for getting dedup on ZFS. One place we currently use ZFS is as nearline storage for backup data. I have a 16TB server that provides a file store for an EMC Networker server. I'm seeing a compressratio of 1.73, which is mighty impressive, since we also use native EMC compression during the backups. But with dedup, we should see way more. Here at UCB SSL, we have demoed and investigated various dedup products, hardware and software, but they are all steep on the ROI curve. I would be very excited to see block level ZFS deduplication roll out. Especially since we already have the infrastructure in place using Solaris/ZFS. Cheers, Jon -- - _/ _/ / - Jonathan Loran - - -/ / /IT Manager - - _ / _ / / Space Sciences Laboratory, UC Berkeley -/ / / (510) 643-5146 [EMAIL PROTECTED] - __/__/__/ AST:7731^29u18e3 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
On Tue, 8 Jul 2008, Nathan Kroenert wrote: Even better would be using the ZFS block checksums (assuming we are only summing the data, not it's position or time :)... Then we could have two files that have 90% the same blocks, and still get some dedup value... ;) It seems that the hard problem is not if ZFS has the structure to support it (the implementation seems pretty obvious), but rather that ZFS is supposed to be able to scale to extremely large sizes. If you have a petabyte of storage in the pool, then the data structure to keep track of block similarity could grow exceedingly large. The block checksums are designed to be as random as possible so their value does not suggest anything regarding the similarity of the data unless the values are identical. The checksums have enough bits and randomness that binary trees would not scale. Except for the special case of backups or cloned server footprints, it does not seem that data deduplication is going to save the 90% (or more) space that Quantum claims at http://www.quantum.com/Solutions/datadeduplication/Index.aspx. ZFS clones already provide a form of data deduplication. The actual benefit of data deduplication to an enterprise seems negligible unless the backup system directly supports it. In the enterprise the cost of storage has more to do with backing up the data than the amount of storage media consumed. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
On Mon, 7 Jul 2008, Jonathan Loran wrote: use ZFS is as nearline storage for backup data. I have a 16TB server that provides a file store for an EMC Networker server. I'm seeing a compressratio of 1.73, which is mighty impressive, since we also use native EMC compression during the backups. But with dedup, we should see way more. Here at UCB SSL, we have demoed and investigated various I was going to say something smart about how zfs could contribute to improved serialized compression. However, I retract that and think that when one starts with order, it is best to preserve order and not attempt to re-create order once things have devolved into utter chaos. This deduplication technology seems similar to the Microsoft adds I see on TV which advertise how their new technology saves the customer 20% of the 500% additional cost incurred by Microsoft's previous technology (which was itself a band-aid to a previous technology). Sun/Solaris should be about being smarter rather than working harder. If data devolution is a problem, it is most likely that the solution is to investigate the root causes and provide solutions which do not lead to devolution. For example, if Berkely has 30,000 students which all require a home directory with similar stuff, perhaps they can be initialized using ZFS clones so that there is little waste of space until a student modifies an existing file. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
On Mon, Jul 07, 2008 at 07:56:26PM -0500, Bob Friesenhahn wrote: This deduplication technology seems similar to the Microsoft adds I see on TV which advertise how their new technology saves the customer Quantum's claim of 20:1 just doesn't jive in my head, either, for some reason. -brian ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
On Mon, Jul 7, 2008 at 7:40 PM, Bob Friesenhahn [EMAIL PROTECTED] wrote: The actual benefit of data deduplication to an enterprise seems negligible unless the backup system directly supports it. ?In the enterprise the cost of storage has more to do with backing up the data than the amount of storage media consumed. Real data... I did a survey of about 120 (mostly sparse) zone roots installed over an 18 month period and used for normal enterprise activity. Each zone root is installed into its own SVM soft partition with a strong effort to isolate application data elsewhere. Each zone's /var (including /var/tmp) was included in the survey. My mechanism involved calculating the md5 checksum of every 4 KB block from the SVM raw device. This size was chosen because it is the fixed block size of the player in the market that does deduplication of live data today. My results were that I found that I had 75% duplicate data - with no special effort to minimize duplicate data. If other techniques were applied to minimize duplicate data (e.g. periodic write of zeros over free space, extend file system to do the same for freed blocks, mount with noatime, etc.) or full root zones (or LDoms) were the subject of the test I would expect a higher level of duplication. Supposition... As I have considered deduplication for application data I see several things happen in various areas. - Multiple business application areas use very similar software. When looking at various applications that directly (conscious choice) or indirectly (embedded in some other product) use various web servers, application servers, databases, etc. each application administrator uses the same installation media to perform an installation into a private (but commonly NFS mounted) area. Many/most of these applications do a full installation of java which is a statistically significant size of the installation. - Maintenance activity creates duplicate data. When patching, upgrading, or otherwise performing maintenance, it is common to make a full copy or a fresh installation of the software. This allows most of the maintenance activity to be performed when the workload is live as well as rapid fallback by making small configuration changes. The vast majority of the data in these multiple versions are identical (e.g. small percentage of jars updated, maybe a bit of the included documentation, etc.) - Application distribution tools create duplicate data Some application-level clustering technologies cause a significant amount of data to be sent from the administrative server to the various cluster members. By application server design, this is duplicate data. If that data all resides on the same highly redundant storage frame, it could be reduced back down to one (or fewer copies). - Multiple development and release trees are duplicate When various developers check out code from a source code repository or a single developer has multiple copies to work on different releases, the checked out code is nearly 100% duplicate and objects that are created during builds may be highly duplicate. - Relying on storage-based snapshots and clones is impractical There tend to be organizational walls between those that manage storage and those that consume it. As storage is distributed across a network (NFS, iSCSI, FC) things like delegated datasets and RBAC are of limited practical use. Due to these factors and likely others, storage snapshots and clones are only used for the few cases where there is a huge financial incentive with minimal administrative effort. Deduplication could be deployed on the back end to do what clones can't do due to non-technical reasons. - Clones diverge permanently but shouldn't If I have a 3 GB OS image (inside an 8 GB block device) that I am patching, there is a reasonable chance that I unzip 500 MB of patches to the system, apply a the patches, then remove them. If deduplication is done at the block device level (e.g. iSCSI LUNs shared from a storage server) the space uncloned by extracting the patches remains per-server used space. Additionally the other space used by the installed patches remains used. Deduplication can reclaim the majority of the space. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
I second this, provided we also check that the data is in fact identical as well. Checksum collisions are likely given the sizes of disks and the sizes of checksums; and some users actually deliberately generate data with colliding checksums (researchers and nefarious users). Dedup must be absolutely safe and users should decide if they want the cost of checking blocks versus the space saving. Maurice On 08/07/2008, at 10:00 AM, Nathan Kroenert wrote: Even better would be using the ZFS block checksums (assuming we are only summing the data, not it's position or time :)... Then we could have two files that have 90% the same blocks, and still get some dedup value... ;) Nathan. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Good points. I see the archival process as a good candidate for adding dedup because it is essentially doing what a stage/release archiving system already does - faking the existence of data via metadata. Those blocks aren't actually there, but they're still accessible because they're *somewhere* the system knows about (i.e. the other twin). Currently in SAMFS, if I store two identical files on the archiving filesystem and my policy generates 4 copies, I will have created 8 copies of the file (albeit with different metadata). Dedup would help immensely here. And as archiving (data management) is inherently a costly operation, it's used where potentially slower access to data is acceptable. Another system that comes to mind that utilizes dedup is Xythos WebFS. As Bob points out, keeping track of dupes is a chore. IIRC, WebFS uses a relational database to track this (among much of its other metadata). Charles On 7/7/08 7:40 PM, Bob Friesenhahn [EMAIL PROTECTED] wrote: On Tue, 8 Jul 2008, Nathan Kroenert wrote: Even better would be using the ZFS block checksums (assuming we are only summing the data, not it's position or time :)... Then we could have two files that have 90% the same blocks, and still get some dedup value... ;) It seems that the hard problem is not if ZFS has the structure to support it (the implementation seems pretty obvious), but rather that ZFS is supposed to be able to scale to extremely large sizes. If you have a petabyte of storage in the pool, then the data structure to keep track of block similarity could grow exceedingly large. The block checksums are designed to be as random as possible so their value does not suggest anything regarding the similarity of the data unless the values are identical. The checksums have enough bits and randomness that binary trees would not scale. Except for the special case of backups or cloned server footprints, it does not seem that data deduplication is going to save the 90% (or more) space that Quantum claims at http://www.quantum.com/Solutions/datadeduplication/Index.aspx. ZFS clones already provide a form of data deduplication. The actual benefit of data deduplication to an enterprise seems negligible unless the backup system directly supports it. In the enterprise the cost of storage has more to do with backing up the data than the amount of storage media consumed. Bob ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
On Mon, 7 Jul 2008, Mike Gerdts wrote: As I have considered deduplication for application data I see several things happen in various areas. You have provided an excellent description of gross inefficiencies in the way systems and software are deployed today, resulting in massive duplication. Massive duplication is used to ease service deployment and management. Most of this massive duplication is not technically necessary. There tend to be organizational walls between those that manage storage and those that consume it. As storage is distributed across a network (NFS, iSCSI, FC) things like delegated datasets and RBAC are of limited practical use. Due to these factors and likely It seems that deduplication on the server does not provide much benefit to the client since the client always sees a duplicate. It does not know that it doesn't need to cache or copy a block twice because it is a duplicate. Only the server benefits from the deduplication except that maybe server-side caching improves and provides the client with a bit more performance. While deduplication can obviously save server storage space, it does not seem to help much for backups, and it does not really help the user manage all of that data. It does help the user in terms of less raw storage space but there is surely a substantial run-time cost associated with the deduplication mechanism. None of the existing applications (based on POSIX standards) has any understanding of deduplication so they won't benefit from it. If you use tar, cpio, or 'cp -r', to copy the contents of a directory tree, they will transmit just as much data as before and if the destintation does real-time deduplication, then the copy will be slower. If the copy is to another server, then the copy time will be huge, just like before. Unless the backup system fully understands and has access to the filesystem deduplication mechanism, it will be grossly inefficient just like before. Recovery from a backup stored in a sequential (e.g. tape) format which does understand deduplication would be quite interesting indeed. Raw storage space is cheap. Managing the data is what is expensive. Perhaps deduplication is a response to an issue which should be solved elsewhere? Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
On Mon, Jul 7, 2008 at 9:24 PM, Bob Friesenhahn [EMAIL PROTECTED] wrote: On Mon, 7 Jul 2008, Mike Gerdts wrote: There tend to be organizational walls between those that manage storage and those that consume it. As storage is distributed across a network (NFS, iSCSI, FC) things like delegated datasets and RBAC are of limited practical use. Due to these factors and likely It seems that deduplication on the server does not provide much benefit to the client since the client always sees a duplicate. It does not know that it doesn't need to cache or copy a block twice because it is a duplicate. Only the server benefits from the deduplication except that maybe server-side caching improves and provides the client with a bit more performance. I want the deduplication to happen where it can be most efficient. Just like with snapshots and clones, the client will have no idea that multiple metadata sets point to the same data. If deduplication makes it so that each GB of perceived storage is cheaper, clients benefit because the storage provider is (or should be) charging less. While deduplication can obviously save server storage space, it does not seem to help much for backups, and it does not really help the user manage all of that data. It does help the user in terms of less raw storage space but there is surely a substantial run-time cost associated with the deduplication mechanism. None of the existing applications (based on POSIX standards) has any understanding of deduplication so they won't benefit from it. If you use tar, cpio, or 'cp -r', to copy the contents of a directory tree, they will transmit just as much data as before and if the destintation does real-time deduplication, then the copy will be slower. If the copy is to another server, then the copy time will be huge, just like before. I agree. Follow-on work needs to happen in the backup and especially restore areas. The first phase of work in this area is complete when a full restore of all data (including snapshots and clones) takes the same amount of space as was occupied during backup. I suspect that if you take a look at the processor utililzation on most storage devices you will find that there are lots of times that the processors are relatively idle. Deduplication can happen real time in when the processors are not very busy, but dirty block analysis should be queued during times of high processor utilization. If you find that the processor can't keep up with the deduplication workload it suggests that your processors aren't fast/plentiful enough or you have deduplication enabled on inappropriate data sets. The same goes for I/O induced by the dedupe process. In another message it was suggested that the size of the checksum employed by zfs is so large that maintaining a database of the checksums would be too costly. It may be that a multi-level checksum scheme is needed. That is, perhaps the database of checksums uses a 32-bit or 64-bit hash of the 256 bit checksum. If a hash collision occurs then normal I/O routines are used for comparing the checksums. If they are also the same, then compare the data. It may be that the intermediate comparison is more overhead than is needed because one set of data is already in cache and in the worst case an I/O is needed for the checksum or the data. Why do two I/O's if only one is needed? Unless the backup system fully understands and has access to the filesystem deduplication mechanism, it will be grossly inefficient just like before. Recovery from a backup stored in a sequential (e.g. tape) format which does understand deduplication would be quite interesting indeed. Right now it is a mess. Take a look at the situation for restoring snapshots/clones and you will see that unless you use deduplication during restore you need to go out and buy a lot of storage to do a restore or highly duplicate data. Raw storage space is cheap. Managing the data is what is expensive. The systems that make the raw storage scale to petabytes of fault tolerant storage are very expensive and sometimes quite complex. Needing fewer or smaller spindles should mean less energy consumption, less space, lower MTTR, higher MTTDL, and less complexity in all the hardware used to string it all together. Perhaps deduplication is a response to an issue which should be solved elsewhere? Perhaps. However, I take a look at my backup and restore options for zfs today and don't think the POSIX API is the right way to go - at least as I've seen it used so far. Unless something happens that makes restores of clones retain the initial space efficiency or deduplication hides the problem, clones are useless in most environments. If this problem is solved via fixing backups and restores, deduplication seems even more like the next step to take for storage efficiency. If it is solved by adding deduplication then we get the other benefits of deduplication at the same time. And
Re: [zfs-discuss] ZFS deduplication
Oh, I agree. Much of the duplication described is clearly the result of bad design in many of our systems. After all, most of an OS can be served off the network (diskless systems etc.). But much of the dupe I'm talking about is less about not using the most efficient system administration tricks. Rather, it's about the fact that software (e.g. Samba) is used by people, and people don't always do things efficiently. Case in point: students in one of our courses were hitting their quota by growing around 8GB per day. Rather than simply agree that these kids need more space, we had a look at the files. Turns out just about every student copied a 600MB file into their own directories, as it was created by another student to be used as a template for many of their projects. Nobody understood that they could use the file right where it sat. Nope. 7GB of dupe data. And these students are even familiar with our practice of putting class media on a read-only share (these files serve as similar templates for their own projects - you can create a full video project with just a few MB in your project file this way). So, while much of the situation is caused by bad data management, there aren't always systems we can employ that prevent it. Done right, dedup can certainly be worth it for my operations. Yes, teaching the user the right thing is useful, but that user isn't there to know how to manage data for my benefit. They're there to learn how to be filmmakers, journalists, speech pathologists, etc. Charles On 7/7/08 9:24 PM, Bob Friesenhahn [EMAIL PROTECTED] wrote: On Mon, 7 Jul 2008, Mike Gerdts wrote: As I have considered deduplication for application data I see several things happen in various areas. You have provided an excellent description of gross inefficiencies in the way systems and software are deployed today, resulting in massive duplication. Massive duplication is used to ease service deployment and management. Most of this massive duplication is not technically necessary. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
On Mon, Jul 7, 2008 at 11:07 PM, Charles Soto [EMAIL PROTECTED] wrote: So, while much of the situation is caused by bad data management, there aren't always systems we can employ that prevent it. Done right, dedup can certainly be worth it for my operations. Yes, teaching the user the right thing is useful, but that user isn't there to know how to manage data for my benefit. They're there to learn how to be filmmakers, journalists, speech pathologists, etc. Well said. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS deduplication
Does anyone know a tool that can look over a dataset and give duplication statistics? I'm not looking for something incredibly efficient but I'd like to know how much it would actually benefit our dataset: HiRISE has a large set of spacecraft data (images) that could potentially have large amounts of redundancy, or not. Also, other up and coming missions have a large data volume that have a lot of duplicate image info and a small budget; with d11p in OpenSolaris there is a good business case to invest in Sun/OpenSolaris rather than buy the cheaper storage (+ linux?) that can simply hold everything as is. If someone feels like coding a tool up that basically makes a file of checksums and counts how many times a particular checksum get's hit over a dataset, I would be willing to run it and provide feedback. :) -Tim Charles Soto wrote: Oh, I agree. Much of the duplication described is clearly the result of bad design in many of our systems. After all, most of an OS can be served off the network (diskless systems etc.). But much of the dupe I'm talking about is less about not using the most efficient system administration tricks. Rather, it's about the fact that software (e.g. Samba) is used by people, and people don't always do things efficiently. Case in point: students in one of our courses were hitting their quota by growing around 8GB per day. Rather than simply agree that these kids need more space, we had a look at the files. Turns out just about every student copied a 600MB file into their own directories, as it was created by another student to be used as a template for many of their projects. Nobody understood that they could use the file right where it sat. Nope. 7GB of dupe data. And these students are even familiar with our practice of putting class media on a read-only share (these files serve as similar templates for their own projects - you can create a full video project with just a few MB in your project file this way). So, while much of the situation is caused by bad data management, there aren't always systems we can employ that prevent it. Done right, dedup can certainly be worth it for my operations. Yes, teaching the user the right thing is useful, but that user isn't there to know how to manage data for my benefit. They're there to learn how to be filmmakers, journalists, speech pathologists, etc. Charles On 7/7/08 9:24 PM, Bob Friesenhahn [EMAIL PROTECTED] wrote: On Mon, 7 Jul 2008, Mike Gerdts wrote: As I have considered deduplication for application data I see several things happen in various areas. You have provided an excellent description of gross inefficiencies in the way systems and software are deployed today, resulting in massive duplication. Massive duplication is used to ease service deployment and management. Most of this massive duplication is not technically necessary. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss