Re: Offline Deduplication for Btrfs
Hi, I like your idea and implementation for offline deduplication a lot. I think it will save me 50% of my backup storage! Your code walks/scans the directory/file tree of the filesystem. Would it be possible to walk/scan the disk extents sequentially in disk order? - This would be more I/O-efficient - This would save you reading previously deduped/snapshotted/hardlinked files more than once. - Maybe this would make it possible to deduplicate directories as well. Met vriendelijke groet, Arjen Nienhuis P.S. The NTFS implementation on Windows has 'ioctls' to read the MFT sequentially in disk order and it's *fast*. It's being used for things like defrag. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
I think that dedup has a variety of use cases that are all very dependent on your workload. The approach you have here seems to be a quite reasonable one. I did not see it in the code, but it is great to be able to collect statistics on how effective your hash is and any counters for the extra IO imposed. Also very useful to have a paranoid mode where when you see a hash collision (dedup candidate), you fall back to a byte-by-byte compare to verify that the the collision is correct. Keeping stats on how often this is a false collision would be quite interesting as well :) Ric -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
On Mon, Jan 10, 2011 at 10:28:14AM -0500, Ric Wheeler wrote: I think that dedup has a variety of use cases that are all very dependent on your workload. The approach you have here seems to be a quite reasonable one. I did not see it in the code, but it is great to be able to collect statistics on how effective your hash is and any counters for the extra IO imposed. So I have counters for how many extents are deduped and the overall file savings, is that what you are talking about? Also very useful to have a paranoid mode where when you see a hash collision (dedup candidate), you fall back to a byte-by-byte compare to verify that the the collision is correct. Keeping stats on how often this is a false collision would be quite interesting as well :) So I've always done a byte-by-byte compare, first in userspace but now its in kernel, because frankly I don't trust hashing algorithms with my data. It would be simple enough to keep statistics on how often the byte-by-byte compare comes out wrong, but really this is to catch changes to the file, so I have a suspicion that most of these statistics would be simply that the file changed, not that the hash was a collision. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
Excerpts from Josef Bacik's message of 2011-01-10 10:37:31 -0500: On Mon, Jan 10, 2011 at 10:28:14AM -0500, Ric Wheeler wrote: I think that dedup has a variety of use cases that are all very dependent on your workload. The approach you have here seems to be a quite reasonable one. I did not see it in the code, but it is great to be able to collect statistics on how effective your hash is and any counters for the extra IO imposed. So I have counters for how many extents are deduped and the overall file savings, is that what you are talking about? Also very useful to have a paranoid mode where when you see a hash collision (dedup candidate), you fall back to a byte-by-byte compare to verify that the the collision is correct. Keeping stats on how often this is a false collision would be quite interesting as well :) So I've always done a byte-by-byte compare, first in userspace but now its in kernel, because frankly I don't trust hashing algorithms with my data. It would be simple enough to keep statistics on how often the byte-by-byte compare comes out wrong, but really this is to catch changes to the file, so I have a suspicion that most of these statistics would be simply that the file changed, not that the hash was a collision. Thanks, At least in the kernel, if you're comparing extents on disk that are from a committed transaction. The contents won't change. We could read into a private buffer instead of into the file's address space to make this more reliable/strict. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
On Mon, Jan 10, 2011 at 10:39:56AM -0500, Chris Mason wrote: Excerpts from Josef Bacik's message of 2011-01-10 10:37:31 -0500: On Mon, Jan 10, 2011 at 10:28:14AM -0500, Ric Wheeler wrote: I think that dedup has a variety of use cases that are all very dependent on your workload. The approach you have here seems to be a quite reasonable one. I did not see it in the code, but it is great to be able to collect statistics on how effective your hash is and any counters for the extra IO imposed. So I have counters for how many extents are deduped and the overall file savings, is that what you are talking about? Also very useful to have a paranoid mode where when you see a hash collision (dedup candidate), you fall back to a byte-by-byte compare to verify that the the collision is correct. Keeping stats on how often this is a false collision would be quite interesting as well :) So I've always done a byte-by-byte compare, first in userspace but now its in kernel, because frankly I don't trust hashing algorithms with my data. It would be simple enough to keep statistics on how often the byte-by-byte compare comes out wrong, but really this is to catch changes to the file, so I have a suspicion that most of these statistics would be simply that the file changed, not that the hash was a collision. Thanks, At least in the kernel, if you're comparing extents on disk that are from a committed transaction. The contents won't change. We could read into a private buffer instead of into the file's address space to make this more reliable/strict. Right sorry I was talking in the userspace case. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
On Thursday, January 06, 2011 01:35:15 pm Chris Mason wrote: What is the smallest granularity that the datadomain searches for in terms of dedup? Josef's current setup isn't restricted to a specific block size, but there is a min match of 4k. I talked to a few people I know and didn't get a clear answer either... However, 512 bytes came up more than once. I'm not really worried about the size of region to be used, but about offsetting it... its so easy to create large tars, ... where the content is offset by a few bytes, mutliples of 512 and such. Peter. -- Censorship: noun, circa 1591. a: Relief of the burden of independent thinking. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
On Thu, Jan 6, 2011 at 12:36 AM, Josef Bacik jo...@redhat.com wrote: Here are patches to do offline deduplication for Btrfs. It works well for the cases it's expected to, I'm looking for feedback on the ioctl interface and such, I'm well aware there are missing features for the userspace app (like being able to set a different blocksize). If this interface is acceptable I will flesh out the userspace app a little more, but I believe the kernel side is ready to go. Basically I think online dedup is huge waste of time and completely useless. You are going to want to do different things with different data. For example, for a mailserver you are going to want to have very small blocksizes, but for say a virtualization image store you are going to want much larger blocksizes. And lets not get into heterogeneous environments, those just get much too complicated. So my solution is batched dedup, where a user just runs this command and it dedups everything at this point. This avoids the very costly overhead of having to hash and lookup for duplicate extents online and lets us be _much_ more flexible about what we want to deduplicate and how we want to do it. For the userspace app it only does 64k blocks, or whatever the largest area it can read out of a file. I'm going to extend this to do the following things in the near future 1) Take the blocksize as an argument so we can have bigger/smaller blocks 2) Have an option to _only_ honor the blocksize, don't try and dedup smaller blocks 3) Use fiemap to try and dedup extents as a whole and just ignore specific blocksizes 4) Use fiemap to determine what would be the most optimal blocksize for the data you want to dedup. I've tested this out on my setup and it seems to work well. I appreciate any feedback you may have. Thanks, FYI: Using clone ioctl can do the same thing (except reading data and computing hash in user space). Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
I have been thinking a lot about de-duplication for a backup application I am writing. I wrote a little script to figure out how much it would save me. For my laptop home directory, about 100 GiB of data, it was a couple of percent, depending a bit on the size of the chunks. With 4 KiB chunks, I would save about two gigabytes. (That's assuming no MD5 hash collisions.) I don't have VM images, but I do have a fair bit of saved e-mail. So, for backups, I concluded it was worth it to provide an option to do this. I have no opinion on whether it is worthwhile to do in btrfs. Online deduplication is very useful for backups of big, multi-gigabyte files which change constantly. Some mail servers store files this way; some MUA store the files like this; databases are also common to pack everything in big files which tend to change here and there almost all the time. Multi-gigabyte files which only have few megabytes changed can't be hardlinked; simple maths shows that even compressing multiple files which have few differences will lead to greater space usage than a few megabytes extra in each (because everything else is deduplicated). And I don't even want to think about IO needed to offline dedup a multi-terabyte storage (1 TB disks and bigger are becoming standard nowadays) i.e. daily, especially when the storage is already heavily used in IO terms. Now, one popular tool which can deal with small changes in files is rsync. It can be used to copy files over the network - so that if you want to copy/update a multi-gigabyte file which only has a few changes, rsync would need to transfer just a few megabytes. On disk however, rsync creates a temporary copy of the original file, where it packs unchanged contents together with any changes made. For example, while it copies/updates a file, we will have: original_file.bin .temporary_random_name Later, original_file.bin would be removed, and .temporary_random_name would be renamed to original_file.bin. Here goes away any deduplication we had so far, we have to start the IO over again. -- Tomasz Chmielewski http://wpkg.org -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
Chris Mason wrote: Excerpts from Gordan Bobic's message of 2011-01-05 12:42:42 -0500: Josef Bacik wrote: Basically I think online dedup is huge waste of time and completely useless. I couldn't disagree more. First, let's consider what is the general-purpose use-case of data deduplication. What are the resource requirements to perform it? How do these resource requirements differ between online and offline? I don't really agree with Josef that dedup is dumb, but I do think his current approach is the most reasonable. Dedup has a few very valid use cases, which I think break down to: 1) backups 2) VM images. The backup farm use case is the best candidate for dedup in general because they are generally write once and hopefully read never. Fragmentation for reading doesn't matter at all and we're really very sure we're going to backup the same files over and over again. But, it's also something that will be dramatically more efficient when the backup server helps out. The backup server knows two files have the same name, same size and can guess with very high accuracy that they will be the same. So it is a very good candidate for Josef's offline dedup because it can just do the dedup right after writing the file. File level deduplication in addition to block level would be great, no argument there. This can again be done more efficiently in-line, though, as the files come in. Next is the VM images. This is actually a much better workload for online dedup, except for the part where our poor storage server would be spending massive amounts of CPU deduping blocks for all the VMs on the machine. In this case the storage server doesn't know the filenames, it just has bunches of blocks that are likely to be the same across VMs. I'm still unconvinced that deduping's major cost is CPU. I think in any real case it'll be disk I/O. So, it seems a bit silly to do this out of band, where we wander through the FS and read a bunch of blocks in hopes of finding ones with the same hash. Except you can get this almost for free. How about this approach: 1) Store a decent size hash for each block (checksum for ECC - something like this already happens, it's just a question of what hashing algorithm to use) 2) Keep a (btree?) index of all known hashes (this doesn't happen at the moment, AFAIK, so this would be the bulk of the new cost for dedup). Now there are 2 options: 3a) Offline - go through the index, find the blocks with duplicate hashes, relink the pointers to one of them and free the rest. There is no need to actually read/write any data unless we are doing full block compare, only metadata needs to be updated. The problem with this is that you would still have to do a full index scan of the index to find all the duplicates, unless there is a second index specifically listing the duplicate blocks (maintained at insertion time). 3b) Online - look up whether the hash for the current block is already in the index ( O(log(n)) ), and if it is, don't bother writing the data block, only add a pointer to the existing block. No need for a second index with duplicate blocks in this case, either. But, one of the things on our features-to-implement page is to wander through the FS and read all the blocks from time to time. We want to do this in the background to make sure the bits haven't rotted on disk. By scrubbing from time to time we are able to make sure that when a disk does die, other disks in the FS are likely to have a good copy. Scrubbing the whole FS seems like an overly expensive way to do things and it also requires low-load times (which don't necessarily exist). How about scrubbing disk-by-disk, rather than the whole FS? If we keep checksums per block, then each disk can be checked for rot independently. It also means that if redundancy is used, the system doesn't end up anywhere nearly as crippled during the scrubbing as the requests can be served from other disks that are a part of that FS (e.g. the mirrored pair in RAID1, or the parity blocks in higher level RAIDs). So again, Josef's approach actually works very well. His dedup util could be the scrubbing util and we'll get two projects for the price of one. Indeed, the scrubber would potentially give deduping functionality for free, but I'm not convinced that having deduping depend on scrubbing is the way forward. This is where we get to multi-tier deduping again - perhaps have things markable for online or offline dedupe, as deemed more appropriate? As for the security of hashes, we're unlikely to find a collision on a sha256 that wasn't made maliciously. If the system's data is controlled and you're not worried about evil people putting files on there, extra reads really aren't required. If you manage to construct one maliciously that's pretty bad in itself, though. :) But then again, extra reads are a good thing (see above about scrubbing). Only under very, very controlled
Re: Offline Deduplication for Btrfs
On Thu, Jan 06, 2011 at 10:37:46AM +0100, Tomasz Chmielewski wrote: I have been thinking a lot about de-duplication for a backup application I am writing. I wrote a little script to figure out how much it would save me. For my laptop home directory, about 100 GiB of data, it was a couple of percent, depending a bit on the size of the chunks. With 4 KiB chunks, I would save about two gigabytes. (That's assuming no MD5 hash collisions.) I don't have VM images, but I do have a fair bit of saved e-mail. So, for backups, I concluded it was worth it to provide an option to do this. I have no opinion on whether it is worthwhile to do in btrfs. Online deduplication is very useful for backups of big, multi-gigabyte files which change constantly. Some mail servers store files this way; some MUA store the files like this; databases are also common to pack everything in big files which tend to change here and there almost all the time. Multi-gigabyte files which only have few megabytes changed can't be hardlinked; simple maths shows that even compressing multiple files which have few differences will lead to greater space usage than a few megabytes extra in each (because everything else is deduplicated). And I don't even want to think about IO needed to offline dedup a multi-terabyte storage (1 TB disks and bigger are becoming standard nowadays) i.e. daily, especially when the storage is already heavily used in IO terms. Now, one popular tool which can deal with small changes in files is rsync. It can be used to copy files over the network - so that if you want to copy/update a multi-gigabyte file which only has a few changes, rsync would need to transfer just a few megabytes. On disk however, rsync creates a temporary copy of the original file, where it packs unchanged contents together with any changes made. For example, while it copies/updates a file, we will have: original_file.bin .temporary_random_name Later, original_file.bin would be removed, and .temporary_random_name would be renamed to original_file.bin. Here goes away any deduplication we had so far, we have to start the IO over again. Sounds like all you need is cp --reflink=always and rsync --inplace. Haven't tested is that works well, though. Mike -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
Spelic wrote: On 01/06/2011 02:03 AM, Gordan Bobic wrote: That's just alarmist. AES is being cryptanalyzed because everything uses it. And the news of it's insecurity are somewhat exaggerated (for now at least). Who cares... the fact of not being much used is a benefit for RIPEMD / blowfish-twofish then. Nobody makes viruses for Linux because they target windows. Same thing... RIPEMD has still an advantage over SHA imho, and blowfish over AES. Just because nobody attacked it yet doesn't justify complacency. If there is full blocks compare, a simpler/faster algorithm could be chosen, like md5. Or even a md-64bits which I don't think it exists, but you can take MD4 and then xor the first 8 bytes with the second 8 bytes so to reduce it to 8 bytes only. This is just because it saves 60% of the RAM occupation during dedup, which is expected to be large, and the collisions are still insignificant at 64bits. Clearly you need to do full blocks compare after that. I really don't think the cost in terms of a few bytes per file for the hashes is that significant. 20 to 8 = 12 bytes per *filesystem block* saved, I think Aren't we talking about block-level deduplication? For every TB of filesystem you occupy 2GB of RAM with hashes instead of 5.3GB (I am assuming 4K blocks, I don't remember how big are btrfs blocks) For a 24 * 2TB storage you occupy 96GB instead of 254GB of RAM. It might be the edge between feasible and not feasible. Actually it might not be feasible anyway... an option to store hashes into a ssd should be provided then... You wouldn't necessarily have to keep the whole index in RAM, but if you don't you'd get hit for an extra O(log(n)) disk seeks. Gordan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
Tomasz Chmielewski wrote: I have been thinking a lot about de-duplication for a backup application I am writing. I wrote a little script to figure out how much it would save me. For my laptop home directory, about 100 GiB of data, it was a couple of percent, depending a bit on the size of the chunks. With 4 KiB chunks, I would save about two gigabytes. (That's assuming no MD5 hash collisions.) I don't have VM images, but I do have a fair bit of saved e-mail. So, for backups, I concluded it was worth it to provide an option to do this. I have no opinion on whether it is worthwhile to do in btrfs. Online deduplication is very useful for backups of big, multi-gigabyte files which change constantly. Some mail servers store files this way; some MUA store the files like this; databases are also common to pack everything in big files which tend to change here and there almost all the time. Multi-gigabyte files which only have few megabytes changed can't be hardlinked; simple maths shows that even compressing multiple files which have few differences will lead to greater space usage than a few megabytes extra in each (because everything else is deduplicated). And I don't even want to think about IO needed to offline dedup a multi-terabyte storage (1 TB disks and bigger are becoming standard nowadays) i.e. daily, especially when the storage is already heavily used in IO terms. Now, one popular tool which can deal with small changes in files is rsync. It can be used to copy files over the network - so that if you want to copy/update a multi-gigabyte file which only has a few changes, rsync would need to transfer just a few megabytes. On disk however, rsync creates a temporary copy of the original file, where it packs unchanged contents together with any changes made. For example, while it copies/updates a file, we will have: original_file.bin .temporary_random_name Later, original_file.bin would be removed, and .temporary_random_name would be renamed to original_file.bin. Here goes away any deduplication we had so far, we have to start the IO over again. You can tell rsync to either modify the file in place (--inplace) or to put the temp file somewhere else (--temp-dir=DIR). Gordan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
Gordan Bobic wrote: Josef Bacik wrote: Basically I think online dedup is huge waste of time and completely useless. I couldn't disagree more. First, let's consider what is the general-purpose use-case of data deduplication. What are the resource requirements to perform it? How do these resource requirements differ between online and offline? snip As an aside, zfs and lessfs both do online deduping, presumably for a good reason. Then again, for a lot of use-cases there are perhaps better ways to achieve the targed goal than deduping on FS level, e.g. snapshotting or something like fl-cow: http://www.xmailserver.org/flcow.html Just a small point; Josef's work provides a building block for a userspace notify-based online dedupe daemon. The basic idea is to use fanotify/inotify (whichever of the notification systems works for this) to track which inodes have been written to. It can then mmap() the changed data (before it's been dropped from RAM) and do the same process as an offline dedupe (hash, check for matches, call dedupe extent ioctl). If you've got enough CPU (maybe running with realtime privs), you should be able to do this before writes actually hit the disk. Further, a userspace daemon can do more sophisticated online dedupe than is reasonable in the kernel - e.g. queue the dedupe extent ioctl phase for idle time, only dedupe inodes that have been left unwritten for x minutes, different policies for different bits of the filesystem (dedupe crontabs immediately on write, dedupe outgoing mail spool only when the mail sticks around for a while, dedupe all incoming mail immediately, dedupe logfiles after rotation only, whatever is appropriate). It can also do more intelligent trickery than is reasonable in-kernel - e.g. if you know that you're deduping e-mail (line-based), you can search line- by-line for dedupe blocks, rather than byte-by-byte. Having said all that, you may well find that having implemented a userspace online dedupe daemon, there are things the kernel can do to help; you may even find that you do need to move it entirely into the kernel. Just don't think that this ioctl rules out online dedupe - in fact, it enables it. -- Simon Farnsworth -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
Simon Farnsworth wrote: The basic idea is to use fanotify/inotify (whichever of the notification systems works for this) to track which inodes have been written to. It can then mmap() the changed data (before it's been dropped from RAM) and do the same process as an offline dedupe (hash, check for matches, call dedupe extent ioctl). If you've got enough CPU (maybe running with realtime privs), you should be able to do this before writes actually hit the disk. I'm not convinced that racing against the disk write is the way forward here. As for having enough CPU to do this, a lot of modern CPUs (ARM, SPARC, Xeon) actually have hardware crypto acceleration/offload, so calculating checksums is fast and cheap. Gordan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
Gordan Bobic wrote: Simon Farnsworth wrote: The basic idea is to use fanotify/inotify (whichever of the notification systems works for this) to track which inodes have been written to. It can then mmap() the changed data (before it's been dropped from RAM) and do the same process as an offline dedupe (hash, check for matches, call dedupe extent ioctl). If you've got enough CPU (maybe running with realtime privs), you should be able to do this before writes actually hit the disk. I'm not convinced that racing against the disk write is the way forward here. The point is that implementing a userspace online dedupe daemon that races against the disk write is something that can be done by anyone who cares as soon as Josef's patch is in place; if it's clear that the userspace daemon just does something simple enough to put in the kernel (e.g. a fixed block size dedupe), and that extra complexity doesn't gain enough to be worthwhile, the code can be ported into the kernel before it gets posted here. Similarly, if you're convinced that it has to be in kernel (I'm not a dedupe or filesystems expert, so there may be good reasons I'm unaware of), you can reuse parts of Josef's code to write your patch that creates a kernel thread to do the work. If it turns out that complex algorithms for online dedupe are worth the effort (like line-by-line e-mail dedupe), then you've got a starting point for writing something more complex, and determining just what the kernel needs to provide to make things nice - maybe it'll be clear that you need an interface that lets you hold up a write while you do the simple end of the dedupe work, maybe there will be some other kernel interface that's more generic than dedupe fixed size blocks that's needed for efficient work. Either way, Josef's work is a good starting point for online dedupe; you can experiment *now* without going into kernel code (heck, maybe even not C - Python or Perl would be OK for algorithm exploration), and use the offline dedupe support to simplify a patch for online dedupe. -- Simon Farnsworth -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
On Thursday, January 06, 2011 05:48:18 am you wrote: Can you elaborate what you're talking about here? How does the length of a directory name affect alignment of file block contents? I don't see how variability of length matters, other than to make things a lot more complicated. I'm saying in a filesystem it doesn't matter - if you bundle everything into a backup stream, it does. Think of tar. 512 byte allignment. I tar up a directory with 8TB total size. No big deal. Now I create a new, empty file in this dir with a name that just happens to be the first in the dir. This adds 512 bytes close to the beginning of the tar file the second time I run tar. Now the remainder of the is all offset by 512bytes and, if you do dedupe on fs- block sized chunks larger than the 512bytes, not a single byte will be de- duped. I know its a stupid example but it matches the backup-target and database dump usage pattern really well. Files backing iSCSI shows similar dedupe behavior. Essentially every time you bundle mutliple files together into one you run into things like that. Have you some real research/scientifically gathered data (non-hearsay/non-anecdotal) on the underlying reasons for the discrepancy in the deduping effectiveness you describe? 3-17x difference doesn't plausibly come purely from fixed vs. variable length block sizes. Personal experience isn't hearsay :) Netapp publishes a whitepaper against the 7000 making this a big point but that isn't publicly available. Try search zfs dedupe +netbackup or zfs dedupe +datadomain and similar - you will hear of hundreds of people all complain about the same thing. The only case where I'd bother consider variable length deduping is in file deduping (rather than block), in this case we can just make a COW hard-link and it's _really_ cheap and effective. I take your word for this :) Gordan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Censorship: noun, circa 1591. a: Relief of the burden of independent thinking. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
Peter A wrote: On Thursday, January 06, 2011 05:48:18 am you wrote: Can you elaborate what you're talking about here? How does the length of a directory name affect alignment of file block contents? I don't see how variability of length matters, other than to make things a lot more complicated. I'm saying in a filesystem it doesn't matter - if you bundle everything into a backup stream, it does. Think of tar. 512 byte allignment. I tar up a directory with 8TB total size. No big deal. Now I create a new, empty file in this dir with a name that just happens to be the first in the dir. This adds 512 bytes close to the beginning of the tar file the second time I run tar. Now the remainder of the is all offset by 512bytes and, if you do dedupe on fs- block sized chunks larger than the 512bytes, not a single byte will be de- duped. OK, I get what you mean now. And I don't think this is something that should be solved in the file system. I know its a stupid example but it matches the backup-target and database dump usage pattern really well. Files backing iSCSI shows similar dedupe behavior. Essentially every time you bundle mutliple files together into one you run into things like that. I suspect a large part of the underlying cause of this is that most things still operate on 512 byte sectors. Once that is replaced with 4KB sectors, that will probably go away. And if this is the case, then perhaps making the block size variable to 512, 1024, 2048 and 4096 bytes (probably in reverse order) will achieve most of that benefit. Whether than is a worthwhile thing to do for poorly designed backup solutions, but I'm not convinced about the general use-case. It'd be very expensive and complicated for seemingly very limited benefit. Have you some real research/scientifically gathered data (non-hearsay/non-anecdotal) on the underlying reasons for the discrepancy in the deduping effectiveness you describe? 3-17x difference doesn't plausibly come purely from fixed vs. variable length block sizes. Personal experience isn't hearsay :) Netapp publishes a whitepaper against the 7000 making this a big point but that isn't publicly available. Try search zfs dedupe +netbackup or zfs dedupe +datadomain and similar - you will hear of hundreds of people all complain about the same thing. Typical. And no doubt they complain that ZFS isn't doing what they want, rather than netbackup not co-operating. The solution to one misdesign isn't an expensive bodge. The solution to this particular problem is to make netbackup work on per-file rather than per stream basis. Gordan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
On Thu, Jan 06, 2011 at 12:18:34PM +, Simon Farnsworth wrote: Gordan Bobic wrote: Josef Bacik wrote: snip Then again, for a lot of use-cases there are perhaps better ways to achieve the targed goal than deduping on FS level, e.g. snapshotting or something like fl-cow: http://www.xmailserver.org/flcow.html As VM are concerned fl-cow is poor replacement of deduping. Upgrading packages? 1st vm upgrades and copies changed files. After while second upgrades and copies files too. More and more becomes duped again. If you host multiple distributions you need to translate that /usr/share/bin/foo in foonux is /us/bin/bar in barux And primary reason to dedupe is not to reduce space usage but to improve caching. Why should machine A read file if machine B read it five minutes ago. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- user to computer ration too low. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
Ondřej Bílka wrote: Then again, for a lot of use-cases there are perhaps better ways to achieve the targed goal than deduping on FS level, e.g. snapshotting or something like fl-cow: http://www.xmailserver.org/flcow.html As VM are concerned fl-cow is poor replacement of deduping. Depends on your VM. If your VM uses monolithic images, then you're right. For a better solution, take a look at vserver's hashify feature for something that does this very well in it's own context. Upgrading packages? 1st vm upgrades and copies changed files. After while second upgrades and copies files too. More and more becomes duped again. So you want online dedupe, then. :) If you host multiple distributions you need to translate that /usr/share/bin/foo in foonux is /us/bin/bar in barux The chances of the binaries being the same between distros are between slim and none. In the context of VMs where you have access to raw files, as I said, look at vserver's hashify feature. It doesn't care about file names, it will COW hard-link all files with identical content. This doesn't even require an exhaustive check of all the files' contents - you can start with file sizes. Files that have different sizes can't have the same contents, so you can discard most of the comparing before you even open the file, most of the work gets done based on metadata alone. And primary reason to dedupe is not to reduce space usage but to improve caching. Why should machine A read file if machine B read it five minutes ago. Couldn't agree more. This is what I was trying to explain earlier. Even if deduping did cause more fragmentation (and I don't think that is the case to any significant extent), the improved caching efficiency would more than offset this. Gordan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
Tomasz Torcz wrote: On Thu, Jan 06, 2011 at 02:19:04AM +0100, Spelic wrote: CPU can handle considerably more than 250 block hashings per second. You could argue that this changes in cases of sequential I/O on big files, but a 1.86GHz GHz Core2 can churn through 111MB/s of SHA256, which even SSDs will struggle to keep up with. A normal 1TB disk with platters can do 130MB/sec sequential, no problems. A SSD can do more like 200MB/sec write 280MB/sec read sequential or random and is actually limited only by the SATA 3.0gbit/sec but soon enough they will have SATA/SAS 6.0gbit/sec. By “soon enough” you really meant “a year ago”, I think: http://www.anandtech.com/show/3812/the-ssd-diaries-crucials-realssd-c300 Current 6Gbps SSD are doing 415 MB/s sequential: http://www.anandtech.com/show/4086/microns-realssd-c400-uses-25nm-nand-at-161gb-offers-415mbs-reads or even claim 550MB/s: http://www.anandtech.com/show/4100/ocz-vertex-pro-3-demo-worlds-first-sandforce-sf2000 (funny bit: Sandforce SSD controllers dedup internally). Anyway, 6Gbps is not a future tale, but something long available. And not the fastest kids on the block: currently build filesystems must deal storage providing many gigabytes per second. Think of massive disk arrays or stuff like Oracle F5100, claiming 12.8GB/sec read and ~10GB/s write (in one rack unit). Sequential figures look nice and impressive but we all know they are meaningless for most real world workloads. IOPS are where it's at. And maybe you can get 100,000 IOPS out of an SSD. But that still means 100,000 SHA256 hashes/second. That's 3.2MB/s of SHA256 hashes, or about 2% of what a modern x64 CPU will do, assuming it doesn't have a suitable hardware crypto accelerator for that algorithm. So on a reasonably recent quad core CPU you would probably be able to comfortably handle about 200x that before it starts becoming an issue. If you're that concerned about space requirements, doing LZO compression will still be much more expensive. And that's only for writes - on reads we don't need to do any hashing (although it's useful to do for the disk error checking reasons explained earlier). Gordan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
On Thursday, January 06, 2011 09:00:47 am you wrote: Peter A wrote: I'm saying in a filesystem it doesn't matter - if you bundle everything into a backup stream, it does. Think of tar. 512 byte allignment. I tar up a directory with 8TB total size. No big deal. Now I create a new, empty file in this dir with a name that just happens to be the first in the dir. This adds 512 bytes close to the beginning of the tar file the second time I run tar. Now the remainder of the is all offset by 512bytes and, if you do dedupe on fs- block sized chunks larger than the 512bytes, not a single byte will be de- duped. OK, I get what you mean now. And I don't think this is something that should be solved in the file system. snip Whether than is a worthwhile thing to do for poorly designed backup solutions, but I'm not convinced about the general use-case. It'd be very expensive and complicated for seemingly very limited benefit. Glad I finally explained myself properly... Unfortunately I disagree with you on the rest. If you take that logic, then I could claim dedupe is nothing a file system should handle - after all, its the user's poorly designed applications that store multiple copies of data. Why should the fs take care of that? The problem doesn't just affect backups. It affects everything where you have large data files that are not forced to allign with filesystem blocks. In addition to the case I mentioned above this affects in pretty much the same effectiveness: * Database dumps * Video Editing * Files backing iSCSI volumes * VM Images (fs blocks inside the VM rarely align with fs blocks in the backing storage). Our VM environment is backed with a 7410 and we get only about 10% dedupe. Copying the same images to a DataDomain results in a 60% reduction in space used. Basically, every time I end up using a lot of storage space, its in a scenario where fs-block based dedupe is not very effective. I also have to argue the point that these usages are poorly designed. Poorly designed can only apply to technologies that existed or were talked about at the time the design was made. Tar and such have been around for a long time, way before anyone even though of dedupe. In addition, until there is a commonly accepted/standard API to query the block size so apps can generate files appropriately laid out for the backing filesystem, what is the application supposed to do? If anything, I would actually argue the opposite, that fixed block dedupe is a poor design: * The problem is known at the time the design was made * No alternative can be offered as tar, netbackup, video editing, ... has been around for a long time and is unlikely to change in the near future * There is no standard API to query the allignment parameters (and even that would not be great since copying a file alligned for 8k to a 16k alligned filesystem, would potentially cause the same issue again) Also from the human perspective its hard to make end users understand your point of view. I promote the 7000 series of storage and I know how hard it is to explain the dedupe behavior there. They see that Datadomain does it, and does it well. So why can't solution xyz do it just as good? Typical. And no doubt they complain that ZFS isn't doing what they want, rather than netbackup not co-operating. The solution to one misdesign isn't an expensive bodge. The solution to this particular problem is to make netbackup work on per-file rather than per stream basis. I'd agree if it was just limited to netbackup... I know variable block length is a significantly more difficult problem than block level. That's why the ZFS team made the design choice they did. Variable length is also the reason why the DataDomain solution is a scale out rather than scalue up approach. However, CPUs get faster and faster - eventually they'll be able to handle it. So the right solution (from my limited point of view, as I said, I'm not a filesystem design expert) would be to implement the data structures to handle variable length. Then in the first iteration, implement the dedupe algorithm to only search on filesystem blocks using existing checksums and such. Less CPU usage, quicker development, easier debugging. Once that is stable and proven, you can then without requiring the user to reformat, go ahead and implement variable length dedupe... Btw, thanks for your time, Gordan :) Peter. -- Censorship: noun, circa 1591. a: Relief of the burden of independent thinking. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
Peter A wrote: On Thursday, January 06, 2011 09:00:47 am you wrote: Peter A wrote: I'm saying in a filesystem it doesn't matter - if you bundle everything into a backup stream, it does. Think of tar. 512 byte allignment. I tar up a directory with 8TB total size. No big deal. Now I create a new, empty file in this dir with a name that just happens to be the first in the dir. This adds 512 bytes close to the beginning of the tar file the second time I run tar. Now the remainder of the is all offset by 512bytes and, if you do dedupe on fs- block sized chunks larger than the 512bytes, not a single byte will be de- duped. OK, I get what you mean now. And I don't think this is something that should be solved in the file system. snip Whether than is a worthwhile thing to do for poorly designed backup solutions, but I'm not convinced about the general use-case. It'd be very expensive and complicated for seemingly very limited benefit. Glad I finally explained myself properly... Unfortunately I disagree with you on the rest. If you take that logic, then I could claim dedupe is nothing a file system should handle - after all, its the user's poorly designed applications that store multiple copies of data. Why should the fs take care of that? There is merit in that point. Some applications do in fact do their own deduplication, as mentioned previously on this thread. The problem doesn't just affect backups. It affects everything where you have large data files that are not forced to allign with filesystem blocks. In addition to the case I mentioned above this affects in pretty much the same effectiveness: * Database dumps * Video Editing * Files backing iSCSI volumes * VM Images (fs blocks inside the VM rarely align with fs blocks in the backing storage). Our VM environment is backed with a 7410 and we get only about 10% dedupe. Copying the same images to a DataDomain results in a 60% reduction in space used. I'd be interested to hear about the relative write performance on the variable block size. I also have to argue the point that these usages are poorly designed. Poorly designed can only apply to technologies that existed or were talked about at the time the design was made. Tar and such have been around for a long time, way before anyone even though of dedupe. In addition, until there is a commonly accepted/standard API to query the block size so apps can generate files appropriately laid out for the backing filesystem, what is the application supposed to do? Indeed, this goes philosophically in the direction that storage vendors have been (unsuccessfully) been trying to shift the industry for decades - object based storage. But, sadly, it hasn't happened (yet?). If anything, I would actually argue the opposite, that fixed block dedupe is a poor design: * The problem is known at the time the design was made * No alternative can be offered as tar, netbackup, video editing, ... has been around for a long time and is unlikely to change in the near future * There is no standard API to query the allignment parameters (and even that would not be great since copying a file alligned for 8k to a 16k alligned filesystem, would potentially cause the same issue again) Also from the human perspective its hard to make end users understand your point of view. I promote the 7000 series of storage and I know how hard it is to explain the dedupe behavior there. They see that Datadomain does it, and does it well. So why can't solution xyz do it just as good? I'd be interested to see the evidence of the variable length argument. I have a sneaky suspicion that it actually falls back to 512 byte blocks, which are much more likely to align, when more sensibly sized blocks fail. The downside is that you don't really want to store a 32 byte hash key with every 512 bytes of data, so you could peel off 512 byte blocks off the front in a hope that a bigger block that follows will match. Thinking about it, this might actually not be too expensive to do. If the 4KB block doesn't match, check 512 byte sub-blocks, and try peeling them, to make the next one line up. If it doesn't, store the mismatch as a full 4KB block and resume. If you do find a match, save the peeled 512 byte blocks separately and dedupe the 4KB block. In fact, it's rather like the loop peeling optimization on a compiler, that allows you to align the data to the boundary suitable for vectorizing. Typical. And no doubt they complain that ZFS isn't doing what they want, rather than netbackup not co-operating. The solution to one misdesign isn't an expensive bodge. The solution to this particular problem is to make netbackup work on per-file rather than per stream basis. I'd agree if it was just limited to netbackup... I know variable block length is a significantly more difficult problem than block level. That's why the ZFS team made the design choice they did. Variable length is also the reason why
Re: Offline Deduplication for Btrfs
On Thu, Jan 06, 2011 at 02:41:28PM +, Gordan Bobic wrote: Ondřej Bílka wrote: Then again, for a lot of use-cases there are perhaps better ways to achieve the targed goal than deduping on FS level, e.g. snapshotting or something like fl-cow: http://www.xmailserver.org/flcow.html As VM are concerned fl-cow is poor replacement of deduping. Depends on your VM. If your VM uses monolithic images, then you're right. For a better solution, take a look at vserver's hashify feature for something that does this very well in it's own context. Upgrading packages? 1st vm upgrades and copies changed files. After while second upgrades and copies files too. More and more becomes duped again. So you want online dedupe, then. :) If you host multiple distributions you need to translate that /usr/share/bin/foo in foonux is /us/bin/bar in barux The chances of the binaries being the same between distros are between slim and none. In the context of VMs where you have access to raw files, as I said, look at vserver's hashify feature. It doesn't care about file names, it will COW hard-link all files with identical content. This doesn't even require an exhaustive check of all the files' contents - you can start with file sizes. Files that have different sizes can't have the same contents, so you can discard most of the comparing before you even open the file, most of the work gets done based on metadata alone. Yes I wrote this as quick example. On second thought files shared between distros are typicaly write-only(like manpages) And primary reason to dedupe is not to reduce space usage but to improve caching. Why should machine A read file if machine B read it five minutes ago. Couldn't agree more. This is what I was trying to explain earlier. Even if deduping did cause more fragmentation (and I don't think that is the case to any significant extent), the improved caching efficiency would more than offset this. Gordan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Program load too heavy for processor to lift. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
On Thursday, January 06, 2011 10:07:03 am you wrote: I'd be interested to see the evidence of the variable length argument. I have a sneaky suspicion that it actually falls back to 512 byte blocks, which are much more likely to align, when more sensibly sized blocks fail. The downside is that you don't really want to store a 32 byte hash key with every 512 bytes of data, so you could peel off 512 byte blocks off the front in a hope that a bigger block that follows will match. Thinking about it, this might actually not be too expensive to do. If the 4KB block doesn't match, check 512 byte sub-blocks, and try peeling them, to make the next one line up. If it doesn't, store the mismatch as a full 4KB block and resume. If you do find a match, save the peeled 512 byte blocks separately and dedupe the 4KB block. In fact, it's rather like the loop peeling optimization on a compiler, that allows you to align the data to the boundary suitable for vectorizing. I'm not sure about this but to be honest I can not see any other way. Otherwise, how would you ever find a match? You can not store checksums of random sub-sections hoping that eventually stuff will match up... 512 byte is probably the best choice as its been a known block size since the dawn of unix. Actually, see above - I believe I was wrong about how expensive variable length block size is likely to be. It's more expensive, sure, but not orders of magnitude more expensive, and as discussed earlier, given the CPU isn't really the key bottleneck here, I think it'd be quite workable. Hmm - from my work with DD systems, it seems to be the CPU that ends up being the limiting factor on dedupe performance... But again, it seems that you have much deeper insight into this topic and have already drawn up an algorithm in your mind :) Peter. -- Censorship: noun, circa 1591. a: Relief of the burden of independent thinking. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
On Thursday 06 of January 2011 10:51:04 Mike Hommey wrote: On Thu, Jan 06, 2011 at 10:37:46AM +0100, Tomasz Chmielewski wrote: I have been thinking a lot about de-duplication for a backup application I am writing. I wrote a little script to figure out how much it would save me. For my laptop home directory, about 100 GiB of data, it was a couple of percent, depending a bit on the size of the chunks. With 4 KiB chunks, I would save about two gigabytes. (That's assuming no MD5 hash collisions.) I don't have VM images, but I do have a fair bit of saved e-mail. So, for backups, I concluded it was worth it to provide an option to do this. I have no opinion on whether it is worthwhile to do in btrfs. Online deduplication is very useful for backups of big, multi-gigabyte files which change constantly. Some mail servers store files this way; some MUA store the files like this; databases are also common to pack everything in big files which tend to change here and there almost all the time. Multi-gigabyte files which only have few megabytes changed can't be hardlinked; simple maths shows that even compressing multiple files which have few differences will lead to greater space usage than a few megabytes extra in each (because everything else is deduplicated). And I don't even want to think about IO needed to offline dedup a multi-terabyte storage (1 TB disks and bigger are becoming standard nowadays) i.e. daily, especially when the storage is already heavily used in IO terms. Now, one popular tool which can deal with small changes in files is rsync. It can be used to copy files over the network - so that if you want to copy/update a multi-gigabyte file which only has a few changes, rsync would need to transfer just a few megabytes. On disk however, rsync creates a temporary copy of the original file, where it packs unchanged contents together with any changes made. For example, while it copies/updates a file, we will have: original_file.bin .temporary_random_name Later, original_file.bin would be removed, and .temporary_random_name would be renamed to original_file.bin. Here goes away any deduplication we had so far, we have to start the IO over again. Sounds like all you need is cp --reflink=always and rsync --inplace. Haven't tested is that works well, though. It works very well, btrfs with snapshots, compression and rsync --inplace has better storage utilisation than lessfs at around 10-15 snapshots with around 600GB of test data in small files. -- Hubert Kario QBS - Quality Business Software 02-656 Warszawa, ul. Ksawerów 30/85 tel. +48 (22) 646-61-51, 646-74-24 www.qbs.com.pl -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Offline Deduplication for Btrfs V2
Just a quick update, I've dropped the hashing stuff in favor of doing a memcmp in the kernel to make sure the data is still the same. The thing that takes a while is reading the data up from disk, so doing a memcmp of the entire buffer isn't that big of a deal, not to mention there's a possiblity for malicious users if there is a problem with the hashing algorithms we use. Plus this makes the interface simpler. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
Excerpts from Peter A's message of 2011-01-05 22:58:36 -0500: On Wednesday, January 05, 2011 08:19:04 pm Spelic wrote: I'd just make it always use the fs block size. No point in making it variable. Agreed. What is the reason for variable block size? First post on this list - I mostly was just reading so far to learn more on fs design but this is one topic I (unfortunately) have experience with... You wouldn't believe the difference variable block size dedupe makes. For a pure fileserver, its ok to dedupe on block level but for most other uses, variable is king. One big example is backups. Netbackup and most others produce one stream with all data even when backing up to disk. Imagine you move a whole lot of data from one dir to another. Think a directory with huge video files. As a filesystem it would be de-duped nicely. The backup stream however may and may not have matching fs blocks. If the directory name before and after has the same lengths and such - then yeah, dedupe works. Directory name is a byte shorter? Everything in the stream will be offset by one byte - and no dedupe will occur at all on the whole dataset. In real world just compare the dedupe performance of an Oracle 7000 (zfs and therefore fs block based) to a DataDomain (variable lenght) in this usage scenario. Among our customers we see something like 3 to 17x dedupe ration on the DD, 1.02 - 1.05 in the 7000. What is the smallest granularity that the datadomain searches for in terms of dedup? Josef's current setup isn't restricted to a specific block size, but there is a min match of 4k. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Offline Deduplication for Btrfs
Here are patches to do offline deduplication for Btrfs. It works well for the cases it's expected to, I'm looking for feedback on the ioctl interface and such, I'm well aware there are missing features for the userspace app (like being able to set a different blocksize). If this interface is acceptable I will flesh out the userspace app a little more, but I believe the kernel side is ready to go. Basically I think online dedup is huge waste of time and completely useless. You are going to want to do different things with different data. For example, for a mailserver you are going to want to have very small blocksizes, but for say a virtualization image store you are going to want much larger blocksizes. And lets not get into heterogeneous environments, those just get much too complicated. So my solution is batched dedup, where a user just runs this command and it dedups everything at this point. This avoids the very costly overhead of having to hash and lookup for duplicate extents online and lets us be _much_ more flexible about what we want to deduplicate and how we want to do it. For the userspace app it only does 64k blocks, or whatever the largest area it can read out of a file. I'm going to extend this to do the following things in the near future 1) Take the blocksize as an argument so we can have bigger/smaller blocks 2) Have an option to _only_ honor the blocksize, don't try and dedup smaller blocks 3) Use fiemap to try and dedup extents as a whole and just ignore specific blocksizes 4) Use fiemap to determine what would be the most optimal blocksize for the data you want to dedup. I've tested this out on my setup and it seems to work well. I appreciate any feedback you may have. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
Josef Bacik wrote: Basically I think online dedup is huge waste of time and completely useless. I couldn't disagree more. First, let's consider what is the general-purpose use-case of data deduplication. What are the resource requirements to perform it? How do these resource requirements differ between online and offline? The only sane way to keep track of hashes of existing blocks is using an index. Searches through an index containing evenly distributed data (such as hashes) is pretty fast (log(N)), and this has to be done regardless of whether the dedupe is online or offline. It also goes without saying that all the blocks being deduplicated need to be hashed, and the cost of this is also the same whether the block is hashes online or offline. Let's look at the relative merits: 1a) Offline We have to copy the entire data set. This means we are using the full amount of disk writes that the data set size dictates. Do we do the hashing of current blocks at this point to create the indexes? Or do we defer it until some later time? Doing it at the point of writes is cheaper - we already have the data in RAM and we can calculate the hashes as we are writing each block. Performance implications of this are fairly analogous to the parity RAID RMW performance issue - to achieve decent performance you have to write the parity at the same time as the rest of the stripe, otherwise you have to read the part of the stripe you didn't write, before calculating the checksum. So by doing the hash indexing offline, the total amount of disk I/O required effectively doubles, and the amount of CPU spent on doing the hashing is in no way reduced. How is this in any way advantageous? 1b) Online As we are writing the data, we calculate the hashes for each block. (See 1a for argument of why I believe this is saner and cheaper than doing it offline.) Since we already have these hashes, we can do a look-up in the hash-index, and either write out the block as is (if that hash isn't already in the index) or simply write the pointer to an existing suitable block (if it already exists). This saves us writing out that block - fewer writes to the disk, not to mention we don't later have to re-read the block to dedupe it. So in this case, instead of write-read-relink of the offline scenario, we simply do relink on duplicate blocks. There is another reason to favour the online option due to it's lower write stress - SSDs. Why hammer the SSD with totally unnecessary writes? The _only_ reason to defer deduping is that hashing costs CPU time. But the chances are that a modern CPU core can churn out MD5 and/or SHA256 hashes faster than a modern mechanical disk can keep up. A 15,000rpm disk can theoretically handle 250 IOPS. A modern CPU can handle considerably more than 250 block hashings per second. You could argue that this changes in cases of sequential I/O on big files, but a 1.86GHz GHz Core2 can churn through 111MB/s of SHA256, which even SSDs will struggle to keep up with. I don't think that the realtime performance argument withstands scrutiny. You are going to want to do different things with different data. For example, for a mailserver you are going to want to have very small blocksizes, but for say a virtualization image store you are going to want much larger blocksizes. And lets not get into heterogeneous environments, those just get much too complicated. In terms of deduplication, IMO it should really all be uniform, transparent, and block based. In terms of specifying which subtrees to dedupe, that should really be a per subdirectory hereditary attribute, kind of like compression was supposed to work with chattr +c in the past. So my solution is batched dedup, where a user just runs this command and it dedups everything at this point. This avoids the very costly overhead of having to hash and lookup for duplicate extents online and lets us be _much_ more flexible about what we want to deduplicate and how we want to do it. I don't see that it adds any flexibility compared to the hereditary deduping attribute. I also don't see that it is any cheaper. It's actually more expensive, according to the reasoning above. As an aside, zfs and lessfs both do online deduping, presumably for a good reason. Then again, for a lot of use-cases there are perhaps better ways to achieve the targed goal than deduping on FS level, e.g. snapshotting or something like fl-cow: http://www.xmailserver.org/flcow.html Personally, I would still like to see a fl-cow like solution that actually preserves the inode numbers of duplicate files while providing COW functionality that breaks this unity (and inode number identity) upon writes, specifically because it saves page cache (only have to cache one copy) and in case of DLLs on chroot style virtualization (OpenVZ, Vserver, LXC) means that identical DLLs in all the guests are all mapped into the same memory, thus
Re: Offline Deduplication for Btrfs
On Wed, Jan 05, 2011 at 05:42:42PM +, Gordan Bobic wrote: Josef Bacik wrote: Basically I think online dedup is huge waste of time and completely useless. I couldn't disagree more. First, let's consider what is the general-purpose use-case of data deduplication. What are the resource requirements to perform it? How do these resource requirements differ between online and offline? The only sane way to keep track of hashes of existing blocks is using an index. Searches through an index containing evenly distributed data (such as hashes) is pretty fast (log(N)), and this has to be done regardless of whether the dedupe is online or offline. It also goes without saying that all the blocks being deduplicated need to be hashed, and the cost of this is also the same whether the block is hashes online or offline. Let's look at the relative merits: 1a) Offline We have to copy the entire data set. This means we are using the full amount of disk writes that the data set size dictates. Do we do the hashing of current blocks at this point to create the indexes? Or do we defer it until some later time? Doing it at the point of writes is cheaper - we already have the data in RAM and we can calculate the hashes as we are writing each block. Performance implications of this are fairly analogous to the parity RAID RMW performance issue - to achieve decent performance you have to write the parity at the same time as the rest of the stripe, otherwise you have to read the part of the stripe you didn't write, before calculating the checksum. So by doing the hash indexing offline, the total amount of disk I/O required effectively doubles, and the amount of CPU spent on doing the hashing is in no way reduced. How is this in any way advantageous? 1b) Online As we are writing the data, we calculate the hashes for each block. (See 1a for argument of why I believe this is saner and cheaper than doing it offline.) Since we already have these hashes, we can do a look-up in the hash-index, and either write out the block as is (if that hash isn't already in the index) or simply write the pointer to an existing suitable block (if it already exists). This saves us writing out that block - fewer writes to the disk, not to mention we don't later have to re-read the block to dedupe it. So in this case, instead of write-read-relink of the offline scenario, we simply do relink on duplicate blocks. There is another reason to favour the online option due to it's lower write stress - SSDs. Why hammer the SSD with totally unnecessary writes? The _only_ reason to defer deduping is that hashing costs CPU time. But the chances are that a modern CPU core can churn out MD5 and/or SHA256 hashes faster than a modern mechanical disk can keep up. A 15,000rpm disk can theoretically handle 250 IOPS. A modern CPU can handle considerably more than 250 block hashings per second. You could argue that this changes in cases of sequential I/O on big files, but a 1.86GHz GHz Core2 can churn through 111MB/s of SHA256, which even SSDs will struggle to keep up with. I don't think that the realtime performance argument withstands scrutiny. You are going to want to do different things with different data. For example, for a mailserver you are going to want to have very small blocksizes, but for say a virtualization image store you are going to want much larger blocksizes. And lets not get into heterogeneous environments, those just get much too complicated. In terms of deduplication, IMO it should really all be uniform, transparent, and block based. In terms of specifying which subtrees to dedupe, that should really be a per subdirectory hereditary attribute, kind of like compression was supposed to work with chattr +c in the past. So my solution is batched dedup, where a user just runs this command and it dedups everything at this point. This avoids the very costly overhead of having to hash and lookup for duplicate extents online and lets us be _much_ more flexible about what we want to deduplicate and how we want to do it. I don't see that it adds any flexibility compared to the hereditary deduping attribute. I also don't see that it is any cheaper. It's actually more expensive, according to the reasoning above. As an aside, zfs and lessfs both do online deduping, presumably for a good reason. Then again, for a lot of use-cases there are perhaps better ways to achieve the targed goal than deduping on FS level, e.g. snapshotting or something like fl-cow: http://www.xmailserver.org/flcow.html Personally, I would still like to see a fl-cow like solution that actually preserves the inode numbers of duplicate files while providing COW functionality that breaks this unity (and inode number identity) upon writes, specifically because it saves page cache (only have to
Re: Offline Deduplication for Btrfs
On Wed, Jan 05, 2011 at 07:41:13PM +0100, Diego Calleja wrote: On Miércoles, 5 de Enero de 2011 18:42:42 Gordan Bobic escribió: So by doing the hash indexing offline, the total amount of disk I/O required effectively doubles, and the amount of CPU spent on doing the hashing is in no way reduced. But there are people who might want to avoid temporally the extra cost of online dedup, and do it offline when the server load is smaller. In my opinion, both online and offline dedup have valid use cases, and the best choice is probably implement both. Question from an end-user. When we say offline deduplication, are we talking about post-process deduplication (a la what Data ONTAP does with their SIS implementation) during which the underlying file system data continues to be available, or a process that needs exclusive access ot the blocks to do its job? Ray -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
On ke, 2011-01-05 at 14:46 -0500, Josef Bacik wrote: Blah blah blah, I'm not having an argument about which is better because I simply do not care. I think dedup is silly to begin with, and online dedup even sillier. The only reason I did offline dedup was because I was just toying around with a simple userspace app to see exactly how much I would save if I did dedup on my normal system, and with 107 gigabytes in use, I'd save 300 megabytes. I'll say that again, with 107 gigabytes in use, I'd save 300 megabytes. So in the normal user case dedup would have been wholey useless to me. I have been thinking a lot about de-duplication for a backup application I am writing. I wrote a little script to figure out how much it would save me. For my laptop home directory, about 100 GiB of data, it was a couple of percent, depending a bit on the size of the chunks. With 4 KiB chunks, I would save about two gigabytes. (That's assuming no MD5 hash collisions.) I don't have VM images, but I do have a fair bit of saved e-mail. So, for backups, I concluded it was worth it to provide an option to do this. I have no opinion on whether it is worthwhile to do in btrfs. (For my script, see find-duplicate-chunks in http://code.liw.fi/debian/pool/main/o/obnam/obnam_0.14.tar.gz or get the current code using bzr get http://code.liw.fi/obnam/bzr/trunk/;. http://braawi.org/obnam/ is the home page of the backup app.) -- Blog/wiki/website hosting with ikiwiki (free for free software): http://www.branchable.com/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
On Wed, Jan 5, 2011 at 11:46 AM, Josef Bacik jo...@redhat.com wrote: Dedup is only usefull if you _know_ you are going to have duplicate information, so the two major usecases that come to mind are 1) Mail server. You have small files, probably less than 4k (blocksize) that you are storing hundreds to thousands of. Using dedup would be good for this case, and you'd have to have a small dedup blocksize for it to be usefull. 2) Virtualized guests. If you have 5 different RHEL5 virt guests, chances are you are going to share data between them, but unlike with the mail server example, you are likely to find much larger chunks that are the same, so you'd want a larger dedup blocksize, say 64k. You want this because if you did just 4k you'd end up with a ridiculous amount of framentation and performance would go down the toilet, so you need a larger dedup blocksize to make for better performance. You missed out on the most obvious, and useful, use case for dedupe: central backups server. Our current backup server does an rsync backup of 127 servers every night into a single ZFS pool. 90+ of those servers are identical Debian installs (school servers), 20-odd of those are identical FreeBSD installs (firewalls/routers), and the rest are mail/web/db servers (Debian, Ubuntu, RedHat, Windows). Just as a test, we copied a week of backups to a Linux box running ZFS-fuse with dedupe enabled, and had a combined dedupe/compress ration in the low double-digits (11 or 12x, something like that). Now we're just waiting for ZFSv22+ to hit FreeBSD to enable dedupe on the backups server. For backups, and central storage for VMs, online dedupe is a massive win. Offline, maybe. Either way, dedupe is worthwhile. -- Freddie Cash fjwc...@gmail.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
On Wed, Jan 05, 2011 at 07:58:13PM +, Lars Wirzenius wrote: On ke, 2011-01-05 at 14:46 -0500, Josef Bacik wrote: Blah blah blah, I'm not having an argument about which is better because I simply do not care. I think dedup is silly to begin with, and online dedup even sillier. The only reason I did offline dedup was because I was just toying around with a simple userspace app to see exactly how much I would save if I did dedup on my normal system, and with 107 gigabytes in use, I'd save 300 megabytes. I'll say that again, with 107 gigabytes in use, I'd save 300 megabytes. So in the normal user case dedup would have been wholey useless to me. I have been thinking a lot about de-duplication for a backup application I am writing. I wrote a little script to figure out how much it would save me. For my laptop home directory, about 100 GiB of data, it was a couple of percent, depending a bit on the size of the chunks. With 4 KiB chunks, I would save about two gigabytes. (That's assuming no MD5 hash collisions.) I don't have VM images, but I do have a fair bit of saved e-mail. So, for backups, I concluded it was worth it to provide an option to do this. I have no opinion on whether it is worthwhile to do in btrfs. Yeah for things where you are talking about sending it over the network or something like that every little bit helps. I think deduplication is far more interesting and usefull at an application level than at a filesystem level. For example with a mail server, there is a good chance that the files will be smaller than a blocksize and not be able to be deduped, but if the application that was storing them recognized that it had the same messages and just linked everything in its own stuff then that would be cool. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
On 01/05/2011 06:41 PM, Diego Calleja wrote: On Miércoles, 5 de Enero de 2011 18:42:42 Gordan Bobic escribió: So by doing the hash indexing offline, the total amount of disk I/O required effectively doubles, and the amount of CPU spent on doing the hashing is in no way reduced. But there are people who might want to avoid temporally the extra cost of online dedup, and do it offline when the server load is smaller. The point is that the offline dedup is actually twice as expensive, and the hashing part is nowhere nearly expensive as disk I/O. Disk I/O is very limited today, compared to CPU time. Gordan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
On 01/05/2011 07:01 PM, Ray Van Dolson wrote: On Wed, Jan 05, 2011 at 07:41:13PM +0100, Diego Calleja wrote: On Miércoles, 5 de Enero de 2011 18:42:42 Gordan Bobic escribió: So by doing the hash indexing offline, the total amount of disk I/O required effectively doubles, and the amount of CPU spent on doing the hashing is in no way reduced. But there are people who might want to avoid temporally the extra cost of online dedup, and do it offline when the server load is smaller. In my opinion, both online and offline dedup have valid use cases, and the best choice is probably implement both. Question from an end-user. When we say offline deduplication, are we talking about post-process deduplication (a la what Data ONTAP does with their SIS implementation) during which the underlying file system data continues to be available, or a process that needs exclusive access ot the blocks to do its job? I was assuming it was a regular cron-job that grinds away on the disks but doesn't require downtime. Gordan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
On Wed, Jan 05, 2011 at 11:01:41AM -0800, Ray Van Dolson wrote: On Wed, Jan 05, 2011 at 07:41:13PM +0100, Diego Calleja wrote: On Miércoles, 5 de Enero de 2011 18:42:42 Gordan Bobic escribió: So by doing the hash indexing offline, the total amount of disk I/O required effectively doubles, and the amount of CPU spent on doing the hashing is in no way reduced. But there are people who might want to avoid temporally the extra cost of online dedup, and do it offline when the server load is smaller. In my opinion, both online and offline dedup have valid use cases, and the best choice is probably implement both. Question from an end-user. When we say offline deduplication, are we talking about post-process deduplication (a la what Data ONTAP does with their SIS implementation) during which the underlying file system data continues to be available, or a process that needs exclusive access ot the blocks to do its job? Yeah its just a post-process thing, you run it when you care to run it and it doesn't make anything unavailable. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
On Wed, Jan 5, 2011 at 12:15 PM, Josef Bacik jo...@redhat.com wrote: Yeah for things where you are talking about sending it over the network or something like that every little bit helps. I think deduplication is far more interesting and usefull at an application level than at a filesystem level. For example with a mail server, there is a good chance that the files will be smaller than a blocksize and not be able to be deduped, but if the application that was storing them recognized that it had the same messages and just linked everything in its own stuff then that would be cool. Thanks, Cyrus IMAP and Zimbra (probably a lot of others) already do that, hard-linking identical message bodies. The e-mail server use-case is for dedupe is pretty much covered already. -- Freddie Cash fjwc...@gmail.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
On 01/05/2011 07:46 PM, Josef Bacik wrote: Blah blah blah, I'm not having an argument about which is better because I simply do not care. I think dedup is silly to begin with, and online dedup even sillier. Offline dedup is more expensive - so why are you of the opinion that it is less silly? And comparison by silliness quotiend still sounds like an argument over which is better. The only reason I did offline dedup was because I was just toying around with a simple userspace app to see exactly how much I would save if I did dedup on my normal system, and with 107 gigabytes in use, I'd save 300 megabytes. I'll say that again, with 107 gigabytes in use, I'd save 300 megabytes. So in the normal user case dedup would have been wholey useless to me. Dedup isn't for an average desktop user. Dedup is for backup storage and virtual images. I don't remember anyone ever saying it is for the average desktop user. I am amazed you got that much saving even - I wouldn't expect there to be any duplicate files on a normal system. Compression is a feature that the desktop users would benefit with, not deduplication. Dedup is only usefull if you _know_ you are going to have duplicate information, so the two major usecases that come to mind are 1) Mail server. You have small files, probably less than 4k (blocksize) that you are storing hundreds to thousands of. Using dedup would be good for this case, and you'd have to have a small dedup blocksize for it to be usefull. Explain to me why you think this would yield duplicate blocks. If your server is Maildir, headers will be in the mail files, and because all emails went to different users, they'd have different headers, and thus not be dedupable. 2) Virtualized guests. If you have 5 different RHEL5 virt guests, chances are you are going to share data between them, but unlike with the mail server example, you are likely to find much larger chunks that are the same, so you'd want a larger dedup blocksize, say 64k. You want this because if you did just 4k you'd end up with a ridiculous amount of framentation and performance would go down the toilet, so you need a larger dedup blocksize to make for better performance. Fragmentation will cause you problems anyway, the argument in the UNIX world since year dot was that defragging doesn't make a damn worth of difference when you have a hundred users hammering away on a machine that has to skip between all their collective files. If you have VM image files a-la vmware/xen/kvm, then using blocks of the same size as the guests is the only way that you are going to get sane deduplication performance. Otherwise the blocks won't line up. If the dedupe block size is 4KB and guest fs block size is 4KB, that's a reasonably clean case. The biggest win by far, however, would be when using chroot type guests, as I mentioned. So you'd want an online implementation to give you a choice of dedup blocksize, which seems to me to be overly complicated. I'd just make it always use the fs block size. No point in making it variable. And then lets bring up the fact that you _have_ to manually compare any data you are going to dedup. I don't care if you think you have the greatest hashing algorithm known to man, you are still going to have collisions somewhere at some point, so in order to make sure you don't lose data, you have to manually memcmp the data. So if you are doing this online, that means reading back the copy you want to dedup in the write path so you can do the memcmp before you write. That is going to make your write performance _suck_. IIRC, this is configurable in ZFS so that you can switch off the physical block comparison. If you use SHA256, the probability of a collission (unless SHA is broken, in which case we have much bigger problems) is 1^128. Times 4KB blocks, that is one collission in 10^24 Exabytes. That's one trillion trillion (that's double trillion) Exabytes. That is considerably more storage space than there is likely to be available on the planet for some time. And just for good measure, you could always up the hash to SHA512 or use two different hashes (e.g. a combination of SHA256 and MD5). Do I think offline dedup is awesome? Hell no, but I got distracted doing it as a side project so I figured I'd finish it, and I did it in under 1400 lines. I dare you to do the same with an online implementation. Offline is simpler to implement and simpler to debug if something goes wrong, and has an overall easier to control impact on the system. It is also better done outside the FS if you're not going to do it properly using FL-COW or fuse based lessfs. Gordan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
On ke, 2011-01-05 at 19:58 +, Lars Wirzenius wrote: (For my script, see find-duplicate-chunks in http://code.liw.fi/debian/pool/main/o/obnam/obnam_0.14.tar.gz or get the current code using bzr get http://code.liw.fi/obnam/bzr/trunk/;. http://braawi.org/obnam/ is the home page of the backup app.) If I may add: it would perhaps be good to collect numbers on the amount of duplication (for various block sizes) there is on different kinds of systems: random laptops, file servers for small companies and large companies, mail servers, backup servers, VM servers, etc. Would anyone be interested in collecting such numbers? A script like mine would be a bit heavy to run, but not too much so, I bet. It would be good to have hard numbers as a basis of discussion rather than guesses and assumptions. Or perhaps someone's already collected the data? -- Blog/wiki/website hosting with ikiwiki (free for free software): http://www.branchable.com/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
On 01/05/2011 09:14 PM, Diego Calleja wrote: In fact, there are cases where online dedup is clearly much worse. For example, cases where people suffer duplication, but it takes a lot of time (several months) to hit it. With online dedup, you need to enable it all the time to get deduplication, and the useless resource waste offsets the other advantages. With offline dedup, you only deduplicate when the system really needs it. My point is that on a file server you don't need to worry about the CPU cost of deduplication because you'll run out of I/O long before you run out of CPU. And I can also imagine some unrealistic but theorically valid cases, like for example an embedded device that for some weird reason needs deduplication but doesn't want online dedup because it needs to save as much power as possible. But it can run an offline dedup when the batteries are charging. That's _very_ theoretical. It's clear to me that if you really want a perfect deduplication solution you need both systems. I'm not against having both. :) Gordan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
On 01/06/2011 12:22 AM, Spelic wrote: On 01/05/2011 09:46 PM, Gordan Bobic wrote: On 01/05/2011 07:46 PM, Josef Bacik wrote: Offline dedup is more expensive - so why are you of the opinion that it is less silly? And comparison by silliness quotiend still sounds like an argument over which is better. If I can say my opinion, I wouldn't want dedup to be enabled online for the whole filesystem. Three reasons: 1- Virtual machine disk images should not get deduplicated imho, if you care about performances, because fragmentation is more important in that case. I disagree. You'll gain much, much more from improved caching and reduced page cache usage than you'll lose from fragmentation. So offline dedup is preferable IMHO. Or at least online dedup should happen only on configured paths. Definitely agree that it should be a per-directory option, rather than per mount. 2- I don't want performances to drop all the time. I would run dedup periodically on less active hours, hence, offline. A rate limiter should also be implemented so not to trash the drives too much. Also a stop and continue should be implemented, so that dedup which couldn't finish within a certain time-frame (e.g. one night) can be made continue the night after without restarting from the beginning. This is the point I was making - you end up paying double the cost in disk I/O and the same cost in CPU terms if you do it offline. And I am not convniced the overhead of calculating checksums is that great. There are already similar overheads in checksums being calculated to enable smart data recovery in case of silent disk corruption. Now that I mentioned, that, it's an interesting point. Could these be unified? If we crank up the checksums on files a bit, to something suitably useful for deduping, it could make the deduping feature almost free. As for restarting deduping (e.g. you chattr -R a directory to specify it for deduping. Since the contents aren't already deduped (the files' entries aren't in the hash index, it'd be obvious what still needs to be deduped and what doesn't. 3- Only some directories should be deduped, for performance reasons. You can foresee where duplicate blocks can exist and where not. Backup directories typically, or mailservers directories. The rest is probably a waste of time. Indeed, see above. I think it should be a per file setting/attribute, hereditary from the parent directory. Also, the OS is small even if identical on multiple virtual images, how much is going to occupy anyway? Less than 5GB per disk image usually. And that's the only thing that would be deduped because data likely to be different on each instance. How many VMs running you have? 20? That's at most 100GB saved one-time at the cost of a lot of fragmentation. That's also 100GB fewer disk blocks in contention for page cache. If you're hitting the disks, you're already going to slow down by several orders of magnitude. Better to make the caching more effective. So if you are doing this online, that means reading back the copy you want to dedup in the write path so you can do the memcmp before you write. That is going to make your write performance _suck_. IIRC, this is configurable in ZFS so that you can switch off the physical block comparison. If you use SHA256, the probability of a collission (unless SHA is broken, in which case we have much bigger problems) is 1^128. Times 4KB blocks, that is one collission in 10^24 Exabytes. That's one trillion trillion (that's double trillion) Exabytes. I like mathematics, but I don't care this time. I would never enable dedup without full blocks compare. I think most users and most companies would do the same. I understand where you are coming from, but by that reasoning you could also argue that AES256 isn't good enough to keep your data confidential. It is a virtual certainty that you will lose several times that much data through catastrophic disk+raid+backup failures than through finding a hash collission. If there is full blocks compare, a simpler/faster algorithm could be chosen, like md5. Or even a md-64bits which I don't think it exists, but you can take MD4 and then xor the first 8 bytes with the second 8 bytes so to reduce it to 8 bytes only. This is just because it saves 60% of the RAM occupation during dedup, which is expected to be large, and the collisions are still insignificant at 64bits. Clearly you need to do full blocks compare after that. I really don't think the cost in terms of a few bytes per file for the hashes is that significant. Note that deduplication IS a cryptographically sensitive matter because if sha-1 is cracked, people can nuke (or maybe even alter, and with this, hack privileges) other users' files by providing blocks with the same SHA and waiting for dedup to pass. Same thing for AES btw, it is showing weaknesses: use blowfish or twofish. SHA1 and AES are two wrong standards... That's just alarmist. AES
Re: Offline Deduplication for Btrfs
On 01/05/2011 09:46 PM, Gordan Bobic wrote: On 01/05/2011 07:46 PM, Josef Bacik wrote: Offline dedup is more expensive - so why are you of the opinion that it is less silly? And comparison by silliness quotiend still sounds like an argument over which is better. If I can say my opinion, I wouldn't want dedup to be enabled online for the whole filesystem. Three reasons: 1- Virtual machine disk images should not get deduplicated imho, if you care about performances, because fragmentation is more important in that case. So offline dedup is preferable IMHO. Or at least online dedup should happen only on configured paths. 2- I don't want performances to drop all the time. I would run dedup periodically on less active hours, hence, offline. A rate limiter should also be implemented so not to trash the drives too much. Also a stop and continue should be implemented, so that dedup which couldn't finish within a certain time-frame (e.g. one night) can be made continue the night after without restarting from the beginning. 3- Only some directories should be deduped, for performance reasons. You can foresee where duplicate blocks can exist and where not. Backup directories typically, or mailservers directories. The rest is probably a waste of time. Dedup isn't for an average desktop user. Dedup is for backup storage and virtual images Not virtual images imho, for the reason above. Also, the OS is small even if identical on multiple virtual images, how much is going to occupy anyway? Less than 5GB per disk image usually. And that's the only thing that would be deduped because data likely to be different on each instance. How many VMs running you have? 20? That's at most 100GB saved one-time at the cost of a lot of fragmentation. Now if you backup those images periodically, that's a place where I would run dedup. I'd just make it always use the fs block size. No point in making it variable. Agreed. What is the reason for variable block size? And then lets bring up the fact that you _have_ to manually compare any data you are going to dedup. I don't care if you think you have the greatest hashing algorithm known to man, you are still going to have collisions somewhere at some point, so in order to make sure you don't lose data, you have to manually memcmp the data. Totally agreed So if you are doing this online, that means reading back the copy you want to dedup in the write path so you can do the memcmp before you write. That is going to make your write performance _suck_. IIRC, this is configurable in ZFS so that you can switch off the physical block comparison. If you use SHA256, the probability of a collission (unless SHA is broken, in which case we have much bigger problems) is 1^128. Times 4KB blocks, that is one collission in 10^24 Exabytes. That's one trillion trillion (that's double trillion) Exabytes. I like mathematics, but I don't care this time. I would never enable dedup without full blocks compare. I think most users and most companies would do the same. If there is full blocks compare, a simpler/faster algorithm could be chosen, like md5. Or even a md-64bits which I don't think it exists, but you can take MD4 and then xor the first 8 bytes with the second 8 bytes so to reduce it to 8 bytes only. This is just because it saves 60% of the RAM occupation during dedup, which is expected to be large, and the collisions are still insignificant at 64bits. Clearly you need to do full blocks compare after that. BTW if you want to allow (as an option) dedup without full blocks compare, SHA1 is not so good: sha-0 already had problems, now sha-1 has problems, I almost wouldn't suggest it for cryptographically secure stuff foreseeing the future. Use ripemd160 or even better ripemd256 which is even faster according to http://www.cryptopp.com/benchmarks.html ripemds are much better algorithms than shas: they have no known weaknesses. Note that deduplication IS a cryptographically sensitive matter because if sha-1 is cracked, people can nuke (or maybe even alter, and with this, hack privileges) other users' files by providing blocks with the same SHA and waiting for dedup to pass. Same thing for AES btw, it is showing weaknesses: use blowfish or twofish. SHA1 and AES are two wrong standards... Dedup without full blocks compare seems indeed suited for online dedup (which I wouldn't enable, now for one more reason) because with full block compares performances would really suck. But please leave full blocks compare for the offline dedup. Also I could suggest a third type of deduplication, but this is harder... it's a file-level deduplication which works like xdelta, that is, it is capable to recognize piece of identical data on two files, which are not at the same offset and which are not even aligned at block boundary. For this, a rolling hash like the one of rsync, or the xdelta 3.0 algorithm could be used. For this to
Re: Offline Deduplication for Btrfs
Excerpts from Gordan Bobic's message of 2011-01-05 12:42:42 -0500: Josef Bacik wrote: Basically I think online dedup is huge waste of time and completely useless. I couldn't disagree more. First, let's consider what is the general-purpose use-case of data deduplication. What are the resource requirements to perform it? How do these resource requirements differ between online and offline? I don't really agree with Josef that dedup is dumb, but I do think his current approach is the most reasonable. Dedup has a few very valid use cases, which I think break down to: 1) backups 2) VM images. The backup farm use case is the best candidate for dedup in general because they are generally write once and hopefully read never. Fragmentation for reading doesn't matter at all and we're really very sure we're going to backup the same files over and over again. But, it's also something that will be dramatically more efficient when the backup server helps out. The backup server knows two files have the same name, same size and can guess with very high accuracy that they will be the same. So it is a very good candidate for Josef's offline dedup because it can just do the dedup right after writing the file. In the backup farm, whole files are very likely to be identical, which again is very easy to optimize with Josef's approach. Next is the VM images. This is actually a much better workload for online dedup, except for the part where our poor storage server would be spending massive amounts of CPU deduping blocks for all the VMs on the machine. In this case the storage server doesn't know the filenames, it just has bunches of blocks that are likely to be the same across VMs. So, it seems a bit silly to do this out of band, where we wander through the FS and read a bunch of blocks in hopes of finding ones with the same hash. But, one of the things on our features-to-implement page is to wander through the FS and read all the blocks from time to time. We want to do this in the background to make sure the bits haven't rotted on disk. By scrubbing from time to time we are able to make sure that when a disk does die, other disks in the FS are likely to have a good copy. So again, Josef's approach actually works very well. His dedup util could be the scrubbing util and we'll get two projects for the price of one. As for the security of hashes, we're unlikely to find a collision on a sha256 that wasn't made maliciously. If the system's data is controlled and you're not worried about evil people putting files on there, extra reads really aren't required. But then again, extra reads are a good thing (see above about scrubbing). The complexity of the whole operation goes down dramatically when we do the verifications because hash index corruptions (this extent has this hash) will be found instead of blindly trusted. None of this means that online dedup is out of the question, I just think the offline stuff is a great way to start. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
On 01/06/2011 02:03 AM, Gordan Bobic wrote: That's just alarmist. AES is being cryptanalyzed because everything uses it. And the news of it's insecurity are somewhat exaggerated (for now at least). Who cares... the fact of not being much used is a benefit for RIPEMD / blowfish-twofish then. Nobody makes viruses for Linux because they target windows. Same thing... RIPEMD has still an advantage over SHA imho, and blowfish over AES. If there is full blocks compare, a simpler/faster algorithm could be chosen, like md5. Or even a md-64bits which I don't think it exists, but you can take MD4 and then xor the first 8 bytes with the second 8 bytes so to reduce it to 8 bytes only. This is just because it saves 60% of the RAM occupation during dedup, which is expected to be large, and the collisions are still insignificant at 64bits. Clearly you need to do full blocks compare after that. I really don't think the cost in terms of a few bytes per file for the hashes is that significant. 20 to 8 = 12 bytes per *filesystem block* saved, I think Aren't we talking about block-level deduplication? For every TB of filesystem you occupy 2GB of RAM with hashes instead of 5.3GB (I am assuming 4K blocks, I don't remember how big are btrfs blocks) For a 24 * 2TB storage you occupy 96GB instead of 254GB of RAM. It might be the edge between feasible and not feasible. Actually it might not be feasible anyway... an option to store hashes into a ssd should be provided then... -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
On Wed, Jan 5, 2011 at 5:03 PM, Gordan Bobic gor...@bobich.net wrote: On 01/06/2011 12:22 AM, Spelic wrote: Definitely agree that it should be a per-directory option, rather than per mount. JOOC, would the dedupe table be done per directory, per mount, per sub-volume, or per volume? The larger the pool of data to check against, the better your dedupe ratios will be. I'm not up-to-date on all the terminology that btrfs uses, and how it compares to ZFS (disks - vdevs - pool - filesystem/volume), so the terms above may be incorrect. :) In the ZFS world, dedupe is done pool-wide in that any block in the pool is a candidate for dedupe, but the dedupe property can be enabled/disabled on a per-filesystem basis. Thus, only blocks in filesystems with the dedupe property enabled will be deduped. But blocks from any filesystem can be compared against. This is the point I was making - you end up paying double the cost in disk I/O and the same cost in CPU terms if you do it offline. And I am not convniced the overhead of calculating checksums is that great. There are already similar overheads in checksums being calculated to enable smart data recovery in case of silent disk corruption. Now that I mentioned, that, it's an interesting point. Could these be unified? If we crank up the checksums on files a bit, to something suitably useful for deduping, it could make the deduping feature almost free. This is what ZFS does. Every block in the pool has a checksum attached to it. Originally, the default algorithm was fletcher2, with fletcher4 and sha256 as alternates. When dedupe was enabled, the default was changed to fletcher4. Dedupe also came with the option to enable/disable a byte-for-byte verify when the hashes match. By switching the checksum algorithm for the pool to sha256 ahead of time, you can enable dedupe, and get the dedupe checksumming for free. :) Also, the OS is small even if identical on multiple virtual images, how much is going to occupy anyway? Less than 5GB per disk image usually. And that's the only thing that would be deduped because data likely to be different on each instance. How many VMs running you have? 20? That's at most 100GB saved one-time at the cost of a lot of fragmentation. That's also 100GB fewer disk blocks in contention for page cache. If you're hitting the disks, you're already going to slow down by several orders of magnitude. Better to make the caching more effective. If you setup your VMs as diskless images, using NFS off a storage server running whatever FS using dedupe, you can get a lot more out of it than using disk image files (where you have all the block sizes and alignment to worry about). And the you can use all the fancy snapshotting, cloning, etc features of whatever FS as well. -- Freddie Cash fjwc...@gmail.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
On Wednesday, January 05, 2011 08:19:04 pm Spelic wrote: I'd just make it always use the fs block size. No point in making it variable. Agreed. What is the reason for variable block size? First post on this list - I mostly was just reading so far to learn more on fs design but this is one topic I (unfortunately) have experience with... You wouldn't believe the difference variable block size dedupe makes. For a pure fileserver, its ok to dedupe on block level but for most other uses, variable is king. One big example is backups. Netbackup and most others produce one stream with all data even when backing up to disk. Imagine you move a whole lot of data from one dir to another. Think a directory with huge video files. As a filesystem it would be de-duped nicely. The backup stream however may and may not have matching fs blocks. If the directory name before and after has the same lengths and such - then yeah, dedupe works. Directory name is a byte shorter? Everything in the stream will be offset by one byte - and no dedupe will occur at all on the whole dataset. In real world just compare the dedupe performance of an Oracle 7000 (zfs and therefore fs block based) to a DataDomain (variable lenght) in this usage scenario. Among our customers we see something like 3 to 17x dedupe ration on the DD, 1.02 - 1.05 in the 7000. There are many other examples and in the end it comes down to if you want general purpose de-dupe (e.g. something useful when you serve iscsi luns, serve as backup target, ...) or if you only care about a pure file store, you're probably going to be ok with fixed block lengths... Hope that helps, Peter. -- Censorship: noun, circa 1591. a: Relief of the burden of independent thinking. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html