Re: Data Deduplication with the help of an online filesystem check
Chris Mason wrote: On Thu, Jun 04, 2009 at 10:49:19AM +0200, Thomas Glanzmann wrote: Hello Chris, My question is now, how often can a block in btrfs be refferenced? The exact answer depends on if we are referencing it from a single file or from multiple files. But either way it is roughly 2^32. could you please explain to me what underlying datastructure is used to monitor if the block is still referenced or already free? Is a counter used, bitmap (but that can't be if is 2^32) or some sort of list? I assume that a counter is used. If this is the case, I assume when a snapshot for example is deleted the reference counter of every block that was referenced in the snapshot will be decremented by one. Is this correct or am I missing something here? It is a counter and a back reference. With Yan Zheng's new format work, the limit is not 2^64. When a snapshot is deleted, the btree is walked to efficiently drop the references on the blocks it referenced. From a dedup point of view, we'll want the dedup file to hold a reference on the file extents. The kernel ioctl side of things will take care of that part. I wonder how well would deduplication work with defragmentation? One excludes the other to some extent. -- Tomasz Chmielewski http://wpkg.org -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
On Fri, Jun 05, 2009 at 02:20:48PM +0200, Tomasz Chmielewski wrote: Chris Mason wrote: On Thu, Jun 04, 2009 at 10:49:19AM +0200, Thomas Glanzmann wrote: Hello Chris, My question is now, how often can a block in btrfs be refferenced? The exact answer depends on if we are referencing it from a single file or from multiple files. But either way it is roughly 2^32. could you please explain to me what underlying datastructure is used to monitor if the block is still referenced or already free? Is a counter used, bitmap (but that can't be if is 2^32) or some sort of list? I assume that a counter is used. If this is the case, I assume when a snapshot for example is deleted the reference counter of every block that was referenced in the snapshot will be decremented by one. Is this correct or am I missing something here? It is a counter and a back reference. With Yan Zheng's new format work, the limit is not 2^64. When a snapshot is deleted, the btree is walked to efficiently drop the references on the blocks it referenced. From a dedup point of view, we'll want the dedup file to hold a reference on the file extents. The kernel ioctl side of things will take care of that part. I wonder how well would deduplication work with defragmentation? One excludes the other to some extent. Very much so ;) Ideally we end up doing dedup in large extents, but it will definitely increase the overall fragmentation of the FS. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
Chris Mason wrote: I wonder how well would deduplication work with defragmentation? One excludes the other to some extent. Very much so ;) Ideally we end up doing dedup in large extents, but it will definitely increase the overall fragmentation of the FS. Defragmentation could lead to interesting problems if it's not aware of dedupliction. I can imagine freeing up space (i.e., as seen by userspace) where duplicated blocks are found, but keeping track of duplicated blocks internally (and not allowing to overwrite any block which is duplicated). Only when free space is really needed, deduplicate blocks which have more copies, and allow to overwrite them with data. Perhaps complicated though. -- Tomasz Chmielewski -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
Hello Chris, My question is now, how often can a block in btrfs be refferenced? The exact answer depends on if we are referencing it from a single file or from multiple files. But either way it is roughly 2^32. could you please explain to me what underlying datastructure is used to monitor if the block is still referenced or already free? Is a counter used, bitmap (but that can't be if is 2^32) or some sort of list? I assume that a counter is used. If this is the case, I assume when a snapshot for example is deleted the reference counter of every block that was referenced in the snapshot will be decremented by one. Is this correct or am I missing something here? Thomas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
Hello Heinz, Hi, during the last half year I thought a little bit about doing dedup for my backup program: not only with fixed blocks (which is implemented), but with moving blocks (with all offsets in a file: 1 byte, 2 byte, ...). That means, I have to have *lots* of comparisions (size of file - blocksize). Even it's not the same, it must be very fast and that's the same problem like the one discussed here. because I just stumbled across that, I wanted to let you know about an interesting approach that NetAPP is using for its Virtual Tape Library: http://www.netapp.com/us/communities/tech-ontap/vtl-dedupe.html Thomas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
Heinz-Josef Claes wrote (ao): Am Dienstag, 28. April 2009 19:38:24 schrieb Chris Mason: On Tue, 2009-04-28 at 19:34 +0200, Thomas Glanzmann wrote: Hello, I wouldn't rely on crc32: it is not a strong hash, Such deduplication can lead to various problems, including security ones. sure thing, did you think of replacing crc32 with sha1 or md5, is this even possible (is there enough space reserved so that the change can be done without changing the filesystem layout) at the moment with btrfs? It is possible, there's room in the metadata for about about 4k of checksum for each 4k of data. The initial btrfs code used sha256, but the real limiting factor is the CPU time used. -chris It's not only cpu time, it's also memory. You need 32 byte for each 4k block. It needs to be in RAM for performance reason. Less so with SSD I would assume. -- Humilis IT Services and Solutions http://www.humilis.net -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
On Tue, 5 May 2009 07:29:45 +1000 Dmitri Nikulin dniku...@gmail.com wrote: On Tue, May 5, 2009 at 5:11 AM, Heinz-Josef Claes hjcl...@web.de wrote: Hi, during the last half year I thought a little bit about doing dedup for my backup program: not only with fixed blocks (which is implemented), but with moving blocks (with all offsets in a file: 1 byte, 2 byte, ...). That means, I have to have *lots* of comparisions (size of file - blocksize). Even it's not the same, it must be very fast and that's the same problem like the one discussed here. My solution (not yet implemented) is as follows (hopefully I remember well): I calculate a checksum of 24 bit. (there can be another size) This means, I can have 2^24 different checksums. Therefore, I hold a bit verctor of 0,5 GB in memory (I hope I remember well, I'm just in a hotel and have no calculator): one bit for each possibility. This verctor is initialized with zeros. For each calculated checksum of a block, I set the according bit in the bit vector. It's very fast, to check if a block with a special checksum exists in the filesystem (backup for me) by checking the appropriate bit in the bit vector. If it doesn't exist, it's a new block If it exists, there need to be a separate 'real' check if it's really the same block (which is slow, but's that's happening 1% of the time). Which means you have to refer to each block in some unique way from the bit vector, making it a block pointer vector instead. That's only 64 times more expensive for a 64 bit offset... It was not the idea to have a pointer vector, only a bit vector. A pointer vector would be too big to hold it in RAM. Therefore, I need access to the disk after using the more exact md5sum (I wanted to use). The bitvector is only needed to have a very quick decision for most of the cases (speedup). But I have no idea if it fits to this use case. I'm not a filesystem developer ;-) Since the overwhelming majority of combinations will never appear in practice, you are much better served with a self-sizing data structure like a hash map or even a binary tree, or a hash map with each bucket being a binary tree, etc... You can use any sized hash and it won't affect the number of nodes you have to store. You can trade off to CPU or RAM easily as required, just by selecting an appropriate data structure. A bit vector and especially a pointer vector have extremely bad any case RAM requirements because even if you're deduping a mere 10 blocks you're still allocating and initialising 2^24 offsets. The least you could do is adaptively switch to a more efficient data structure if you see the number of blocks is low enough. -- Dmitri Nikulin Centre for Synchrotron Science Monash University Victoria 3800, Australia -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Heinz-Josef Claes hjcl...@web.de -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
Hello Jan, * Jan-Frode Myklebust janfr...@tanso.net [090504 20:20]: thin or shallow clones sounds more like sparse images. I believe linked clones is the word for running multiple virtual machines off a single gold image. Ref, the VMware View Composer section of: not exactly. VMware has one golden image and than accounts the blocks different from the golden image by accounting the differences in a snapshot file on a block level basis. So yes, it is a ,,sparse file'' but implemented in userland. All desktops that are linked to a master image can be patched or updated simply by updating the master image, without affecting usersâ settings, data or applications. True for their desktops, because each desktop contains of a virtual machine which is created when a user connects and destroyed, when he disconnects. Bottom line the VM for the thin client only exists while the user is connected. That of course makes it easy to update the master image. But for VMs that are not desktops that are get created/destroyed at least once a day, this doesn't scale. Thomas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
Ric Wheeler schrieb: One thing in the above scheme that would be really interesting for all possible hash functions is maintaining good stats on hash collisions, effectiveness of the hash, etc. There has been a lot of press about MD5 hash collisions for example - it would be really neat to be able to track real world data on those, See here (The hashing function): http://backuppc.sourceforge.net/faq/BackupPC.html#some_design_issues It's not real world data, but it gives some overview which applies here. -- Tomasz Chmielewski http://wpkg.org -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
Hello Ric, (1) Block level or file level dedup? what is the difference between the two? (2) Inband dedup (during a write) or background dedup? I think inband dedup is way to intensive on ressources (memory) and also would kill every performance benchmark. So I think the offline dedup is the right way to go. (3) How reliably can you protect the pool of blocks? How reliably can you protect the database that maps hashes to blocks? You have to lock down the i/o requests for the blocks in question and compare them byte by byte anyway, just to make sure. (4) Can you give users who are somewhat jaded confidence in your solution (this is where stats come in very handy!) For virtual machines you can reduce the used data by 1/3. Of course it can blow in your face when you don't watch your physical resources closely. Thomas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
Thomas Glanzmann wrote: Hello Ric, (1) Block level or file level dedup? what is the difference between the two? (2) Inband dedup (during a write) or background dedup? I think inband dedup is way to intensive on ressources (memory) and also would kill every performance benchmark. So I think the offline dedup is the right way to go. I would not categorize it as offline, but just not as inband (i.e., you can run a low priority background process to handle dedup). Offline windows are extremely rare in production sites these days and it could take a very long time to do dedup at the block level over a large file system :-) (3) How reliably can you protect the pool of blocks? How reliably can you protect the database that maps hashes to blocks? You have to lock down the i/o requests for the blocks in question and compare them byte by byte anyway, just to make sure. Yes, one advantage we had in centera was that the objects were read-only and whole files, so this was not an issue for us. (4) Can you give users who are somewhat jaded confidence in your solution (this is where stats come in very handy!) For virtual machines you can reduce the used data by 1/3. Of course it can blow in your face when you don't watch your physical resources closely. Thomas 1/3 is not sufficient for dedup in my opinion - you can get that with normal compression at the block level. Put another way, if the baseline is bzip2 levels of compression for the block device, when is the complexity of dedup going to pay off? I think that Chris already mentioned that you can (for virt OS images) also imagine using copy on write snapshots (of something that is mostly read-only like the OS system partitions). Make 50 copy-on-write snapshots of / (excluding /home) and you have an effective compression of 98% until someone rudely starts writing at least :-) ric -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
On 05/04/2009 10:39 AM, Tomasz Chmielewski wrote: Ric Wheeler schrieb: One thing in the above scheme that would be really interesting for all possible hash functions is maintaining good stats on hash collisions, effectiveness of the hash, etc. There has been a lot of press about MD5 hash collisions for example - it would be really neat to be able to track real world data on those, See here (The hashing function): http://backuppc.sourceforge.net/faq/BackupPC.html#some_design_issues It's not real world data, but it gives some overview which applies here. I have lots of first hand real world data from 5 years at EMC in the Centera group where we used various types of hashes to do single instancing at the file level, but those are unfortunately not public :-) Note that other parts of EMC do dedup at the block level (Avamar most notably). The key to doing dedup at scale is answering a bunch of questions: (1) Block level or file level dedup? (2) Inband dedup (during a write) or background dedup? (3) How reliably can you protect the pool of blocks? How reliably can you protect the database that maps hashes to blocks? (4) Can you give users who are somewhat jaded confidence in your solution (this is where stats come in very handy!) (5) In the end, dedup is basically a data compression trick - how effective is it (including the costs of the metadata, etc) compared to less complex schemes? What does it costs in CPU, DRAM and impact on the foreground workload? Dedup is a very active area in the storage world, but you need to be extremely robust in the face of failure since a single block failure could (worst case!) lose all stored data in the pool :-) Regards, Ric -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
On 04/28/2009 01:41 PM, Michael Tharp wrote: Thomas Glanzmann wrote: no, I just used the md5 checksum. And even if I have a hash escalation which is highly unlikely it still gives a good house number. I'd start with a crc32 and/or MD5 to find candidate blocks, then do a bytewise comparison before actually merging them. Even the risk of an accidental collision is too high, and considering there are plenty of birthday-style MD5 attacks it would not be extraordinarily difficult to construct a block that collides with e.g. a system library. Keep in mind that although digests do a fairly good job of making unique identifiers for larger chunks of data, they can only hold so many unique combinations. Considering you're comparing blocks of a few kibibytes in size it's best to just do a foolproof comparison. There's nothing wrong with using a checksum/digest as a screening mechanism though. -- m. tharp One thing in the above scheme that would be really interesting for all possible hash functions is maintaining good stats on hash collisions, effectiveness of the hash, etc. There has been a lot of press about MD5 hash collisions for example - it would be really neat to be able to track real world data on those, Ric -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
Hello Andrey, As far as I understand, VMware already ships this gold image feature (as they call it) for Windows environments and claims it to be very efficient. they call it ,,thin or shallow clones'' and ship it with desktop virtualization (one vm per thinclient user) and for VMware lab manager. Looking at the website content, it also revealed that VMware will have a similiar feature for their workhorse ,,esx server'' in the upcoming release, however my point still stands. Ship out a service pack for windows and you 1.5 Gbyte of modified data that is not deduped. Because that is a once in VMs life opportunity. Thomas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
Ric, I would not categorize it as offline, but just not as inband (i.e., you can run a low priority background process to handle dedup). Offline windows are extremely rare in production sites these days and it could take a very long time to do dedup at the block level over a large file system :-) let me rephrase, by offline I meant asynchronous during off hours. 1/3 is not sufficient for dedup in my opinion - you can get that with normal compression at the block level. 1/3 is what gives me real time data of an production environment in a mixed VM setup without compression. Thomas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
On 2009-05-04, Thomas Glanzmann tho...@glanzmann.de wrote: As far as I understand, VMware already ships this gold image feature (as they call it) for Windows environments and claims it to be very efficient. they call it ,,thin or shallow clones'' thin or shallow clones sounds more like sparse images. I believe linked clones is the word for running multiple virtual machines off a single gold image. Ref, the VMware View Composer section of: http://www.vmware.com/products/view/whatsincluded.html Looking at the website content, it also revealed that VMware will have a similiar feature for their workhorse ,,esx server'' in the upcoming release, however my point still stands. Ship out a service pack for windows and you 1.5 Gbyte of modified data that is not deduped. All desktops that are linked to a master image can be patched or updated simply by updating the master image, without affecting usersâ settings, data or applications. -jf -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
Thomas Glanzmann schrieb: Ric, I would not categorize it as offline, but just not as inband (i.e., you can run a low priority background process to handle dedup). Offline windows are extremely rare in production sites these days and it could take a very long time to do dedup at the block level over a large file system :-) let me rephrase, by offline I meant asynchronous during off hours. Hi, during the last half year I thought a little bit about doing dedup for my backup program: not only with fixed blocks (which is implemented), but with moving blocks (with all offsets in a file: 1 byte, 2 byte, ...). That means, I have to have *lots* of comparisions (size of file - blocksize). Even it's not the same, it must be very fast and that's the same problem like the one discussed here. My solution (not yet implemented) is as follows (hopefully I remember well): I calculate a checksum of 24 bit. (there can be another size) This means, I can have 2^24 different checksums. Therefore, I hold a bit verctor of 0,5 GB in memory (I hope I remember well, I'm just in a hotel and have no calculator): one bit for each possibility. This verctor is initialized with zeros. For each calculated checksum of a block, I set the according bit in the bit vector. It's very fast, to check if a block with a special checksum exists in the filesystem (backup for me) by checking the appropriate bit in the bit vector. If it doesn't exist, it's a new block If it exists, there need to be a separate 'real' check if it's really the same block (which is slow, but's that's happening 1% of the time). I hope it is possible to understand my thoughts. I'm in a hotel and I possibly cannot track the emails in this list in the next hours or days. Regards, HJC 1/3 is not sufficient for dedup in my opinion - you can get that with normal compression at the block level. 1/3 is what gives me real time data of an production environment in a mixed VM setup without compression. Thomas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
On Mon, May 4, 2009 at 10:06 PM, Jan-Frode Myklebust janfr...@tanso.net wrote: Looking at the website content, it also revealed that VMware will have a similiar feature for their workhorse ,,esx server'' in the upcoming release, however my point still stands. Ship out a service pack for windows and you 1.5 Gbyte of modified data that is not deduped. All desktops that are linked to a master image can be patched or updated simply by updating the master image, without affecting users’ settings, data or applications. Writable clone support in the file-system coupled with hierarchical settings/data/apps layout, and you have what's described above. Regards, Andrey -jf -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
On Tue, May 5, 2009 at 5:11 AM, Heinz-Josef Claes hjcl...@web.de wrote: Hi, during the last half year I thought a little bit about doing dedup for my backup program: not only with fixed blocks (which is implemented), but with moving blocks (with all offsets in a file: 1 byte, 2 byte, ...). That means, I have to have *lots* of comparisions (size of file - blocksize). Even it's not the same, it must be very fast and that's the same problem like the one discussed here. My solution (not yet implemented) is as follows (hopefully I remember well): I calculate a checksum of 24 bit. (there can be another size) This means, I can have 2^24 different checksums. Therefore, I hold a bit verctor of 0,5 GB in memory (I hope I remember well, I'm just in a hotel and have no calculator): one bit for each possibility. This verctor is initialized with zeros. For each calculated checksum of a block, I set the according bit in the bit vector. It's very fast, to check if a block with a special checksum exists in the filesystem (backup for me) by checking the appropriate bit in the bit vector. If it doesn't exist, it's a new block If it exists, there need to be a separate 'real' check if it's really the same block (which is slow, but's that's happening 1% of the time). Which means you have to refer to each block in some unique way from the bit vector, making it a block pointer vector instead. That's only 64 times more expensive for a 64 bit offset... Since the overwhelming majority of combinations will never appear in practice, you are much better served with a self-sizing data structure like a hash map or even a binary tree, or a hash map with each bucket being a binary tree, etc... You can use any sized hash and it won't affect the number of nodes you have to store. You can trade off to CPU or RAM easily as required, just by selecting an appropriate data structure. A bit vector and especially a pointer vector have extremely bad any case RAM requirements because even if you're deduping a mere 10 blocks you're still allocating and initialising 2^24 offsets. The least you could do is adaptively switch to a more efficient data structure if you see the number of blocks is low enough. -- Dmitri Nikulin Centre for Synchrotron Science Monash University Victoria 3800, Australia -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
Thomas Glanzmann wrote: Looking at this picture, when I'm going to implement the dedup code, do I also have to take care to spread the blocks over the different devices or is there already infrastructure in place that automates that process? If you somehow had blocks duplicated exactly across two devices such that the deduplicator discarded all the blocks from one disk and kept all the blocks from the other disk, then there would be a problem. The best way to solve this would be to always keep the block on the least-full device and discard the rest, which shouldn't be too difficult (especially since this would be happening in kernelspace), but the cheapest solution is to randomize the choice which would be sufficient to prevent any further imbalance from developing. -- m. tharp -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
On Wed, 2009-04-29 at 14:03 +0200, Thomas Glanzmann wrote: Hello Chris, You can start with the code documentation section on http://btrfs.wiki.kernel.org I read through this and at the moment one questions come in my mind: http://btrfs.wiki.kernel.org/images-btrfs/7/72/Chunks-overview.png Looking at this picture, when I'm going to implement the dedup code, do I also have to take care to spread the blocks over the different devices or is there already infrastructure in place that automates that process? The layering inside of btrfs means that you don't need to worry about chunks or multiple devices. But, in your ioctls you want to deal with [file, offset, len], not directly with block numbers. COW means that blocks can move around without you knowing, and some of the btrfs internals will COW files in order to relocate storage. So, what you want is a dedup file (or files) where your DB knows a given offset in the file has a given csum. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
Hello Chris, But, in your ioctls you want to deal with [file, offset, len], not directly with block numbers. COW means that blocks can move around without you knowing, and some of the btrfs internals will COW files in order to relocate storage. So, what you want is a dedup file (or files) where your DB knows a given offset in the file has a given csum. how do I track if a certain block has already been deduplicated? So, if I only have file, offset, len and not the block number, is there a way from userland to tell if two blocks are already point to the same block? Thomas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
On Wed, 2009-04-29 at 15:58 +0200, Thomas Glanzmann wrote: Hello Chris, But, in your ioctls you want to deal with [file, offset, len], not directly with block numbers. COW means that blocks can move around without you knowing, and some of the btrfs internals will COW files in order to relocate storage. So, what you want is a dedup file (or files) where your DB knows a given offset in the file has a given csum. how do I track if a certain block has already been deduplicated? Your database should know, and the ioctl could check to see if the source and destination already point to the same thing before doing anything expensive. So, if I only have file, offset, len and not the block number, is there a way from userland to tell if two blocks are already point to the same block? You can use the fiemap ioctl. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
Hello Chris, Your database should know, and the ioctl could check to see if the source and destination already point to the same thing before doing anything expensive. I see. So, if I only have file, offset, len and not the block number, is there a way from userland to tell if two blocks are already point to the same block? You can use the fiemap ioctl. I see. One more question, in my dedup test VMs, I had a block that was often referenced these are the top block. The last one alone saves 3 Gbyte of data. My question is now, how often can a block in btrfs be refferenced? top 4 block counts for 8 Kbyte: 32055 59514 63606 415252 Thomas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
Andrey Kuzmin wrote: On Tue, Apr 28, 2009 at 2:02 PM, Chris Mason chris.ma...@oracle.com wrote: On Tue, 2009-04-28 at 07:22 +0200, Thomas Glanzmann wrote: Hello Chris, There is a btrfs ioctl to clone individual files, and this could be used to implement an online dedup. But, since it is happening from userland, you can't lock out all of the other users of a given file. So, the dedup application would be responsible for making sure a given file was not being changed while the dedup scan was running. I see, does that mean that I can not do ,,dedup'' for files that are currently opened by a userland program? No, but it does mean the dedup done from userland is racey. Picture Race disappears if (background) dedupe is run against snapshot(s). only if the snapshots are made read-only. btrfs has writable snapshots. jim Regards, Andrey this: process A: create some_file # some_file matches the contents of another file dedup proc: check some_file decide to start block dedup process A: modify some_file dedup proc: progress through block dedup So, this will happily replace blocks in some_file with the dedup blocks. But there's no way to atomically swap them. We could create new ioctls for this, basically a variant of the clone file ioctl that makes sure a given set of pages has a given sum (or strict memory contents) before doing the swap. But they don't exist yet. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
Chris, what blocksizes can I choose with btrfs? Do you think that it is possible for an outsider like me to submit patches to btrfs which enable dedup in three fulltime days? Thomas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
Thomas Glanzmann schrieb: 300 Gbyte of used storage of several productive VMs with the following Operatings systems running: \begin{itemize} \item Red Hat Linux 32 and 64 Bit (Release 3, 4 and 5) \item SuSE Linux 32 and 64 Bit (SLES 9 and 10) \item Windows 2003 Std. Edition 32 Bit \item Windows 2003 Enterprise Edition 64 Bit \end{itemize} \begin{tabular}{r|r|r|l} blocksize Deduplicated Data \\ \hline 128k29.9 G \\ 64k41.3 G \\ 32k59.2 G \\ 16k82 G \\ 8k 112 G \\ \ Bottom line with 8 K blocksize you can get more than 33% of deduped data running a productive set of VMs. Did you just compare checksums, or did you also compare the data bit after bit if the checksums matched? -- Tomasz Chmielewski -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
Tomasz Chmielewski wrote: Thomas Glanzmann schrieb: 300 Gbyte of used storage of several productive VMs with the following Operatings systems running: \begin{itemize} \item Red Hat Linux 32 and 64 Bit (Release 3, 4 and 5) \item SuSE Linux 32 and 64 Bit (SLES 9 and 10) \item Windows 2003 Std. Edition 32 Bit \item Windows 2003 Enterprise Edition 64 Bit \end{itemize} \begin{tabular}{r|r|r|l} blocksize Deduplicated Data \\ \hline 128k29.9 G \\ 64k41.3 G \\ 32k59.2 G \\ 16k82 G \\ 8k 112 G \\ \ Bottom line with 8 K blocksize you can get more than 33% of deduped data running a productive set of VMs. Did you just compare checksums, I wouldn't rely on crc32: it is not a strong hash, Such deduplication can lead to various problems, including security ones. or did you also compare the data bit after bit if the checksums matched? -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
Hello, I wouldn't rely on crc32: it is not a strong hash, Such deduplication can lead to various problems, including security ones. sure thing, did you think of replacing crc32 with sha1 or md5, is this even possible (is there enough space reserved so that the change can be done without changing the filesystem layout) at the moment with btrfs? Thomas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
Hello Chris, Is there a checksum for every block in btrfs? Yes, but they are only crc32c. I see, is it easily possible to exchange that with sha-1 or md5? Is it possible to retrieve these checksums from userland? Not today. The sage developers sent a patch to make an ioctl for this, but since it was hard coded to crc32c I haven't taken it yet. I see. Yes, btrfs uses extents but for the purposes of dedup, 4k blocksizes are fine. Does that mean that I can dedup 4k blocks even if you use extents? Virtual machines are the ideal dedup workload. But, you do get a big portion of the dedup benefits by just starting with a common image and cloning it instead of doing copies of each vm. True, the operating system can be almost completely deduped but as soon as you start patching you loose the benefit. Thomas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
On Tue, 2009-04-28 at 19:34 +0200, Thomas Glanzmann wrote: Hello, I wouldn't rely on crc32: it is not a strong hash, Such deduplication can lead to various problems, including security ones. sure thing, did you think of replacing crc32 with sha1 or md5, is this even possible (is there enough space reserved so that the change can be done without changing the filesystem layout) at the moment with btrfs? It is possible, there's room in the metadata for about about 4k of checksum for each 4k of data. The initial btrfs code used sha256, but the real limiting factor is the CPU time used. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
On Tue, 2009-04-28 at 19:37 +0200, Thomas Glanzmann wrote: Hello Chris, Is there a checksum for every block in btrfs? Yes, but they are only crc32c. I see, is it easily possible to exchange that with sha-1 or md5? Yes, but for the purposes of dedup, it's not exactly what you want. You want an index by checksum, and the current btrfs code indexes by logical byte number in the disk. So you need an extra index either way. It makes sense to keep the crc32c csums for fast verification of the data read from disk and only use the expensive csums for dedup. Is it possible to retrieve these checksums from userland? Not today. The sage developers sent a patch to make an ioctl for this, but since it was hard coded to crc32c I haven't taken it yet. I see. Yes, btrfs uses extents but for the purposes of dedup, 4k blocksizes are fine. Does that mean that I can dedup 4k blocks even if you use extents? Yes. Virtual machines are the ideal dedup workload. But, you do get a big portion of the dedup benefits by just starting with a common image and cloning it instead of doing copies of each vm. True, the operating system can be almost completely deduped but as soon as you start patching you loose the benefit. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
Hello, It is possible, there's room in the metadata for about about 4k of checksum for each 4k of data. The initial btrfs code used sha256, but the real limiting factor is the CPU time used. I see. There a very efficient md5 algorithms out there, for example, especially if the code is written in assembler. Thomas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
Am Dienstag, 28. April 2009 19:38:24 schrieb Chris Mason: On Tue, 2009-04-28 at 19:34 +0200, Thomas Glanzmann wrote: Hello, I wouldn't rely on crc32: it is not a strong hash, Such deduplication can lead to various problems, including security ones. sure thing, did you think of replacing crc32 with sha1 or md5, is this even possible (is there enough space reserved so that the change can be done without changing the filesystem layout) at the moment with btrfs? It is possible, there's room in the metadata for about about 4k of checksum for each 4k of data. The initial btrfs code used sha256, but the real limiting factor is the CPU time used. -chris It's not only cpu time, it's also memory. You need 32 byte for each 4k block. It needs to be in RAM for performance reason. hjc -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
Thomas Glanzmann wrote: no, I just used the md5 checksum. And even if I have a hash escalation which is highly unlikely it still gives a good house number. I'd start with a crc32 and/or MD5 to find candidate blocks, then do a bytewise comparison before actually merging them. Even the risk of an accidental collision is too high, and considering there are plenty of birthday-style MD5 attacks it would not be extraordinarily difficult to construct a block that collides with e.g. a system library. Keep in mind that although digests do a fairly good job of making unique identifiers for larger chunks of data, they can only hold so many unique combinations. Considering you're comparing blocks of a few kibibytes in size it's best to just do a foolproof comparison. There's nothing wrong with using a checksum/digest as a screening mechanism though. -- m. tharp -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
Hello Chris, Right now the blocksize can only be the same as the page size. For this external dedup program you have in mind, you could use any multiple of the page size. perfect. Exactly what I need. Three days is probably not quite enough ;) I'd honestly prefer the dedup happen entirely in the kernel in a setup similar to the compression code. I see. I think that it wouldn't scale because than all the checksums need to be stored in memory or at least in an efficient b*tree. For a 1 Tbyte filesystem with 4 kbyte blocks that would mean more 5 G (!) (assuming a 16 kbyte checksum and 4 byte block identifier and that leaves out the b*tree overhead for fast searching) of memory. But, that would use _lots_ of CPU, so an offline dedup is probably a good feature even if we have transparent dedup. I think that is the right way to go. Wire up a userland database that stores checksums and points to file, offset tuples exactly. And if there is a way to retrieve the already calculated checksums from kernel land, than it would be possible to implement a ,,systemcall'' that gives the kernel a hint of a possible duplicated block (like providing a list of lists of blocks to the kernel that might be duplicated because they have the same checksum). Than the kernel code could dedup the block after byte-byte comparing it. Make the ioctl to replace a given file extent if and only if the file contents match a given checksum over a range of bytes. The ioctl should be able to optionally do a byte compare of the src and destination pages to make 100% sure the data is really the same. Exactly. Make another ioctl to report on which parts of a file have changed since a given transaction. This will greatly reduce the time spent scanning for new blocks. That would be perfect. Even better would be a systemcall that reports all the blocks that have been touched since a specific transaction. Like a bitmap that sets a ,,1'' for every block that has been touched. It isn't painfully hard, but you're looking at about 3 weeks total time. I see, so no quick hack to get it going. Thomas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
Hello Heinz, It's not only cpu time, it's also memory. You need 32 byte for each 4k block. It needs to be in RAM for performance reason. exactly and that is not going to scale. Thomas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
Hello, * Thomas Glanzmann tho...@glanzmann.de [090428 22:10]: exactly. And if there is a way to retrieve the already calculated checksums from kernel land, than it would be possible to implement a ,,systemcall'' that gives the kernel a hint of a possible duplicated block (like providing a list of lists of blocks to the kernel that might be duplicated because they have the same checksum). Than the kernel code could dedup the block after byte-byte comparing it. how hard would be to implement a systemcall that gets a list of blocks and check if it is possible to dedup them (especially for the implemented raid code, etc)? Probably it would be smarter to report one set of blocks at a time instead of a list of sets of blocks because than the systemcall would take for ever to finish which is not really an issue but I assume that the dedup systemcall needs to get a global lock that most probably blocks any i/o operations? Thomas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
Am Dienstag, 28. April 2009 22:16:19 schrieb Thomas Glanzmann: Hello Heinz, It's not only cpu time, it's also memory. You need 32 byte for each 4k block. It needs to be in RAM for performance reason. exactly and that is not going to scale. Thomas Hi Thomas, I wrote a backup tool which uses dedup, so I know a little bit about the problem and the performance impact if the checksums are not in memory (optionally in that tool). http://savannah.gnu.org/projects/storebackup Dedup really helps a lot - I think more than I could imagine before I was engaged in this kind of backup. You will not beleve how many identical files are in a filesystem to give a simple example. EMC has very big boxes for this with lots of RAM in it. I think the first problem which has to be solved is the memory problem. Perhaps something asynchronous to find identical blocks and storing the checksums on disk? Heinz -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
On Tue, 2009-04-28 at 22:52 +0200, Thomas Glanzmann wrote: Hello Heinz, I wrote a backup tool which uses dedup, so I know a little bit about the problem and the performance impact if the checksums are not in memory (optionally in that tool). http://savannah.gnu.org/projects/storebackup Dedup really helps a lot - I think more than I could imagine before I was engaged in this kind of backup. You will not beleve how many identical files are in a filesystem to give a simple example. I saw it with my own yes (see my previous e-mail). EMC has very big boxes for this with lots of RAM in it. I think the first problem which has to be solved is the memory problem. Perhaps something asynchronous to find identical blocks and storing the checksums on disk? I think we already have a very nice solution in order to solve that issue: - Implement a system call that reports all checksums and unique block identifiers for all stored blocks. This would require storing the larger checksums in the filesystem. It is much better done in the dedup program. - Implement another system call that reports all checksums and unique identifiers for all stored blocks since the last report. This can be easily implemented: This is racey because there's no way to prevent new changes. Use a block bitmap for every block on the filesystem use one bit. If the block is modified set the bit to one, when a bitmap is retrieved simply zero it out: Assuming a 4 kbyte block size that would mean for a 1 Tbyte filesystem: 1Tbyte / 4096 / 8 = 32 Mbyte of memory (this should of course be saved to disk from time to time and be restored on startup). Sorry, a 1TB drive is teeny, I don't think a bitmap is practical across the whole FS. Btrfs has metadata that can quickly and easily tell you which files and which blocks in which files have changed since a given transaction id. This is how you want to find new things. But, the ioctl to actually do the dedup needs to be able to verify a given block has the contents you expect it to. The only place you can lock down the pages in the file and prevent new changes is inside the kernel. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
Hello, - Implement a system call that reports all checksums and unique block identifiers for all stored blocks. This would require storing the larger checksums in the filesystem. It is much better done in the dedup program. I think I misunderstood something here. I thought the checksums per block would already be stored somewhere in btrfs? - Implement another system call that reports all checksums and unique identifiers for all stored blocks since the last report. This can be easily implemented: This is racey because there's no way to prevent new changes. I got the point. Use a block bitmap for every block on the filesystem use one bit. If the block is modified set the bit to one, when a bitmap is retrieved simply zero it out: Assuming a 4 kbyte block size that would mean for a 1 Tbyte filesystem: 1Tbyte / 4096 / 8 = 32 Mbyte of memory (this should of course be saved to disk from time to time and be restored on startup). Sorry, a 1TB drive is teeny, I don't think a bitmap is practical across the whole FS. Btrfs has metadata that can quickly and easily tell you which files and which blocks in which files have changed since a given transaction id. This is how you want to find new things. You're right the bitmap would not scale. So what is missing is a systemcall to report the changes to userland? (Is this feature used to generate off-site backups as well?) But, the ioctl to actually do the dedup needs to be able to verify a given block has the contents you expect it to. The only place you can lock down the pages in the file and prevent new changes is inside the kernel. I totally agree to that. How much time would it consume to implement such a systemcall? Thomas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
On Wed, Apr 29, 2009 at 3:43 AM, Chris Mason chris.ma...@oracle.com wrote: So you need an extra index either way. It makes sense to keep the crc32c csums for fast verification of the data read from disk and only use the expensive csums for dedup. What about self-healing? With only a CRC32 to distinguish a good block from a bad one, statistically you're likely to get an incorrectly healed block in only every few billion blocks. And that may not be your machine, but it'll be somebody's, since the probability is way too high for it not to happen to somebody. Even just a 64 bit checksum would drop the probability plenty, but I'd really only start with 128 bits. NetApp does 64 for 4k of data, ZFS does 256 bits per block, and this traces back to the root like a highly dynamic Merkle tree. In the CRC case the only safe redundancy is one that has 3+ copies of the block, to compare the raw data itself, at which point you may as well have just been using healing RAID1 without checksums. -- Dmitri Nikulin Centre for Synchrotron Science Monash University Victoria 3800, Australia -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
On Tue, 2009-04-28 at 23:12 +0200, Thomas Glanzmann wrote: Hello, - Implement a system call that reports all checksums and unique block identifiers for all stored blocks. This would require storing the larger checksums in the filesystem. It is much better done in the dedup program. I think I misunderstood something here. I thought the checksums per block would already be stored somewhere in btrfs? They are, but only the crc32c are stored today. - Implement another system call that reports all checksums and unique identifiers for all stored blocks since the last report. This can be easily implemented: This is racey because there's no way to prevent new changes. I got the point. Use a block bitmap for every block on the filesystem use one bit. If the block is modified set the bit to one, when a bitmap is retrieved simply zero it out: Assuming a 4 kbyte block size that would mean for a 1 Tbyte filesystem: 1Tbyte / 4096 / 8 = 32 Mbyte of memory (this should of course be saved to disk from time to time and be restored on startup). Sorry, a 1TB drive is teeny, I don't think a bitmap is practical across the whole FS. Btrfs has metadata that can quickly and easily tell you which files and which blocks in which files have changed since a given transaction id. This is how you want to find new things. You're right the bitmap would not scale. So what is missing is a systemcall to report the changes to userland? (Is this feature used to generate off-site backups as well?) Yes, that's the idea. An ioctl to walk the tree and report on changes, but this doesn't have to be done with version 1 of the dedup code, you can just scan the file based on mtime/ctime. But, the ioctl to actually do the dedup needs to be able to verify a given block has the contents you expect it to. The only place you can lock down the pages in the file and prevent new changes is inside the kernel. I totally agree to that. How much time would it consume to implement such a systemcall? It is probably a 3 week to one month effort. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
On Wed, 2009-04-29 at 00:14 +0200, Thomas Glanzmann wrote: Hello Chris, They are, but only the crc32c are stored today. maybe crc32c is good enough to identify duplicated blocks, I mean we only need a hint, the dedup ioctl does the double checking. I will write tomorrow a perl script and compare the results to the one that uses md5 and repoort back. Its a start at least. Yes, that's the idea. An ioctl to walk the tree and report on changes, but this doesn't have to be done with version 1 of the dedup code, you can just scan the file based on mtime/ctime. Good point. But, the ioctl to actually do the dedup needs to be able to verify a given block has the contents you expect it to. The only place you can lock down the pages in the file and prevent new changes is inside the kernel. I totally agree to that. How much time would it consume to implement such a systemcall? It is probably a 3 week to one month effort. I'm taking the challenge. Is there a document that I can read that introduces me to the structures used in btrfs or can someone walk me through on the phone to get a quick start? Great to hear. It's an ambitious project, but I'll definitely help explain things. You can start with the code documentation section on http://btrfs.wiki.kernel.org I'll write up my ideas on how userspace controlled dedup should work. I also would like to retrieve the checksums and identify the potential blocks and after that work is done (even in a very preliminary state) in a way that someone can work with it, I would like to move on to the dedup ioctl. Sounds fair, I'll forward the original patch. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
On Tue, Apr 28, 2009 at 04:58:15PM -0400, Chris Mason wrote: Assuming a 4 kbyte block size that would mean for a 1 Tbyte filesystem: 1Tbyte / 4096 / 8 = 32 Mbyte of memory (this should of course be saved to disk from time to time and be restored on startup). Sorry, a 1TB drive is teeny, I don't think a bitmap is practical across the whole FS. 32Mbyte of memory is teeny too. You can pick up a server with 32Gb for about $5k these days. Using 10% of that gives you room for 100Tb of storage, which is getting to be a reasonable size. More than that and you would be thinking of bigger machines anyway to have enough memory for a decent cache. (32Gb seems to be the nice price point, because 4Gb dimms are cheap as chips, and 8Gb are way expensive. We're buying 32Gb machines at the moment) Bron ( ok, so I don't work with giant enterprise systems. I can see that there are cases where this might suck, but memory seems to be growing as fast as disk in the low end ) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
On Mon, 2009-04-27 at 05:33 +0200, Thomas Glanzmann wrote: Hello, I would like to know if it would be possible to implement the following feature in btrfs: Have an online filesystem check which accounts for possible duplicated data blocks (maybe with the help of already implemented checksums: Are these checksums for the whole file or block based?) and de duplicate these blocks? There is a btrfs ioctl to clone individual files, and this could be used to implement an online dedup. But, since it is happening from userland, you can't lock out all of the other users of a given file. So, the dedup application would be responsible for making sure a given file was not being changed while the dedup scan was running. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data-deduplication?
On Thu, Oct 16, 2008 at 03:30:49PM -0400, Chris Mason wrote: On Thu, 2008-10-16 at 15:25 -0400, Valerie Aurora Henson wrote: Both deduplication and compression have an interesting side effect in which a write to a previously allocated block can return ENOSPC. This is even more exciting when you factor in mmap. Any thoughts on how to handle this? Unfortunately we'll have a number of places where ENOSPC will jump in where people don't expect it, and this includes any COW overwrite of an existing extent. The old extent isn't freed until snapshot deletion time, which won't happen until after the current transaction commits. Another example is fallocate. The extent will have a little flag that says I'm a preallocated extent, which is how we'll know we're allowed to overwrite it directly instead of doing COW. But, to write to the fallocated extent, we'll have to clear the flag. So, we'll have to cow the block that holds the file extent pointer, which means we can enospc. I'm sure you know this, but for the peanut gallery: You can avoid some of these sort of purely copy-on-write ENOSPC cases. Any operation where the space used afterwards is less than or equal to the space used before - like in your fallocate case - can avoid ENOSPC as long as you reserve a certain amount of space on the fs and break down the changes into small enough groups. Most file systems don't let you fill up beyond 90-95% anyway because performance goes to hell. You also need to do this so you can delete when your file system is full. In general, it'd be nice to say that if your app can't handle suprise ENOSPC, then if you run without snapshots, compression, or data dedup, we guarantee you'll only get ENOSPC in the normal cases. What do you think? -VAL -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data-deduplication?
On Thu, Oct 16, 2008 at 03:25:01PM -0400, Valerie Aurora Henson wrote: Both deduplication and compression have an interesting side effect in which a write to a previously allocated block can return ENOSPC. This is even more exciting when you factor in mmap. Any thoughts on how to handle this? Note that this can already happen in todays filesystems. Writing into some preallocated space can always cause splits of the allocation or bmap btrees as the pervious big preallocated extent now is split into one allocated and at least one (or two if writing into the middle) preallocated extents. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data-deduplication?
On Wed, Oct 15, 2008 at 03:39:16PM +0200, Avi Kivity wrote: Andi Kleen wrote: Ray Van Dolson [EMAIL PROTECTED] writes: I recall their being a thread here a number of months back regarding data-deduplication support for bttfs. Did anyone end up picking that up and giving a go at it? Block level data dedup would be *awesome* in a Linux filesystem. It does wonders for storing virtual machines w/ NetApp and WAFL, and even ZFS doesn't have this feature yet (although I've read discussions on them looking to add it). There are some patches to do in QEMU's cow format for KVM. That's user level only. And thus, doesn't work for sharing between different images, especially at runtime. It would work if the images are all based once on a reference image, won't it? I would imagine that's the common situation for installing lots of VMs. I'd really, really [any number of reallies], really like to see btrfs deduplication. Sure it would be useful for a couple of things. -Andi -- [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data-deduplication?
On Wed, Oct 15, 2008 at 3:15 PM, Andi Kleen [EMAIL PROTECTED] wrote: On Wed, Oct 15, 2008 at 03:39:16PM +0200, Avi Kivity wrote: Andi Kleen wrote: Ray Van Dolson [EMAIL PROTECTED] writes: I recall their being a thread here a number of months back regarding data-deduplication support for bttfs. Did anyone end up picking that up and giving a go at it? Block level data dedup would be *awesome* in a Linux filesystem. It does wonders for storing virtual machines w/ NetApp and WAFL, and even ZFS doesn't have this feature yet (although I've read discussions on them looking to add it). There are some patches to do in QEMU's cow format for KVM. That's user level only. And thus, doesn't work for sharing between different images, especially at runtime. It would work if the images are all based once on a reference image, won't it? I would imagine that's the common situation for installing lots of VMs. Like, using bcp (btrfs specific cp) for creating new images from a base one? Will that suffice? With modifications after that being COW, that could be a simple way of having a stupid/hack no duplication. I'd really, really [any number of reallies], really like to see btrfs deduplication. Sure it would be useful for a couple of things. -Andi -- [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Miguel Sousa Filipe -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data-deduplication?
Andi Kleen wrote: There are some patches to do in QEMU's cow format for KVM. That's user level only. And thus, doesn't work for sharing between different images, especially at runtime. It would work if the images are all based once on a reference image, won't it? Yes and no. It's difficult to do it at runtime, and it allows one qemu to access another guest's data (for read-only). Also, it's almost impossible to do at runtime. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data-deduplication?
On Sat, 2008-10-11 at 19:06 -0700, Ray Van Dolson wrote: I recall their being a thread here a number of months back regarding data-deduplication support for bttfs. Did anyone end up picking that up and giving a go at it? Block level data dedup would be *awesome* in a Linux filesystem. It does wonders for storing virtual machines w/ NetApp and WAFL, and even ZFS doesn't have this feature yet (although I've read discussions on them looking to add it). So far nobody has grabbed this one, but I've had more requests (no shocker there, the kvm people are interested in it too). It probably won't make 1.0 but the disk format will be able to support it. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data-deduplication?
Ray Van Dolson [EMAIL PROTECTED] writes: I recall their being a thread here a number of months back regarding data-deduplication support for bttfs. Did anyone end up picking that up and giving a go at it? Block level data dedup would be *awesome* in a Linux filesystem. It does wonders for storing virtual machines w/ NetApp and WAFL, and even ZFS doesn't have this feature yet (although I've read discussions on them looking to add it). There are some patches to do in QEMU's cow format for KVM. That's user level only. -Andi -- [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html