Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread jim owens
Andrey Kuzmin wrote: On Tue, Apr 28, 2009 at 2:02 PM, Chris Mason chris.ma...@oracle.com wrote: On Tue, 2009-04-28 at 07:22 +0200, Thomas Glanzmann wrote: Hello Chris, There is a btrfs ioctl to clone individual files, and this could be used to implement an online dedup. But, since it is

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Chris, what blocksizes can I choose with btrfs? Do you think that it is possible for an outsider like me to submit patches to btrfs which enable dedup in three fulltime days? Thomas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Tomasz Chmielewski
Thomas Glanzmann schrieb: 300 Gbyte of used storage of several productive VMs with the following Operatings systems running: \begin{itemize} \item Red Hat Linux 32 and 64 Bit (Release 3, 4 and 5) \item SuSE Linux 32 and 64 Bit (SLES 9 and 10) \item Windows 2003 Std.

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Edward Shishkin
Tomasz Chmielewski wrote: Thomas Glanzmann schrieb: 300 Gbyte of used storage of several productive VMs with the following Operatings systems running: \begin{itemize} \item Red Hat Linux 32 and 64 Bit (Release 3, 4 and 5) \item SuSE Linux 32 and 64 Bit (SLES 9 and 10)

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello, I wouldn't rely on crc32: it is not a strong hash, Such deduplication can lead to various problems, including security ones. sure thing, did you think of replacing crc32 with sha1 or md5, is this even possible (is there enough space reserved so that the change can be done without

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello Chris, Is there a checksum for every block in btrfs? Yes, but they are only crc32c. I see, is it easily possible to exchange that with sha-1 or md5? Is it possible to retrieve these checksums from userland? Not today. The sage developers sent a patch to make an ioctl for this,

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Chris Mason
On Tue, 2009-04-28 at 19:34 +0200, Thomas Glanzmann wrote: Hello, I wouldn't rely on crc32: it is not a strong hash, Such deduplication can lead to various problems, including security ones. sure thing, did you think of replacing crc32 with sha1 or md5, is this even possible (is there

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Chris Mason
On Tue, 2009-04-28 at 19:37 +0200, Thomas Glanzmann wrote: Hello Chris, Is there a checksum for every block in btrfs? Yes, but they are only crc32c. I see, is it easily possible to exchange that with sha-1 or md5? Yes, but for the purposes of dedup, it's not exactly what you want.

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello, It is possible, there's room in the metadata for about about 4k of checksum for each 4k of data. The initial btrfs code used sha256, but the real limiting factor is the CPU time used. I see. There a very efficient md5 algorithms out there, for example, especially if the code is

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Heinz-Josef Claes
Am Dienstag, 28. April 2009 19:38:24 schrieb Chris Mason: On Tue, 2009-04-28 at 19:34 +0200, Thomas Glanzmann wrote: Hello, I wouldn't rely on crc32: it is not a strong hash, Such deduplication can lead to various problems, including security ones. sure thing, did you think of

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Michael Tharp
Thomas Glanzmann wrote: no, I just used the md5 checksum. And even if I have a hash escalation which is highly unlikely it still gives a good house number. I'd start with a crc32 and/or MD5 to find candidate blocks, then do a bytewise comparison before actually merging them. Even the risk of

kernel bug in file-item.c

2009-04-28 Thread Marc R. O'Connor
I have had two 'kernel bug' issues today both referencing file-item.c. The first oops happened when i was cp'ing from and external HD(ext3) to and ext3 partition. The second happened during boot up. I have attached them both. Im using btrfs that was merged into my kernel yesterday. -- -- Marc

Re: kernel bug in file-item.c

2009-04-28 Thread Chris Mason
On Tue, 2009-04-28 at 13:39 -0400, Marc R. O'Connor wrote: I have had two 'kernel bug' issues today both referencing file-item.c. The first oops happened when i was cp'ing from and external HD(ext3) to and ext3 partition. The second happened during boot up. I have attached them both. Im

Re: kernel bug in file-item.c

2009-04-28 Thread Marc R. O'Connor
Chris Mason wrote: On Tue, 2009-04-28 at 13:39 -0400, Marc R. O'Connor wrote: I have had two 'kernel bug' issues today both referencing file-item.c. The first oops happened when i was cp'ing from and external HD(ext3) to and ext3 partition. The second happened during boot up. I have attached

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello Chris, Right now the blocksize can only be the same as the page size. For this external dedup program you have in mind, you could use any multiple of the page size. perfect. Exactly what I need. Three days is probably not quite enough ;) I'd honestly prefer the dedup happen

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello Heinz, It's not only cpu time, it's also memory. You need 32 byte for each 4k block. It needs to be in RAM for performance reason. exactly and that is not going to scale. Thomas -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello, * Thomas Glanzmann tho...@glanzmann.de [090428 22:10]: exactly. And if there is a way to retrieve the already calculated checksums from kernel land, than it would be possible to implement a ,,systemcall'' that gives the kernel a hint of a possible duplicated block (like providing a

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Heinz-Josef Claes
Am Dienstag, 28. April 2009 22:16:19 schrieb Thomas Glanzmann: Hello Heinz, It's not only cpu time, it's also memory. You need 32 byte for each 4k block. It needs to be in RAM for performance reason. exactly and that is not going to scale. Thomas Hi Thomas, I wrote a backup

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Chris Mason
On Tue, 2009-04-28 at 22:52 +0200, Thomas Glanzmann wrote: Hello Heinz, I wrote a backup tool which uses dedup, so I know a little bit about the problem and the performance impact if the checksums are not in memory (optionally in that tool). http://savannah.gnu.org/projects/storebackup

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello, - Implement a system call that reports all checksums and unique block identifiers for all stored blocks. This would require storing the larger checksums in the filesystem. It is much better done in the dedup program. I think I misunderstood something here. I

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Dmitri Nikulin
On Wed, Apr 29, 2009 at 3:43 AM, Chris Mason chris.ma...@oracle.com wrote: So you need an extra index either way.  It makes sense to keep the crc32c csums for fast verification of the data read from disk and only use the expensive csums for dedup. What about self-healing? With only a CRC32 to

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Chris Mason
On Tue, 2009-04-28 at 23:12 +0200, Thomas Glanzmann wrote: Hello, - Implement a system call that reports all checksums and unique block identifiers for all stored blocks. This would require storing the larger checksums in the filesystem. It is much better done in

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Chris Mason
On Wed, 2009-04-29 at 00:14 +0200, Thomas Glanzmann wrote: Hello Chris, They are, but only the crc32c are stored today. maybe crc32c is good enough to identify duplicated blocks, I mean we only need a hint, the dedup ioctl does the double checking. I will write tomorrow a perl script and

Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Bron Gondwana
On Tue, Apr 28, 2009 at 04:58:15PM -0400, Chris Mason wrote: Assuming a 4 kbyte block size that would mean for a 1 Tbyte filesystem: 1Tbyte / 4096 / 8 = 32 Mbyte of memory (this should of course be saved to disk from time to time and be restored on