Re: Offline Deduplication for Btrfs

2011-01-15 Thread Arjen Nienhuis
Hi, I like your idea and implementation for offline deduplication a lot. I think it will save me 50% of my backup storage! Your code walks/scans the directory/file tree of the filesystem. Would it be possible to walk/scan the disk extents sequentially in disk order? - This would be more

Re: Offline Deduplication for Btrfs

2011-01-10 Thread Ric Wheeler
I think that dedup has a variety of use cases that are all very dependent on your workload. The approach you have here seems to be a quite reasonable one. I did not see it in the code, but it is great to be able to collect statistics on how effective your hash is and any counters for the

Re: Offline Deduplication for Btrfs

2011-01-10 Thread Josef Bacik
On Mon, Jan 10, 2011 at 10:28:14AM -0500, Ric Wheeler wrote: I think that dedup has a variety of use cases that are all very dependent on your workload. The approach you have here seems to be a quite reasonable one. I did not see it in the code, but it is great to be able to collect

Re: Offline Deduplication for Btrfs

2011-01-10 Thread Chris Mason
Excerpts from Josef Bacik's message of 2011-01-10 10:37:31 -0500: On Mon, Jan 10, 2011 at 10:28:14AM -0500, Ric Wheeler wrote: I think that dedup has a variety of use cases that are all very dependent on your workload. The approach you have here seems to be a quite reasonable one. I

Re: Offline Deduplication for Btrfs

2011-01-10 Thread Josef Bacik
On Mon, Jan 10, 2011 at 10:39:56AM -0500, Chris Mason wrote: Excerpts from Josef Bacik's message of 2011-01-10 10:37:31 -0500: On Mon, Jan 10, 2011 at 10:28:14AM -0500, Ric Wheeler wrote: I think that dedup has a variety of use cases that are all very dependent on your workload. The

Re: Offline Deduplication for Btrfs

2011-01-07 Thread Peter A
On Thursday, January 06, 2011 01:35:15 pm Chris Mason wrote: What is the smallest granularity that the datadomain searches for in terms of dedup? Josef's current setup isn't restricted to a specific block size, but there is a min match of 4k. I talked to a few people I know and didn't get a

Re: Offline Deduplication for Btrfs

2011-01-06 Thread Yan, Zheng
On Thu, Jan 6, 2011 at 12:36 AM, Josef Bacik jo...@redhat.com wrote: Here are patches to do offline deduplication for Btrfs.  It works well for the cases it's expected to, I'm looking for feedback on the ioctl interface and such, I'm well aware there are missing features for the userspace app

Re: Offline Deduplication for Btrfs

2011-01-06 Thread Tomasz Chmielewski
I have been thinking a lot about de-duplication for a backup application I am writing. I wrote a little script to figure out how much it would save me. For my laptop home directory, about 100 GiB of data, it was a couple of percent, depending a bit on the size of the chunks. With 4 KiB chunks, I

Re: Offline Deduplication for Btrfs

2011-01-06 Thread Gordan Bobic
Chris Mason wrote: Excerpts from Gordan Bobic's message of 2011-01-05 12:42:42 -0500: Josef Bacik wrote: Basically I think online dedup is huge waste of time and completely useless. I couldn't disagree more. First, let's consider what is the general-purpose use-case of data deduplication.

Re: Offline Deduplication for Btrfs

2011-01-06 Thread Mike Hommey
On Thu, Jan 06, 2011 at 10:37:46AM +0100, Tomasz Chmielewski wrote: I have been thinking a lot about de-duplication for a backup application I am writing. I wrote a little script to figure out how much it would save me. For my laptop home directory, about 100 GiB of data, it was a couple of

Re: Offline Deduplication for Btrfs

2011-01-06 Thread Gordan Bobic
Spelic wrote: On 01/06/2011 02:03 AM, Gordan Bobic wrote: That's just alarmist. AES is being cryptanalyzed because everything uses it. And the news of it's insecurity are somewhat exaggerated (for now at least). Who cares... the fact of not being much used is a benefit for RIPEMD /

Re: Offline Deduplication for Btrfs

2011-01-06 Thread Gordan Bobic
Tomasz Chmielewski wrote: I have been thinking a lot about de-duplication for a backup application I am writing. I wrote a little script to figure out how much it would save me. For my laptop home directory, about 100 GiB of data, it was a couple of percent, depending a bit on the size of the

Re: Offline Deduplication for Btrfs

2011-01-06 Thread Simon Farnsworth
Gordan Bobic wrote: Josef Bacik wrote: Basically I think online dedup is huge waste of time and completely useless. I couldn't disagree more. First, let's consider what is the general-purpose use-case of data deduplication. What are the resource requirements to perform it? How do these

Re: Offline Deduplication for Btrfs

2011-01-06 Thread Gordan Bobic
Simon Farnsworth wrote: The basic idea is to use fanotify/inotify (whichever of the notification systems works for this) to track which inodes have been written to. It can then mmap() the changed data (before it's been dropped from RAM) and do the same process as an offline dedupe (hash,

Re: Offline Deduplication for Btrfs

2011-01-06 Thread Simon Farnsworth
Gordan Bobic wrote: Simon Farnsworth wrote: The basic idea is to use fanotify/inotify (whichever of the notification systems works for this) to track which inodes have been written to. It can then mmap() the changed data (before it's been dropped from RAM) and do the same process as an

Re: Offline Deduplication for Btrfs

2011-01-06 Thread Peter A
On Thursday, January 06, 2011 05:48:18 am you wrote: Can you elaborate what you're talking about here? How does the length of a directory name affect alignment of file block contents? I don't see how variability of length matters, other than to make things a lot more complicated. I'm saying in

Re: Offline Deduplication for Btrfs

2011-01-06 Thread Gordan Bobic
Peter A wrote: On Thursday, January 06, 2011 05:48:18 am you wrote: Can you elaborate what you're talking about here? How does the length of a directory name affect alignment of file block contents? I don't see how variability of length matters, other than to make things a lot more complicated.

Re: Offline Deduplication for Btrfs

2011-01-06 Thread Ondřej Bílka
On Thu, Jan 06, 2011 at 12:18:34PM +, Simon Farnsworth wrote: Gordan Bobic wrote: Josef Bacik wrote: snip Then again, for a lot of use-cases there are perhaps better ways to achieve the targed goal than deduping on FS level, e.g. snapshotting or something like fl-cow:

Re: Offline Deduplication for Btrfs

2011-01-06 Thread Gordan Bobic
Ondřej Bílka wrote: Then again, for a lot of use-cases there are perhaps better ways to achieve the targed goal than deduping on FS level, e.g. snapshotting or something like fl-cow: http://www.xmailserver.org/flcow.html As VM are concerned fl-cow is poor replacement of deduping. Depends on

Re: Offline Deduplication for Btrfs

2011-01-06 Thread Gordan Bobic
Tomasz Torcz wrote: On Thu, Jan 06, 2011 at 02:19:04AM +0100, Spelic wrote: CPU can handle considerably more than 250 block hashings per second. You could argue that this changes in cases of sequential I/O on big files, but a 1.86GHz GHz Core2 can churn through 111MB/s of SHA256, which even

Re: Offline Deduplication for Btrfs

2011-01-06 Thread Peter A
On Thursday, January 06, 2011 09:00:47 am you wrote: Peter A wrote: I'm saying in a filesystem it doesn't matter - if you bundle everything into a backup stream, it does. Think of tar. 512 byte allignment. I tar up a directory with 8TB total size. No big deal. Now I create a new, empty

Re: Offline Deduplication for Btrfs

2011-01-06 Thread Gordan Bobic
Peter A wrote: On Thursday, January 06, 2011 09:00:47 am you wrote: Peter A wrote: I'm saying in a filesystem it doesn't matter - if you bundle everything into a backup stream, it does. Think of tar. 512 byte allignment. I tar up a directory with 8TB total size. No big deal. Now I create a

Re: Offline Deduplication for Btrfs

2011-01-06 Thread Ondřej Bílka
On Thu, Jan 06, 2011 at 02:41:28PM +, Gordan Bobic wrote: Ondřej Bílka wrote: Then again, for a lot of use-cases there are perhaps better ways to achieve the targed goal than deduping on FS level, e.g. snapshotting or something like fl-cow: http://www.xmailserver.org/flcow.html As VM

Re: Offline Deduplication for Btrfs

2011-01-06 Thread Peter A
On Thursday, January 06, 2011 10:07:03 am you wrote: I'd be interested to see the evidence of the variable length argument. I have a sneaky suspicion that it actually falls back to 512 byte blocks, which are much more likely to align, when more sensibly sized blocks fail. The downside is that

Re: Offline Deduplication for Btrfs

2011-01-06 Thread Hubert Kario
On Thursday 06 of January 2011 10:51:04 Mike Hommey wrote: On Thu, Jan 06, 2011 at 10:37:46AM +0100, Tomasz Chmielewski wrote: I have been thinking a lot about de-duplication for a backup application I am writing. I wrote a little script to figure out how much it would save me. For my

Offline Deduplication for Btrfs V2

2011-01-06 Thread Josef Bacik
Just a quick update, I've dropped the hashing stuff in favor of doing a memcmp in the kernel to make sure the data is still the same. The thing that takes a while is reading the data up from disk, so doing a memcmp of the entire buffer isn't that big of a deal, not to mention there's a possiblity

Re: Offline Deduplication for Btrfs

2011-01-06 Thread Chris Mason
Excerpts from Peter A's message of 2011-01-05 22:58:36 -0500: On Wednesday, January 05, 2011 08:19:04 pm Spelic wrote: I'd just make it always use the fs block size. No point in making it variable. Agreed. What is the reason for variable block size? First post on this list - I

Offline Deduplication for Btrfs

2011-01-05 Thread Josef Bacik
Here are patches to do offline deduplication for Btrfs. It works well for the cases it's expected to, I'm looking for feedback on the ioctl interface and such, I'm well aware there are missing features for the userspace app (like being able to set a different blocksize). If this interface

Re: Offline Deduplication for Btrfs

2011-01-05 Thread Gordan Bobic
Josef Bacik wrote: Basically I think online dedup is huge waste of time and completely useless. I couldn't disagree more. First, let's consider what is the general-purpose use-case of data deduplication. What are the resource requirements to perform it? How do these resource requirements

Re: Offline Deduplication for Btrfs

2011-01-05 Thread Josef Bacik
On Wed, Jan 05, 2011 at 05:42:42PM +, Gordan Bobic wrote: Josef Bacik wrote: Basically I think online dedup is huge waste of time and completely useless. I couldn't disagree more. First, let's consider what is the general-purpose use-case of data deduplication. What are the resource

Re: Offline Deduplication for Btrfs

2011-01-05 Thread Ray Van Dolson
On Wed, Jan 05, 2011 at 07:41:13PM +0100, Diego Calleja wrote: On Miércoles, 5 de Enero de 2011 18:42:42 Gordan Bobic escribió: So by doing the hash indexing offline, the total amount of disk I/O required effectively doubles, and the amount of CPU spent on doing the hashing is in no way

Re: Offline Deduplication for Btrfs

2011-01-05 Thread Lars Wirzenius
On ke, 2011-01-05 at 14:46 -0500, Josef Bacik wrote: Blah blah blah, I'm not having an argument about which is better because I simply do not care. I think dedup is silly to begin with, and online dedup even sillier. The only reason I did offline dedup was because I was just toying around

Re: Offline Deduplication for Btrfs

2011-01-05 Thread Freddie Cash
On Wed, Jan 5, 2011 at 11:46 AM, Josef Bacik jo...@redhat.com wrote: Dedup is only usefull if you _know_ you are going to have duplicate information, so the two major usecases that come to mind are 1) Mail server.  You have small files, probably less than 4k (blocksize) that you are storing

Re: Offline Deduplication for Btrfs

2011-01-05 Thread Josef Bacik
On Wed, Jan 05, 2011 at 07:58:13PM +, Lars Wirzenius wrote: On ke, 2011-01-05 at 14:46 -0500, Josef Bacik wrote: Blah blah blah, I'm not having an argument about which is better because I simply do not care. I think dedup is silly to begin with, and online dedup even sillier. The

Re: Offline Deduplication for Btrfs

2011-01-05 Thread Gordan Bobic
On 01/05/2011 06:41 PM, Diego Calleja wrote: On Miércoles, 5 de Enero de 2011 18:42:42 Gordan Bobic escribió: So by doing the hash indexing offline, the total amount of disk I/O required effectively doubles, and the amount of CPU spent on doing the hashing is in no way reduced. But there are

Re: Offline Deduplication for Btrfs

2011-01-05 Thread Gordan Bobic
On 01/05/2011 07:01 PM, Ray Van Dolson wrote: On Wed, Jan 05, 2011 at 07:41:13PM +0100, Diego Calleja wrote: On Miércoles, 5 de Enero de 2011 18:42:42 Gordan Bobic escribió: So by doing the hash indexing offline, the total amount of disk I/O required effectively doubles, and the amount of CPU

Re: Offline Deduplication for Btrfs

2011-01-05 Thread Josef Bacik
On Wed, Jan 05, 2011 at 11:01:41AM -0800, Ray Van Dolson wrote: On Wed, Jan 05, 2011 at 07:41:13PM +0100, Diego Calleja wrote: On Miércoles, 5 de Enero de 2011 18:42:42 Gordan Bobic escribió: So by doing the hash indexing offline, the total amount of disk I/O required effectively

Re: Offline Deduplication for Btrfs

2011-01-05 Thread Freddie Cash
On Wed, Jan 5, 2011 at 12:15 PM, Josef Bacik jo...@redhat.com wrote: Yeah for things where you are talking about sending it over the network or something like that every little bit helps.  I think deduplication is far more interesting and usefull at an application level than at a filesystem

Re: Offline Deduplication for Btrfs

2011-01-05 Thread Gordan Bobic
On 01/05/2011 07:46 PM, Josef Bacik wrote: Blah blah blah, I'm not having an argument about which is better because I simply do not care. I think dedup is silly to begin with, and online dedup even sillier. Offline dedup is more expensive - so why are you of the opinion that it is less

Re: Offline Deduplication for Btrfs

2011-01-05 Thread Lars Wirzenius
On ke, 2011-01-05 at 19:58 +, Lars Wirzenius wrote: (For my script, see find-duplicate-chunks in http://code.liw.fi/debian/pool/main/o/obnam/obnam_0.14.tar.gz or get the current code using bzr get http://code.liw.fi/obnam/bzr/trunk/;. http://braawi.org/obnam/ is the home page of the backup

Re: Offline Deduplication for Btrfs

2011-01-05 Thread Gordan Bobic
On 01/05/2011 09:14 PM, Diego Calleja wrote: In fact, there are cases where online dedup is clearly much worse. For example, cases where people suffer duplication, but it takes a lot of time (several months) to hit it. With online dedup, you need to enable it all the time to get deduplication,

Re: Offline Deduplication for Btrfs

2011-01-05 Thread Gordan Bobic
On 01/06/2011 12:22 AM, Spelic wrote: On 01/05/2011 09:46 PM, Gordan Bobic wrote: On 01/05/2011 07:46 PM, Josef Bacik wrote: Offline dedup is more expensive - so why are you of the opinion that it is less silly? And comparison by silliness quotiend still sounds like an argument over which is

Re: Offline Deduplication for Btrfs

2011-01-05 Thread Spelic
On 01/05/2011 09:46 PM, Gordan Bobic wrote: On 01/05/2011 07:46 PM, Josef Bacik wrote: Offline dedup is more expensive - so why are you of the opinion that it is less silly? And comparison by silliness quotiend still sounds like an argument over which is better. If I can say my opinion, I

Re: Offline Deduplication for Btrfs

2011-01-05 Thread Chris Mason
Excerpts from Gordan Bobic's message of 2011-01-05 12:42:42 -0500: Josef Bacik wrote: Basically I think online dedup is huge waste of time and completely useless. I couldn't disagree more. First, let's consider what is the general-purpose use-case of data deduplication. What are the

Re: Offline Deduplication for Btrfs

2011-01-05 Thread Spelic
On 01/06/2011 02:03 AM, Gordan Bobic wrote: That's just alarmist. AES is being cryptanalyzed because everything uses it. And the news of it's insecurity are somewhat exaggerated (for now at least). Who cares... the fact of not being much used is a benefit for RIPEMD / blowfish-twofish

Re: Offline Deduplication for Btrfs

2011-01-05 Thread Freddie Cash
On Wed, Jan 5, 2011 at 5:03 PM, Gordan Bobic gor...@bobich.net wrote: On 01/06/2011 12:22 AM, Spelic wrote: Definitely agree that it should be a per-directory option, rather than per mount. JOOC, would the dedupe table be done per directory, per mount, per sub-volume, or per volume? The

Re: Offline Deduplication for Btrfs

2011-01-05 Thread Peter A
On Wednesday, January 05, 2011 08:19:04 pm Spelic wrote: I'd just make it always use the fs block size. No point in making it variable. Agreed. What is the reason for variable block size? First post on this list - I mostly was just reading so far to learn more on fs design but this is