Re: [GSoC 2015] Btrfs content based storage

2015-04-14 Thread harshad shirwadkar
pr 13, 2015 at 10:47 AM, David Sterba wrote: > On Fri, Mar 27, 2015 at 10:58:42AM -0400, harshad shirwadkar wrote: >> I am a CS graduate student from Carnegie Mellon University. I am >> hoping to build the feature - "Content based storage mode" under >> Google Summer

Re: [GSoC 2015] Btrfs content based storage

2015-04-13 Thread David Sterba
On Fri, Mar 27, 2015 at 10:58:42AM -0400, harshad shirwadkar wrote: > I am a CS graduate student from Carnegie Mellon University. I am > hoping to build the feature - "Content based storage mode" under > Google Summer of Code 2015. This project has also been listed as an > id

[GSoC 2015] Btrfs content based storage

2015-03-27 Thread harshad shirwadkar
Hello All, I am a CS graduate student from Carnegie Mellon University. I am hoping to build the feature - "Content based storage mode" under Google Summer of Code 2015. This project has also been listed as an idea on BTRFS ideas page. However, I have not found a mentor yet, and without

Re: Content based storage

2010-03-20 Thread Boyd Waters
I realize that I've posted some dumb things in this thread so here's a re-cast summary: 1) In the past, I experimented with fikesystem backups, using my own file-level checksumming that would detect when a file was already in the backup repository, and add a hard link rather than allocate new bloc

Re: Content based storage

2010-03-20 Thread Ric Wheeler
On 03/20/2010 06:16 PM, Ric Wheeler wrote: On 03/20/2010 05:24 PM, Boyd Waters wrote: On Mar 20, 2010, at 9:05 AM, Ric Wheeler wrote: My dataset reported a dedup factor of 1.28 for about 4TB, meaning that almost a third of the dataset was duplicated. It is always interesting to compare thi

Re: Content based storage

2010-03-20 Thread Ric Wheeler
On 03/20/2010 05:24 PM, Boyd Waters wrote: On Mar 20, 2010, at 9:05 AM, Ric Wheeler wrote: My dataset reported a dedup factor of 1.28 for about 4TB, meaning that almost a third of the dataset was duplicated. It is always interesting to compare this to the rate you would get with old fashion

Re: Content based storage

2010-03-20 Thread Boyd Waters
On Mar 20, 2010, at 9:05 AM, Ric Wheeler wrote: >> >> My dataset reported a dedup factor of 1.28 for about 4TB, meaning >> that >> almost a third of the dataset was duplicated. > It is always interesting to compare this to the rate you would get > with old fashioned compression to see how effecti

Re: Content based storage

2010-03-20 Thread Ric Wheeler
On 03/19/2010 10:46 PM, Boyd Waters wrote: 2010/3/17 Hubert Kario: Read further, Sun did provide a way to enable the compare step by using "verify" instead of "on": zfs set dedup=verify I have tested ZFS deduplication on the same data set that I'm using to test btrfs. I used a 5-eleme

Re: Content based storage

2010-03-19 Thread Boyd Waters
2010/3/17 Hubert Kario : > > Read further, Sun did provide a way to enable the compare step by using > "verify" instead of "on": > zfs set dedup=verify I have tested ZFS deduplication on the same data set that I'm using to test btrfs. I used a 5-element radiz, dedup=on, which uses SHA256 for ZFS

Re: Content based storage

2010-03-17 Thread Hubert Kario
On Wednesday 17 March 2010 16:33:41 Leszek Ciesielski wrote: > On Wed, Mar 17, 2010 at 4:25 PM, Hubert Kario wrote: > > On Wednesday 17 March 2010 09:48:18 Heinz-Josef Claes wrote: > >> Hi, > >> > >> just want to add one correction to your thoughts: > >> > >> Storage is not cheap if you think abou

Re: Content based storage

2010-03-17 Thread Leszek Ciesielski
On Wed, Mar 17, 2010 at 4:25 PM, Hubert Kario wrote: > On Wednesday 17 March 2010 09:48:18 Heinz-Josef Claes wrote: >> Hi, >> >> just want to add one correction to your thoughts: >> >> Storage is not cheap if you think about enterprise storage on a SAN, >> replicated to another data centre. Using

Re: Content based storage

2010-03-17 Thread Hubert Kario
On Wednesday 17 March 2010 09:48:18 Heinz-Josef Claes wrote: > Hi, > > just want to add one correction to your thoughts: > > Storage is not cheap if you think about enterprise storage on a SAN, > replicated to another data centre. Using dedup on the storage boxes leads > to performance issues an

Re: Content based storage

2010-03-17 Thread Heinz-Josef Claes
are more. > >> > >> Developers often have multiple copies of source code trees as branches, > >> snapshots, etc. For larger projects (I have multiple "buildroot" trees > >> for one project) this can take a lot of space. Content-based storage > &

Re: Content based storage

2010-03-17 Thread David Brown
On 17/03/2010 01:45, Hubert Kario wrote: On Tuesday 16 March 2010 10:21:43 David Brown wrote: Hi, I was wondering if there has been any thought or progress in content-based storage for btrfs beyond the suggestion in the "Project ideas" wiki page? The basic idea, as I understand it,

Re: Content based storage

2010-03-17 Thread David Brown
On 16/03/2010 23:45, Fabio wrote: Some years ago I was searching for that kind of functionality and found an experimental ext3 patch to allow the so-called COW-links: http://lwn.net/Articles/76616/ I'd read about the COW patches for ext3 before. While there is certainly some similarity here,

Re: Content based storage

2010-03-16 Thread Hubert Kario
On Tuesday 16 March 2010 10:21:43 David Brown wrote: > Hi, > > I was wondering if there has been any thought or progress in > content-based storage for btrfs beyond the suggestion in the "Project > ideas" wiki page? > > The basic idea, as I understand it, is that

Re: Content based storage

2010-03-16 Thread Fabio
uld really make Btrfs FLY on Hard Disk and make SSD devices possible for storage (because of the space efficiency). -- Fabio David Brown ha scritto: Hi, I was wondering if there has been any thought or progress in content-based storage for btrfs beyond the suggestion in the "Project idea

Content based storage

2010-03-16 Thread David Brown
Hi, I was wondering if there has been any thought or progress in content-based storage for btrfs beyond the suggestion in the "Project ideas" wiki page? The basic idea, as I understand it, is that a longer data extent checksum is used (long enough to make collisions unrealistic),