Hi,
I like your idea and implementation for offline deduplication a lot. I
think it will save me 50% of my backup storage!
Your code walks/scans the directory/file tree of the filesystem. Would
it be possible to walk/scan the disk extents sequentially in disk
order?
- This would be more
I think that dedup has a variety of use cases that are all very dependent on
your workload. The approach you have here seems to be a quite reasonable one.
I did not see it in the code, but it is great to be able to collect statistics
on how effective your hash is and any counters for the
On Mon, Jan 10, 2011 at 10:28:14AM -0500, Ric Wheeler wrote:
I think that dedup has a variety of use cases that are all very dependent
on your workload. The approach you have here seems to be a quite
reasonable one.
I did not see it in the code, but it is great to be able to collect
Excerpts from Josef Bacik's message of 2011-01-10 10:37:31 -0500:
On Mon, Jan 10, 2011 at 10:28:14AM -0500, Ric Wheeler wrote:
I think that dedup has a variety of use cases that are all very dependent
on your workload. The approach you have here seems to be a quite
reasonable one.
I
On Mon, Jan 10, 2011 at 10:39:56AM -0500, Chris Mason wrote:
Excerpts from Josef Bacik's message of 2011-01-10 10:37:31 -0500:
On Mon, Jan 10, 2011 at 10:28:14AM -0500, Ric Wheeler wrote:
I think that dedup has a variety of use cases that are all very dependent
on your workload. The
On Thursday, January 06, 2011 01:35:15 pm Chris Mason wrote:
What is the smallest granularity that the datadomain searches for in
terms of dedup?
Josef's current setup isn't restricted to a specific block size, but
there is a min match of 4k.
I talked to a few people I know and didn't get a
On Thu, Jan 6, 2011 at 12:36 AM, Josef Bacik jo...@redhat.com wrote:
Here are patches to do offline deduplication for Btrfs. It works well for the
cases it's expected to, I'm looking for feedback on the ioctl interface and
such, I'm well aware there are missing features for the userspace app
I have been thinking a lot about de-duplication for a backup application
I am writing. I wrote a little script to figure out how much it would
save me. For my laptop home directory, about 100 GiB of data, it was a
couple of percent, depending a bit on the size of the chunks. With 4 KiB
chunks, I
Chris Mason wrote:
Excerpts from Gordan Bobic's message of 2011-01-05 12:42:42 -0500:
Josef Bacik wrote:
Basically I think online dedup is huge waste of time and completely useless.
I couldn't disagree more. First, let's consider what is the
general-purpose use-case of data deduplication.
On Thu, Jan 06, 2011 at 10:37:46AM +0100, Tomasz Chmielewski wrote:
I have been thinking a lot about de-duplication for a backup application
I am writing. I wrote a little script to figure out how much it would
save me. For my laptop home directory, about 100 GiB of data, it was a
couple of
Spelic wrote:
On 01/06/2011 02:03 AM, Gordan Bobic wrote:
That's just alarmist. AES is being cryptanalyzed because everything
uses it. And the news of it's insecurity are somewhat exaggerated (for
now at least).
Who cares... the fact of not being much used is a benefit for RIPEMD /
Tomasz Chmielewski wrote:
I have been thinking a lot about de-duplication for a backup application
I am writing. I wrote a little script to figure out how much it would
save me. For my laptop home directory, about 100 GiB of data, it was a
couple of percent, depending a bit on the size of the
Gordan Bobic wrote:
Josef Bacik wrote:
Basically I think online dedup is huge waste of time and completely
useless.
I couldn't disagree more. First, let's consider what is the
general-purpose use-case of data deduplication. What are the resource
requirements to perform it? How do these
Simon Farnsworth wrote:
The basic idea is to use fanotify/inotify (whichever of the notification
systems works for this) to track which inodes have been written to. It can
then mmap() the changed data (before it's been dropped from RAM) and do the
same process as an offline dedupe (hash,
Gordan Bobic wrote:
Simon Farnsworth wrote:
The basic idea is to use fanotify/inotify (whichever of the notification
systems works for this) to track which inodes have been written to. It
can then mmap() the changed data (before it's been dropped from RAM) and
do the same process as an
On Thursday, January 06, 2011 05:48:18 am you wrote:
Can you elaborate what you're talking about here? How does the length of
a directory name affect alignment of file block contents? I don't see
how variability of length matters, other than to make things a lot more
complicated.
I'm saying in
Peter A wrote:
On Thursday, January 06, 2011 05:48:18 am you wrote:
Can you elaborate what you're talking about here? How does the length of
a directory name affect alignment of file block contents? I don't see
how variability of length matters, other than to make things a lot more
complicated.
On Thu, Jan 06, 2011 at 12:18:34PM +, Simon Farnsworth wrote:
Gordan Bobic wrote:
Josef Bacik wrote:
snip
Then again, for a lot of use-cases there are perhaps better ways to
achieve the targed goal than deduping on FS level, e.g. snapshotting or
something like fl-cow:
Ondřej Bílka wrote:
Then again, for a lot of use-cases there are perhaps better ways to
achieve the targed goal than deduping on FS level, e.g. snapshotting or
something like fl-cow:
http://www.xmailserver.org/flcow.html
As VM are concerned fl-cow is poor replacement of deduping.
Depends on
Tomasz Torcz wrote:
On Thu, Jan 06, 2011 at 02:19:04AM +0100, Spelic wrote:
CPU can handle considerably more than 250 block hashings per
second. You could argue that this changes in cases of sequential
I/O on big files, but a 1.86GHz GHz Core2 can churn through
111MB/s of SHA256, which even
On Thursday, January 06, 2011 09:00:47 am you wrote:
Peter A wrote:
I'm saying in a filesystem it doesn't matter - if you bundle everything
into a backup stream, it does. Think of tar. 512 byte allignment. I tar
up a directory with 8TB total size. No big deal. Now I create a new,
empty
Peter A wrote:
On Thursday, January 06, 2011 09:00:47 am you wrote:
Peter A wrote:
I'm saying in a filesystem it doesn't matter - if you bundle everything
into a backup stream, it does. Think of tar. 512 byte allignment. I tar
up a directory with 8TB total size. No big deal. Now I create a
On Thu, Jan 06, 2011 at 02:41:28PM +, Gordan Bobic wrote:
Ondřej Bílka wrote:
Then again, for a lot of use-cases there are perhaps better ways to
achieve the targed goal than deduping on FS level, e.g. snapshotting or
something like fl-cow:
http://www.xmailserver.org/flcow.html
As VM
On Thursday, January 06, 2011 10:07:03 am you wrote:
I'd be interested to see the evidence of the variable length argument.
I have a sneaky suspicion that it actually falls back to 512 byte
blocks, which are much more likely to align, when more sensibly sized
blocks fail. The downside is that
On Thursday 06 of January 2011 10:51:04 Mike Hommey wrote:
On Thu, Jan 06, 2011 at 10:37:46AM +0100, Tomasz Chmielewski wrote:
I have been thinking a lot about de-duplication for a backup application
I am writing. I wrote a little script to figure out how much it would
save me. For my
Just a quick update, I've dropped the hashing stuff in favor of doing a memcmp
in the kernel to make sure the data is still the same. The thing that takes a
while is reading the data up from disk, so doing a memcmp of the entire buffer
isn't that big of a deal, not to mention there's a possiblity
Excerpts from Peter A's message of 2011-01-05 22:58:36 -0500:
On Wednesday, January 05, 2011 08:19:04 pm Spelic wrote:
I'd just make it always use the fs block size. No point in making it
variable.
Agreed. What is the reason for variable block size?
First post on this list - I
Here are patches to do offline deduplication for Btrfs. It works well for the
cases it's expected to, I'm looking for feedback on the ioctl interface and
such, I'm well aware there are missing features for the userspace app (like
being able to set a different blocksize). If this interface
Josef Bacik wrote:
Basically I think online dedup is huge waste of time and completely useless.
I couldn't disagree more. First, let's consider what is the
general-purpose use-case of data deduplication. What are the resource
requirements to perform it? How do these resource requirements
On Wed, Jan 05, 2011 at 05:42:42PM +, Gordan Bobic wrote:
Josef Bacik wrote:
Basically I think online dedup is huge waste of time and completely useless.
I couldn't disagree more. First, let's consider what is the
general-purpose use-case of data deduplication. What are the resource
On Wed, Jan 05, 2011 at 07:41:13PM +0100, Diego Calleja wrote:
On Miércoles, 5 de Enero de 2011 18:42:42 Gordan Bobic escribió:
So by doing the hash indexing offline, the total amount of disk I/O
required effectively doubles, and the amount of CPU spent on doing the
hashing is in no way
On ke, 2011-01-05 at 14:46 -0500, Josef Bacik wrote:
Blah blah blah, I'm not having an argument about which is better because I
simply do not care. I think dedup is silly to begin with, and online dedup
even
sillier. The only reason I did offline dedup was because I was just toying
around
On Wed, Jan 5, 2011 at 11:46 AM, Josef Bacik jo...@redhat.com wrote:
Dedup is only usefull if you _know_ you are going to have duplicate
information,
so the two major usecases that come to mind are
1) Mail server. You have small files, probably less than 4k (blocksize) that
you are storing
On Wed, Jan 05, 2011 at 07:58:13PM +, Lars Wirzenius wrote:
On ke, 2011-01-05 at 14:46 -0500, Josef Bacik wrote:
Blah blah blah, I'm not having an argument about which is better because I
simply do not care. I think dedup is silly to begin with, and online dedup
even
sillier. The
On 01/05/2011 06:41 PM, Diego Calleja wrote:
On Miércoles, 5 de Enero de 2011 18:42:42 Gordan Bobic escribió:
So by doing the hash indexing offline, the total amount of disk I/O
required effectively doubles, and the amount of CPU spent on doing the
hashing is in no way reduced.
But there are
On 01/05/2011 07:01 PM, Ray Van Dolson wrote:
On Wed, Jan 05, 2011 at 07:41:13PM +0100, Diego Calleja wrote:
On Miércoles, 5 de Enero de 2011 18:42:42 Gordan Bobic escribió:
So by doing the hash indexing offline, the total amount of disk I/O
required effectively doubles, and the amount of CPU
On Wed, Jan 05, 2011 at 11:01:41AM -0800, Ray Van Dolson wrote:
On Wed, Jan 05, 2011 at 07:41:13PM +0100, Diego Calleja wrote:
On Miércoles, 5 de Enero de 2011 18:42:42 Gordan Bobic escribió:
So by doing the hash indexing offline, the total amount of disk I/O
required effectively
On Wed, Jan 5, 2011 at 12:15 PM, Josef Bacik jo...@redhat.com wrote:
Yeah for things where you are talking about sending it over the network or
something like that every little bit helps. I think deduplication is far more
interesting and usefull at an application level than at a filesystem
On 01/05/2011 07:46 PM, Josef Bacik wrote:
Blah blah blah, I'm not having an argument about which is better because I
simply do not care. I think dedup is silly to begin with, and online dedup even
sillier.
Offline dedup is more expensive - so why are you of the opinion that it
is less
On ke, 2011-01-05 at 19:58 +, Lars Wirzenius wrote:
(For my script, see find-duplicate-chunks in
http://code.liw.fi/debian/pool/main/o/obnam/obnam_0.14.tar.gz or get the
current code using bzr get http://code.liw.fi/obnam/bzr/trunk/;.
http://braawi.org/obnam/ is the home page of the backup
On 01/05/2011 09:14 PM, Diego Calleja wrote:
In fact, there are cases where online dedup is clearly much worse. For
example, cases where people suffer duplication, but it takes a lot of
time (several months) to hit it. With online dedup, you need to enable
it all the time to get deduplication,
On 01/06/2011 12:22 AM, Spelic wrote:
On 01/05/2011 09:46 PM, Gordan Bobic wrote:
On 01/05/2011 07:46 PM, Josef Bacik wrote:
Offline dedup is more expensive - so why are you of the opinion that
it is less silly? And comparison by silliness quotiend still sounds
like an argument over which is
On 01/05/2011 09:46 PM, Gordan Bobic wrote:
On 01/05/2011 07:46 PM, Josef Bacik wrote:
Offline dedup is more expensive - so why are you of the opinion that
it is less silly? And comparison by silliness quotiend still sounds
like an argument over which is better.
If I can say my opinion, I
Excerpts from Gordan Bobic's message of 2011-01-05 12:42:42 -0500:
Josef Bacik wrote:
Basically I think online dedup is huge waste of time and completely useless.
I couldn't disagree more. First, let's consider what is the
general-purpose use-case of data deduplication. What are the
On 01/06/2011 02:03 AM, Gordan Bobic wrote:
That's just alarmist. AES is being cryptanalyzed because everything
uses it. And the news of it's insecurity are somewhat exaggerated (for
now at least).
Who cares... the fact of not being much used is a benefit for RIPEMD /
blowfish-twofish
On Wed, Jan 5, 2011 at 5:03 PM, Gordan Bobic gor...@bobich.net wrote:
On 01/06/2011 12:22 AM, Spelic wrote:
Definitely agree that it should be a per-directory option, rather than per
mount.
JOOC, would the dedupe table be done per directory, per mount, per
sub-volume, or per volume? The
On Wednesday, January 05, 2011 08:19:04 pm Spelic wrote:
I'd just make it always use the fs block size. No point in making it
variable.
Agreed. What is the reason for variable block size?
First post on this list - I mostly was just reading so far to learn more on fs
design but this is
47 matches
Mail list logo