Re: Offline Deduplication for Btrfs

2011-01-15 Thread Arjen Nienhuis
Hi,

I like your idea and implementation for offline deduplication a lot. I
think it will save me 50% of my backup storage!

Your code walks/scans the directory/file tree of the filesystem. Would
it be possible to walk/scan the disk extents sequentially in disk
order?

- This would be more I/O-efficient
- This would save you reading previously
deduped/snapshotted/hardlinked files more than once.
- Maybe this would make it possible to deduplicate directories as well.

Met vriendelijke groet,
Arjen Nienhuis

P.S. The NTFS implementation on Windows has 'ioctls' to read the MFT
sequentially in disk order and it's *fast*. It's being used for things
like defrag.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-10 Thread Ric Wheeler


I think that dedup has a variety of use cases that are all very dependent on 
your workload. The approach you have here seems to be a quite reasonable one.


I did not see it in the code, but it is great to be able to collect statistics 
on how effective your hash is and any counters for the extra IO imposed.


Also very useful to have a paranoid mode where when you see a hash collision 
(dedup candidate), you fall back to a byte-by-byte compare to verify that the 
the collision is correct.  Keeping stats on how often this is a false collision 
would be quite interesting as well :)


Ric

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-10 Thread Josef Bacik
On Mon, Jan 10, 2011 at 10:28:14AM -0500, Ric Wheeler wrote:

 I think that dedup has a variety of use cases that are all very dependent 
 on your workload. The approach you have here seems to be a quite 
 reasonable one.

 I did not see it in the code, but it is great to be able to collect 
 statistics on how effective your hash is and any counters for the extra 
 IO imposed.


So I have counters for how many extents are deduped and the overall file
savings, is that what you are talking about?

 Also very useful to have a paranoid mode where when you see a hash 
 collision (dedup candidate), you fall back to a byte-by-byte compare to 
 verify that the the collision is correct.  Keeping stats on how often 
 this is a false collision would be quite interesting as well :)


So I've always done a byte-by-byte compare, first in userspace but now its in
kernel, because frankly I don't trust hashing algorithms with my data.  It would
be simple enough to keep statistics on how often the byte-by-byte compare comes
out wrong, but really this is to catch changes to the file, so I have a
suspicion that most of these statistics would be simply that the file changed,
not that the hash was a collision.  Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-10 Thread Chris Mason
Excerpts from Josef Bacik's message of 2011-01-10 10:37:31 -0500:
 On Mon, Jan 10, 2011 at 10:28:14AM -0500, Ric Wheeler wrote:
 
  I think that dedup has a variety of use cases that are all very dependent 
  on your workload. The approach you have here seems to be a quite 
  reasonable one.
 
  I did not see it in the code, but it is great to be able to collect 
  statistics on how effective your hash is and any counters for the extra 
  IO imposed.
 
 
 So I have counters for how many extents are deduped and the overall file
 savings, is that what you are talking about?
 
  Also very useful to have a paranoid mode where when you see a hash 
  collision (dedup candidate), you fall back to a byte-by-byte compare to 
  verify that the the collision is correct.  Keeping stats on how often 
  this is a false collision would be quite interesting as well :)
 
 
 So I've always done a byte-by-byte compare, first in userspace but now its in
 kernel, because frankly I don't trust hashing algorithms with my data.  It 
 would
 be simple enough to keep statistics on how often the byte-by-byte compare 
 comes
 out wrong, but really this is to catch changes to the file, so I have a
 suspicion that most of these statistics would be simply that the file changed,
 not that the hash was a collision.  Thanks,

At least in the kernel, if you're comparing extents on disk that are
from a committed transaction.  The contents won't change.  We could read
into a private buffer instead of into the file's address space to make
this more reliable/strict.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-10 Thread Josef Bacik
On Mon, Jan 10, 2011 at 10:39:56AM -0500, Chris Mason wrote:
 Excerpts from Josef Bacik's message of 2011-01-10 10:37:31 -0500:
  On Mon, Jan 10, 2011 at 10:28:14AM -0500, Ric Wheeler wrote:
  
   I think that dedup has a variety of use cases that are all very dependent 
   on your workload. The approach you have here seems to be a quite 
   reasonable one.
  
   I did not see it in the code, but it is great to be able to collect 
   statistics on how effective your hash is and any counters for the extra 
   IO imposed.
  
  
  So I have counters for how many extents are deduped and the overall file
  savings, is that what you are talking about?
  
   Also very useful to have a paranoid mode where when you see a hash 
   collision (dedup candidate), you fall back to a byte-by-byte compare to 
   verify that the the collision is correct.  Keeping stats on how often 
   this is a false collision would be quite interesting as well :)
  
  
  So I've always done a byte-by-byte compare, first in userspace but now its 
  in
  kernel, because frankly I don't trust hashing algorithms with my data.  It 
  would
  be simple enough to keep statistics on how often the byte-by-byte compare 
  comes
  out wrong, but really this is to catch changes to the file, so I have a
  suspicion that most of these statistics would be simply that the file 
  changed,
  not that the hash was a collision.  Thanks,
 
 At least in the kernel, if you're comparing extents on disk that are
 from a committed transaction.  The contents won't change.  We could read
 into a private buffer instead of into the file's address space to make
 this more reliable/strict.
 

Right sorry I was talking in the userspace case.  Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-07 Thread Peter A
On Thursday, January 06, 2011 01:35:15 pm Chris Mason wrote:
 What is the smallest granularity that the datadomain searches for in
 terms of dedup?
 
 Josef's current setup isn't restricted to a specific block size, but
 there is a min match of 4k.
I talked to a few people I know and didn't get a clear answer either... 
However, 512 bytes came up more than once. 

I'm not really worried about the size of region to be used, but about 
offsetting it... its so easy to create large tars, ... where the content is 
offset by a few bytes, mutliples of 512 and such.

Peter.

-- 
Censorship: noun, circa 1591. a: Relief of the burden of independent thinking.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-06 Thread Yan, Zheng
On Thu, Jan 6, 2011 at 12:36 AM, Josef Bacik jo...@redhat.com wrote:
 Here are patches to do offline deduplication for Btrfs.  It works well for the
 cases it's expected to, I'm looking for feedback on the ioctl interface and
 such, I'm well aware there are missing features for the userspace app (like
 being able to set a different blocksize).  If this interface is acceptable I
 will flesh out the userspace app a little more, but I believe the kernel side 
 is
 ready to go.

 Basically I think online dedup is huge waste of time and completely useless.
 You are going to want to do different things with different data.  For 
 example,
 for a mailserver you are going to want to have very small blocksizes, but for
 say a virtualization image store you are going to want much larger blocksizes.
 And lets not get into heterogeneous environments, those just get much too
 complicated.  So my solution is batched dedup, where a user just runs this
 command and it dedups everything at this point.  This avoids the very costly
 overhead of having to hash and lookup for duplicate extents online and lets us
 be _much_ more flexible about what we want to deduplicate and how we want to 
 do
 it.

 For the userspace app it only does 64k blocks, or whatever the largest area it
 can read out of a file.  I'm going to extend this to do the following things 
 in
 the near future

 1) Take the blocksize as an argument so we can have bigger/smaller blocks
 2) Have an option to _only_ honor the blocksize, don't try and dedup smaller
 blocks
 3) Use fiemap to try and dedup extents as a whole and just ignore specific
 blocksizes
 4) Use fiemap to determine what would be the most optimal blocksize for the 
 data
 you want to dedup.

 I've tested this out on my setup and it seems to work well.  I appreciate any
 feedback you may have.  Thanks,


FYI: Using clone ioctl can do the same thing (except reading data and
computing hash in user space).

Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-06 Thread Tomasz Chmielewski

I have been thinking a lot about de-duplication for a backup application
I am writing. I wrote a little script to figure out how much it would
save me. For my laptop home directory, about 100 GiB of data, it was a
couple of percent, depending a bit on the size of the chunks. With 4 KiB
chunks, I would save about two gigabytes. (That's assuming no MD5 hash
collisions.) I don't have VM images, but I do have a fair bit of saved
e-mail. So, for backups, I concluded it was worth it to provide an
option to do this. I have no opinion on whether it is worthwhile to do
in btrfs.


Online deduplication is very useful for backups of big, multi-gigabyte 
files which change constantly.
Some mail servers store files this way; some MUA store the files like 
this; databases are also common to pack everything in big files which 
tend to change here and there almost all the time.


Multi-gigabyte files which only have few megabytes changed can't be 
hardlinked; simple maths shows that even compressing multiple files 
which have few differences will lead to greater space usage than a few 
megabytes extra in each (because everything else is deduplicated).


And I don't even want to think about IO needed to offline dedup a 
multi-terabyte storage (1 TB disks and bigger are becoming standard 
nowadays) i.e. daily, especially when the storage is already heavily 
used in IO terms.



Now, one popular tool which can deal with small changes in files is 
rsync. It can be used to copy files over the network - so that if you 
want to copy/update a multi-gigabyte file which only has a few changes, 
rsync would need to transfer just a few megabytes.


On disk however, rsync creates a temporary copy of the original file, 
where it packs unchanged contents together with any changes made. For 
example, while it copies/updates a file, we will have:


original_file.bin
.temporary_random_name

Later, original_file.bin would be removed, and .temporary_random_name 
would be renamed to original_file.bin. Here goes away any deduplication 
we had so far, we have to start the IO over again.



--
Tomasz Chmielewski
http://wpkg.org
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-06 Thread Gordan Bobic

Chris Mason wrote:

Excerpts from Gordan Bobic's message of 2011-01-05 12:42:42 -0500:

Josef Bacik wrote:


Basically I think online dedup is huge waste of time and completely useless.
I couldn't disagree more. First, let's consider what is the 
general-purpose use-case of data deduplication. What are the resource 
requirements to perform it? How do these resource requirements differ 
between online and offline?


I don't really agree with Josef that dedup is dumb, but I do think his
current approach is the most reasonable.  Dedup has a few very valid use
cases, which I think break down to:

1) backups
2) VM images.

The backup farm use case is the best candidate for dedup in general
because they are generally write once and hopefully read never.
Fragmentation for reading doesn't matter at all and we're really very
sure we're going to backup the same files over and over again.

But, it's also something that will be dramatically more efficient when
the backup server helps out.  The backup server knows two files have the
same name, same size and can guess with very high accuracy that they
will be the same.  So it is a very good candidate for Josef's offline
dedup because it can just do the dedup right after writing the file.


File level deduplication in addition to block level would be great, no 
argument there. This can again be done more efficiently in-line, though, 
as the files come in.



Next is the VM images.  This is actually a much better workload for
online dedup, except for the part where our poor storage server would be
spending massive amounts of CPU deduping blocks for all the VMs on the
machine.  In this case the storage server doesn't know the
filenames, it just has bunches of blocks that are likely to be the same
across VMs.


I'm still unconvinced that deduping's major cost is CPU. I think in any 
real case it'll be disk I/O.



So, it seems a bit silly to do this out of band, where we wander through
the FS and read a bunch of blocks in hopes of finding ones with the same
hash.


Except you can get this almost for free. How about this approach:

1) Store a decent size hash for each block (checksum for ECC - something 
like this already happens, it's just a question of what hashing 
algorithm to use)


2) Keep a (btree?) index of all known hashes (this doesn't happen at the 
moment, AFAIK, so this would be the bulk of the new cost for dedup).


Now there are 2 options:

3a) Offline - go through the index, find the blocks with duplicate 
hashes, relink the pointers to one of them and free the rest. There is 
no need to actually read/write any data unless we are doing full block 
compare, only metadata needs to be updated. The problem with this is 
that you would still have to do a full index scan of the index to find 
all the duplicates, unless there is a second index specifically listing 
the duplicate blocks (maintained at insertion time).


3b) Online - look up whether the hash for the current block is already 
in the index ( O(log(n)) ), and if it is, don't bother writing the data 
block, only add a pointer to the existing block. No need for a second 
index with duplicate blocks in this case, either.



But, one of the things on our features-to-implement page is to wander
through the FS and read all the blocks from time to time.  We want to do
this in the background to make sure the bits haven't rotted on disk.  By
scrubbing from time to time we are able to make sure that when a disk
does die, other disks in the FS are likely to have a good copy.


Scrubbing the whole FS seems like an overly expensive way to do things 
and it also requires low-load times (which don't necessarily exist). How 
about scrubbing disk-by-disk, rather than the whole FS? If we keep 
checksums per block, then each disk can be checked for rot 
independently. It also means that if redundancy is used, the system 
doesn't end up anywhere nearly as crippled during the scrubbing as the 
requests can be served from other disks that are a part of that FS (e.g. 
the mirrored pair in RAID1, or the parity blocks in higher level RAIDs).



So again, Josef's approach actually works very well.  His dedup util
could be the scrubbing util and we'll get two projects for the price of
one.


Indeed, the scrubber would potentially give deduping functionality for 
free, but I'm not convinced that having deduping depend on scrubbing is 
the way forward. This is where we get to multi-tier deduping again - 
perhaps have things markable for online or offline dedupe, as deemed 
more appropriate?



As for the security of hashes, we're unlikely to find a collision on a
sha256 that wasn't made maliciously.  If the system's data is
controlled and you're not worried about evil people putting files on
there, extra reads really aren't required.


If you manage to construct one maliciously that's pretty bad in itself, 
though. :)



But then again, extra reads are a good thing (see above about
scrubbing).


Only under very, very controlled 

Re: Offline Deduplication for Btrfs

2011-01-06 Thread Mike Hommey
On Thu, Jan 06, 2011 at 10:37:46AM +0100, Tomasz Chmielewski wrote:
 I have been thinking a lot about de-duplication for a backup application
 I am writing. I wrote a little script to figure out how much it would
 save me. For my laptop home directory, about 100 GiB of data, it was a
 couple of percent, depending a bit on the size of the chunks. With 4 KiB
 chunks, I would save about two gigabytes. (That's assuming no MD5 hash
 collisions.) I don't have VM images, but I do have a fair bit of saved
 e-mail. So, for backups, I concluded it was worth it to provide an
 option to do this. I have no opinion on whether it is worthwhile to do
 in btrfs.
 
 Online deduplication is very useful for backups of big,
 multi-gigabyte files which change constantly.
 Some mail servers store files this way; some MUA store the files
 like this; databases are also common to pack everything in big files
 which tend to change here and there almost all the time.
 
 Multi-gigabyte files which only have few megabytes changed can't be
 hardlinked; simple maths shows that even compressing multiple files
 which have few differences will lead to greater space usage than a
 few megabytes extra in each (because everything else is
 deduplicated).
 
 And I don't even want to think about IO needed to offline dedup a
 multi-terabyte storage (1 TB disks and bigger are becoming standard
 nowadays) i.e. daily, especially when the storage is already heavily
 used in IO terms.
 
 
 Now, one popular tool which can deal with small changes in files is
 rsync. It can be used to copy files over the network - so that if
 you want to copy/update a multi-gigabyte file which only has a few
 changes, rsync would need to transfer just a few megabytes.
 
 On disk however, rsync creates a temporary copy of the original
 file, where it packs unchanged contents together with any changes
 made. For example, while it copies/updates a file, we will have:
 
 original_file.bin
 .temporary_random_name
 
 Later, original_file.bin would be removed, and
 .temporary_random_name would be renamed to original_file.bin. Here
 goes away any deduplication we had so far, we have to start the IO
 over again.

Sounds like all you need is cp --reflink=always and rsync --inplace.

Haven't tested is that works well, though.

Mike
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-06 Thread Gordan Bobic

Spelic wrote:

On 01/06/2011 02:03 AM, Gordan Bobic wrote:


That's just alarmist. AES is being cryptanalyzed because everything 
uses it. And the news of it's insecurity are somewhat exaggerated (for 
now at least). 


Who cares... the fact of not being much used is a benefit for RIPEMD / 
blowfish-twofish then.

Nobody makes viruses for Linux because they target windows. Same thing...
RIPEMD has still an advantage over SHA imho, and blowfish over AES.


Just because nobody attacked it yet doesn't justify complacency.


If there is full blocks compare, a simpler/faster algorithm could be
chosen, like md5. Or even a md-64bits which I don't think it exists, but
you can take MD4 and then xor the first 8 bytes with the second 8 bytes
so to reduce it to 8 bytes only. This is just because it saves 60% of
the RAM occupation during dedup, which is expected to be large, and the
collisions are still insignificant at 64bits. Clearly you need to do
full blocks compare after that.


I really don't think the cost in terms of a few bytes per file for the 
hashes is that significant.


20 to 8 = 12 bytes per *filesystem block* saved, I think
Aren't we talking about block-level deduplication?
For every TB of filesystem you occupy 2GB of RAM with hashes instead of 
5.3GB (I am assuming 4K blocks, I don't remember how big are btrfs blocks)
For a 24 * 2TB storage you occupy 96GB instead of 254GB of RAM. It might 
be the edge between feasible and not feasible.
Actually it might not be feasible anyway... an option to store hashes 
into a ssd should be provided then...


You wouldn't necessarily have to keep the whole index in RAM, but if you 
don't you'd get hit for an extra O(log(n)) disk seeks.


Gordan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-06 Thread Gordan Bobic

Tomasz Chmielewski wrote:

I have been thinking a lot about de-duplication for a backup application
I am writing. I wrote a little script to figure out how much it would
save me. For my laptop home directory, about 100 GiB of data, it was a
couple of percent, depending a bit on the size of the chunks. With 4 KiB
chunks, I would save about two gigabytes. (That's assuming no MD5 hash
collisions.) I don't have VM images, but I do have a fair bit of saved
e-mail. So, for backups, I concluded it was worth it to provide an
option to do this. I have no opinion on whether it is worthwhile to do
in btrfs.


Online deduplication is very useful for backups of big, multi-gigabyte 
files which change constantly.
Some mail servers store files this way; some MUA store the files like 
this; databases are also common to pack everything in big files which 
tend to change here and there almost all the time.


Multi-gigabyte files which only have few megabytes changed can't be 
hardlinked; simple maths shows that even compressing multiple files 
which have few differences will lead to greater space usage than a few 
megabytes extra in each (because everything else is deduplicated).


And I don't even want to think about IO needed to offline dedup a 
multi-terabyte storage (1 TB disks and bigger are becoming standard 
nowadays) i.e. daily, especially when the storage is already heavily 
used in IO terms.



Now, one popular tool which can deal with small changes in files is 
rsync. It can be used to copy files over the network - so that if you 
want to copy/update a multi-gigabyte file which only has a few changes, 
rsync would need to transfer just a few megabytes.


On disk however, rsync creates a temporary copy of the original file, 
where it packs unchanged contents together with any changes made. For 
example, while it copies/updates a file, we will have:


original_file.bin
.temporary_random_name

Later, original_file.bin would be removed, and .temporary_random_name 
would be renamed to original_file.bin. Here goes away any deduplication 
we had so far, we have to start the IO over again.


You can tell rsync to either modify the file in place (--inplace) or to 
put the temp file somewhere else (--temp-dir=DIR).


Gordan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-06 Thread Simon Farnsworth
Gordan Bobic wrote:

 Josef Bacik wrote:
 
 Basically I think online dedup is huge waste of time and completely
 useless.
 
 I couldn't disagree more. First, let's consider what is the
 general-purpose use-case of data deduplication. What are the resource
 requirements to perform it? How do these resource requirements differ
 between online and offline?
snip

 As an aside, zfs and lessfs both do online deduping, presumably for a
 good reason.
 
 Then again, for a lot of use-cases there are perhaps better ways to
 achieve the targed goal than deduping on FS level, e.g. snapshotting or
 something like fl-cow:
 http://www.xmailserver.org/flcow.html
 
Just a small point; Josef's work provides a building block for a userspace 
notify-based online dedupe daemon.

The basic idea is to use fanotify/inotify (whichever of the notification 
systems works for this) to track which inodes have been written to. It can 
then mmap() the changed data (before it's been dropped from RAM) and do the 
same process as an offline dedupe (hash, check for matches, call dedupe 
extent ioctl). If you've got enough CPU (maybe running with realtime privs), 
you should be able to do this before writes actually hit the disk.

Further, a userspace daemon can do more sophisticated online dedupe than is 
reasonable in the kernel - e.g. queue the dedupe extent ioctl phase for idle 
time, only dedupe inodes that have been left unwritten for x minutes, 
different policies for different bits of the filesystem (dedupe crontabs 
immediately on write, dedupe outgoing mail spool only when the mail sticks 
around for a while, dedupe all incoming mail immediately, dedupe logfiles 
after rotation only, whatever is appropriate).

It can also do more intelligent trickery than is reasonable in-kernel - e.g. 
if you know that you're deduping e-mail (line-based), you can search line-
by-line for dedupe blocks, rather than byte-by-byte.

Having said all that, you may well find that having implemented a userspace 
online dedupe daemon, there are things the kernel can do to help; you may 
even find that you do need to move it entirely into the kernel. Just don't 
think that this ioctl rules out online dedupe - in fact, it enables it.

-- 
Simon Farnsworth

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-06 Thread Gordan Bobic

Simon Farnsworth wrote:

The basic idea is to use fanotify/inotify (whichever of the notification 
systems works for this) to track which inodes have been written to. It can 
then mmap() the changed data (before it's been dropped from RAM) and do the 
same process as an offline dedupe (hash, check for matches, call dedupe 
extent ioctl). If you've got enough CPU (maybe running with realtime privs), 
you should be able to do this before writes actually hit the disk.


I'm not convinced that racing against the disk write is the way forward 
here.


As for having enough CPU to do this, a lot of modern CPUs (ARM, SPARC, 
Xeon) actually have hardware crypto acceleration/offload, so calculating 
checksums is fast and cheap.


Gordan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-06 Thread Simon Farnsworth
Gordan Bobic wrote:

 Simon Farnsworth wrote:
 
 The basic idea is to use fanotify/inotify (whichever of the notification
 systems works for this) to track which inodes have been written to. It
 can then mmap() the changed data (before it's been dropped from RAM) and
 do the same process as an offline dedupe (hash, check for matches, call
 dedupe extent ioctl). If you've got enough CPU (maybe running with
 realtime privs), you should be able to do this before writes actually hit
 the disk.
 
 I'm not convinced that racing against the disk write is the way forward
 here.
 
The point is that implementing a userspace online dedupe daemon that races 
against the disk write is something that can be done by anyone who cares as 
soon as Josef's patch is in place; if it's clear that the userspace daemon 
just does something simple enough to put in the kernel (e.g. a fixed block 
size dedupe), and that extra complexity doesn't gain enough to be 
worthwhile, the code can be ported into the kernel before it gets posted 
here.

Similarly, if you're convinced that it has to be in kernel (I'm not a dedupe 
or filesystems expert, so there may be good reasons I'm unaware of), you can 
reuse parts of Josef's code to write your patch that creates a kernel thread 
to do the work.

If it turns out that complex algorithms for online dedupe are worth the 
effort (like line-by-line e-mail dedupe), then you've got a starting point 
for writing something more complex, and determining just what the kernel 
needs to provide to make things nice - maybe it'll be clear that you need an 
interface that lets you hold up a write while you do the simple end of the 
dedupe work, maybe there will be some other kernel interface that's more 
generic than dedupe fixed size blocks that's needed for efficient work.

Either way, Josef's work is a good starting point for online dedupe; you can 
experiment *now* without going into kernel code (heck, maybe even not C - 
Python or Perl would be OK for algorithm exploration), and use the offline 
dedupe support to simplify a patch for online dedupe.

-- 
Simon Farnsworth

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-06 Thread Peter A
On Thursday, January 06, 2011 05:48:18 am you wrote:
 Can you elaborate what you're talking about here? How does the length of
 a directory name affect alignment of file block contents? I don't see
 how variability of length matters, other than to make things a lot more
 complicated.
I'm saying in a filesystem it doesn't matter - if you bundle everything into a 
backup stream, it does. Think of tar. 512 byte allignment. I tar up a 
directory with 8TB total size. No big deal. Now I create a new, empty file in 
this dir with a name that just happens to be the first in the dir. This adds 
512 bytes close to the beginning of the tar file the second time I run tar. Now 
the remainder of the is all offset by 512bytes and, if you do dedupe on fs-
block sized chunks larger than the 512bytes, not a single byte will be de-
duped. 
I know its a stupid example but it matches the backup-target and database dump 
usage pattern really well. Files backing iSCSI shows similar dedupe behavior. 
Essentially every time you bundle mutliple files together into one you run into 
things like that.
 
 Have you some real research/scientifically gathered data
 (non-hearsay/non-anecdotal) on the underlying reasons for the
 discrepancy in the deduping effectiveness you describe? 3-17x difference
 doesn't plausibly come purely from fixed vs. variable length block sizes.
Personal experience isn't hearsay :) Netapp publishes a whitepaper against the 
7000 making this a big point but that isn't publicly available. Try search 
zfs dedupe +netbackup or zfs dedupe +datadomain and similar - you will 
hear of hundreds of people all complain about the same thing.

 The only case where I'd bother consider variable length deduping is in
 file deduping (rather than block), in this case we can just make a COW
 hard-link and it's _really_ cheap and effective.
I take your word for this :) 

 
 Gordan
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Censorship: noun, circa 1591. a: Relief of the burden of independent thinking.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-06 Thread Gordan Bobic

Peter A wrote:

On Thursday, January 06, 2011 05:48:18 am you wrote:

Can you elaborate what you're talking about here? How does the length of
a directory name affect alignment of file block contents? I don't see
how variability of length matters, other than to make things a lot more
complicated.


I'm saying in a filesystem it doesn't matter - if you bundle everything into a 
backup stream, it does. Think of tar. 512 byte allignment. I tar up a 
directory with 8TB total size. No big deal. Now I create a new, empty file in 
this dir with a name that just happens to be the first in the dir. This adds 
512 bytes close to the beginning of the tar file the second time I run tar. Now 
the remainder of the is all offset by 512bytes and, if you do dedupe on fs-

block sized chunks larger than the 512bytes, not a single byte will be de-
duped. 


OK, I get what you mean now. And I don't think this is something that 
should be solved in the file system.


I know its a stupid example but it matches the backup-target and database dump 
usage pattern really well. Files backing iSCSI shows similar dedupe behavior. 
Essentially every time you bundle mutliple files together into one you run into 
things like that.


I suspect a large part of the underlying cause of this is that most 
things still operate on 512 byte sectors. Once that is replaced with 4KB 
sectors, that will probably go away. And if this is the case, then 
perhaps making the block size variable to 512, 1024, 2048 and 4096 
bytes (probably in reverse order) will achieve most of that benefit.


Whether than is a worthwhile thing to do for poorly designed backup 
solutions, but I'm not convinced about the general use-case. It'd be 
very expensive and complicated for seemingly very limited benefit.



Have you some real research/scientifically gathered data
(non-hearsay/non-anecdotal) on the underlying reasons for the
discrepancy in the deduping effectiveness you describe? 3-17x difference
doesn't plausibly come purely from fixed vs. variable length block sizes.


Personal experience isn't hearsay :) Netapp publishes a whitepaper against the 
7000 making this a big point but that isn't publicly available. Try search 
zfs dedupe +netbackup or zfs dedupe +datadomain and similar - you will 
hear of hundreds of people all complain about the same thing.


Typical. And no doubt they complain that ZFS isn't doing what they want, 
rather than netbackup not co-operating. The solution to one misdesign 
isn't an expensive bodge. The solution to this particular problem is to 
make netbackup work on per-file rather than per stream basis.


Gordan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-06 Thread Ondřej Bílka
On Thu, Jan 06, 2011 at 12:18:34PM +, Simon Farnsworth wrote:
 Gordan Bobic wrote:
 
  Josef Bacik wrote:
  
snip
  
  Then again, for a lot of use-cases there are perhaps better ways to
  achieve the targed goal than deduping on FS level, e.g. snapshotting or
  something like fl-cow:
  http://www.xmailserver.org/flcow.html
  
As VM are concerned fl-cow is poor replacement of deduping.

Upgrading packages? 1st vm upgrades and copies changed files.
After while second upgrades and copies files too. More and more becomes duped 
again.

If you host multiple distributions you need to translate
that /usr/share/bin/foo in foonux is /us/bin/bar in barux

And primary reason to dedupe is not to reduce space usage but to
improve caching. Why should machine A read file if machine B read it five 
minutes ago.
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

user to computer ration too low.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-06 Thread Gordan Bobic

Ondřej Bílka wrote:


Then again, for a lot of use-cases there are perhaps better ways to
achieve the targed goal than deduping on FS level, e.g. snapshotting or
something like fl-cow:
http://www.xmailserver.org/flcow.html


As VM are concerned fl-cow is poor replacement of deduping.


Depends on your VM. If your VM uses monolithic images, then you're 
right. For a better solution, take a look at vserver's hashify feature 
for something that does this very well in it's own context.



Upgrading packages? 1st vm upgrades and copies changed files.
After while second upgrades and copies files too. More and more becomes duped 
again.


So you want online dedupe, then. :)


If you host multiple distributions you need to translate
that /usr/share/bin/foo in foonux is /us/bin/bar in barux


The chances of the binaries being the same between distros are between 
slim and none. In the context of VMs where you have access to raw files, 
as I said, look at vserver's hashify feature. It doesn't care about file 
names, it will COW hard-link all files with identical content. This 
doesn't even require an exhaustive check of all the files' contents - 
you can start with file sizes. Files that have different sizes can't 
have the same contents, so you can discard most of the comparing before 
you even open the file, most of the work gets done based on metadata alone.



And primary reason to dedupe is not to reduce space usage but to
improve caching. Why should machine A read file if machine B read it five 
minutes ago.


Couldn't agree more. This is what I was trying to explain earlier. Even 
if deduping did cause more fragmentation (and I don't think that is the 
case to any significant extent), the improved caching efficiency would 
more than offset this.


Gordan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-06 Thread Gordan Bobic

Tomasz Torcz wrote:

On Thu, Jan 06, 2011 at 02:19:04AM +0100, Spelic wrote:

CPU can handle considerably more than 250 block hashings per
second. You could argue that this changes in cases of sequential
I/O on big files, but a 1.86GHz GHz Core2 can churn through
111MB/s of SHA256, which even SSDs will struggle to keep up with.

A normal 1TB disk with platters can do 130MB/sec sequential, no problems.
A SSD can do more like 200MB/sec write 280MB/sec read sequential or
random and is actually limited only by the SATA 3.0gbit/sec but soon
enough they will have SATA/SAS 6.0gbit/sec.


  By “soon enough” you really meant “a year ago”, I think:
http://www.anandtech.com/show/3812/the-ssd-diaries-crucials-realssd-c300
Current 6Gbps SSD are doing 415 MB/s sequential:
http://www.anandtech.com/show/4086/microns-realssd-c400-uses-25nm-nand-at-161gb-offers-415mbs-reads
or even claim 550MB/s:
http://www.anandtech.com/show/4100/ocz-vertex-pro-3-demo-worlds-first-sandforce-sf2000
(funny bit: Sandforce SSD controllers dedup internally). 


  Anyway, 6Gbps is not a future tale, but something long available.
And not the fastest kids on the block:  currently build filesystems
must deal storage providing many gigabytes per second.  Think
of massive disk arrays or stuff like Oracle F5100, claiming
12.8GB/sec read and ~10GB/s write (in one rack unit).


Sequential figures look nice and impressive but we all know they are 
meaningless for most real world workloads. IOPS are where it's at. And 
maybe you can get 100,000 IOPS out of an SSD. But that still means 
100,000 SHA256 hashes/second. That's 3.2MB/s of SHA256 hashes, or about 
2% of what a modern x64 CPU will do, assuming it doesn't have a suitable 
hardware crypto accelerator for that algorithm. So on a reasonably 
recent quad core CPU you would probably be able to comfortably handle 
about 200x that before it starts becoming an issue. If you're that 
concerned about space requirements, doing LZO compression will still be 
much more expensive.


And that's only for writes - on reads we don't need to do any hashing 
(although it's useful to do for the disk error checking reasons 
explained earlier).


Gordan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-06 Thread Peter A
On Thursday, January 06, 2011 09:00:47 am you wrote:
 Peter A wrote:
  I'm saying in a filesystem it doesn't matter - if you bundle everything
  into a backup stream, it does. Think of tar. 512 byte allignment. I tar
  up a directory with 8TB total size. No big deal. Now I create a new,
  empty file in this dir with a name that just happens to be the first in
  the dir. This adds 512 bytes close to the beginning of the tar file the
  second time I run tar. Now the remainder of the is all offset by
  512bytes and, if you do dedupe on fs- block sized chunks larger than the
  512bytes, not a single byte will be de- duped.
 
 OK, I get what you mean now. And I don't think this is something that
 should be solved in the file system.
snip
 Whether than is a worthwhile thing to do for poorly designed backup
 solutions, but I'm not convinced about the general use-case. It'd be
 very expensive and complicated for seemingly very limited benefit.
Glad I finally explained myself properly... Unfortunately I disagree with you 
on the rest. If you take that logic, then I could claim dedupe is nothing a 
file system should handle - after all, its the user's poorly designed 
applications that store multiple copies of data. Why should the fs take care 
of that? 

The problem doesn't just affect backups. It affects everything where you have 
large data files that are not forced to allign with filesystem blocks. In 
addition to the case I mentioned above this affects in pretty much the same 
effectiveness:
* Database dumps 
* Video Editing 
* Files backing iSCSI volumes
* VM Images (fs blocks inside the VM rarely align with fs blocks in the 
backing storage). Our VM environment is backed with a 7410 and we get only 
about 10% dedupe. Copying the same images to a DataDomain results in a 60% 
reduction in space used.

Basically, every time I end up using a lot of storage space, its in a scenario 
where fs-block based dedupe is not very effective.

I also have to argue the point that these usages are poorly designed. Poorly 
designed can only apply to technologies that existed or were talked about at 
the time the design was made. Tar and such have been around for a long time, 
way before anyone even though of dedupe. In addition, until there is a 
commonly accepted/standard API to query the block size so apps can generate 
files appropriately laid out for the backing filesystem, what is the 
application 
supposed to do? 
If anything, I would actually argue the opposite, that fixed block dedupe is a 
poor design:
* The problem is known at the time the design was made
* No alternative can be offered as tar, netbackup, video editing, ... has been 
around for a long time and is unlikely to change in the near future
* There is no standard API to query the allignment parameters (and even that 
would not be great since copying a file alligned for 8k to a 16k alligned 
filesystem, would potentially cause the same issue again)

Also from the human perspective its hard to make end users understand your 
point of view. I promote the 7000 series of storage and I know how hard it is 
to explain the dedupe behavior there. They see that Datadomain does it, and 
does it well. So why can't solution xyz do it just as good?

 Typical. And no doubt they complain that ZFS isn't doing what they want,
 rather than netbackup not co-operating. The solution to one misdesign
 isn't an expensive bodge. The solution to this particular problem is to
 make netbackup work on per-file rather than per stream basis.
I'd agree if it was just limited to netbackup... I know variable block length 
is a significantly more difficult problem than block level. That's why the ZFS 
team made the design choice they did. Variable length is also the reason why 
the DataDomain solution is a scale out rather than scalue up approach. 
However, CPUs get faster and faster - eventually they'll be able to handle it. 
So the right solution (from my limited point of view, as I said, I'm not a 
filesystem design expert) would be to implement the data structures to handle 
variable length. Then in the first iteration, implement the dedupe algorithm to 
only search on filesystem blocks using existing checksums and such. Less CPU 
usage, quicker development, easier debugging. Once that is stable and proven, 
you can then without requiring the user to reformat, go ahead and implement 
variable length dedupe...

Btw, thanks for your time, Gordan :)

Peter.

-- 
Censorship: noun, circa 1591. a: Relief of the burden of independent thinking.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-06 Thread Gordan Bobic

Peter A wrote:

On Thursday, January 06, 2011 09:00:47 am you wrote:

Peter A wrote:

I'm saying in a filesystem it doesn't matter - if you bundle everything
into a backup stream, it does. Think of tar. 512 byte allignment. I tar
up a directory with 8TB total size. No big deal. Now I create a new,
empty file in this dir with a name that just happens to be the first in
the dir. This adds 512 bytes close to the beginning of the tar file the
second time I run tar. Now the remainder of the is all offset by
512bytes and, if you do dedupe on fs- block sized chunks larger than the
512bytes, not a single byte will be de- duped.

OK, I get what you mean now. And I don't think this is something that
should be solved in the file system.

snip

Whether than is a worthwhile thing to do for poorly designed backup
solutions, but I'm not convinced about the general use-case. It'd be
very expensive and complicated for seemingly very limited benefit.
Glad I finally explained myself properly... Unfortunately I disagree with you 
on the rest. If you take that logic, then I could claim dedupe is nothing a 
file system should handle - after all, its the user's poorly designed 
applications that store multiple copies of data. Why should the fs take care 
of that? 


There is merit in that point. Some applications do in fact do their own 
deduplication, as mentioned previously on this thread.


The problem doesn't just affect backups. It affects everything where you have 
large data files that are not forced to allign with filesystem blocks. In 
addition to the case I mentioned above this affects in pretty much the same 
effectiveness:
* Database dumps 
* Video Editing 
* Files backing iSCSI volumes
* VM Images (fs blocks inside the VM rarely align with fs blocks in the 
backing storage). Our VM environment is backed with a 7410 and we get only 
about 10% dedupe. Copying the same images to a DataDomain results in a 60% 
reduction in space used.


I'd be interested to hear about the relative write performance on the 
variable block size.


I also have to argue the point that these usages are poorly designed. Poorly 
designed can only apply to technologies that existed or were talked about at 
the time the design was made. Tar and such have been around for a long time, 
way before anyone even though of dedupe. In addition, until there is a 
commonly accepted/standard API to query the block size so apps can generate 
files appropriately laid out for the backing filesystem, what is the application 
supposed to do?


Indeed, this goes philosophically in the direction that storage vendors 
have been (unsuccessfully) been trying to shift the industry for decades 
- object based storage. But, sadly, it hasn't happened (yet?).


If anything, I would actually argue the opposite, that fixed block dedupe is a 
poor design:

* The problem is known at the time the design was made
* No alternative can be offered as tar, netbackup, video editing, ... has been 
around for a long time and is unlikely to change in the near future
* There is no standard API to query the allignment parameters (and even that 
would not be great since copying a file alligned for 8k to a 16k alligned 
filesystem, would potentially cause the same issue again)


Also from the human perspective its hard to make end users understand your 
point of view. I promote the 7000 series of storage and I know how hard it is 
to explain the dedupe behavior there. They see that Datadomain does it, and 
does it well. So why can't solution xyz do it just as good?


I'd be interested to see the evidence of the variable length argument. 
I have a sneaky suspicion that it actually falls back to 512 byte 
blocks, which are much more likely to align, when more sensibly sized 
blocks fail. The downside is that you don't really want to store a 32 
byte hash key with every 512 bytes of data, so you could peel off 512 
byte blocks off the front in a hope that a bigger block that follows 
will match.


Thinking about it, this might actually not be too expensive to do. If 
the 4KB block doesn't match, check 512 byte sub-blocks, and try peeling 
them, to make the next one line up. If it doesn't, store the mismatch as 
a full 4KB block and resume. If you do find a match, save the peeled 512 
byte blocks separately and dedupe the 4KB block.


In fact, it's rather like the loop peeling optimization on a compiler, 
that allows you to align the data to the boundary suitable for vectorizing.



Typical. And no doubt they complain that ZFS isn't doing what they want,
rather than netbackup not co-operating. The solution to one misdesign
isn't an expensive bodge. The solution to this particular problem is to
make netbackup work on per-file rather than per stream basis.


I'd agree if it was just limited to netbackup... I know variable block length 
is a significantly more difficult problem than block level. That's why the ZFS 
team made the design choice they did. Variable length is also the reason why 

Re: Offline Deduplication for Btrfs

2011-01-06 Thread Ondřej Bílka
On Thu, Jan 06, 2011 at 02:41:28PM +, Gordan Bobic wrote:
 Ondřej Bílka wrote:
 
 Then again, for a lot of use-cases there are perhaps better ways to
 achieve the targed goal than deduping on FS level, e.g. snapshotting or
 something like fl-cow:
 http://www.xmailserver.org/flcow.html
 
 As VM are concerned fl-cow is poor replacement of deduping.
 
 Depends on your VM. If your VM uses monolithic images, then you're
 right. For a better solution, take a look at vserver's hashify
 feature for something that does this very well in it's own context.
 
 Upgrading packages? 1st vm upgrades and copies changed files.
 After while second upgrades and copies files too. More and more becomes 
 duped again.
 
 So you want online dedupe, then. :)
 
 If you host multiple distributions you need to translate
 that /usr/share/bin/foo in foonux is /us/bin/bar in barux
 
 The chances of the binaries being the same between distros are
 between slim and none. In the context of VMs where you have access
 to raw files, as I said, look at vserver's hashify feature. It
 doesn't care about file names, it will COW hard-link all files with
 identical content. This doesn't even require an exhaustive check of
 all the files' contents - you can start with file sizes. Files that
 have different sizes can't have the same contents, so you can
 discard most of the comparing before you even open the file, most of
 the work gets done based on metadata alone.
 
Yes I wrote this as quick example. On second thought files shared 
between distros are typicaly write-only(like manpages)


 And primary reason to dedupe is not to reduce space usage but to
 improve caching. Why should machine A read file if machine B read it five 
 minutes ago.
 
 Couldn't agree more. This is what I was trying to explain earlier.
 Even if deduping did cause more fragmentation (and I don't think
 that is the case to any significant extent), the improved caching
 efficiency would more than offset this.
 
 Gordan
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

Program load too heavy for processor to lift.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-06 Thread Peter A
On Thursday, January 06, 2011 10:07:03 am you wrote:
 I'd be interested to see the evidence of the variable length argument.
 I have a sneaky suspicion that it actually falls back to 512 byte
 blocks, which are much more likely to align, when more sensibly sized
 blocks fail. The downside is that you don't really want to store a 32
 byte hash key with every 512 bytes of data, so you could peel off 512
 byte blocks off the front in a hope that a bigger block that follows
 will match.
 
 Thinking about it, this might actually not be too expensive to do. If
 the 4KB block doesn't match, check 512 byte sub-blocks, and try peeling
 them, to make the next one line up. If it doesn't, store the mismatch as
 a full 4KB block and resume. If you do find a match, save the peeled 512
 byte blocks separately and dedupe the 4KB block.
 
 In fact, it's rather like the loop peeling optimization on a compiler,
 that allows you to align the data to the boundary suitable for vectorizing.
I'm not sure about this but to be honest I can not see any other way. 
Otherwise, how would you ever find a match? You can not store checksums of 
random sub-sections hoping that eventually stuff will match up... 512 byte is 
probably the best choice as its been a known block size since the dawn of 
unix.


 Actually, see above - I believe I was wrong about how expensive
 variable length block size is likely to be. It's more expensive, sure,
 but not orders of magnitude more expensive, and as discussed earlier,
 given the CPU isn't really the key bottleneck here, I think it'd be
 quite workable.
Hmm - from my work with DD systems, it seems to be the CPU that ends up being 
the limiting factor on dedupe performance... But again, it seems that you have 
much deeper insight into this topic and have already drawn up an algorithm in 
your mind :)

Peter.

-- 
Censorship: noun, circa 1591. a: Relief of the burden of independent thinking.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-06 Thread Hubert Kario
On Thursday 06 of January 2011 10:51:04 Mike Hommey wrote:
 On Thu, Jan 06, 2011 at 10:37:46AM +0100, Tomasz Chmielewski wrote:
  I have been thinking a lot about de-duplication for a backup application
  I am writing. I wrote a little script to figure out how much it would
  save me. For my laptop home directory, about 100 GiB of data, it was a
  couple of percent, depending a bit on the size of the chunks. With 4 KiB
  chunks, I would save about two gigabytes. (That's assuming no MD5 hash
  collisions.) I don't have VM images, but I do have a fair bit of saved
  e-mail. So, for backups, I concluded it was worth it to provide an
  option to do this. I have no opinion on whether it is worthwhile to do
  in btrfs.
  
  Online deduplication is very useful for backups of big,
  multi-gigabyte files which change constantly.
  Some mail servers store files this way; some MUA store the files
  like this; databases are also common to pack everything in big files
  which tend to change here and there almost all the time.
  
  Multi-gigabyte files which only have few megabytes changed can't be
  hardlinked; simple maths shows that even compressing multiple files
  which have few differences will lead to greater space usage than a
  few megabytes extra in each (because everything else is
  deduplicated).
  
  And I don't even want to think about IO needed to offline dedup a
  multi-terabyte storage (1 TB disks and bigger are becoming standard
  nowadays) i.e. daily, especially when the storage is already heavily
  used in IO terms.
  
  
  Now, one popular tool which can deal with small changes in files is
  rsync. It can be used to copy files over the network - so that if
  you want to copy/update a multi-gigabyte file which only has a few
  changes, rsync would need to transfer just a few megabytes.
  
  On disk however, rsync creates a temporary copy of the original
  file, where it packs unchanged contents together with any changes
  made. For example, while it copies/updates a file, we will have:
  
  original_file.bin
  .temporary_random_name
  
  Later, original_file.bin would be removed, and
  .temporary_random_name would be renamed to original_file.bin. Here
  goes away any deduplication we had so far, we have to start the IO
  over again.
 
 Sounds like all you need is cp --reflink=always and rsync --inplace.
 
 Haven't tested is that works well, though.

It works very well, btrfs with snapshots, compression and rsync --inplace has 
better storage utilisation than lessfs at around 10-15 snapshots with around 
600GB of test data in small files.

-- 
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawerów 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Offline Deduplication for Btrfs V2

2011-01-06 Thread Josef Bacik
Just a quick update, I've dropped the hashing stuff in favor of doing a memcmp
in the kernel to make sure the data is still the same.  The thing that takes a
while is reading the data up from disk, so doing a memcmp of the entire buffer
isn't that big of a deal, not to mention there's a possiblity for malicious
users if there is a problem with the hashing algorithms we use.  Plus this makes
the interface simpler.  Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-06 Thread Chris Mason
Excerpts from Peter A's message of 2011-01-05 22:58:36 -0500:
 On Wednesday, January 05, 2011 08:19:04 pm Spelic wrote:
   I'd just make it always use the fs block size. No point in making it 
   variable.
  
  Agreed. What is the reason for variable block size?
 
 First post on this list - I mostly was just reading so far to learn more on 
 fs 
 design but this is one topic I (unfortunately) have experience with... 
 
 You wouldn't believe the difference variable block size dedupe makes. For a 
 pure fileserver, its ok to dedupe on block level but for most other uses, 
 variable is king. One big example is backups. Netbackup and most others 
 produce one stream with all data even when backing up to disk. Imagine you 
 move a whole lot of data from one dir to another. Think a directory with huge 
 video files. As a filesystem it would be de-duped nicely. The backup stream 
 however may and may not have matching fs blocks. If the directory name before 
 and after has the same lengths and such - then yeah, dedupe works. Directory 
 name is a byte shorter? Everything in the stream will be offset by one byte - 
 and no dedupe will occur at all on the whole dataset. In real world just 
 compare the dedupe performance of an Oracle 7000 (zfs and therefore fs block 
 based) to a DataDomain (variable lenght) in this usage scenario. Among our 
 customers we see something like 3 to 17x dedupe ration on the DD, 1.02 - 1.05 
 in the 7000.

What is the smallest granularity that the datadomain searches for in
terms of dedup?

Josef's current setup isn't restricted to a specific block size, but
there is a min match of 4k.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Offline Deduplication for Btrfs

2011-01-05 Thread Josef Bacik
Here are patches to do offline deduplication for Btrfs.  It works well for the
cases it's expected to, I'm looking for feedback on the ioctl interface and
such, I'm well aware there are missing features for the userspace app (like
being able to set a different blocksize).  If this interface is acceptable I
will flesh out the userspace app a little more, but I believe the kernel side is
ready to go.

Basically I think online dedup is huge waste of time and completely useless.
You are going to want to do different things with different data.  For example,
for a mailserver you are going to want to have very small blocksizes, but for
say a virtualization image store you are going to want much larger blocksizes.
And lets not get into heterogeneous environments, those just get much too
complicated.  So my solution is batched dedup, where a user just runs this
command and it dedups everything at this point.  This avoids the very costly
overhead of having to hash and lookup for duplicate extents online and lets us
be _much_ more flexible about what we want to deduplicate and how we want to do
it.

For the userspace app it only does 64k blocks, or whatever the largest area it
can read out of a file.  I'm going to extend this to do the following things in
the near future

1) Take the blocksize as an argument so we can have bigger/smaller blocks
2) Have an option to _only_ honor the blocksize, don't try and dedup smaller
blocks
3) Use fiemap to try and dedup extents as a whole and just ignore specific
blocksizes
4) Use fiemap to determine what would be the most optimal blocksize for the data
you want to dedup.

I've tested this out on my setup and it seems to work well.  I appreciate any
feedback you may have.  Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-05 Thread Gordan Bobic

Josef Bacik wrote:


Basically I think online dedup is huge waste of time and completely useless.


I couldn't disagree more. First, let's consider what is the 
general-purpose use-case of data deduplication. What are the resource 
requirements to perform it? How do these resource requirements differ 
between online and offline?


The only sane way to keep track of hashes of existing blocks is using an 
index. Searches through an index containing evenly distributed data 
(such as hashes) is pretty fast (log(N)), and this has to be done 
regardless of whether the dedupe is online or offline. It also goes 
without saying that all the blocks being deduplicated need to be hashed, 
and the cost of this is also the same whether the block is hashes online 
or offline.


Let's look at the relative merits:

1a) Offline
We have to copy the entire data set. This means we are using the full 
amount of disk writes that the data set size dictates. Do we do the 
hashing of current blocks at this point to create the indexes? Or do we 
defer it until some later time?


Doing it at the point of writes is cheaper - we already have the data in 
RAM and we can calculate the hashes as we are writing each block. 
Performance implications of this are fairly analogous to the parity RAID 
RMW performance issue - to achieve decent performance you have to write 
the parity at the same time as the rest of the stripe, otherwise you 
have to read the part of the stripe you didn't write, before calculating 
the checksum.


So by doing the hash indexing offline, the total amount of disk I/O 
required effectively doubles, and the amount of CPU spent on doing the 
hashing is in no way reduced.


How is this in any way advantageous?

1b) Online
As we are writing the data, we calculate the hashes for each block. (See 
1a for argument of why I believe this is saner and cheaper than doing it 
offline.) Since we already have these hashes, we can do a look-up in the 
hash-index, and either write out the block as is (if that hash isn't 
already in the index) or simply write the pointer to an existing 
suitable block (if it already exists). This saves us writing out that 
block - fewer writes to the disk, not to mention we don't later have to 
re-read the block to dedupe it.


So in this case, instead of write-read-relink of the offline scenario, 
we simply do relink on duplicate blocks.


There is another reason to favour the online option due to it's lower 
write stress - SSDs. Why hammer the SSD with totally unnecessary writes?


The _only_ reason to defer deduping is that hashing costs CPU time. But 
the chances are that a modern CPU core can churn out MD5 and/or SHA256 
hashes faster than a modern mechanical disk can keep up. A 15,000rpm 
disk can theoretically handle 250 IOPS. A modern CPU can handle 
considerably more than 250 block hashings per second. You could argue 
that this changes in cases of sequential I/O on big files, but a 1.86GHz 
GHz Core2 can churn through 111MB/s of SHA256, which even SSDs will 
struggle to keep up with.


I don't think that the realtime performance argument withstands scrutiny.


You are going to want to do different things with different data.  For example,
for a mailserver you are going to want to have very small blocksizes, but for
say a virtualization image store you are going to want much larger blocksizes.
And lets not get into heterogeneous environments, those just get much too
complicated.


In terms of deduplication, IMO it should really all be uniform, 
transparent, and block based. In terms of specifying which subtrees to 
dedupe, that should really be a per subdirectory hereditary attribute, 
kind of like compression was supposed to work with chattr +c in the past.



So my solution is batched dedup, where a user just runs this
command and it dedups everything at this point.  This avoids the very costly
overhead of having to hash and lookup for duplicate extents online and lets us
be _much_ more flexible about what we want to deduplicate and how we want to do
it.


I don't see that it adds any flexibility compared to the hereditary 
deduping attribute. I also don't see that it is any cheaper. It's 
actually more expensive, according to the reasoning above.


As an aside, zfs and lessfs both do online deduping, presumably for a 
good reason.


Then again, for a lot of use-cases there are perhaps better ways to 
achieve the targed goal than deduping on FS level, e.g. snapshotting or 
something like fl-cow:

http://www.xmailserver.org/flcow.html

Personally, I would still like to see a fl-cow like solution that 
actually preserves the inode numbers of duplicate files while providing 
COW functionality that breaks this unity (and inode number identity) 
upon writes, specifically because it saves page cache (only have to 
cache one copy) and in case of DLLs on chroot style virtualization 
(OpenVZ, Vserver, LXC) means that identical DLLs in all the guests are 
all mapped into the same memory, thus 

Re: Offline Deduplication for Btrfs

2011-01-05 Thread Josef Bacik
On Wed, Jan 05, 2011 at 05:42:42PM +, Gordan Bobic wrote:
 Josef Bacik wrote:

 Basically I think online dedup is huge waste of time and completely useless.

 I couldn't disagree more. First, let's consider what is the  
 general-purpose use-case of data deduplication. What are the resource  
 requirements to perform it? How do these resource requirements differ  
 between online and offline?

 The only sane way to keep track of hashes of existing blocks is using an  
 index. Searches through an index containing evenly distributed data  
 (such as hashes) is pretty fast (log(N)), and this has to be done  
 regardless of whether the dedupe is online or offline. It also goes  
 without saying that all the blocks being deduplicated need to be hashed,  
 and the cost of this is also the same whether the block is hashes online  
 or offline.

 Let's look at the relative merits:

 1a) Offline
 We have to copy the entire data set. This means we are using the full  
 amount of disk writes that the data set size dictates. Do we do the  
 hashing of current blocks at this point to create the indexes? Or do we  
 defer it until some later time?

 Doing it at the point of writes is cheaper - we already have the data in  
 RAM and we can calculate the hashes as we are writing each block.  
 Performance implications of this are fairly analogous to the parity RAID  
 RMW performance issue - to achieve decent performance you have to write  
 the parity at the same time as the rest of the stripe, otherwise you  
 have to read the part of the stripe you didn't write, before calculating  
 the checksum.

 So by doing the hash indexing offline, the total amount of disk I/O  
 required effectively doubles, and the amount of CPU spent on doing the  
 hashing is in no way reduced.

 How is this in any way advantageous?

 1b) Online
 As we are writing the data, we calculate the hashes for each block. (See  
 1a for argument of why I believe this is saner and cheaper than doing it  
 offline.) Since we already have these hashes, we can do a look-up in the  
 hash-index, and either write out the block as is (if that hash isn't  
 already in the index) or simply write the pointer to an existing  
 suitable block (if it already exists). This saves us writing out that  
 block - fewer writes to the disk, not to mention we don't later have to  
 re-read the block to dedupe it.

 So in this case, instead of write-read-relink of the offline scenario,  
 we simply do relink on duplicate blocks.

 There is another reason to favour the online option due to it's lower  
 write stress - SSDs. Why hammer the SSD with totally unnecessary writes?

 The _only_ reason to defer deduping is that hashing costs CPU time. But  
 the chances are that a modern CPU core can churn out MD5 and/or SHA256  
 hashes faster than a modern mechanical disk can keep up. A 15,000rpm  
 disk can theoretically handle 250 IOPS. A modern CPU can handle  
 considerably more than 250 block hashings per second. You could argue  
 that this changes in cases of sequential I/O on big files, but a 1.86GHz  
 GHz Core2 can churn through 111MB/s of SHA256, which even SSDs will  
 struggle to keep up with.

 I don't think that the realtime performance argument withstands scrutiny.

 You are going to want to do different things with different data.  For 
 example,
 for a mailserver you are going to want to have very small blocksizes, but for
 say a virtualization image store you are going to want much larger 
 blocksizes.
 And lets not get into heterogeneous environments, those just get much too
 complicated.

 In terms of deduplication, IMO it should really all be uniform,  
 transparent, and block based. In terms of specifying which subtrees to  
 dedupe, that should really be a per subdirectory hereditary attribute,  
 kind of like compression was supposed to work with chattr +c in the past.

 So my solution is batched dedup, where a user just runs this
 command and it dedups everything at this point.  This avoids the very costly
 overhead of having to hash and lookup for duplicate extents online and lets 
 us
 be _much_ more flexible about what we want to deduplicate and how we want to 
 do
 it.

 I don't see that it adds any flexibility compared to the hereditary  
 deduping attribute. I also don't see that it is any cheaper. It's  
 actually more expensive, according to the reasoning above.

 As an aside, zfs and lessfs both do online deduping, presumably for a  
 good reason.

 Then again, for a lot of use-cases there are perhaps better ways to  
 achieve the targed goal than deduping on FS level, e.g. snapshotting or  
 something like fl-cow:
 http://www.xmailserver.org/flcow.html

 Personally, I would still like to see a fl-cow like solution that  
 actually preserves the inode numbers of duplicate files while providing  
 COW functionality that breaks this unity (and inode number identity)  
 upon writes, specifically because it saves page cache (only have to  
 

Re: Offline Deduplication for Btrfs

2011-01-05 Thread Ray Van Dolson
On Wed, Jan 05, 2011 at 07:41:13PM +0100, Diego Calleja wrote:
 On Miércoles, 5 de Enero de 2011 18:42:42 Gordan Bobic escribió:
  So by doing the hash indexing offline, the total amount of disk I/O 
  required effectively doubles, and the amount of CPU spent on doing the 
  hashing is in no way reduced.
 
 But there are people who might want to avoid temporally the extra cost
 of online dedup, and do it offline when the server load is smaller.
 
 In my opinion, both online and offline dedup have valid use cases, and
 the best choice is probably implement both.

Question from an end-user.  When we say offline deduplication, are we
talking about post-process deduplication (a la what Data ONTAP does
with their SIS implementation) during which the underlying file system
data continues to be available, or a process that needs exclusive
access ot the blocks to do its job?

Ray
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-05 Thread Lars Wirzenius
On ke, 2011-01-05 at 14:46 -0500, Josef Bacik wrote:
 Blah blah blah, I'm not having an argument about which is better because I
 simply do not care.  I think dedup is silly to begin with, and online dedup 
 even
 sillier.  The only reason I did offline dedup was because I was just toying
 around with a simple userspace app to see exactly how much I would save if I 
 did
 dedup on my normal system, and with 107 gigabytes in use, I'd save 300
 megabytes.  I'll say that again, with 107 gigabytes in use, I'd save 300
 megabytes.  So in the normal user case dedup would have been wholey useless to
 me.

I have been thinking a lot about de-duplication for a backup application
I am writing. I wrote a little script to figure out how much it would
save me. For my laptop home directory, about 100 GiB of data, it was a
couple of percent, depending a bit on the size of the chunks. With 4 KiB
chunks, I would save about two gigabytes. (That's assuming no MD5 hash
collisions.) I don't have VM images, but I do have a fair bit of saved
e-mail. So, for backups, I concluded it was worth it to provide an
option to do this. I have no opinion on whether it is worthwhile to do
in btrfs.

(For my script, see find-duplicate-chunks in
http://code.liw.fi/debian/pool/main/o/obnam/obnam_0.14.tar.gz or get the
current code using bzr get http://code.liw.fi/obnam/bzr/trunk/;.
http://braawi.org/obnam/ is the home page of the backup app.)

-- 
Blog/wiki/website hosting with ikiwiki (free for free software):
http://www.branchable.com/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-05 Thread Freddie Cash
On Wed, Jan 5, 2011 at 11:46 AM, Josef Bacik jo...@redhat.com wrote:
 Dedup is only usefull if you _know_ you are going to have duplicate 
 information,
 so the two major usecases that come to mind are

 1) Mail server.  You have small files, probably less than 4k (blocksize) that
 you are storing hundreds to thousands of.  Using dedup would be good for this
 case, and you'd have to have a small dedup blocksize for it to be usefull.

 2) Virtualized guests.  If you have 5 different RHEL5 virt guests, chances are
 you are going to share data between them, but unlike with the mail server
 example, you are likely to find much larger chunks that are the same, so you'd
 want a larger dedup blocksize, say 64k.  You want this because if you did just
 4k you'd end up with a ridiculous amount of framentation and performance would
 go down the toilet, so you need a larger dedup blocksize to make for better
 performance.

You missed out on the most obvious, and useful, use case for dedupe:
  central backups server.

Our current backup server does an rsync backup of 127 servers every
night into a single ZFS pool.  90+ of those servers are identical
Debian installs (school servers), 20-odd of those are identical
FreeBSD installs (firewalls/routers), and the rest are mail/web/db
servers (Debian, Ubuntu, RedHat, Windows).

Just as a test, we copied a week of backups to a Linux box running
ZFS-fuse with dedupe enabled, and had a combined dedupe/compress
ration in the low double-digits (11 or 12x, something like that).  Now
we're just waiting for ZFSv22+ to hit FreeBSD to enable dedupe on the
backups server.

For backups, and central storage for VMs, online dedupe is a massive
win.  Offline, maybe.  Either way, dedupe is worthwhile.

-- 
Freddie Cash
fjwc...@gmail.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-05 Thread Josef Bacik
On Wed, Jan 05, 2011 at 07:58:13PM +, Lars Wirzenius wrote:
 On ke, 2011-01-05 at 14:46 -0500, Josef Bacik wrote:
  Blah blah blah, I'm not having an argument about which is better because I
  simply do not care.  I think dedup is silly to begin with, and online dedup 
  even
  sillier.  The only reason I did offline dedup was because I was just toying
  around with a simple userspace app to see exactly how much I would save if 
  I did
  dedup on my normal system, and with 107 gigabytes in use, I'd save 300
  megabytes.  I'll say that again, with 107 gigabytes in use, I'd save 300
  megabytes.  So in the normal user case dedup would have been wholey useless 
  to
  me.
 
 I have been thinking a lot about de-duplication for a backup application
 I am writing. I wrote a little script to figure out how much it would
 save me. For my laptop home directory, about 100 GiB of data, it was a
 couple of percent, depending a bit on the size of the chunks. With 4 KiB
 chunks, I would save about two gigabytes. (That's assuming no MD5 hash
 collisions.) I don't have VM images, but I do have a fair bit of saved
 e-mail. So, for backups, I concluded it was worth it to provide an
 option to do this. I have no opinion on whether it is worthwhile to do
 in btrfs.
 

Yeah for things where you are talking about sending it over the network or
something like that every little bit helps.  I think deduplication is far more
interesting and usefull at an application level than at a filesystem level.  For
example with a mail server, there is a good chance that the files will be
smaller than a blocksize and not be able to be deduped, but if the application
that was storing them recognized that it had the same messages and just linked
everything in its own stuff then that would be cool.  Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-05 Thread Gordan Bobic

On 01/05/2011 06:41 PM, Diego Calleja wrote:

On Miércoles, 5 de Enero de 2011 18:42:42 Gordan Bobic escribió:

So by doing the hash indexing offline, the total amount of disk I/O
required effectively doubles, and the amount of CPU spent on doing the
hashing is in no way reduced.


But there are people who might want to avoid temporally the extra cost
of online dedup, and do it offline when the server load is smaller.


The point is that the offline dedup is actually twice as expensive, and 
the hashing part is nowhere nearly expensive as disk I/O. Disk I/O is 
very limited today, compared to CPU time.


Gordan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-05 Thread Gordan Bobic

On 01/05/2011 07:01 PM, Ray Van Dolson wrote:

On Wed, Jan 05, 2011 at 07:41:13PM +0100, Diego Calleja wrote:

On Miércoles, 5 de Enero de 2011 18:42:42 Gordan Bobic escribió:

So by doing the hash indexing offline, the total amount of disk I/O
required effectively doubles, and the amount of CPU spent on doing the
hashing is in no way reduced.


But there are people who might want to avoid temporally the extra cost
of online dedup, and do it offline when the server load is smaller.

In my opinion, both online and offline dedup have valid use cases, and
the best choice is probably implement both.


Question from an end-user.  When we say offline deduplication, are we
talking about post-process deduplication (a la what Data ONTAP does
with their SIS implementation) during which the underlying file system
data continues to be available, or a process that needs exclusive
access ot the blocks to do its job?


I was assuming it was a regular cron-job that grinds away on the disks 
but doesn't require downtime.


Gordan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-05 Thread Josef Bacik
On Wed, Jan 05, 2011 at 11:01:41AM -0800, Ray Van Dolson wrote:
 On Wed, Jan 05, 2011 at 07:41:13PM +0100, Diego Calleja wrote:
  On Miércoles, 5 de Enero de 2011 18:42:42 Gordan Bobic escribió:
   So by doing the hash indexing offline, the total amount of disk I/O 
   required effectively doubles, and the amount of CPU spent on doing the 
   hashing is in no way reduced.
  
  But there are people who might want to avoid temporally the extra cost
  of online dedup, and do it offline when the server load is smaller.
  
  In my opinion, both online and offline dedup have valid use cases, and
  the best choice is probably implement both.
 
 Question from an end-user.  When we say offline deduplication, are we
 talking about post-process deduplication (a la what Data ONTAP does
 with their SIS implementation) during which the underlying file system
 data continues to be available, or a process that needs exclusive
 access ot the blocks to do its job?
 

Yeah its just a post-process thing, you run it when you care to run it and it
doesn't make anything unavailable.  Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-05 Thread Freddie Cash
On Wed, Jan 5, 2011 at 12:15 PM, Josef Bacik jo...@redhat.com wrote:
 Yeah for things where you are talking about sending it over the network or
 something like that every little bit helps.  I think deduplication is far more
 interesting and usefull at an application level than at a filesystem level.  
 For
 example with a mail server, there is a good chance that the files will be
 smaller than a blocksize and not be able to be deduped, but if the application
 that was storing them recognized that it had the same messages and just linked
 everything in its own stuff then that would be cool.  Thanks,

Cyrus IMAP and Zimbra (probably a lot of others) already do that,
hard-linking identical message bodies.  The e-mail server use-case is
for dedupe is pretty much covered already.


-- 
Freddie Cash
fjwc...@gmail.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-05 Thread Gordan Bobic

On 01/05/2011 07:46 PM, Josef Bacik wrote:


Blah blah blah, I'm not having an argument about which is better because I
simply do not care.  I think dedup is silly to begin with, and online dedup even
sillier.


Offline dedup is more expensive - so why are you of the opinion that it 
is less silly? And comparison by silliness quotiend still sounds like an 
argument over which is better.



The only reason I did offline dedup was because I was just toying
around with a simple userspace app to see exactly how much I would save if I did
dedup on my normal system, and with 107 gigabytes in use, I'd save 300
megabytes.  I'll say that again, with 107 gigabytes in use, I'd save 300
megabytes.  So in the normal user case dedup would have been wholey useless to
me.


Dedup isn't for an average desktop user. Dedup is for backup storage and 
virtual images. I don't remember anyone ever saying it is for the 
average desktop user. I am amazed you got that much saving even - I 
wouldn't expect there to be any duplicate files on a normal system. 
Compression is a feature that the desktop users would benefit with, not 
deduplication.



Dedup is only usefull if you _know_ you are going to have duplicate information,
so the two major usecases that come to mind are

1) Mail server.  You have small files, probably less than 4k (blocksize) that
you are storing hundreds to thousands of.  Using dedup would be good for this
case, and you'd have to have a small dedup blocksize for it to be usefull.


Explain to me why you think this would yield duplicate blocks. If your 
server is Maildir, headers will be in the mail files, and because all 
emails went to different users, they'd have different headers, and thus 
not be dedupable.



2) Virtualized guests.  If you have 5 different RHEL5 virt guests, chances are
you are going to share data between them, but unlike with the mail server
example, you are likely to find much larger chunks that are the same, so you'd
want a larger dedup blocksize, say 64k.  You want this because if you did just
4k you'd end up with a ridiculous amount of framentation and performance would
go down the toilet, so you need a larger dedup blocksize to make for better
performance.


Fragmentation will cause you problems anyway, the argument in the UNIX 
world since year dot was that defragging doesn't make a damn worth of 
difference when you have a hundred users hammering away on a machine 
that has to skip between all their collective files.


If you have VM image files a-la vmware/xen/kvm, then using blocks of the 
same size as the guests is the only way that you are going to get sane 
deduplication performance. Otherwise the blocks won't line up. If the 
dedupe block size is 4KB and guest fs block size is 4KB, that's a 
reasonably clean case.


The biggest win by far, however, would be when using chroot type guests, 
as I mentioned.



So you'd want an online implementation to give you a choice of dedup blocksize,
which seems to me to be overly complicated.


I'd just make it always use the fs block size. No point in making it 
variable.



And then lets bring up the fact that you _have_ to manually compare any data you
are going to dedup.  I don't care if you think you have the greatest hashing
algorithm known to man, you are still going to have collisions somewhere at some
point, so in order to make sure you don't lose data, you have to manually memcmp
the data.  So if you are doing this online, that means reading back the copy you
want to dedup in the write path so you can do the memcmp before you write.  That
is going to make your write performance _suck_.


IIRC, this is configurable in ZFS so that you can switch off the 
physical block comparison. If you use SHA256, the probability of a 
collission (unless SHA is broken, in which case we have much bigger 
problems) is 1^128. Times 4KB blocks, that is one collission in 10^24 
Exabytes. That's one trillion trillion (that's double trillion) 
Exabytes. That is considerably more storage space than there is likely 
to be available on the planet for some time. And just for good measure, 
you could always up the hash to SHA512 or use two different hashes (e.g. 
a combination of SHA256 and MD5).



Do I think offline dedup is awesome?  Hell no, but I got distracted doing it as
a side project so I figured I'd finish it, and I did it in under 1400 lines.  I
dare you to do the same with an online implementation.  Offline is simpler to
implement and simpler to debug if something goes wrong, and has an overall
easier to control impact on the system.


It is also better done outside the FS if you're not going to do it 
properly using FL-COW or fuse based lessfs.


Gordan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-05 Thread Lars Wirzenius
On ke, 2011-01-05 at 19:58 +, Lars Wirzenius wrote:
 (For my script, see find-duplicate-chunks in
 http://code.liw.fi/debian/pool/main/o/obnam/obnam_0.14.tar.gz or get the
 current code using bzr get http://code.liw.fi/obnam/bzr/trunk/;.
 http://braawi.org/obnam/ is the home page of the backup app.)

If I may add: it would perhaps be good to collect numbers on the amount
of duplication (for various block sizes) there is on different kinds of
systems: random laptops, file servers for small companies and large
companies, mail servers, backup servers, VM servers, etc. Would anyone
be interested in collecting such numbers? A script like mine would be a
bit heavy to run, but not too much so, I bet.

It would be good to have hard numbers as a basis of discussion rather
than guesses and assumptions.

Or perhaps someone's already collected the data?

-- 
Blog/wiki/website hosting with ikiwiki (free for free software):
http://www.branchable.com/

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-05 Thread Gordan Bobic

On 01/05/2011 09:14 PM, Diego Calleja wrote:

In fact, there are cases where online dedup is clearly much worse. For
example, cases where people suffer duplication, but it takes a lot of
time (several months) to hit it. With online dedup, you need to enable
it all the time to get deduplication, and the useless resource waste
offsets the other advantages. With offline dedup, you only deduplicate
when the system really needs it.


My point is that on a file server you don't need to worry about the CPU 
cost of deduplication because you'll run out of I/O long before you run 
out of CPU.



And I can also imagine some unrealistic but theorically valid cases,
like for example an embedded device that for some weird reason needs
deduplication but doesn't want online dedup because it needs to save
as much power as possible. But it can run an offline dedup when the
batteries are charging.


That's _very_ theoretical.


It's clear to me that if you really want a perfect deduplication
solution you need both systems.


I'm not against having both. :)

Gordan
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-05 Thread Gordan Bobic

On 01/06/2011 12:22 AM, Spelic wrote:

On 01/05/2011 09:46 PM, Gordan Bobic wrote:

On 01/05/2011 07:46 PM, Josef Bacik wrote:

Offline dedup is more expensive - so why are you of the opinion that
it is less silly? And comparison by silliness quotiend still sounds
like an argument over which is better.



If I can say my opinion, I wouldn't want dedup to be enabled online for
the whole filesystem.

Three reasons:

1- Virtual machine disk images should not get deduplicated imho, if you
care about performances, because fragmentation is more important in that
case.


I disagree. You'll gain much, much more from improved caching and 
reduced page cache usage than you'll lose from fragmentation.



So offline dedup is preferable IMHO. Or at least online dedup should
happen only on configured paths.


Definitely agree that it should be a per-directory option, rather than 
per mount.



2- I don't want performances to drop all the time. I would run dedup
periodically on less active hours, hence, offline. A rate limiter should
also be implemented so not to trash the drives too much. Also a stop and
continue should be implemented, so that dedup which couldn't finish
within a certain time-frame (e.g. one night) can be made continue the
night after without restarting from the beginning.


This is the point I was making - you end up paying double the cost in 
disk I/O and the same cost in CPU terms if you do it offline. And I am 
not convniced the overhead of calculating checksums is that great. There 
are already similar overheads in checksums being calculated to enable 
smart data recovery in case of silent disk corruption.


Now that I mentioned, that, it's an interesting point. Could these be 
unified? If we crank up the checksums on files a bit, to something 
suitably useful for deduping, it could make the deduping feature almost 
free.


As for restarting deduping (e.g. you chattr -R a directory to specify it 
for deduping. Since the contents aren't already deduped (the files' 
entries aren't in the hash index, it'd be obvious what still needs to be 
deduped and what doesn't.



3- Only some directories should be deduped, for performance reasons. You
can foresee where duplicate blocks can exist and where not. Backup
directories typically, or mailservers directories. The rest is probably
a waste of time.


Indeed, see above. I think it should be a per file setting/attribute, 
hereditary from the parent directory.



Also, the OS is small even if identical on multiple virtual images, how
much is going to occupy anyway? Less than 5GB per disk image usually.
And that's the only thing that would be deduped because data likely to
be different on each instance. How many VMs running you have? 20? That's
at most 100GB saved one-time at the cost of a lot of fragmentation.


That's also 100GB fewer disk blocks in contention for page cache. If 
you're hitting the disks, you're already going to slow down by several 
orders of magnitude. Better to make the caching more effective.



So if you are doing this online, that means reading back the copy you
want to dedup in the write path so you can do the memcmp before you
write. That
is going to make your write performance _suck_.


IIRC, this is configurable in ZFS so that you can switch off the
physical block comparison. If you use SHA256, the probability of a
collission (unless SHA is broken, in which case we have much bigger
problems) is 1^128. Times 4KB blocks, that is one collission in 10^24
Exabytes. That's one trillion trillion (that's double trillion) Exabytes.


I like mathematics, but I don't care this time. I would never enable
dedup without full blocks compare. I think most users and most companies
would do the same.


I understand where you are coming from, but by that reasoning you could 
also argue that AES256 isn't good enough to keep your data confidential. 
It is a virtual certainty that you will lose several times that much 
data through catastrophic disk+raid+backup failures than through finding 
a hash collission.



If there is full blocks compare, a simpler/faster algorithm could be
chosen, like md5. Or even a md-64bits which I don't think it exists, but
you can take MD4 and then xor the first 8 bytes with the second 8 bytes
so to reduce it to 8 bytes only. This is just because it saves 60% of
the RAM occupation during dedup, which is expected to be large, and the
collisions are still insignificant at 64bits. Clearly you need to do
full blocks compare after that.


I really don't think the cost in terms of a few bytes per file for the 
hashes is that significant.



Note that deduplication IS a cryptographically sensitive matter because
if sha-1 is cracked, people can nuke (or maybe even alter, and with
this, hack privileges) other users' files by providing blocks with the
same SHA and waiting for dedup to pass.
Same thing for AES btw, it is showing weaknesses: use blowfish or twofish.
SHA1 and AES are two wrong standards...


That's just alarmist. AES 

Re: Offline Deduplication for Btrfs

2011-01-05 Thread Spelic

On 01/05/2011 09:46 PM, Gordan Bobic wrote:

On 01/05/2011 07:46 PM, Josef Bacik wrote:

Offline dedup is more expensive - so why are you of the opinion that 
it is less silly? And comparison by silliness quotiend still sounds 
like an argument over which is better.




If I can say my opinion, I wouldn't want dedup to be enabled online for 
the whole filesystem.


Three reasons:

1- Virtual machine disk images should not get deduplicated imho, if you 
care about performances, because fragmentation is more important in that 
case.
So offline dedup is preferable IMHO. Or at least online dedup should 
happen only on configured paths.


2- I don't want performances to drop all the time. I would run dedup 
periodically on less active hours, hence, offline. A rate limiter should 
also be implemented so not to trash the drives too much. Also a stop and 
continue should be implemented, so that dedup which couldn't finish 
within a certain time-frame (e.g. one night) can be made continue the 
night after without restarting from the beginning.


3- Only some directories should be deduped, for performance reasons. You 
can foresee where duplicate blocks can exist and where not. Backup 
directories typically, or mailservers directories. The rest is probably 
a waste of time.


Dedup isn't for an average desktop user. Dedup is for backup storage 
and virtual images


Not virtual images imho, for the reason above.

Also, the OS is small even if identical on multiple virtual images, how 
much is going to occupy anyway? Less than 5GB per disk image usually. 
And that's the only thing that would be deduped because data likely to 
be different on each instance. How many VMs running you have? 20? That's 
at most 100GB saved one-time at the cost of a lot of fragmentation.


Now if you backup those images periodically, that's a place where I 
would run dedup.


I'd just make it always use the fs block size. No point in making it 
variable.


Agreed. What is the reason for variable block size?

And then lets bring up the fact that you _have_ to manually compare 
any data you
are going to dedup.  I don't care if you think you have the greatest 
hashing
algorithm known to man, you are still going to have collisions 
somewhere at some
point, so in order to make sure you don't lose data, you have to 
manually memcmp

the data.


Totally agreed


So if you are doing this online, that means reading back the copy you
want to dedup in the write path so you can do the memcmp before you 
write.  That

is going to make your write performance _suck_.


IIRC, this is configurable in ZFS so that you can switch off the 
physical block comparison. If you use SHA256, the probability of a 
collission (unless SHA is broken, in which case we have much bigger 
problems) is 1^128. Times 4KB blocks, that is one collission in 10^24 
Exabytes. That's one trillion trillion (that's double trillion) Exabytes. 


I like mathematics, but I don't care this time. I would never enable 
dedup without full blocks compare. I think most users and most companies 
would do the same.


If there is full blocks compare, a simpler/faster algorithm could be 
chosen, like md5. Or even a md-64bits which I don't think it exists, but 
you can take MD4 and then xor the first 8 bytes with the second 8 bytes 
so to reduce it to 8 bytes only. This is just because it saves 60% of 
the RAM occupation during dedup, which is expected to be large, and the 
collisions are still insignificant at 64bits. Clearly you need to do 
full blocks compare after that.


BTW if you want to allow (as an option) dedup without full blocks 
compare, SHA1 is not so good: sha-0 already had problems, now sha-1 has 
problems, I almost wouldn't suggest it for cryptographically secure 
stuff foreseeing the future. Use ripemd160 or even better ripemd256 
which is even faster according to 
http://www.cryptopp.com/benchmarks.html  ripemds are much better 
algorithms than shas: they have no known weaknesses.
Note that deduplication IS a cryptographically sensitive matter because 
if sha-1 is cracked, people can nuke (or maybe even alter, and with 
this, hack privileges) other users' files by providing blocks with the 
same SHA and waiting for dedup to pass.

Same thing for AES btw, it is showing weaknesses: use blowfish or twofish.
SHA1 and AES are two wrong standards...

Dedup without full blocks compare seems indeed suited for online dedup 
(which I wouldn't enable, now for one more reason) because with full 
block compares performances would really suck. But please leave full 
blocks compare for the offline dedup.


Also I could suggest a third type of deduplication, but this is 
harder... it's a file-level deduplication which works like xdelta, that 
is, it is capable to recognize piece of identical data on two files, 
which are not at the same offset and which are not even aligned at block 
boundary. For this, a rolling hash like the one of rsync, or the xdelta 
3.0 algorithm could be used. For this to 

Re: Offline Deduplication for Btrfs

2011-01-05 Thread Chris Mason
Excerpts from Gordan Bobic's message of 2011-01-05 12:42:42 -0500:
 Josef Bacik wrote:
 
  Basically I think online dedup is huge waste of time and completely useless.
 
 I couldn't disagree more. First, let's consider what is the 
 general-purpose use-case of data deduplication. What are the resource 
 requirements to perform it? How do these resource requirements differ 
 between online and offline?

I don't really agree with Josef that dedup is dumb, but I do think his
current approach is the most reasonable.  Dedup has a few very valid use
cases, which I think break down to:

1) backups
2) VM images.

The backup farm use case is the best candidate for dedup in general
because they are generally write once and hopefully read never.
Fragmentation for reading doesn't matter at all and we're really very
sure we're going to backup the same files over and over again.

But, it's also something that will be dramatically more efficient when
the backup server helps out.  The backup server knows two files have the
same name, same size and can guess with very high accuracy that they
will be the same.  So it is a very good candidate for Josef's offline
dedup because it can just do the dedup right after writing the file.

In the backup farm, whole files are very likely to be identical, which
again is very easy to optimize with Josef's approach.

Next is the VM images.  This is actually a much better workload for
online dedup, except for the part where our poor storage server would be
spending massive amounts of CPU deduping blocks for all the VMs on the
machine.  In this case the storage server doesn't know the
filenames, it just has bunches of blocks that are likely to be the same
across VMs.

So, it seems a bit silly to do this out of band, where we wander through
the FS and read a bunch of blocks in hopes of finding ones with the same
hash.

But, one of the things on our features-to-implement page is to wander
through the FS and read all the blocks from time to time.  We want to do
this in the background to make sure the bits haven't rotted on disk.  By
scrubbing from time to time we are able to make sure that when a disk
does die, other disks in the FS are likely to have a good copy.

So again, Josef's approach actually works very well.  His dedup util
could be the scrubbing util and we'll get two projects for the price of
one.

As for the security of hashes, we're unlikely to find a collision on a
sha256 that wasn't made maliciously.  If the system's data is
controlled and you're not worried about evil people putting files on
there, extra reads really aren't required.

But then again, extra reads are a good thing (see above about
scrubbing).  The complexity of the whole operation goes down
dramatically when we do the verifications because hash index corruptions
(this extent has this hash) will be found instead of blindly trusted.

None of this means that online dedup is out of the question, I just
think the offline stuff is a great way to start.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-05 Thread Spelic

On 01/06/2011 02:03 AM, Gordan Bobic wrote:


That's just alarmist. AES is being cryptanalyzed because everything 
uses it. And the news of it's insecurity are somewhat exaggerated (for 
now at least). 


Who cares... the fact of not being much used is a benefit for RIPEMD / 
blowfish-twofish then.

Nobody makes viruses for Linux because they target windows. Same thing...
RIPEMD has still an advantage over SHA imho, and blowfish over AES.





If there is full blocks compare, a simpler/faster algorithm could be
chosen, like md5. Or even a md-64bits which I don't think it exists, but
you can take MD4 and then xor the first 8 bytes with the second 8 bytes
so to reduce it to 8 bytes only. This is just because it saves 60% of
the RAM occupation during dedup, which is expected to be large, and the
collisions are still insignificant at 64bits. Clearly you need to do
full blocks compare after that.


I really don't think the cost in terms of a few bytes per file for the 
hashes is that significant.


20 to 8 = 12 bytes per *filesystem block* saved, I think
Aren't we talking about block-level deduplication?
For every TB of filesystem you occupy 2GB of RAM with hashes instead of 
5.3GB (I am assuming 4K blocks, I don't remember how big are btrfs blocks)
For a 24 * 2TB storage you occupy 96GB instead of 254GB of RAM. It might 
be the edge between feasible and not feasible.
Actually it might not be feasible anyway... an option to store hashes 
into a ssd should be provided then...

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-05 Thread Freddie Cash
On Wed, Jan 5, 2011 at 5:03 PM, Gordan Bobic gor...@bobich.net wrote:
 On 01/06/2011 12:22 AM, Spelic wrote:
 Definitely agree that it should be a per-directory option, rather than per
 mount.

JOOC, would the dedupe table be done per directory, per mount, per
sub-volume, or per volume?  The larger the pool of data to check
against, the better your dedupe ratios will be.

I'm not up-to-date on all the terminology that btrfs uses, and how it
compares to ZFS (disks - vdevs - pool - filesystem/volume), so the
terms above may be incorrect.  :)

In the ZFS world, dedupe is done pool-wide in that any block in the
pool is a candidate for dedupe, but the dedupe property can be
enabled/disabled on a per-filesystem basis.  Thus, only blocks in
filesystems with the dedupe property enabled will be deduped.  But
blocks from any filesystem can be compared against.

 This is the point I was making - you end up paying double the cost in disk
 I/O and the same cost in CPU terms if you do it offline. And I am not
 convniced the overhead of calculating checksums is that great. There are
 already similar overheads in checksums being calculated to enable smart data
 recovery in case of silent disk corruption.

 Now that I mentioned, that, it's an interesting point. Could these be
 unified? If we crank up the checksums on files a bit, to something suitably
 useful for deduping, it could make the deduping feature almost free.

This is what ZFS does.  Every block in the pool has a checksum
attached to it.  Originally, the default algorithm was fletcher2, with
fletcher4 and sha256 as alternates.  When dedupe was enabled, the
default was changed to fletcher4.  Dedupe also came with the option to
enable/disable a byte-for-byte verify when the hashes match.

By switching the checksum algorithm for the pool to sha256 ahead of
time, you can enable dedupe, and get the dedupe checksumming for free.
 :)

 Also, the OS is small even if identical on multiple virtual images, how
 much is going to occupy anyway? Less than 5GB per disk image usually.
 And that's the only thing that would be deduped because data likely to
 be different on each instance. How many VMs running you have? 20? That's
 at most 100GB saved one-time at the cost of a lot of fragmentation.

 That's also 100GB fewer disk blocks in contention for page cache. If you're
 hitting the disks, you're already going to slow down by several orders of
 magnitude. Better to make the caching more effective.

If you setup your VMs as diskless images, using NFS off a storage
server running whatever FS using dedupe, you can get a lot more out
of it than using disk image files (where you have all the block sizes
and alignment to worry about).  And the you can use all the fancy
snapshotting, cloning, etc features of whatever FS as well.

-- 
Freddie Cash
fjwc...@gmail.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-05 Thread Peter A
On Wednesday, January 05, 2011 08:19:04 pm Spelic wrote:
  I'd just make it always use the fs block size. No point in making it 
  variable.
 
 Agreed. What is the reason for variable block size?

First post on this list - I mostly was just reading so far to learn more on fs 
design but this is one topic I (unfortunately) have experience with... 

You wouldn't believe the difference variable block size dedupe makes. For a 
pure fileserver, its ok to dedupe on block level but for most other uses, 
variable is king. One big example is backups. Netbackup and most others 
produce one stream with all data even when backing up to disk. Imagine you 
move a whole lot of data from one dir to another. Think a directory with huge 
video files. As a filesystem it would be de-duped nicely. The backup stream 
however may and may not have matching fs blocks. If the directory name before 
and after has the same lengths and such - then yeah, dedupe works. Directory 
name is a byte shorter? Everything in the stream will be offset by one byte - 
and no dedupe will occur at all on the whole dataset. In real world just 
compare the dedupe performance of an Oracle 7000 (zfs and therefore fs block 
based) to a DataDomain (variable lenght) in this usage scenario. Among our 
customers we see something like 3 to 17x dedupe ration on the DD, 1.02 - 1.05 
in the 7000.

There are many other examples and in the end it comes down to if you want 
general purpose de-dupe (e.g. something useful when you serve iscsi luns, 
serve as backup target, ...) or if you only care about a pure file store, 
you're probably going to be ok with fixed block lengths... 

Hope that helps, 

Peter.

-- 
Censorship: noun, circa 1591. a: Relief of the burden of independent thinking.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html