Re: Data De-duplication

2008-12-14 Thread Chris Samuel
On Thu, 11 Dec 2008 8:19:03 am Ray Van Dolson wrote:

 I'm not sure why this hasn't caught on, but as soon as a solid and fast
 implementation of it exists in the Linux world I really think it can
 catch on for VM datastores I know we've hollered at Sun as to why
 they haven't rolled it out for ZFS yet!

Might NetApp have patented the technique ?

Would be a similar situation to the issues [1] that the KSM work [2] fell into 
with trying to prevent duplication of pages.

cheers,
Chris

[1] - http://lwn.net/Articles/309155/
[2] - http://lwn.net/Articles/306704/

-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

This email may come with a PGP signature as a file. Do not panic.
For more info see: http://en.wikipedia.org/wiki/OpenPGP

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data De-duplication

2008-12-14 Thread Omen Wild
Quoting Oliver Mattos oliver.matto...@imperial.ac.uk on Thu, Dec 11 00:18:

  It would be interesting to see how many duplicate *blocks* there are
  across the filesystem, agnostic to files...

Here is my contribution.  It's a perl script that goes through every
block (of various block sizes) on a device and prints out summaries.
It uses BerkeleyDB to store the hashes as it goes as there are an
enormous number for smaller block sizes.

It checks every block, not just the used blocks, so on a file system
less that 100% full it really isn't a proper test.

It requires root privileges in order to read the raw block device.

My root filesystem on my Debian box, 20G, 9.2G used:
root:   512 byte blocks 27.07% duplicate.
root:  1024 byte blocks 26.20% duplicate.
root:  4096 byte blocks 23.17% duplicate.
root:  8192 byte blocks 15.31% duplicate.
root: 16384 byte blocks 9.57% duplicate.

For my home partition, 20G, 29% used
home:   512 byte blocks 32.11% duplicate.
home:  1024 byte blocks 31.16% duplicate.
home:  4096 byte blocks 29.85% duplicate.
home:  8192 byte blocks 21.31% duplicate.
home: 16384 byte blocks 10.84% duplicate.

There is, of course, a fair amount of overhead for smaller blocks:
512b 3.91% overhead
1k   1.95% overhead
4k   0.49% overhead
8k   0.24% overhead
16k  0.12% overhead

Though, 4% overhead for 27-32% savings seems pretty impressive.  Again,
it checksums all blocks, not just used blocks, so the whole test might
be bogus.

-- 
Life is wasted on the living.
#! /usr/bin/perl

use strict;
use warnings;

use BerkeleyDB;
use DBI;
use Data::Dumper;
use Digest::SHA1;
use Fcntl;

$| = 1;

my $env = new BerkeleyDB::Env({
		 -Home = $ENV{'HOME'} . /work/bulk,
		});

my @blocksizes = (qw/16384 8192 4096 1024 512/);
#my @blocksizes = (qw/16384/);

while (my $device = shift) {
  my $device_size = `env POSIXLY_CORRECT=1 df $device | tail -1 | awk '{printf %d, \$1}'`;
  if ($device_size = 0) {
	 die Unable to find device block count through df, make sure you pass /dev/mapper/devicename for logical volumes.\n;
  }

  foreach my $blocksize (sort {$a = $b} @blocksizes) {
	 my $device_blocks = ($device_size * 512) / $blocksize;

	 print Testing \`$device': $blocksize bytes, $device_blocks blocks\n;

	 my $device_sanitized = $device;
	 $device_sanitized =~ s@/@_...@g;
	 $device_sanitized =~ s/[^-a-zA-Z0-9_]//g;

	 my $data;
	 my $percent_done = 0;
	 my $blocks = 0;

	 my %hashes;
	 unlink($device_sanitized.db);
	 my $db = tie %hashes, 'BerkeleyDB::Hash',
		-Filename = $device_sanitized.db,
		  -Cachesize = 2**26,
			 -Flags= DB_CREATE;
			 

	 my $time_start = time();
	 sysopen(DEVICE, $device, O_RDONLY) or die;
	 while ((my $bytes_read = sysread(DEVICE, $data, 2**20)) = $blocksize) {
		# Read 1M at a time
		for (my $i = 0; $i  $bytes_read / $blocksize; $i++) {
		  my $substr = substr($data, $i * $blocksize, $blocksize);
		  #printf(Block %d, i = %d, offset = %d\n,
			#		$blocks, $i, $i * $blocksize);
		  my $digest = Digest::SHA1-sha1($substr);
		  $hashes{$digest}++;
		  $blocks++;
		  my $pdone = int(100 * ($blocks / $device_blocks));
		  if ($pdone != $percent_done) {
			 printf STDERR (Done: %d%% %d/%d, %d seconds\n,
$pdone, $blocks, $device_blocks,
time() - $time_start);
			 $percent_done = $pdone;
			 $time_start = time();
		  }
		}
	 }
	 close(DEVICE);

	 my $blocks_saved = 0.0;
	 while (my ($key, $value) = each(%hashes)) {
		if ($value  1) {
		  $blocks_saved += $value;
		}
	 }
	 my $saved = $blocks_saved * $blocksize;
	 my $total_size = $blocks * $blocksize;

	 printf(%s: %5d byte blocks, %d blocks\n,
			  $device, $blocksize);
	 printf(\t%d duplicate blocks, %.0f duplicate bytes.\n,
			  $blocks, $blocks_saved, $saved);
	 printf(\t%.2f%% overhead, %.2f%% duplicate.\n,
			  (20.0 / $blocksize) * 100, ($saved / ($total_size + $saved)) * 100);
  }
}


Re: Data De-duplication

2008-12-11 Thread Oliver Mattos
 Neat.  Thanks much.  It'd be cool to output the results of each of your
 hashes to a database so you can get a feel for how many duplicate
 blocks there are cross-files as well.
 
 I'd like to run this in a similar setup on all my VMware VMDK files and
 get an idea of how much space savings there would be across 20+ Windows
 2003 VMDK files... probably *lots* of common blocks.
 
 Ray
Hi,

Currently it DOES do cross file block matching - thats why it takes sooo
long to run :-)

If you remove everything after the word sort, it will make a verbose
output, which you could then stick into some SQL database if you wanted.
You could relativey easily format the output into a format for direct
input to an SQL database if you modified the line with the dd in it
within the first while.  You can also remove the sort and the pipe
before it to get an unsorted output - the advantage of this is it takes
less time.

I'm guessing that if you had the time to run this on multi-gigabyte disk
images you'd find that as much as 80% of the blocks are duplicated
between any two virtual machines of the same operating system.

That means if you have 20 Win 2k3 VM's and the first VM image of Windows
+ software is 2GB (after nulls are removed), the total size for 20VM's
could be ~6GB (remembering there will be extra redundancy the more VM's
you add)- not a bad saving.

Thanks
Oliver

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data De-duplication

2008-12-11 Thread Chris Mason
On Wed, 2008-12-10 at 17:53 +, Oliver Mattos wrote:
   2)  Keep a tree of checksums for data blocks, so that a bit of data can
   be located by it's checksum.  Whenever a data block is about to be
   written check if the block matches any known block, and if it does then
   don't bother duplicating the data on disk.  I suspect this option may
   not be realistic for performance reasons.
   
  
  When compression was added, the writeback path was changed to make
  option #2 viable, at least in the case where the admin is willing to
  risk hash collisions on strong hashes.  When the a direct read
  comparison is required before sharing blocks, it is probably best done
  by a stand alone utility, since we don't want wait for a read of a full
  extent every time we want to write on.
  
 
 Can we assume hash collisions won't occur?  I mean if it's a 256 bit
 hash then even with 256TB of data, and one hash per block, the chances
 of collision are still too small to calculate on gnome calculator.

It depends on the use case.  We can assume that if someone really wants
collisions to occur they will eventually be able to trigger them.  So,
the code should have the option to verify the bytes are identical via  a
read.

For a backup farm or virtualization dataset, I personally wouldn't use
the read-verify stage.  But others will.

-chris


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data De-duplication

2008-12-10 Thread seth huang
De-duplication is useful in data backup systems because of the high level of 
data redundancy, but I'm not sure whether it is necessary for a 
general-purposed fs. If you really want to do so, I will suggest the latter 
one. File-level de-dup can be implemented in a user-level application.

As I googled the internet, I found a previous discussion on this topic in the 
mail archive. I hope this will help.
http://markmail.org/message/sdxos4s2bnckven5#query:+page:1+mid:sdxos4s2bnckven5+state:results
 

On Tue, Dec 9, 2008 at 10:48 PM, Oliver Mattos
[EMAIL PROTECTED] wrote:
 Hi,

 Say I download a large file from the net to /mnt/a.iso.  I then download
 the same file again to /mnt/b.iso.  These files now have the same
 content, but are stored twice since the copies weren't made with the bcp
 utility.

 The same occurs if a directory tree with duplicate files (created with
 bcp) is put through a non-aware program - for example tarred and then
 untarred again.

 This could be improved in two ways:

 1)  Make a utility which checks the checksums for all the data extents,
 and if the checksums of data match for two files then check the file
 data, and if the file data matches then keep only one copy.  It could be
 run as a cron job to free up disk space on systems where duplicate data
 is common (eg. virtual machine images)

 2)  Keep a tree of checksums for data blocks, so that a bit of data can
 be located by it's checksum.  Whenever a data block is about to be
 written check if the block matches any known block, and if it does then
 don't bother duplicating the data on disk.  I suspect this option may
 not be realistic for performance reasons.

 If either is possible then thought needs to be put into if it's worth
 doing on a file level, or a partial-file level (ie. if I have two
 similar files, can the space used by the identical parts of the files be
 saved)

 Has any thought been put into either 1) or 2) - are either possible or
 desired?

 Thanks
 Oliver

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data De-duplication

2008-12-10 Thread Miguel Figueiredo Mascarenhas Sousa Filipe
Hi all,

On Tue, Dec 9, 2008 at 10:48 PM, Oliver Mattos
[EMAIL PROTECTED] wrote:
 Hi,

 Say I download a large file from the net to /mnt/a.iso.  I then download
 the same file again to /mnt/b.iso.  These files now have the same
 content, but are stored twice since the copies weren't made with the bcp
 utility.

 The same occurs if a directory tree with duplicate files (created with
 bcp) is put through a non-aware program - for example tarred and then
 untarred again.

 This could be improved in two ways:

 1)  Make a utility which checks the checksums for all the data extents,
 and if the checksums of data match for two files then check the file
 data, and if the file data matches then keep only one copy.  It could be
 run as a cron job to free up disk space on systems where duplicate data
 is common (eg. virtual machine images)


I have a perl script that does this, made by a friend of mine.
First it sweeps a dir/ and stats() every file, putting all files with
same size X in a linked list on a hashtable entry for size X.
Then it will md5sum all files with same bytesize to confirm if they
really are copies of each others.

Because if first only stats, and only md5sum files with potencial
duplicates, its faster than regular scripts I've seen.

Do you want this ?


 2)  Keep a tree of checksums for data blocks, so that a bit of data can
 be located by it's checksum.  Whenever a data block is about to be
 written check if the block matches any known block, and if it does then
 don't bother duplicating the data on disk.  I suspect this option may
 not be realistic for performance reasons.

 If either is possible then thought needs to be put into if it's worth
 doing on a file level, or a partial-file level (ie. if I have two
 similar files, can the space used by the identical parts of the files be
 saved)

 Has any thought been put into either 1) or 2) - are either possible or
 desired?

 Thanks
 Oliver

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html




-- 
Miguel Sousa Filipe
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data De-duplication

2008-12-10 Thread Chris Mason
On Tue, 2008-12-09 at 22:48 +, Oliver Mattos wrote:
 Hi,
 
 Say I download a large file from the net to /mnt/a.iso.  I then download
 the same file again to /mnt/b.iso.  These files now have the same
 content, but are stored twice since the copies weren't made with the bcp
 utility.
 
 The same occurs if a directory tree with duplicate files (created with
 bcp) is put through a non-aware program - for example tarred and then
 untarred again.
 
 This could be improved in two ways:
 
 1)  Make a utility which checks the checksums for all the data extents,
 and if the checksums of data match for two files then check the file
 data, and if the file data matches then keep only one copy.  It could be
 run as a cron job to free up disk space on systems where duplicate data
 is common (eg. virtual machine images)
 

Sage did extend the ioctl used by bcp to be able to deal with ranges of
files bytes.  So, it could be used to do the  actual cow step for this
utility.

 2)  Keep a tree of checksums for data blocks, so that a bit of data can
 be located by it's checksum.  Whenever a data block is about to be
 written check if the block matches any known block, and if it does then
 don't bother duplicating the data on disk.  I suspect this option may
 not be realistic for performance reasons.
 

When compression was added, the writeback path was changed to make
option #2 viable, at least in the case where the admin is willing to
risk hash collisions on strong hashes.  When the a direct read
comparison is required before sharing blocks, it is probably best done
by a stand alone utility, since we don't want wait for a read of a full
extent every time we want to write on.

 If either is possible then thought needs to be put into if it's worth
 doing on a file level, or a partial-file level (ie. if I have two
 similar files, can the space used by the identical parts of the files be
 saved)

From the kernel, the easiest granularity is the block level.

 
 Has any thought been put into either 1) or 2) - are either possible or
 desired?

These are definitely on the long term feature list.  I don't think we'll
get them before 1.0, but its an important feature.

-chris


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data De-duplication

2008-12-10 Thread Oliver Mattos
  2)  Keep a tree of checksums for data blocks, so that a bit of data can
  be located by it's checksum.  Whenever a data block is about to be
  written check if the block matches any known block, and if it does then
  don't bother duplicating the data on disk.  I suspect this option may
  not be realistic for performance reasons.
  
 
 When compression was added, the writeback path was changed to make
 option #2 viable, at least in the case where the admin is willing to
 risk hash collisions on strong hashes.  When the a direct read
 comparison is required before sharing blocks, it is probably best done
 by a stand alone utility, since we don't want wait for a read of a full
 extent every time we want to write on.
 

Can we assume hash collisions won't occur?  I mean if it's a 256 bit
hash then even with 256TB of data, and one hash per block, the chances
of collision are still too small to calculate on gnome calculator.

The only issue is if the hash algorithm is later found to be flawed, a
malicious bit of data could be stored on the disk who's hash would
collide with some more important data, potentially allowing the contents
of one file to be replaced with another.

Even if we don't assume hash collisions won't occur (eg. for crc's), the
write performance when writing duplicate files is equal to the read
performance of the disk, since for every block written by a program, one
block will need to be read, and no blocks written.  This is still better
than the write case (as most devices read faster than write), and has
the added advantage of saving lots of space.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data De-duplication

2008-12-10 Thread Oliver Mattos
On Wed, 2008-12-10 at 13:07 -0700, Anthony Roberts wrote:
  When the a direct read
  comparison is required before sharing blocks, it is probably best done
  by a stand alone utility, since we don't want wait for a read of a full
  extent every time we want to write on.
 
 Can a stand-alone utility do this without a race? Eg, if a portion of one
 of the files has already been read by the util, but is changed before the
 util has a chance to actually do the ioctl to duplicate the range.
 
 If we assume someone is keeping a hash table of likely matches, might it
 make sense to have a verify-and-dup ioctl that does this safely?
 

This sounds good because then a standard user could safely do this to
any file as long as he had read access to both files, but wouldn't need
write access (since the operation doesn't actually modify the file
data).

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data De-duplication

2008-12-10 Thread Ray Van Dolson
I lost the original post so I'm jumping in at the wrong thread-point :)
Someone mentioned that the primary usage of de-dup is in the backup
realm.  True perhaps currently, but de-dup IMO is *the* killer app in
the world of virtualization and is a huge reason why we're picking
NetApp at work to back our NFS VMware DataStores.  We easily see 50%
savings in space.

I know of only one other production filesystem implementation of
data-dedup -- GreenBytes has it in their ZFS-based storage product.

I'm not sure why this hasn't caught on, but as soon as a solid and fast
implementation of it exists in the Linux world I really think it can
catch on for VM datastores I know we've hollered at Sun as to why
they haven't rolled it out for ZFS yet!

Anyways, I know it's on the roadmap, just like throwing my $0.02 once
in a while on how big a feature I think this could be..

Great job all :)
Ray
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data De-duplication

2008-12-10 Thread Oliver Mattos
I see quite a few uses for this, and while it looks like the kernel mode
automatic de-dup-on-write code might be performance costly, require disk
format changes, and be controversial, it sounds like the user mode
utility could be implemented today.

It looks like a simple script could do the job - just iterate through
every file in the filesystem, run md5sum on every block of every file,
whenever a duplicate is found call an ioctl to remove the duplicate
data.  By md5summing each block it can also effectively compress disk
images.

While not very efficient it should work, and having something like this
in the toolkit would mean as soon as btrfs gets stable enough for
everyday use it would immediately out-do every other linux filesystem in
terms of space efficiency for some workloads.

In the long term kernel mode de-duplication would probably be good.  I'm
willing to bet even the average user has say 1-2% of data duplicated
somewhere on the HD due to accidental copies instead of moves, same
application installed to two different paths, two users who happen to
have the same file each saved in their home folder, etc, so even the
average user will slightly benefit.

I'm considering writing that script to test on my ext3 disk just to see
how much duplicate wasted data I really have.

Thanks
Oliver


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data De-duplication

2008-12-10 Thread Tracy Reed
On Wed, Dec 10, 2008 at 09:42:16PM +, Oliver Mattos spake thusly:
 I'm considering writing that script to test on my ext3 disk just to see
 how much duplicate wasted data I really have.

Check out the fdupes command. In Fedora 8 it is in the yum repo as
fdupes-1.40-10.fc8

-- 
Tracy Reed
http://cuttedge.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data De-duplication

2008-12-10 Thread Ray Van Dolson
On Wed, Dec 10, 2008 at 01:57:54PM -0800, Tracy Reed wrote:
 On Wed, Dec 10, 2008 at 09:42:16PM +, Oliver Mattos spake thusly:
  I'm considering writing that script to test on my ext3 disk just to see
  how much duplicate wasted data I really have.
 
 Check out the fdupes command. In Fedora 8 it is in the yum repo as
 fdupes-1.40-10.fc8
 

Neat tool.  I guess this just checks for duplicate files though.  It
would be interesting to see how many duplicate *blocks* there are
across the filesystem, agnostic to files...

Is this somthing your script does Oliver?

Ray
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data De-duplication

2008-12-10 Thread Oliver Mattos

 It would be interesting to see how many duplicate *blocks* there are
 across the filesystem, agnostic to files...
 
 Is this somthing your script does Oliver?

My script doesn't yet exist, although when created it would, yes.  I was
thinking of just making a BASH script and using dd to extract 512 byte
chunks of files, pipe through md5sum and save the result in a large
index file.  Next just iterate through the index file looking for
duplicate hashes.

In fact that sounds so easy I might do it right now...  (only to proof
of concept stage - a real utility would probably want to be written in a
compiled language and use proper trees for faster searching)

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Data De-duplication

2008-12-09 Thread Oliver Mattos
Hi,

Say I download a large file from the net to /mnt/a.iso.  I then download
the same file again to /mnt/b.iso.  These files now have the same
content, but are stored twice since the copies weren't made with the bcp
utility.

The same occurs if a directory tree with duplicate files (created with
bcp) is put through a non-aware program - for example tarred and then
untarred again.

This could be improved in two ways:

1)  Make a utility which checks the checksums for all the data extents,
and if the checksums of data match for two files then check the file
data, and if the file data matches then keep only one copy.  It could be
run as a cron job to free up disk space on systems where duplicate data
is common (eg. virtual machine images)

2)  Keep a tree of checksums for data blocks, so that a bit of data can
be located by it's checksum.  Whenever a data block is about to be
written check if the block matches any known block, and if it does then
don't bother duplicating the data on disk.  I suspect this option may
not be realistic for performance reasons.

If either is possible then thought needs to be put into if it's worth
doing on a file level, or a partial-file level (ie. if I have two
similar files, can the space used by the identical parts of the files be
saved)

Has any thought been put into either 1) or 2) - are either possible or
desired?

Thanks
Oliver

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html