Re: Data Deduplication with the help of an online filesystem check

2009-06-05 Thread Tomasz Chmielewski

Chris Mason wrote:

On Thu, Jun 04, 2009 at 10:49:19AM +0200, Thomas Glanzmann wrote:

Hello Chris,


 My question is now, how often can a block in btrfs be refferenced?

The exact answer depends on if we are referencing it from a single
file or from multiple files.  But either way it is roughly 2^32.

could you please explain to me what underlying datastructure is used to
monitor if the block is still referenced or already free? Is a counter
used, bitmap (but that can't be if is 2^32) or some sort of list? I
assume that a counter is used. If this is the case, I assume when a
snapshot for example is deleted the reference counter of every block
that was referenced in the snapshot will be decremented by one. Is this
correct or am I missing something here?


It is a counter and a back reference.  With Yan Zheng's new format work,
the limit is not 2^64.

When a snapshot is deleted, the btree is walked to efficiently drop the
references on the blocks it referenced.

From a dedup point of view, we'll want the dedup file to hold a
reference on the file extents.  The kernel ioctl side of things will
take care of that part.


I wonder how well would deduplication work with defragmentation? One 
excludes the other to some extent.



--
Tomasz Chmielewski
http://wpkg.org
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-06-05 Thread Chris Mason
On Fri, Jun 05, 2009 at 02:20:48PM +0200, Tomasz Chmielewski wrote:
 Chris Mason wrote:
 On Thu, Jun 04, 2009 at 10:49:19AM +0200, Thomas Glanzmann wrote:
 Hello Chris,

  My question is now, how often can a block in btrfs be refferenced?
 The exact answer depends on if we are referencing it from a single
 file or from multiple files.  But either way it is roughly 2^32.
 could you please explain to me what underlying datastructure is used to
 monitor if the block is still referenced or already free? Is a counter
 used, bitmap (but that can't be if is 2^32) or some sort of list? I
 assume that a counter is used. If this is the case, I assume when a
 snapshot for example is deleted the reference counter of every block
 that was referenced in the snapshot will be decremented by one. Is this
 correct or am I missing something here?

 It is a counter and a back reference.  With Yan Zheng's new format work,
 the limit is not 2^64.

 When a snapshot is deleted, the btree is walked to efficiently drop the
 references on the blocks it referenced.

 From a dedup point of view, we'll want the dedup file to hold a
 reference on the file extents.  The kernel ioctl side of things will
 take care of that part.

 I wonder how well would deduplication work with defragmentation? One  
 excludes the other to some extent.

Very much so ;)  Ideally we end up doing dedup in large extents, but it
will definitely increase the overall fragmentation of the FS.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-06-05 Thread Tomasz Chmielewski

Chris Mason wrote:

I wonder how well would deduplication work with defragmentation? One  
excludes the other to some extent.


Very much so ;)  Ideally we end up doing dedup in large extents, but it
will definitely increase the overall fragmentation of the FS.


Defragmentation could lead to interesting problems if it's not aware of 
dedupliction.



I can imagine freeing up space (i.e., as seen by userspace) where 
duplicated blocks are found, but keeping track of duplicated blocks 
internally (and not allowing to overwrite any block which is duplicated).


Only when free space is really needed, deduplicate blocks which have 
more copies, and allow to overwrite them with data.


Perhaps complicated though.


--
Tomasz Chmielewski
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-06-04 Thread Thomas Glanzmann
Hello Chris,

   My question is now, how often can a block in btrfs be refferenced?

 The exact answer depends on if we are referencing it from a single
 file or from multiple files.  But either way it is roughly 2^32.

could you please explain to me what underlying datastructure is used to
monitor if the block is still referenced or already free? Is a counter
used, bitmap (but that can't be if is 2^32) or some sort of list? I
assume that a counter is used. If this is the case, I assume when a
snapshot for example is deleted the reference counter of every block
that was referenced in the snapshot will be decremented by one. Is this
correct or am I missing something here?

Thomas
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-24 Thread Thomas Glanzmann
Hello Heinz,

 Hi, during the last half year I thought a little bit about doing dedup
 for my backup program: not only with fixed blocks (which is
 implemented), but with moving blocks (with all offsets in a file: 1
 byte, 2 byte, ...). That means, I have to have *lots* of comparisions
 (size of file - blocksize).  Even it's not the same, it must be very
 fast and that's the same problem like the one discussed here.

because I just stumbled across that, I wanted to let you know about an
interesting approach that NetAPP is using for its Virtual Tape Library:

http://www.netapp.com/us/communities/tech-ontap/vtl-dedupe.html

Thomas
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-06 Thread Sander
Heinz-Josef Claes wrote (ao):
 Am Dienstag, 28. April 2009 19:38:24 schrieb Chris Mason:
  On Tue, 2009-04-28 at 19:34 +0200, Thomas Glanzmann wrote:
   Hello,
  
I wouldn't rely on crc32: it is not a strong hash,
Such deduplication can lead to various problems,
including security ones.
  
   sure thing, did you think of replacing crc32 with sha1 or md5, is this
   even possible (is there enough space reserved so that the change can be
   done without changing the filesystem layout) at the moment with btrfs?
 
  It is possible, there's room in the metadata for about about 4k of
  checksum for each 4k of data.  The initial btrfs code used sha256, but
  the real limiting factor is the CPU time used.
 
  -chris
 
 It's not only cpu time, it's also memory. You need 32 byte for each 4k block. 
 It needs to be in RAM for performance reason.

Less so with SSD I would assume.

-- 
Humilis IT Services and Solutions
http://www.humilis.net
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-05 Thread Heinz-Josef Claes
On Tue, 5 May 2009 07:29:45 +1000
Dmitri Nikulin dniku...@gmail.com wrote:

 On Tue, May 5, 2009 at 5:11 AM, Heinz-Josef Claes hjcl...@web.de wrote:
  Hi, during the last half year I thought a little bit about doing dedup for
  my backup program: not only with fixed blocks (which is implemented), but
  with moving blocks (with all offsets in a file: 1 byte, 2 byte, ...). That
  means, I have to have *lots* of comparisions (size of file - blocksize).
  Even it's not the same, it must be very fast and that's the same problem
  like the one discussed here.
 
  My solution (not yet implemented) is as follows (hopefully I remember well):
 
  I calculate a checksum of 24 bit. (there can be another size)
 
  This means, I can have 2^24 different checksums.
 
  Therefore, I hold a bit verctor of 0,5 GB in memory (I hope I remember well,
  I'm just in a hotel and have no calculator): one bit for each possibility.
  This verctor is initialized with zeros.
 
  For each calculated checksum of a block, I set the according bit in the bit
  vector.
 
  It's very fast, to check if a block with a special checksum exists in the
  filesystem (backup for me) by checking the appropriate bit in the bit
  vector.
 
  If it doesn't exist, it's a new block
 
  If it exists, there need to be a separate 'real' check if it's really the
  same block (which is slow, but's that's happening 1% of the time).
 
 Which means you have to refer to each block in some unique way from
 the bit vector, making it a block pointer vector instead. That's only
 64 times more expensive for a 64 bit offset...
 
It was not the idea to have a pointer vector, only a bit vector. A pointer 
vector would be too big to hold it in RAM. Therefore, I need access to the disk 
after using the more exact md5sum (I wanted to use). The bitvector is only 
needed to have a very quick decision for most of the cases (speedup).
But I have no idea if it fits to this use case. I'm not a filesystem developer 
;-)

 Since the overwhelming majority of combinations will never appear in
 practice, you are much better served with a self-sizing data structure
 like a hash map or even a binary tree, or a hash map with each bucket
 being a binary tree, etc... You can use any sized hash and it won't
 affect the number of nodes you have to store. You can trade off to CPU
 or RAM easily as required, just by selecting an appropriate data
 structure. A bit vector and especially a pointer vector have extremely
 bad any case RAM requirements because even if you're deduping a mere
 10 blocks you're still allocating and initialising 2^24 offsets. The
 least you could do is adaptively switch to a more efficient data
 structure if you see the number of blocks is low enough.
 
 -- 
 Dmitri Nikulin
 
 Centre for Synchrotron Science
 Monash University
 Victoria 3800, Australia
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Heinz-Josef Claes hjcl...@web.de
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-05 Thread Thomas Glanzmann
Hello Jan,

* Jan-Frode Myklebust janfr...@tanso.net [090504 20:20]:
 thin or shallow clones sounds more like sparse images. I believe
 linked clones is the word for running multiple virtual machines off
 a single gold image. Ref, the VMware View Composer section of:

not exactly. VMware has one golden image and than accounts the blocks
different from the golden image by accounting the differences in a
snapshot file on a block level basis. So yes, it is a ,,sparse file''
but implemented in userland.

 All desktops that are linked to a master image can be patched or updated
  simply by updating the master image, without affecting users’ settings,
  data or applications.

True for their desktops, because each desktop contains of a virtual
machine which is created when a user connects and destroyed, when he
disconnects. Bottom line the VM for the thin client only exists while
the user is connected. That of course makes it easy to update the master
image. But for VMs that are not desktops that are get created/destroyed
at least once a day, this doesn't scale.

Thomas
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Tomasz Chmielewski

Ric Wheeler schrieb:

One thing in the above scheme that would be really interesting for all 
possible hash functions is maintaining good stats on hash collisions, 
effectiveness of the hash, etc. There has been a lot of press about MD5 
hash collisions for example - it would be really neat to be able to 
track real world data on those,


See here (The hashing function):

http://backuppc.sourceforge.net/faq/BackupPC.html#some_design_issues

It's not real world data, but it gives some overview which applies here.


--
Tomasz Chmielewski
http://wpkg.org
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Thomas Glanzmann
Hello Ric,

 (1) Block level or file level dedup?

what is the difference between the two?

 (2) Inband dedup (during a write) or background dedup?

I think inband dedup is way to intensive on ressources (memory) and also
would kill every performance benchmark. So I think the offline dedup is
the right way to go.

 (3) How reliably can you protect the pool of blocks? How reliably can
 you protect the database that maps hashes to blocks?

You have to lock down the i/o requests for the blocks in question and
compare them byte by byte anyway, just to make sure.

 (4) Can you give users who are somewhat jaded confidence in your
 solution (this is where stats come in very handy!)

For virtual machines you can reduce the used data by 1/3. Of course it
can blow in your face when you don't watch your physical resources
closely.

Thomas
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Ric Wheeler

Thomas Glanzmann wrote:

Hello Ric,


(1) Block level or file level dedup?


what is the difference between the two?


(2) Inband dedup (during a write) or background dedup?


I think inband dedup is way to intensive on ressources (memory) and also
would kill every performance benchmark. So I think the offline dedup is
the right way to go.


I would not categorize it as offline, but just not as inband (i.e., you can run 
a low priority background process to handle dedup).  Offline windows are 
extremely rare in production sites these days and it could take a very long time 
to do dedup at the block level over a large file system :-)





(3) How reliably can you protect the pool of blocks? How reliably can
you protect the database that maps hashes to blocks?


You have to lock down the i/o requests for the blocks in question and
compare them byte by byte anyway, just to make sure.


Yes, one advantage we had in centera was that the objects were read-only and 
whole files, so this was not an issue for us.





(4) Can you give users who are somewhat jaded confidence in your
solution (this is where stats come in very handy!)


For virtual machines you can reduce the used data by 1/3. Of course it
can blow in your face when you don't watch your physical resources
closely.

Thomas


1/3 is not sufficient for dedup in my opinion - you can get that with normal 
compression at the block level.


Put another way, if the baseline is bzip2 levels of compression for the block 
device, when is the complexity of dedup going to pay off?


I think that Chris already mentioned that you can (for virt OS images) also 
imagine using copy on write snapshots (of something that is mostly read-only 
like the OS system partitions).  Make 50 copy-on-write snapshots of / 
(excluding /home) and you have an effective compression of 98% until someone 
rudely starts writing at least :-)


ric


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Ric Wheeler

On 05/04/2009 10:39 AM, Tomasz Chmielewski wrote:

Ric Wheeler schrieb:

One thing in the above scheme that would be really interesting for 
all possible hash functions is maintaining good stats on hash 
collisions, effectiveness of the hash, etc. There has been a lot of 
press about MD5 hash collisions for example - it would be really neat 
to be able to track real world data on those,


See here (The hashing function):

http://backuppc.sourceforge.net/faq/BackupPC.html#some_design_issues

It's not real world data, but it gives some overview which applies 
here.





I have lots of first hand real world data from 5 years at EMC in the 
Centera group where we used various types of hashes to do single 
instancing at the file level, but those are unfortunately not public :-) 
Note that other parts of EMC do dedup at the block level (Avamar most 
notably).


The key to doing dedup at scale is answering a bunch of questions:

(1) Block level or file level dedup?
(2) Inband dedup (during a write) or background dedup?
(3) How reliably can you protect the pool of blocks? How reliably 
can you protect the database that maps hashes to blocks?
(4) Can you give users who are somewhat jaded confidence in your 
solution (this is where stats come in very handy!)
(5) In the end, dedup is basically a data compression trick - how 
effective is it (including the costs of the metadata, etc) compared to 
less complex schemes? What does it costs in CPU, DRAM and impact on the 
foreground workload?


Dedup is a very active area in the storage world, but you need to be 
extremely robust in the face of failure since a single block failure 
could (worst case!) lose all stored data in the pool :-)


Regards,

Ric

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Ric Wheeler

On 04/28/2009 01:41 PM, Michael Tharp wrote:

Thomas Glanzmann wrote:

no, I just used the md5 checksum. And even if I have a hash escalation
which is highly unlikely it still gives a good house number.


I'd start with a crc32 and/or MD5 to find candidate blocks, then do a 
bytewise comparison before actually merging them. Even the risk of an 
accidental collision is too high, and considering there are plenty of 
birthday-style MD5 attacks it would not be extraordinarily difficult 
to construct a block that collides with e.g. a system library.


Keep in mind that although digests do a fairly good job of making 
unique identifiers for larger chunks of data, they can only hold so 
many unique combinations. Considering you're comparing blocks of a few 
kibibytes in size it's best to just do a foolproof comparison. There's 
nothing wrong with using a checksum/digest as a screening mechanism 
though.


-- m. tharp


One thing in the above scheme that would be really interesting for all 
possible hash functions is maintaining good stats on hash collisions, 
effectiveness of the hash, etc. There has been a lot of press about MD5 
hash collisions for example - it would be really neat to be able to 
track real world data on those,


Ric

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Thomas Glanzmann
Hello Andrey,

 As far as I understand, VMware already ships this gold image feature
 (as they call it) for Windows environments and claims it to be very
 efficient.

they call it ,,thin or shallow clones'' and ship it with desktop
virtualization (one vm per thinclient user) and for VMware lab manager.
Looking at the website content, it also revealed that VMware will have a
similiar feature for their workhorse ,,esx server'' in the upcoming
release, however my point still stands. Ship out a service pack for
windows and you 1.5 Gbyte of modified data that is not deduped. Because
that is a once in VMs life opportunity.

Thomas
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Thomas Glanzmann
Ric,

 I would not categorize it as offline, but just not as inband (i.e., you can 
 run a low priority background process to handle dedup).

 Offline windows are extremely rare in production sites these days and
 it could take a very long time to do dedup at the block level over a
 large file system :-)

let me rephrase, by offline I meant asynchronous during off hours.

 1/3 is not sufficient for dedup in my opinion - you can get that with 
 normal compression at the block level.

1/3 is what gives me real time data of an production environment in a
mixed VM setup without compression.

Thomas
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Jan-Frode Myklebust
On 2009-05-04, Thomas Glanzmann tho...@glanzmann.de wrote:

 As far as I understand, VMware already ships this gold image feature
 (as they call it) for Windows environments and claims it to be very
 efficient.

 they call it ,,thin or shallow clones''

thin or shallow clones sounds more like sparse images. I believe linked 
clones
is the word for running multiple virtual machines off a single gold image. Ref,
the VMware View Composer section of:

http://www.vmware.com/products/view/whatsincluded.html


 Looking at the website content, it also revealed that VMware will have a
 similiar feature for their workhorse ,,esx server'' in the upcoming
 release, however my point still stands. Ship out a service pack for
 windows and you 1.5 Gbyte of modified data that is not deduped. 

All desktops that are linked to a master image can be patched or updated
 simply by updating the master image, without affecting users’ settings,
 data or applications.


  -jf

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Heinz-Josef Claes

Thomas Glanzmann schrieb:

Ric,

  
I would not categorize it as offline, but just not as inband (i.e., you can 
run a low priority background process to handle dedup).



  

Offline windows are extremely rare in production sites these days and
it could take a very long time to do dedup at the block level over a
large file system :-)



let me rephrase, by offline I meant asynchronous during off hours.

  
Hi, during the last half year I thought a little bit about doing dedup 
for my backup program: not only with fixed blocks (which is 
implemented), but with moving blocks (with all offsets in a file: 1 
byte, 2 byte, ...). That means, I have to have *lots* of comparisions 
(size of file - blocksize). Even it's not the same, it must be very fast 
and that's the same problem like the one discussed here.


My solution (not yet implemented) is as follows (hopefully I remember well):

I calculate a checksum of 24 bit. (there can be another size)

This means, I can have 2^24 different checksums.

Therefore, I hold a bit verctor of 0,5 GB in memory (I hope I remember 
well, I'm just in a hotel and have no calculator): one bit for each 
possibility. This verctor is initialized with zeros.


For each calculated checksum of a block, I set the according bit in the 
bit vector.


It's very fast, to check if a block with a special checksum exists in 
the filesystem (backup for me) by checking the appropriate bit in the 
bit vector.


If it doesn't exist, it's a new block

If it exists, there need to be a separate 'real' check if it's really 
the same block (which is slow, but's that's happening 1% of the time).


I hope it is possible to understand my thoughts. I'm in a hotel and I 
possibly cannot track the emails in this list in the next hours or days.


Regards, HJC
1/3 is not sufficient for dedup in my opinion - you can get that with 
normal compression at the block level.



1/3 is what gives me real time data of an production environment in a
mixed VM setup without compression.

Thomas
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Andrey Kuzmin
On Mon, May 4, 2009 at 10:06 PM, Jan-Frode Myklebust janfr...@tanso.net wrote:
 Looking at the website content, it also revealed that VMware will have a
 similiar feature for their workhorse ,,esx server'' in the upcoming
 release, however my point still stands. Ship out a service pack for
 windows and you 1.5 Gbyte of modified data that is not deduped.

 All desktops that are linked to a master image can be patched or updated
  simply by updating the master image, without affecting users’ settings,
  data or applications.

Writable clone support in the file-system coupled with hierarchical
settings/data/apps  layout, and you have what's described above.

Regards,
Andrey



  -jf

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-05-04 Thread Dmitri Nikulin
On Tue, May 5, 2009 at 5:11 AM, Heinz-Josef Claes hjcl...@web.de wrote:
 Hi, during the last half year I thought a little bit about doing dedup for
 my backup program: not only with fixed blocks (which is implemented), but
 with moving blocks (with all offsets in a file: 1 byte, 2 byte, ...). That
 means, I have to have *lots* of comparisions (size of file - blocksize).
 Even it's not the same, it must be very fast and that's the same problem
 like the one discussed here.

 My solution (not yet implemented) is as follows (hopefully I remember well):

 I calculate a checksum of 24 bit. (there can be another size)

 This means, I can have 2^24 different checksums.

 Therefore, I hold a bit verctor of 0,5 GB in memory (I hope I remember well,
 I'm just in a hotel and have no calculator): one bit for each possibility.
 This verctor is initialized with zeros.

 For each calculated checksum of a block, I set the according bit in the bit
 vector.

 It's very fast, to check if a block with a special checksum exists in the
 filesystem (backup for me) by checking the appropriate bit in the bit
 vector.

 If it doesn't exist, it's a new block

 If it exists, there need to be a separate 'real' check if it's really the
 same block (which is slow, but's that's happening 1% of the time).

Which means you have to refer to each block in some unique way from
the bit vector, making it a block pointer vector instead. That's only
64 times more expensive for a 64 bit offset...

Since the overwhelming majority of combinations will never appear in
practice, you are much better served with a self-sizing data structure
like a hash map or even a binary tree, or a hash map with each bucket
being a binary tree, etc... You can use any sized hash and it won't
affect the number of nodes you have to store. You can trade off to CPU
or RAM easily as required, just by selecting an appropriate data
structure. A bit vector and especially a pointer vector have extremely
bad any case RAM requirements because even if you're deduping a mere
10 blocks you're still allocating and initialising 2^24 offsets. The
least you could do is adaptively switch to a more efficient data
structure if you see the number of blocks is low enough.

-- 
Dmitri Nikulin

Centre for Synchrotron Science
Monash University
Victoria 3800, Australia
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-29 Thread Michael Tharp

Thomas Glanzmann wrote:

Looking at this picture, when I'm going to implement the dedup code, do I also
have to take care to spread the blocks over the different devices or is
there already infrastructure in place that automates that process?


If you somehow had blocks duplicated exactly across two devices such 
that the deduplicator discarded all the blocks from one disk and kept 
all the blocks from the other disk, then there would be a problem. The 
best way to solve this would be to always keep the block on the 
least-full device and discard the rest, which shouldn't be too difficult 
(especially since this would be happening in kernelspace), but the 
cheapest solution is to randomize the choice which would be sufficient 
to prevent any further imbalance from developing.


-- m. tharp
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-29 Thread Chris Mason
On Wed, 2009-04-29 at 14:03 +0200, Thomas Glanzmann wrote:
 Hello Chris,
 
  You can start with the code documentation section on
  http://btrfs.wiki.kernel.org
 
 I read through this and at the moment one questions come in my mind:
 
 http://btrfs.wiki.kernel.org/images-btrfs/7/72/Chunks-overview.png
 
 Looking at this picture, when I'm going to implement the dedup code, do I also
 have to take care to spread the blocks over the different devices or is
 there already infrastructure in place that automates that process?

The layering inside of btrfs means that you don't need to worry about
chunks or multiple devices.

But, in your ioctls you want to deal with [file, offset, len], not
directly with block numbers.  COW means that blocks can move around
without you knowing, and some of the btrfs internals will COW files in
order to relocate storage.

So, what you want is a dedup file (or files) where your DB knows a given
offset in the file has a given csum.

-chris


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-29 Thread Thomas Glanzmann
Hello Chris,

 But, in your ioctls you want to deal with [file, offset, len], not
 directly with block numbers.  COW means that blocks can move around
 without you knowing, and some of the btrfs internals will COW files in
 order to relocate storage.

 So, what you want is a dedup file (or files) where your DB knows a given
 offset in the file has a given csum.

how do I track if a certain block has already been deduplicated?

So, if I only have file, offset, len and not the block number, is there
a way from userland to tell if two blocks are already point to the same
block?

Thomas
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-29 Thread Chris Mason
On Wed, 2009-04-29 at 15:58 +0200, Thomas Glanzmann wrote:
 Hello Chris,
 
  But, in your ioctls you want to deal with [file, offset, len], not
  directly with block numbers.  COW means that blocks can move around
  without you knowing, and some of the btrfs internals will COW files in
  order to relocate storage.
 
  So, what you want is a dedup file (or files) where your DB knows a given
  offset in the file has a given csum.
 
 how do I track if a certain block has already been deduplicated?
 

Your database should know, and the ioctl could check to see if the
source and destination already point to the same thing before doing
anything expensive.

 So, if I only have file, offset, len and not the block number, is there
 a way from userland to tell if two blocks are already point to the same
 block?

You can use the fiemap ioctl.

-chris


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-29 Thread Thomas Glanzmann
Hello Chris,

 Your database should know, and the ioctl could check to see if the
 source and destination already point to the same thing before doing
 anything expensive.

I see.

  So, if I only have file, offset, len and not the block number, is there
  a way from userland to tell if two blocks are already point to the same
  block?

 You can use the fiemap ioctl.

I see.

One more question, in my dedup test VMs, I had a block that was often
referenced these are the top block. The last one alone saves 3 Gbyte of
data. My question is now, how often can a block in btrfs be refferenced?

top 4 block counts for 8 Kbyte:

32055
59514
63606
415252

Thomas
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread jim owens

Andrey Kuzmin wrote:

On Tue, Apr 28, 2009 at 2:02 PM, Chris Mason chris.ma...@oracle.com wrote:

On Tue, 2009-04-28 at 07:22 +0200, Thomas Glanzmann wrote:

Hello Chris,


There is a btrfs ioctl to clone individual files, and this could be used
to implement an online dedup.  But, since it is happening from userland,
you can't lock out all of the other users of a given file.
So, the dedup application would be responsible for making sure a given
file was not being changed while the dedup scan was running.

I see, does that mean that I can not do ,,dedup'' for files that are
currently opened by a userland program?

No, but it does mean the dedup done from userland is racey.  Picture


Race disappears if (background) dedupe is run against snapshot(s).


only if the snapshots are made read-only.

btrfs has writable snapshots.

jim



Regards,
Andrey


this:

process A:
   create some_file # some_file matches the contents of another file

dedup proc:
   check some_file
   decide to start block dedup

process A:
modify some_file

dedup proc:
progress through block dedup

So, this will happily replace blocks in some_file with the dedup blocks.
But there's no way to atomically swap them.

We could create new ioctls for this, basically a variant of the clone
file ioctl that makes sure a given set of pages has a given sum (or
strict memory contents) before doing the swap.

But they don't exist yet.

-chris


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Chris,
what blocksizes can I choose with btrfs? Do you think that it is
possible for an outsider like me to submit patches to btrfs which enable
dedup in three fulltime days?

Thomas
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Tomasz Chmielewski

Thomas Glanzmann schrieb:


300 Gbyte of used storage of several productive VMs with the following
Operatings systems running:
\begin{itemize}
\item Red Hat Linux 32 and 64 Bit (Release 3, 4 and 5)
\item SuSE Linux 32 and 64 Bit (SLES 9 and 10)
\item Windows 2003 Std. Edition 32 Bit
\item Windows 2003 Enterprise Edition 64 Bit
\end{itemize}
\begin{tabular}{r|r|r|l}
blocksize  Deduplicated Data \\
\hline
128k29.9 G \\
 64k41.3 G \\
 32k59.2 G \\
 16k82   G \\
  8k   112   G \\
\

Bottom line with 8 K blocksize you can get more than 33% of deduped data
running a productive set of VMs.


Did you just compare checksums, or did you also compare the data bit 
after bit if the checksums matched?



--
Tomasz Chmielewski
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Edward Shishkin

Tomasz Chmielewski wrote:

Thomas Glanzmann schrieb:


300 Gbyte of used storage of several productive VMs with the following
Operatings systems running:
\begin{itemize}
\item Red Hat Linux 32 and 64 Bit (Release 3, 4 and 5)
\item SuSE Linux 32 and 64 Bit (SLES 9 and 10)
\item Windows 2003 Std. Edition 32 Bit
\item Windows 2003 Enterprise Edition 64 Bit
\end{itemize}
\begin{tabular}{r|r|r|l}
blocksize  Deduplicated Data \\
\hline
128k29.9 G \\
 64k41.3 G \\
 32k59.2 G \\
 16k82   G \\
  8k   112   G \\
\

Bottom line with 8 K blocksize you can get more than 33% of deduped data
running a productive set of VMs.


Did you just compare checksums, 


I wouldn't rely on crc32: it is not a strong hash,
Such deduplication can lead to various problems,
including security ones.

or did you also compare the data bit after bit if the checksums 
matched?


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello,

 I wouldn't rely on crc32: it is not a strong hash,
 Such deduplication can lead to various problems,
 including security ones.

sure thing, did you think of replacing crc32 with sha1 or md5, is this
even possible (is there enough space reserved so that the change can be
done without changing the filesystem layout) at the moment with btrfs?

Thomas
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello Chris,

  Is there a checksum for every block in btrfs?

 Yes, but they are only crc32c.

I see, is it easily possible to exchange that with sha-1 or md5?

  Is it possible to retrieve these checksums from userland?

 Not today.  The sage developers sent a patch to make an ioctl for
 this, but since it was hard coded to crc32c I haven't taken it yet.

I see.

 Yes, btrfs uses extents but for the purposes of dedup, 4k blocksizes
 are fine.

Does that mean that I can dedup 4k blocks even if you use extents?

 Virtual machines are the ideal dedup workload.  But, you do get a big
 portion of the dedup benefits by just starting with a common image and
 cloning it instead of doing copies of each vm.

True, the operating system can be almost completely deduped but as soon
as you start patching you loose the benefit.

Thomas
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Chris Mason
On Tue, 2009-04-28 at 19:34 +0200, Thomas Glanzmann wrote:
 Hello,
 
  I wouldn't rely on crc32: it is not a strong hash,
  Such deduplication can lead to various problems,
  including security ones.
 
 sure thing, did you think of replacing crc32 with sha1 or md5, is this
 even possible (is there enough space reserved so that the change can be
 done without changing the filesystem layout) at the moment with btrfs?

It is possible, there's room in the metadata for about about 4k of
checksum for each 4k of data.  The initial btrfs code used sha256, but
the real limiting factor is the CPU time used.

-chris


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Chris Mason
On Tue, 2009-04-28 at 19:37 +0200, Thomas Glanzmann wrote:
 Hello Chris,
 
   Is there a checksum for every block in btrfs?
 
  Yes, but they are only crc32c.
 
 I see, is it easily possible to exchange that with sha-1 or md5?
 

Yes, but for the purposes of dedup, it's not exactly what you want.  You
want an index by checksum, and the current btrfs code indexes by logical
byte number in the disk.

So you need an extra index either way.  It makes sense to keep the
crc32c csums for fast verification of the data read from disk and only
use the expensive csums for dedup. 

   Is it possible to retrieve these checksums from userland?
 
  Not today.  The sage developers sent a patch to make an ioctl for
  this, but since it was hard coded to crc32c I haven't taken it yet.
 
 I see.
 
  Yes, btrfs uses extents but for the purposes of dedup, 4k blocksizes
  are fine.
 
 Does that mean that I can dedup 4k blocks even if you use extents?

Yes.

 
  Virtual machines are the ideal dedup workload.  But, you do get a big
  portion of the dedup benefits by just starting with a common image and
  cloning it instead of doing copies of each vm.
 
 True, the operating system can be almost completely deduped but as soon
 as you start patching you loose the benefit.

-chris


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello,

 It is possible, there's room in the metadata for about about 4k of
 checksum for each 4k of data.  The initial btrfs code used sha256, but
 the real limiting factor is the CPU time used.

I see. There a very efficient md5 algorithms out there, for example,
especially if the code is written in assembler.

Thomas
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Heinz-Josef Claes
Am Dienstag, 28. April 2009 19:38:24 schrieb Chris Mason:
 On Tue, 2009-04-28 at 19:34 +0200, Thomas Glanzmann wrote:
  Hello,
 
   I wouldn't rely on crc32: it is not a strong hash,
   Such deduplication can lead to various problems,
   including security ones.
 
  sure thing, did you think of replacing crc32 with sha1 or md5, is this
  even possible (is there enough space reserved so that the change can be
  done without changing the filesystem layout) at the moment with btrfs?

 It is possible, there's room in the metadata for about about 4k of
 checksum for each 4k of data.  The initial btrfs code used sha256, but
 the real limiting factor is the CPU time used.

 -chris

It's not only cpu time, it's also memory. You need 32 byte for each 4k block. 
It needs to be in RAM for performance reason.

hjc

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Michael Tharp

Thomas Glanzmann wrote:

no, I just used the md5 checksum. And even if I have a hash escalation
which is highly unlikely it still gives a good house number.


I'd start with a crc32 and/or MD5 to find candidate blocks, then do a 
bytewise comparison before actually merging them. Even the risk of an 
accidental collision is too high, and considering there are plenty of 
birthday-style MD5 attacks it would not be extraordinarily difficult to 
construct a block that collides with e.g. a system library.


Keep in mind that although digests do a fairly good job of making unique 
identifiers for larger chunks of data, they can only hold so many unique 
combinations. Considering you're comparing blocks of a few kibibytes in 
size it's best to just do a foolproof comparison. There's nothing wrong 
with using a checksum/digest as a screening mechanism though.


-- m. tharp
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello Chris,

 Right now the blocksize can only be the same as the page size.  For
 this external dedup program you have in mind, you could use any
 multiple of the page size.

perfect. Exactly what I need.

 Three days is probably not quite enough ;)  I'd honestly prefer the
 dedup happen entirely in the kernel in a setup similar to the
 compression code.

I see. I think that it wouldn't scale because than all the checksums
need to be stored in memory or at least in an efficient b*tree. For a
1 Tbyte filesystem with 4 kbyte blocks that would mean more 5 G (!)
(assuming a 16 kbyte checksum and 4 byte block identifier and that
leaves out the b*tree overhead for fast searching) of memory.

 But, that would use _lots_ of CPU, so an offline dedup is probably a
 good feature even if we have transparent dedup.

I think that is the right way to go.

 Wire up a userland database that stores checksums and points to
 file, offset tuples

exactly. And if there is a way to retrieve the already calculated
checksums from kernel land, than it would be possible to implement a
,,systemcall'' that gives the kernel a hint of a possible duplicated
block (like providing a list of lists of blocks to the kernel that might
be duplicated because they have the same checksum). Than the kernel code
could dedup the block after byte-byte comparing it.

 Make the ioctl to replace a given file extent if and only if the file
 contents match a given checksum over a range of bytes.  The ioctl should
 be able to optionally do a byte compare of the src and destination pages
 to make 100% sure the data is really the same.

Exactly.

 Make another ioctl to report on which parts of a file have changed
 since a given transaction. This will greatly reduce the time spent
 scanning for new blocks.

That would be perfect. Even better would be a systemcall that reports
all the blocks that have been touched since a specific transaction. Like
a bitmap that sets a ,,1'' for every block that has been touched.

 It isn't painfully hard, but you're looking at about 3 weeks total
 time.

I see, so no quick hack to get it going.

Thomas
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello Heinz,

 It's not only cpu time, it's also memory. You need 32 byte for each 4k
 block.  It needs to be in RAM for performance reason.

exactly and that is not going to scale.

Thomas
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello,

* Thomas Glanzmann tho...@glanzmann.de [090428 22:10]:
 exactly. And if there is a way to retrieve the already calculated
 checksums from kernel land, than it would be possible to implement a
 ,,systemcall'' that gives the kernel a hint of a possible duplicated
 block (like providing a list of lists of blocks to the kernel that might
 be duplicated because they have the same checksum). Than the kernel code
 could dedup the block after byte-byte comparing it.

how hard would be to implement a systemcall that gets a list of blocks
and check if it is possible to dedup them (especially for the
implemented raid code, etc)?

Probably it would be smarter to report one set of blocks at a time
instead of a list of sets of blocks because than the systemcall would
take for ever to finish which is not really an issue but I assume that
the dedup systemcall needs to get a global lock that most probably
blocks any i/o operations?

Thomas
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Heinz-Josef Claes
Am Dienstag, 28. April 2009 22:16:19 schrieb Thomas Glanzmann:
 Hello Heinz,

  It's not only cpu time, it's also memory. You need 32 byte for each 4k
  block.  It needs to be in RAM for performance reason.

 exactly and that is not going to scale.

 Thomas


Hi Thomas,

I wrote a backup tool which uses dedup, so I know a little bit about the 
problem and the performance impact if the checksums are not in memory 
(optionally in that tool).
http://savannah.gnu.org/projects/storebackup

Dedup really helps a lot - I think more than I could imagine before I was 
engaged in this kind of backup. You will not beleve how many identical files 
are in a filesystem to give a simple example.

EMC has very big boxes for this with lots of RAM in it.
I think the first problem which has to be solved is the memory problem. 
Perhaps something asynchronous to find identical blocks and storing the 
checksums on disk?

Heinz
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Chris Mason
On Tue, 2009-04-28 at 22:52 +0200, Thomas Glanzmann wrote:
 Hello Heinz,
 
  I wrote a backup tool which uses dedup, so I know a little bit about
  the problem and the performance impact if the checksums are not in
  memory (optionally in that tool).
  http://savannah.gnu.org/projects/storebackup
 
  Dedup really helps a lot - I think more than I could imagine before I
  was engaged in this kind of backup. You will not beleve how many
  identical files are in a filesystem to give a simple example.
 
 I saw it with my own yes (see my previous e-mail).
 
  EMC has very big boxes for this with lots of RAM in it.  I think the
  first problem which has to be solved is the memory problem.  Perhaps
  something asynchronous to find identical blocks and storing the
  checksums on disk?
 
 I think we already have a very nice solution in order to solve that
 issue:
 
 - Implement a system call that reports all checksums and unique
   block identifiers for all stored blocks.
 

This would require storing the larger checksums in the filesystem.  It
is much better done in the dedup program.

 - Implement another system call that reports all checksums and
   unique identifiers for all stored blocks since the last
   report. This can be easily implemented:

This is racey because there's no way to prevent new changes.

 
   Use a block bitmap for every block on the filesystem use one
   bit. If the block is modified set the bit to one, when a
   bitmap is retrieved simply zero it out:

 Assuming a 4 kbyte block size that would mean for a 1 Tbyte
 filesystem:
 
 1Tbyte / 4096 / 8 = 32 Mbyte of memory (this should of course
 be saved to disk from time to time and be restored on startup).
 

Sorry, a 1TB drive is teeny, I don't think a bitmap is practical across
the whole FS.  Btrfs has metadata that can quickly and easily tell you
which files and which blocks in which files have changed since a given
transaction id.  This is how you want to find new things.

But, the ioctl to actually do the dedup needs to be able to verify a
given block has the contents you expect it to.  The only place you can
lock down the pages in the file and prevent new changes is inside the
kernel.

-chris


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Thomas Glanzmann
Hello,

  - Implement a system call that reports all checksums and unique
block identifiers for all stored blocks.

 This would require storing the larger checksums in the filesystem.  It
 is much better done in the dedup program.

I think I misunderstood something here. I thought the checksums per
block would already be stored somewhere in btrfs?

  - Implement another system call that reports all checksums and
unique identifiers for all stored blocks since the last
report. This can be easily implemented:

 This is racey because there's no way to prevent new changes.

I got the point.

Use a block bitmap for every block on the filesystem use one
bit. If the block is modified set the bit to one, when a
bitmap is retrieved simply zero it out:

  Assuming a 4 kbyte block size that would mean for a 1 Tbyte
  filesystem:

  1Tbyte / 4096 / 8 = 32 Mbyte of memory (this should of course
  be saved to disk from time to time and be restored on startup).

 Sorry, a 1TB drive is teeny, I don't think a bitmap is practical
 across the whole FS.  Btrfs has metadata that can quickly and easily
 tell you which files and which blocks in which files have changed
 since a given transaction id.  This is how you want to find new
 things.

You're right the bitmap would not scale. So what is missing is a
systemcall to report the changes to userland? (Is this feature used to
generate off-site backups as well?)

 But, the ioctl to actually do the dedup needs to be able to verify a
 given block has the contents you expect it to.  The only place you can
 lock down the pages in the file and prevent new changes is inside the
 kernel.

I totally agree to that. How much time would it consume to implement
such a systemcall?

Thomas
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Dmitri Nikulin
On Wed, Apr 29, 2009 at 3:43 AM, Chris Mason chris.ma...@oracle.com wrote:
 So you need an extra index either way.  It makes sense to keep the
 crc32c csums for fast verification of the data read from disk and only
 use the expensive csums for dedup.

What about self-healing? With only a CRC32 to distinguish a good block
from a bad one, statistically you're likely to get an incorrectly
healed block in only every few billion blocks. And that may not be
your machine, but it'll be somebody's, since the probability is way
too high for it not to happen to somebody. Even just a 64 bit checksum
would drop the probability plenty, but I'd really only start with 128
bits. NetApp does 64 for 4k of data, ZFS does 256 bits per block, and
this traces back to the root like a highly dynamic Merkle tree.

In the CRC case the only safe redundancy is one that has 3+ copies of
the block, to compare the raw data itself, at which point you may as
well have just been using healing RAID1 without checksums.

-- 
Dmitri Nikulin

Centre for Synchrotron Science
Monash University
Victoria 3800, Australia
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Chris Mason
On Tue, 2009-04-28 at 23:12 +0200, Thomas Glanzmann wrote:
 Hello,
 
   - Implement a system call that reports all checksums and unique
 block identifiers for all stored blocks.
 
  This would require storing the larger checksums in the filesystem.  It
  is much better done in the dedup program.
 
 I think I misunderstood something here. I thought the checksums per
 block would already be stored somewhere in btrfs?

They are, but only the crc32c are stored today.

 
   - Implement another system call that reports all checksums and
 unique identifiers for all stored blocks since the last
 report. This can be easily implemented:
 
  This is racey because there's no way to prevent new changes.
 
 I got the point.
 
 Use a block bitmap for every block on the filesystem use one
 bit. If the block is modified set the bit to one, when a
 bitmap is retrieved simply zero it out:
 
   Assuming a 4 kbyte block size that would mean for a 1 Tbyte
   filesystem:
 
   1Tbyte / 4096 / 8 = 32 Mbyte of memory (this should of course
   be saved to disk from time to time and be restored on startup).
 
  Sorry, a 1TB drive is teeny, I don't think a bitmap is practical
  across the whole FS.  Btrfs has metadata that can quickly and easily
  tell you which files and which blocks in which files have changed
  since a given transaction id.  This is how you want to find new
  things.
 
 You're right the bitmap would not scale. So what is missing is a
 systemcall to report the changes to userland? (Is this feature used to
 generate off-site backups as well?)

Yes, that's the idea.  An ioctl to walk the tree and report on changes,
but this doesn't have to be done with version 1 of the dedup code, you
can just scan the file based on mtime/ctime.

 
  But, the ioctl to actually do the dedup needs to be able to verify a
  given block has the contents you expect it to.  The only place you can
  lock down the pages in the file and prevent new changes is inside the
  kernel.
 
 I totally agree to that. How much time would it consume to implement
 such a systemcall?

It is probably a 3 week to one month effort.

-chris



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Chris Mason
On Wed, 2009-04-29 at 00:14 +0200, Thomas Glanzmann wrote:
 Hello Chris,
 
  They are, but only the crc32c are stored today.
 
 maybe crc32c is good enough to identify duplicated blocks, I mean we
 only need a hint, the dedup ioctl does the double checking. I will write
 tomorrow a perl script and compare the results to the one that uses md5
 and repoort back.

Its a start at least.

 
  Yes, that's the idea.  An ioctl to walk the tree and report on
  changes, but this doesn't have to be done with version 1 of the dedup
  code, you can just scan the file based on mtime/ctime.
 
 Good point.
 
But, the ioctl to actually do the dedup needs to be able to verify a
given block has the contents you expect it to.  The only place you can
lock down the pages in the file and prevent new changes is inside the
kernel.
 
   I totally agree to that. How much time would it consume to implement
   such a systemcall?
 
  It is probably a 3 week to one month effort.
 
 I'm taking the challenge. Is there a document that I can read that
 introduces me to the structures used in btrfs or can someone walk me
 through on the phone to get a quick start?
 

Great to hear.  It's an ambitious project, but I'll definitely help
explain things.

You can start with the code documentation section on
http://btrfs.wiki.kernel.org

I'll write up my ideas on how userspace controlled dedup should work.

 I also would like to retrieve the checksums and identify the potential
 blocks and after that work is done (even in a very preliminary state) in
 a way that someone can work with it, I would like to move on to the
 dedup ioctl.

Sounds fair, I'll forward the original patch.

-chris


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-28 Thread Bron Gondwana
On Tue, Apr 28, 2009 at 04:58:15PM -0400, Chris Mason wrote:
  Assuming a 4 kbyte block size that would mean for a 1 Tbyte
  filesystem:
  
  1Tbyte / 4096 / 8 = 32 Mbyte of memory (this should of course
  be saved to disk from time to time and be restored on startup).
  
 
 Sorry, a 1TB drive is teeny, I don't think a bitmap is practical across
 the whole FS.

32Mbyte of memory is teeny too.  You can pick up a server with 32Gb for
about $5k these days.  Using 10% of that gives you room for 100Tb of
storage, which is getting to be a reasonable size.  More than that and
you would be thinking of bigger machines anyway to have enough memory
for a decent cache.

(32Gb seems to be the nice price point, because 4Gb dimms are cheap as
chips, and 8Gb are way expensive.  We're buying 32Gb machines at the
moment)

Bron ( ok, so I don't work with giant enterprise systems.  I can see
   that there are cases where this might suck, but memory seems
   to be growing as fast as disk in the low end )
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data Deduplication with the help of an online filesystem check

2009-04-27 Thread Chris Mason
On Mon, 2009-04-27 at 05:33 +0200, Thomas Glanzmann wrote:
 Hello,
 I would like to know if it would be possible to implement the following
 feature in btrfs:
 
 Have an online filesystem check which accounts for possible duplicated
 data blocks (maybe with the help of already implemented checksums: Are
 these checksums for the whole file or block based?) and de duplicate
 these blocks?

There is a btrfs ioctl to clone individual files, and this could be used
to implement an online dedup.  But, since it is happening from userland,
you can't lock out all of the other users of a given file.

So, the dedup application would be responsible for making sure a given
file was not being changed while the dedup scan was running.

-chris


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data-deduplication?

2008-10-17 Thread Valerie Aurora Henson
On Thu, Oct 16, 2008 at 03:30:49PM -0400, Chris Mason wrote:
 On Thu, 2008-10-16 at 15:25 -0400, Valerie Aurora Henson wrote:
  
  Both deduplication and compression have an interesting side effect in
  which a write to a previously allocated block can return ENOSPC.
  This is even more exciting when you factor in mmap.  Any thoughts on
  how to handle this?
 
 Unfortunately we'll have a number of places where ENOSPC will jump in
 where people don't expect it, and this includes any COW overwrite of an
 existing extent.  The old extent isn't freed until snapshot deletion
 time, which won't happen until after the current transaction commits.
 
 Another example is fallocate.  The extent will have a little flag that
 says I'm a preallocated extent, which is how we'll know we're allowed to
 overwrite it directly instead of doing COW.
 
 But, to write to the fallocated extent, we'll have to clear the flag.
 So, we'll have to cow the block that holds the file extent pointer,
 which means we can enospc.

I'm sure you know this, but for the peanut gallery: You can avoid some
of these sort of purely copy-on-write ENOSPC cases.  Any operation
where the space used afterwards is less than or equal to the space
used before - like in your fallocate case - can avoid ENOSPC as long
as you reserve a certain amount of space on the fs and break down the
changes into small enough groups.  Most file systems don't let you
fill up beyond 90-95% anyway because performance goes to hell.  You
also need to do this so you can delete when your file system is full.

In general, it'd be nice to say that if your app can't handle suprise
ENOSPC, then if you run without snapshots, compression, or data dedup,
we guarantee you'll only get ENOSPC in the normal cases.  What do
you think?

-VAL
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data-deduplication?

2008-10-17 Thread Christoph Hellwig
On Thu, Oct 16, 2008 at 03:25:01PM -0400, Valerie Aurora Henson wrote:
 Both deduplication and compression have an interesting side effect in
 which a write to a previously allocated block can return ENOSPC.
 This is even more exciting when you factor in mmap.  Any thoughts on
 how to handle this?

Note that this can already happen in todays filesystems.  Writing into
some preallocated space can always cause splits of the allocation or
bmap btrees as the pervious big preallocated extent now is split into
one allocated and at least one (or two if writing into the middle)
preallocated extents.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data-deduplication?

2008-10-15 Thread Andi Kleen
On Wed, Oct 15, 2008 at 03:39:16PM +0200, Avi Kivity wrote:
 Andi Kleen wrote:
 Ray Van Dolson [EMAIL PROTECTED] writes:
 
   
 I recall their being a thread here a number of months back regarding
 data-deduplication support for bttfs.
 
 Did anyone end up picking that up and giving a go at it?  Block level
 data dedup would be *awesome* in a Linux filesystem.  It does wonders
 for storing virtual machines w/ NetApp and WAFL, and even ZFS doesn't
 have this feature yet (although I've read discussions on them looking
 to add it).
 
 
 There are some patches to do in QEMU's cow format for KVM. That's
 user level only.
   
 
 And thus, doesn't work for sharing between different images, especially 
 at runtime. 

It would work if the images are all based once on a reference image, won't it?
I would imagine that's the common situation for installing lots of VMs.

 I'd really, really [any number of reallies], really like to 
 see btrfs deduplication.

Sure it would be useful for a couple of things.

-Andi

-- 
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data-deduplication?

2008-10-15 Thread Miguel Sousa Filipe
On Wed, Oct 15, 2008 at 3:15 PM, Andi Kleen [EMAIL PROTECTED] wrote:
 On Wed, Oct 15, 2008 at 03:39:16PM +0200, Avi Kivity wrote:
 Andi Kleen wrote:
 Ray Van Dolson [EMAIL PROTECTED] writes:
 
 
 I recall their being a thread here a number of months back regarding
 data-deduplication support for bttfs.
 
 Did anyone end up picking that up and giving a go at it?  Block level
 data dedup would be *awesome* in a Linux filesystem.  It does wonders
 for storing virtual machines w/ NetApp and WAFL, and even ZFS doesn't
 have this feature yet (although I've read discussions on them looking
 to add it).
 
 
 There are some patches to do in QEMU's cow format for KVM. That's
 user level only.
 

 And thus, doesn't work for sharing between different images, especially
 at runtime.

 It would work if the images are all based once on a reference image, won't it?
 I would imagine that's the common situation for installing lots of VMs.

Like, using bcp (btrfs specific cp) for creating new images from a base one?
Will that suffice?
With modifications after that being COW, that could be a simple way of
having a stupid/hack no duplication.



 I'd really, really [any number of reallies], really like to
 see btrfs deduplication.

 Sure it would be useful for a couple of things.

 -Andi

 --
 [EMAIL PROTECTED]
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html




-- 
Miguel Sousa Filipe
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data-deduplication?

2008-10-15 Thread Avi Kivity

Andi Kleen wrote:

  


There are some patches to do in QEMU's cow format for KVM. That's
user level only.
 
  
And thus, doesn't work for sharing between different images, especially 
at runtime. 



It would work if the images are all based once on a reference image, won't it?
  


Yes and no.  It's difficult to do it at runtime, and it allows one qemu 
to access another guest's data (for read-only).


Also, it's almost impossible to do at runtime.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data-deduplication?

2008-10-13 Thread Chris Mason
On Sat, 2008-10-11 at 19:06 -0700, Ray Van Dolson wrote:
 I recall their being a thread here a number of months back regarding
 data-deduplication support for bttfs.
 
 Did anyone end up picking that up and giving a go at it?  Block level
 data dedup would be *awesome* in a Linux filesystem.  It does wonders
 for storing virtual machines w/ NetApp and WAFL, and even ZFS doesn't
 have this feature yet (although I've read discussions on them looking
 to add it).
 

So far nobody has grabbed this one, but I've had more requests (no
shocker there, the kvm people are interested in it too).  It probably
won't make 1.0 but the disk format will be able to support it.

-chris


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data-deduplication?

2008-10-13 Thread Andi Kleen
Ray Van Dolson [EMAIL PROTECTED] writes:

 I recall their being a thread here a number of months back regarding
 data-deduplication support for bttfs.

 Did anyone end up picking that up and giving a go at it?  Block level
 data dedup would be *awesome* in a Linux filesystem.  It does wonders
 for storing virtual machines w/ NetApp and WAFL, and even ZFS doesn't
 have this feature yet (although I've read discussions on them looking
 to add it).

There are some patches to do in QEMU's cow format for KVM. That's
user level only.

-Andi
-- 
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html