On 05/07/14 00:29, Arthur Chance wrote: > On 07/05/2014 01:41, The Farmer wrote: >> Is there any documentation on how de-duplication works? >> >> From what I understand, each archive is split into chunks, which are hashed >> then encrypted (or encrypted then hashed, perhaps) and uploaded to the >> server, and two chunks with the same hash are considered duplicates, saving >> an upload. >> >> I'm wondering how the boundaries between chuncks is established. If my >> first upload is chunked as (BC)(DEFG)(HIJ) and my 2nd (after inserting an A >> at the start) as (AB)(CDEF)(GHIJ) then none of the chunks will be the same, >> and the whole file will need to be uploaded again even though the change to >> the file was tiny. >> >> ... so I guess that's not how it works, and I'm left what kind of >> cleverness is in play here.
Tarsnap uses context-dependent block boundaries to avoid exactly this problem. I talked about this as part of my EuroBSDCon 2013 talk -- see slides 7-12 of http://www.daemonology.net/papers/EuroBSDCon13.pdf for some details. > I'm sure Colin will give you a more detailed answer but I'm fairly certain > he's > said it's a variant of the rsync algorithm. Take a look at > > http://rsync.samba.org/tech_report/ > > for more details of the original. No, not at all. It is however related to the *rsyncable* option to gzip. -- Colin Percival Security Officer Emeritus, FreeBSD | The power to serve Founder, Tarsnap | www.tarsnap.com | Online backups for the truly paranoid
