I think we are talking about different things.

You are talking about the DataStore save space by deduplicate data and reducing 
redundancy in a general way. This can be done by the DataStore without 
additional information, I totally agree with you here. The hash returned is for 
example also a way to reduce duplicates.

What I'm talking about is storing concrete diffs instead of whole files. This 
is not possible. The DataStore is called from Jackrabbit (addRecord) with some 
stream and has absolutely no idea what the original file was. So it can't 
determine the concrete diff. Sure it can look for similar files (performance?) 
and make a diff (binaries?) to the most similar file found, but that is maybe 
not the diff to the previous file from application view.

Regards, Robert

-----Ursprüngliche Nachricht-----
Von: Thomas Mueller [mailto:[email protected]] 
Gesendet: Mittwoch, 13. Juli 2011 11:37
An: [email protected]
Betreff: Re: AW: AW: AW: Incremental/deduplicating versioning

Hi,

>The only possible way is to add the information (the DataStore does not
>have like the identifier/content of the previous version) to the stream
>(addRecord) from the application side.

I suggest to read http://en.wikipedia.org/wiki/Data_deduplication -
specially "Depending on the type of deduplication, redundant files may be
reduced, 
or even portions of files or other data that are similar can also be
removed." and http://en.wikipedia.org/wiki/Rsync#Algorithm

Regards,
Thomas

Reply via email to