On Fri, 9 Jan 2009 08:15:56 -0700 Shawn Willden <[email protected]> wrote:
> The problem with using reverse deltas is they're even more > bandwidth-intensive than full versions. If you have plaintext on both ends Yeah, this notion of "both ends" is tricky: you have to download the file to get the plaintext, which requires N/k (e.g. 1/3rd) the bandwidth of uploading the whole thing. Or you can keep a copy locally (doubling your local storage needs) so that you have something to diff against. The 'revlog' format that I'd like to use for our eventual "LDMF" (Large Distributed Mutable Files) design uses forward deltas and an append-only data structure, but writes out a full copy of the file whenever the amount of data you'd need to read to construct the latest version gets bigger than twice the size of the file. I think the Mercurial folks picked this 2x thing as a reasonable tradeoff between speed and storage. For our mutable files, this also gives us versioning: we create a read-cap that contains both a pointer to the file overall (i.e. the revlog) and a secure identifier for the particular revision that we care about. The question of how far back to keep the deltas becomes a matter of retention policy and GC, with the obvious limitation that the oldest thing in the revlog should be a full copy. The format I've been sketching out would allow for efficient insertions and deletions of arbitrary byte spans, which would be pretty wild. The standard POSIX file io API doesn't give any way to express this, but an application that's speaking directly to Tahoe could take advantage of it. Imagine being able to insert a byte in the beginning of your file in constant time. > But if the remote store is encrypted, the delta can't be applied. All you > can do is store it as a forward delta. To get reverse deltas, you have to > upload a full copy of the new version, plus the delta. > > That's the primary reason why I suggested maybe Tahoe should optionally > allow not encrypting the files, to facilitate reverse delta versioning. > Zooko doesn't want to go there, and I respect his reasoning. The limitation is actually a combination of encryption and erasure-coding (which, unless you're using the degenerate k=1 replication case, involves splitting the data up among multiple servers). Both make it hard or impossible to apply rsync-like algorithms on the server side. > If librsync deltas are reversible (need to check) Incidentally, for deltas to be reversible, they have to contain roughly twice the information as an irreversible delta. Darcs does this (because the patches may need to be commuted arbitrarily), the result is that deleting a file takes as much patch space as adding one. cheers, -Brian _______________________________________________ tahoe-dev mailing list [email protected] http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
