On Tue, 3 Feb 2009 18:11:24 -0700 Shawn Willden <[email protected]> wrote:
> Take a look in the rdiff-backup source code. It handles extended attributes > and ACLs. Cool, thanks. > I plan to handle resource forks as a separate file, associated with the > data fork by referencing both of them from the single backuplog entry. Yeah, if we had node-metadata then I'd be tempted to store the resource forks in separate files and bind them via the metadata. Since we don't, when I get around to handling those forks, I'll probably bind them via the edge-metadata, which is safe because this tool creates read-only-directories exclusively (so the file+fork pair is still effectively immutable). > I'm probably being excessively picky about this area, but it really bugs me > that a backup may span days or weeks (or perhaps never finish!). Good point, it's nice to have a coherent short-exposure snapshot of the filesystem. Oh, how I wish ZFS were easier for me to use.. > My solution is to "scan fast, upload slow". The snapshot then spans the > time required to scan the file system, including hashing the files that > need to be hashed, which isn't fast but is unavoidable. It's not fast, no.. in my experiments, hashing the whole disk is at least several hours, and sometimes most of the day. But I think we're both planning to use a cheap path+timestamp+size(+inode?) lookup table and give the user an option of skipping the hash when the timestamps are still the same. > If I understand immutable read caps correctly (I need to go read that > code), I should be able to upload the log which contains all of the read > caps before uploading the files those read caps reference. Not really, unfortunately. The immutable file read-cap contains two crypto-length strings. The first is an encryption key, which can be generated at random but is usually generated by hashing the contents of the file (along with a per-user "convergence secret"). The advantage of hashing the file is that uploading the same file twice results in the same readcap, so you don't use extra storage. You can also think of this as a limited form of the "backupdb" which remembers what's been uploaded before, indexed by a hash of the file. On the other hand, using this technique means that you have to make an extra read pass over the file. Also note that we don't yet have an API for providing a random or pre-generated encryption key: there's a ticket #320 which touches on this, but we haven't implemented it yet. The second string is a sha256 hash of the "URI Extension Block", which itself contains hashes of the generated shares, which of course depend upon the encrypted file, which of course depends upon both the original file and the encryption key. This is the fundamental value which provides integrity checking, and to compute it you have to do the entire encrypt+encode process (everything but actually upload the generated shares to some set of storage servers). It is normally computed at the very end of the upload process, using values that were stashed during the encode+upload phases. So, given a file on disk, you have to do almost the entire Tahoe upload process to find out what the eventual Tahoe readcap is going to be. This sounds like it's at odds with your plan to upload the "backuplog" before you finish uploading some of the actual data files. I'm not sure how to rectify this. cheers, -Brian _______________________________________________ tahoe-dev mailing list [email protected] http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
