This really does seem to be reinventing rdiff-backup. Sure you can't save yourself a lot of effort by modifying that?
-Ian On Jan 9, 2009, at 1:49 AM, Shawn Willden wrote: > On Thursday 08 January 2009 09:17:40 am zooko wrote: >> Have you seen this thread? It might be a good project for you, as it >> is self-contained, requires minimal changes to the tahoe core itself, >> and is closely related to your idea about good backup: >> >> http://allmydata.org/pipermail/tahoe-dev/2008-September/000809.html > > I think it's exactly what I want to do. Not exactly the way I want > to do it, > but close. > > Here's an overview of how I think I'll approach the problem. All > suggestions > welcome. > > First, I'll run the backup as two separate processes. A scanner, > which > determines what to back up, and an uploader, which does the job of > backing > things up. > > There are several reasons for separating the two tasks, but the > biggest ones > are (a) I expect there will ultimately be several different scanners > and (b) > because a large backup will take a long time (especially an initial > backup), > the upload process needs to be resumable anyway, so there will have > to be > some way of tracking where it is in the process. Having an upload > queue > generated by the scanner and written to disk is a good way. An > interrupted > scan can simply be restarted from the beginning, since it's > relatively cheap. > > Change detection will of course be based on size, mtime and ctime. > ctime > needs to be examined to catch metadata changes. The scanner will > check these > values against a backupdb (more on that below). If a full "hash > check" is > desired, the scanner can simply queue every file for the uploader. > > The uploader will read from the queue and compute an rdiff signature > for each > file. If the file is in the backupdb, the uploader will check the > computed > rdiff signature against the stored signature. If they differ, it > will use > the stored signature to compute an rdiff delta and upload that. It > will also > store the current metadata in the backupdb. > > To avoid missing changes due to limited mtime granularity the stored > metadata > will have mtime set to min(mtime, start-1), where "start" is the > time that > the uploader started reading the file to compute the rdiff signature. > > There is another possible issue, a serious one. If the file changes > between > the calculation of the rdiff signature and generation of the delta > the chain > of forward deltas will be broken. To avoid that, the uploader will > check the > mtime after generating the delta and if it is greater than or equal > to "start", the uploader will calculate a new rdiff while copying > the file to > a temporary location, compute the delta from the temporary and use > those > rdiff and delta values. The copy is only really necessary if the > file is > going to continue changing, but since it changed once while the > uploader was > working, it's reasonable to assume that it might be a log file or > something > else that's changing frequently. > > When the uploader has exhausted the queue, it then uploads the > backupdb. In > order to make that efficient, for incremental backups it uploads a > delta of > the backupdb. > > I'm still thinking about how to structure the directories into which > all this > stuff gets uploaded. I think it'll probably be a single tree > structure that > mirrors the backed-up filesystem structure, but each file will > become a > subdirectory, into which the original plus deltas will be placed. The > backupdb and its deltas will go just above the three. Oh, and all > of the > various uploads mentioned are immutable. > > The downside of using forward deltas is that you have to have all of > them, in > a complete unbroken chain, in order to get from the original to any > other > version including the most recent. This means that the longer the > chain of > deltas, the less reliable the most recent version is, and that means > that > we'll also need another maintenance process, one that has to have > the ability > to retrieve the files and the deltas and reconstruct them all so it > can > generate the most recent version and then generate reverse deltas, > probably > discarding some of the backup intervals. > > There are lots of details to fill in, but that's the outline. > Comments? > > Shawn. > _______________________________________________ > tahoe-dev mailing list > [email protected] > http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev _______________________________________________ tahoe-dev mailing list [email protected] http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
