I'd like to provoke a discussion about file repair, to see if we can make some progress on improving the process. The main tickets on this issue are #643 "(automatic) Tahoe Repair process", in which newcomers sensibly think that files are repaired automatically (and are disappointed to learn that they are not), and #483 "repairer service". #543 (rebalancing manager) and 777 (CLI command to deep-renew all aliases, not just one) are peripherally related.
There are actually three periodic operations that need to be done. The first is lease-renewal (locating all shares and updating their lease timers). This may or may not be necessary in the future, depending upon how we do Accounting, but for now, if your servers are configured to expire shares at all, then clients are obligated to update a lease on every file on a regular basis. The second is file-checking (locating and counting all shares). The third is file-repair (if the file-checker says there are too many missing shares, make new ones). Lease-renewal currently uses a distinct backend storage-server call. The file-checker code has an option to additionally renew leases at the same time (the "do you have a share" and the "please renew my lease on this share" calls are pipelined, so it does not add roundtrips). I felt that it made more sense to add it to the file-checker code than, say, the download code, because the Downloader can stop after finding "k" shares, whereas the Checker is obligated to find all of them, and of course lease-renewal needs to hit all shares too. Repair is currently implemented as a Downloader and an Uploader glued together, bypassing the decryption step. I think this is the best design, although I'm also looking forward to the Uploader being taught to handle existing shares better. The current Repairer is pretty inefficient when the server list has changed slightly: it will readily put multiple shares on the same server, and you'll easily wind up with multiple copies of any given share. Some day that will be better (#610), but for now the process works well enough. Now, I've gone back and forth on where I've thought the higher-level "Repair Process" functionality ought to be. There are two main use cases. One is the normal end-user, who has one client node, and nobody else who will do work on their behalf. This user needs to do any checking/repairing/renewing all by themselves. The other is the allmydata.com case, where end-users are paying a central service to handle maintenance tasks like this, so repair/etc will be done by a different machine (which preferably does not have access to the plaintext). Smaller grids might also use a central service for this sort of thing: some of the storage-server operators might also be willing to run repair services for their friends. Now, I've tried to build the Tahoe code and its interfaces from the bottom up: when we weren't sure how to build the high-level things, I tried to at least provide users/developers with the low-level tools to assemble things like a Repair process on their own. Adding --repair to the deep-check webapi operation and CLI command is an example of this. A normal end-user can nominally get all of their renew/repair needs taken care of with a cron job that runs "tahoe deep-check --repair --add-lease ALIAS:". But, that's not very convenient. You don't know how long it will take, and you don't want subsequent runs to overlap, so you don't know how frequently to schedule the cronjob. Very large directory structures could take days (one allmydata customer's tree took a few weeks to traverse). If it gets interrupted, you lose all the progress you've made in that time. And there's no way to prioritize file-checking over repair, or repair of some objects (like directories) over files, or the most damaged objects over less-damaged objects, or to defer repair until later. Transient unavailability of a server or two will look just like damage, so if you repair right away, you'll be doing a lot more work than necessary (you might want to defer repair until you've seen the same "damage" persist for a couple of days, or weeks). There are ways to address some of these with additional "userspace" tools (ones which live above the webapi or CLI layer), but it's difficult and touchy. Many of those goals require a persistence layer which can remember what needs to be done, and a long-running process to manage scheduling, and then the code would spend most of its life speaking HTTP and JSON to a nearby client node, telling it what to do and interpreting the results. Since the Tahoe node already has these things, it probably makes sense to perform these tasks in "nodespace" (below the webapi layer, in the tahoe client process itself), with perhaps some kind of webapi/CLI layer to manage or monitor it. The code that runs these tasks would then run faster (direct access to IClient and IFileSystemNode objects, no HTTP or JSON parsing in the way), and there would only be one service to start. The second use case (allmydata.com, central services) *really* wants a prioritizing scheduler, since it is combining jobs from thousands of users, examining tens or hundreds of millions of files. Also, if those users have files in common, the central process can save time by only checking each file once. We've gone back and forth over the design of these services, as we alternately try to emphasize the "run your own node, let it manage things for you" use case, or the "central services will manage things for you" case, as well as the "tahoe will do this for you" vs "here are the tools, go write it yourself" thing. There aren't really too many differences between the goals for a node-local Repairer service and a centrally-managed one: * a local repair service would be allowed to look at real filecaps, whereas a central one should be limited to repaircaps/verifycaps * a local repair service may run few enough jobs to be satisfied with a single worker client node. A central service, providing for thousands of users, may require dozens of worker nodes running in parallel to make reasonable progress * a local repair service would be a part of the tahoe client node, displaying status through the webapi, and configured through tahoe.cfg and CLI commands. Its presence should not increase the install-time load of Tahoe (i.e. no additional dependencies or GUI libraries, etc). A central service, living outside the context of a client node, may have other UI avenues (Gtk?), and can justify additional dependencies (MySQL or something). We cannot yet repair read-only mutable files (#625, #746), and we require readcaps for directory traversal (#308), so deep-repair currently requires directory writecaps. This may be marginally acceptable for the allmydata.com central server, but not for a friendnet's repair services. One plan we've discussed would be to have client nodes build a "manifest" of repaircaps and submit it to a central service. The service would maintain those files until the client replaced the manifest with a new version. (this would bypass the #308 problem, by performing traversal on the client, but would still hit the can-only-repair-mutable-writecaps problem). So, some of these goals must be deferred. Now, the "repair service" design that we've considered, the kind that would live inside a tahoe client node and work on the behalf of a single local user, would probably look like this: * use a persistent data structure (probably SQLite), to track history and manage event scheduling across node reboots * periodically (perhaps weekly), do a "deep-check --add-lease" on certain rootcaps, as named by tahoe.cfg . Keep track of which dirnodes have been visited, to avoid losing too much progress when a node bounce occurs during the scan. * occasionally do a "check --verify" to run the Verifier on each file, probably as a random sampling, to confirm that share data is still retrievable and intact. This is significantly more bandwidth intensive than merely sending "do you have share" queries (even more intensive than regular download, since it usually downloads all N shares, not just k). So it must be rate-limited and balanced against other needs. * record damaged files in the DB. Maybe record a deep-size value for quick queries. Maybe record information about all files. * a separate process would examine the records of damaged files, sort the weakest ones to the top, apply repair policy to decide which should be repaired, and begin repair work * bandwidth/CPU used by the checker loop and the repairer loop should be limited, to prioritize other Tahoe requests and other non-Tahoe uses of the same host and network * provide status on the repair process, how much work is left to go, ETA, etc * maybe, if we cache where-are-the-shares information about all files, then we can provide an interface that says "server X is going away / has gone away, please repair everything that used it". This could provide faster response to server loss than a full Checker pass of all files. The Downloader might also take advantage of this cache to speed up peer-selection. One big challenge with the Checker/Repairer is that, for the most part, its job will be very very bursty. It will spend months looking at files to determine that, yes, that share is still there. Then, a server will go away, and boom, there are thousands of files that need repair. Or, the server will go away for a few hours, and any files that happen to get checked during that time will appear damaged, but if the Checker is run again later, they'll be back to normal. When a server really has gone away, there will be a lot of repair work to do. The distribution gets better as you have more servers, but even so, it will probably help to make the "R" repair threshold be fuzzy instead of a strict cutoff. The idea would be: * be willing to spend lots of bandwidth on repairing the really weak files (those closest to the edge of unrecoverability). If you have a 3-of-10 encoded file with only 3 shares left, drop everything and repair it quick * then spend less bandwidth on repairing the less-damaged files * once you're down to all files having >R shares, still spend some bandwidth randomly repairing some of them, slowly You want to slowly gnaw away at the lightly-damaged files, making them a little bit healthier, so that when a whole server disappears, you'll have less work to do. The Repairer should be able to make some predictions/plans about how much bandwidth is needed to do repair: if it's losing ground, it should tell you about it, and/or raise the bandwidth cap to catch up again. So.. does this seem reasonable? Can people imagine what the schema of this persistent store would look like? What sort of statistics or trends might we want to extract from this database, and how would that influence the data that we put into it? In allmydata.com's pre-Tahoe "MV" system, I really wanted to track some files (specifically excluded from repair) and graph how they degraded over time (to learn more about what the repair policy should be). It might be useful to get similar graphs out of this scheme. Should we / can we use this DB to track server availability too? How should the process be managed? Should there be a "pause" button? A "go faster" button? Where should bandwidth limits be imposed? Can we do all of this through the webapi? How can we make that safe? (i.e. does the status page need to be on an unguessable URL? how about the control page and its POST buttons?). And what's the best way to manage a loop-avoiding depth-first directed graph traversal such that it can be interrupted and resumed with minimal loss of progress? (this might be a reason to store information about every node in the DB, and use that as a "been here already, move along" reminder). cheers, -Brian _______________________________________________ tahoe-dev mailing list [email protected] http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
