Having read Zooko's original message more carefully, I think I was responding to the wrong concern (which is probably why I deferred sending that response for long enough to forget about it). Here's a better response.
On 9/24/10 12:36 AM, Zooko O'Whielacronx wrote: > However, I'm also sure that Tahoe-LAFS *failed* to scale up in a > different way, and that other failure is why I jealously guard the > secret entrance to the Volunteer Grid from passersby. > > The way that it failed to scale up was like this: suppose you use K=3, > H=7, M=10 erasure-coding. Then the more nodes in the system the more > likely it is to incur a simultaneous outage of 5 different nodes > (H-K+1), which *might* render some files and directories unavailable. > (That's because some files or directories might be on only H=7 > different nodes. The failure of 8 nodes (M-K+1) will *definitely* > render some files or directories unavailable.) ... > > Okay, that doesn't sound too good, but it isn't that bad. You could > say to yourself that at least the rate of unavailable or even > destroyed files, expressed as a fraction of the total number of files > that your grid is serving, should be low. *But* there is another > design decision that mixes with this one to make things really bad. > That is: a lot of maintenance operations like renewing leases and > checking-and-repairing files, not to mention retrieving your files for > download, work by traversing through directories stored in the > Tahoe-LAFS filesystem. Each Tahoe-LAFS directory (which is stored in a > Tahoe-LAFS file) is independently randomly assigned to servers. > > See the problem? If you scale up the size of your grid in terms of > servers *and* the size of your filesystem in terms of how many > directories you have to traverse through in order to find something > you want then you will eventually reach a scale where all or most of > the things that you want are unreachable all or most of the time Ah, ok, so the concern is about the interaction between two trends: 1: while the percentage of unavailable file objects remains constant, the absolute number of them is growing along with the rest of the grid. 2: the "diameter" (in the graph-theory sense) or typical directory tree depth is growing over time, as an individual user stores more and more files. Therefore the number of necessary successful downloads needed to read a leaf node is growing (you must be able to retrieve the root directory object, plus the subdirectory, etc, down to the actual file you care about). Therefore the probability of successfully reading an average leaf node is dropping over time. I think it's sufficient to simply pay attention to the second trend, ignoring the first, and express it like this: 3: given a constant probability of file-object unavailability, the availability of a given file is a function of its depth in the directory tree, and this is likely to grow over time Note that this affects a single user who stores more and more files over time (and only retains a single out-of-band rootcap). The act of adding more users to the system doesn't cause this problem (because they're each holding their own rootcap). You might portray the concern as scaling with the ratio of (number of files we care about) / (number of out-of-band rootcaps we retain). Or better yet, (average depth of directory tree) / (number of out-of-band rootcaps we retain). Shawn's backup system, which flattens the directory tree into a single table, drops the average-depth numerator to a nice constant "1". A tahoe-like system that forsakes directories completely by retaining an out-of-band table of filecaps would drop the numerator to 0. I'd guess that, for most users, the average depth of their filesystems is likely to grow logarithmically with the number of files, not linearly, slowing the advancement of the problem somewhat. In fact, I suspect that the average depth is closer to constant.. I tend to use the same directory structure on each new machine I build, and very rarely introduce a new level of subdirectories (when something gets too big), maybe once every other year. So to model that, I'd pick a maximum target directory depth (say 10), calculate Prob(success-of-file-retrieval) as being equal to getting 10 simultaneous Prob(success-of-object-retrieval) (i.e. 1-(1-Pobj)**10), then choose k-of-N to get Prob(success-of-file-retrieval) above my goal. It means you need more margin, certainly, but not anything too difficult. And the rule would be that you get enough reliability as long as you don't go crazy with your directory structures. (there are other problems with very deep directory trees on local filesystems too: OS limits on pathnames, usability limits of tools and cut-and-paste of pathnames, reliability losses as you touch more and more local disk blocks for the dnodes, etc, so it's not an entirely foreign problem) > This is what happened to allmydata.com, when their load grew and their > financial and operational capacity shrank so that they couldn't > replace dead hard drives, add capacity, and run deep-check-and-repair > and deep-add-lease. That's the concern my previous message responded to, because I think it wasn't entirely accurate. It implies that dead hard drives were a problem (when in fact we never lost enough to threaten anything), that adding capacity would have helped, and that repair would have helped. The actual problem was that the money ran out and the whole grid was shut down: the revenue (i.e. customers being able to get their files) and the cost (running those servers) were both all-or-nothing. It *would* have been marginally helpful (but probably not practical or feasible) to design a system that could save costs by scaling *down* reliability while still retaining availability. Having all 100+ drives from day one (or implementing continuous rebalancing to make it look like you'd had them all along), and setting N much higher from day one (so data was spread over all drives, not just a relatively-small subset) would accomplish that. If the money had gradually slowed down, we could have powered down half the grid, cut the operational costs in half, and continued to provide full service despite reduced redundancy (thus still bringing in full revenue). >From the business point of view, I don't think that would have worked: either we'd have needed to spend a lot more money up front (to buy two years worth of hard drives and servers ahead of time), or spend the engineering time and bandwidth and CPU time to do continuous rebalancing, and the only benefit to the company would have been to be able to cut our burn rate by a little bit without needing to shut off all the revenue-generating accounts at the same time. But I think the servers were not the majority of the business costs (compared to salaries and office space, etc), and colo space is pretty quantized, so I'm doubtful that the ability to scale downwards would have helped much. (and what dot-com -era investor wants to hear that you're going to spend even *more* of their money to create a company that can fail gracefully? work hard, grow fast, reach for the stars or die trying, by golly :). >From the customer's point of view, of course, graceful-scale-down is better than all-or-nothing. But again, from a marketing point of view, how much confidence would your average consumer have in a backup company which was visibly planning for failure from the beginning? Anyways, yeah, the does-it-scale-down question is a good one to ask, and evaluating its benefits against its costs (rebalancing bandwidth, basically) is a good comparison to make. I think the take-home message may be for us to put more effort into a rebalancing and/or reencoding strategy. I remember talking somewhere recently about experimenting with fixing the encoding at like 3-of-256, always generating all 256 shares regardless of how many servers were available, but only uploading a reasonable subset of them (one per server up to some maximum). Then, later, when more servers become available, you can spread the file out further. Another thing to consider is a form of immutable file which has mutable encoding parameters. There would be a special "uploader" or "rebalancer" cap, which has a signing key, and the root of the share merkle tree is signed (not hashed) by the UEB. The plaintext/ciphertext merkle tree *would* be hashed, so the actual contents of the file are immutable and fixed by the readcap. But someone with the rebalance-cap could generate new shares with different encoding parameters, so it could start with 3-of-10 when there are only 10 servers, but get changed to 30-of-100 when there are 100 servers and retain the filecap. Of course, the question of who holds on to the rebalance-cap is a troublesome one: anyone who holds it can create invalid shares that are inefficient to detect and recover from, so you don't want to spread it around, but they're a hassle to store. (hm, what if we stored them in dirnodes like mutable writecaps? read-only users wouldn't be able to help re-encode the file, but the original uploader might. It might interact weirdly with the immutable directories created by 'tahoe backup', but we could probably make it work). cheers, -Brian _______________________________________________ tahoe-dev mailing list tahoe-dev@tahoe-lafs.org http://tahoe-lafs.org/cgi-bin/mailman/listinfo/tahoe-dev