Shawn Willden wrote: > On Wednesday 12 August 2009 08:02:13 am Francois Deppierraz wrote: >> Hi Shawn, >> >> Shawn Willden wrote: >>> The first observation is that when uploading small files it is very >>> hard to keep the upstream pipe full, regardless of my configuration. >>> My upstream connection is only about 400 kbps, but I can't sustain >>> any more than about 200 kbps usage regardless of whether I'm using a >>> local node that goes through a helper, direct to the helper, or a >>> local node without the helper. Of course, not using the helper >>> produces the worst effective throughput, because the uploaded data >>> volume stays about the same, but the data is FEC-expanded. >> This is probably because Tahoe doesn't use a windowing protocol which >> makes it especially sensitive to the latency between nodes. >> >> You should have a look at Brian's description in bug #397. > > Very interesting, but I don't think #397 explains what I'm seeing. I > can pretty well saturate my upstream connection with Tahoe when > uploading a large file.
First off, I must admit that I'm somewhat embarrassed by Tahoe's relatively poor throughput+latency performance. Its inability to saturate a good upstream pipe is probably the biggest reason that I sometimes hesitate to recommend it as a serious backup solution to my friends. In the allmydata.com world, we deferred performance analysis and improvement for two reasons: most consumer upload speeds were really bad (so we could hide behind that), and we don't have any bandwidth-management tools to avoid saturating their upstream and thus causing problems to other users (so e.g. *not* having a windowing protocol could actually be considered a feature, since it left some bandwidth available for the HTTP requests to get out). That said, I can think of a couple of likely slowdowns. 1: The process of uploading an immutable file involves a *lot* of roundtrips, which will hurt small files a lot more than large ones. Peer selection is done serially: we ask server #1, wait for a response, then ask server #2, wait, etc. This could be done in parallel, to a certain extent (it requires being somewhat optimistic about the responses you're going to get, and code to recover when you guess wrong). We then send share data and hashes in a bunch of separate messages (there is some overhead per message, but 1.5.0 has code to pipeline them [see #392], so I don't think this will introduce a significant number of roundtrips). I think that a small file (1kb) can be uploaded in peer-selection plus two RTT (one for a big batch of write() messages, the second for the final close() message), but the peer-selection phase will take a minimum of 10 RTT (one per server contacted). I've been contemplating a rewrite of the uploader code for a while now. The main goals would be: * improve behavior in the repair case, where there are already shares present. Currently repair will blindly place multiple shares on the same server, and wind up with multiple copies of the same share. Instead, once upload notices any evidence of existing shares, it should query lots of servers to find all the existing shares, and then generate the missing ones and put them on servers that don't already have any. (#610, #362, #699) * improve tolerance to servers which disappear silently during upload, eventually giving up on the server instead of stalling for 10-30 minutes * improve parallelism during peer-selection 2: Foolscap is doing an awful lot of work to serialize and deserialize the data down at the wire level. This part of the work is roughly proportional to the number of objects being transmitted (so a list of 100 32-byte hashes is a lot slower than a single 3200-byte string). http://foolscap.lothar.com/trac/ticket/117 has some notes on how we might speed up Foolscap. And the time spent in foolscap gets added to your round-trip times (we can't parallelize over that time), so it gets multiplied by the 12ish RTTs per upload. I suspect the receiving side is slower than the sending side, which means the uploader will spend a lot of time twiddling its thumbs while the servers think hard about the bytes they've just received. We're thinking about switching away from Foolscap for share-transfer and instead using something closer to HTTP (#510). This would be an opportunity to improve the RTT behavior as well: we don't really need to wait for an ACK before we send the next block, we just need confirmation that the whole share was received correctly, and we need to avoid buffering too much data in the outbound socket buffer. In addition, we could probably trim off an RTT by changing the semantics of the initial message, to combine a do-you-have-share query with a please-prepare-to-upload query. Or, we might decide to give up on grid-side convergence and stop doing the do-you-have-share query first, to speed up the must-upload case at the expense of the might-not-need-to-upload case. This involves a lot of other projects, though: * change the immutable share-transfer protocol to be less object-ish (allocate_buckets, bucket.write) and more send/more/more/done-ish. * change the mutable share design to make shares fully self-validating, with the storage-index as the mutable share pubkey (or its hash) * make the server responsible for validating shares as they arrive, replace the "write-enabler" with a rule that the server accepts a mutable delta if it results in a new share that validates against the pubkey and has a higher seqnum (this will also increase the server load a bit) (this is the biggest barrier to giving up Foolscap's link-level encryption) * replace the renew-lease/cancel-lease shared secrets with DSA pubkeys, again to tolerate the lack of link-level encryption * interact with Accounting: share uploads will need to be signed by a who-is-responsible-for-this-space key, and doing this over HTTP will be significantly different than doing it over foolscap. So this project is waiting for DSA-based lease management, new DSA-based mutable files, and some Accounting design work. All of these are waiting on getting ECDSA into pycryptopp. 3: Using a nearby Helper might help and might hurt. You spend a lot more RTTs by using the helper (it effectively performs peer-selection twice, plus an extra couple RTTs to manage the client-to-helper negotiation), but now you're running encryption and FEC in separate processes, so if you have multiple cores or hyperthreading or whatever you can utilize the hardware better. The Helper (which is doing more foolscap messages per file than the client) is doing slightly less CPU work, so those message RTTs might be smaller, which might help. It would be an interesting exercise to graph throughput/latency against filesize using a local helper (probably on a separate machine but connected via LAN) and see where the break-even point is. We've got a ticket (#398) about having the client *not* use the Helper sometimes.. maybe a tahoe.cfg option to set a threshold filesize could be useful. > For example, if the local node supplies the SID to the helper, the > helper can do peer selection before the upload is completed. Yeah, currently the Helper basically tries to locate shares, reports back the results of that attempt, then forgets everything it's learned and starts a new upload. We could probably preserve some information across that boundary to speed things up. Maybe the first query should be allocate_buckets(), and if the file was already in place, we abort the upload, but if it wasn't, we use the returned bucket references. That could shave off N*RTT. We should keep this in mind during the Uploader overhaul; it'll probably influence the design. cheers, -Brian _______________________________________________ tahoe-dev mailing list [email protected] http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
