Hello, I was very happy to see this elephant in the room finally discussed.
> Its inability to saturate a good upstream pipe is probably the biggest reason that I > sometimes hesitate to recommend it as a serious backup solution to my friends. This is exactly what I was forced to admit, except that in my case recommendations go to my consulting clients. I am still here, as I see enormous potential in Tahoe, but for it to be realized, this project will need more users in the real-world scenarios. And for that to happen, this performance issue should be the absolutely of highest priority. Thanks, Andrej Falout On Thu, Aug 20, 2009 at 6:20 AM, Brian Warner <[email protected]> wrote: > Shawn Willden wrote: > > On Wednesday 12 August 2009 08:02:13 am Francois Deppierraz wrote: > >> Hi Shawn, > >> > >> Shawn Willden wrote: > >>> The first observation is that when uploading small files it is very > >>> hard to keep the upstream pipe full, regardless of my configuration. > >>> My upstream connection is only about 400 kbps, but I can't sustain > >>> any more than about 200 kbps usage regardless of whether I'm using a > >>> local node that goes through a helper, direct to the helper, or a > >>> local node without the helper. Of course, not using the helper > >>> produces the worst effective throughput, because the uploaded data > >>> volume stays about the same, but the data is FEC-expanded. > >> This is probably because Tahoe doesn't use a windowing protocol which > >> makes it especially sensitive to the latency between nodes. > >> > >> You should have a look at Brian's description in bug #397. > > > > Very interesting, but I don't think #397 explains what I'm seeing. I > > can pretty well saturate my upstream connection with Tahoe when > > uploading a large file. > > First off, I must admit that I'm somewhat embarrassed by Tahoe's > relatively poor throughput+latency performance. Its inability to > saturate a good upstream pipe is probably the biggest reason that I > sometimes hesitate to recommend it as a serious backup solution to my > friends. In the allmydata.com world, we deferred performance analysis > and improvement for two reasons: most consumer upload speeds were really > bad (so we could hide behind that), and we don't have any > bandwidth-management tools to avoid saturating their upstream and thus > causing problems to other users (so e.g. *not* having a windowing > protocol could actually be considered a feature, since it left some > bandwidth available for the HTTP requests to get out). > > That said, I can think of a couple of likely slowdowns. > > 1: The process of uploading an immutable file involves a *lot* of > roundtrips, which will hurt small files a lot more than large ones. Peer > selection is done serially: we ask server #1, wait for a response, then > ask server #2, wait, etc. This could be done in parallel, to a certain > extent (it requires being somewhat optimistic about the responses you're > going to get, and code to recover when you guess wrong). We then send > share data and hashes in a bunch of separate messages (there is some > overhead per message, but 1.5.0 has code to pipeline them [see #392], so > I don't think this will introduce a significant number of roundtrips). I > think that a small file (1kb) can be uploaded in peer-selection plus two > RTT (one for a big batch of write() messages, the second for the final > close() message), but the peer-selection phase will take a minimum of 10 > RTT (one per server contacted). > > I've been contemplating a rewrite of the uploader code for a while now. > The main goals would be: > > * improve behavior in the repair case, where there are already shares > present. Currently repair will blindly place multiple shares on the > same server, and wind up with multiple copies of the same share. > Instead, once upload notices any evidence of existing shares, it > should query lots of servers to find all the existing shares, and > then generate the missing ones and put them on servers that don't > already have any. (#610, #362, #699) > * improve tolerance to servers which disappear silently during upload, > eventually giving up on the server instead of stalling for 10-30 > minutes > * improve parallelism during peer-selection > > 2: Foolscap is doing an awful lot of work to serialize and deserialize > the data down at the wire level. This part of the work is roughly > proportional to the number of objects being transmitted (so a list of > 100 32-byte hashes is a lot slower than a single 3200-byte string). > http://foolscap.lothar.com/trac/ticket/117 has some notes on how we > might speed up Foolscap. And the time spent in foolscap gets added to > your round-trip times (we can't parallelize over that time), so it gets > multiplied by the 12ish RTTs per upload. I suspect the receiving side is > slower than the sending side, which means the uploader will spend a lot > of time twiddling its thumbs while the servers think hard about the > bytes they've just received. > > We're thinking about switching away from Foolscap for share-transfer and > instead using something closer to HTTP (#510). This would be an > opportunity to improve the RTT behavior as well: we don't really need to > wait for an ACK before we send the next block, we just need confirmation > that the whole share was received correctly, and we need to avoid > buffering too much data in the outbound socket buffer. In addition, we > could probably trim off an RTT by changing the semantics of the initial > message, to combine a do-you-have-share query with a > please-prepare-to-upload query. Or, we might decide to give up on > grid-side convergence and stop doing the do-you-have-share query first, > to speed up the must-upload case at the expense of the > might-not-need-to-upload case. This involves a lot of other projects, > though: > > * change the immutable share-transfer protocol to be less object-ish > (allocate_buckets, bucket.write) and more send/more/more/done-ish. > * change the mutable share design to make shares fully self-validating, > with the storage-index as the mutable share pubkey (or its hash) > * make the server responsible for validating shares as they arrive, > replace the "write-enabler" with a rule that the server accepts a > mutable delta if it results in a new share that validates against the > pubkey and has a higher seqnum (this will also increase the server > load a bit) (this is the biggest barrier to giving up Foolscap's > link-level encryption) > * replace the renew-lease/cancel-lease shared secrets with DSA pubkeys, > again to tolerate the lack of link-level encryption > * interact with Accounting: share uploads will need to be signed by a > who-is-responsible-for-this-space key, and doing this over HTTP will > be significantly different than doing it over foolscap. > > So this project is waiting for DSA-based lease management, new DSA-based > mutable files, and some Accounting design work. All of these are waiting > on getting ECDSA into pycryptopp. > > 3: Using a nearby Helper might help and might hurt. You spend a lot more > RTTs by using the helper (it effectively performs peer-selection twice, > plus an extra couple RTTs to manage the client-to-helper negotiation), > but now you're running encryption and FEC in separate processes, so if > you have multiple cores or hyperthreading or whatever you can utilize > the hardware better. The Helper (which is doing more foolscap messages > per file than the client) is doing slightly less CPU work, so those > message RTTs might be smaller, which might help. It would be an > interesting exercise to graph throughput/latency against filesize using > a local helper (probably on a separate machine but connected via LAN) > and see where the break-even point is. We've got a ticket (#398) about > having the client *not* use the Helper sometimes.. maybe a tahoe.cfg > option to set a threshold filesize could be useful. > > > For example, if the local node supplies the SID to the helper, the > > helper can do peer selection before the upload is completed. > > Yeah, currently the Helper basically tries to locate shares, reports > back the results of that attempt, then forgets everything it's learned > and starts a new upload. We could probably preserve some information > across that boundary to speed things up. Maybe the first query should be > allocate_buckets(), and if the file was already in place, we abort the > upload, but if it wasn't, we use the returned bucket references. That > could shave off N*RTT. We should keep this in mind during the Uploader > overhaul; it'll probably influence the design. > > cheers, > -Brian > _______________________________________________ > tahoe-dev mailing list > [email protected] > http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev >
_______________________________________________ tahoe-dev mailing list [email protected] http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
