Shawn Willden wrote: > > Can't you pass a disk-backed buffer-like object to transport.write()? > Perhaps an mmap object? If there's a reason that doesn't work, then > transport.write() needs to either accept a file-like object or > implement disk-based buffering itself.
The short answer is no, transport.write() accepts a bytestring and nothing else. However, it's pretty easy to build a disk-backed "Producer" object and attach it to the transport (which implements the "Consumer" interface), which will be pulled from whenever the buffer has enough space.. it'd take maybe 10 lines of code (this is what the Downloader does when invoked by an HTTP GET). But, now think about where the data goes: you have to write a N/k*1GB to disk, and then read it all back out again. Sure, you get to retire the input stream quickly (well, as quickly as you're able to process and write gigabytes to your disk, which could still take an hour). But now you add the overhead of doing all of that disk IO, which will probably result in a slower overall upload as it stalls everything else you're trying to do (diskio always blocks the process, since nobody has a usable async filesystem API), as well as slamming the host and slowing down other processes. The "Mountain View" system which predated tahoe (derived from Mnet and the original MojoNation codebase) did this, and performance for large files was horrible. One problem was that it encoded everything before starting any part of the upload (so it could retire the input file faster, and to simplify the "Pusher" uploading code). This hurt small files: lots of extra steps and disk io for something that could otherwise be finished in RAM and retired before it fell out of the CPU cache. But the biggest problem was that this approach performs the share transposition on disk. As you know, each share contains one block per segment (otherwise we'd hit the "chunking problem", where reliability drops quickly with the size of the file). If you draw the shares as short+wide horizontal boxes and stack them on top of each other, then the encoding process is filling in this rectangle from the left (segment[0], then segment[1], etc). However, the upload process (depending upon how you do it) is *reading* this rectangle from the top: sending out share[0], then share[1], etc. The matrix is being transposed in between the encoding step and the upload step. This kills the disk: you write a huge amount of data in one order, then (from the disk's point of view) do a bazillion seeks and read it out in a completely different order. All of the disk and local fs's readahead optimizations fail. You could literally hear the disk crunching as the heads were flying around like mad. The computer sounded angry. People didn't like this :). So one of the big early decisions in Tahoe was to move this transposition into the network. By uploading all the shares at the same time (instead of storing them to disk first and uploading them one at a time, as peers became available), we can hide this transposition in the wire, and allow both sides to do nice clean linear reads/writes to the source and share files respectively. This was dramatically less painful. Disk seeks are the most expensive thing you can do (even worse than TCP setup time). > Expecting the data to be small enough to be queueable in RAM isn't a > good idea, even with ubiquitous virtualized memory and gigabytes of > physical RAM. Well, this touches on a deeper issue: the "leaky abstraction" of where the buffering lives. The layers of the network stack like to pretend that they're independent, and that socket.write() is fire-and-forget (i.e. once socket.write() returns, the kernel has taken responsibility and the application doesn't need to think about it anymore). But it's a big multi-party system, each with their own limitations, and the layers can't accept unconditional responsibility without incurring unbounded costs. There's "backpressure" from one layer to another, like when a twisted Consumer stalls its Producer, or when socket.write() doesn't accept all of your data, or when the remote TCP stack doesn't ACK your segment and open up the window until *their* userspace has socket.read() some of the inbound data. Naive programs, with blocking IO calls, usually ignore this backpressure: things work fine at low volumes, but stall in unpredictable ways when usually-quick remote calls start taking a long time. To avoid unbounded storage usage, you always want to convey this backpressure as far back as possible (ideally by making the human not type until there's room for it). But that's both complicated and annoying: many layers don't like to be told to wait, and are unwilling to accept an API in which all of their data is not instantly accepted. To accomodate those layers, the layer below must buffer. If the insistent layer is not using some sort of backpressure of its own (e.g. the Encoder waiting for all servers to ACK block receipt before encoding the next segment), then that buffering can involve an unbounded amount of storage. Usually what happens is that somebody imposes a limit on the buffer size and starts throwing errors when it gets full, in the hopes that this will make the sender slow down. However, without careful design, it is easy to wind up with runaway buffering issues: the system gets so busy handling these errors that it can't make any progress on the existing queue. Network driver and TCP stack design encourages a "discard early" policy: if you're going to have to drop that inbound packet (because you don't have anywhere to put it), make the decision with as little work as possible, so you can conserve your CPU (or memory bandwidth, or whatever) for working through the data that you *don't* discard. Regardless of how it's limited, the buffer storage can be hosted in memory, or on disk, or on some clever combination of both. On modern systems, it frequently doesn't matter which an application uses, because the kernel will effectively combine the two (use enough memory and it'll get swapped out to disk; or write to disk but really your data gets held in a RAM-based writecache until it's convenient to write, and reads will come from RAM, and if you delete it quick enough it'll never actually touch the spindle). On windows, this didn't work so well (the kernel writecache wasn't too bright). And when the data size gets large, it also starts to not work out so well (the kernel wants to avoid unbounded diskcache storage, so it will stall your file.write() and flush to disk when it gets too big, so all of that N/k*1GB will actually touch the disk). But it has to go somewhere, and the larger it gets, the more trouble it will cause. So the one-segment-at-a-time design is there to provide that continuous chain of backpressure: keep the pipe full, but don't encode (too much) more data than we can currently use. Don't encrypt more data than we can encode. Don't read more data than we can encrypt. Keep the buffering requirements bounded. We currently use something like 3x or 4x the segment size (so probably 0.5MB max) per active upload or download. To improve our streaming story, we need to push it back one step further. Using a randomly-generated key (or changes to the way we encode files, which we can discuss on NewImmutableEncodingDesign) would get us down to one IO pass. Then the "don't read more data than we can encrypt" would mean reading from the user (i.e. from the HTTP PUT pipe), rather than reading from the tempfile where the PUT stashed the request body so we could do the CHK pass before starting the encoding and upload. This would finally apply backpressure to the webapi client, which would probably then avoid reading (from local disk) until needed. This would complete the job: the plaintext sits nice and quiet on the original uploader's disk until the last possible moment. The intermediate layers buffer just enough to keep the pipeline full and avoid delays due to non-zero round-trip times. Everything happens "Just In Time" and storage requirements (including the tempfile) are dropped to a minimum. cheers, -Brian _______________________________________________ tahoe-dev mailing list [email protected] http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
