On Wed, Jan 31, 2007, Alex Rousskov wrote: > I agree that read-parse-dump-write-read-parse-dump-write sequence is > inefficient for message headers and other metadata. IIRC, we talked > about optimizing this at least since 1998. Stored binary metadata has > its drawbacks, but overall it is probably a win for a > performance-sensitive proxy.
:) There's one less parse pass in Squid-2 HEAD now - the client side will wait for the first storeClientCopy() to complete and then just snaffle the headers from the store client -> mem object -> reply. If the headers aren't there then they won't ever be. Its still a straight copy of all the reply information but its better than the re-parse path and the semantics lean themselves to ref-counting stuff later. The next optimisation will be "pass pre-built reply struct in before the first storeAppend" which will clone the reply data and stuff it into the store. This step, however, does presuppose the reply status/headerdata isn't stored away in the data stream so I'll have to do quite a bit of work at the same time to make this eventuate. > For many other problems and changes you are talking about, I would try > reworking MemBuf or providing a similar object that would allow > higher-level code to "copy" and "concatenate" chunks of memory without > the actual copy taking place. Visualize a MemBuf with an offset and > subsize fields. Now add support for a chain of such buffers that looks > like a single buffer for higher-level code. Henrik and I already have this. I wrote a replacement http request and header parser which works on refcounted buffers. The buffers aren't fixed size (so I call realloc where required) but are refcounted. The strings are an 'extent' on top of a refcounted buffer. The refcounted buffer pretty much looks like a non-refcounted MemBuf. Whats missing is the concept of a buffer chain, so concatenate is cheap. There's a few libraries which implement this stuff (eg vstr) which I'm using for inspiration. We don't need anything -that- complicated to get much better performance. The store will just contain a chain of 0 or more extents with whatever backing buffers are required. This way you can pull nifty tricks such as reading the chunked TE'ed data from the server connection and store the non-chunked extents in the store without having to do the copy tricks Henrik does. It'll remain to be seen how optimal that is - modern OSes -really like- page-aligned buffers for things so I'll benchmark how things perform once its written to see whether there's a substantial difference. (My gut feeling is "yes, there will be" but to be honest, doing the above will probably give us a huge performance and code cleanliness boost over squid-2 and squid-3 that'll be worth it as an intermediary step.) > As an added bonus, it may be possible to avoid a lot of the copying when > parsing headers because string-based headers would be able to refer to > portions of the original I/O buffer. Already done. :) And yes, its pretty damned fast. > Proper reference counting of true/allocated buffers would be required to > keep overall memory consumption comparable to current Squid, of course. > Finally, I am not sure I agree regarding storage decision making time. > An optimized storage system (the interesting case) would probably > buffer/merge small chunks and would probably not store object chunks > sequentially on disk, so the issue of the total object size becomes > unimportant. Its only important when deciding how to write it all to disk. If you delay the layout decision you can do interesting packing tricks. You want to pack with some leaning towards temporal locality, for example, so your reads can read >1 object back at once. You can interleave small and large objects on disk so your disk access algorithm has a chance of being able to do both at a reasonable clip rather than suffering from starvation issues. (The last hasn't been tested out, but I got a feeling it'd be an issue from my benchmarking.) I did some unofficial benchmarking when fiddling with COSS about a year ago and found the upper limit for random IO was transactions per second based long before throughput became a limiting factor. I was seeing 200-300 tps on these test SATA NCQ disks (and probably more with SCSI) up to a couple hundred kilobytes per transaction. This isn't new (there's plenty of papers which reference this, which I hope to find again and put on the new squid website) but it shows there's a huge room for improvement wrt small object sizes. > Needless to say, I believe this work should be done in Squid3.1 code > base or later :-/. People are running squid-2 and want to keep running it for now. I think the best thing for my work is to get it into squid-2 so people stay interested in Squid and so I have a stable platform to do my development with. Once its done and tested we can sit back, look at the pluses and minuses, then extract it all out and shoehorn it into squid-3. This requires Squid-3 to be stable by then. :) If its stable and ready for production then I'm really all for it. I really do want to take advantage of C++ constructs here (where appropriate!) to enforce data type semantics. Heck, refcounting buffers would be a cinch in C++. Unfortunately Squid-3 is worse than Squid-2 in the "does a hell of a lot of a hell of a little" problem - try profiling Squid-2 or Squid-3 and see if you can find a single area or two that would give a big performance boost. I've fixed most of them.. :/ Adrian
