On 22/11/2013 4:16 a.m., Alex Rousskov wrote: > On 11/21/2013 02:27 AM, Amos Jeffries wrote: >> While writing parser updates I have encountered the small problem that >> SBuf always *copies* data from non-Sbuf sources. There is absolutely no >> way provided to use a pre-allocated I/O buffer as the backing store for >> Sbuf objects. This includes pointing SBuf at a already allocated global >> char*. >> It always copies. > > I recommend keeping it that way. While adding support for alternative > backing blobs is doable, we should really focus on reducing the number > of non-Sbuf sources instead, at least for now. Our chances of properly > optimizing something (other than by pure luck) decrease exponentially as > the complexity increases, and we are just starting to use the new string > code with "simple" single-type backing.
I am trying to start that by working straight from the I/O buffers into SBuf. If we are happy to wear the data copy until the buffer itself is made SBuf friendly I will just keep going on parser, otherwise I will prioritize my client-side cleanup patch which upgrades the I/O buffer (see connection-manager launchpad branch). > >> The way it is done makes sense in parsing where the input buffer is >> constantly being cycled/shifted by the I/O system and possibly has 500KB >> of area with a small sub-string needing to be pointed at by an SBuf for >> long periods. >> However, is also prevents us from doing two things: >> 1) having a global array of char* header names and field values. Which >> the parser points an SBuf at before emitting (avoiding a lock on the I/O >> buffer memory). > > I do not see a global array of char* header names as valuable. We should > have a global set of SBuf header names instead, to optimize search and > comparison. Good point. > >> Primarily because parsing happens in small pieces and the end of a block >> of input may not even be present when we have to scan the start of it. > > The "sliding window parser" is a different problem, actually. We have > tried to discuss it several times already, without strong consensus. > IMO, we need a good tokenizer to solve this problem for "small" content > (like request headers) AND a buffer list (with an even better tokenizer) > to solve this problem for "large" content (like chunked encoding). The > tokenizer and list APIs proposed earlier had too many problems IMO, but > that is a relatively minor detail we can fix. > > I hope to be able to propose a tokenizer soon. > Great. Future thinking in me is working along the lines that MemBuf becomes backed by MemBlob store and Tokeniser can take either MemBuf,SBuf to spawn SBuf for the same MemBlob. > > BTW, is HTTP/2 parsing based primarily on offsets ("header #5 is at > offset 100") rather than string patterns ("the new header starts after > CRLF sequence")? HTTP/2 has binary Frame blocks with type codes, size and various fields much like a TCP/IP packet header. Then inside the HEADER frame type there is payload consisting of the HTTP headers in compressed format. * HTTP/1 request-line fields are split into headers along the lines of Host: (eg. method becomes a :method header, URI becomes :scheme, :host, :path headers) in a HEADERS frame. * HTTP/1 response status also becomes :status header in a HEADERS frame. 1xx status are obsoleted. * HTTP/1 mime headers become generic headers listed after those "special ones" in the HEADERS frames * HTTP/1 entities/payload become DATA frames * request and response are paired with a shared "stream ID". I have not yet looked that closely at the header compression draft yet. >From what I can see of the comments they are still concentrating on the SPDY mindset about taking in plain-text HTTP/1 style headers as-is and just compressing the bytes on a per-line basis. Possibly with a line length prefix in binary - meaning we still need to walk headers in sequence like ASN.1 but with step size given so we don't have to search for delimiters between lines (only ',' or '\0' delimiters within lines if the talk this week goes ahead). As relates to SBuf/strings: * there is expected to be a per-TCP-connection state array/stack/map/fifo/lifo/whatever structure with an entry ID numerically 0-N assigned to each header in decompressed form which persists for the lifetime of the TCP connection with constant churn. * there is expected to be a static global binary->text mapping between RFC registered header name/values, method names, etc. We have this already in char* / enum arrays, it will just mean us renumbering of those entries at some point. Amos