On 09 Dec 2013, at 20:00, Alex Rousskov <rouss...@measurement-factory.com> wrote:
> Hello, > > The promised Tokenizer API proposal is attached. Compared to earlier > proposals, this interface is simplified by focusing on finding tokens > (requires knowledge of allowed and prohibited character sets), not > parsing (requires knowledge of input syntax) and by acknowledging that > real parsing rules are often too complex to be [efficiently] supported > with a single set of delimiters. The parser (rather than a tokenizer) is > a better place to deal with those complexities. > > The API supports checkpoints and backtracking by ... copying Tokenizers. > > I believe the interface allows for an efficient implementation, > especially if the CharacterSet type is eventually redefined as a boolean > array, providing us a constant time lookup complexity. Hi, SBuf supplies a few find() variants which could help which are not constant time but rely on lower-level primitives and related optimizations. My suggestion is to have CharacterSet be a SBuf and rely on them, at least for now. In any case having them be a SBuf promotes better interface decoupling and abstraction. > Here is a sketch on how a Tokenizer "tk" might be used to build a > primitive HTTP Request-Line parser (a part of the incremental HTTP > header parser): SBuf was not really designed to be passed by nonconst reference. But this sketch is very compelling, so it's worth to try it and see. >> // Looking at something like GET /index.html HTTP/1.0\r\n >> >> SBuf method, uri, proto, vMajor, vMinor; >> if (tk.prefix(method, Http::MethodChars) && >> tk.token(uri, Http::HeaderWhitespace) && >> tk.prefix(proto, Http::ProtoChars) && >> tk.skip('/') && >> tk.prefix(vMajor, DecimalDigits) && >> tk.skip('.') && >> tk.prefix(vMinor, DecimalDigits) && >> (tk.skip(Http::Crs) || true) && // optional CRs >> tk.skip('\n')) { >> ... validate after successfully parsing the request line >> } else ... > > > And this sketch illustrates the part of squid.conf parser dealing with > quoted strings: > >> if (tk.skip('\\')) ... >> else if (tk.skip('"')) ... >> else if (tk.token(word, SquidConfWhitespace)) ... About the interface itself: const SBuf &remaining() const I'd change the signature to SBuf remaining() const copying a SBuf is easy, returning one puts a lower requirement on the caller and is less constrained I'd also add to the interface a few constants to describe common character sets such as ALPHA, ALNUM, LOWERALPHA, UPPERALPHA etc. (I'd use the predefined character classes from grep(1) as a refetence for common patterns). Kinkie