Re: [RFC] Tokenizer API

Francesco Chemolli Mon, 09 Dec 2013 21:47:15 -0800

On 09 Dec 2013, at 20:00, Alex Rousskov <rouss...@measurement-factory.com> 
wrote:


> Hello,
> 
>    The promised Tokenizer API proposal is attached. Compared to earlier
> proposals, this interface is simplified by focusing on finding tokens
> (requires knowledge of allowed and prohibited character sets), not
> parsing (requires knowledge of input syntax) and by acknowledging that
> real parsing rules are often too complex to be [efficiently] supported
> with a single set of delimiters. The parser (rather than a tokenizer) is
> a better place to deal with those complexities.
> 
> The API supports checkpoints and backtracking by ... copying Tokenizers.
> 
> I believe the interface allows for an efficient implementation,
> especially if the CharacterSet type is eventually redefined as a boolean
> array, providing us a constant time lookup complexity.

Hi,
   SBuf supplies a few find() variants which could help which are not constant 
time but rely on lower-level primitives and related optimizations. My 
suggestion is to have CharacterSet be a SBuf and rely on them, at least for 
now. In any case having them be a SBuf promotes better interface decoupling and 
abstraction.

> Here is a sketch on how a Tokenizer "tk" might be used to build a
> primitive HTTP Request-Line parser (a part of the incremental HTTP
> header parser):

SBuf was not really designed to be passed by nonconst reference. But this 
sketch is very compelling, so it's worth to try it and see.

>> // Looking at something like GET /index.html HTTP/1.0\r\n
>> 
>> SBuf method, uri, proto, vMajor, vMinor;
>> if (tk.prefix(method, Http::MethodChars) &&
>>    tk.token(uri, Http::HeaderWhitespace) &&
>>    tk.prefix(proto, Http::ProtoChars) &&
>>    tk.skip('/') &&
>>    tk.prefix(vMajor, DecimalDigits) &&
>>    tk.skip('.') &&
>>    tk.prefix(vMinor, DecimalDigits) &&
>>    (tk.skip(Http::Crs) || true) && // optional CRs
>>    tk.skip('\n')) {
>>    ... validate after successfully parsing the request line
>> } else ...
> 
> 
> And this sketch illustrates the part of squid.conf parser dealing with
> quoted strings:
> 
>> if (tk.skip('\\')) ...
>> else if (tk.skip('"')) ...
>> else if (tk.token(word, SquidConfWhitespace)) ...

About the interface itself:

const SBuf &remaining() const

I'd change the signature to
SBuf remaining() const

copying a SBuf is easy, returning one puts a lower requirement on the caller 
and is less constrained

I'd also add to the interface a few constants to describe common character sets 
such as ALPHA, ALNUM, LOWERALPHA, UPPERALPHA etc. (I'd use the predefined 
character classes from grep(1) as a refetence for common patterns).


   Kinkie

Re: [RFC] Tokenizer API

Reply via email to