Re: [PATCH] Tokenizer

Amos Jeffries Mon, 02 Jun 2014 21:03:27 -0700

On 3/06/2014 6:07 a.m., Alex Rousskov wrote:
> On 06/02/2014 11:27 AM, Kinkie wrote:
>>>> +    //fixme: account for buf_.size()
>>>> +    bool neg = false;
>>>> +    const char *s = buf_.rawContent();
>>>> +    const char *end = buf_.rawContent() + buf_.length();
>>>
>>> Could you please explain what the above comment means? There is no
>>> SBuf::size() AFAICT.
>>
>> length() is an alias of size(). Changed to size()  to avoid confusion
> 
> The above code already accounts for .length() or .size(), right? What is
> there to fix? Is it a bug (XXX) or just an future improvement note (TODO)?
> 
> 
>>>> +    * At least one terminating delimiter is required. \0 may be passed
>>>> +    * as a delimiter to treat end of buffer content as the end of token.
>>>
>>>> +    if (tokenLen == SBuf::npos && !delimiters['\0']) {
>>>
>>> The above is from the actually committed code (not the code posted for
>>> review). Please remove \0 as a special case. As far as I can tell, the
>>> reason it was added is already covered well enough by the existing API.
>>> The addition of this special case is likely to just confuse and break
>>> things. For example, adding \0 to delimiters for the purpose of
>>> disabling this special case will also tell Tokenizer to skip \0
>>> characters in input, which is often wrong.


But it is not wrong that often. On direct network input such as we are
dealing with it is more often a "right" delimiter. Consider
\0-terminated string values inside NTLM, Kerberos/GSSAPI, ICP, HTCP
packets, our own SSL helper 'eom' line endings, HTTP/2 using \0
delimiter for Cookies, etc.

For HTTP/1.1 and ICAP its a rarity, but I am finding for those parsers
its simpler to use a deterministic sequence of
skip/prefix/remainder/startsWith methods than to use token() with both
pre- and post- validation of the bytes. Regardless of this special case
handling.

That said we have not yet got into mime header parsing where token() was
expected to be used most with ";, " delimiters and end-of-mime-SBuf at
any point.


>>
>> I'll let Amos argue for this as he's the one who has the clearest
>> picture of the use case.
>> Given your objection, my 2c suggestion to address the need is to
>> either have two token() methods, one which requires a separator at the
>> end of a token and one which doesn't, or to have a bool flag passed to
>> the method.
> 
> I hope neither is needed, but I will wait for the "use case" discussion.

The use case for token() is finding a { D* T+ D+ } sequence.

NOTE: The characters for T+ are !D and default to include 0x00-0xFF
anyway so regardless of this change token() will produce output
containing embeded \0 unless \0 is explicitly a delimiter.


I made this adjustment to treat SBuf as a string with implicit \0
termination inline with how we are treating the strcmp() and
SBuf::compare() equivalence.

This matters when we reach the end of SBuf via the prefix(T) call. All
cases of reaching npos before the final skip(D) are failures to find the
final D+ sequence (and/or T+ missing) inside the SBuf (procducing false).
 BUT with \0 delimiter explicitly stated and implicit \0 termination we
should consider npos == \0 and thus produce true (ie. we found/reached
the implicit \0). So we need to check for the \0 delimiter case and alow
that past the "return false" action.

Amos

Re: [PATCH] Tokenizer

Reply via email to