On 3/06/2014 6:07 a.m., Alex Rousskov wrote: > On 06/02/2014 11:27 AM, Kinkie wrote: >>>> + //fixme: account for buf_.size() >>>> + bool neg = false; >>>> + const char *s = buf_.rawContent(); >>>> + const char *end = buf_.rawContent() + buf_.length(); >>> >>> Could you please explain what the above comment means? There is no >>> SBuf::size() AFAICT. >> >> length() is an alias of size(). Changed to size() to avoid confusion > > The above code already accounts for .length() or .size(), right? What is > there to fix? Is it a bug (XXX) or just an future improvement note (TODO)? > > >>>> + * At least one terminating delimiter is required. \0 may be passed >>>> + * as a delimiter to treat end of buffer content as the end of token. >>> >>>> + if (tokenLen == SBuf::npos && !delimiters['\0']) { >>> >>> The above is from the actually committed code (not the code posted for >>> review). Please remove \0 as a special case. As far as I can tell, the >>> reason it was added is already covered well enough by the existing API. >>> The addition of this special case is likely to just confuse and break >>> things. For example, adding \0 to delimiters for the purpose of >>> disabling this special case will also tell Tokenizer to skip \0 >>> characters in input, which is often wrong.
But it is not wrong that often. On direct network input such as we are dealing with it is more often a "right" delimiter. Consider \0-terminated string values inside NTLM, Kerberos/GSSAPI, ICP, HTCP packets, our own SSL helper 'eom' line endings, HTTP/2 using \0 delimiter for Cookies, etc. For HTTP/1.1 and ICAP its a rarity, but I am finding for those parsers its simpler to use a deterministic sequence of skip/prefix/remainder/startsWith methods than to use token() with both pre- and post- validation of the bytes. Regardless of this special case handling. That said we have not yet got into mime header parsing where token() was expected to be used most with ";, " delimiters and end-of-mime-SBuf at any point. >> >> I'll let Amos argue for this as he's the one who has the clearest >> picture of the use case. >> Given your objection, my 2c suggestion to address the need is to >> either have two token() methods, one which requires a separator at the >> end of a token and one which doesn't, or to have a bool flag passed to >> the method. > > I hope neither is needed, but I will wait for the "use case" discussion. The use case for token() is finding a { D* T+ D+ } sequence. NOTE: The characters for T+ are !D and default to include 0x00-0xFF anyway so regardless of this change token() will produce output containing embeded \0 unless \0 is explicitly a delimiter. I made this adjustment to treat SBuf as a string with implicit \0 termination inline with how we are treating the strcmp() and SBuf::compare() equivalence. This matters when we reach the end of SBuf via the prefix(T) call. All cases of reaching npos before the final skip(D) are failures to find the final D+ sequence (and/or T+ missing) inside the SBuf (procducing false). BUT with \0 delimiter explicitly stated and implicit \0 termination we should consider npos == \0 and thus produce true (ie. we found/reached the implicit \0). So we need to check for the \0 delimiter case and alow that past the "return false" action. Amos