On 06/02/2014 10:02 PM, Amos Jeffries wrote: > On 3/06/2014 6:07 a.m., Alex Rousskov wrote: >> On 06/02/2014 11:27 AM, Kinkie wrote: >>>>> + * At least one terminating delimiter is required. \0 may be passed >>>>> + * as a delimiter to treat end of buffer content as the end of token. >>>> >>>>> + if (tokenLen == SBuf::npos && !delimiters['\0']) {
>>>> The above is from the actually committed code (not the code posted for >>>> review). Please remove \0 as a special case. As far as I can tell, the >>>> reason it was added is already covered well enough by the existing API. >>>> The addition of this special case is likely to just confuse and break >>>> things. For example, adding \0 to delimiters for the purpose of >>>> disabling this special case will also tell Tokenizer to skip \0 >>>> characters in input, which is often wrong. > But it is not wrong that often. On direct network input such as we are > dealing with it is more often a "right" delimiter. Consider > \0-terminated string values inside NTLM, Kerberos/GSSAPI, ICP, HTCP > packets, our own SSL helper 'eom' line endings, HTTP/2 using \0 > delimiter for Cookies, etc. Grammars that use 0-terminated strings will add \0 to delimiter sets as needed. Grammars that do not use 0-terminated strings will not add \0 to delimiter sets. In either case, assigning a secondary meaning to \0 breaks parsing and spoils the API. There is no need for these problems as Kinkie's alternative API demonstrates. The only argument half-worth having is whether that more complex API is needed at all. > The use case for token() is finding a { D* T+ D+ } sequence. Sure, the above is the definition of the new token() API. When it comes to actual/real use cases, I suspect any correct parsing code calling token() will not need a special Tokenizer API to exclude "{T+} followed by {T+}" cases. In other words, all the necessary checks can and should be done without that special API. If you have a use case where the new API is required, please share it. Cheers, Alex.