Re: [RFC] Tokenizer API

Amos Jeffries Mon, 09 Dec 2013 22:14:15 -0800

On 10/12/2013 6:51 p.m., Francesco Chemolli wrote:
>> Hi,
>>   SBuf supplies a few find() variants which could help which are not 
>> constant time but rely on lower-level primitives and related optimizations. 
>> My suggestion is to have CharacterSet be a SBuf and rely on them, at least 
>> for now. In any case having them be a SBuf promotes better interface 
>> decoupling and abstraction.
> 
> Oh, one more argument for having the low-level matching primitives in SBuf: 
> it's a pet peeve of mine to use some form of compact tries and/or FSM to do 
> single-pass low-level string matching in SBuf, possibly by lifting code from 
> GNU grep (it's very efficient but complex). Redoing find_first_of() and 
> startsWith() here would duplicate code and undermine that possibility and 
> qualifies as premature optimisation IMO :)
> 
>       Kinkie
>


The problem with comparing input strings to a SBuf of characters is that
parsing a input of length N againt charset of size M takes O(N*M) time.

Making the charset a boolean array like Alex mentioned cuts that down to
O(N) parsing time.

This Tokeniser is also for the cases where the input is a MemBuf or
similar non-SBuf array. The data copy to get it into SBuf is the output
of the Tokenizer. If we copy the entire buffer into an SBuf first then
pare we face either growing SBuf more than otherwise necessary if it was
not all received, or cropping them down after a useless data copy.

Amos

Re: [RFC] Tokenizer API

Reply via email to