On 10/12/2013 6:51 p.m., Francesco Chemolli wrote: >> Hi, >> SBuf supplies a few find() variants which could help which are not >> constant time but rely on lower-level primitives and related optimizations. >> My suggestion is to have CharacterSet be a SBuf and rely on them, at least >> for now. In any case having them be a SBuf promotes better interface >> decoupling and abstraction. > > Oh, one more argument for having the low-level matching primitives in SBuf: > it's a pet peeve of mine to use some form of compact tries and/or FSM to do > single-pass low-level string matching in SBuf, possibly by lifting code from > GNU grep (it's very efficient but complex). Redoing find_first_of() and > startsWith() here would duplicate code and undermine that possibility and > qualifies as premature optimisation IMO :) > > Kinkie >
The problem with comparing input strings to a SBuf of characters is that parsing a input of length N againt charset of size M takes O(N*M) time. Making the charset a boolean array like Alex mentioned cuts that down to O(N) parsing time. This Tokeniser is also for the cases where the input is a MemBuf or similar non-SBuf array. The data copy to get it into SBuf is the output of the Tokenizer. If we copy the entire buffer into an SBuf first then pare we face either growing SBuf more than otherwise necessary if it was not all received, or cropping them down after a useless data copy. Amos