Re: [RFC] Tokenizer API

Alex Rousskov Tue, 10 Dec 2013 07:38:11 -0800

On 12/09/2013 10:46 PM, Francesco Chemolli wrote:

> My suggestion is to have CharacterSet be a SBuf and
> rely on them, at least for now. In any case having them be a SBuf
> promotes better interface decoupling and abstraction.


CharacterSet is a "set of characters", which is semantically very
different from a "string". It would be overall wrong to abuse SBuf for
that purpose. It should either start simple (as I suggested) or
immediately optimized into a dedicated array-based class (as Amos proposed).

> Oh, one more argument for having the low-level matching primitives in
> SBuf: it's a pet peeve of mine to use some form of compact tries
> and/or FSM to do single-pass low-level string matching in SBuf

The purpose of Tokenizer is to help transform input into tokens using
simple rules. There is absolutely nothing in that design that prevents
someone from adding more matching methods to SBuf or optimizing existing
ones. Tokenizer can call those new/optimized methods as needed. That
does not lead to code duplication or even performance overheads.

On the other hand, it would be very wrong for the caller to use raw SBuf
methods to tokenize input. Tokenization is Tokenizer job, not SBuf job.
Please let me know if I should give more arguments explaining the
difference and supporting this design.


> SBuf was not really designed to be passed by nonconst reference. 

I do not think that is a valid claim. Nearly any C++ object can be
passed by reference. There is nothing wrong with that as such, it is
often a good idea, and no general-purpose design can (or should)
prohibit that. The "const" part of the API (or lack of thereof) is just
an indication that the method argument cannot (or can) be modified.


>> const SBuf &remaining() const

> I'd change the signature to
> SBuf remaining() const

> copying a SBuf is easy, returning one puts a lower requirement on the
> caller and is less constrained

Copying SBuf is easy but relatively expensive compared to not copying. I
am not sure GCC would optimize that copy away when it can be. Here, I am
talking about calling various refcounting methods, not copying of data
buffers, of course.

The reference-based method would not allow the caller to modify the
return value directly, but can you think of a common use pattern or two
where such on-the-fly modification would actually be needed or desired?

Please note that we can change this aspect of the interface later if
needed without any need to change any of the callers (AFAICT).


> I'd also add to the interface a few constants to describe common
> character sets such as ALPHA, ALNUM, LOWERALPHA, UPPERALPHA etc. (I'd
> use the predefined character classes from grep(1) as a refetence for
> common patterns).

Yes, the proposed design will work well only if we predefine virtually
all character sets. There is no need for a new interface for that
though. Just declare global constant CharacterSet objects (or functions
creating/returning them).

However, I recommend against adding a lot of such constants until they
are needed by a specific caller. Why waste 256 bytes? It is easy to add
more as needed.

I would recommend adding a macro to create a CharacterSet from one of
the ISALPHA(3) isfoo() functions/macros, but I am worried that those are
locale-dependent while almost everything in Squid should not be. We may
be better off hard-coding any protocol-defined character sets instead.


Thank you,

Alex.

Re: [RFC] Tokenizer API

Reply via email to