On 12/09/2013 10:46 PM, Francesco Chemolli wrote: > My suggestion is to have CharacterSet be a SBuf and > rely on them, at least for now. In any case having them be a SBuf > promotes better interface decoupling and abstraction.
CharacterSet is a "set of characters", which is semantically very different from a "string". It would be overall wrong to abuse SBuf for that purpose. It should either start simple (as I suggested) or immediately optimized into a dedicated array-based class (as Amos proposed). > Oh, one more argument for having the low-level matching primitives in > SBuf: it's a pet peeve of mine to use some form of compact tries > and/or FSM to do single-pass low-level string matching in SBuf The purpose of Tokenizer is to help transform input into tokens using simple rules. There is absolutely nothing in that design that prevents someone from adding more matching methods to SBuf or optimizing existing ones. Tokenizer can call those new/optimized methods as needed. That does not lead to code duplication or even performance overheads. On the other hand, it would be very wrong for the caller to use raw SBuf methods to tokenize input. Tokenization is Tokenizer job, not SBuf job. Please let me know if I should give more arguments explaining the difference and supporting this design. > SBuf was not really designed to be passed by nonconst reference. I do not think that is a valid claim. Nearly any C++ object can be passed by reference. There is nothing wrong with that as such, it is often a good idea, and no general-purpose design can (or should) prohibit that. The "const" part of the API (or lack of thereof) is just an indication that the method argument cannot (or can) be modified. >> const SBuf &remaining() const > I'd change the signature to > SBuf remaining() const > copying a SBuf is easy, returning one puts a lower requirement on the > caller and is less constrained Copying SBuf is easy but relatively expensive compared to not copying. I am not sure GCC would optimize that copy away when it can be. Here, I am talking about calling various refcounting methods, not copying of data buffers, of course. The reference-based method would not allow the caller to modify the return value directly, but can you think of a common use pattern or two where such on-the-fly modification would actually be needed or desired? Please note that we can change this aspect of the interface later if needed without any need to change any of the callers (AFAICT). > I'd also add to the interface a few constants to describe common > character sets such as ALPHA, ALNUM, LOWERALPHA, UPPERALPHA etc. (I'd > use the predefined character classes from grep(1) as a refetence for > common patterns). Yes, the proposed design will work well only if we predefine virtually all character sets. There is no need for a new interface for that though. Just declare global constant CharacterSet objects (or functions creating/returning them). However, I recommend against adding a lot of such constants until they are needed by a specific caller. Why waste 256 bytes? It is easy to add more as needed. I would recommend adding a macro to create a CharacterSet from one of the ISALPHA(3) isfoo() functions/macros, but I am worried that those are locale-dependent while almost everything in Squid should not be. We may be better off hard-coding any protocol-defined character sets instead. Thank you, Alex.