On 12/15/2013 04:54 AM, Kinkie wrote: > attached patch is from branch lp:~squid/squid/characterset/
> + std::vector<bool> chars_; //std::vector<bool> is optimized According to STL documentation for std::vector<bool> specialization: > These changes provide a quirky interface to this specialization and > favor memory optimization over processing (which may or may not suit > your needs). This is kind of the opposite of what we want. We want access speed, not RAM or copying speed savings. I suspect we do not want std::vector<bool> here! The same docs suggest using a different element type (char, unsigned char, or custom) if a different optimization trade-off is needed. I think we should follow that advice if we want to optimize this (which was the primary point of doing a custom CharacterSet). > It extracts some of the work done in parser-ng to implement Alex' > Tokenizer API, bringing O(N) find-any{,-not}-of matching to SBuf. > Note: in IRC discussions, Amos mentioned find_first_of and > find_first_not_of not being naming convention compliant (that's right, > they are modeled after std::string). I'm fine with changing them as a > follow-up commit if this is the collegial decision; I'd avoid do to it > as part of this merge. I agree that the two should be renamed to use camelCase [as a follow up commit]. > + * \return length() if all characters in the SBuf are from set and the implementation matches that documentation: > +SBuf::size_type > +SBuf::find_first_not_of(const CharacterSet &set, size_type startPos) const > +{ ... > + debugs(24, 7, "not found"); > + return length(); > +} but STL docs say: > Return Value > The position of the first character that does not match. > If no such characters are found, the function returns string::npos. which makes more sense than returning length() IMO. > + CharacterSet(const char *label, const char * const c) : name(label), > chars_(std::vector<bool>(256,false)) { > + size_t clen = strlen(c); > + for (size_t i = 0; i < clen; ++i) > + chars_[static_cast<uint8_t>(c[i])] = true; > + } Since we are using set.name [for debugging] without checking, please set name to "anonymous" or similar if the label parameter is NULL. > + size_t clen = strlen(c); This variable can be const. > + size_t clen = strlen(c); > + for (size_t i = 0; i < clen; ++i) > + chars_[static_cast<uint8_t>(c[i])] = true; Please use add() to add characters to a set. > + /// add a given char to the character set. c must be >= 0. > + CharacterSet & add(const unsigned char c) > {chars_[static_cast<uint8_t>(c)] = true; return *this; } The "c must be >= 0" comment is redundant given that "c" is unsigned. If you are worried about folks adding negative signed chars, try adding a corresponding add(const signed char) method with a check. > + /// name of this character set > + const char * name; The comment does not really document the data member (the data member name itself says exactly the same thing already). Consider: /// optional set label for debugging ("anonymous" by default) > +#include <vector> > + > +class CharacterSet Missing class description. Consider: /// An optimized set of C chars, /// with a quick membership test and merge support. > + /// characters defined in this set > + std::vector<bool> chars_; //std::vector<bool> is optimized s/defined/present/ or s/defined/included/ * CharacterSet creation and merging (+=) are heavy operations that do not benefit from being inlined. I know it is a pain to deal with Makefile dependencies, and I do not insist on this change, but please do seriously consider placing them in .cc file. FWIW, the following problems can be solved easier or better in a .cc file. With a little bit of additional work, you can even remove #include <vector> from CharacterSet users that way! > + //precondition: src.chars_.size() == chars_.size() There is no such precondition. I should be able to merge ALPHA and NUMERIC sets, for example. Rewrite the += operator loop to simply this->add() every character in the src set. Use std::for_each or another <algorithm> for that if possible. > + CharacterSet(const char *label, const char * const c) : name(label), > chars_(std::vector<bool>(256,false)) { > + std::vector<bool>::const_iterator s = src.chars_.begin(); > + const std::vector<bool>::const_iterator e = src.chars_.end(); > + std::vector<bool> chars_; //std::vector<bool> is optimized Please avoid code duplication by typedef-ing std::vector<bool> as Chars or similar and using the corresponding type name everywhere. HTH, Alex.