On 2013-12-10 08:00, Alex Rousskov wrote:
Hello,

The promised Tokenizer API proposal is attached. Compared to earlier
proposals, this interface is simplified by focusing on finding tokens
(requires knowledge of allowed and prohibited character sets), not
parsing (requires knowledge of input syntax) and by acknowledging that
real parsing rules are often too complex to be [efficiently] supported
with a single set of delimiters. The parser (rather than a tokenizer) is
a better place to deal with those complexities.

The API supports checkpoints and backtracking by ... copying Tokenizers.

I believe the interface allows for an efficient implementation,
especially if the CharacterSet type is eventually redefined as a boolean
array, providing us a constant time lookup complexity.

Agreed. +1 for going with this design.


Two requests for additional scope:
* can we place this is a separate src/parse/ library please?
- we have other generic parse code the deserves to all be bundled up together instead of spread out. Might as well start that collection process now.

* Lets do the charset boolean array earlier rather than later. The existing ones are rather nasty but they do "work" right now. Making this project an optimization start to finish.

CharacterSet.h:

namespace Parser {

class CharacterSet
{
public:
  CharacterSet(const char * const c, size_t len) {
    memset(match_, 0, sizeof(match_));
    for (size_t i = 0; i < len; ++i) {
      match_[static_cast<uint8_t>(c)] = true;
    }
  }

  /// whether a given character exists in the set
bool operator[](char t) const {return match_[static_cast<uint8_t>(c)];}

  /// add all characters from the given CharacterSet to this one
  void merge(const CharacterSet &src) const {
    for (size_t i = 0; i < 256; ++i) {
      if(src.match_[i])
        match_[i] = true;
    }
  }

private:
  bool match_[256];
};

} // namespace Parser


NP: most of the time we will be wanting to define these CharacterSet as global once-off objects. So I'm not sure if the merge() method is useful, but shown here for completeness in case we want it for generating composite character sets.

Amos

Reply via email to