Re: [RFC] Tokenizer API

Amos Jeffries Mon, 09 Dec 2013 15:14:16 -0800

On 2013-12-10 08:00, Alex Rousskov wrote:

Hello,
The promised Tokenizer API proposal is attached. Compared toearlier
proposals, this interface is simplified by focusing on finding tokens
(requires knowledge of allowed and prohibited character sets), not
parsing (requires knowledge of input syntax) and by acknowledging that
real parsing rules are often too complex to be [efficiently] supported
with a single set of delimiters. The parser (rather than a tokenizer)is
a better place to deal with those complexities.
The API supports checkpoints and backtracking by ... copyingTokenizers.
I believe the interface allows for an efficient implementation,
especially if the CharacterSet type is eventually redefined as aboolean
array, providing us a constant time lookup complexity.


Agreed. +1 for going with this design.


Two requests for additional scope:
* can we place this is a separate src/parse/ library please?

- we have other generic parse code the deserves to all be bundled uptogether instead of spread out. Might as well start that collectionprocess now.

* Lets do the charset boolean array earlier rather than later. Theexisting ones are rather nasty but they do "work" right now. Making thisproject an optimization start to finish.


CharacterSet.h:

namespace Parser {

class CharacterSet
{
public:
  CharacterSet(const char * const c, size_t len) {
    memset(match_, 0, sizeof(match_));
    for (size_t i = 0; i < len; ++i) {
      match_[static_cast<uint8_t>(c)] = true;
    }
  }

  /// whether a given character exists in the set

bool operator[](char t) const {returnmatch_[static_cast<uint8_t>(c)];}


  /// add all characters from the given CharacterSet to this one
  void merge(const CharacterSet &src) const {
    for (size_t i = 0; i < 256; ++i) {
      if(src.match_[i])
        match_[i] = true;
    }
  }

private:
  bool match_[256];
};

} // namespace Parser

NP: most of the time we will be wanting to define these CharacterSet asglobal once-off objects. So I'm not sure if the merge() method isuseful, but shown here for completeness in case we want it forgenerating composite character sets.


Amos

Re: [RFC] Tokenizer API

Reply via email to