On 2013-12-10 08:00, Alex Rousskov wrote:
Hello,
The promised Tokenizer API proposal is attached. Compared to
earlier
proposals, this interface is simplified by focusing on finding tokens
(requires knowledge of allowed and prohibited character sets), not
parsing (requires knowledge of input syntax) and by acknowledging that
real parsing rules are often too complex to be [efficiently] supported
with a single set of delimiters. The parser (rather than a tokenizer)
is
a better place to deal with those complexities.
The API supports checkpoints and backtracking by ... copying
Tokenizers.
I believe the interface allows for an efficient implementation,
especially if the CharacterSet type is eventually redefined as a
boolean
array, providing us a constant time lookup complexity.
Agreed. +1 for going with this design.
Two requests for additional scope:
* can we place this is a separate src/parse/ library please?
- we have other generic parse code the deserves to all be bundled up
together instead of spread out. Might as well start that collection
process now.
* Lets do the charset boolean array earlier rather than later. The
existing ones are rather nasty but they do "work" right now. Making this
project an optimization start to finish.
CharacterSet.h:
namespace Parser {
class CharacterSet
{
public:
CharacterSet(const char * const c, size_t len) {
memset(match_, 0, sizeof(match_));
for (size_t i = 0; i < len; ++i) {
match_[static_cast<uint8_t>(c)] = true;
}
}
/// whether a given character exists in the set
bool operator[](char t) const {return
match_[static_cast<uint8_t>(c)];}
/// add all characters from the given CharacterSet to this one
void merge(const CharacterSet &src) const {
for (size_t i = 0; i < 256; ++i) {
if(src.match_[i])
match_[i] = true;
}
}
private:
bool match_[256];
};
} // namespace Parser
NP: most of the time we will be wanting to define these CharacterSet as
global once-off objects. So I'm not sure if the merge() method is
useful, but shown here for completeness in case we want it for
generating composite character sets.
Amos