[RFC] Tokenizer API

Alex Rousskov Mon, 09 Dec 2013 11:01:45 -0800

Hello,

    The promised Tokenizer API proposal is attached. Compared to earlier
proposals, this interface is simplified by focusing on finding tokens
(requires knowledge of allowed and prohibited character sets), not
parsing (requires knowledge of input syntax) and by acknowledging that
real parsing rules are often too complex to be [efficiently] supported
with a single set of delimiters. The parser (rather than a tokenizer) is
a better place to deal with those complexities.


The API supports checkpoints and backtracking by ... copying Tokenizers.

I believe the interface allows for an efficient implementation,
especially if the CharacterSet type is eventually redefined as a boolean
array, providing us a constant time lookup complexity.


Here is a sketch on how a Tokenizer "tk" might be used to build a
primitive HTTP Request-Line parser (a part of the incremental HTTP
header parser):

> // Looking at something like GET /index.html HTTP/1.0\r\n
> 
> SBuf method, uri, proto, vMajor, vMinor;
> if (tk.prefix(method, Http::MethodChars) &&
>     tk.token(uri, Http::HeaderWhitespace) &&
>     tk.prefix(proto, Http::ProtoChars) &&
>     tk.skip('/') &&
>     tk.prefix(vMajor, DecimalDigits) &&
>     tk.skip('.') &&
>     tk.prefix(vMinor, DecimalDigits) &&
>     (tk.skip(Http::Crs) || true) && // optional CRs
>     tk.skip('\n')) {
>     ... validate after successfully parsing the request line
> } else ...


And this sketch illustrates the part of squid.conf parser dealing with
quoted strings:

> if (tk.skip('\\')) ...
> else if (tk.skip('"')) ...
> else if (tk.token(word, SquidConfWhitespace)) ...


HTH,

Alex.

#include <set>
#include "SBuf.h"

/** 
 * Efficiently converts raw input into a stream of basic tokens.
 * Custom token boundary/separation rules are supported via caller-provided,
 * pre-computed character sets. The caller (a parser of some kind) defines
 * the input grammar by using an appropriate sequence of token(), prefix(),
 * and skip() calls, with the right parameters restricting token composition.
 */
class Tokenizer {
public:
    /// a collection of unique characters; TODO: support negation, merging
    typedef std::set<char> CharacterSet; // TODO: optimize using a bool array

    explicit Tokenizer(const SBuf &inBuf);

    bool atEnd() const { return !buf_.length(); }
    const SBuf &remaining() const { return buf_; }
    void reset(const SBuf &newBuf) { buf_ = newBuf; }

    /* The following methods start from the beginning of the input buffer.
     * They return true and consume parsed chars if a non-empty token is found.
     * Otherwise, they return false without any side-effects. */

    /** Basic strtok(3):
     *  Skips all leading delimiters (if any),
     *  accumulates all characters up to the first delimiter (a token), and
     *  skips all trailing delimiters (if any).
     *  Want to extract delimiters? Use three prefix() calls instead.
     */
    bool token(SBuf &token, const CharacterSet &whitespace);

    /// Accumulates all sequential permitted characters (a token).
    bool prefix(SBuf &token, const CharacterSet &tokenChars);

    /// Skips all sequential permitted characters (a token).
    bool skip(const CharacterSet &tokenChars);

    /// Skips a given token.
    bool skip(const SBuf &token);

    /// Skips a given character (a token).
    bool skip(const char token);

private:
    SBuf buf_; ///< yet unparsed input
};

[RFC] Tokenizer API

Reply via email to