[squid-dev] [PATCH] suffix parsing and skipping

Alex Rousskov Wed, 22 Jul 2015 15:42:14 -0700

Hello,

    The attached patch adds SBuf and Tokenizer methods to simplify
suffix parsing and skipping (and to make suffix/reverse APIs more
similar to prefix/forward ones). The new methods are used in another
patch (to be posted shortly) that makes the request line parser simpler
and more robust, but they are useful in other contexts as well IMO.


For consistency sake, I tried to mimic the existing Tokenizer/SBuf
documentation, debugging, and code duplication styles.


One old problem was detected and documented during this project: The
Tokenizer::parsedSize() method does not account for the successfully
parsed and removed suffix/trailing characters in most cases. That method
should be revised or removed. Solving this problem is outside this
project scope.


This and the following patch are based on trunk r14083. I will update to
the current trunk if the patches are approved. Please let me know if you
want to review the updated versions as well.


HTH,

Alex.

New SBuf and Tokenizer methods to simplify suffix parsing and skipping
(and to make suffix/reverse APIs more similar to prefix/forward ones).

XXX: One old problem detected and documented: The Tokenizer::parsedSize()
method does not account for the successfully parsed and removed suffix/trailing
characters in most cases. It should be revised or removed.

=== modified file 'src/SBuf.cc'
--- src/SBuf.cc	2015-03-17 02:51:04 +0000
+++ src/SBuf.cc	2015-07-20 22:46:31 +0000
@@ -778,40 +778,82 @@ SBuf::findFirstNotOf(const CharacterSet
     ++stats.find;
 
     if (startPos == npos)
         return npos;
 
     if (startPos >= length())
         return npos;
 
     debugs(24, 7, "first not of characterset " << set.name << " in id " << id);
     char *cur = buf()+startPos;
     const char *bufend = bufEnd();
     while (cur < bufend) {
         if (!set[*cur])
             return cur-buf();
         ++cur;
     }
     debugs(24, 7, "not found");
     return npos;
 }
 
+SBuf::size_type
+SBuf::findLastOf(const CharacterSet &set, size_type endPos) const
+{
+    ++stats.find;
+
+    if (isEmpty())
+        return npos;
+
+    if (endPos == npos || endPos >= length())
+        endPos = length() - 1;
+
+    debugs(24, 7, "last of characterset " << set.name << " in id " << id);
+    const char *start = buf();
+    for (const char *cur = start + endPos; cur >= start; --cur) {
+        if (set[*cur])
+            return cur - start;
+    }
+    debugs(24, 7, "not found");
+    return npos;
+}
+
+SBuf::size_type
+SBuf::findLastNotOf(const CharacterSet &set, size_type endPos) const
+{
+    ++stats.find;
+
+    if (isEmpty())
+        return npos;
+
+    if (endPos == npos || endPos >= length())
+        endPos = length() - 1;
+
+    debugs(24, 7, "last not of characterset " << set.name << " in id " << id);
+    const char *start = buf();
+    for (const char *cur = start + endPos; cur >= start; --cur) {
+        if (!set[*cur])
+            return cur - start;
+    }
+    debugs(24, 7, "not found");
+    return npos;
+}
+
 /*
  * TODO: borrow a sscanf implementation from Linux or similar?
  * we'd really need a vsnscanf(3)... ? As an alternative, a
  * light-regexp-like domain-specific syntax might be an idea.
  */
 int
 SBuf::scanf(const char *format, ...)
 {
     va_list arg;
     int rv;
     ++stats.scanf;
     va_start(arg, format);
     rv = vsscanf(c_str(), format, arg);
     va_end(arg);
     return rv;
 }
 
 std::ostream &
 SBufStats::dump(std::ostream& os) const
 {

=== modified file 'src/SBuf.h'
--- src/SBuf.h	2015-03-14 04:42:10 +0000
+++ src/SBuf.h	2015-07-20 22:31:25 +0000
@@ -578,50 +578,70 @@ public:
      * Returns the index in the SBuf of the last occurrence of the
      * sequence contained in the str argument.
      * \return npos if the sequence  was not found
      * \param endPos if specified, ignore any occurrences after that position
      *   if npos or greater than length(), the whole SBuf is considered
      */
     size_type rfind(const SBuf &str, size_type endPos = npos) const;
 
     /** Find first occurrence of character of set in SBuf
      *
      * Finds the first occurrence of ANY of the characters in the supplied set in
      * the SBuf.
      * \return npos if no character in the set could be found
      * \param startPos if specified, ignore any occurrences before that position
      *   if npos, then npos is always returned
      *
      * TODO: rename to camelCase
      */
     size_type findFirstOf(const CharacterSet &set, size_type startPos = 0) const;
 
+    /** Find last occurrence of character of set in SBuf
+     *
+     * Finds the last occurrence of ANY of the characters in the supplied set in
+     * the SBuf.
+     * \return npos if no character in the set could be found
+     * \param endPos if specified, ignore any occurrences after that position
+     *   if npos, the entire SBuf is searched
+     */
+    size_type findLastOf(const CharacterSet &set, size_type endPos = npos) const;
+
     /** Find first occurrence character NOT in character set
      *
      * \return npos if all characters in the SBuf are from set
      * \param startPos if specified, ignore any occurrences before that position
      *   if npos, then npos is always returned
      *
      * TODO: rename to camelCase
      */
     size_type findFirstNotOf(const CharacterSet &set, size_type startPos = 0) const;
 
+    /** Find last occurrence character NOT in character set
+     *
+     * \return npos if all characters in the SBuf are from set
+     * \param endPos if specified, ignore any occurrences before that position
+     *   if npos, then the entire SBuf is searched
+     *
+     * TODO: rename to camelCase
+     */
+    size_type findLastNotOf(const CharacterSet &set, size_type endPos = npos) const;
+
     /** sscanf-alike
      *
      * sscanf re-implementation. Non-const, and not \0-clean.
      * \return same as sscanf
      * \see man sscanf(3)
      */
     int scanf(const char *format, ...);
 
     /// converts all characters to lower case; \see man tolower(3)
     void toLower();
 
     /// converts all characters to upper case; \see man toupper(3)
     void toUpper();
 
     /** String export function
      * converts the SBuf to a legacy String, by copy.
      * \deprecated
      */
     String toString() const;
 

=== modified file 'src/parser/Tokenizer.cc'
--- src/parser/Tokenizer.cc	2015-04-22 13:32:56 +0000
+++ src/parser/Tokenizer.cc	2015-07-21 00:01:22 +0000
@@ -18,40 +18,64 @@
 #endif
 
 /// convenience method: consumes up to n bytes, counts, and returns them
 SBuf
 Parser::Tokenizer::consume(const SBuf::size_type n)
 {
     // careful: n may be npos!
     debugs(24, 5, "consuming " << n << " bytes");
     const SBuf result = buf_.consume(n);
     parsed_ += result.length();
     return result;
 }
 
 /// convenience method: consume()s up to n bytes and returns their count
 SBuf::size_type
 Parser::Tokenizer::success(const SBuf::size_type n)
 {
     return consume(n).length();
 }
 
+/// convenience method: consumes up to n last bytes and returns them
+SBuf
+Parser::Tokenizer::consumeTrailing(const SBuf::size_type n)
+{
+    debugs(24, 5, "consuming " << n << " bytes");
+
+    // If n is npos, we consume everything.
+    // If we consume everything, use consume() to advance parsed_.
+    if (n == SBuf::npos || n == buf_.length())
+        return consume(buf_.length());
+
+    SBuf result = buf_;
+    buf_ = result.consume(buf_.length() - n);
+    // parsed_ stays the same because we did not consume the first char
+    return result;
+}
+
+/// convenience method: consumes up to n last bytes and returns their count
+SBuf::size_type
+Parser::Tokenizer::successTrailing(const SBuf::size_type n)
+{
+    return consumeTrailing(n).length();
+}
+
 bool
 Parser::Tokenizer::token(SBuf &returnedToken, const CharacterSet &delimiters)
 {
     const Tokenizer saved(*this);
     skipAll(delimiters);
     const SBuf::size_type tokenLen = buf_.findFirstOf(delimiters); // not found = npos => consume to end
     if (tokenLen == SBuf::npos) {
         debugs(24, 8, "no token found for delimiters " << delimiters.name);
         *this = saved;
         return false;
     }
     returnedToken = consume(tokenLen); // cannot be empty
     skipAll(delimiters);
     debugs(24, DBG_DATA, "token found for delimiters " << delimiters.name << ": '" <<
            returnedToken << '\'');
     return true;
 }
 
 bool
 Parser::Tokenizer::prefix(SBuf &returnedToken, const CharacterSet &tokenChars, const SBuf::size_type limit)
@@ -73,42 +97,41 @@ Parser::Tokenizer::prefix(SBuf &returned
     returnedToken = consume(prefixLen); // cannot be empty after the npos check
     return true;
 }
 
 bool
 Parser::Tokenizer::suffix(SBuf &returnedToken, const CharacterSet &tokenChars, const SBuf::size_type limit)
 {
     SBuf span = buf_;
 
     if (limit < buf_.length())
         span.consume(buf_.length() - limit); // ignore the N prefix characters
 
     auto i = span.rbegin();
     SBuf::size_type found = 0;
     while (i != span.rend() && tokenChars[*i]) {
         ++i;
         ++found;
     }
     if (!found)
         return false;
-    returnedToken = buf_;
-    buf_ = returnedToken.consume(buf_.length() - found);
+    returnedToken = consumeTrailing(found);
     return true;
 }
 
 SBuf::size_type
 Parser::Tokenizer::skipAll(const CharacterSet &tokenChars)
 {
     const SBuf::size_type prefixLen = buf_.findFirstNotOf(tokenChars);
     if (prefixLen == 0) {
         debugs(24, 8, "no match when trying to skipAll " << tokenChars.name);
         return 0;
     }
     debugs(24, 8, "skipping all in " << tokenChars.name << " len " << prefixLen);
     return success(prefixLen);
 }
 
 bool
 Parser::Tokenizer::skipOne(const CharacterSet &chars)
 {
     if (!buf_.isEmpty() && chars[buf_[0]]) {
         debugs(24, 8, "skipping one-of " << chars.name);
@@ -140,40 +163,66 @@ Parser::Tokenizer::skip(const SBuf &toke
 {
     if (buf_.startsWith(tokenToSkip)) {
         debugs(24, 8, "skipping " << tokenToSkip.length());
         return success(tokenToSkip.length());
     }
     debugs(24, 8, "no match, not skipping '" << tokenToSkip << '\'');
     return false;
 }
 
 bool
 Parser::Tokenizer::skip(const char tokenChar)
 {
     if (!buf_.isEmpty() && buf_[0] == tokenChar) {
         debugs(24, 8, "skipping char '" << tokenChar << '\'');
         return success(1);
     }
     debugs(24, 8, "no match, not skipping char '" << tokenChar << '\'');
     return false;
 }
 
+bool
+Parser::Tokenizer::skipOneTrailing(const CharacterSet &skippable)
+{
+    if (!buf_.isEmpty() && skippable[buf_[buf_.length()-1]]) {
+        debugs(24, 8, "skipping one-of " << skippable.name);
+        return successTrailing(1);
+    }
+    debugs(24, 8, "no match while skipping one-of " << skippable.name);
+    return false;
+}
+
+SBuf::size_type
+Parser::Tokenizer::skipAllTrailing(const CharacterSet &skippable)
+{
+    const SBuf::size_type prefixEnd = buf_.findLastNotOf(skippable);
+    const SBuf::size_type prefixLen = prefixEnd == SBuf::npos ?
+        0 : (prefixEnd + 1);
+    const SBuf::size_type suffixLen = buf_.length() - prefixLen;
+    if (suffixLen == 0) {
+        debugs(24, 8, "no match when trying to skip " << skippable.name);
+        return 0;
+    }
+    debugs(24, 8, "skipping in " << skippable.name << " len " << suffixLen);
+    return successTrailing(suffixLen);
+}
+
 /* reworked from compat/strtoll.c */
 bool
 Parser::Tokenizer::int64(int64_t & result, int base, bool allowSign, const SBuf::size_type limit)
 {
     if (atEnd() || limit == 0)
         return false;
 
     const SBuf range(buf_.substr(0,limit));
 
     //fixme: account for buf_.size()
     bool neg = false;
     const char *s = range.rawContent();
     const char *end = range.rawContent() + range.length();
 
     if (allowSign) {
         if (*s == '-') {
             neg = true;
             ++s;
         } else if (*s == '+') {
             ++s;

=== modified file 'src/parser/Tokenizer.h'
--- src/parser/Tokenizer.h	2015-02-20 03:25:12 +0000
+++ src/parser/Tokenizer.h	2015-07-20 23:37:52 +0000
@@ -18,40 +18,41 @@ namespace Parser
 
 /**
  * Lexical processor to tokenize a buffer.
  *
  * Allows arbitrary delimiters and token character sets to
  * be provided by callers.
  *
  * All methods start from the beginning of the input buffer.
  * Methods returning true consume bytes from the buffer.
  * Methods returning false have no side-effects.
  */
 class Tokenizer
 {
 public:
     explicit Tokenizer(const SBuf &inBuf) : buf_(inBuf), parsed_(0) {}
 
     /// yet unparsed data
     SBuf buf() const { return buf_; }
 
     /// number of parsed bytes, including skipped ones
+    /// XXX: avoid when parsing suffixes/trailing bytes
     SBuf::size_type parsedSize() const { return parsed_; }
 
     /// whether the end of the buffer has been reached
     bool atEnd() const { return buf_.isEmpty(); }
 
     /// the remaining unprocessed section of buffer
     const SBuf& remaining() const { return buf_; }
 
     /// reinitialize processing for a new buffer
     void reset(const SBuf &newBuf) { buf_ = newBuf; parsed_ = 0; }
 
     /** Basic strtok(3):
      *  Skips all leading delimiters (if any),
      *  extracts all characters up to the next delimiter (a token), and
      *  skips all trailing delimiters (at least one must be present).
      *
      *  Want to extract delimiters? Use prefix() instead.
      *
      *  Note that Tokenizer cannot tell whether the trailing delimiters will
      *  continue when/if more input data becomes available later.
@@ -98,49 +99,63 @@ public:
     bool skip(const SBuf &tokenToSkip);
 
     /** skips a given single character
      *
      * \return whether the character was skipped
      */
     bool skip(const char tokenChar);
 
     /** Skips a single character from the set.
      *
      * \return whether a character was skipped
      */
     bool skipOne(const CharacterSet &discardables);
 
     /** Skips all sequential characters from the set, in any order.
      *
      * \returns the number of skipped characters
      */
     SBuf::size_type skipAll(const CharacterSet &discardables);
 
+    /** Removes a single trailing character from the set.
+     *
+     * \return whether a character was removed
+     */
+    bool skipOneTrailing(const CharacterSet &discardables);
+
+    /** Removes all sequential trailing characters from the set, in any order.
+     *
+     * \returns the number of characters removed
+     */
+    SBuf::size_type skipAllTrailing(const CharacterSet &discardables);
+
     /** Extracts an unsigned int64_t at the beginning of the buffer.
      *
      * strtoll(3)-alike function: tries to parse unsigned 64-bit integer
      * at the beginning of the parse buffer, in the base specified by the user
      * or guesstimated; consumes the parsed characters.
      *
      * \param result Output value. Not touched if parsing is unsuccessful.
      * \param base   Specify base to do the parsing in, with the same restrictions
      *               as strtoll. Defaults to 0 (meaning guess)
      * \param allowSign Whether to accept a '+' or '-' sign prefix.
      * \param limit  Maximum count of characters to convert.
      *
      * \return whether the parsing was successful
      */
     bool int64(int64_t &result, int base = 0, bool allowSign = true, SBuf::size_type limit = SBuf::npos);
 
 protected:
     SBuf consume(const SBuf::size_type n);
     SBuf::size_type success(const SBuf::size_type n);
+    SBuf consumeTrailing(const SBuf::size_type n);
+    SBuf::size_type successTrailing(const SBuf::size_type n);
 
 private:
     SBuf buf_; ///< yet unparsed input
     SBuf::size_type parsed_; ///< bytes successfully parsed, including skipped
 };
 
 } /* namespace Parser */
 
 #endif /* SQUID_PARSER_TOKENIZER_H_ */

_______________________________________________
squid-dev mailing list
[email protected]
http://lists.squid-cache.org/listinfo/squid-dev

[squid-dev] [PATCH] suffix parsing and skipping

Reply via email to