Re: Revised approach to fixing configuration syntax

Amos Jeffries Thu, 26 Sep 2013 23:21:26 -0700

On 27/09/2013 6:14 a.m., Alex Rousskov wrote:

On 08/30/2013 11:39 AM, Alex Rousskov wrote:

One of the key ideas behind "strict syntax" is that the token "type" can
be and is determined by the parser (simple token, quoted string, regular
expression, number, etc.) and not the caller. This is not the only way
to support better syntax rules, but it is overall simpler and more
flexible than the caller-determines-token-type alternatives (because the
code can manipulate tokens with respect to comments, macros, line
continuation, whitespace, and such without relying on callers to parse
each character).


Hello,

     While I still believe that the above is correct, after many email
exchanges with Christos and several patches, I suspect we should not
implement the above. My concern is that if we do, the fixed syntax will
never become the default because there will be just too much work for
admins to rewrite their configurations, parts of which they may not even
control. Initially, I thought we can remain mostly backwards compatible
with the fixed syntax, but I think I lost that important goal along the
way to the strict syntax.

To void any misunderstanding, these are my thoughts. Christos has
supplied valuable insight and very useful code that brought me here, but
he has not reviewed and may not share my conclusions.

Hopefully, this discussion will help us find the right balance between
the needed syntax improvements and the need to remain backwards
compatible in the vast majority of cases.


I think the key problem here is our (or my) desire to use a simple
grammar definition for the strict syntax. I think we have to relax that
by using Squid context-aware parsers more aggressively. Also, we have to
make a couple of new syntax constructs less beautiful to minimize
collision chances. Initial deployment of v3.4 code exposed a few
unexpected (but unfortunately common) collisions.


Specifically, I propose the following:

A) Define Squid syntax as described in the grammar quoted at the end of
this email. Some adjustments would be necessary, I am sure, but I think
it is important to have a grammar definition as a starting point. They
key in that grammar is the ambiguous rule after "The caller must ask"
comment. If it works, it will allow us to support most of the old
configurations, especially when dealing with old options.

B) Adjust the low-level parsing code to provide the following methods,
each of which extract the requested thing and return true (on success)
or false (on failures). If they fail to extract what was requested, they
leave the parsing stream unchanged:

    NextChars(const char *); // the exact sequence of characters given
    NextToken(String &t); // next name or token as defined in the grammar
    NextValue(String &v, kind); // next value as defined in the grammar
    NextSpace(String *v); // Optional White Space (OWS) as in grammar
    NextLineEnd(); // end of line; may be \0 if preprocessor strips \n

    Mark CurrentPosition(); // current parsing position
    void RevertTo(Mark mark); // reverts low-level parsing state to mark

Also add the following class to properly manage "undo". Note that we
cannot just put parsed elements on stack any more because the next
caller may use a different syntax to parse the same sequence of characters.

This implies that we require StringNG terminator-agnostic ability forthe String& in all the above functions - OR to copy each potential lineinto a temporary String which can be dropped on unwind. Thatterminator-agnostic is provided by StringArea, but IIRC kinkie warnedagainst relying on StringArea for parser usage due to its lack of cow()and pointer safety.

    // remembers CurrentPosition() upon construction and undoes damages
    // using RevertTo upon destruction unless .success() has been called
    class ParseProgress {
    public:
        ParseProgress(): start(CurrentPosition()), succeeded(false) {}
        ~ParseProgress() { if (!succeeded) RevertTo(start); }
        bool success() { return succeeded = true; }
        bool failure() const { return false; } // for clarity

    private:
        Mark start; ///< where the parsing started
        bool succeeded; ///< whether our owner successfully parsed
    };



C) Add parameter parsing helpers sketched below, using the above API:

bool ParamWithValue(const char *name, String &value, bool required)
    ParseProgress progress;

    if (!NextChars(name))
        return progress.success();

    if (!NextChars("=")) {
        if (required) ... failure: missing value for name
        return false;
    }

    // we may need to make strictOrLegacy a fourth ParamWithValue()
    // parameter so that callers can control what kind of values
    // are allowed
    if (NextValue(value, strictOrLegacy))
        return progress.success();

    if (required)
        failure: missing value for name

    return false;
}

// directive parameter without a value
bool ParamWoutValue(const char *name)
{
    if (!NextChars(name))
        return false;

    if (NextChars("="))
        failure: unexpected value for name

    return true;
}

bool MoreParams()
{
     return NextSpace();
     // the line end will be checked for and absorbed
     // by the directive parser, not directive parameters parser
}

// called when all supported switches and parameter names
// have been exhausted; we do not expect a name=value thing next
// kills Squid on errors; otherwise returns false
bool UnsupportedParamName()
{
   ParseProgress progress;
   String name;
   if (NextToken(name) && NextChars("="))
       failure: unknown directive parameter name or switch: name
   return progress.failure(); // undo name extraction
}



Here is how a simple directive parameters parser would look like:

     do {
         String v;
         if (ParamWithValue("opt1", v, true)) ...
         else if (ParamWithValue("opt2", v, false)) ...
         else if (ParamWoutValue("opt3")) ...
         else if (!UnsupportedParamName()) {
             NextValue(v, ...) // anonymous parameter
             ...
         }
     } while (MoreParams());


The above can be rearranged so that the directive parameters parsing
code first extracts the parameter name and then compares it to supported
parameters (like most of the current code does). However, I suspect it
would create more problems than it would solve in some cases: The
extracted thing may not be a parameter name but an [anonymous] value,
and when you extract a value, you may want to tell the parser whether
using relaxed syntax is allowed, but you cannot tell the parser that
until you are sure that the next thing is not a parameter name (because
the relaxed value parser is going to eat the '=' sign)! Thus, you should
not extract a value until you know it is not a parameter name.

Agreed +1. This is no less efficient than the current code doing itsseries of if-conditions on a strtok() output string. But gains us theundo functionality a lot more cheaply.

I am omitting some details, of course, but do you think this approach
with more explicit whitespace and relaxed value handling will cover
nearly all current use cases(*), allowing the new parsing mode to be
enabled by default soon? Or am I missing something again?

Comment handling (see blow for the fix), and the detail that StringNGwill allow us to implement the undo fuctionality far better than String- which implies that the large amount of conversion work will have analmost as larhe amount of conversion to SBuf later in 3.5.

Since this design is again prioritizing backward compatibility I thinkwe can schedule this for 3.5+ and work on some bridging polish to avoid3.4 parser having any user noticible difference from the final strictparser. For starters ignore the code internals being vastly differentmaintenance nightmare - that is not a user-visible aspect and we willcope with that pain ourselves.

What I am thinking is that if we agree on this grammar being suitable(you and I mostly agree, lets see what the others think) then we canmassage the 3.4 parser to cope with it. No matter how nasty the hacksmay have to be in that version we drop them in 3.5 going forward. I donot see re-implementing the entire parser starting today as a quickenough project to fix things before 3.4 needs to be stable.


(*) except quoted strings with unsupported %macros in old configs, of
course, but we are willing to sacrifice those, I guess. The syntax also
allows comments where spaces are allowed, but we can disable that if
there are too many clashes.

If we take a token of "#" surrounded by whitespace as comment start, ora # in the position EOL is expected this should not be an issue for most.

Since comments go to EOL it will collide with the preprocessorun-wrapping wrapped lines if a comment is added anywhere space ispossible. I am inclined to make comment a case in itself with thefollowing alterations to the grammar:


config = *( OWS directive / comment <new line> )

directive = name *( RWS parameter ) comment <new line>

whitespace = *ascii_whitespace_char

comment = OWS ( <#> RWS <followed by all characters until the end of theline (excluding that EOL)>)



Line wrapping grammar is this:

fold = <\> <new line>

whitespace = *( ascii_whitespace_char / fold )

with no need or possibility to fold inside a comment.

Thank you,

Alex.


Squid configuration syntax:

   config = *( OWS directive OWS )

   directive = name *( RWS parameter ) OWS <new line>

   name = token
   parameter = anonymous_parameter / named_parameter

   anonymous_parameter = value
   named_parameter = name <=> value ; No spaces around = allowed!

   ; This is where the grammar is ambiguous!
   ; The caller must ask for either strict_value, relaxed_value, or
   ; "strict_value if possible and relaxed_value otherwise"
   value = strict_value / relaxed_value

   strict_value = token / percent / RE / fcall /
                  single_quoted_string / double_quoted_string

   ; any non-whitespace sequence that does not start with one of
   ; the two reserved prefixes: "squid::" and "regex::"
   relaxed_value = xchar *xchar ; with prefix restrictions

   ; The "squid::" prefix avoids fcall clashes with relaxed_values
   ; and allows us to add more functions later.
   ; We may require that fcall is the only or last directive parameter.
   fcall = "squid::" name <(> OWS farguments OWS <)>

   ; we can support just one required argument for now
   farguments = fargument *( OWS <,> OWS <fargument> )
   fargument = strict_value

   ; TBD, but reserve "regex::" now to avoid clashes with relaxed_values
   RE = seven characters "regex::" followed by a self-delimiting
        sequence of characters

   token = tchar *tchar
   percent = alphanumeric *alphanumeric <%> ; add <.>?
   single_quoted_string := <'> *[sqchar / escaped-pair] <'>
   double_quoted_string := <"> *[dqchar / escaped-pair] <">


   tchar = alphanumeric / <_> / <-> / <.>
   xchar = any char except ascii_whitespace_char and new line
   sqchar = any char except single quote and backslash
   dqchar = any char except double quote and backslash
   escaped-pair = backslash followed by any char except new line

   OWS = *[whitespace]      ; optional whitespace
   RWS = whitespace OWS     ; required whitespace

   whitespace = *ascii_whitespace_char / comment
   comment = <#> followed by all characters until the end of the line

comments should be a # _after_ some whitespace. Not instead of,otherwise we cannot tell foo#bar from "foo #bar"


Please note that the above grammar is applied after the preprocessing
stage which handles lines continuations using backslashes, conditionals
and ${macros}. No changes there. Unfortunately, our preprocessor cannot
tokenize because tokenization should remain context-dependent for
backward compatibility reasons.


Thank you. It is good to have this written down finally.

Amos

Re: Revised approach to fixing configuration syntax

Reply via email to