On 27/09/2013 6:14 a.m., Alex Rousskov wrote:
On 08/30/2013 11:39 AM, Alex Rousskov wrote:
One of the key ideas behind "strict syntax" is that the token "type" can
be and is determined by the parser (simple token, quoted string, regular
expression, number, etc.) and not the caller. This is not the only way
to support better syntax rules, but it is overall simpler and more
flexible than the caller-determines-token-type alternatives (because the
code can manipulate tokens with respect to comments, macros, line
continuation, whitespace, and such without relying on callers to parse
each character).
Hello,
While I still believe that the above is correct, after many email
exchanges with Christos and several patches, I suspect we should not
implement the above. My concern is that if we do, the fixed syntax will
never become the default because there will be just too much work for
admins to rewrite their configurations, parts of which they may not even
control. Initially, I thought we can remain mostly backwards compatible
with the fixed syntax, but I think I lost that important goal along the
way to the strict syntax.
To void any misunderstanding, these are my thoughts. Christos has
supplied valuable insight and very useful code that brought me here, but
he has not reviewed and may not share my conclusions.
Hopefully, this discussion will help us find the right balance between
the needed syntax improvements and the need to remain backwards
compatible in the vast majority of cases.
I think the key problem here is our (or my) desire to use a simple
grammar definition for the strict syntax. I think we have to relax that
by using Squid context-aware parsers more aggressively. Also, we have to
make a couple of new syntax constructs less beautiful to minimize
collision chances. Initial deployment of v3.4 code exposed a few
unexpected (but unfortunately common) collisions.
Specifically, I propose the following:
A) Define Squid syntax as described in the grammar quoted at the end of
this email. Some adjustments would be necessary, I am sure, but I think
it is important to have a grammar definition as a starting point. They
key in that grammar is the ambiguous rule after "The caller must ask"
comment. If it works, it will allow us to support most of the old
configurations, especially when dealing with old options.
B) Adjust the low-level parsing code to provide the following methods,
each of which extract the requested thing and return true (on success)
or false (on failures). If they fail to extract what was requested, they
leave the parsing stream unchanged:
NextChars(const char *); // the exact sequence of characters given
NextToken(String &t); // next name or token as defined in the grammar
NextValue(String &v, kind); // next value as defined in the grammar
NextSpace(String *v); // Optional White Space (OWS) as in grammar
NextLineEnd(); // end of line; may be \0 if preprocessor strips \n
Mark CurrentPosition(); // current parsing position
void RevertTo(Mark mark); // reverts low-level parsing state to mark
Also add the following class to properly manage "undo". Note that we
cannot just put parsed elements on stack any more because the next
caller may use a different syntax to parse the same sequence of characters.
This implies that we require StringNG terminator-agnostic ability for
the String& in all the above functions - OR to copy each potential line
into a temporary String which can be dropped on unwind. That
terminator-agnostic is provided by StringArea, but IIRC kinkie warned
against relying on StringArea for parser usage due to its lack of cow()
and pointer safety.
// remembers CurrentPosition() upon construction and undoes damages
// using RevertTo upon destruction unless .success() has been called
class ParseProgress {
public:
ParseProgress(): start(CurrentPosition()), succeeded(false) {}
~ParseProgress() { if (!succeeded) RevertTo(start); }
bool success() { return succeeded = true; }
bool failure() const { return false; } // for clarity
private:
Mark start; ///< where the parsing started
bool succeeded; ///< whether our owner successfully parsed
};
C) Add parameter parsing helpers sketched below, using the above API:
bool ParamWithValue(const char *name, String &value, bool required)
ParseProgress progress;
if (!NextChars(name))
return progress.success();
if (!NextChars("=")) {
if (required) ... failure: missing value for name
return false;
}
// we may need to make strictOrLegacy a fourth ParamWithValue()
// parameter so that callers can control what kind of values
// are allowed
if (NextValue(value, strictOrLegacy))
return progress.success();
if (required)
failure: missing value for name
return false;
}
// directive parameter without a value
bool ParamWoutValue(const char *name)
{
if (!NextChars(name))
return false;
if (NextChars("="))
failure: unexpected value for name
return true;
}
bool MoreParams()
{
return NextSpace();
// the line end will be checked for and absorbed
// by the directive parser, not directive parameters parser
}
// called when all supported switches and parameter names
// have been exhausted; we do not expect a name=value thing next
// kills Squid on errors; otherwise returns false
bool UnsupportedParamName()
{
ParseProgress progress;
String name;
if (NextToken(name) && NextChars("="))
failure: unknown directive parameter name or switch: name
return progress.failure(); // undo name extraction
}
Here is how a simple directive parameters parser would look like:
do {
String v;
if (ParamWithValue("opt1", v, true)) ...
else if (ParamWithValue("opt2", v, false)) ...
else if (ParamWoutValue("opt3")) ...
else if (!UnsupportedParamName()) {
NextValue(v, ...) // anonymous parameter
...
}
} while (MoreParams());
The above can be rearranged so that the directive parameters parsing
code first extracts the parameter name and then compares it to supported
parameters (like most of the current code does). However, I suspect it
would create more problems than it would solve in some cases: The
extracted thing may not be a parameter name but an [anonymous] value,
and when you extract a value, you may want to tell the parser whether
using relaxed syntax is allowed, but you cannot tell the parser that
until you are sure that the next thing is not a parameter name (because
the relaxed value parser is going to eat the '=' sign)! Thus, you should
not extract a value until you know it is not a parameter name.
Agreed +1. This is no less efficient than the current code doing its
series of if-conditions on a strtok() output string. But gains us the
undo functionality a lot more cheaply.
I am omitting some details, of course, but do you think this approach
with more explicit whitespace and relaxed value handling will cover
nearly all current use cases(*), allowing the new parsing mode to be
enabled by default soon? Or am I missing something again?
Comment handling (see blow for the fix), and the detail that StringNG
will allow us to implement the undo fuctionality far better than String
- which implies that the large amount of conversion work will have an
almost as larhe amount of conversion to SBuf later in 3.5.
Since this design is again prioritizing backward compatibility I think
we can schedule this for 3.5+ and work on some bridging polish to avoid
3.4 parser having any user noticible difference from the final strict
parser. For starters ignore the code internals being vastly different
maintenance nightmare - that is not a user-visible aspect and we will
cope with that pain ourselves.
What I am thinking is that if we agree on this grammar being suitable
(you and I mostly agree, lets see what the others think) then we can
massage the 3.4 parser to cope with it. No matter how nasty the hacks
may have to be in that version we drop them in 3.5 going forward. I do
not see re-implementing the entire parser starting today as a quick
enough project to fix things before 3.4 needs to be stable.
(*) except quoted strings with unsupported %macros in old configs, of
course, but we are willing to sacrifice those, I guess. The syntax also
allows comments where spaces are allowed, but we can disable that if
there are too many clashes.
If we take a token of "#" surrounded by whitespace as comment start, or
a # in the position EOL is expected this should not be an issue for most.
Since comments go to EOL it will collide with the preprocessor
un-wrapping wrapped lines if a comment is added anywhere space is
possible. I am inclined to make comment a case in itself with the
following alterations to the grammar:
config = *( OWS directive / comment <new line> )
directive = name *( RWS parameter ) comment <new line>
whitespace = *ascii_whitespace_char
comment = OWS ( <#> RWS <followed by all characters until the end of the
line (excluding that EOL)>)
Line wrapping grammar is this:
fold = <\> <new line>
whitespace = *( ascii_whitespace_char / fold )
with no need or possibility to fold inside a comment.
Thank you,
Alex.
Squid configuration syntax:
config = *( OWS directive OWS )
directive = name *( RWS parameter ) OWS <new line>
name = token
parameter = anonymous_parameter / named_parameter
anonymous_parameter = value
named_parameter = name <=> value ; No spaces around = allowed!
; This is where the grammar is ambiguous!
; The caller must ask for either strict_value, relaxed_value, or
; "strict_value if possible and relaxed_value otherwise"
value = strict_value / relaxed_value
strict_value = token / percent / RE / fcall /
single_quoted_string / double_quoted_string
; any non-whitespace sequence that does not start with one of
; the two reserved prefixes: "squid::" and "regex::"
relaxed_value = xchar *xchar ; with prefix restrictions
; The "squid::" prefix avoids fcall clashes with relaxed_values
; and allows us to add more functions later.
; We may require that fcall is the only or last directive parameter.
fcall = "squid::" name <(> OWS farguments OWS <)>
; we can support just one required argument for now
farguments = fargument *( OWS <,> OWS <fargument> )
fargument = strict_value
; TBD, but reserve "regex::" now to avoid clashes with relaxed_values
RE = seven characters "regex::" followed by a self-delimiting
sequence of characters
token = tchar *tchar
percent = alphanumeric *alphanumeric <%> ; add <.>?
single_quoted_string := <'> *[sqchar / escaped-pair] <'>
double_quoted_string := <"> *[dqchar / escaped-pair] <">
tchar = alphanumeric / <_> / <-> / <.>
xchar = any char except ascii_whitespace_char and new line
sqchar = any char except single quote and backslash
dqchar = any char except double quote and backslash
escaped-pair = backslash followed by any char except new line
OWS = *[whitespace] ; optional whitespace
RWS = whitespace OWS ; required whitespace
whitespace = *ascii_whitespace_char / comment
comment = <#> followed by all characters until the end of the line
comments should be a # _after_ some whitespace. Not instead of,
otherwise we cannot tell foo#bar from "foo #bar"
Please note that the above grammar is applied after the preprocessing
stage which handles lines continuations using backslashes, conditionals
and ${macros}. No changes there. Unfortunately, our preprocessor cannot
tokenize because tokenization should remain context-dependent for
backward compatibility reasons.
Thank you. It is good to have this written down finally.
Amos