On 20/01/22 10:32, Alex Rousskov wrote:
Hello,

     We have a use case where a regex in squid.conf should contain/match
a new line (i.e. ASCII LF). I do not know whether there are similar use
cases with the existing squid.conf regex directives, but that is not
important because we are adding a _new_ directive that will need such
support. This email discusses the problem and proposes how to add a new
line (and other special characters) to regexes found in squid.conf and such.


With the current mix of squid.conf parsers this RFC seems irrelevant to me.

The developer designing a new directive also writes the parse_*() function that processes the config file line. All they have to do is avoid using the parser functions which implicitly do the problematic behaviour. The fact that there is logic imposing this problem at all is a bug to be resolved. But that is something for a different RFC.



Programming languages usually have standard mechanisms for adding
special characters to strings from which regexes are compiled. We all
know that "a\nb" uses LF byte in the C++ string literal. Other bytes can
be added as well: https://en.cppreference.com/w/cpp/language/escape


There was a plan from 2014 (re-attempted by Christos 2016) to migrate Squid from the GNURegex dependency to more flexible C++11 regex library which supports many regex languages. With that plan the UI would only need an option flag or pattern prefix to specify which language a pattern uses.

That plan was put on hold due to feature-incomplete GCC 4.8 versions being distributed by CentOS 7 and RHEL needing to build Squid.

One Core Developer (you Alex) has repeatedly expressed a strong opinion veto'ing the addition/removal of features to Squid-6 while they are still officially supported by a small set of "officially supported" Vendors. RHEL and CentOS being in that set.


When combined, those two design limitations mean the C++11 regex library cannot be implemented in a Squid released prior to June 2024.



IMO that plan is still a good one for long-term. However you design your new directive UI please make it compatible with that.


Unfortunately, squid.conf syntax lacks a similar general mechanism.

This is not a property of squid.conf design choices. It is an artifact of the GNURegex language.

Until Squid gets a major upgrade to support other regex languages. We are stuck with these pattern limitations.

 In
most cases, it is possible to work around that limitation by entering
special symbols directly. However, that trick causes various headaches
and does not work for new lines at all because squid.conf preprocessor
and parameter parser use/strip all new lines; the code compiling the
regular expression will simply not see any.

In POSIX regex(7), the two-character \n escape sequence is referring to
the ASCII character 'n', not the new line/LF character, so entering \n
(two characters) into a squid.conf regex value will not work if one
wants to match ASCII LF.

There are many options for adding this functionality to regexes used in
_new_ squid.conf contexts (i.e. contexts aware of this enhancement).
Here is a fairly representative sample:

1a. Recognize just \n escape sequence in squid.conf regexes
    Pros: Simple.
    Cons: Converting old regexes[1] requires careful checking[2].
    Cons: Cannot detect typos in escape sequences. \r is accepted.
    Cons: Cannot address other, similar use cases (e.g., ASCII CR).

1b. Recognize all C escape sequences in squid.conf regexes
    Pros: Can detect typos -- unsupported escape sequences.
    Cons: Poor readability: Double-escaping of all for-regex backslashes!
    Cons: Converting old regexes requires non-trivial automation.


As you mention these \-escape is a feature of POSIX Regular Expression language.

Taking this step we will no longer to honestly say that Squid is only supporting GNU "regex" patterns. Open the floodgate and you will find a mountain of admin wanting the other POSIX features for one reason or another.

We would be better accepting the long-ago planned migration to C++11 regex than taking more half-measures like implementing \-escape patterns ourselves.



2a. Recognize %byte{n} logformat-like sequence in squid.conf regexes
    Pros: Simple.
    Cons: Converting old regexes[1] requires careful checking[3].
    Cons: Cannot detect typos in logformat-like sequences.
    Cons: Does not support other advanced use cases (e.g., %tr).

2b. Recognize %byte{n} and logformat sequences in squid.conf regexes
    Pros: Can detect typos -- unsupported logformat sequences.
    Cons: The need to escape % in regexes will surprise admins.
    Cons: Converting old regexes requires (simple) automation.


3. Use composition to combine regexes and some special strings:
    regex1 + "\n" + regex2
    or
    regex1 + %byte{10} + regex2
    Pros: Old regexes can be safely used without any conversions.
    Cons: Requires new, complex composition expressions/syntax.
    Cons: A bit difficult to read.
    Cons: Requires a lot of development.


Please no. There are enough regex languages confusing people. Lets not be responsible for creating yet another.

That is my clear "no" vote on all (2) and (3) idea variants.


4. Use 2b but only when regex is given to a special function:
    substitute_logformat_codes(regex)
    Pros: Old regexes can be safely used without any conversions.
    Pros: New regexes do not need to escape % (by default).
    Pros: Extendable to old regex configuration contexts.
    Pros: Extendable to non-regex configuration contexts.
    Pros: Reusing the existing parameters(...)-like call syntax.
    Cons: A bit more difficult to read than 1a or 2a.
    Cons: Duplicates "quoted string" approach in some directives[4].
    Cons: Requires arguing about the new function name :-).


Or (5) Alex puts aside his objection blocking the plan to convert Squid to C++11 regex library.


Cheers
Amos
_______________________________________________
squid-dev mailing list
squid-dev@lists.squid-cache.org
http://lists.squid-cache.org/listinfo/squid-dev

Reply via email to