Hello, We have a use case where a regex in squid.conf should contain/match a new line (i.e. ASCII LF). I do not know whether there are similar use cases with the existing squid.conf regex directives, but that is not important because we are adding a _new_ directive that will need such support. This email discusses the problem and proposes how to add a new line (and other special characters) to regexes found in squid.conf and such.
Programming languages usually have standard mechanisms for adding special characters to strings from which regexes are compiled. We all know that "a\nb" uses LF byte in the C++ string literal. Other bytes can be added as well: https://en.cppreference.com/w/cpp/language/escape Unfortunately, squid.conf syntax lacks a similar general mechanism. In most cases, it is possible to work around that limitation by entering special symbols directly. However, that trick causes various headaches and does not work for new lines at all because squid.conf preprocessor and parameter parser use/strip all new lines; the code compiling the regular expression will simply not see any. In POSIX regex(7), the two-character \n escape sequence is referring to the ASCII character 'n', not the new line/LF character, so entering \n (two characters) into a squid.conf regex value will not work if one wants to match ASCII LF. There are many options for adding this functionality to regexes used in _new_ squid.conf contexts (i.e. contexts aware of this enhancement). Here is a fairly representative sample: 1a. Recognize just \n escape sequence in squid.conf regexes Pros: Simple. Cons: Converting old regexes[1] requires careful checking[2]. Cons: Cannot detect typos in escape sequences. \r is accepted. Cons: Cannot address other, similar use cases (e.g., ASCII CR). 1b. Recognize all C escape sequences in squid.conf regexes Pros: Can detect typos -- unsupported escape sequences. Cons: Poor readability: Double-escaping of all for-regex backslashes! Cons: Converting old regexes requires non-trivial automation. 2a. Recognize %byte{n} logformat-like sequence in squid.conf regexes Pros: Simple. Cons: Converting old regexes[1] requires careful checking[3]. Cons: Cannot detect typos in logformat-like sequences. Cons: Does not support other advanced use cases (e.g., %tr). 2b. Recognize %byte{n} and logformat sequences in squid.conf regexes Pros: Can detect typos -- unsupported logformat sequences. Cons: The need to escape % in regexes will surprise admins. Cons: Converting old regexes requires (simple) automation. 3. Use composition to combine regexes and some special strings: regex1 + "\n" + regex2 or regex1 + %byte{10} + regex2 Pros: Old regexes can be safely used without any conversions. Cons: Requires new, complex composition expressions/syntax. Cons: A bit difficult to read. Cons: Requires a lot of development. 4. Use 2b but only when regex is given to a special function: substitute_logformat_codes(regex) Pros: Old regexes can be safely used without any conversions. Pros: New regexes do not need to escape % (by default). Pros: Extendable to old regex configuration contexts. Pros: Extendable to non-regex configuration contexts. Pros: Reusing the existing parameters(...)-like call syntax. Cons: A bit more difficult to read than 1a or 2a. Cons: Duplicates "quoted string" approach in some directives[4]. Cons: Requires arguing about the new function name :-). Given all the pros and cons, I think we should use option 4 above. Do you see any better options? Thank you, Alex. [1] We are still talking about new configuration contexts here, but we should still be thinking about bringing old regexes into new contexts and, eventually, possibly even about upgrading old contexts to support the new feature. Neither is required or urgent, but it would be nice to eventually end up with a uniform regex approach, of course. [2] In most cases, no old regexes will contain the \n sequence because it means nothing to the regex compiler. A few exceptional regexes can be edited manually. Automated conversion will be required only in some rare cases. [3] Essentially the same as [2] above: Old regexes are unlikely to contain the %byte 5-character sequence (or whatever we end up calling that special sequence -- we are polishing a PR that adds %byte support to logformat). [4] Several existing squid.conf directives interpret "quoted values" specially, substituting embedded logformat %codes. Arguably, the explicit function call mechanism is better because there is less confusion regarding which context supports it and which does not. And we probably should not "quote regexes" because many old regexes contain double quotes already. _______________________________________________ squid-dev mailing list squid-dev@lists.squid-cache.org http://lists.squid-cache.org/listinfo/squid-dev