On 12/07/11 12:30, Alex Rousskov wrote:
On 07/05/2011 01:14 PM, Marcus Kool wrote:
Attached is a patch for optimisation of REs.
This is the second submission of the patch and the comments from
Amos' review are addressed.

This patch is inspired by the work that I did for ufdbGuard and a few
emails with Amos.

The new code optimises lists of regular expressions.
The optimisations are:
* initial .* is stripped

Initial '.'s are stripped as well. If that code survives the audit, the
change should be documented. More on that change below.

* RE-1 RE-2 ... RE-n are joined into one large RE: (RE-1)|(RE-2)|...|(RE-n)
* -i ... -i options are optimised: the second one is ignored, same for +i


+    while (*t == '.') {
+       t++;

In the above code, you remove leading "."s which may change the regular
expression. For example, the following RE will match "01234" but will
not match "1234" or "234":

     ..234

After your patch, that RE will be converted to "234" and match all three
strings. If you agree that this is a bug, please enhance your test cases
before you fix the bug.


Yes. The first version of that loop was correct:
  while (*t =='.' && *(t+1) == '*') t+=2;

+    if (*t == '\0') {
+       debugs(28, DBG_IMPORTANT, ""<<  cfg_filename<<  " line "<<  config_lineno<<  ": 
"<<  config_input_line);
+       debugs(28, DBG_IMPORTANT, "WARNING: regular expression '"<<  orig<<  "' has 
only wildcards and matches all strings." );
+        return orig;
+    }

The above feels a little inconsistent. The code adjusts RE in all other
cases but returns orig in this specific case. Should we return "^.*" or
a similar simplified RE instead?

Right. ".*" should probably be returned instead.

The catenation assembly could then detect this case and drop all other OR'ed patterns from the compound. Though I'm not asking for that in this patch, it can be an extra tuning optimization later.


+    if (t != orig) {
+       debugs(28, DBG_IMPORTANT, ""<<  cfg_filename<<  " line "<<  config_lineno<<  ": 
"<<  config_input_line);
+       debugs(28, DBG_IMPORTANT, "WARNING: regular expression '"<<  orig<<  "' has 
unnecessary wildcard(s)" );
+    }

Please report what the simplified RE will look like so that the admin
can see their mistake (and can check our simplification logic).

The same comment may apply to compileOptimisedREs(): I am not sure, bu
perhaps we should report (at level 1) any changes we apply to
user-supplied REs?


FYI: I'm using the syntax "WARNING: blah. Using XYZ instead." when adding these type of things to the config parser. With the word "Use" if Squid is not making a change but instructing the admin. The word "Using" if Squid has auto-dropped the config and taken our alternative.

Needless to say, any mistakes here may lead to hard-to-find access
control bugs in real configurations. Let's inform the user that we are
fiddling with their configurations. It is probably better to come back
and remove some of the warnings than have to come back and remove bugs
that went undiscovered for a long time.


+        debugs(28, DBG_IMPORTANT, "WARNING: optimisation of regular expressions 
failed; using fallback method without optimisation" );

Please report the optimized RE that failed to compile if possible.


Please try to post a patch that includes RegexData.cc changes and new
test cases. To do that, you would need to write test cases to call your
new (and possibly some old) functions directly, without running Squid.
There are many such test examples in Squid sources.


Thank you,

Alex.



Amos
--
Please be using
  Current Stable Squid 2.7.STABLE9 or 3.1.14
  Beta testers wanted for 3.2.0.9

Reply via email to