On Mon, Jul 6, 2009 at 7:43 AM, Andrew Garrett<[email protected]> wrote:
> Yes.
>
> We wouldn't allow direct searching from the web interface with regexes
> for two related reasons:
>
> 1/ A single search with the most basic of regexes would take several
> minutes, if not hours. It isn't computationally trivial to search for
> a small string in a very large string of over 10 GB, let alone a
> regex. Words can be indexed, regexes cannot.
>
> 2/ Even if we could find a way to make the former performant, a
> malicious regex could significantly expand this time taken, leading to
> a denial of service.

I seem to recall Gregory Maxwell describing a setup that made this
feasible, given the appropriate amount of dedicated hardware.  It was
run with the entire database in memory; it only permitted "real"
regular expressions (compilable to finite-state machines, no
backreferences etc.); and it limited the length of the finite-state
machine generated.  Running a regex took several minutes, but he'd run
a lot of them in parallel, since it was mostly memory-bound, so he got
fairly good throughout.  Something like that.

But probably not practical without an undue amount of effort and
hardware, yeah.  :)

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to