Re: Improving regexp performance

Nikolai Weibull Sun, 18 Mar 2007 08:32:19 -0800

On 3/18/07, Asiri Rathnayake <[EMAIL PROTECTED]> wrote:

On Sun, 2007-03-18 at 14:07 +0100, Nikolai Weibull wrote:

> Then there's the parallel one that keeps track of all the states that
> we're currently in that accepts as soon as an accepting/final state is
> reached.

I think by parallel you are referring to the usual mechanism for
simulating the NFA ( on the fly subset construction ? ). Simply, we have
a set of states ( initially S0 ) and the we keep eating a symbol at a
time, calculate the next possible set of states ( including E-closure )
and whenever the current set of states contain an accepting state of the
original NFA, Voila!


Yes.  Quite elegant, wouldn't you say?  ;-)

> And then there's the cached DFA implementation that builds the DFA
> representation of the NFA as needed while traversing the input.  This
> is the most difficult one to implement, but also the fastest for most
> inputs.

Sorry, but I'm in confusion here. Are you telling me that the current
regxp engine has a cached DFA implementation ? ( I thought such an
implementation was the aim of this project :( )


No.  The current implementation is purely backtracking, with a bunch
of tricks to speed it up.

> Russ Cox's article is good, but doesn't show anyting new, even though
> it has been received like as it would.  And he really skims over some
> of the important issues that one has to consider when using regular
> expressions for doing anything other than just accepting or rejecting
> input as being part of a regular language.  Perhaps most important is
> submatch reporting, which causes all kinds of difficulties that he
> simply doesn't mention.

I agree. Just out of curiosity, I referred to the book "Compilers -
Principles, Techniques and Tools by Alfred V. Aho, Ravi Sethi, Jeffrey
D. Ullman". It has a very nice explanation of NFA, NFA->DFA, Regxp->NFA
and even Regxp->DFA ( I'm still trying to understand the last one ).
I used the article by Russ Cox to verify what I have learnt from the
above book. But I believe that above article is a good starter.


Yes, it's a very good and short (which is a good thing!) summary.  The
book you mention above is also very good, and covers a bit more
theory.  I can highly recommend the following books and articles:

John E. Hopcroft and Jeﬀrey D. Ullman, Introduction to Automata Theo-
ry, Languages, and Computation. (Addison–Wesley, Reading, Massachusetts,
1979).

This is really the reference work if you care about the theory.
Contains a lot of information on formal languages, of which only the
regular ones are really of interest for this project.

Gonzalo Navarro, NR-grep: A fast and ﬂexible pattern matching tool ,
Software—Practice and Experience 31(13), pp. 1265–1312 (2001).

A very fast grep implementation.  The algorithm used uses bitfields to
keep track of the finite automatons state, making it very fast for
many common patterns.

Alfred V. Aho, Algorithms for Finding Patterns in Strings. In Algorithms
and Complexiy, vol. A of Handbook of Theoretical Computer Science, chapter
5, pp. 255–300 (Elsevier Science Publishers B.V., Cambridge, Massachusetts,
1990).

This last one covers the theory especially well for this project, I'd
say, and contains a very complete bibliography.

I hope you won't drown in information ;-).  More pointers can be found
in my thesis, but those are probably the most interesting ones.

Thank you so much for the feed back. :)


No problem.  I'm very interested to see where this project will go and
I hope I can be of help, should you need it.

 nikolai

Re: Improving regexp performance

Reply via email to