Re: Improving regexp performance

Asiri Rathnayake Sun, 18 Mar 2007 06:37:22 -0800

On Sun, 2007-03-18 at 14:07 +0100, Nikolai Weibull wrote:
> On 3/18/07, Asiri Rathnayake <[EMAIL PROTECTED]> wrote:
> 
> > I read several articles on possible NFA implementations ( took some time
> > to understand ). But I think implementing an NFA based approach is quite
> > possible ( Russ Cox has provided a sample implementation in his
> > article ). But I'm having a hard time understanding the existing
> > approach taken :'( .
> 
> The current implementation is also an NFA-based one.  What's more
> interesting is how the NFA is simulated.  The current implementation
> uses backtracking, that is, it tries a solution and checks if it
> works.  If so, then accept.  Otherwise, try the next one.
> 
> Then there's the parallel one that keeps track of all the states that
> we're currently in that accepts as soon as an accepting/final state is
> reached.


I think by parallel you are referring to the usual mechanism for
simulating the NFA ( on the fly subset construction ? ). Simply, we have
a set of states ( initially S0 ) and the we keep eating a symbol at a
time, calculate the next possible set of states ( including E-closure )
and whenever the current set of states contain an accepting state of the
original NFA, Voila!

> And then there's the cached DFA implementation that builds the DFA
> representation of the NFA as needed while traversing the input.  This
> is the most difficult one to implement, but also the fastest for most
> inputs.

Sorry, but I'm in confusion here. Are you telling me that the current
regxp engine has a cached DFA implementation ? ( I thought such an
implementation was the aim of this project :( )


> Russ Cox's article is good, but doesn't show anyting new, even though
> it has been received like as it would.  And he really skims over some
> of the important issues that one has to consider when using regular
> expressions for doing anything other than just accepting or rejecting
> input as being part of a regular language.  Perhaps most important is
> submatch reporting, which causes all kinds of difficulties that he
> simply doesn't mention.

I agree. Just out of curiosity, I referred to the book "Compilers -
Principles, Techniques and Tools by Alfred V. Aho, Ravi Sethi, Jeffrey
D. Ullman". It has a very nice explanation of NFA, NFA->DFA, Regxp->NFA
and even Regxp->DFA ( I'm still trying to understand the last one ).
I used the article by Russ Cox to verify what I have learnt from the
above book. But I believe that above article is a good starter.

> 
>   nikolai

Thank you so much for the feed back. :)

- Asiri

Re: Improving regexp performance

Reply via email to