Re: Improving regexp performance

Asiri Rathnayake Sun, 18 Mar 2007 08:46:51 -0800

On Sun, 2007-03-18 at 17:32 +0100, Nikolai Weibull wrote:
> On 3/18/07, Asiri Rathnayake <[EMAIL PROTECTED]> wrote:
> 
> > On Sun, 2007-03-18 at 14:07 +0100, Nikolai Weibull wrote:
> 
> > > Then there's the parallel one that keeps track of all the states that
> > > we're currently in that accepts as soon as an accepting/final state is
> > > reached.
> 
> > I think by parallel you are referring to the usual mechanism for
> > simulating the NFA ( on the fly subset construction ? ). Simply, we have
> > a set of states ( initially S0 ) and the we keep eating a symbol at a
> > time, calculate the next possible set of states ( including E-closure )
> > and whenever the current set of states contain an accepting state of the
> > original NFA, Voila!
> 
> Yes.  Quite elegant, wouldn't you say?  ;-)
> 
> > > And then there's the cached DFA implementation that builds the DFA
> > > representation of the NFA as needed while traversing the input.  This
> > > is the most difficult one to implement, but also the fastest for most
> > > inputs.
> 
> > Sorry, but I'm in confusion here. Are you telling me that the current
> > regxp engine has a cached DFA implementation ? ( I thought such an
> > implementation was the aim of this project :( )
> 
> No.  The current implementation is purely backtracking, with a bunch
> of tricks to speed it up.
> 
> > > Russ Cox's article is good, but doesn't show anyting new, even though
> > > it has been received like as it would.  And he really skims over some
> > > of the important issues that one has to consider when using regular
> > > expressions for doing anything other than just accepting or rejecting
> > > input as being part of a regular language.  Perhaps most important is
> > > submatch reporting, which causes all kinds of difficulties that he
> > > simply doesn't mention.
> >
> > I agree. Just out of curiosity, I referred to the book "Compilers -
> > Principles, Techniques and Tools by Alfred V. Aho, Ravi Sethi, Jeffrey
> > D. Ullman". It has a very nice explanation of NFA, NFA->DFA, Regxp->NFA
> > and even Regxp->DFA ( I'm still trying to understand the last one ).
> > I used the article by Russ Cox to verify what I have learnt from the
> > above book. But I believe that above article is a good starter.
> 
> Yes, it's a very good and short (which is a good thing!) summary.  The
> book you mention above is also very good, and covers a bit more
> theory.  I can highly recommend the following books and articles:
> 
> John E. Hopcroft and Jeﬀrey D. Ullman, Introduction to Automata Theo-
> ry, Languages, and Computation. (Addison–Wesley, Reading, Massachusetts,
> 1979).
> 
> This is really the reference work if you care about the theory.
> Contains a lot of information on formal languages, of which only the
> regular ones are really of interest for this project.
> 
> Gonzalo Navarro, NR-grep: A fast and ﬂexible pattern matching tool ,
> Software—Practice and Experience 31(13), pp. 1265–1312 (2001).
> 
> A very fast grep implementation.  The algorithm used uses bitfields to
> keep track of the finite automatons state, making it very fast for
> many common patterns.
> 
> Alfred V. Aho, Algorithms for Finding Patterns in Strings. In Algorithms
> and Complexiy, vol. A of Handbook of Theoretical Computer Science, chapter
> 5, pp. 255–300 (Elsevier Science Publishers B.V., Cambridge, Massachusetts,
> 1990).
> 
> This last one covers the theory especially well for this project, I'd
> say, and contains a very complete bibliography.
> 
> I hope you won't drown in information ;-).  More pointers can be found
> in my thesis, but those are probably the most interesting ones.
> 
> > Thank you so much for the feed back. :)
> 
> No problem.  I'm very interested to see where this project will go and
> I hope I can be of help, should you need it.


I definitely need your help! :)

- Asiri

>   nikolai

Re: Improving regexp performance

Reply via email to