On 3/18/07, Asiri Rathnayake <[EMAIL PROTECTED]> wrote:
On Sun, 2007-03-18 at 14:07 +0100, Nikolai Weibull wrote:
> Then there's the parallel one that keeps track of all the states that > we're currently in that accepts as soon as an accepting/final state is > reached.
I think by parallel you are referring to the usual mechanism for simulating the NFA ( on the fly subset construction ? ). Simply, we have a set of states ( initially S0 ) and the we keep eating a symbol at a time, calculate the next possible set of states ( including E-closure ) and whenever the current set of states contain an accepting state of the original NFA, Voila!
Yes. Quite elegant, wouldn't you say? ;-)
> And then there's the cached DFA implementation that builds the DFA > representation of the NFA as needed while traversing the input. This > is the most difficult one to implement, but also the fastest for most > inputs.
Sorry, but I'm in confusion here. Are you telling me that the current regxp engine has a cached DFA implementation ? ( I thought such an implementation was the aim of this project :( )
No. The current implementation is purely backtracking, with a bunch of tricks to speed it up.
> Russ Cox's article is good, but doesn't show anyting new, even though > it has been received like as it would. And he really skims over some > of the important issues that one has to consider when using regular > expressions for doing anything other than just accepting or rejecting > input as being part of a regular language. Perhaps most important is > submatch reporting, which causes all kinds of difficulties that he > simply doesn't mention. I agree. Just out of curiosity, I referred to the book "Compilers - Principles, Techniques and Tools by Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman". It has a very nice explanation of NFA, NFA->DFA, Regxp->NFA and even Regxp->DFA ( I'm still trying to understand the last one ). I used the article by Russ Cox to verify what I have learnt from the above book. But I believe that above article is a good starter.
Yes, it's a very good and short (which is a good thing!) summary. The book you mention above is also very good, and covers a bit more theory. I can highly recommend the following books and articles: John E. Hopcroft and Jeffrey D. Ullman, Introduction to Automata Theo- ry, Languages, and Computation. (Addison–Wesley, Reading, Massachusetts, 1979). This is really the reference work if you care about the theory. Contains a lot of information on formal languages, of which only the regular ones are really of interest for this project. Gonzalo Navarro, NR-grep: A fast and flexible pattern matching tool , Software—Practice and Experience 31(13), pp. 1265–1312 (2001). A very fast grep implementation. The algorithm used uses bitfields to keep track of the finite automatons state, making it very fast for many common patterns. Alfred V. Aho, Algorithms for Finding Patterns in Strings. In Algorithms and Complexiy, vol. A of Handbook of Theoretical Computer Science, chapter 5, pp. 255–300 (Elsevier Science Publishers B.V., Cambridge, Massachusetts, 1990). This last one covers the theory especially well for this project, I'd say, and contains a very complete bibliography. I hope you won't drown in information ;-). More pointers can be found in my thesis, but those are probably the most interesting ones.
Thank you so much for the feed back. :)
No problem. I'm very interested to see where this project will go and I hope I can be of help, should you need it. nikolai