Re: Google Summer of Code 2007 : Improve regexp performance

Bram Moolenaar Fri, 16 Mar 2007 11:47:41 -0800

Nikolai Weibull wrote:

> On 3/15/07, Asiri Rathnayake <[EMAIL PROTECTED]> wrote:
> > Hi all,
> >
> > I went through the above idea presented for google summer of code 2007
> > and found it interesting. In my opinion, incorporating Thompson NFA into
> > regexp from the ground up is pretty cool. I also went through the
> > alternate TRE (http://laurikari.net/tre/index.html) but couldn't verify
> > whether its an implementation of Thompson NFA ( I will have to dig into
> > it's source and find it out ). I would like to know further details (if
> > any) regarding this project and the possible mentor(s).
> 
> It has its own finite automata, which is based on a Thompson NFA,
> adding so called /tags/ in the mix for doing subgroup matching and
> fuzzy matching.
> 
> It comes with three matchers/drivers: A backtracking one for when the
> pattern requires it (backrefences), a multithreaded/parallel one, that
> is, the standard two tables of active states one, and one for fuzzy
> matching (approximate matching).
> 
> There's a nice paper about it as well (too actually, if you count the
> first one he wrote about his first implementation for .
> 
> I actually wrote a simplification of his library, removing the
> approximate matching stuff, as part of my master's, which is well
> documented.  I still haven't had time to put up the PDF, though.


Interesting.  Although when the documentation is not available I would
call it undocumented :-).

> Sadly, both his and my implementation suffer from a bug in the
> subgroup matching of "opposing" repetitions, which would have to be
> fixed.  Neither of us have been able to do so yet.
> 
> Anyway, it would take an immense amount of work to turn Vim onto a new
> regex implementation.  Vim has a whole range of its own stuff, like
> matching cursor positions and so on, and is tightly bound to the
> buffer implementation with its memlines and whatnot.  Not to
> dishearten you, but I don't think this is a project that can be
> completed over a summer (not that it has to be, but you may want to
> keep that in mind).

The idea is that, when compiling the regexp, you check for special items
that are not supported by the new/fast code.  If any is found, fall back
to the old/slow code.  This way you can start with something simple and
add more features over time.  Most patterns don't use back references or
match with the cursor position, thus the new/fast code would be used in
most cases.

Matching with text in a Vim buffer is required, can't do much without
it.

-- 
Witches prefer brooms: vacuum-cleaners need extension cords!

 /// Bram Moolenaar -- [EMAIL PROTECTED] -- http://www.Moolenaar.net   \\\
///        sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\        download, build and distribute -- http://www.A-A-P.org        ///
 \\\            help me help AIDS victims -- http://ICCF-Holland.org    ///

Re: Google Summer of Code 2007 : Improve regexp performance

Reply via email to