Re: Google Summer of Code 2007 : Improve regexp performance

2007-03-16 Thread Asiri Rathnayake
On Fri, 2007-03-16 at 14:08 +0100, Martin Stubenschrott wrote:
> On Thu, Mar 15, 2007 at 11:28:45PM +0530, Asiri Rathnayake wrote:
> > Hi all,
> > 
> > I went through the above idea presented for google summer of code 2007
> > and found it interesting. In my opinion, incorporating Thompson NFA into
> > regexp from the ground up is pretty cool. I also went through the
> > alternate TRE (http://laurikari.net/tre/index.html) but couldn't verify
> > whether its an implementation of Thompson NFA ( I will have to dig into
> > it's source and find it out ). I would like to know further details (if
> > any) regarding this project and the possible mentor(s).
> 
> Wow, yesterday I read the announcement of vim's suggested SoC ideas, and
> I thought, it would be that cool, if the regexp thing would be taken,
> and here you are :) If you are accepted by Bram/Google I really wish you
> luck, and hope, that this one could really speed up things like syntax
> highlighting which can be quite slow for special files even on fast PCs.

Thanks, but i must get familiar with the subject ( which is very broad )
and Vim regxp implementation before applying to the project (  working
on it :) ). I won't apply if I find things too much for me ( but I
believe I can handle, so far it looks good ). I'll try my best.

- Asiri

> regrads,
> 
> Martin



Re: Google Summer of Code 2007 : Improve regexp performance

2007-03-16 Thread Nikolai Weibull

On 3/16/07, Nikolai Weibull <[EMAIL PROTECTED]> wrote:


On 3/16/07, Bram Moolenaar <[EMAIL PROTECTED]> wrote:



> Nikolai Weibull wrote:



> > I actually wrote a simplification of his library, removing the
> > approximate matching stuff, as part of my master's, which is well
> > documented.  I still haven't had time to put up the PDF, though.

> Interesting.  Although when the documentation is not available I would
> call it undocumented :-).


Argh!  Forgot (temporary) link.

http://bitwi.se/masters-project.pdf

It's also about text editors for anyone who's interested.

I was going to cannibalize some smaller articles from it, but never
got around to it.  I also like it like it is, although my supervisors
didn't.  Let's hope some of you do ;-).  I'm especially pleased with
the cover.

 nikolai


Re: Google Summer of Code 2007 : Improve regexp performance

2007-03-16 Thread Nikolai Weibull

On 3/16/07, Bram Moolenaar <[EMAIL PROTECTED]> wrote:


Nikolai Weibull wrote:



> I actually wrote a simplification of his library, removing the
> approximate matching stuff, as part of my master's, which is well
> documented.  I still haven't had time to put up the PDF, though.



Interesting.  Although when the documentation is not available I would
call it undocumented :-).



> Anyway, it would take an immense amount of work to turn Vim onto a new
> regex implementation.  Vim has a whole range of its own stuff, like
> matching cursor positions and so on, and is tightly bound to the
> buffer implementation with its memlines and whatnot.  Not to
> dishearten you, but I don't think this is a project that can be
> completed over a summer (not that it has to be, but you may want to
> keep that in mind).

The idea is that, when compiling the regexp, you check for special items
that are not supported by the new/fast code.  If any is found, fall back
to the old/slow code.  This way you can start with something simple and
add more features over time.  Most patterns don't use back references or
match with the cursor position, thus the new/fast code would be used in
most cases.


Yes, that's a good course of action, and is what most regex libraries
should be doing in my opinion.  It's what Tcl's does, which is,
coincidentally, also written by Henry Spencer (the same guy who wrote
the initial regex implementation that Vim uses).


Matching with text in a Vim buffer is required, can't do much without
it.


No, but what I meant was that most regex libraries expect a char * as
input.  As you yourself has stated, adding support for multiline
matching in the current implementation was nontrivial.  However, that
may mostly be due to backtracking issues, and perhaps it may be easier
to interface with some other implementation.

 nikolai


Re: Google Summer of Code 2007 : Improve regexp performance

2007-03-16 Thread Bram Moolenaar

Nikolai Weibull wrote:

> On 3/15/07, Asiri Rathnayake <[EMAIL PROTECTED]> wrote:
> > Hi all,
> >
> > I went through the above idea presented for google summer of code 2007
> > and found it interesting. In my opinion, incorporating Thompson NFA into
> > regexp from the ground up is pretty cool. I also went through the
> > alternate TRE (http://laurikari.net/tre/index.html) but couldn't verify
> > whether its an implementation of Thompson NFA ( I will have to dig into
> > it's source and find it out ). I would like to know further details (if
> > any) regarding this project and the possible mentor(s).
> 
> It has its own finite automata, which is based on a Thompson NFA,
> adding so called /tags/ in the mix for doing subgroup matching and
> fuzzy matching.
> 
> It comes with three matchers/drivers: A backtracking one for when the
> pattern requires it (backrefences), a multithreaded/parallel one, that
> is, the standard two tables of active states one, and one for fuzzy
> matching (approximate matching).
> 
> There's a nice paper about it as well (too actually, if you count the
> first one he wrote about his first implementation for .
> 
> I actually wrote a simplification of his library, removing the
> approximate matching stuff, as part of my master's, which is well
> documented.  I still haven't had time to put up the PDF, though.

Interesting.  Although when the documentation is not available I would
call it undocumented :-).

> Sadly, both his and my implementation suffer from a bug in the
> subgroup matching of "opposing" repetitions, which would have to be
> fixed.  Neither of us have been able to do so yet.
> 
> Anyway, it would take an immense amount of work to turn Vim onto a new
> regex implementation.  Vim has a whole range of its own stuff, like
> matching cursor positions and so on, and is tightly bound to the
> buffer implementation with its memlines and whatnot.  Not to
> dishearten you, but I don't think this is a project that can be
> completed over a summer (not that it has to be, but you may want to
> keep that in mind).

The idea is that, when compiling the regexp, you check for special items
that are not supported by the new/fast code.  If any is found, fall back
to the old/slow code.  This way you can start with something simple and
add more features over time.  Most patterns don't use back references or
match with the cursor position, thus the new/fast code would be used in
most cases.

Matching with text in a Vim buffer is required, can't do much without
it.

-- 
Witches prefer brooms: vacuum-cleaners need extension cords!

 /// Bram Moolenaar -- [EMAIL PROTECTED] -- http://www.Moolenaar.net   \\\
///sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\download, build and distribute -- http://www.A-A-P.org///
 \\\help me help AIDS victims -- http://ICCF-Holland.org///


Re: Google Summer of Code 2007 : Improve regexp performance

2007-03-16 Thread Martin Stubenschrott
On Thu, Mar 15, 2007 at 11:28:45PM +0530, Asiri Rathnayake wrote:
> Hi all,
> 
> I went through the above idea presented for google summer of code 2007
> and found it interesting. In my opinion, incorporating Thompson NFA into
> regexp from the ground up is pretty cool. I also went through the
> alternate TRE (http://laurikari.net/tre/index.html) but couldn't verify
> whether its an implementation of Thompson NFA ( I will have to dig into
> it's source and find it out ). I would like to know further details (if
> any) regarding this project and the possible mentor(s).

Wow, yesterday I read the announcement of vim's suggested SoC ideas, and
I thought, it would be that cool, if the regexp thing would be taken,
and here you are :) If you are accepted by Bram/Google I really wish you
luck, and hope, that this one could really speed up things like syntax
highlighting which can be quite slow for special files even on fast PCs.

regrads,

Martin


Re: Google Summer of Code 2007 : Improve regexp performance

2007-03-16 Thread Asiri Rathnayake
On Fri, 2007-03-16 at 12:03 +0100, Nikolai Weibull wrote:
> On 3/16/07, Asiri Rathnayake <[EMAIL PROTECTED]> wrote:
> > On Fri, 2007-03-16 at 10:44 +0100, Nicolas Weber wrote:
> 
> > > > A multithreaded matcher might be useful to vim but do we have a
> > > > need for
> > > > fuzzy matching ?
> 
> > > vim doesn't use threads at the moment, and Bram seems to be reluctant
> > > to include threads for various reasons.
> 
> > I too agree, threading might complicate everything ( and a rather
> > difficult programing model )...
> 
> Again, multithreading as in parallel, as in being able to be in
> multiple states at once.  Not pthreads or anything like that.  Sorry
> about confusion.  But this may also indicate that more knowledge about
> what you're actually about to undertake may be necessary before
> setting out.  No offense intended, but the area of formal languages
> (that includes regular expressions, regular languages, pattern
> matching, and finite automata) isn't exactly the easiest of topics in
> computer science, and having a good understanding of this topic really
> is necessary to work on things like that discussed in this thread; so
> I'd recommend seeing to it that you have the theory nailed down before
> writing code or anything like that.

Nicolai,

Yes i do understand your point. I understood what you said about being
in several states at once, but i was referring to nicolas's comments.
Nevertheless, I must accept that i'm still a newbie to this subject
area. Anyway, that's the challenge i want to take, i've been into OOP
and java for sometime but now my interests are in compiler theory and
operating systems ( I think they are more challenging ).

As you said, I have to cover a lot of grounds. My intention was to
gather information on the depth of the subject and determine whether
it's doable before applying to it.

I will try my best to gain a foothold on the subject before 24th march
( final date for applications ).

BTW, your comments are very much appreciated, no offense taken. :)

- Asiri

>   nikolai



Re: Google Summer of Code 2007 : Improve regexp performance

2007-03-16 Thread Nikolai Weibull

On 3/16/07, Asiri Rathnayake <[EMAIL PROTECTED]> wrote:

On Fri, 2007-03-16 at 10:44 +0100, Nicolas Weber wrote:



> > A multithreaded matcher might be useful to vim but do we have a
> > need for
> > fuzzy matching ?



> vim doesn't use threads at the moment, and Bram seems to be reluctant
> to include threads for various reasons.



I too agree, threading might complicate everything ( and a rather
difficult programing model )...


Again, multithreading as in parallel, as in being able to be in
multiple states at once.  Not pthreads or anything like that.  Sorry
about confusion.  But this may also indicate that more knowledge about
what you're actually about to undertake may be necessary before
setting out.  No offense intended, but the area of formal languages
(that includes regular expressions, regular languages, pattern
matching, and finite automata) isn't exactly the easiest of topics in
computer science, and having a good understanding of this topic really
is necessary to work on things like that discussed in this thread; so
I'd recommend seeing to it that you have the theory nailed down before
writing code or anything like that.

 nikolai


Re: Google Summer of Code 2007 : Improve regexp performance

2007-03-16 Thread Asiri Rathnayake
On Fri, 2007-03-16 at 10:44 +0100, Nicolas Weber wrote:
> Hi,
> 
> > A multithreaded matcher might be useful to vim but do we have a  
> > need for
> > fuzzy matching ?
> 
> vim doesn't use threads at the moment, and Bram seems to be reluctant  
> to include threads for various reasons.

I too agree, threading might complicate everything ( and a rather
difficult programing model )...

- Asiri

> Bye,
> Nico




Re: Google Summer of Code 2007 : Improve regexp performance

2007-03-16 Thread Nicolas Weber

Hi,

A multithreaded matcher might be useful to vim but do we have a  
need for

fuzzy matching ?


vim doesn't use threads at the moment, and Bram seems to be reluctant  
to include threads for various reasons.


Bye,
Nico



Re: Google Summer of Code 2007 : Improve regexp performance

2007-03-16 Thread Nikolai Weibull

On 3/16/07, Mikolaj Machowski <[EMAIL PROTECTED]> wrote:


Dnia piątek 16 marzec 2007, Asiri Rathnayake napisał:



> A multithreaded matcher might be useful to vim but do we have a need for
> fuzzy matching ?



For main regexp engine I think not. But it could be useful for command
line completions (think zsh).


I did write multithreading/parallel matcher.  The matcher is
multithreaded in the sense that it can be in many states at once.  It
does not mean that matching is run in separate application threads.
Being able to be in many states at once is what makes it
non-deterministic, by the way.  It also means that it will run faster
than backtracking matchers in most cases, as it never has to
second-guess itself.

 nikolai


Re: Google Summer of Code 2007 : Improve regexp performance

2007-03-15 Thread Mikolaj Machowski
Dnia piątek 16 marzec 2007, Asiri Rathnayake napisał:
>
> A multithreaded matcher might be useful to vim but do we have a need for
> fuzzy matching ?
>
For main regexp engine I think not. But it could be useful for command
line completions (think zsh).

m.



Re: Google Summer of Code 2007 : Improve regexp performance

2007-03-15 Thread Asiri Rathnayake
On Thu, 2007-03-15 at 20:44 +0100, Bram Moolenaar wrote:

Hi Bram,

I'm going through the regxp code now. I'll first have to determine a way
to incorporate the fast matching code into the current code base without
breaking existing functionality.

Thanks.

- Asiri

> Asiri Rathnayake wrote:
> 
> > I went through the above idea presented for google summer of code 2007
> > and found it interesting. In my opinion, incorporating Thompson NFA into
> > regexp from the ground up is pretty cool. I also went through the
> > alternate TRE (http://laurikari.net/tre/index.html) but couldn't verify
> > whether its an implementation of Thompson NFA ( I will have to dig into
> > it's source and find it out ). I would like to know further details (if
> > any) regarding this project and the possible mentor(s).
> 
> I'll be the mentor.  Russ Cox may provide background info for the regexp
> theories.
> 
> There are no more details to be mentioned.  You can also look at the
> current regexp code in Vim, of course.
> 
> I'll do an announcement soon.  The shell server is down again, can't
> change the web page right now :-(.
> 



Re: Google Summer of Code 2007 : Improve regexp performance

2007-03-15 Thread Asiri Rathnayake
On Thu, 2007-03-15 at 23:35 +0100, Nikolai Weibull wrote:

nikolai,

> On 3/15/07, Asiri Rathnayake <[EMAIL PROTECTED]> wrote:
> > Hi all,
> >
> > I went through the above idea presented for google summer of code 2007
> > and found it interesting. In my opinion, incorporating Thompson NFA into
> > regexp from the ground up is pretty cool. I also went through the
> > alternate TRE (http://laurikari.net/tre/index.html) but couldn't verify
> > whether its an implementation of Thompson NFA ( I will have to dig into
> > it's source and find it out ). I would like to know further details (if
> > any) regarding this project and the possible mentor(s).
> 
> It has its own finite automata, which is based on a Thompson NFA,
> adding so called /tags/ in the mix for doing subgroup matching and
> fuzzy matching.
> 
> It comes with three matchers/drivers: A backtracking one for when the
> pattern requires it (backrefences), a multithreaded/parallel one, that
> is, the standard two tables of active states one, and one for fuzzy
> matching (approximate matching).

A multithreaded matcher might be useful to vim but do we have a need for
fuzzy matching ?

> There's a nice paper about it as well (too actually, if you count the
> first one he wrote about his first implementation for .
> 
> I actually wrote a simplification of his library, removing the
> approximate matching stuff, as part of my master's, which is well
> documented.  I still haven't had time to put up the PDF, though.

Also, it would be great if you can publish your simplified version of
TRE and the relevant documentation. :)

> Sadly, both his and my implementation suffer from a bug in the
> subgroup matching of "opposing" repetitions, which would have to be
> fixed.  Neither of us have been able to do so yet.
> 
> Anyway, it would take an immense amount of work to turn Vim onto a new
> regex implementation.  Vim has a whole range of its own stuff, like
> matching cursor positions and so on, and is tightly bound to the
> buffer implementation with its memlines and whatnot.  Not to
> dishearten you, but I don't think this is a project that can be
> completed over a summer (not that it has to be, but you may want to
> keep that in mind).
> 
> Also, there's a TDFA (tagged, deterministic finite automaton)
> implementation written in Haskell
> 
> http://www.mail-archive.com/glasgow-haskell-users@haskell.org/msg11442.html
> 
> Quite cool.
> 
>   nikolai

Thanks a bunch for the information and the advices, I'll give my best
attempts to finish the project in the summer, if it takes longer, so be
it.

- Asiri



Re: Google Summer of Code 2007 : Improve regexp performance

2007-03-15 Thread Nikolai Weibull

On 3/15/07, Asiri Rathnayake <[EMAIL PROTECTED]> wrote:

Hi all,

I went through the above idea presented for google summer of code 2007
and found it interesting. In my opinion, incorporating Thompson NFA into
regexp from the ground up is pretty cool. I also went through the
alternate TRE (http://laurikari.net/tre/index.html) but couldn't verify
whether its an implementation of Thompson NFA ( I will have to dig into
it's source and find it out ). I would like to know further details (if
any) regarding this project and the possible mentor(s).


It has its own finite automata, which is based on a Thompson NFA,
adding so called /tags/ in the mix for doing subgroup matching and
fuzzy matching.

It comes with three matchers/drivers: A backtracking one for when the
pattern requires it (backrefences), a multithreaded/parallel one, that
is, the standard two tables of active states one, and one for fuzzy
matching (approximate matching).

There's a nice paper about it as well (too actually, if you count the
first one he wrote about his first implementation for .

I actually wrote a simplification of his library, removing the
approximate matching stuff, as part of my master's, which is well
documented.  I still haven't had time to put up the PDF, though.

Sadly, both his and my implementation suffer from a bug in the
subgroup matching of "opposing" repetitions, which would have to be
fixed.  Neither of us have been able to do so yet.

Anyway, it would take an immense amount of work to turn Vim onto a new
regex implementation.  Vim has a whole range of its own stuff, like
matching cursor positions and so on, and is tightly bound to the
buffer implementation with its memlines and whatnot.  Not to
dishearten you, but I don't think this is a project that can be
completed over a summer (not that it has to be, but you may want to
keep that in mind).

Also, there's a TDFA (tagged, deterministic finite automaton)
implementation written in Haskell

http://www.mail-archive.com/glasgow-haskell-users@haskell.org/msg11442.html

Quite cool.

 nikolai


Re: Google Summer of Code 2007 : Improve regexp performance

2007-03-15 Thread Bram Moolenaar

Asiri Rathnayake wrote:

> I went through the above idea presented for google summer of code 2007
> and found it interesting. In my opinion, incorporating Thompson NFA into
> regexp from the ground up is pretty cool. I also went through the
> alternate TRE (http://laurikari.net/tre/index.html) but couldn't verify
> whether its an implementation of Thompson NFA ( I will have to dig into
> it's source and find it out ). I would like to know further details (if
> any) regarding this project and the possible mentor(s).

I'll be the mentor.  Russ Cox may provide background info for the regexp
theories.

There are no more details to be mentioned.  You can also look at the
current regexp code in Vim, of course.

I'll do an announcement soon.  The shell server is down again, can't
change the web page right now :-(.

-- 
ARTHUR:  Well, I can't just call you `Man'.
DENNIS:  Well, you could say `Dennis'.
ARTHUR:  Well, I didn't know you were called `Dennis.'
DENNIS:  Well, you didn't bother to find out, did you?
  The Quest for the Holy Grail (Monty Python)

 /// Bram Moolenaar -- [EMAIL PROTECTED] -- http://www.Moolenaar.net   \\\
///sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\download, build and distribute -- http://www.A-A-P.org///
 \\\help me help AIDS victims -- http://ICCF-Holland.org///