Re: Google Summer of Code 2007 : Improve regexp performance
On Fri, 2007-03-16 at 14:08 +0100, Martin Stubenschrott wrote: > On Thu, Mar 15, 2007 at 11:28:45PM +0530, Asiri Rathnayake wrote: > > Hi all, > > > > I went through the above idea presented for google summer of code 2007 > > and found it interesting. In my opinion, incorporating Thompson NFA into > > regexp from the ground up is pretty cool. I also went through the > > alternate TRE (http://laurikari.net/tre/index.html) but couldn't verify > > whether its an implementation of Thompson NFA ( I will have to dig into > > it's source and find it out ). I would like to know further details (if > > any) regarding this project and the possible mentor(s). > > Wow, yesterday I read the announcement of vim's suggested SoC ideas, and > I thought, it would be that cool, if the regexp thing would be taken, > and here you are :) If you are accepted by Bram/Google I really wish you > luck, and hope, that this one could really speed up things like syntax > highlighting which can be quite slow for special files even on fast PCs. Thanks, but i must get familiar with the subject ( which is very broad ) and Vim regxp implementation before applying to the project ( working on it :) ). I won't apply if I find things too much for me ( but I believe I can handle, so far it looks good ). I'll try my best. - Asiri > regrads, > > Martin
Re: Google Summer of Code 2007 : Improve regexp performance
On 3/16/07, Nikolai Weibull <[EMAIL PROTECTED]> wrote: On 3/16/07, Bram Moolenaar <[EMAIL PROTECTED]> wrote: > Nikolai Weibull wrote: > > I actually wrote a simplification of his library, removing the > > approximate matching stuff, as part of my master's, which is well > > documented. I still haven't had time to put up the PDF, though. > Interesting. Although when the documentation is not available I would > call it undocumented :-). Argh! Forgot (temporary) link. http://bitwi.se/masters-project.pdf It's also about text editors for anyone who's interested. I was going to cannibalize some smaller articles from it, but never got around to it. I also like it like it is, although my supervisors didn't. Let's hope some of you do ;-). I'm especially pleased with the cover. nikolai
Re: Google Summer of Code 2007 : Improve regexp performance
On 3/16/07, Bram Moolenaar <[EMAIL PROTECTED]> wrote: Nikolai Weibull wrote: > I actually wrote a simplification of his library, removing the > approximate matching stuff, as part of my master's, which is well > documented. I still haven't had time to put up the PDF, though. Interesting. Although when the documentation is not available I would call it undocumented :-). > Anyway, it would take an immense amount of work to turn Vim onto a new > regex implementation. Vim has a whole range of its own stuff, like > matching cursor positions and so on, and is tightly bound to the > buffer implementation with its memlines and whatnot. Not to > dishearten you, but I don't think this is a project that can be > completed over a summer (not that it has to be, but you may want to > keep that in mind). The idea is that, when compiling the regexp, you check for special items that are not supported by the new/fast code. If any is found, fall back to the old/slow code. This way you can start with something simple and add more features over time. Most patterns don't use back references or match with the cursor position, thus the new/fast code would be used in most cases. Yes, that's a good course of action, and is what most regex libraries should be doing in my opinion. It's what Tcl's does, which is, coincidentally, also written by Henry Spencer (the same guy who wrote the initial regex implementation that Vim uses). Matching with text in a Vim buffer is required, can't do much without it. No, but what I meant was that most regex libraries expect a char * as input. As you yourself has stated, adding support for multiline matching in the current implementation was nontrivial. However, that may mostly be due to backtracking issues, and perhaps it may be easier to interface with some other implementation. nikolai
Re: Google Summer of Code 2007 : Improve regexp performance
Nikolai Weibull wrote: > On 3/15/07, Asiri Rathnayake <[EMAIL PROTECTED]> wrote: > > Hi all, > > > > I went through the above idea presented for google summer of code 2007 > > and found it interesting. In my opinion, incorporating Thompson NFA into > > regexp from the ground up is pretty cool. I also went through the > > alternate TRE (http://laurikari.net/tre/index.html) but couldn't verify > > whether its an implementation of Thompson NFA ( I will have to dig into > > it's source and find it out ). I would like to know further details (if > > any) regarding this project and the possible mentor(s). > > It has its own finite automata, which is based on a Thompson NFA, > adding so called /tags/ in the mix for doing subgroup matching and > fuzzy matching. > > It comes with three matchers/drivers: A backtracking one for when the > pattern requires it (backrefences), a multithreaded/parallel one, that > is, the standard two tables of active states one, and one for fuzzy > matching (approximate matching). > > There's a nice paper about it as well (too actually, if you count the > first one he wrote about his first implementation for . > > I actually wrote a simplification of his library, removing the > approximate matching stuff, as part of my master's, which is well > documented. I still haven't had time to put up the PDF, though. Interesting. Although when the documentation is not available I would call it undocumented :-). > Sadly, both his and my implementation suffer from a bug in the > subgroup matching of "opposing" repetitions, which would have to be > fixed. Neither of us have been able to do so yet. > > Anyway, it would take an immense amount of work to turn Vim onto a new > regex implementation. Vim has a whole range of its own stuff, like > matching cursor positions and so on, and is tightly bound to the > buffer implementation with its memlines and whatnot. Not to > dishearten you, but I don't think this is a project that can be > completed over a summer (not that it has to be, but you may want to > keep that in mind). The idea is that, when compiling the regexp, you check for special items that are not supported by the new/fast code. If any is found, fall back to the old/slow code. This way you can start with something simple and add more features over time. Most patterns don't use back references or match with the cursor position, thus the new/fast code would be used in most cases. Matching with text in a Vim buffer is required, can't do much without it. -- Witches prefer brooms: vacuum-cleaners need extension cords! /// Bram Moolenaar -- [EMAIL PROTECTED] -- http://www.Moolenaar.net \\\ ///sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\ \\\download, build and distribute -- http://www.A-A-P.org/// \\\help me help AIDS victims -- http://ICCF-Holland.org///
Re: Google Summer of Code 2007 : Improve regexp performance
On Thu, Mar 15, 2007 at 11:28:45PM +0530, Asiri Rathnayake wrote: > Hi all, > > I went through the above idea presented for google summer of code 2007 > and found it interesting. In my opinion, incorporating Thompson NFA into > regexp from the ground up is pretty cool. I also went through the > alternate TRE (http://laurikari.net/tre/index.html) but couldn't verify > whether its an implementation of Thompson NFA ( I will have to dig into > it's source and find it out ). I would like to know further details (if > any) regarding this project and the possible mentor(s). Wow, yesterday I read the announcement of vim's suggested SoC ideas, and I thought, it would be that cool, if the regexp thing would be taken, and here you are :) If you are accepted by Bram/Google I really wish you luck, and hope, that this one could really speed up things like syntax highlighting which can be quite slow for special files even on fast PCs. regrads, Martin
Re: Google Summer of Code 2007 : Improve regexp performance
On Fri, 2007-03-16 at 12:03 +0100, Nikolai Weibull wrote: > On 3/16/07, Asiri Rathnayake <[EMAIL PROTECTED]> wrote: > > On Fri, 2007-03-16 at 10:44 +0100, Nicolas Weber wrote: > > > > > A multithreaded matcher might be useful to vim but do we have a > > > > need for > > > > fuzzy matching ? > > > > vim doesn't use threads at the moment, and Bram seems to be reluctant > > > to include threads for various reasons. > > > I too agree, threading might complicate everything ( and a rather > > difficult programing model )... > > Again, multithreading as in parallel, as in being able to be in > multiple states at once. Not pthreads or anything like that. Sorry > about confusion. But this may also indicate that more knowledge about > what you're actually about to undertake may be necessary before > setting out. No offense intended, but the area of formal languages > (that includes regular expressions, regular languages, pattern > matching, and finite automata) isn't exactly the easiest of topics in > computer science, and having a good understanding of this topic really > is necessary to work on things like that discussed in this thread; so > I'd recommend seeing to it that you have the theory nailed down before > writing code or anything like that. Nicolai, Yes i do understand your point. I understood what you said about being in several states at once, but i was referring to nicolas's comments. Nevertheless, I must accept that i'm still a newbie to this subject area. Anyway, that's the challenge i want to take, i've been into OOP and java for sometime but now my interests are in compiler theory and operating systems ( I think they are more challenging ). As you said, I have to cover a lot of grounds. My intention was to gather information on the depth of the subject and determine whether it's doable before applying to it. I will try my best to gain a foothold on the subject before 24th march ( final date for applications ). BTW, your comments are very much appreciated, no offense taken. :) - Asiri > nikolai
Re: Google Summer of Code 2007 : Improve regexp performance
On 3/16/07, Asiri Rathnayake <[EMAIL PROTECTED]> wrote: On Fri, 2007-03-16 at 10:44 +0100, Nicolas Weber wrote: > > A multithreaded matcher might be useful to vim but do we have a > > need for > > fuzzy matching ? > vim doesn't use threads at the moment, and Bram seems to be reluctant > to include threads for various reasons. I too agree, threading might complicate everything ( and a rather difficult programing model )... Again, multithreading as in parallel, as in being able to be in multiple states at once. Not pthreads or anything like that. Sorry about confusion. But this may also indicate that more knowledge about what you're actually about to undertake may be necessary before setting out. No offense intended, but the area of formal languages (that includes regular expressions, regular languages, pattern matching, and finite automata) isn't exactly the easiest of topics in computer science, and having a good understanding of this topic really is necessary to work on things like that discussed in this thread; so I'd recommend seeing to it that you have the theory nailed down before writing code or anything like that. nikolai
Re: Google Summer of Code 2007 : Improve regexp performance
On Fri, 2007-03-16 at 10:44 +0100, Nicolas Weber wrote: > Hi, > > > A multithreaded matcher might be useful to vim but do we have a > > need for > > fuzzy matching ? > > vim doesn't use threads at the moment, and Bram seems to be reluctant > to include threads for various reasons. I too agree, threading might complicate everything ( and a rather difficult programing model )... - Asiri > Bye, > Nico
Re: Google Summer of Code 2007 : Improve regexp performance
Hi, A multithreaded matcher might be useful to vim but do we have a need for fuzzy matching ? vim doesn't use threads at the moment, and Bram seems to be reluctant to include threads for various reasons. Bye, Nico
Re: Google Summer of Code 2007 : Improve regexp performance
On 3/16/07, Mikolaj Machowski <[EMAIL PROTECTED]> wrote: Dnia piątek 16 marzec 2007, Asiri Rathnayake napisał: > A multithreaded matcher might be useful to vim but do we have a need for > fuzzy matching ? For main regexp engine I think not. But it could be useful for command line completions (think zsh). I did write multithreading/parallel matcher. The matcher is multithreaded in the sense that it can be in many states at once. It does not mean that matching is run in separate application threads. Being able to be in many states at once is what makes it non-deterministic, by the way. It also means that it will run faster than backtracking matchers in most cases, as it never has to second-guess itself. nikolai
Re: Google Summer of Code 2007 : Improve regexp performance
Dnia piątek 16 marzec 2007, Asiri Rathnayake napisał: > > A multithreaded matcher might be useful to vim but do we have a need for > fuzzy matching ? > For main regexp engine I think not. But it could be useful for command line completions (think zsh). m.
Re: Google Summer of Code 2007 : Improve regexp performance
On Thu, 2007-03-15 at 20:44 +0100, Bram Moolenaar wrote: Hi Bram, I'm going through the regxp code now. I'll first have to determine a way to incorporate the fast matching code into the current code base without breaking existing functionality. Thanks. - Asiri > Asiri Rathnayake wrote: > > > I went through the above idea presented for google summer of code 2007 > > and found it interesting. In my opinion, incorporating Thompson NFA into > > regexp from the ground up is pretty cool. I also went through the > > alternate TRE (http://laurikari.net/tre/index.html) but couldn't verify > > whether its an implementation of Thompson NFA ( I will have to dig into > > it's source and find it out ). I would like to know further details (if > > any) regarding this project and the possible mentor(s). > > I'll be the mentor. Russ Cox may provide background info for the regexp > theories. > > There are no more details to be mentioned. You can also look at the > current regexp code in Vim, of course. > > I'll do an announcement soon. The shell server is down again, can't > change the web page right now :-(. >
Re: Google Summer of Code 2007 : Improve regexp performance
On Thu, 2007-03-15 at 23:35 +0100, Nikolai Weibull wrote: nikolai, > On 3/15/07, Asiri Rathnayake <[EMAIL PROTECTED]> wrote: > > Hi all, > > > > I went through the above idea presented for google summer of code 2007 > > and found it interesting. In my opinion, incorporating Thompson NFA into > > regexp from the ground up is pretty cool. I also went through the > > alternate TRE (http://laurikari.net/tre/index.html) but couldn't verify > > whether its an implementation of Thompson NFA ( I will have to dig into > > it's source and find it out ). I would like to know further details (if > > any) regarding this project and the possible mentor(s). > > It has its own finite automata, which is based on a Thompson NFA, > adding so called /tags/ in the mix for doing subgroup matching and > fuzzy matching. > > It comes with three matchers/drivers: A backtracking one for when the > pattern requires it (backrefences), a multithreaded/parallel one, that > is, the standard two tables of active states one, and one for fuzzy > matching (approximate matching). A multithreaded matcher might be useful to vim but do we have a need for fuzzy matching ? > There's a nice paper about it as well (too actually, if you count the > first one he wrote about his first implementation for . > > I actually wrote a simplification of his library, removing the > approximate matching stuff, as part of my master's, which is well > documented. I still haven't had time to put up the PDF, though. Also, it would be great if you can publish your simplified version of TRE and the relevant documentation. :) > Sadly, both his and my implementation suffer from a bug in the > subgroup matching of "opposing" repetitions, which would have to be > fixed. Neither of us have been able to do so yet. > > Anyway, it would take an immense amount of work to turn Vim onto a new > regex implementation. Vim has a whole range of its own stuff, like > matching cursor positions and so on, and is tightly bound to the > buffer implementation with its memlines and whatnot. Not to > dishearten you, but I don't think this is a project that can be > completed over a summer (not that it has to be, but you may want to > keep that in mind). > > Also, there's a TDFA (tagged, deterministic finite automaton) > implementation written in Haskell > > http://www.mail-archive.com/glasgow-haskell-users@haskell.org/msg11442.html > > Quite cool. > > nikolai Thanks a bunch for the information and the advices, I'll give my best attempts to finish the project in the summer, if it takes longer, so be it. - Asiri
Re: Google Summer of Code 2007 : Improve regexp performance
On 3/15/07, Asiri Rathnayake <[EMAIL PROTECTED]> wrote: Hi all, I went through the above idea presented for google summer of code 2007 and found it interesting. In my opinion, incorporating Thompson NFA into regexp from the ground up is pretty cool. I also went through the alternate TRE (http://laurikari.net/tre/index.html) but couldn't verify whether its an implementation of Thompson NFA ( I will have to dig into it's source and find it out ). I would like to know further details (if any) regarding this project and the possible mentor(s). It has its own finite automata, which is based on a Thompson NFA, adding so called /tags/ in the mix for doing subgroup matching and fuzzy matching. It comes with three matchers/drivers: A backtracking one for when the pattern requires it (backrefences), a multithreaded/parallel one, that is, the standard two tables of active states one, and one for fuzzy matching (approximate matching). There's a nice paper about it as well (too actually, if you count the first one he wrote about his first implementation for . I actually wrote a simplification of his library, removing the approximate matching stuff, as part of my master's, which is well documented. I still haven't had time to put up the PDF, though. Sadly, both his and my implementation suffer from a bug in the subgroup matching of "opposing" repetitions, which would have to be fixed. Neither of us have been able to do so yet. Anyway, it would take an immense amount of work to turn Vim onto a new regex implementation. Vim has a whole range of its own stuff, like matching cursor positions and so on, and is tightly bound to the buffer implementation with its memlines and whatnot. Not to dishearten you, but I don't think this is a project that can be completed over a summer (not that it has to be, but you may want to keep that in mind). Also, there's a TDFA (tagged, deterministic finite automaton) implementation written in Haskell http://www.mail-archive.com/glasgow-haskell-users@haskell.org/msg11442.html Quite cool. nikolai
Re: Google Summer of Code 2007 : Improve regexp performance
Asiri Rathnayake wrote: > I went through the above idea presented for google summer of code 2007 > and found it interesting. In my opinion, incorporating Thompson NFA into > regexp from the ground up is pretty cool. I also went through the > alternate TRE (http://laurikari.net/tre/index.html) but couldn't verify > whether its an implementation of Thompson NFA ( I will have to dig into > it's source and find it out ). I would like to know further details (if > any) regarding this project and the possible mentor(s). I'll be the mentor. Russ Cox may provide background info for the regexp theories. There are no more details to be mentioned. You can also look at the current regexp code in Vim, of course. I'll do an announcement soon. The shell server is down again, can't change the web page right now :-(. -- ARTHUR: Well, I can't just call you `Man'. DENNIS: Well, you could say `Dennis'. ARTHUR: Well, I didn't know you were called `Dennis.' DENNIS: Well, you didn't bother to find out, did you? The Quest for the Holy Grail (Monty Python) /// Bram Moolenaar -- [EMAIL PROTECTED] -- http://www.Moolenaar.net \\\ ///sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\ \\\download, build and distribute -- http://www.A-A-P.org/// \\\help me help AIDS victims -- http://ICCF-Holland.org///