Re: Understanding regxp implementation
Nikolai, As you might know, the reg_comp() method is called twice when compiling a r.e; first to determine the size of the compiled expression and then to actually compile it. I was thinking if this can be used to our advantage, while on the first pass, we look for occurrences of special characters and set a flag in regprog_T appropriately. If such thing was not found, we branch off the second pass into one of our own routines to compile the expression into our own structures (say, a state diagram). And we have to change other functions a bit to look for the above flag and call new routines appropriately. What do you think ? - Asiri
Re: Understanding regxp implementation
On 3/22/07, Asiri Rathnayake [EMAIL PROTECTED] wrote: As you might know, the reg_comp() method is called twice when compiling a r.e; first to determine the size of the compiled expression and then to actually compile it. I was thinking if this can be used to our advantage, while on the first pass, we look for occurrences of special characters and set a flag in regprog_T appropriately. If such thing was not found, we branch off the second pass into one of our own routines to compile the expression into our own structures (say, a state diagram). And we have to change other functions a bit to look for the above flag and call new routines appropriately. What do you think ? That sounds like a good way of determining whether the old engine will be required or if a new one (with more limited functionality) should be used. One way of keeping this information as local as possible would be to keep a set of function pointers with the compiled regex that point to the appropriate functions to execute them on some input. For example, you could have something like this: typedef struct { int (*exec)(); int regstart; char_u reganch; char_u *regmust; int regmlen; unsigned regflags; char_u reghasz; char_u program[1]; /* actually longer.. */ } regprog_T; and change vim_regexec() to call the exec() function of the regprog_T in the regmatch_T that it gets passed. You'd then set exec() to point to either vim_old_regexec() or vim_new_regexec() (or similarly named functions) in vim_regcomp() depending on the type of regex we have. I guess it could be some flag field as well, but this makes it possible to add a third matcher, should we so desire...like a Boyer-Moore-Horspool matcher for fixed strings. nikolai
Re: Understanding regxp implementation
On Thu, 2007-03-22 at 09:26 +0100, Nikolai Weibull wrote: On 3/22/07, Asiri Rathnayake [EMAIL PROTECTED] wrote: As you might know, the reg_comp() method is called twice when compiling a r.e; first to determine the size of the compiled expression and then to actually compile it. I was thinking if this can be used to our advantage, while on the first pass, we look for occurrences of special characters and set a flag in regprog_T appropriately. If such thing was not found, we branch off the second pass into one of our own routines to compile the expression into our own structures (say, a state diagram). And we have to change other functions a bit to look for the above flag and call new routines appropriately. What do you think ? That sounds like a good way of determining whether the old engine will be required or if a new one (with more limited functionality) should be used. One way of keeping this information as local as possible would be to keep a set of function pointers with the compiled regex that point to the appropriate functions to execute them on some input. For example, you could have something like this: typedef struct { int (*exec)(); int regstart; char_ureganch; char_u*regmust; int regmlen; unsigned regflags; char_ureghasz; char_uprogram[1]; /* actually longer.. */ } regprog_T; and change vim_regexec() to call the exec() function of the regprog_T in the regmatch_T that it gets passed. You'd then set exec() to point to either vim_old_regexec() or vim_new_regexec() (or similarly named functions) in vim_regcomp() depending on the type of regex we have. I guess it could be some flag field as well, but this makes it possible to add a third matcher, should we so desire...like a Boyer-Moore-Horspool matcher for fixed strings. Yes, this is more flexible. thanks. - Asiri nikolai
Re: Understanding regxp implementation
Nikolai Weibull wrote: On 3/22/07, Asiri Rathnayake [EMAIL PROTECTED] wrote: As you might know, the reg_comp() method is called twice when compiling a r.e; first to determine the size of the compiled expression and then to actually compile it. I was thinking if this can be used to our advantage, while on the first pass, we look for occurrences of special characters and set a flag in regprog_T appropriately. If such thing was not found, we branch off the second pass into one of our own routines to compile the expression into our own structures (say, a state diagram). And we have to change other functions a bit to look for the above flag and call new routines appropriately. What do you think ? That sounds like a good way of determining whether the old engine will be required or if a new one (with more limited functionality) should be used. One way of keeping this information as local as possible would be to keep a set of function pointers with the compiled regex that point to the appropriate functions to execute them on some input. For example, you could have something like this: typedef struct { int (*exec)(); int regstart; char_ureganch; char_u*regmust; int regmlen; unsigned regflags; char_ureghasz; char_uprogram[1]; /* actually longer.. */ } regprog_T; and change vim_regexec() to call the exec() function of the regprog_T in the regmatch_T that it gets passed. You'd then set exec() to point to either vim_old_regexec() or vim_new_regexec() (or similarly named functions) in vim_regcomp() depending on the type of regex we have. I guess it could be some flag field as well, but this makes it possible to add a third matcher, should we so desire...like a Boyer-Moore-Horspool matcher for fixed strings. Adding a third matcher won't happen soon, and is a big change. It's not really needed to prepare for that. The disadvantage of using a function pointer is that in the place where it's used you only see: myprog-exec(args); You can't see which function is being called, and finding out is not that simple. So when you browse the code this is like a dead end. Using this keeps navigating much simpler: if (myprog-difficultregexp) regmatch_old(args); else regmatch_new(args); One reason why inheritance in object oriented programming makes our life more difficult is that you quite often don't know for sure which method is invoked. -- ROBIN: The what? ARTHUR: The Holy Hand Grenade of Antioch. 'Tis one of the sacred relics Brother Maynard always carries with him. ALL:Yes. Of course. ARTHUR: (shouting) Bring up the Holy Hand Grenade! Monty Python and the Holy Grail PYTHON (MONTY) PICTURES LTD /// Bram Moolenaar -- [EMAIL PROTECTED] -- http://www.Moolenaar.net \\\ ///sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\ \\\download, build and distribute -- http://www.A-A-P.org/// \\\help me help AIDS victims -- http://ICCF-Holland.org///
Re: Understanding regxp implementation
On Thu, 2007-03-22 at 22:21 +0100, Bram Moolenaar wrote: Adding a third matcher won't happen soon, and is a big change. It's not really needed to prepare for that. The disadvantage of using a function pointer is that in the place where it's used you only see: myprog-exec(args); You can't see which function is being called, and finding out is not that simple. So when you browse the code this is like a dead end. Using this keeps navigating much simpler: if (myprog-difficultregexp) regmatch_old(args); else regmatch_new(args); One reason why inheritance in object oriented programming makes our life more difficult is that you quite often don't know for sure which method is invoked. Actually this is something I loved about OOP, you don't have to worry about which method is called (assuming the correct one gets called). But given the complexity of the existing code, adding more complications would be a disaster as you say. - Asiri
Re: Understanding regxp implementation
This mail bounced off for some reason, I'm repeating it. Sorry if you've already got this. - Hi Bram, Nikolai, All, I think the best way to understand current implementation of regxp is to first go through Henry Spencer's original regxp implementation ( thanks nikolai ). It's very compact and easy to mess with. After that, I think I would be able to come up with a hack to include new code into vim's regxp code base ( either TRE or our own implementation ). By the way, I have decided to apply for the GSoc ( I beleive I can crack this with the help of all of you ). I'm really greateful for the help provided by Nikolai and Bram, Thank you very much! ps : even if I'm not selected, I would still like to contribute to this project... :-) - Asiri
Re: Understanding regxp implementation
Hi Bram,Nikolai, I went through the regxp code and have a few questions... First, Why use this kind of a coding scheme and encode patterns rather than using a state diagram ? ( Performance/Memory ? ). Secondly, is it a requirement that the new implementation has to follow the same method ? I mean, can't I use a state diagram ( which is easy to implement in my opinion ) to simulate the NFA ? PS: My questions might be incomplete, My understanding of the current implementation is limited; I'll be asking few dumb questions here and there, please excuse me for that.. - Asiri
Understanding regxp implementation
Hi Bram, Nicolai, I'm unable to grasp the description ( attachment ) given in the regxp.c file. For a moment they seem like NFA fragments for operators |,+,* and so on, but then again I'm in doubt ( specially i don't understand what a node in this context is ). A little help is greatly appreciated ( perhaps a pointer to some other information ). I believe this is a very simple thing, sorry for my incompetence... - Asiri /* * Structure for regexp program. This is essentially a linear encoding * of a nondeterministic finite-state machine (aka syntax charts or * railroad normal form in parsing technology). Each node is an opcode * plus a next pointer, possibly plus an operand. Next pointers of * all nodes except BRANCH and BRACES_COMPLEX implement concatenation; a next * pointer with a BRANCH on both ends of it is connecting two alternatives. * (Here we have one of the subtle syntax dependencies: an individual BRANCH * (as opposed to a collection of them) is never concatenated with anything * because of operator precedence). The next pointer of a BRACES_COMPLEX * node points to the node after the stuff to be repeated. * The operand of some types of node is a literal string; for others, it is a * node leading into a sub-FSM. In particular, the operand of a BRANCH node * is the first node of the branch. * (NB this is *not* a tree structure: the tail of the branch connects to the * thing following the set of BRANCHes.) * * pattern is coded like: * *+-+ *| V * aa\|bb BRANCH aa BRANCH bb -- END * | ^| ^ * +--++--+ * * * +--+ * V | * aa*BRANCH BRANCH aa -- BACK BRANCH -- NOTHING -- END * | | ^ ^ * | +---+ | * +-+ * * * +--+ * V | * aa\+ BRANCH aa -- BRANCH -- BACK BRANCH -- NOTHING -- END * | | ^ ^ * | +---+ | * +--+ * * * +-+ * V | * aa\{} BRANCH BRACE_LIMITS -- BRACE_COMPLEX aa -- BACK END * | |^ * | ++ * +---+ * * * aa[EMAIL PROTECTED]bbBRANCH NOMATCH aa -- END bb -- END * | |^ ^ * | ++ | * ++ * *+-+ *| V * \z[abc] BRANCH BRANCH a BRANCH b BRANCH c BRANCH NOTHING -- END * | | | | ^ ^ * | | | +-+ | * | | ++ | * | +---+ | * +--+ * * They all start with a BRANCH for \| alternaties, even when there is only * one alternative. */
Re: Understanding regxp implementation
On Mon, 2007-03-19 at 11:55 +0100, Nikolai Weibull wrote: On 3/19/07, Asiri Rathnayake [EMAIL PROTECTED] wrote: Hi Bram, Nicolai, A 'k' would be greatly appreciated. I'm really really sorry, won't happen again... I'm unable to grasp the description ( attachment ) given in the regxp.c file. For a moment they seem like NFA fragments for operators |,+,* Well, yes, that's what they are. The diagrams show you how the different operators are represented. I think the representation can be a bit difficult to grasp at first, because the representation isn't like a state diagram/tree/DAG, but more of a list of assembler instructions.These assembler-like instructions are then executed by an interpreter that executes the appropriate C code for doing the matching. This is (mainly) done in regmatch(). Thanks a bunch, I was completely lost... - Asiri nikolai
Re: Understanding regxp implementation
On 1/1/07, Asiri Rathnayake [EMAIL PROTECTED] wrote: On Mon, 2007-03-19 at 11:55 +0100, Nikolai Weibull wrote: On 3/19/07, Asiri Rathnayake [EMAIL PROTECTED] wrote: Hi Bram, Nicolai, A 'k' would be greatly appreciated. I'm really really sorry, won't happen again... Hehe, don't take it too hard, it happens to me all the time. If I had a dime for every misspelling of my name, I'd have...more money than I do now. nikolai
Re: Understanding regxp implementation
On Mon, Mar 19, 2007 at 01:49:53PM +0100, Nikolai Weibull wrote: On 1/1/07, Asiri Rathnayake [EMAIL PROTECTED] wrote: On Mon, 2007-03-19 at 11:55 +0100, Nikolai Weibull wrote: On 3/19/07, Asiri Rathnayake [EMAIL PROTECTED] wrote: Hi Bram, Nicolai, A 'k' would be greatly appreciated. I'm really really sorry, won't happen again... Hehe, don't take it too hard, it happens to me all the time. If I had a dime for every misspelling of my name, I'd have...more money than I do now. Now I know, why Bram said in his Google Talk that he even uses ctrl-n completion for names, as it helps not misspelling names :) -- Martin