Re: Understanding regxp implementation

2007-03-22 Thread Asiri Rathnayake
Nikolai,

As you might know, the reg_comp() method is called twice when compiling
a r.e; first to determine the size of the compiled expression and then
to actually compile it. I was thinking if this can be used to our
advantage, while on the first pass, we look for occurrences of special
characters and set a flag in regprog_T appropriately. If such thing was
not found, we branch off the second pass into one of our own routines to
compile the expression into our own structures (say, a state diagram).
And we have to change other functions a bit to look for the above flag
and call new routines appropriately. What do you think ?

- Asiri



Re: Understanding regxp implementation

2007-03-22 Thread Nikolai Weibull

On 3/22/07, Asiri Rathnayake [EMAIL PROTECTED] wrote:


As you might know, the reg_comp() method is called twice when compiling
a r.e; first to determine the size of the compiled expression and then
to actually compile it. I was thinking if this can be used to our
advantage, while on the first pass, we look for occurrences of special
characters and set a flag in regprog_T appropriately. If such thing was
not found, we branch off the second pass into one of our own routines to
compile the expression into our own structures (say, a state diagram).
And we have to change other functions a bit to look for the above flag
and call new routines appropriately. What do you think ?


That sounds like a good way of determining whether the old engine will
be required or if a new one (with more limited functionality) should
be used.  One way of keeping this information as local as possible
would be to keep a set of function pointers with the compiled regex
that point to the appropriate functions to execute them on some input.

For example, you could have something like this:

typedef struct
{
   int (*exec)();
   int  regstart;
   char_u   reganch;
   char_u   *regmust;
   int  regmlen;
   unsigned regflags;
   char_u   reghasz;
   char_u   program[1]; /* actually longer.. */
} regprog_T;

and change vim_regexec() to call the exec() function of the regprog_T
in the regmatch_T that it gets passed.  You'd then set exec() to point
to either vim_old_regexec() or vim_new_regexec() (or similarly named
functions) in vim_regcomp() depending on the type of regex we have.  I
guess it could be some flag field as well, but this makes it possible
to add a third matcher, should we so desire...like a
Boyer-Moore-Horspool matcher for fixed strings.

 nikolai


Re: Understanding regxp implementation

2007-03-22 Thread Asiri Rathnayake
On Thu, 2007-03-22 at 09:26 +0100, Nikolai Weibull wrote:
 On 3/22/07, Asiri Rathnayake [EMAIL PROTECTED] wrote:
 
  As you might know, the reg_comp() method is called twice when compiling
  a r.e; first to determine the size of the compiled expression and then
  to actually compile it. I was thinking if this can be used to our
  advantage, while on the first pass, we look for occurrences of special
  characters and set a flag in regprog_T appropriately. If such thing was
  not found, we branch off the second pass into one of our own routines to
  compile the expression into our own structures (say, a state diagram).
  And we have to change other functions a bit to look for the above flag
  and call new routines appropriately. What do you think ?
 
 That sounds like a good way of determining whether the old engine will
 be required or if a new one (with more limited functionality) should
 be used.  One way of keeping this information as local as possible
 would be to keep a set of function pointers with the compiled regex
 that point to the appropriate functions to execute them on some input.
 
 For example, you could have something like this:
 
 typedef struct
 {
 int (*exec)();
 int   regstart;
 char_ureganch;
 char_u*regmust;
 int   regmlen;
 unsigned  regflags;
 char_ureghasz;
 char_uprogram[1]; /* actually longer.. */
 } regprog_T;
 
 and change vim_regexec() to call the exec() function of the regprog_T
 in the regmatch_T that it gets passed.  You'd then set exec() to point
 to either vim_old_regexec() or vim_new_regexec() (or similarly named
 functions) in vim_regcomp() depending on the type of regex we have.  I
 guess it could be some flag field as well, but this makes it possible
 to add a third matcher, should we so desire...like a
 Boyer-Moore-Horspool matcher for fixed strings.

Yes, this is more flexible. thanks.

- Asiri

   nikolai



Re: Understanding regxp implementation

2007-03-22 Thread Bram Moolenaar

Nikolai Weibull wrote:

 On 3/22/07, Asiri Rathnayake [EMAIL PROTECTED] wrote:
 
  As you might know, the reg_comp() method is called twice when compiling
  a r.e; first to determine the size of the compiled expression and then
  to actually compile it. I was thinking if this can be used to our
  advantage, while on the first pass, we look for occurrences of special
  characters and set a flag in regprog_T appropriately. If such thing was
  not found, we branch off the second pass into one of our own routines to
  compile the expression into our own structures (say, a state diagram).
  And we have to change other functions a bit to look for the above flag
  and call new routines appropriately. What do you think ?
 
 That sounds like a good way of determining whether the old engine will
 be required or if a new one (with more limited functionality) should
 be used.  One way of keeping this information as local as possible
 would be to keep a set of function pointers with the compiled regex
 that point to the appropriate functions to execute them on some input.
 
 For example, you could have something like this:
 
 typedef struct
 {
 int (*exec)();
 int   regstart;
 char_ureganch;
 char_u*regmust;
 int   regmlen;
 unsigned  regflags;
 char_ureghasz;
 char_uprogram[1]; /* actually longer.. */
 } regprog_T;
 
 and change vim_regexec() to call the exec() function of the regprog_T
 in the regmatch_T that it gets passed.  You'd then set exec() to point
 to either vim_old_regexec() or vim_new_regexec() (or similarly named
 functions) in vim_regcomp() depending on the type of regex we have.  I
 guess it could be some flag field as well, but this makes it possible
 to add a third matcher, should we so desire...like a
 Boyer-Moore-Horspool matcher for fixed strings.

Adding a third matcher won't happen soon, and is a big change.  It's not
really needed to prepare for that.

The disadvantage of using a function pointer is that in the place where
it's used you only see:

 myprog-exec(args);

You can't see which function is being called, and finding out is not
that simple.  So when you browse the code this is like a dead end.

Using this keeps navigating much simpler:

if (myprog-difficultregexp)
regmatch_old(args);
else
regmatch_new(args);

One reason why inheritance in object oriented programming makes our life
more difficult is that you quite often don't know for sure which method
is invoked.

-- 
ROBIN:  The what?
ARTHUR: The Holy Hand Grenade of Antioch.  'Tis one of the sacred relics
Brother Maynard always carries with him.
ALL:Yes. Of course.
ARTHUR: (shouting) Bring up the Holy Hand Grenade!
 Monty Python and the Holy Grail PYTHON (MONTY) PICTURES LTD

 /// Bram Moolenaar -- [EMAIL PROTECTED] -- http://www.Moolenaar.net   \\\
///sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\download, build and distribute -- http://www.A-A-P.org///
 \\\help me help AIDS victims -- http://ICCF-Holland.org///


Re: Understanding regxp implementation

2007-03-22 Thread Asiri Rathnayake
On Thu, 2007-03-22 at 22:21 +0100, Bram Moolenaar wrote:

 
 Adding a third matcher won't happen soon, and is a big change.  It's not
 really needed to prepare for that.
 
 The disadvantage of using a function pointer is that in the place where
 it's used you only see:
 
  myprog-exec(args);
 
 You can't see which function is being called, and finding out is not
 that simple.  So when you browse the code this is like a dead end.
 
 Using this keeps navigating much simpler:
 
   if (myprog-difficultregexp)
   regmatch_old(args);
   else
   regmatch_new(args);
 
 One reason why inheritance in object oriented programming makes our life
 more difficult is that you quite often don't know for sure which method
 is invoked.

Actually this is something I loved about OOP, you don't have to worry
about which method is called (assuming the correct one gets called). But
given the complexity of the existing code, adding more complications
would be a disaster as you say.

- Asiri



Re: Understanding regxp implementation

2007-03-21 Thread Asiri Rathnayake
This mail bounced off for some reason, I'm repeating it. Sorry if you've
already got this.

-

Hi Bram, Nikolai, All,

I think the best way to understand current implementation of regxp is to
first go through Henry Spencer's original regxp implementation ( thanks
nikolai ). It's very compact and easy to mess with. After that, I think
I
would be able to come up with a hack to include new code into vim's
regxp
code base (  either TRE or our own implementation ). By the way, I have
decided to apply for the GSoc ( I beleive I can crack this with the help
of
all of you  ). I'm really greateful for the help provided by Nikolai and
Bram, Thank you very much!

ps : even if I'm not selected, I would still like to contribute to this
project... :-)

- Asiri



Re: Understanding regxp implementation

2007-03-20 Thread Asiri Rathnayake
Hi Bram,Nikolai,

I went through the regxp code and have a few questions...

First, Why use this kind of a coding scheme and encode patterns rather
than using a state diagram ? ( Performance/Memory ? ). Secondly, is it a
requirement that the new implementation has to follow the same method ?
I mean, can't I use a state diagram ( which is easy to implement in my
opinion ) to simulate the NFA ?

PS: My questions might be incomplete, My understanding of the current
implementation is limited; I'll be asking few dumb questions here and
there, please excuse me for that.. 

- Asiri



Understanding regxp implementation

2007-03-19 Thread Asiri Rathnayake
Hi Bram, Nicolai,

I'm unable to grasp the description ( attachment ) given in the regxp.c
file. For a moment they seem like NFA fragments for operators |,+,* and
so on, but then again I'm in doubt ( specially i don't understand what a
node in this context is ). A little help is greatly appreciated
( perhaps a pointer to some other information ). I believe this is a
very simple thing, sorry for my incompetence...

- Asiri
/*
 * Structure for regexp program.  This is essentially a linear encoding
 * of a nondeterministic finite-state machine (aka syntax charts or
 * railroad normal form in parsing technology).  Each node is an opcode
 * plus a next pointer, possibly plus an operand.  Next pointers of
 * all nodes except BRANCH and BRACES_COMPLEX implement concatenation; a next
 * pointer with a BRANCH on both ends of it is connecting two alternatives.
 * (Here we have one of the subtle syntax dependencies: an individual BRANCH
 * (as opposed to a collection of them) is never concatenated with anything
 * because of operator precedence).  The next pointer of a BRACES_COMPLEX
 * node points to the node after the stuff to be repeated.
 * The operand of some types of node is a literal string; for others, it is a
 * node leading into a sub-FSM.  In particular, the operand of a BRANCH node
 * is the first node of the branch.
 * (NB this is *not* a tree structure: the tail of the branch connects to the
 * thing following the set of BRANCHes.)
 *
 * pattern  is coded like:
 *
 *+-+
 *| V
 * aa\|bb   BRANCH aa BRANCH bb -- END
 *   |  ^|  ^
 *   +--++--+
 *
 *
 * +--+
 * V  |
 * aa*BRANCH BRANCH aa -- BACK BRANCH -- NOTHING -- END
 *   |  |   ^  ^
 *   |  +---+  |
 *   +-+
 *
 *
 * +--+
 * V  |
 * aa\+   BRANCH aa -- BRANCH -- BACK  BRANCH -- NOTHING -- END
 *   |   |   ^  ^
 *   |   +---+  |
 *   +--+
 *
 *
 *  +-+
 *  V |
 * aa\{}  BRANCH BRACE_LIMITS -- BRACE_COMPLEX aa -- BACK  END
 *   |  |^
 *   |  ++
 *   +---+
 *
 *
 * aa[EMAIL PROTECTED]bbBRANCH NOMATCH aa -- END  bb -- END
 *   |   |^   ^
 *   |   ++   |
 *   ++
 *
 *+-+
 *| V
 * \z[abc]  BRANCH BRANCH  a  BRANCH  b  BRANCH  c  BRANCH  NOTHING -- END
 *   |  |  |  | ^   ^
 *   |  |  |  +-+   |
 *   |  |  ++   |
 *   |  +---+   |
 *   +--+
 *
 * They all start with a BRANCH for \| alternaties, even when there is only
 * one alternative.
 */


Re: Understanding regxp implementation

2007-03-19 Thread Asiri Rathnayake
On Mon, 2007-03-19 at 11:55 +0100, Nikolai Weibull wrote:
 On 3/19/07, Asiri Rathnayake [EMAIL PROTECTED] wrote:
  Hi Bram, Nicolai,
 
 A 'k' would be greatly appreciated.

I'm really really sorry, won't happen again...

  I'm unable to grasp the description ( attachment ) given in the regxp.c
  file. For a moment they seem like NFA fragments for operators |,+,*
 
 Well, yes, that's what they are.  The diagrams show you how the
 different operators are represented.  I think the representation can
 be a bit difficult to grasp at first, because the representation isn't
 like a state diagram/tree/DAG, but more of a list of assembler
 instructions.These assembler-like instructions are then executed
 by an interpreter that executes the appropriate C code for doing the
 matching.  This is (mainly) done in regmatch().

Thanks a bunch, I was completely lost...

- Asiri

   nikolai



Re: Understanding regxp implementation

2007-03-19 Thread Nikolai Weibull

On 1/1/07, Asiri Rathnayake [EMAIL PROTECTED] wrote:

On Mon, 2007-03-19 at 11:55 +0100, Nikolai Weibull wrote:
 On 3/19/07, Asiri Rathnayake [EMAIL PROTECTED] wrote:
  Hi Bram, Nicolai,

 A 'k' would be greatly appreciated.

I'm really really sorry, won't happen again...


Hehe, don't take it too hard, it happens to me all the time.  If I had
a dime for every misspelling of my name, I'd have...more money than I
do now.

 nikolai


Re: Understanding regxp implementation

2007-03-19 Thread Martin Stubenschrott
On Mon, Mar 19, 2007 at 01:49:53PM +0100, Nikolai Weibull wrote:
 On 1/1/07, Asiri Rathnayake [EMAIL PROTECTED] wrote:
 On Mon, 2007-03-19 at 11:55 +0100, Nikolai Weibull wrote:
  On 3/19/07, Asiri Rathnayake [EMAIL PROTECTED] wrote:
   Hi Bram, Nicolai,
 
  A 'k' would be greatly appreciated.
 
 I'm really really sorry, won't happen again...
 
 Hehe, don't take it too hard, it happens to me all the time.  If I had
 a dime for every misspelling of my name, I'd have...more money than I
 do now.

Now I know, why Bram said in his Google Talk that he even uses ctrl-n
completion for names, as it helps not misspelling names :)

--
Martin