Hi Bram, Nicolai,

I'm unable to grasp the description ( attachment ) given in the regxp.c
file. For a moment they seem like NFA fragments for operators |,+,* and
so on, but then again I'm in doubt ( specially i don't understand what a
node in this context is ). A little help is greatly appreciated
( perhaps a pointer to some other information ). I believe this is a
very simple thing, sorry for my incompetence...

- Asiri
/*
 * Structure for regexp "program".  This is essentially a linear encoding
 * of a nondeterministic finite-state machine (aka syntax charts or
 * "railroad normal form" in parsing technology).  Each node is an opcode
 * plus a "next" pointer, possibly plus an operand.  "Next" pointers of
 * all nodes except BRANCH and BRACES_COMPLEX implement concatenation; a "next"
 * pointer with a BRANCH on both ends of it is connecting two alternatives.
 * (Here we have one of the subtle syntax dependencies: an individual BRANCH
 * (as opposed to a collection of them) is never concatenated with anything
 * because of operator precedence).  The "next" pointer of a BRACES_COMPLEX
 * node points to the node after the stuff to be repeated.
 * The operand of some types of node is a literal string; for others, it is a
 * node leading into a sub-FSM.  In particular, the operand of a BRANCH node
 * is the first node of the branch.
 * (NB this is *not* a tree structure: the tail of the branch connects to the
 * thing following the set of BRANCHes.)
 *
 * pattern      is coded like:
 *
 *                        +-----------------+
 *                        |                 V
 * <aa>\|<bb>   BRANCH <aa> BRANCH <bb> --> END
 *                   |      ^    |          ^
 *                   +------+    +----------+
 *
 *
 *                     +------------------+
 *                     V                  |
 * <aa>*        BRANCH BRANCH <aa> --> BACK BRANCH --> NOTHING --> END
 *                   |      |               ^                      ^
 *                   |      +---------------+                      |
 *                   +---------------------------------------------+
 *
 *
 *                     +----------------------+
 *                     V                      |
 * <aa>\+       BRANCH <aa> --> BRANCH --> BACK  BRANCH --> NOTHING --> END
 *                   |               |           ^                      ^
 *                   |               +-----------+                      |
 *                   +--------------------------------------------------+
 *
 *
 *                                      +-------------------------+
 *                                      V                         |
 * <aa>\{}      BRANCH BRACE_LIMITS --> BRACE_COMPLEX <aa> --> BACK  END
 *                   |                              |                ^
 *                   |                              +----------------+
 *                   +-----------------------------------------------+
 *
 *
 * <aa>[EMAIL PROTECTED]<bb>    BRANCH NOMATCH <aa> --> END  <bb> --> END
 *                   |       |                ^       ^
 *                   |       +----------------+       |
 *                   +--------------------------------+
 *
 *                                                    +---------+
 *                                                    |         V
 * \z[abc]      BRANCH BRANCH  a  BRANCH  b  BRANCH  c  BRANCH  NOTHING --> END
 *                   |      |          |          |     ^                   ^
 *                   |      |          |          +-----+                   |
 *                   |      |          +----------------+                   |
 *                   |      +---------------------------+                   |
 *                   +------------------------------------------------------+
 *
 * They all start with a BRANCH for "\|" alternaties, even when there is only
 * one alternative.
 */

Reply via email to