Hi Bram, Nicolai,
I'm unable to grasp the description ( attachment ) given in the regxp.c
file. For a moment they seem like NFA fragments for operators |,+,* and
so on, but then again I'm in doubt ( specially i don't understand what a
node in this context is ). A little help is greatly appreciated
( perhaps a pointer to some other information ). I believe this is a
very simple thing, sorry for my incompetence...
- Asiri
/*
* Structure for regexp "program". This is essentially a linear encoding
* of a nondeterministic finite-state machine (aka syntax charts or
* "railroad normal form" in parsing technology). Each node is an opcode
* plus a "next" pointer, possibly plus an operand. "Next" pointers of
* all nodes except BRANCH and BRACES_COMPLEX implement concatenation; a "next"
* pointer with a BRANCH on both ends of it is connecting two alternatives.
* (Here we have one of the subtle syntax dependencies: an individual BRANCH
* (as opposed to a collection of them) is never concatenated with anything
* because of operator precedence). The "next" pointer of a BRACES_COMPLEX
* node points to the node after the stuff to be repeated.
* The operand of some types of node is a literal string; for others, it is a
* node leading into a sub-FSM. In particular, the operand of a BRANCH node
* is the first node of the branch.
* (NB this is *not* a tree structure: the tail of the branch connects to the
* thing following the set of BRANCHes.)
*
* pattern is coded like:
*
* +-----------------+
* | V
* <aa>\|<bb> BRANCH <aa> BRANCH <bb> --> END
* | ^ | ^
* +------+ +----------+
*
*
* +------------------+
* V |
* <aa>* BRANCH BRANCH <aa> --> BACK BRANCH --> NOTHING --> END
* | | ^ ^
* | +---------------+ |
* +---------------------------------------------+
*
*
* +----------------------+
* V |
* <aa>\+ BRANCH <aa> --> BRANCH --> BACK BRANCH --> NOTHING --> END
* | | ^ ^
* | +-----------+ |
* +--------------------------------------------------+
*
*
* +-------------------------+
* V |
* <aa>\{} BRANCH BRACE_LIMITS --> BRACE_COMPLEX <aa> --> BACK END
* | | ^
* | +----------------+
* +-----------------------------------------------+
*
*
* <aa>[EMAIL PROTECTED]<bb> BRANCH NOMATCH <aa> --> END <bb> --> END
* | | ^ ^
* | +----------------+ |
* +--------------------------------+
*
* +---------+
* | V
* \z[abc] BRANCH BRANCH a BRANCH b BRANCH c BRANCH NOTHING --> END
* | | | | ^ ^
* | | | +-----+ |
* | | +----------------+ |
* | +---------------------------+ |
* +------------------------------------------------------+
*
* They all start with a BRANCH for "\|" alternaties, even when there is only
* one alternative.
*/