On Sun, 17 May 2015 16:45:18 +0200 Philippe Verdy <[email protected]> wrote:
> 2015-05-16 22:33 GMT+02:00 Richard Wordingham < > [email protected]>: > > > I'm not at all sure what your example string is. I ran my program > > to watch its progression with input \u0323\u0323\u0302\u0302, which > > does not match the pattern, and attach the outputs for your scorn. > > I have added comments started by #. > > > > Sorry for not commenting it, this is the internal tricks and outputs > of your program, and your added comments does not allow me to > interpret what all this means, i.e. the exact role of the notations > with sequences or "L" or "R" or "N", and what the "=>" notation means > (I suppose this is noting an advance rule and that the left-hand side > is the state before, the right-hand-side is the state after, but I > don't see where is the condition (the character or character class to > match, or an error condition). You've only "explained" partly the NDE > and ODE comments and the "!" when it is appended. 'ODE' and 'NDE' mean the transitions should not occur when I finish my current set of edits. The exclamation mark means the optimisation I first though of wouldn't eliminate it. > Is that really what your regexp engine outputs as its internally > generated parser tables (only "friendly" serialized as a "readable" > text) ? When running the regex, I really do hold the states in forms like LLLL2 and LLLN001220:2:L2. (The colons are unnecessary; I included them for readability.) It's designed for proof of principle, rather than high speed. There is also a tree corresponding to the analysis of the regex; the nodes record how the lower level regexes are combined. The branching nodes in the example are for sequences. In the simplest case, a matching expression will, in some canonically equivalent form, be the concatenation of a string matching the left hand node and a string matching the right mode. For iterations ('*' and '+', though I treat '+' as basic), the tree does not need a corresponding right branch, as all the information about the regex is held in the left branch. An 'L' means that the input sequence is proceeding through the left branch. An 'R' means that it has completed its passage through the left branch, and is now proceeding through the right branch. All this would be applicable if I were ignoring canonical equivalence. An 'N' (for 'normalisation') means that parsing is passing through the region where the normalisation has interleaved the left and right hand component strings. As I consider each fresh character, I have to consider its canonical combining class. The string for the state records what ccc is blocked from the left hand string. As I take the characters from the input string in NFD order, I only need to remember the highest blocked ccc. The first character I receive with a lower ccc will be a starter, at which point I will only be progressing the right hand component string. For the state in the parent regex, I record the 'N' (as opposed to an 'L' or 'R'), the highest blocked ccc, the state in the left-hand regex and the state in the right-hand regex. The input characters are recorded in the form =<character scalar value>=<this character start location>:<next character start location>:= The character location is recorded in the from <part><byte offset>. The part is a single digit. 0 means whole character, 1 means first character in character decomposing to multiple characters, 2 means second and so on. Thus, as the first U+0302 is stored as 6-character escape code '\u0302', I record the position as: =0302=06:012:= I then record the consequential transition from each state to another state. As the basic structure is that of a non-deterministic finite automaton (as at https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton ), there may be no or many transitions from a particular state. There are no error conditions as such. As I record each transition, I record whether there is now a match to the whole regex and whether the state is a duplicate. Detecting duplicates is part of the key to the classical NFA's better resistance to 'pathological inputs' compared to back-tracking algorithms. There are two main state numberings for the bottom level regexes. The main bottom level regex is a simple regex with no alternates or groupings. The engine propagates the simple regex as a string and records the state as the byte offset of the next character to compare against. The regex is stored in Latin-1 or UTF-8. (Latin-1 is not suitable for precomposed characters.) Thus when the first character input is U+0323 and is compared against the regex \u0323\u0302\u0302, the state for the regex changes from 0 to 2, as U+0323 occupies 2 bytes in UTF-8. This is recorded as an overall state transition 'LLLL0 => LLLL2'. When all characters in the string have been matched, the state becomes 'M'. The simple regexes have one-to-many state progressions to handle iterations and optionality ('*', '+' and '?'). The second system is for Unicode properties. The state records the composition of precomposed characters by using the accumulated codepoint as the state. However, the state also includes a success flag for ease of composing the acceptance or otherwise of the overall state and to determine transitions from one regex to the next. My program does not calculate what the characters are for a state transition to occur. Instead, it calculates what transitions occur in response to an input character. Richard.

