On Thu, 14 May 2015 01:31:29 +0100 Richard Wordingham <[email protected]> wrote:
> I believe this corresponds to the > intent of requirement RL2.1 that was in UTS#18 Unicode Regular > Expression until the towel was thrown in and the paragraph survived > but the requirement vanished. I apologise if I am telling those interested what they already know. I couldn't find it written down in terms of NFD strings. I believe the core of the problem is that Thompson's construction algorithm has to be significantly elaborated for concatenation. When running the non-deterministic finite state machine for the regular expression st, if the string is amnb with ccc(m) != ccc(n), one has to consider the possibility that subsequence an matches expression s and subsequence mb matches expression t. To handle a run of decomposed characters with non-zero canonical combining class, one method adds states of the form (x,y,n) where x is a state of for expression s, y is a state for expression t, and n is the non-zero canonical combining class of the last character received. The additional problem with (algebraic) Kleene star is that for s* one has to simultaneously consider s, ss, sss and so on, which makes the state machine non-finite. This is probably just a formal problem; once one adds capture groups to the FSM, the memory requirement depends on the size of the string being examined. A solution is to effectively add a loop to the parse structure of the regular expression and add checks to the matching function to avoid unnecessary recursion. An elegant formal solution to the Kleene star problem interprets (\u0323\u0302)* as (\u0323|\u0302)*. However, that is counter-intuitive, and simply rejecting such expressions would probably be better. Going non-finite is probably better. My *finite* state machine bodge for these cases is to simply match s+ to something uncharacterised between s|ss and s+. Richard.

