I've been puzzling over how a pure regular expression engine that works via a non-deterministic finite automaton can be bent to accommodate 'literal clusters' as in Requirement RL2.2 'Extended Grapheme Clusters' of UTS#18 'Unicode Regular Expressions' - "To meet this requirement, an implementation shall provide a mechanism for matching against an arbitrary extended grapheme cluster, a literal cluster, and matching extended grapheme cluster boundaries." It works from a regular expression by stitching together the FSMs corresponding to its elements.
An example UTS#18 gives for matching a literal cluster can be simplified to, in its notation: [c \q{ch}] This is interpreted as 'match against "ch" if possible, otherwise against "c". Thus the strings "ca" and "cha" would both match the expression [c \q{ch}]a while "chh" but not "ch" would match against [c \q{ch}]h Or have I got this wrong? Thus, while "[c \q{ch}]" may be a regex, it is clearly not any notation for a regular expression in the mathematical sense. It seems to me that this expression requires backtracking, which is totally alien to the design of the regular expression engine. One problem then is that the engine supports both the union and intersection of regular languages. While algebraic manipulation might raise union to the highest level, eliminating intersection is an expensive operation which I have deliberately avoided. While backtracking is feasible if state progression has been restricted to the FSM for a literal cluster, it is far more difficult if multiple FSMs have been running in parallel. As the engine fully respects canonical equivalence (with the result that it can find an accented letter of the Vietnamese alphabet even if it bears a subscript tone mark), concatenated subexpressions can divide the input streams between them. Consequently, the backtracking mechanism gets complicated. May I correctly argue instead that matching against literal clusters would be satisfied by instead supporting, for this example, the regular subexpression "(c|ch)" or the UnicodeSet expression "[c{ch}]"? Richard.