What is the current state of play on regular expression engines that acknowledge canonical equivalence? By acknowledge, I mean that will deem a string to have a match for a pattern if any string canonically equivalent to the string does. I believe this corresponds to the intent of requirement RL2.1 that was in UTS#18 Unicode Regular Expression until the towel was thrown in and the paragraph survived but the requirement vanished.
I have been putting my own together, but my efforts have bogged down with how to select the match and subexpression matches to report. The relevant theory is not of regular languages of strings, but of regular languages of 'traces'. I currently leave the results undefined if an algebraic Kleene star is not a regular expression, e.g. (\u0323\u0301)*. It is particularly relevant to using regular expressions for text rendering, e.g. for something like an imitation of Microsoft’s Universal Shaping Engine. I note that ICU is having another attempt at supporting canoncial equivalence - http://bugs.icu-project.org/trac/ticket/9111 'Support UREGEX_CANON_EQ'. At least, they are if the User Guide (http://userguide.icu-project.org/strings/regexp) is to be believed. Perhaps not, though, if the old comments in the ticket are taken seriously. For example, I believe that one should be able to find the Lanna script subscript nga <U+1A60 TAI THAM SIGN SAKOT, U+1A26 TAI THAM LETTER NGA> in the word ᨠᩮᩥ᩠᩵ᨦ <koeng> /kɤŋ/ 'half' <U+1A20 TAI THAM LETTER HIGH KA, U+1A6E TAI THAM VOWEL SIGN E, U+1A65 TAI THAM VOWEL SIGN I, U+1A75 TAI THAM SIGN TONE-1, U+1A60 TAI THAM SIGN SAKOT, U+1A26 TAI THAM LETTER NGA> or the Vietnamese letter ô U+00F4 LATIN SMALL LETTER O WITH CIRCUMFLEX in the word _buộc_ 'to bind' <U+0062, U+0075, U+1ED9 LATIN SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW, U+0063>. As far as I can tell, U+1ED9 is not a letter of the Vietnamese alphabet; it is the combination <U+00F4 LATIN SMALL LETTER O WITH CIRCUMFLEX, U+0323 COMBINING DOT BELOW> of Vietnamese letter and tone mark. One will not find them if one simply applies the string theory of regular expressions to NFD equivalents, as the initial bug report in the ticket suggests doing. A later comment in the ticket suggests that the alphabet for the string theory should be 'the combining sequences'. (I hope there is no theoretical problem from there being an infinite number of them.) The Vietnamese search would work if the alphabet in the string theory were *Vietnamese* collation elements. In the text rendering domain, HarfBuzz makes regular expressions work with conversion to NFD by permuting the canonical combining classes on a script by script basis. This requires care. Richard.

