On Sat, 16 May 2015 02:41:33 +0200 Philippe Verdy <[email protected]> wrote:
> With a NFA, the representation is completely different, The regexp > > (\u0302\u0302\0323)*(\u0302\0303)*\u0302 > > is just transformed into: > > (˕\u0302˕\u0302˕\0323|˕\u0302˕\0323˕\u0302|˕\u0302˕\u0302˕\0323)*(˕\u0302˕\ > 0303|˕\0303˕\u0302)*˕\u0302˕ > > where I noted with the "tack" the 15 relative positions **in this new > regex** where there's a need to check if the input matches a > character or character class. The old regex is a pattern for use with the trace monoid of Unicode strings under canonical equivalence. >From its appearance, I presume the new regex is intended for use with strings, and that the third run of codepoints is meant to be ˕\u0323˕\u0302˕\u0302 rather than a repeat of ˕\u0302˕\u0302˕\0323. There is an annoying error. You appear to assume that U+0302 COMBINING CIRCUMFLEX ACCENT and U+0303 COMBINING TILDE commute, but they don't; they have the same combining class, namely 230. I'm going to assume that 0303 is a typo for 0323. \u0323\u0323\u0302\u0302\u0302\u0302 is canonically equivalent to \u0302\u0302\u0323\u0302\u0323\u0302, which clearly matches the corrected old regex (\u0302\u0302\u0323)*(\u0302\u0323)*\u0302. However, \u0323\u0323\u0302\u0302\u0302\u0302 does not match the corrected new regex (˕\u0302˕\u0302˕\u0323|˕\u0302˕\0323˕\u0302|˕\u0323˕\u0302˕\u0302)*(˕\u0302˕\u0323|˕\u0323˕\u0302)*˕\u0302˕ This example goes straight to the problem with the recommended way of using string-based regular expression engines. Using NFD throughout works fine if one is working with whole words. If fails if one is working with sequences of combining marks and there is any complexity. > But it's true that the allowed reorderings implied by canonical > equivalences (and those that are NOT allowed because they are > blocked) are really challenging ! They are not challenging at all. Once you have eliminated the precomposed characters and characters with singleton decompositions, you are left with the trace monoid of Unicode strings under canonical equivalence. All you have to remember is that two characters commute if and only if they have different positive canonical combining classes. Richard.

