2018-01-28 5:12 GMT+01:00 Richard Wordingham via Unicode < unicode@unicode.org>:
> On Sat, 27 Jan 2018 14:13:40 -0800The theory > of regular expressions (though you may not think that mathematical > regular expressions matter) extends to trace monoids, with the > disturbing exception that the Kleene star of a regular language is not > necessarily regular. (The prototypical example is sequences (xy)^n > where x and y are distinct and commute, i.e. xy and yx are canonically > equivalent in Unicode terms. A Unicode example is the set of strings > composed only of U+0F73 TIBETAN VOWEL SIGN II - there is no FSM that > will recognise canonically equivalent strings). > I don't see why you can't write this as the regular expression: (x | y)* For the Unicode canonical equivalences, this applies to distinct characters that have distinct non-zero combining classes. But of course searching for <LATIN SMALL LETTER A WITH CIRCUMFLEX, COMBINING DOT BELOW> or <LATIN SMALL LETTER A WITH DOT BELOW, COMBINING CIRCUMFLEX > requires transforming it to NFD first as: <LATIN SMALL LETTER A, COMBINING DOT BELOW, COMBINING CIRCUMFLEX> so thet the regexp transforms to: <LATIN SMALL LETTER A> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * ( <COMBINING CIRCUMFLEX> * [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * <COMBINING DOT BELOW> | <COMBINING DOT BELOW> * [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * < COMBINING CIRCUMFLEX> Note that the "complex" set of characters used three times above is finite, it contains all combining characters of Unicode that have a non-zero combining class except above and below, i.e. all Unicode characters whose combining class is not 0, 220 (below) or 230 (above). However, It is too simplified, because the allowed combining classes must occur at most once in each possible non-zero combining class and not arbitrary numbers of them: these allowed combining classs currently are in {1, 7..36, 84, 91, 103, 107, 118, 122, 129, 130, 132, 202, 214, 216, 218, 222, 224, 226, 228, 232..234, 240} whose most member elements are used for very few combining characters (the above and below combining classes are the most populated ones but we exclude them, all the others have 1 to 9 combining characters assigned to them, or 22 characters with cc=7 (nukta), or 32 characters with cc=1 (overlay), or 47 characters with cc=9 (virama). Once again we can refine them also as a regexp, but this is combinatorial because we have 52 combining classes (so we would need to consider the 52! (factorial) alternates). But given the maximum length of what this can match (0 to 52 combining characters: yes it is finite), this is best done by not rewriting this as a regexp but by replacing the final "*" by {1,52}, and then just check each returned match of [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]{0,52} with a simple scan of these short strings to see that they all have distinct combining classes (this just requires 52 booleans, easily stored in a single 64 bit integer initialized to 0 prior to scan the scan of these small strings). But the theory does not prevent writing it as a regexp (even if it would be extremely long). So a Kleene Star closure is possible and can be efficiently implemented (all depends on the way you represent the "current state" in the FSM: a single integer representing a single node number in the traversal graph is not the best way to do that. This is a valid regexp, the finite state machine DOES have a finite lookahead (the full regexp above will match AT MOST 107 characters including the combining marks, where 107=3+2*52), but this is general to regexps that generally cannot be transformed directly into a FSM with finite lookahead, but a FSM is possible: the regexp first transforms into a simple graph of transitions with a finite number of node (this number is bound to the length of the regexp itself) where there can be multiple states active simultaneously; then a basic transform converts this small graph by transforming nodes into new nodes representing the finite set of the combinations of active states in the first graph : There will be many more nodes, and generally this explodes in size because the transform is combinatorial, and such size explosion has worst perfomance (explosion of the memory needed to representing the new graph with a single state active). So regexp engines use the alternative by representing the current state of traversal of the first simple graph using a stack of active states and transiting them all separately (this can be implemented with a "bitset" whose size in bits is the number of states in the first simple graph, or by using an associative array (dictionnary of boolean properties whose keys are state numbers in the first graph, which can be set or removed: this requires much less memory and it is relatively fast, even if the full current state is not just a single small integer.