I made an error for the character class notation: "{?optionalquantifier[class]}" should be just "{optionalquantifier[class]}"...
So "{?[abc]}" contains 1 item "[abc]" to choose from in any order, it is not quantified explicitly so it matches by default 1 or more, but as there's only one item, it will match just one "[abc]" But "{[abc]}" contains 3 items from the class "[abc]" to choose from in any order, so it will match "a", "b", "c", "ab", "ba", "ac", "ca", "abc", "acb", "bac", "bca", "cab" or "cba". And "{{1}[abc]}" is quantified to match one and only one item, and is equivalent to "[abc]" and matches only "a", "b", or "c" And "{{0}[abc]}" is quantified to match zero and only zero item (the items are not relevant) and will never match anything, just like "{{0}a|b|c}" or "{{0}}". And "{{2}[abc]}" or "{{2,2}[abc]}" is quantified to match two and only two items from the character class, and matches only "ab", "ba", "ac", or "ca", it is equivalent to "{{2,2}a|b|c}" or "{{2}a|b|c}". With that extension you can build the necessary regexps to match canonical equivalent strings with a finite regexp. 2018-01-29 7:16 GMT+01:00 Philippe Verdy <verd...@wanadoo.fr>: > > > 2018-01-28 23:44 GMT+01:00 Richard Wordingham via Unicode < > unicode@unicode.org>: > >> On Sun, 28 Jan 2018 20:29:28 +0100 >> Philippe Verdy via Unicode <unicode@unicode.org> wrote: >> >> > 2018-01-28 5:12 GMT+01:00 Richard Wordingham via Unicode < >> > unicode@unicode.org>: >> > >> > > On Sat, 27 Jan 2018 14:13:40 -0800The theory >> > > of regular expressions (though you may not think that mathematical >> > > regular expressions matter) extends to trace monoids, with the >> > > disturbing exception that the Kleene star of a regular language is >> > > not necessarily regular. (The prototypical example is sequences >> > > (xy)^n where x and y are distinct and commute, i.e. xy and yx are >> > > canonically equivalent in Unicode terms. A Unicode example is the >> > > set of strings composed only of U+0F73 TIBETAN VOWEL SIGN II - >> > > there is no FSM that will recognise canonically equivalent strings). >> > > >> > >> > I don't see why you can't write this as the regular expression: >> > (x | y)* >> >> Because xx does not match. >> >> In principle, it can be done iteratively thus: >> >> 1) Look for sequences of x's and y's - your (x | y) * >> 2) Discard matches from (1) where the number of x's and y's are equal. >> >> However, the second step cannot be implemented by a *finite* state >> machine. >> >> > For the Unicode canonical equivalences, this applies to distinct >> > characters that have distinct non-zero combining classes. >> >> Those of us who've looked at optimising collation by reducing >> normalisation will recognise U+0F73 TIBETAN VOWEL SIGN II as, in >> theory, a source of many problems. >> >> > But of course searching for >> > >> > <LATIN SMALL LETTER A WITH CIRCUMFLEX, COMBINING DOT BELOW> or >> > <LATIN SMALL LETTER A WITH DOT BELOW, COMBINING CIRCUMFLEX > >> > >> > requires transforming it to NFD first as: >> >> That wasn't I had in mind. What I had in mind was accepting the >> propositions that the string <LATIN SMALL LETTER A, COMBINING DOT >> BELOW, COMBINING CIRCUMFLEX> contains both LATIN SMALL LETTER A WITH >> CIRCUMFLEX and LATIN SMALL LETTER A WITH DOT BELOW. >> >> > <LATIN SMALL LETTER A, COMBINING DOT BELOW, COMBINING CIRCUMFLEX> >> > >> > so thet the regexp transforms to: >> > >> > <LATIN SMALL LETTER A> [[ [^[[:cc=0:]]] - >> > [[:cc=above:][:cc=below:]] ]] * ( <COMBINING CIRCUMFLEX> * >> > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * <COMBINING DOT >> > BELOW> | <COMBINING DOT BELOW> * [[ [^[[:cc=0:]]] - >> > BELOW> [[:cc=above:][:cc=below:]] >> > ]] * < COMBINING CIRCUMFLEX> >> >> If everything is converted to NFD, the regular expressions using traces >> can be converted to frequently unintelligible regexes on strings; the >> behaviour of the converted regex when faced with an unnormalised string >> is of course irrelevant. >> >> In the search you have in mind, the converted regex for use with NFD >> strings is actually intelligible and simple: >> >> <LATIN SMALL LETTER A> >> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * >> <COMBINING DOT BELOW> >> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * >> <COMBINING CIRCUMFLEX> >> >> Informal notation can simplify the regex still further. >> >> There is no upper bound to the length of a string matching that regex, >> > > Wrong, you've not read what followed immediately that commented it > already: it IS bound exactly because you cannot duplicate the same > combining class, and there's a known finite number of them for acceptable > cases: if there's any repetition, it will always be within that bound. But > it not necessay to expand all the combinations of combining classes to all > their possible ordering of occurence (something that a classic regexp > normally requires by expecting a specific order). > > One way to solve it would have to have (generic) regexp extension allowing > to specify a combination of one or more of several items in a choice list > in any order, but never more than one occurence of each of item. This is > possible using a rule with boolean flags of presence, one boolean for each > item in the choice list. > > Something like {?a|b|c|d} matching zero or more (or all of them) of > a,b,c,d (these can be subregexps) in any order, and {?+a|b|c|d} matching > one or more, and {?{m,n}a|b|c|d} matching betwen m and n of them (in any > order in all cases) > So that {?a|b|c|d}{1,1} is the same as (a|b|c|d) but without the capture, > i.e. (?:a|b|c|d), and {?{m,n}a} is the same as a{m,n}, and {?+a} is the > same as a, and {?*a} is the same as a? > > Which can also be written respectrively as {?*[abcd]}, {?+[abcd]} and > {?{m,n}[abcd]) > if the items of the choice list are characters that can be compacted with > the classic "character class" notation [abcd]. > > In all these the "{?quantifier list}" notation is always bound by the > number of items in the list (independantly of the quantifier, and if > individual items in the list are bound in length, the whole will be bound > by the sum of their lengths. So even if the quantifier is higher than than > the number of items in the list, it will be capped: "{?{1000}a}" will only > match "a", and "{?{1000}}" will never match anything (because the list > is empty: the specified higher bound 1000 is capped to 0 but the specified > lower bound 1000 is capped to 1 and this is impossible) and is also > equivalent to {?} where the min-max specified bounds are 1 by default, but > capped to 1,0 > > > > > >