On Thu, 14 May 2015 12:25:06 -0500 Stephen E Slevinski Jr <[email protected]> wrote:
> On 5/14/15 5:58 AM, Philippe Verdy wrote: > > Yes it is problematic: (ab)* is not the same as (a|b)* as this > > requires matching pairs of letters "ab" in that order in the first > > expression, but random strings of "a" and "b" i nthe second one (so > > the second matches *more* input samples. > > > > Even if you consider canonical equivalences (where the relative > > order of "ab" does not matter for example because they have > > distinct non-zero canonical) this does not mean that "a" alone will > > match in the first expression "(ab*)", even though it MUST match in > > "(a|b)*". > > > > So the solution is just elegant to simplify the first level of > > analysis of "(ab)*" by using "(a|b)*" instead. But then you need to > > perform a second pass on the match to make sure it is containing > > only complete sequences "ab" in that order (or any other order if > > they are all combining with a non-zero combining class) and no > > unpaired "a" or "b". > > If you always want to find "a" and "b" in a pair without regard to > the order, how about the regex: > ((ab)|(ba))* In NFD, the language (\u0323\u0302)* consists of ε (empty string) \u0323\u0302 \u0323\u0323\u0302\u0302 \u0323\u0323\u0323\u0302\u0302\u0302 \u0323\u0323\u0323\u0323\u0302\u0302\u0302\u0302 and so on. Therefore the finite automaton implied by your regex won't work. No regular expression will work. That is mathematically proven. What I have listed above is the standard example of a 'non-regular language', a set of strings that cannot be defined by a finite site of regular expressions. Richard.

