On Fri, 15 May 2015 02:10:36 +0200 Philippe Verdy <[email protected]> wrote:
> 2015-05-14 20:13 GMT+02:00 Richard Wordingham < > [email protected]>: > > > On Thu, 14 May 2015 12:58:29 +0200 > > Philippe Verdy <[email protected]> wrote: > > > > > 2015-05-14 9:59 GMT+02:00 Richard Wordingham < > > > [email protected]>: > > > > > > > An elegant formal solution to the Kleene star problem interprets > > > > (\u0323\u0302)* as (\u0323|\u0302)*. However, that is > > > > counter-intuitive > > > > The technical term for this is the 'concurrent iteration' - or at > > least, that's the term used in the 'Book of Traces'. > > > > > For your example "(\u0323\u0302)*" the characters in the > > > alternatives (COMBINING DOT BELOW and COMBINING ACUTE ACCENT), > > > once converted to NFD (which is the same here) are just using at > > > most two distinct non-zero combining classes and no blocker; so > > > it is safe to trasform it to (\u0323|\u0302)* for a first pass > > > matching that will then only check candidate matchs in the second > > > pass. or more efficiently, a second finite state automata (FSA) > > > running in parallel with its own state: > > > > You've forgotten the basic problem. A *finite* state automaton > > cannot count very far; with only n states, it cannot count as far > > as n. > > > > I did not forget it, this is why there's a second pass (or a second > FSA running in parallel to indicate its own accept state). You have > to combine the two states variables to get the final combined state > to determine if it is a final accept state. Your description makes no sense to me as a description of a finite state automaton. Now, a program to check whether a trace matching {\u0323|\u0302)* matches (\u0323\u0302)* is very simple. It just counts the number of times \u0323 occurs and the number of times \u0302 occurs, and returns whether they are equal. The two counters are the key variables (and one could just keep the difference in the counts). However, this is not a finite state automaton. Now, to some extent I am cheating by assuming that the characters are delivered in NFD order. If I did not do this, to construct the non-deterministic finite automaton (NDFA) for the concatenation of two sets / regular expressions, the triples (x, y, n) of (left NDFA state, right NDFA state, highest non-zero ccc assigned to righthand component) would need to be expanded. The third component would become a list of non-zero ccc's - in principle 2^254 values, but in fact rather fewer as not all 255 ccc values are used by Unicode. It is still finite. I prefer to keep the complexity out of the regular expression engine proper. Given a NDFA recognising a set of NFD strings, one can convert it to a deterministic finite automaton (DFA), say X, provided one does not run out of memory or time. One can then 'easily' construct a DFA Y recognising the canonical equivalents of the strings. The state in DFA Y reached by string x is defined to be the state reached by the string to_NFD(x) in DFA X. This method relies on the identity to_NFD(to_NFD(x)z) = to_NFD(xz). This handles the recognition of a string canonically equivalent to \u0323\u0302. (The constructions above are sledge hammers; the NDFAs have many unreachable states.) However, recognising canonical equivalents of (\u0323\u0302)* via an FSM is rather more difficult; to be precise, it cannot be done by an FSM. Richard.

