On Sun, 17 May 2015 16:33:15 +0200 Philippe Verdy <[email protected]> wrote:
> 2015-05-16 22:33 GMT+02:00 Richard Wordingham < > [email protected]>: > > > On Sat, 16 May 2015 18:29:18 +0200 > > Philippe Verdy <[email protected]> wrote: > > > > > 2015-05-16 17:02 GMT+02:00 Richard Wordingham < > > > [email protected]>: > > > > > > > There is an annoying error. You appear to assume that U+0302 > > > > COMBINING CIRCUMFLEX ACCENT and U+0303 COMBINING TILDE commute, > > > > but they don't; they have the same combining class, namely > > > > 230. I'm going to assume that 0303 is a typo for 0323. > > > > > > > > > Not a typo, and I did not made the assumption you suppose because > > > I chose then so that they were effectively using the **same** > > > combining class, so that they do not commute. > > > > In that case you have an even worse problem. Neither the trace nor > > the string \u0303\u0302\u0302 matches the pattern > > (\u0302\u0302\0323)*(\u0302\0303)*\u0302, but the string does match > > the regular expression > > (˕\u0302˕\u0302˕\0323|˕\u0302˕\0323˕\u0302|˕\u0302˕\u0302˕\0323)*(˕\u0302˕\ > > 0303|˕\0303˕\u0302)*˕\u0302˕ > > > > You've transformed (\u0302\u0303) into > > (˕\u0302˕\0303|˕\0303˕\u0302), but that is unnecessary and wrong, > > because U+0302 and U+0303 do not commute. > > > Oh right! Thanks for pointing, it was intended you can read it as. > > (˕\u0302˕\u0302˕\0323|˕\u0302˕\0323˕\u0302|˕\u0302˕\u0302˕\0323)*(˕\u0302˕\0303)*˕\u0302˕ > > But my argument remains because of the presence of \0302 in the second > subregexp (which additionally is a separate capture, but here I'm not > concentrating on the impact in numbered captures, but only on the > global capture aka $0) > > > > > It was the key fact of my argument that destroys your > > > argumentation. > > > > However, \u0323\u0323\u0302\u0302\u0302\u0302 does not match the > > corrected new regex > > > > (˕\u0302˕\u0302˕\u0323|˕\u0302˕\0323˕\u0302|˕\u0323˕\u0302˕\u0302)*(˕\u0302˕\u0303)*˕\u0302˕ > > > > Do you claim that this argument is destroyed? If it is irrelevant, > > why is it irrelevant? It shows that your transform does not solve > > the original problem of missed matches. > > > > Why doesn't it solve it? Sorry, my example wasn't quite right. It should have two combining dots below and five circumflexes, not four as I wrote it. I will first explain how my NDnear-FA handles it - I have now removed the generation of the dead end states. Initial states: 0) LLLL0 # Starting the \u0302\u0302\u0323 factor, # implemented as \u0323\u032\u0320 1) LLRM # Completed the zero trip alternative to (\u0302\u0302\u0323)+ # Not actually useful. 2) LRLL0 # Starting the \u0302\u0303 factor 3) LRRM # Completed the zero trip alternative to (\u0302\u0303)+ 4) R0 # Starting the \u0302 factor =0323=00:06:= LLLL0 => LLLL2 # \u0323\u0302\u0302 factor progressed as far as \u0323 =0323=06:012:= LLLL2 => LLLN001220:2:L2 # Progressing 2 successive repeats of factor. # Both have progressed as far as \u0323. # Finiteness would restrict me to, say, 3 repeats # in progress. # The states of the finite DFA are a cross product of 3 copies of # the DFAs for \u0323\u0302\u0302 and 2 copies of the set of relevant # ccc values. By no means all of these states are used. # In the Kleene stars of the regular expression guaranteed by # recognisability, 3 copies caters for the worst case, xyz, where x # has a starter and ends in a non-starter, y consists of non-starters # with the same canonical combining class, and z starts with # non-starter and contains a starter, e.g. # x = \u0f40\u0f74, y = \u0f7a\u0f7a\u0f7a, z = \u0f71\u0f42 # to_NFD(xyz) = \u0f40\u0f71\u0f7a\u0f7a\u0f7a\u0f74\u0f42 =0302=012:018:= LLLN001220:2:L2 => LLLN001220:4:L2 # Still progressing two factors # First has progressed to \u0323\u0302 and second to # \u0323. The other way round has been pruned by the # automated observation that if \u0302 is blocked from # first factor, the factor cannot be completed. =0302=018:024:= LLLN001220:4:L2 => LLLN001220:M:L2 # Completed the first factor LLLN001220:4:L2 => LLLL2 # As first factor is complete, remove it from # consideration and relabel second factor as # first. =0302=024:030:= LLLL2 => LLLL4 # \u0323\0302\u0302 completed as far as \u0323\u0302 =0302=030:036:= LLLL4 => LLLLM # \u0323\u0302\u0302 is complete. LLLL4 => LRLL0 # So start \u0302\u0303 factor. LLLL4 => LRRM # Alternatively, completed the zero trip option of # (\u0302\u0303)* LLLL4 => R0 # Or, we have progressed as far as the final \u0302 LLLL4 => LLLL0 # Or, start another \u0323\u0302\u0302 =0302=036:042:= LRLL0 => LRLL2 # Got as far as \u0302 in \u0302\u0303 R0 => RM (match) # Or completed the final \u0302. End marker is at 042:OVF Could you please talk me through how your system recognises the string \u0323\u0323\u0302\u0302\u0302\u0302\u0302 as matching the regex. I can't work out how it is supposed to work from your description. Richard.

