On Sat, 16 May 2015 18:29:18 +0200 Philippe Verdy <[email protected]> wrote:
> 2015-05-16 17:02 GMT+02:00 Richard Wordingham < > [email protected]>: > > > There is an annoying error. You appear to assume that U+0302 > > COMBINING CIRCUMFLEX ACCENT and U+0303 COMBINING TILDE commute, but > > they don't; they have the same combining class, namely 230. I'm > > going to assume that 0303 is a typo for 0323. > > > Not a typo, and I did not made the assumption you suppose because I > chose then so that they were effectively using the **same** combining > class, so that they do not commute. In that case you have an even worse problem. Neither the trace nor the string \u0303\u0302\u0302 matches the pattern (\u0302\u0302\0323)*(\u0302\0303)*\u0302, but the string does match the regular expression (˕\u0302˕\u0302˕\0323|˕\u0302˕\0323˕\u0302|˕\u0302˕\u0302˕\0323)*(˕\u0302˕\ 0303|˕\0303˕\u0302)*˕\u0302˕ You've transformed (\u0302\u0303) into (˕\u0302˕\0303|˕\0303˕\u0302), but that is unnecessary and wrong, because U+0302 and U+0303 do not commute. > It was the key fact of my argument that destroys your argumentation. Which argument? Restoring the \u303, the fact that remains that \u0323\u0323\u0302\u0302\u0302\u0302 is canonically equivalent to \u0302\u0302\u0323\u0302\u0323\u0302, which clearly matches the corrected old regex (\u0302\u0302\u0323)*(\u0302\u0303)*\u0302. However, \u0323\u0323\u0302\u0302\u0302\u0302 does not match the corrected new regex (˕\u0302˕\u0302˕\u0323|˕\u0302˕\0323˕\u0302|˕\u0323˕\u0302˕\u0302)*(˕\u0302˕\u0303|˕\u0303˕\u0302)*˕\u0302˕ Do you claim that this argument is destroyed? If it is irrelevant, why is it irrelevant? It shows that your transform does not solve the original problem of missed matches. > Reread carefully and use the example string I gave and don't assume I > wanted to write u0323 instead of u0303. I'm not at all sure what your example string is. I ran my program to watch its progression with input \u0323\u0323\u0302\u0302, which does not match the pattern, and attach the outputs for your scorn. I have added comments started by #. # NDE = new dead end - I could tweak the program so this state is not entered. # NDE! = new dead end that might not be easy to avoid. # ODE = old dead end - derived from a state already labelled ODE or NDE. # ODE! = old dead end - derived from a state already labelled ODE! or NDE!. Here are the run outputs, with blank lines added to assist formatting. $ ./regex -b '(\u0302\u0302\u0323)*(\u0302\u0303)*\u0302' '\u0323\u0302\u0323\u0302' # ignore line wrapping above. Examining /home/richard/unicode/UCD/7.00/PropertyAliases.txt. Examining /home/richard/unicode/UCD/7.00/PropertyValueAliases.txt. Examining /home/richard/unicode/UCD/7.00/SpecialCasing.txt. Examining /home/richard/unicode/UCD/7.00/Scripts.txt. Examining /home/richard/unicode/UCD/7.00/PropList.txt. Simple Unicode regex "\u0323\u0302\u0302" Simple ASCII regex "" # I construct A* = (|A+) Simple Unicode regex "\u0302\u0303" Simple Unicode regex "\u0302" Initial states: 0) LLLL0 # The states are named according to a hierarchy of regexes. # LL = regex (\u0302\u0302\u0323)* # LLL = regex (\u0302\u0302\u0323)+ # LLLL = regex \u0302\u0302\u0323. # This is implemented as 'Simple Unicode regex "\u0323\u0302\u0302"'. # 0 means about to compare with character at offset 0, i.e. 0 1) LLRM # LLR = Empty string regex. # M = matched 2) LRLL0 # LR = regex (\u0302\u0303)* # LRL = regex (\u0302\u0303)+ # LRLL = regex \u0302\u0303 3) LRRM # LRR = Empty string regex. 4) R0 # R = regex \u0302 =0323=00:06:= # Get U+0323 from whole (=0) at byte 0 of argument LLLL0 => LLLL2 LLLL0 => LLLN001220:0:L2 # NDE! =0323=012:018:= # Note that string is input in NFD order. LLLL2 => LLLN001220:2:L2 # Now running LLLL and LLLR, whose states have relative names 2 and L2. # LLLR is a clone of LLL. # This recursion enables the recognition of unrecognisable Kleene # stars. It makes the automaton non-finite. # 001 is length in hex of name of relative state of LLLL # 220 means non-starters of ccc <= 220 will not be fed to LLLL LLLN001220:0:L2 => LLLN001220:0:N001220:2:L2 # ODE! =0302=06:012:= LLLN001220:2:L2 => LLLN001220:4:L2 LLLN001220:2:L2 => LLLN001230:2:L4 # NDE LLLN001220:2:L2 => LN00D230:LN001220:2:L2:LL2 # NDE # L = regex (\u0302\u0302\u0323)*(\u0302\u0303)* # NDE LLLN001220:2:L2 => LN00D230:LN001220:2:L2:LN001230:0:L2 # NDE LLLN001220:2:L2 => N00E230:LLN001220:2:L2:M # NDE LLLN001220:0:N001220:2:L2 => LLLN001230:0:N001220:4:L2 # ODE! LLLN001220:0:N001220:2:L2 => LLLN001230:0:N001230:2:L4 # ODE! LLLN001220:0:N001220:2:L2 => LN017230:LN001220:0:N001220:2:L2:LL2 # ODE! LLLN001220:0:N001220:2:L2 => # Line-break is email artefact. LN017230:LN001220:0:N001220:2:L2:LN001230:0:L2 # ODE! LLLN001220:0:N001220:2:L2 => N018230:LLN001220:0:N001220:2:L2:M # ODE! =0302=018:024:= LLLN001220:4:L2 => LLLN001220:M:L2 # Redundant - should purge somehow. LLLN001220:4:L2 => LLLL2 # Regex LLLL 'recognised' - rename LLLRL as LLLL. LLLN001220:4:L2 => LLLN001230:4:L4 # NDE LLLN001220:4:L2 => LN00D230:LN001220:4:L2:LL2 # NDE LLLN001220:4:L2 => LN00D230:LN001220:4:L2:LN001230:0:L2 # NDE LLLN001220:4:L2 => N00E230:LLN001220:4:L2:M # NDE LLLN001230:2:L4 => LLLN001230:2:LM # ODE LLLN001230:2:L4 => LLLN001230:2:L0 # ODE LLLN001230:2:L4 => LN00D230:LN001230:2:L4:LL2 # ODE LLLN001230:2:L4 => LN00D230:LN001230:2:L4:LN001230:0:L2 # ODE LLLN001230:2:L4 => N00E230:LLN001230:2:L4:M # ODE LN00D230:LN001220:2:L2:LL2 => LN00D230:LN001220:2:L2:LN001230:2:L2 # ODE LN00D230:LN001220:2:L2:LL2 => N019230:N00D230:LN001220:2:L2:LL2:M # ODE LN00D230:LN001220:2:L2:LN001230:0:L2 => # Line-break is e-mail artefact LN00D230:LN001220:2:L2:LN001230:0:N001230:2:L2 # ODE LN00D230:LN001220:2:L2:LN001230:0:L2 => # Line-break is email artefact N023230:N00D230:LN001220:2:L2:LN001230:0:L2:M # ODE LLLN001230:0:N001220:4:L2 => LLLN001230:0:N001220:M:L2 # ODE! LLLN001230:0:N001220:4:L2 => LLLN001230:0:L2 # ODE! LLLN001230:0:N001220:4:L2 => LLLN001230:0:N001230:4:L4 # ODE! LLLN001230:0:N001220:4:L2 => LN017230:LN001230:0:N001220:4:L2:LL2 # ODE! LLLN001230:0:N001220:4:L2 => # Line-break is e-mail artefact. LN017230:LN001230:0:N001220:4:L2:LN001230:0:L2 # ODE! LLLN001230:0:N001220:4:L2 => N018230:LLN001230:0:N001220:4:L2:M # ODE! LLLN001230:0:N001230:2:L4 => LLLN001230:0:N001230:2:LM # ODE! LLLN001230:0:N001230:2:L4 => LLLN001230:0:N001230:2:L0 # ODE! LLLN001230:0:N001230:2:L4 => LN017230:LN001230:0:N001230:2:L4:LL2 # ODE! LLLN001230:0:N001230:2:L4 => # Line-break is e-mail artefact LN017230:LN001230:0:N001230:2:L4:LN001230:0:L2 # ODE! LLLN001230:0:N001230:2:L4 => N018230:LLN001230:0:N001230:2:L4:M # ODE! LN017230:LN001220:0:N001220:2:L2:LL2 => LN017230:LN001220:0:N001220:2:L2:LN001230:2:L2 # ODE! LN017230:LN001220:0:N001220:2:L2:LL2 => N023230:N017230:LN001220:0:N001220:2:L2:LL2:M # ODE! LN017230:LN001220:0:N001220:2:L2:LN001230:0:L2 => LN017230:LN001220:0:N001220:2:L2:LN001230:0:N001230:2:L2 # ODE! LN017230:LN001220:0:N001220:2:L2:LN001230:0:L2 => N02D230:N017230:LN001220:0:N001220:2:L2:LN001230:0:L2:M # ODE! End marker is at 024:OVF > And you'll see that backtracing is necessary for this case (EVEN if > you don't care about capture groups but you are only interested in > the global capture $0). What I see is the desirability of some optimisation, but no problem in principle. Now I might see something different with your intended example - but until I see it I think my examination would be overwhelmed by dead-end state propagations. If you are making the point that a backtracking automaton might need to backtrack, then I won't dispute that point. Richard.

