2015-05-16 22:33 GMT+02:00 Richard Wordingham < [email protected]>:
> On Sat, 16 May 2015 18:29:18 +0200 > Philippe Verdy <[email protected]> wrote: > > > 2015-05-16 17:02 GMT+02:00 Richard Wordingham < > > [email protected]>: > > > > > There is an annoying error. You appear to assume that U+0302 > > > COMBINING CIRCUMFLEX ACCENT and U+0303 COMBINING TILDE commute, but > > > they don't; they have the same combining class, namely 230. I'm > > > going to assume that 0303 is a typo for 0323. > > > > > > Not a typo, and I did not made the assumption you suppose because I > > chose then so that they were effectively using the **same** combining > > class, so that they do not commute. > > In that case you have an even worse problem. Neither the trace nor the > string \u0303\u0302\u0302 matches the pattern > (\u0302\u0302\0323)*(\u0302\0303)*\u0302, but the string does match the > regular expression > (˕\u0302˕\u0302˕\0323|˕\u0302˕\0323˕\u0302|˕\u0302˕\u0302˕\0323)*(˕\u0302˕\ > 0303|˕\0303˕\u0302)*˕\u0302˕ > > You've transformed (\u0302\u0303) into (˕\u0302˕\0303|˕\0303˕\u0302), > but that is unnecessary and wrong, because U+0302 and U+0303 do not > commute. Oh right! Thanks for pointing, it was intended you can read it as. (˕\u0302˕\u0302˕\0323|˕\u0302˕\0323˕\u0302|˕\u0302˕\u0302˕\0323)*(˕\u0302˕\0303)*˕\u0302˕ But my argument remains because of the presence of \0302 in the second subregexp (which additionally is a separate capture, but here I'm not concentrating on the impact in numbered captures, but only on the global capture aka $0) > > It was the key fact of my argument that destroys your argumentation. > > However, \u0323\u0323\u0302\u0302\u0302\u0302 does not match the > corrected new regex > > (˕\u0302˕\u0302˕\u0323|˕\u0302˕\0323˕\u0302|˕\u0323˕\u0302˕\u0302)*(˕\u0302˕\u0303)*˕\u0302˕ > > Do you claim that this argument is destroyed? If it is irrelevant, why > is it irrelevant? It shows that your transform does not solve the > original problem of missed matches. > Why doesn't it solve it? Note that the notation with tacks is just the first transform. Of course you can optimize it by factorizing the common prefixes in each alternative. In the following the 1st and 4th tacks have some common followers in their lists of characters or character classes they expect (for advancing to the next tack), but the 2nd and 5th tack expect different followers. (˕\u0302˕\u0302˕\u0323|˕\u0302˕\0323˕\u0302|˕\u0323˕\u0302˕\u0302)*(˕\u0302˕\u0303)*˕\u0302˕ OK I understand the need for "counting" characters present in regexps when they are sharing the same combining classes, but counting does not work correctly, in fact you have to keep counters for each distinct combining character with non-zero combining class for how they contribute to the total length of the "star" group. They also don't contribute necessarily to the same total when the regexp specifies them multiple times (a simple measurment of the total length of the group is evidently not enough, all counters must be exact multiples of the number of occurences (counter[c]) of each combining character (c) in the original untransformed content of each alternative in the star group, and the second factor n of this multiple must be identical for all counters The total length is in pseudo-code: { sum=0; for(c:v in counter) sum += v; return sum; } but it has no use by itself. If the number of (non-repeated) original untransformed alternative are in mustoccur[] the check to perform is this pseudo-code: var n = null; foreach(c:m in mustoccur) { checkthat(counter[c] % m == 0); if (n == null) n = counter[c] / m; else checkthat(counter[c] / m == n); } > > Reread carefully and use the example string I gave and don't assume I > > wanted to write u0323 instead of u0303. > > I'm not at all sure what your example string is My example was the original regexp without the notation tacks: (\u0302\u0302\u0323)*(\u0302\u0303)*\u0302 It exposes some of the critical difficulties, first for returning correct global matches (but then also for for captures, and the effect of "ungreedy" options of Perl (and PCRE working in Perl-compatible mode or in extended mode) and most regexp engines (whose default behavior is "greedy"): the "ungreedy" option causes significant slowdowns with additional rollbacks or more work to maintain an efficient backtracing directly in the current state of the automata (if you attempt to use deterministic rules everywhere it is possible). But we know that it's not possible for all regexps in general, otherwise regexp engines would just be simple LR(n) parsers with bounded n, or even simpler LALR parsers like Bison/Yacc but without their backtracing support for "shift"/"reduce" ambiguities, these LALR parsers are also greedy by default and resolve ambiguities by "shifting" first, leaving the code determine what to do when after shiting there's a parse error caused by unexpected values, but LALR parsers do not have a clean way to handle the correct rollback to the last ambiguous shift/reduce state with a special match rule, and they do not support trying "reduce" first to get the "ungreedy" behavior as they cannot return to this state to choose the "shift" alternative). That's why since long lexers are written with regexps, and syntaxic scanners written preferably with LALR parsers which cannot work alone without a separate lexer. But using LALR parsers does not work with common languages like Fortran; it works for parsing language like C/C++ because they are specified so that shift/reduce ambiguities are resolved using "shift" always (i.e. the greedy behavior). Very few parser generator support both working mode (except the excellent PCCS that I have used since the early 1990's when it was still not rewritten in Java and was a student project in Purdue University, and that combines all the advantages of regexps and LR parsers, with very clean control of backtracing, it also supports layered parsing with multiple local parsers if needed, even without wrining any piece of output code, you can describe the full syntaxic and lexical rules of almost all languages in a single specification).

