On Mon, 18 May 2015 21:05:49 +0200 Philippe Verdy <[email protected]> wrote:
> 2015-05-18 20:35 GMT+02:00 Richard Wordingham < > [email protected]>: > > > The algorithm itself should be tractable - Mark Davis has published > > an algorithm to generate all strings canonically equivalent to a > > Unicode string, and what we need might not be so complex. > > > Even this algorithm from Mark Davis will fail in this case: How so? The regexp is \u0F73*, which is converted to a non-capturing (\u0F71\u0F72)*. Given a string 0F40 0F71 0F73 0F42 representing the trace, matching will fail at 0F40 and an attempt will be made starting at the 0F71. The input string handling part will then present a run of three non-starters: \u0F71 \u0F71 \u0F72 I think the process is even simpler than I first thought. The engine will look for a match for \u0F71, and take it from this list, leaving \u0F71 \u0F72. It will then look for a match for \u0F72, and take it form the list, leaving \u0F71. It will then look for a match for \u0F71, and take it from the list. It will then look for a match for \u0F72. It will fail, and then back track, disgorging the \0F71. The input 'stream' now looks like \u0F71 \u0F42. This will match nothing; it is after the matching substream. The matching substring is: None of 0F40, all of 0F71, the second part of 0F72 and none of 0F42. Its value, as a trace, is 0F71 0F72. > - You can use it easily to transform a regexp containing (\u0F73) > into a regexp containing (\u0F73|\u0F71\u0F72|\u0F71\u0F72) That is *not* what I am suggesting. The regex needs decomposing, but no other transformations. It is the string representing the input trace that is expanded. > - But this leaves the same problem for unbounded repetititions with > the "+" or "*" or "{m,}" operators. Not at all - that is the beauty of the scheme. On the regex side, \u0F73* is as straight forward as non-capturing (\u0061\u0062)*. Putting back the unused fragments of the run of non-starters in the input trace is the most difficult part. > Now all the problem is how to do the backtracking, Yes, that may be more difficult than I thought. Comparing against literal characters is simple, but it may be more complicated when matching against a list of alternative characters. Back-tracking schemes may not be set up to try the next character on a list of alternatives, e.g. so that pattern (\u0f72|\u0f71)\u0f72 matches input string 0F71 0F72. The alternative (\u0f72|\u0f71) would first take the 0F72, and only on backtracking would it take the 0F71 instead. This is an issue with traces that does not appear with strings. Richard.

