2015-05-14 9:59 GMT+02:00 Richard Wordingham < [email protected]>:
> An elegant formal solution to the Kleene star problem interprets > (\u0323\u0302)* as (\u0323|\u0302)*. However, that is > counter-intuitive Yes it is problematic: (ab)* is not the same as (a|b)* as this requires matching pairs of letters "ab" in that order in the first expression, but random strings of "a" and "b" i nthe second one (so the second matches *more* input samples. Even if you consider canonical equivalences (where the relative order of "ab" does not matter for example because they have distinct non-zero canonical) this does not mean that "a" alone will match in the first expression "(ab*)", even though it MUST match in "(a|b)*". So the solution is just elegant to simplify the first level of analysis of "(ab)*" by using "(a|b)*" instead. But then you need to perform a second pass on the match to make sure it is containing only complete sequences "ab" in that order (or any other order if they are all combining with a non-zero combining class) and no unpaired "a" or "b". Such transform using two passes should only be made when subregexps within a "(...)*" contain only alternatives (converted to NFD) such then each of them contains ONLY combining characters with distinct non-zero combining classes. If one of the alternatives "ab" contains any character with combining class 0 or if they have blockers with identical non-zero combining classes, we cannot use this transform. But this transform using two passes is stil elegant: the alternatives where we can use it and that requires a second pass have a bounded length (it's impossible for them to be longer than 255 codepoints given there cannot be more than 255 *distinct* non-zero combining classes. But even in this case, the current UCD currently uses a much lower number of non-zero combining classes, so this limit is even lower: the substrings where this transform is possible will be extremely short and a second pass on them will be extremely fast (using very small string buffers that can stay in memory). For your example "(\u0323\u0302)*" the characters in the alternatives (COMBINING DOT BELOW and COMBINING ACUTE ACCENT), once converted to NFD (which is the same here) are just using at most two distinct non-zero combining classes and no blocker; so it is safe to trasform it to (\u0323|\u0302)* for a first pass matching that will then only check candidate matchs in the second pass. or more efficiently, a second finite state automata (FSA) running in parallel with its own state: in your example this second FSA just has 2 states: the initial state 0 which is also the final/accept state, and state 1 after matching one character of the pair. When you reach the point where matching (\u0323 | \u0302)* with the first level of analysis would terminate, you just need to check the state of the 2nd FSA to see if it is also in the initial/final/accept state 0 (otherwise this is not an valid accept state for the untransformed (\u0323\u0302)* regexp. However, the most difficult part for regexps supporting canonical equivalence is about what we can do to return submatches! they are not necessarily contiguous in the input stream. You can still return a matching substring but if you use it for performing search/replace operations, it becomes difficult to know where to place the replacement, when that replacement string (even if it was converted first to NFD) may also contain combining characters. Or even worse if the replacement contains some blockers that will be inserted in the middle of the non-replaced text (wnad where can we safely place the remaining characters in the middle of the match but that are not part of the match itself ??? One solution is to not exclude these characters in the middle of a match and return them too. It's up to the replacement function to check their existence: The regexp engine can just provide an additional index of characters in the returned matched substring, that are in fact not part of the actual match but present in the middle, instead of just the substring for the match. Or it can return just the exact matching substring, but also an index array containing their relative position in the actual input string (in standard matches those indexes would be the sequence of intergers 0 to N-1 where N is the length of the matched substring, but if the sequence is discontinusous in the input, the sequence will still be growing with some steps higher than 1, leaving some holes (the last index in that sequence will be equal to or higher than N).

