I don't work with strings, but with what you seem to call "traces", but
that I call sets of states (they are in fact bitsets, which may be
compacted or just stored as arrays of bytes containing just 1 usefull bit,
but which may be a bit faster; byte arrays are just simpler to program).,
in a stack (I'll use bitsets later to make the structure more compact, if
needed, but for now this is fast enough and not memory intensive even for
large regexps with many repetitions with "+/*/{m,n}" or variable parts).
The internal matcher uses NFD, but needs to track the positions in the
original buffered input for returning captured matches.There's some optiomization to reduce the size of the bitsets, by defining classes. The representation of classes in Unicode is more challenging than with plain ASCII or ISO8859-*, for this reason I limit their length (differences between the smallest and highest code point), and over this size the classes are just defined as a sorted string of pairs of codepoints: I can perform a binary search in that string and look at the position: with an even position the character is part of the class, with an odd position, the character is not part of it). Thanks to a previous message you posted, I noted that my code deos not work correctly with Hangul precomposed syllables (I perform the decompoisition to NFD of the input on the fly in the input buffer, but the buffer is incorrectly advanced when there's a match to the next character, and it can skip one or two characters of the original input instead of code points in the NFD transformed input. (I don't have extensive cases for testing Hangul, I have much more for Latin, Greek, Cyrillic and Arabic, but also too few for Hebrew where "pathological" cases of regexps are certainly more likely to occur than in Latin, even with Vietnamese and its frequent double diacritics). For now with the complex cases of replacements, I have no precise syntax defined for specifiying replacements as as simple string with placeholders I just allow these matches to be passed as objects (rather than just strings) to a callback that performs the substitutions itself using the array of captures given by the engine to the callback; I have no idea for now about how to handle the special cases occuring when computing the actual replacements: The callback can insert/delete subsequences everywhere in the input buffer which is limited in size by the extent of $0, plus any intermediate characters when there's a discontinuity, plus their left and right contexts when the match still does not include the full combining sequences (for most uses cases, the left context is empty, but the right context is frequently non-empty and contains all combining characters on over the last base which is part of the match; the callback also does not necessarily have to modify the input buffer it it does not want to perform replacements in it, but in that case the input buffer is readonly and I don't need to feed the contexts which remain empty. There are also left and right context variables for *each* capture group (some of them may be partly or fully in another returned capture group). Finally a question: I suppose that like many programmers you have read the famous "green dragon" book of Sethi/Aho/Ullman books about compilers. I can understand the terminology they use when spoeaking about automatas (and that is found in many other places), but apparently you are using some terms that I have to guess from their context. Good books on the subjext are now becoming difficutlt to find (or they are more expensive now), and too difficult to use on the web (for such very technical topics, it really helps to have a printed copy, that you an annotate, explore, or have beside you instead of on a screen (and printing ebooks is not an option if they are voluminous). May be you have other books to recommend. But finding these books in libraries is now becoming difficult when many are closing or reducing their proposed collections (and I don't like buying books on the Internet). For the rest, I tend to just describe what I've made or used or experimented, even if the terms are not the best ones (some of my references are in French, and dificutl to translate). On difficult topics like this one, I'm not paid to perform research and I can only do that in my spare time from time to time, until I can make something stable enough for a limited use (without experimental features) In the past I could work on such research topic, but now we are pressed to use extisting libraries and not pass lot of time, we sell smaller incremental but limtied improvements and we know what is volutarily limited and left unimplemented. 2015-05-18 23:14 GMT+02:00 Richard Wordingham < [email protected]>: > On Mon, 18 May 2015 22:56:47 +0200 > Philippe Verdy <[email protected]> wrote: > > > Isn't it possible for your basic substitution to transform \uf073 > > into a character class [\uf071\uf072\uf073] that the regexp considers > > as a single entity to check ? > > In that case, backtracking for matching \u0F73*\u0F72 is simpler: > > [\uF071\uF072\uF073]*\u0F72, as it just requires backtracking only > > one character class (instead of one character). > > I'm still waiting for your explanation of how your scheme for European > diacritics (as used in SE Asia) would work. This thread is intended for > the idea of using the regex to decide which character to take as the > next character from the input trace. In the other thread, I'm still not > sure whether you're working with traces or strings. > > Richard. >

