Re: How to find offsets in Unicode Text fast

Mark Waddingham via use-livecode Tue, 13 Nov 2018 03:03:57 -0800

On 2018-11-13 11:37, Geoff Canyon via use-livecode wrote:

I understand (generally) the complexity of comparison, but that's notthe
speed issue causing this discussion. Most of the proposed solutions are
using nearly the same operators/functions for comparison, or at leastthe
same comparison is being done. Instead, the problem is a Foolish Line
Painter problem: with single-byte characters, finding all occurrencesofone string in another by repeatedly using offset() with charsToSkipscales
well; but with multi-byte characters, the penalty for repeatedly
calculating out longer and longer skips is exponential.

The actual reason its not linear when you have Unicode strings (thebehaviorisn't exponential, btw - cubic, I think) is that to work out thecharacterindex in a Unicode string is a O(n) operation - in native strings itsO(1).

So, I revise what I said before, you actually need to usecodeunitOffset() withformSensitive and caseSensitive set to true, with your input stringsappropriately

processed.

The only wrinkle with that is that you might get 'false-positive'matches - i.e.

where a match is not on character boundaries:

e.g. [e, combining-acute] will be found in [e, combining-acute,combining-grave]even though the latter is *not* the former as characters (but is ascodeunits).

Using textEncode("UTF-32") / binary offset will cause morefalse-positives than that(as you've mentioned before) as it could find matches which are not on a4 byte boundary.


Warmest Regards,

Mark.

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: How to find offsets in Unicode Text fast

Reply via email to