On 2018-11-13 11:37, Geoff Canyon via use-livecode wrote:
I understand (generally) the complexity of comparison, but that's not the
speed issue causing this discussion. Most of the proposed solutions are
using nearly the same operators/functions for comparison, or at least the
same comparison is being done. Instead, the problem is a Foolish Line
Painter problem: with single-byte characters, finding all occurrences of one string in another by repeatedly using offset() with charsToSkip scales
well; but with multi-byte characters, the penalty for repeatedly
calculating out longer and longer skips is exponential.

The actual reason its not linear when you have Unicode strings (the behavior isn't exponential, btw - cubic, I think) is that to work out the character index in a Unicode string is a O(n) operation - in native strings its O(1).

So, I revise what I said before, you actually need to use codeunitOffset() with formSensitive and caseSensitive set to true, with your input strings appropriately
processed.

The only wrinkle with that is that you might get 'false-positive' matches - i.e.
where a match is not on character boundaries:

e.g. [e, combining-acute] will be found in [e, combining-acute, combining-grave] even though the latter is *not* the former as characters (but is as codeunits).

Using textEncode("UTF-32") / binary offset will cause more false-positives than that (as you've mentioned before) as it could find matches which are not on a 4 byte boundary.

Warmest Regards,

Mark.

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to