RE: OMG text processing performance 6.7 - 9.5

Neville via use-livecode Tue, 04 Feb 2020 16:14:10 -0800

The recent testing of the Parse1 and Parse2  algorithms I think must have been 
on ascii not utf-8 text


I tested on the English translation of Les Miserables, to ensure at least a 
sprinkling of multi-bite characters in the text, and a longish file: 3.4 MB. I 
tested for the search string ‘Valjean’ which obviously occurs very frequently.

The searches were first applied to the raw binary text as read from the utf-8 
encoded file, without decoding; then on the text utf-8 decoded

Parse 0 : using itemdelimiter  ‘Valjean’ (case insensitive)

Parse 1: using offset with skips

Parse 2: using offset, truncating the text and 0 skip

Results:

searches on raw text
parse0 10 ms
parse1 9 ms
parse2 708 ms

searches on utf-8text
parse0 4402 ms
parse1 225469 ms
parse2 3453 ms


The winner for long utf-8 text is Parse 2; for raw text Parse1 and Parse 0 are 
equivalent The results dramatically demonstrate the exponential decay in 
performance with long utf-8 text. 

For most searches I would think one could use the raw text as long as one was 
searching for an ascii string, false positives where the string of single bytes 
occurs inside multibyte characters would be extremely unlikely.

Neville



_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

RE: OMG text processing performance 6.7 - 9.5

Reply via email to