RE: OMG text processing performance 6.7 - 9.5

Neville via use-livecode Wed, 05 Feb 2020 15:43:57 -0800

Richard, here is a link to my test stack

https://www.dropbox.com/sh/bbpe12p8bf56ofe/AADbhV2LavLP4Y3CZ8Ab8NGia?dl=0 
<https://www.dropbox.com/sh/bbpe12p8bf56ofe/AADbhV2LavLP4Y3CZ8Ab8NGia?dl=0>


The LesMiserables.txt file is included for convenience; it should be placed in 
your Documents directory. The algorithms are all in the script for the `Run` 
button.

I am still mystified that the character offset searches give the same number 
for each hit for the utf8 text as for the raw text. Surely `char x of 
theUTF8Text` returns the unicode character at offset x, `char x of theRawText` 
returns the 8-bit ascii character of the raw text? How then can x be the same 
for the corresponding hit, when I know there are some multibyte unicode 
characters in the text (eg e-acute in Miserables)? Indeed just what does 
textDecode(theRawText,`UTF-8`) do, does it modify the actual text at all or 
just set a property flag?

Another mystery: I decided to extend the search algorithms by adding in 
matchChunk. In this case I use the regular expression `(?m)(?i)(Valjean)` to 
get the start and end offsets of the first match, and then truncate the initial 
section as per Parse2. As expected this search is much slower than any of the 
others on the raw text, it has a lot more to do. I then expected to get around 
the same time for the search on utf8 text rather than an exponentially worse 
time, since matchChunk is presumably encoding-blind. But it is actually 15% 
faster than on the raw text, in fact it is the fastest for finding offsets of 
all the algorithms if you must* search utf8 text ! How can this be? I don’t 
believe the utf8 text is 15% smaller than the raw text!

searches on raw text
matchChunk    3059 ms
filter                      16 ms
parse0                  10 ms
parse1                    8 ms
parse3              2244 ms
parse2                671 ms
parse4                682 ms

searches on utf-8 text
matchChunk utf8      2492 ms
filter utf8                   1954 ms
parse0 utf8               3788 ms
parse1 utf8           223254 ms
parse3 utf8           634423 ms
parse2 utf8               3409 ms
parse4 utf8               7166 ms

*As I mentioned in most case character offset searching the raw text should be 
OK if you are searching for 7-bit ascii strings of length say>2. But I think 
the lineOffset and filter operations could give false positives, if there is a 
multibyte character contains OD as a component byte in the text.

Neville


_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

RE: OMG text processing performance 6.7 - 9.5

Reply via email to