On 2017-10-12 18:38, Peter Reid via use-livecode wrote:
I agree that the redundant indexing is an expensive approach, however
I have found that this abnormal structure of a repeat for each loop
can be a lot faster than the other loop forms in some circumstances.

I can't find the reference that first highlighted to me the lack of
guaranteed sequence of chunks, but I've assumed this was a restriction
for some time. It would be great if this is incorrect as I've got some
heavyweight looping in several of my apps that would benefit from
this!

Can anyone from LiveCode give a definitive answer to this please?

Yes - the previous posts are correct.

If you are using 'repeat for each' to iterate over something which has an order, then the iteration will follow that order.

Specifically:

- string chunks (char, item, line, word, trueWord, sentence, paragraph, token, codepoint, codeunit) are all ordered

   - byte chunks are ordered

- key chunks are unordered (they use the hash-order of the array, which can change arbitrarily even within a session)

- element chunks are ordered for arrays which are sequences*, unordered for all other arrays

* An array is a sequence if: it has only integer keys, and keys range from 1 up to the number of elements in the array. e.g. If the keys are 1, 2, 3, 4 - then that is a sequence; 0, 1, 2, 3 is not; 1, 3, 4 is not.

Just to reiterate something Richard mentioned - the performance advantage of a 'repeat for each' loop is completely lost if you use a normal chunk within the loop to fetch another part (or the same part!) of the container:

  put 0 into tIndex
  repeat for each word tWord in tString
    add 1 to tIndex
    put word somefunc(tIndex) of tString into tSameWord
    ...
  end repeat

This is because the chunk evaluation must step through somefunc(tIndex) words in the string *again*. (It takes N steps to access the N'th string/byte chunk in a container - repeat for each gets its performance advantage because it remembers where it got the previous chunk from so it can fetch the next, so it is just 1 step each iteration).

The only types of access which are guaranteed not to have this N step behavior are codeunit, byte and array accesses:

- a LiveCode string is (internally) an array of codeunits; which means that it takes only one step to get any codeunit in it.

- a LiveCode binary string is (internally) an array of bytes; so it only takes one step to get any byte from it

- LiveCode arrays are hash-tables, so you can look up any key in one step

Note: If you are using the 'byte' chunk, make sure that the thing they are accessing *is* actually a binary string before using it on them. The conversion from string -> binary string (which exists for compatibility with pre-7 scripts) will cause a performance bump.

In terms of the codeunit/codepoint/character chunks - the engine *tries* to ensure it does as little work as possible when accessing these. Internally, the engine sets flags on a string so that it can optimize these lookups. In particular:

1) If a string contains only Unicode codepoints from the first 64k Unicode codes *without* surrogates (they give access to everything above 64k) then codepoint access on that string will be 1 step.

2) If a string contains characters which are only composed of single codepoints in the first 64k Unicode code *without* surrogates, then char access on that string will be 1 step.

In particular, if the string you are processing actually came from a native string; then (1) and (2) are guaranteed to be the case. However, if you have arbitrary Unicode strings, then it won't generally be the case:

a) any characters which have a Unicode code > 64k must be represented by two codes < 64k (surrogate pairs); this means that the engine has to step through the string codeunit by codeunit to count the codepoints so it can find the one you asked for (N step again)

b) any characters which are composed of multiple codepoints are in the same boat

Case (b) can arise quite often even in Roman script languages. For example, in Unicode you can write e-acute as EITHER e,combining-acute (two codepoints) OR as e-with-acute (one codepoint). For some languages, most characters will be multiple codepoints - e.g. the Indic languages. There are also a lot of esoteric languages (for scholarly / academic purposes) which are entirely defined by Unicode codepoints > 64k.

Upshot is that if you are doing char by char string processing of general Unicode strings, then try and use repeat for each char; and if you *can't* do that for some reason, then try and work out how you can do the processing by using the codeunit chunk.

However, as Knuth is famed for saying - "Premature optimization is the root of all evil" - don't spend time trying to optimize an algorithm until you have a working version and can intimately analyse where the slow parts are. Trying to optimize ahead of time will likely just cause a lot of time being spent needlessly, and potential hair loss when figuring out why the (probably much more complicated code) doesn't work quite right, or why it didn't produce the speed improvement you were expecting!

Warmest Regards,

Mark.

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to