Repeat For Each Iteration Order (was Re: use-livecode Digest, Vol 169, Issue 18)

Mark Waddingham via use-livecode Thu, 12 Oct 2017 10:18:08 -0700

On 2017-10-12 18:38, Peter Reid via use-livecode wrote:

I agree that the redundant indexing is an expensive approach, however
I have found that this abnormal structure of a repeat for each loop
can be a lot faster than the other loop forms in some circumstances.


I can't find the reference that first highlighted to me the lack of
guaranteed sequence of chunks, but I've assumed this was a restriction
for some time. It would be great if this is incorrect as I've got some
heavyweight looping in several of my apps that would benefit from
this!

Can anyone from LiveCode give a definitive answer to this please?


Yes - the previous posts are correct.

If you are using 'repeat for each' to iterate over something which hasan order, then the iteration will follow that order.


Specifically:

- string chunks (char, item, line, word, trueWord, sentence,paragraph, token, codepoint, codeunit) are all ordered


   - byte chunks are ordered

- key chunks are unordered (they use the hash-order of the array,which can change arbitrarily even within a session)

- element chunks are ordered for arrays which are sequences*,unordered for all other arrays

* An array is a sequence if: it has only integer keys, and keys rangefrom 1 up to the number of elements in the array.e.g. If the keys are 1, 2, 3, 4 - then that is a sequence; 0, 1, 2, 3is not; 1, 3, 4 is not.

Just to reiterate something Richard mentioned - the performanceadvantage of a 'repeat for each' loop is completely lost if you use anormal chunk within the loop to fetch another part (or the same part!)of the container:


  put 0 into tIndex
  repeat for each word tWord in tString
    add 1 to tIndex
    put word somefunc(tIndex) of tString into tSameWord
    ...
  end repeat

This is because the chunk evaluation must step through somefunc(tIndex)words in the string *again*. (It takes N steps to access the N'thstring/byte chunk in a container - repeat for each gets its performanceadvantage because it remembers where it got the previous chunk from soit can fetch the next, so it is just 1 step each iteration).

The only types of access which are guaranteed not to have this N stepbehavior are codeunit, byte and array accesses:

- a LiveCode string is (internally) an array of codeunits; which meansthat it takes only one step to get any codeunit in it.

- a LiveCode binary string is (internally) an array of bytes; so itonly takes one step to get any byte from it

- LiveCode arrays are hash-tables, so you can look up any key in onestep

Note: If you are using the 'byte' chunk, make sure that the thing theyare accessing *is* actually a binary string before using it on them. Theconversion from string -> binary string (which exists for compatibilitywith pre-7 scripts) will cause a performance bump.

In terms of the codeunit/codepoint/character chunks - the engine *tries*to ensure it does as little work as possible when accessing these.Internally, the engine sets flags on a string so that it can optimizethese lookups. In particular:

1) If a string contains only Unicode codepoints from the first 64kUnicode codes *without* surrogates (they give access to everything above64k) then codepoint access on that string will be 1 step.

2) If a string contains characters which are only composed of singlecodepoints in the first 64k Unicode code *without* surrogates, then characcess on that string will be 1 step.

In particular, if the string you are processing actually came from anative string; then (1) and (2) are guaranteed to be the case. However,if you have arbitrary Unicode strings, then it won't generally be thecase:

a) any characters which have a Unicode code > 64k must be representedby two codes < 64k (surrogate pairs); this means that the engine has tostep through the string codeunit by codeunit to count the codepoints soit can find the one you asked for (N step again)

b) any characters which are composed of multiple codepoints are in thesame boat

Case (b) can arise quite often even in Roman script languages. Forexample, in Unicode you can write e-acute as EITHER e,combining-acute(two codepoints) OR as e-with-acute (one codepoint). For some languages,most characters will be multiple codepoints - e.g. the Indic languages.There are also a lot of esoteric languages (for scholarly / academicpurposes) which are entirely defined by Unicode codepoints > 64k.

Upshot is that if you are doing char by char string processing ofgeneral Unicode strings, then try and use repeat for each char; and ifyou *can't* do that for some reason, then try and work out how you cando the processing by using the codeunit chunk.

However, as Knuth is famed for saying - "Premature optimization is theroot of all evil" - don't spend time trying to optimize an algorithmuntil you have a working version and can intimately analyse where theslow parts are. Trying to optimize ahead of time will likely just causea lot of time being spent needlessly, and potential hair loss whenfiguring out why the (probably much more complicated code) doesn't workquite right, or why it didn't produce the speed improvement you wereexpecting!


Warmest Regards,

Mark.

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Repeat For Each Iteration Order (was Re: use-livecode Digest, Vol 169, Issue 18)

Reply via email to