On 2020-02-21 00:29, J. Landman Gay via use-livecode wrote:
So glad you chimed in, Mark. This is pretty impressive. I'll need to
use the "for each element" structure because my tags are not unique,
but it still is much faster. When clicking a tag at the top of the
document that links to the last anchor at the bottom of the text, I
get a timing of about 25ms. If I omit the timing for loading the
htmltext and the selection of the text at the end of the handler it
brings the timing to almost 0. The test text is long, but not nearly
as long as Bernd's sample.

Glad I could help - although to be fair, all I did was optimize what
Bernd (and Richard) had already proposed.

One thing I did notice through testing was that the actual styled content makes a great deal of difference to performance. I also tried against the
DataGrid behavior (replicated several times), and then also against some
styled 'Lorem Ipsum' (https://loripsum.net/) of about the same length (around
8Mb of htmlText, with the anchor being search for on the last word). The
difference is that the DG has many more style runs (unsurprisingly) and
almost all are single words. So timings need to be taken against a
representative sample of the data you are actually working with.

I need to select the entire range of text covered by the metadata
span, not just a single word. I've got that working, but since we're
on a roll here, I wonder if there's a more optimal way to do it.

I did wonder if that would be the case...

I'm using chars instead of codepoints because when I tried it, they
both gave the same number. Should I change that?

Both characters and codepoints run the risk of requiring a linear scan of
the string to calculate the length - strictly speaking his will occur if
the engine is not sure whether character / codepoint have a 1-1 map to
codeunits (for example if your string has Unicode chars and it hasn't
analysed it). Therefore you should definitely use codeunits.

Also, I had to add 3 to tStartChar to get the right starting point but
I can't figure out why. Otherwise it selects the last character before
the metadata span as the starting point.

Was the anchor in the third paragraph by any chance?

The styledText representation makes the paragraph separator (return char)
implicit (as it is in the field object internally) - so you need to bump
the tTotalChars by one before the final end repeat to account for that (as the
codeunit ranges the field uses *include* the implicit return char)

So I couldn't help but fettle with this a little more. You mention that your
'anchors' are not unique in a document. This raises the question of what
happens if there is more than one match...

This handler finds all occurrences of a given anchor in the text. As we are searching for all of them, it can use repeat for each key iteration in both
loops:

function FindAllAnchors pStyledText, pAnchor
   /* Return-delimited list of results, each line is of the form:
   *     start,finish,line
   * Each of these corresponds to a chunk of the form:
   *      CODEUNIT start TO finish OF LINE line OF field
   */
   local tResults

/* Iterate over the lines of the text in arbitrary order - the order doesn't
   * matter as we keep the reference to the line any match is in. */
   repeat for each key tLineIndex in pStyledText
/* Fetch the runs in the line, so we don't have to keep looking it up */
      local tRuns
      put pStyledText[tLineIndex]["runs"] into tRuns

/* Iterate over the runs in arbitrary order - assuming that the number * of potentially matching runs is miniscule compared to the number of
      * non-matching runs, it is faster to iterate in hash-order. */
      repeat for each key tRunIndex in tRuns
         /* If we find a match, work out its offset in the line */
         if tRuns[tRunIndex]["metadata"] is pAnchor then
            /* Calculate the number of codeunits before this run */
            local tCodeunitCount
            put 0 into tCodeunitCount
            repeat with tPreviousRunIndex = 1 to tRunIndex - 1
add the number of codeunits in tRuns[tPreviousRunIndex]["text"] to tCodeunitCount
            end repeat

            /* Append the result to the results list. */
            put tCodeunitCount + 1, \
tCodeunitCount + the number of codeunits in tRuns[tRunIndex]["text"], \
                  tLineIndex & \
                  return after tResults
         end if
      end repeat
   end repeat

/* We want the results sorted first by line index, then by starting codeunit * within the line (so we get a top-to-bottom, left-to-right order). As the * 'sort' command is stable, we can do this by first sorting by the secondary * factor (codeunit start), then sorting again by the primary factor (line
   * index). */
   sort lines of tResults ascending numeric by item 1 of each
   sort lines of tResults ascending numeric by item 3 of each

   /* Return the set of results. */
   return tResults
end FindAllAnchors

Testing this on 8Mb of styled Lorem Ipsum text, with the same anchor at:
  word 1
  word 1000
  the middle word
  word -1000
  word -1

Then this handler takes slightly less time then searching for a single anchor
at word -1 of the field using 'repeat with' loops.

Whether this is helpful or not depends if you need to 'do something' when there
is more than one matching anchor in a document :)

Warmest Regards,

Mark.

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to