Re: Translate metadata to field content

Mark Waddingham via use-livecode Fri, 21 Feb 2020 00:53:22 -0800

On 2020-02-21 00:29, J. Landman Gay via use-livecode wrote:

So glad you chimed in, Mark. This is pretty impressive. I'll need to
use the "for each element" structure because my tags are not unique,
but it still is much faster. When clicking a tag at the top of the
document that links to the last anchor at the bottom of the text, I
get a timing of about 25ms. If I omit the timing for loading the
htmltext and the selection of the text at the end of the handler it
brings the timing to almost 0. The test text is long, but not nearly
as long as Bernd's sample.


Glad I could help - although to be fair, all I did was optimize what
Bernd (and Richard) had already proposed.

One thing I did notice through testing was that the actual styledcontentmakes a great deal of difference to performance. I also tried againstthe

DataGrid behavior (replicated several times), and then also against some

styled 'Lorem Ipsum' (https://loripsum.net/) of about the same length(around

8Mb of htmlText, with the anchor being search for on the last word). The
difference is that the DG has many more style runs (unsurprisingly) and
almost all are single words. So timings need to be taken against a
representative sample of the data you are actually working with.

I need to select the entire range of text covered by the metadata
span, not just a single word. I've got that working, but since we're
on a roll here, I wonder if there's a more optimal way to do it.


I did wonder if that would be the case...

I'm using chars instead of codepoints because when I tried it, they
both gave the same number. Should I change that?

Both characters and codepoints run the risk of requiring a linear scanof

the string to calculate the length - strictly speaking his will occur if
the engine is not sure whether character / codepoint have a 1-1 map to
codeunits (for example if your string has Unicode chars and it hasn't
analysed it). Therefore you should definitely use codeunits.

Also, I had to add 3 to tStartChar to get the right starting point but
I can't figure out why. Otherwise it selects the last character before
the metadata span as the starting point.


Was the anchor in the third paragraph by any chance?

The styledText representation makes the paragraph separator (returnchar)

implicit (as it is in the field object internally) - so you need to bump

the tTotalChars by one before the final end repeat to account for that(as the

codeunit ranges the field uses *include* the implicit return char)

So I couldn't help but fettle with this a little more. You mention thatyour

'anchors' are not unique in a document. This raises the question of what
happens if there is more than one match...

This handler finds all occurrences of a given anchor in the text. As wearesearching for all of them, it can use repeat for each key iteration inboth

loops:

function FindAllAnchors pStyledText, pAnchor
   /* Return-delimited list of results, each line is of the form:
   *     start,finish,line
   * Each of these corresponds to a chunk of the form:
   *      CODEUNIT start TO finish OF LINE line OF field
   */
   local tResults

/* Iterate over the lines of the text in arbitrary order - the orderdoesn't

   * matter as we keep the reference to the line any match is in. */
   repeat for each key tLineIndex in pStyledText

/* Fetch the runs in the line, so we don't have to keep looking itup */

      local tRuns
      put pStyledText[tLineIndex]["runs"] into tRuns

/* Iterate over the runs in arbitrary order - assuming that thenumber* of potentially matching runs is miniscule compared to the numberof

      * non-matching runs, it is faster to iterate in hash-order. */
      repeat for each key tRunIndex in tRuns
         /* If we find a match, work out its offset in the line */
         if tRuns[tRunIndex]["metadata"] is pAnchor then
            /* Calculate the number of codeunits before this run */
            local tCodeunitCount
            put 0 into tCodeunitCount
            repeat with tPreviousRunIndex = 1 to tRunIndex - 1

add the number of codeunits intRuns[tPreviousRunIndex]["text"] to tCodeunitCount

            end repeat

            /* Append the result to the results list. */
            put tCodeunitCount + 1, \

tCodeunitCount + the number of codeunits intRuns[tRunIndex]["text"], \

                  tLineIndex & \
                  return after tResults
         end if
      end repeat
   end repeat

/* We want the results sorted first by line index, then by startingcodeunit* within the line (so we get a top-to-bottom, left-to-right order).As the* 'sort' command is stable, we can do this by first sorting by thesecondary* factor (codeunit start), then sorting again by the primary factor(line

   * index). */
   sort lines of tResults ascending numeric by item 1 of each
   sort lines of tResults ascending numeric by item 3 of each

   /* Return the set of results. */
   return tResults
end FindAllAnchors

Testing this on 8Mb of styled Lorem Ipsum text, with the same anchor at:
  word 1
  word 1000
  the middle word
  word -1000
  word -1

Then this handler takes slightly less time then searching for a singleanchor

at word -1 of the field using 'repeat with' loops.

Whether this is helpful or not depends if you need to 'do something'when there

is more than one matching anchor in a document :)

Warmest Regards,

Mark.

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Translate metadata to field content

Reply via email to