Re: How to get word offset all instances of a string in a chunk of text?

2018-08-31 Thread Richard Gaskin via use-livecode

Mike Kerner wrote:

> Since the topic of processes came up a few weeks ago I've been
> thinking about what it would take to build a process/threading
> framework.  I wonder if a text processing subprocessor, written
> and copiled...

I haven't yet come across good use cases for the desktop, but will have 
a need for multiprocessing on Linux servers later this year.



> ... in 6 would be worth everyone's time.

That would be a non-starter for me.  I use LSON data a lot and the 
format changed with v7, a lot of things with text have changed, and 
given the hundreds of bug fixes between then and now I prefer to work 
with the current version.


--
 Richard Gaskin
 Fourth World Systems
 Software Design and Development for the Desktop, Mobile, and the Web
 
 ambassa...@fourthworld.comhttp://www.FourthWorld.com

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: How to get word offset all instances of a string in a chunk of text?

2018-08-31 Thread Mike Kerner via use-livecode
Since the topic of processes came up a few weeks ago I've been thinking
about what it would take to build a process/threading framework.  I wonder
if a text processing subprocessor, written and copiled in 6 would be worth
everyone's time.  The main app would hand off the data and the command to
the subprocessor and be handed the results back.  I wonder how large the
dataset would have to be to make the overhead worth while.

On Fri, Aug 31, 2018 at 10:43 AM Keith Clarke via use-livecode <
use-livecode@lists.runrev.com> wrote:

> Thanks Alex, HH & Jim for all the help & ideas.
>
> Just to close out the thread with a solution for future reference, the
> code below now extracts from a text source a list of unique words, cleaned
> up against a noise-word list, with word frequency, word & and a
> comma-delimited string of the word number within the original source.
>
>
> # Build unique words array
> repeat for each trueWord W in tSource
>
> add 1 to tWordNum
>
> if tANoise[W] then next repeat
>
> put comma & tWordNum after tAWords[W]
>
> end repeat
>
>
> # Convert unique words array to list
>
> repeat for each key K in tAWords
>
> put K && tAWords[K] & CR after tTemp
>
> end repeat
>
>
> repeat for each line tLine in tTemp
>
> put the number of items in tLine & comma & tLine & cr after tWords
>
> end repeat
>
>
> sort lines of tWords descending numeric by item 1 of each
>
> put tWords into field "Words"
>
>
> Thanks & regards,
> Keith
>
>
>
>
>
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>


-- 
On the first day, God created the heavens and the Earth
On the second day, God created the oceans.
On the third day, God put the animals on hold for a few hours,
   and did a little diving.
And God said, "This is good."
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: How to get word offset all instances of a string in a chunk of text?

2018-08-31 Thread Keith Clarke via use-livecode
Thanks Alex, HH & Jim for all the help & ideas.

Just to close out the thread with a solution for future reference, the code 
below now extracts from a text source a list of unique words, cleaned up 
against a noise-word list, with word frequency, word & and a comma-delimited 
string of the word number within the original source.


# Build unique words array
repeat for each trueWord W in tSource

add 1 to tWordNum

if tANoise[W] then next repeat

put comma & tWordNum after tAWords[W]

end repeat


# Convert unique words array to list

repeat for each key K in tAWords

put K && tAWords[K] & CR after tTemp

end repeat


repeat for each line tLine in tTemp

put the number of items in tLine & comma & tLine & cr after tWords

end repeat


sort lines of tWords descending numeric by item 1 of each

put tWords into field "Words"


Thanks & regards,
Keith




 
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: How to get word offset all instances of a string in a chunk of text?

2018-08-30 Thread Curry Kenworthy via use-livecode



Jim:

> This just doesn’t work in all cases

That's the key though, don't repeat when it's not necessary! A day with 
no repeats is an efficient day. ;)


Best wishes,

Curry Kenworthy

Custom Software Development
LiveCode Training and Consulting
http://livecodeconsulting.com/

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: How to get word offset all instances of a string in a chunk of text?

2018-08-30 Thread Jim Lambert via use-livecode
> I wrote:
> 
> Then there is also this repeat-less approach using arrays and filter:

> function findWordOffsets pText, pSearchTerm
>   put replaceText(pText,"\W+"," ") into pText
>   split pText by space
>   combine pText with cr and tab
>   filter pText with "*" & tab & pSearchTerm
>   sort numeric pText
>   return pText
> end findWordOffsets

This just doesn’t work in all cases because splitting by space does not assure 
one is splitting by true words.
:(
Sorry about that.

Jim Lambert

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: How to get word offset all instances of a string in a chunk of text?

2018-08-30 Thread Jim Lambert via use-livecode
> On 30/08/2018 10:24, Keith Clarke via use-livecode wrote:
>> Folks,
>> Is there a single-pass mechanism or more efficient way of returning the 
>> wordOffset of each instance of ?the? in ?the quick brown fox jumped over the 
>> lazy dog? than to use two passes through the text?

Then there is also this repeat-less approach using arrays and filter:

function findWordOffsets pText, pSearchTerm
put replaceText(pText,"\W+"," ") into pText
split pText by space
combine pText with cr and tab
filter pText with "*" & tab & pSearchTerm
sort numeric pText
return pText
end findWordOffsets

put "Then the quick brown fox jumped over "The" very,

very lazy

red dog on the sofa.” into temp   — note the extra spaces and line breaks.

put findWordOffsets(temp, “the”)   
returns:
2   the
8   The
15  the

Jim Lambert



___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: How to get word offset all instances of a string in a chunk of text?

2018-08-30 Thread Curry Kenworthy via use-livecode



hh:

> Sadly LC 9 is at about 10 times slower
> than LC 6 with such fast scripts.

Yes, I've been doing some benchmarks and LC 9 usually takes anywhere 
from 2x to 8x as long to perform a job. With or without text being 
involved. It is a serious problem that should not be neglected across 
multiple major versions of LC. I'll share a test stack and video with 
some examples when I have a little time. (Including one test where LC 9 
held its own.)


Meanwhile, optimize scripts! Then hopefully a serious boost once the 
engine itself is optimized.


Best wishes,

Curry Kenworthy

Custom Software Development
LiveCode Training and Consulting
http://livecodeconsulting.com/


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: How to get word offset all instances of a string in a chunk of text?

2018-08-30 Thread hh via use-livecode
> Alex T. wrote:
> 
> put 0 into tOffset
> repeat for each trueWord W in tSource
>add 1 to tOffset
>if W = myWord then
>   put tOffset & comma after tOffsetList
>end if
> end repeat

This is (whether trueWord or word chunks used) probably the fastest
method for an offset counting of one (true)word.

Possibly it is for a large tSource (say 4 MByte) better to use CR
instead of comma as delimiter for the list:
Else, when putting tOffsetList into a field, LC may cut the result
or even hang (LC 9) because the maximum pixel size of a line gets
exceeded.

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: How to get word offset all instances of a string in a chunk of text?

2018-08-30 Thread hh via use-livecode
For a more general context see

http://www.runrev.com/pipermail/use-livecode//2004-February/032280.html

Sadly LC 9 is at about 10 times slower than LC 6 with such fast scripts.
For example LC 6.7.11 needs at about 500 ms to evaluate a 1 MByte string,
LC 9.0.0 needs at about 5 seconds.

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: How to get word offset all instances of a string in a chunk of text?

2018-08-30 Thread Alex Tweedly via use-livecode
OK, this time I'm just typing into email - havent tested these 
suggestions :-)



On 30/08/2018 10:24, Keith Clarke via use-livecode wrote:

Folks,
Is there a single-pass mechanism or more efficient way of returning the 
wordOffset of each instance of ‘the’ in ‘the quick brown fox jumped over the 
lazy dog’ than to use two passes through the text?

Yes. For a single word myWord

put 0 into tOffset
repeat forever
  put trueWordOffset(myWord, tSource, tOffset) into tmp
  if tmp > 0 then
    put tmp & comma after tOffsetList
    put tmp into tOffset
  end if
end repeat

BUT there's a chance that this performs poorly, becuase of repeated 
skipping, so I would also benchmark the simpler

put 0 into tOffset
repeat for each trueWord W in tSource
  add 1 to tOffset
  if W = myWord then
 put tOffset & comma after tOffsetList
  end if
end repeat

Pass-1. Count the instances of ‘the’ into an array and then
Pass-2. Repeat for the count of instances using wordOffset, with a wordsToSkip 
variable derived from the previous loop’s offset

I’m I’m wondering if there’s something I’ve not yet learned about (nested?) arrays 
that might extend the unique word counter code that Alex, Paul & others helped 
me to fix a few days ago, to add a sub-array of wordOffset alongside word count?

I'm not entirely sure what you want here, or what the 'N' below are.
Do you want a count and an offsetList for each word ? If so, no need for 
nested arrays.


Then I'd change your second loop below to:

repeat for each trueWord W in tSource
   add 1 to tOffset
   if tANoise[W] then next repeat
   add 1 to tAWordCount[W]
   put tOffset & comma after tAWordOffsets[W]
end repeat

and of course the third loop to

repeat for each key K in tAWordCount
   put k && tAWordCount[K] & CR after tmp
end repeat
sort lines of tmp descending numeric by word 2 of each
put tmp into fld "Words"
 


If I've misunderstood what you want, please say so and I'll try again :-)

Alex.



# Prepare noisewords array

repeat for each trueWord W in tNoiseWords

put true into tANoise[W]

end repeat


# Build unique words array

repeat for each trueWord W in tSource

if tANoise[W] then next repeat

add 1 to tAWords[W][N]

end repeat


# Convert unique words array to list


repeat for each key K in tAWords

put K && tAWords[K][N] & CR after fld "Words"

end repeat


sort lines of field "Words" descending numeric by word 2 of each


end repeat

Any ideas or steer towards a lesson / worked example greatly appreciated.
Best,
Keith
 
___

use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode



___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode