Re: Searching for a word when it's more than one word

2018-09-03 Thread David V Glasgow via use-livecode
My family was stranded for a while during a transfer at Frankfurt airport, 
while  a computer system refused to accept that ‘Glasgow’ was not a 
destination. ( At least, in that instance)

Having said that, the same error is much more commonly made by taxi drivers, 
who can’t avoid showing great disappointment, when I am just going to the local 
station.

Cheers,

David Glasgow

> On 1 Sep 2018, at 5:57 pm, Richmond Mathewson via use-livecode 
>  wrote:
> 
> That sounds remarkably like two women who are friends of my parents:
> 
> One is called "Gay" and the other one is called "Loveday". They were friends 
> at school 60 years ago
> and when they were both widowed they moved in together; although the son of 
> one of them fell out
> with his wife and now lives with them as well.
> 
> Assumptions are sometimes difficult to avoid.
> 
> Although my younger son did actually dislocate his knee jumping to 
> conclusions . . .
> 
> This was mainly because he was trying to skip a difficult bit . . .
> 
> But I digress.
> 
> Richmond.
> 
> On 1/9/2018 6:39 pm, J. Landman Gay via use-livecode wrote:
>> There is a town in Texas called West, made infamous a few years ago by a 
>> giant explosion. I don't think you can make assumptions about names of 
>> places.
>> 
>> Mark's suggestion to check for words ending in "s" will fail on many towns, 
>> though apostrophe-s may be safe.
>> -- 
>> Jacqueline Landman Gay | jac...@hyperactivesw.com
>> HyperActive Software | http://www.hyperactivesw.com
>> On September 1, 2018 5:49:30 AM Richmond Mathewson via use-livecode 
>>  wrote:
>> 
>>> I can see that the "problem", which my stack does not address, is with 2
>>> or 3 part place names:
>>> 
>>> The Rochester/Chester problem is easily dealt with.
>>> 
>>> While it should be realtively easy to have a subroutine to deal with
>>> words such as "West" (after all, there are no places just called "West"),
>>> places like a town my parents once lived in called "Haselbury Plucknett"
>>> would cause problems.
>>> 
>>> AND, places such as "Ruyton of the Eleven Towns"
>>> (https://en.wikipedia.org/wiki/Ruyton-XI-Towns)
>>> would really throw a spanner in the works.
>>> 
>>> Come to think of things . . .
>>> 
>>> Unless anyone's code can cope with "Ruyton of the Eleven Towns" it won't
>>> stand up: we could even go further and call
>>> this the "Ruyton of the Eleven Towns Test".
>>> 
>>> More muffled background noises.
>>> 
>>> Richmond.
>>> 
>>> On 1/9/2018 1:29 pm, Mark Waddingham via use-livecode wrote:
 On 2018-09-01 12:05, Richmond Mathewson via use-livecode wrote:
> Obviously, when considering names of places such as Colchester,
> Rochester and Chester one has
> to search for the longer names first and exclude them from later
> searches.
 
 The 'substring' problem (i.e. Chester being 'in' Rochester) isn't
 relevant in the above algorithm because we are 'tokenising' input and
 phrases - essentially changing the alphabet.
 
 i.e. "Rochester Chester Colchester" is turned into ABC, and we match
 A, B or C as atomic units.
 
 I should perhaps point out that the 'processText' operation probably
 needs to be a little better in practice - to at least include a 'stop'
 token for punctuation. For example:
 
 "The man walked starting from East Hartford, West Hartford could be
 seen in the distance."
 
 In the case where 'Hartford West' and 'Hartford' are the 'known' towns
 (and not 'East Hartford') - the proposed tokenization would result in:
 
 The,man,walked,starting,from,East,Hartford,West,Hartford,could,be,seen,in,the,distance
  
 
 Which means you'd get "Hartford West" and "Hartford" - when you should
 only get "Hartford" (assuming you care about the linguistic structure
 of the text, at least).
 
 Indeed, the above actually means in preprocessing the text, you can
 actually vastly reduce the number of words to search - any sequences
 of words which aren't in any pharse (or important punctuation) can be
 replaced by "*" say. So the above would become:
 
 *,East,Hartford,*,West,Hartford,*
 
 The "*" tokens block matching multi-word phrases.
 
 Warmest Regards,
 
 Mark.
>>> 
>>> ___
>>> use-livecode mailing list
>>> use-livecode@lists.runrev.com
>>> Please visit this url to subscribe, unsubscribe and manage your 
>>> subscription preferences:
>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>> 
>> 
>> 
>> 
>> ___
>> use-livecode mailing list
>> use-livecode@lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your subscription 
>> preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
> 
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, 

Re: Searching for a word when it's more than one word

2018-09-02 Thread Tom Glod via use-livecode
i had this same problem a few weeks ago...luckily it wasn't critical to the
featureset, so i didn't find a solution.  I will swing back around with the
help of this thread.  thanks for entertaining the problem.

On Sun, Sep 2, 2018 at 5:09 AM Quentin Long via use-livecode <
use-livecode@lists.runrev.com> wrote:

> Have pondered the question, and come up with some code which may or may
> not solve the problem at hand, but which may at least prove helpful in
> looking for a real solution:
>
> ==
>
> Assumption: You’ve got a text document (not HTML, not RTF, just plain TXT)
> which contains, among other things, however-many place names.
> Assumption: You have a return-list of place names, which may or may not be
> single words
> Assumption: The text document is in the variable SourceDoc
> Assumption: The list of place names is in the variable NamesList
>
> Assumption: You want a document which contains a complete census of
> exactly which of the place-names in NamesList occur in SourceDoc
> Assumption: For each place-name which does occur within SourceDoc, you
> want a list of which word-numbers each such occurrance begins at
>
> put “” into PlaceNamesCensus
> repeat for each line DisName in NamesList
>   put the number of words in DisName into DisNameWords
>   put 0 into SearchOffset
>   put “” into FoundLocs
>   repeat
> put offset (DisName, SourceDoc, SearchOffset) into DisLoc
> if DisLoc = 0 then
>   -- there is no character string which matches the place name in
> question
>   end repeat
> else
>   —- there is a character string which matches the place name in
> question
>   —- is it the actual placename, and not finding “chester” in
> “colchester”?
>   put the number of words in (char 1 to DisLoc of SourceDoc) into
> StartWord
>   if DisName = (word StartWord to (StartWord + DisNameWords - 1) of
> SourceDoc) then
> -- it’s a match, yay!
> put StartWord into item (1 + the number of items in FoundLocs) of
> FoundLocs
>   end if
>   add DisLoc to SearchOffset
> end if
>   end repeat
>   if FoundLocs <> “” then
> —- nope, DisName wasn’t in SourceDoc
> put “[nil]” into DeseLocs
>   else
> —- yay! DisName *was* in SourceDoc! at least once!
> put FoundLocs into DeseLocs
>   end if
>   put DisName & comma & DeseLocs into line (1 + the number of lines in
> PlaceNamesCensus) of PlaceNamesCensus
> end repeat
>
> ==
>
> Known issue: The above code does not pretend to locate possessive
> instances of place names (i.e., California's, the United Kingdom's, etc).
> Am thinking that pre-processing of SourceDoc will be helpful-to-necessary.
> This pre-processing may need to accommodate more issues than just
> possessives.
>
>
> "Bewitched" + "Charlie's Angels" - Charlie = "At Arm's Length"
> Read the webcomic at [ http://www.atarmslength.net ]!
> If you like "At Arm's Length", support it at [
> http://www.patreon.com/DarkwingDude ].
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Searching for a word when it's more than one word

2018-09-02 Thread Quentin Long via use-livecode
Have pondered the question, and come up with some code which may or may not 
solve the problem at hand, but which may at least prove helpful in looking for 
a real solution:

==

Assumption: You’ve got a text document (not HTML, not RTF, just plain TXT) 
which contains, among other things, however-many place names.
Assumption: You have a return-list of place names, which may or may not be 
single words
Assumption: The text document is in the variable SourceDoc
Assumption: The list of place names is in the variable NamesList

Assumption: You want a document which contains a complete census of exactly 
which of the place-names in NamesList occur in SourceDoc
Assumption: For each place-name which does occur within SourceDoc, you want a 
list of which word-numbers each such occurrance begins at

put “” into PlaceNamesCensus
repeat for each line DisName in NamesList
  put the number of words in DisName into DisNameWords
  put 0 into SearchOffset
  put “” into FoundLocs
  repeat
    put offset (DisName, SourceDoc, SearchOffset) into DisLoc
    if DisLoc = 0 then
  -- there is no character string which matches the place name in question
  end repeat
    else
  —- there is a character string which matches the place name in question
  —- is it the actual placename, and not finding “chester” in “colchester”?
  put the number of words in (char 1 to DisLoc of SourceDoc) into StartWord
  if DisName = (word StartWord to (StartWord + DisNameWords - 1) of 
SourceDoc) then
    -- it’s a match, yay!
    put StartWord into item (1 + the number of items in FoundLocs) of 
FoundLocs
  end if
  add DisLoc to SearchOffset
    end if   
  end repeat
  if FoundLocs <> “” then
    —- nope, DisName wasn’t in SourceDoc
    put “[nil]” into DeseLocs
  else
    —- yay! DisName *was* in SourceDoc! at least once!
    put FoundLocs into DeseLocs
  end if
  put DisName & comma & DeseLocs into line (1 + the number of lines in 
PlaceNamesCensus) of PlaceNamesCensus
end repeat

==

Known issue: The above code does not pretend to locate possessive instances of 
place names (i.e., California's, the United Kingdom's, etc). Am thinking that 
pre-processing of SourceDoc will be helpful-to-necessary. This pre-processing 
may need to accommodate more issues than just possessives.
 

"Bewitched" + "Charlie's Angels" - Charlie = "At Arm's Length"
Read the webcomic at [ http://www.atarmslength.net ]!
If you like "At Arm's Length", support it at [ 
http://www.patreon.com/DarkwingDude ].
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

[OT] Up is down (was: Searching for a word when it's more than one word)

2018-09-01 Thread J. Landman Gay via use-livecode
On September 1, 2018 6:34:17 PM Mark Wieder via use-livecode 
 wrote:



On 09/01/2018 02:48 PM, J. Landman Gay via use-livecode wrote:

No, it's a little north-east of center.


Wait. What? West is north-east of center?


Of course. When you're that far south, everything is north. I assume their 
center must be somewhat dynamic, perhaps based on where the most cattle are 
at the moment.


--
Jacqueline Landman Gay | jac...@hyperactivesw.com
HyperActive Software | http://www.hyperactivesw.com



___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Searching for a word when it's more than one word

2018-09-01 Thread Mark Wieder via use-livecode

On 09/01/2018 02:48 PM, J. Landman Gay via use-livecode wrote:

No, it's a little north-east of center.


Wait. What? West is north-east of center?

--
 Mark Wieder
 ahsoftw...@gmail.com

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Searching for a word when it's more than one word

2018-09-01 Thread J. Landman Gay via use-livecode

No, it's a little north-east of center.

On 9/1/18 12:02 PM, Richmond Mathewson via use-livecode wrote:

Is West, Texas in West Texas?

Richmond.

On 1/9/2018 6:55 pm, Mark Wieder via use-livecode wrote:

On 09/01/2018 08:39 AM, J. Landman Gay via use-livecode wrote:
There is a town in Texas called West, made infamous a few years ago 
by a giant explosion. I don't think you can make assumptions about 
names of places.


And thus the distinction between West Texas and West, Texas.



--
Jacqueline Landman Gay | jac...@hyperactivesw.com
HyperActive Software   | http://www.hyperactivesw.com

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Searching for a word when it's more than one word

2018-09-01 Thread J. Landman Gay via use-livecode

On 9/1/18 10:55 AM, Mark Wieder via use-livecode wrote:

On 09/01/2018 08:39 AM, J. Landman Gay via use-livecode wrote:
There is a town in Texas called West, made infamous a few years ago by 
a giant explosion. I don't think you can make assumptions about names 
of places.


And thus the distinction between West Texas and West, Texas.



When I first heard it on the news, I thought half of Texas had 
disappeared. I had mixed feelings when I found out it didn't.


--
Jacqueline Landman Gay | jac...@hyperactivesw.com
HyperActive Software   | http://www.hyperactivesw.com

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Searching for a word when it's more than one word

2018-09-01 Thread Richmond Mathewson via use-livecode

East or West, home is a comfy LiveCode stack . . .

Well; here's my third version, which does better than the first 2:

https://www.dropbox.com/s/r3yocmqzwhwu4ta/Text%20analyzer%20X.livecode.zip?dl=0

Richmond.

On 1/9/2018 6:39 pm, J. Landman Gay via use-livecode wrote:
There is a town in Texas called West, made infamous a few years ago by 
a giant explosion. I don't think you can make assumptions about names 
of places.


Mark's suggestion to check for words ending in "s" will fail on many 
towns, though apostrophe-s may be safe.

--
Jacqueline Landman Gay | jac...@hyperactivesw.com
HyperActive Software | http://www.hyperactivesw.com
On September 1, 2018 5:49:30 AM Richmond Mathewson via use-livecode 
 wrote:



I can see that the "problem", which my stack does not address, is with 2
or 3 part place names:

The Rochester/Chester problem is easily dealt with.

While it should be realtively easy to have a subroutine to deal with
words such as "West" (after all, there are no places just called 
"West"),

places like a town my parents once lived in called "Haselbury Plucknett"
would cause problems.

AND, places such as "Ruyton of the Eleven Towns"
(https://en.wikipedia.org/wiki/Ruyton-XI-Towns)
would really throw a spanner in the works.

Come to think of things . . .

Unless anyone's code can cope with "Ruyton of the Eleven Towns" it won't
stand up: we could even go further and call
this the "Ruyton of the Eleven Towns Test".

More muffled background noises.

Richmond.

On 1/9/2018 1:29 pm, Mark Waddingham via use-livecode wrote:

On 2018-09-01 12:05, Richmond Mathewson via use-livecode wrote:

Obviously, when considering names of places such as Colchester,
Rochester and Chester one has
to search for the longer names first and exclude them from later
searches.


The 'substring' problem (i.e. Chester being 'in' Rochester) isn't
relevant in the above algorithm because we are 'tokenising' input and
phrases - essentially changing the alphabet.

i.e. "Rochester Chester Colchester" is turned into ABC, and we match
A, B or C as atomic units.

I should perhaps point out that the 'processText' operation probably
needs to be a little better in practice - to at least include a 'stop'
token for punctuation. For example:

"The man walked starting from East Hartford, West Hartford could be
seen in the distance."

In the case where 'Hartford West' and 'Hartford' are the 'known' towns
(and not 'East Hartford') - the proposed tokenization would result in:

The,man,walked,starting,from,East,Hartford,West,Hartford,could,be,seen,in,the,distance 



Which means you'd get "Hartford West" and "Hartford" - when you should
only get "Hartford" (assuming you care about the linguistic structure
of the text, at least).

Indeed, the above actually means in preprocessing the text, you can
actually vastly reduce the number of words to search - any sequences
of words which aren't in any pharse (or important punctuation) can be
replaced by "*" say. So the above would become:

*,East,Hartford,*,West,Hartford,*

The "*" tokens block matching multi-word phrases.

Warmest Regards,

Mark.


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your 
subscription preferences:

http://lists.runrev.com/mailman/listinfo/use-livecode





___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your 
subscription preferences:

http://lists.runrev.com/mailman/listinfo/use-livecode


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Searching for a word when it's more than one word

2018-09-01 Thread Richmond Mathewson via use-livecode

We're all in a state at the moment with this one.

Richmond.

On 1/9/2018 7:24 pm, Stephen MacLean via use-livecode wrote:

Thankfully, in my case, I do know what at least the state is:)


On Sep 1, 2018, at 11:55 AM, Mark Wieder via use-livecode 
 wrote:


On 09/01/2018 08:39 AM, J. Landman Gay via use-livecode wrote:
There is a town in Texas called West, made infamous a few years ago by a giant 
explosion. I don't think you can make assumptions about names of places.

And thus the distinction between West Texas and West, Texas.

--
Mark Wieder
ahsoftw...@gmail.com

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode



___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Searching for a word when it's more than one word

2018-09-01 Thread Richmond Mathewson via use-livecode

Is West, Texas in West Texas?

Richmond.

On 1/9/2018 6:55 pm, Mark Wieder via use-livecode wrote:

On 09/01/2018 08:39 AM, J. Landman Gay via use-livecode wrote:
There is a town in Texas called West, made infamous a few years ago 
by a giant explosion. I don't think you can make assumptions about 
names of places.


And thus the distinction between West Texas and West, Texas.



___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Searching for a word when it's more than one word

2018-09-01 Thread Richmond Mathewson via use-livecode

That sounds remarkably like two women who are friends of my parents:

One is called "Gay" and the other one is called "Loveday". They were 
friends at school 60 years ago
and when they were both widowed they moved in together; although the son 
of one of them fell out

with his wife and now lives with them as well.

Assumptions are sometimes difficult to avoid.

Although my younger son did actually dislocate his knee jumping to 
conclusions . . .


This was mainly because he was trying to skip a difficult bit . . .

But I digress.

Richmond.

On 1/9/2018 6:39 pm, J. Landman Gay via use-livecode wrote:
There is a town in Texas called West, made infamous a few years ago by 
a giant explosion. I don't think you can make assumptions about names 
of places.


Mark's suggestion to check for words ending in "s" will fail on many 
towns, though apostrophe-s may be safe.

--
Jacqueline Landman Gay | jac...@hyperactivesw.com
HyperActive Software | http://www.hyperactivesw.com
On September 1, 2018 5:49:30 AM Richmond Mathewson via use-livecode 
 wrote:



I can see that the "problem", which my stack does not address, is with 2
or 3 part place names:

The Rochester/Chester problem is easily dealt with.

While it should be realtively easy to have a subroutine to deal with
words such as "West" (after all, there are no places just called 
"West"),

places like a town my parents once lived in called "Haselbury Plucknett"
would cause problems.

AND, places such as "Ruyton of the Eleven Towns"
(https://en.wikipedia.org/wiki/Ruyton-XI-Towns)
would really throw a spanner in the works.

Come to think of things . . .

Unless anyone's code can cope with "Ruyton of the Eleven Towns" it won't
stand up: we could even go further and call
this the "Ruyton of the Eleven Towns Test".

More muffled background noises.

Richmond.

On 1/9/2018 1:29 pm, Mark Waddingham via use-livecode wrote:

On 2018-09-01 12:05, Richmond Mathewson via use-livecode wrote:

Obviously, when considering names of places such as Colchester,
Rochester and Chester one has
to search for the longer names first and exclude them from later
searches.


The 'substring' problem (i.e. Chester being 'in' Rochester) isn't
relevant in the above algorithm because we are 'tokenising' input and
phrases - essentially changing the alphabet.

i.e. "Rochester Chester Colchester" is turned into ABC, and we match
A, B or C as atomic units.

I should perhaps point out that the 'processText' operation probably
needs to be a little better in practice - to at least include a 'stop'
token for punctuation. For example:

"The man walked starting from East Hartford, West Hartford could be
seen in the distance."

In the case where 'Hartford West' and 'Hartford' are the 'known' towns
(and not 'East Hartford') - the proposed tokenization would result in:

The,man,walked,starting,from,East,Hartford,West,Hartford,could,be,seen,in,the,distance 



Which means you'd get "Hartford West" and "Hartford" - when you should
only get "Hartford" (assuming you care about the linguistic structure
of the text, at least).

Indeed, the above actually means in preprocessing the text, you can
actually vastly reduce the number of words to search - any sequences
of words which aren't in any pharse (or important punctuation) can be
replaced by "*" say. So the above would become:

*,East,Hartford,*,West,Hartford,*

The "*" tokens block matching multi-word phrases.

Warmest Regards,

Mark.


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your 
subscription preferences:

http://lists.runrev.com/mailman/listinfo/use-livecode





___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your 
subscription preferences:

http://lists.runrev.com/mailman/listinfo/use-livecode


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Searching for a word when it's more than one word

2018-09-01 Thread Stephen MacLean via use-livecode
Thankfully, in my case, I do know what at least the state is:)

> On Sep 1, 2018, at 11:55 AM, Mark Wieder via use-livecode 
>  wrote:
> 
>> On 09/01/2018 08:39 AM, J. Landman Gay via use-livecode wrote:
>> There is a town in Texas called West, made infamous a few years ago by a 
>> giant explosion. I don't think you can make assumptions about names of 
>> places.
> 
> And thus the distinction between West Texas and West, Texas.
> 
> -- 
> Mark Wieder
> ahsoftw...@gmail.com
> 
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription 
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode



___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Searching for a word when it's more than one word

2018-09-01 Thread Mark Wieder via use-livecode

On 09/01/2018 08:39 AM, J. Landman Gay via use-livecode wrote:
There is a town in Texas called West, made infamous a few years ago by a 
giant explosion. I don't think you can make assumptions about names of 
places.


And thus the distinction between West Texas and West, Texas.

--
 Mark Wieder
 ahsoftw...@gmail.com

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Searching for a word when it's more than one word

2018-09-01 Thread J. Landman Gay via use-livecode
There is a town in Texas called West, made infamous a few years ago by a 
giant explosion. I don't think you can make assumptions about names of places.


Mark's suggestion to check for words ending in "s" will fail on many towns, 
though apostrophe-s may be safe.

--
Jacqueline Landman Gay | jac...@hyperactivesw.com
HyperActive Software | http://www.hyperactivesw.com
On September 1, 2018 5:49:30 AM Richmond Mathewson via use-livecode 
 wrote:



I can see that the "problem", which my stack does not address, is with 2
or 3 part place names:

The Rochester/Chester problem is easily dealt with.

While it should be realtively easy to have a subroutine to deal with
words such as "West" (after all, there are no places just called "West"),
places like a town my parents once lived in called "Haselbury Plucknett"
would cause problems.

AND, places such as "Ruyton of the Eleven Towns"
(https://en.wikipedia.org/wiki/Ruyton-XI-Towns)
would really throw a spanner in the works.

Come to think of things . . .

Unless anyone's code can cope with "Ruyton of the Eleven Towns" it won't
stand up: we could even go further and call
this the "Ruyton of the Eleven Towns Test".

More muffled background noises.

Richmond.

On 1/9/2018 1:29 pm, Mark Waddingham via use-livecode wrote:

On 2018-09-01 12:05, Richmond Mathewson via use-livecode wrote:

Obviously, when considering names of places such as Colchester,
Rochester and Chester one has
to search for the longer names first and exclude them from later
searches.


The 'substring' problem (i.e. Chester being 'in' Rochester) isn't
relevant in the above algorithm because we are 'tokenising' input and
phrases - essentially changing the alphabet.

i.e. "Rochester Chester Colchester" is turned into ABC, and we match
A, B or C as atomic units.

I should perhaps point out that the 'processText' operation probably
needs to be a little better in practice - to at least include a 'stop'
token for punctuation. For example:

"The man walked starting from East Hartford, West Hartford could be
seen in the distance."

In the case where 'Hartford West' and 'Hartford' are the 'known' towns
(and not 'East Hartford') - the proposed tokenization would result in:

The,man,walked,starting,from,East,Hartford,West,Hartford,could,be,seen,in,the,distance

Which means you'd get "Hartford West" and "Hartford" - when you should
only get "Hartford" (assuming you care about the linguistic structure
of the text, at least).

Indeed, the above actually means in preprocessing the text, you can
actually vastly reduce the number of words to search - any sequences
of words which aren't in any pharse (or important punctuation) can be
replaced by "*" say. So the above would become:

*,East,Hartford,*,West,Hartford,*

The "*" tokens block matching multi-word phrases.

Warmest Regards,

Mark.


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your 
subscription preferences:

http://lists.runrev.com/mailman/listinfo/use-livecode





___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Searching for a word when it's more than one word

2018-09-01 Thread Stephen MacLean via use-livecode
Wow, this is awesome, thank you all!!

Sorry, on the road taking my daughter to college, would love to try some of 
this out. 

One thing to keep in mind is that as that I’m checking for names against the 
town list, I may not know what town I’m actually looking for. Usually i do, but 
not always. 

Therefore i’ve been counting how many of each name I’ve come across and do some 
calculations at the end to make a best guess. 

Really appreciate the responses!!

Thank you,

Steve

> On Sep 1, 2018, at 7:53 AM, Richmond Mathewson via use-livecode 
>  wrote:
> 
> 
> 
>> On 1/9/2018 2:50 pm, Mark Waddingham via use-livecode wrote:
>>> On 2018-09-01 13:15, Richmond Mathewson via use-livecode wrote:
>>> I've already shovelled Ruyton of the Eleven Towns quite effectively:
>>> 
>>> https://www.dropbox.com/s/n7r7u0c2m9ny3eb/Text%20analyzer%20X.livecode.zip?dl=0
>>>  
>>> 
>>> No tokenising, in fact very basic stuff indeed.
>>> 
>>> Not wishing to bang on about over-complcating things . . . . .
>> 
>> There is actually a 'correct' more shovelistic approach (at least I *think* 
>> this is correct):
>> 
>> -- Ensure all punctuation is surrounded by space
>> repeat for each char tPuncChar in ",.';:()[]{}<>!@£$%^&*-_+=~`?/\|#€" & quote
>>  replace tPuncChar with space & tPuncChar & space in tText
>> end repeat
> 
> Thats a "point" (pun intended) as I just fell foul of a full stop.
>> 
>> -- Ensure all whitespace is space
>> replace return with space in tText
>> replace tab with space in tText
>> 
>> -- Ensure there is never two spaces next to each other in tText
>> repeat while tText contains "  "
>>  replace "  " with " " in tText
>> end repeat
>> 
>> -- Ensure there is only ever one space between words in phrases
>> repeat while tPhrases contains "  "
>>  replace "  " with " " in tPhrases
>> end repeat
>> 
>> -- We can now use an itemDelimiter of space
>> set the itemDelimiter to space
>> 
>> -- Sort the phrases by descending word length.
>> sort lines of tPhrases descending numeric by the number of items in each
>> 
>> -- Now check for, and remove each phrase from the source text in turn
>> set the wholeMatches to true
>> repeat for each line tPhrase in tPhrases
>>  -- If the phrase is not present then skip to the next
>>  if itemOffset(tPhrase, tText) is 0 then
>>next repeat
>>  end if
>> 
>>  -- Accumulate the phrase on the output list
>>  put tPhrase & return after tFoundPhrases
>> 
>>  -- Remove the phrase from the input text (we assume here that * does not 
>> appear in any phrase)
>>  replace tPhrase with "*" in tText
>> end repeat
>> 
>> Warmest Regards,
>> 
>> Mark.
>> 
>> P.S. The above will be reasonable quick for small sets of phrases / small 
>> source texts - but I think as the size of either increases it will get very 
>> slow, very quickly!
>> 
> 
> 
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription 
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode



___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Searching for a word when it's more than one word

2018-09-01 Thread Richmond Mathewson via use-livecode



On 1/9/2018 2:50 pm, Mark Waddingham via use-livecode wrote:

On 2018-09-01 13:15, Richmond Mathewson via use-livecode wrote:

I've already shovelled Ruyton of the Eleven Towns quite effectively:

https://www.dropbox.com/s/n7r7u0c2m9ny3eb/Text%20analyzer%20X.livecode.zip?dl=0 



No tokenising, in fact very basic stuff indeed.

Not wishing to bang on about over-complcating things . . . . .


There is actually a 'correct' more shovelistic approach (at least I 
*think* this is correct):


-- Ensure all punctuation is surrounded by space
repeat for each char tPuncChar in ",.';:()[]{}<>!@£$%^&*-_+=~`?/\|#€" 
& quote

  replace tPuncChar with space & tPuncChar & space in tText
end repeat


Thats a "point" (pun intended) as I just fell foul of a full stop.


-- Ensure all whitespace is space
replace return with space in tText
replace tab with space in tText

-- Ensure there is never two spaces next to each other in tText
repeat while tText contains "  "
  replace "  " with " " in tText
end repeat

-- Ensure there is only ever one space between words in phrases
repeat while tPhrases contains "  "
  replace "  " with " " in tPhrases
end repeat

-- We can now use an itemDelimiter of space
set the itemDelimiter to space

-- Sort the phrases by descending word length.
sort lines of tPhrases descending numeric by the number of items in each

-- Now check for, and remove each phrase from the source text in turn
set the wholeMatches to true
repeat for each line tPhrase in tPhrases
  -- If the phrase is not present then skip to the next
  if itemOffset(tPhrase, tText) is 0 then
next repeat
  end if

  -- Accumulate the phrase on the output list
  put tPhrase & return after tFoundPhrases

  -- Remove the phrase from the input text (we assume here that * does 
not appear in any phrase)

  replace tPhrase with "*" in tText
end repeat

Warmest Regards,

Mark.

P.S. The above will be reasonable quick for small sets of phrases / 
small source texts - but I think as the size of either increases it 
will get very slow, very quickly!





___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Searching for a word when it's more than one word

2018-09-01 Thread Richmond Mathewson via use-livecode

It didn't like this:

on mouseDown
   put empty into fld "zText"
   if fld "xText" contains "Ruyton of the Eleven Towns." then
  put fld "xText" into fld "zText"
  put "Ruyton of the Eleven Towns." into CHUNNK
put empty into CHUNNK of fld "zText"
  end if
*end mouseDown**
**
**Richmond.*

On 1/9/2018 2:25 pm, Mark Waddingham via use-livecode wrote:

On 2018-09-01 13:15, Richmond Mathewson via use-livecode wrote:

I've already shovelled Ruyton of the Eleven Towns quite effectively:

https://www.dropbox.com/s/n7r7u0c2m9ny3eb/Text%20analyzer%20X.livecode.zip?dl=0 



No tokenising, in fact very basic stuff indeed.

Not wishing to bang on about over-complcating things . . . . .


Your revised approach is fine - as long as the names of all the towns 
are distinct in terms of no one town's name is contained within another.


Add 'Palm Beach West' and 'Palm Beach' to your placeNames list; then 
modify your source text to end 'or Palm Beach West' - and you 
algorithm does not perform the requested operation.


It reports Palm Beach West *and* Palm Beach as being present - 
whereas, only 'Palm Beach West' is present :D


Warmest Regards,

Mark.



___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Searching for a word when it's more than one word

2018-09-01 Thread Mark Waddingham via use-livecode

On 2018-09-01 13:15, Richmond Mathewson via use-livecode wrote:

I've already shovelled Ruyton of the Eleven Towns quite effectively:

https://www.dropbox.com/s/n7r7u0c2m9ny3eb/Text%20analyzer%20X.livecode.zip?dl=0

No tokenising, in fact very basic stuff indeed.

Not wishing to bang on about over-complcating things . . . . .


There is actually a 'correct' more shovelistic approach (at least I 
*think* this is correct):


-- Ensure all punctuation is surrounded by space
repeat for each char tPuncChar in ",.';:()[]{}<>!@£$%^&*-_+=~`?/\|#€" & 
quote

  replace tPuncChar with space & tPuncChar & space in tText
end repeat

-- Ensure all whitespace is space
replace return with space in tText
replace tab with space in tText

-- Ensure there is never two spaces next to each other in tText
repeat while tText contains "  "
  replace "  " with " " in tText
end repeat

-- Ensure there is only ever one space between words in phrases
repeat while tPhrases contains "  "
  replace "  " with " " in tPhrases
end repeat

-- We can now use an itemDelimiter of space
set the itemDelimiter to space

-- Sort the phrases by descending word length.
sort lines of tPhrases descending numeric by the number of items in each

-- Now check for, and remove each phrase from the source text in turn
set the wholeMatches to true
repeat for each line tPhrase in tPhrases
  -- If the phrase is not present then skip to the next
  if itemOffset(tPhrase, tText) is 0 then
next repeat
  end if

  -- Accumulate the phrase on the output list
  put tPhrase & return after tFoundPhrases

  -- Remove the phrase from the input text (we assume here that * does 
not appear in any phrase)

  replace tPhrase with "*" in tText
end repeat

Warmest Regards,

Mark.

P.S. The above will be reasonable quick for small sets of phrases / 
small source texts - but I think as the size of either increases it will 
get very slow, very quickly!


--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: Searching for a word when it's more than one word

2018-09-01 Thread Richmond Mathewson via use-livecode




On 1/9/2018 2:25 pm, Mark Waddingham via use-livecode wrote:

On 2018-09-01 13:15, Richmond Mathewson via use-livecode wrote:

I've already shovelled Ruyton of the Eleven Towns quite effectively:

https://www.dropbox.com/s/n7r7u0c2m9ny3eb/Text%20analyzer%20X.livecode.zip?dl=0 



No tokenising, in fact very basic stuff indeed.

Not wishing to bang on about over-complcating things . . . . .


Your revised approach is fine - as long as the names of all the towns 
are distinct in terms of no one town's name is contained within another.


Blast!

Of course "my next trick" is to work out how to delete multi-word names 
(i.e. phrases) from a textField.


Richmond.


Add 'Palm Beach West' and 'Palm Beach' to your placeNames list; then 
modify your source text to end 'or Palm Beach West' - and you 
algorithm does not perform the requested operation.


It reports Palm Beach West *and* Palm Beach as being present - 
whereas, only 'Palm Beach West' is present :D


Warmest Regards,

Mark.




___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Searching for a word when it's more than one word

2018-09-01 Thread Mark Waddingham via use-livecode

On 2018-09-01 13:15, Richmond Mathewson via use-livecode wrote:

I've already shovelled Ruyton of the Eleven Towns quite effectively:

https://www.dropbox.com/s/n7r7u0c2m9ny3eb/Text%20analyzer%20X.livecode.zip?dl=0

No tokenising, in fact very basic stuff indeed.

Not wishing to bang on about over-complcating things . . . . .


Your revised approach is fine - as long as the names of all the towns 
are distinct in terms of no one town's name is contained within another.


Add 'Palm Beach West' and 'Palm Beach' to your placeNames list; then 
modify your source text to end 'or Palm Beach West' - and you algorithm 
does not perform the requested operation.


It reports Palm Beach West *and* Palm Beach as being present - whereas, 
only 'Palm Beach West' is present :D


Warmest Regards,

Mark.

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Searching for a word when it's more than one word

2018-09-01 Thread Richmond Mathewson via use-livecode

I've already shovelled Ruyton of the Eleven Towns quite effectively:

https://www.dropbox.com/s/n7r7u0c2m9ny3eb/Text%20analyzer%20X.livecode.zip?dl=0

No tokenising, in fact very basic stuff indeed.

Not wishing to bang on about over-complcating things . . . . .

Probably time for both Thee and Me to get out and get some fresh air 
before we ruin our weekends.


Richmond.

On 1/9/2018 2:05 pm, Mark Waddingham via use-livecode wrote:

On 2018-09-01 12:50, Richmond Mathewson via use-livecode wrote:

Yup: indeed: fairly coarse.

However, see my next posting re "Ruyton of the Eleven Towns"

that should make some folk feel that they need a set of sewing needles
rather than "just" a silver teaspoon.


I think you'll find my 'silver teaspoon' approach (as you put it) 
deals with all those cases :D


Interestingly, as I said, the multi-word match problem can be reduced 
to your 'shovel' - with pre and post processing.


Let's say that the phrase list is:

  Ruyton of the Eleven Towns
  East Hartfordshire
  Colchester
  Chester

First create a mapping from phrase words to individual characters (the 
choice of character is arbitrary):


  Ruyton <-> A
  of <-> B
  the <-> C
  Eleven <-> D
  Towns <-> E
  East <-> F
  Hartfordshire <-> G
  Colchester <-> H
  Chester <-> I

Now iterate through the source text, generating an output source text 
consisting of words from the new alphabet, and a 'unknown' letter '*'. 
For example:


  The man from Ruyton of the Eleven Towns, who is of the order of 
shovels, travelled from Chester to Colchester via the towns in East 
Hartfordshire


Would become:

  C**ABCDE**BC*B***I*H**E*FG

The original phrase list is processed similarly to give:

  ABCDE
  FG
  H
  I

Searching the transformed source text using your algorithm with the 
list of transformed phrases would give the correct set of found 
phrases as required by the original problem.


Warmest Regards,

Mark.



___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Searching for a word when it's more than one word

2018-09-01 Thread Mark Waddingham via use-livecode

On 2018-09-01 12:50, Richmond Mathewson via use-livecode wrote:

Yup: indeed: fairly coarse.

However, see my next posting re "Ruyton of the Eleven Towns"

that should make some folk feel that they need a set of sewing needles
rather than "just" a silver teaspoon.


I think you'll find my 'silver teaspoon' approach (as you put it) deals 
with all those cases :D


Interestingly, as I said, the multi-word match problem can be reduced to 
your 'shovel' - with pre and post processing.


Let's say that the phrase list is:

  Ruyton of the Eleven Towns
  East Hartfordshire
  Colchester
  Chester

First create a mapping from phrase words to individual characters (the 
choice of character is arbitrary):


  Ruyton <-> A
  of <-> B
  the <-> C
  Eleven <-> D
  Towns <-> E
  East <-> F
  Hartfordshire <-> G
  Colchester <-> H
  Chester <-> I

Now iterate through the source text, generating an output source text 
consisting of words from the new alphabet, and a 'unknown' letter '*'. 
For example:


  The man from Ruyton of the Eleven Towns, who is of the order of 
shovels, travelled from Chester to Colchester via the towns in East 
Hartfordshire


Would become:

  C**ABCDE**BC*B***I*H**E*FG

The original phrase list is processed similarly to give:

  ABCDE
  FG
  H
  I

Searching the transformed source text using your algorithm with the list 
of transformed phrases would give the correct set of found phrases as 
required by the original problem.


Warmest Regards,

Mark.

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Searching for a word when it's more than one word

2018-09-01 Thread Richmond Mathewson via use-livecode

Yup: indeed: fairly coarse.

However, see my next posting re "Ruyton of the Eleven Towns"

that should make some folk feel that they need a set of sewing needles 
rather than "just" a silver teaspoon.


Richmond.

On 1/9/2018 1:45 pm, Mark Waddingham via use-livecode wrote:

On 2018-09-01 12:35, Richmond Mathewson via use-livecode wrote:

That's because you lot tend to use a silver teaspoon while I tend to
use a great big shovel:

https://www.dropbox.com/s/00t8oftb1ydm8ni/Text%20analyzer%20X.livecode.zip?dl=0 



Heh, great big shovels are great for coarse work - e.g. for the 
problem of finding occurrences of SINGLE WORD towns in the source text 
- as you are in your stack.


However, in this case, that wasn't what was asked for - the problem 
was to find multi-word town names with the constraints that first and 
longest match always wins with no overlap (i.e. as a human would read 
them):


i.e. East Hartford West Palm Beach Colchester Newchester West Chester

With a town list of

   East Hartford
   Hartford West
   West Palm Beach
   Palm Beach
   Chester
   West Chester

Should return:

   East Hartford
   West Palm Beach
   West Chester

Warmest Regards,

Mark.

P.S. The problem is actually exactly the same - in the single-word 
case your alphabet are the characters in the language. In the 
multi-word case, your alphabet is the set of words in all phrases, 
with a 'stop' word.




___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Searching for a word when it's more than one word

2018-09-01 Thread Richmond Mathewson via use-livecode
I can see that the "problem", which my stack does not address, is with 2 
or 3 part place names:


The Rochester/Chester problem is easily dealt with.

While it should be realtively easy to have a subroutine to deal with 
words such as "West" (after all, there are no places just called "West"),
places like a town my parents once lived in called "Haselbury Plucknett" 
would cause problems.


AND, places such as "Ruyton of the Eleven Towns" 
(https://en.wikipedia.org/wiki/Ruyton-XI-Towns)

would really throw a spanner in the works.

Come to think of things . . .

Unless anyone's code can cope with "Ruyton of the Eleven Towns" it won't 
stand up: we could even go further and call

this the "Ruyton of the Eleven Towns Test".

More muffled background noises.

Richmond.

On 1/9/2018 1:29 pm, Mark Waddingham via use-livecode wrote:

On 2018-09-01 12:05, Richmond Mathewson via use-livecode wrote:

Obviously, when considering names of places such as Colchester,
Rochester and Chester one has
to search for the longer names first and exclude them from later 
searches.


The 'substring' problem (i.e. Chester being 'in' Rochester) isn't 
relevant in the above algorithm because we are 'tokenising' input and 
phrases - essentially changing the alphabet.


i.e. "Rochester Chester Colchester" is turned into ABC, and we match 
A, B or C as atomic units.


I should perhaps point out that the 'processText' operation probably 
needs to be a little better in practice - to at least include a 'stop' 
token for punctuation. For example:


  "The man walked starting from East Hartford, West Hartford could be 
seen in the distance."


In the case where 'Hartford West' and 'Hartford' are the 'known' towns 
(and not 'East Hartford') - the proposed tokenization would result in:


The,man,walked,starting,from,East,Hartford,West,Hartford,could,be,seen,in,the,distance

Which means you'd get "Hartford West" and "Hartford" - when you should 
only get "Hartford" (assuming you care about the linguistic structure 
of the text, at least).


Indeed, the above actually means in preprocessing the text, you can 
actually vastly reduce the number of words to search - any sequences 
of words which aren't in any pharse (or important punctuation) can be 
replaced by "*" say. So the above would become:


  *,East,Hartford,*,West,Hartford,*

The "*" tokens block matching multi-word phrases.

Warmest Regards,

Mark.



___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Searching for a word when it's more than one word

2018-09-01 Thread Mark Waddingham via use-livecode

On 2018-09-01 12:35, Richmond Mathewson via use-livecode wrote:

That's because you lot tend to use a silver teaspoon while I tend to
use a great big shovel:

https://www.dropbox.com/s/00t8oftb1ydm8ni/Text%20analyzer%20X.livecode.zip?dl=0


Heh, great big shovels are great for coarse work - e.g. for the problem 
of finding occurrences of SINGLE WORD towns in the source text - as you 
are in your stack.


However, in this case, that wasn't what was asked for - the problem was 
to find multi-word town names with the constraints that first and 
longest match always wins with no overlap (i.e. as a human would read 
them):


i.e. East Hartford West Palm Beach Colchester Newchester West Chester

With a town list of

   East Hartford
   Hartford West
   West Palm Beach
   Palm Beach
   Chester
   West Chester

Should return:

   East Hartford
   West Palm Beach
   West Chester

Warmest Regards,

Mark.

P.S. The problem is actually exactly the same - in the single-word case 
your alphabet are the characters in the language. In the multi-word 
case, your alphabet is the set of words in all phrases, with a 'stop' 
word.


--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Searching for a word when it's more than one word

2018-09-01 Thread Richmond Mathewson via use-livecode
That's because you lot tend to use a silver teaspoon while I tend to use 
a great big shovel:


https://www.dropbox.com/s/00t8oftb1ydm8ni/Text%20analyzer%20X.livecode.zip?dl=0

Richmond.

On 1/9/2018 1:29 pm, Mark Waddingham via use-livecode wrote:

On 2018-09-01 12:05, Richmond Mathewson via use-livecode wrote:

Obviously, when considering names of places such as Colchester,
Rochester and Chester one has
to search for the longer names first and exclude them from later 
searches.


The 'substring' problem (i.e. Chester being 'in' Rochester) isn't 
relevant in the above algorithm because we are 'tokenising' input and 
phrases - essentially changing the alphabet.


i.e. "Rochester Chester Colchester" is turned into ABC, and we match 
A, B or C as atomic units.


I should perhaps point out that the 'processText' operation probably 
needs to be a little better in practice - to at least include a 'stop' 
token for punctuation. For example:


  "The man walked starting from East Hartford, West Hartford could be 
seen in the distance."


In the case where 'Hartford West' and 'Hartford' are the 'known' towns 
(and not 'East Hartford') - the proposed tokenization would result in:


The,man,walked,starting,from,East,Hartford,West,Hartford,could,be,seen,in,the,distance

Which means you'd get "Hartford West" and "Hartford" - when you should 
only get "Hartford" (assuming you care about the linguistic structure 
of the text, at least).


Indeed, the above actually means in preprocessing the text, you can 
actually vastly reduce the number of words to search - any sequences 
of words which aren't in any pharse (or important punctuation) can be 
replaced by "*" say. So the above would become:


  *,East,Hartford,*,West,Hartford,*

The "*" tokens block matching multi-word phrases.

Warmest Regards,

Mark.



___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Searching for a word when it's more than one word

2018-09-01 Thread Mark Waddingham via use-livecode

On 2018-09-01 12:05, Richmond Mathewson via use-livecode wrote:

Obviously, when considering names of places such as Colchester,
Rochester and Chester one has
to search for the longer names first and exclude them from later 
searches.


The 'substring' problem (i.e. Chester being 'in' Rochester) isn't 
relevant in the above algorithm because we are 'tokenising' input and 
phrases - essentially changing the alphabet.


i.e. "Rochester Chester Colchester" is turned into ABC, and we match A, 
B or C as atomic units.


I should perhaps point out that the 'processText' operation probably 
needs to be a little better in practice - to at least include a 'stop' 
token for punctuation. For example:


  "The man walked starting from East Hartford, West Hartford could be 
seen in the distance."


In the case where 'Hartford West' and 'Hartford' are the 'known' towns 
(and not 'East Hartford') - the proposed tokenization would result in:


   
The,man,walked,starting,from,East,Hartford,West,Hartford,could,be,seen,in,the,distance


Which means you'd get "Hartford West" and "Hartford" - when you should 
only get "Hartford" (assuming you care about the linguistic structure of 
the text, at least).


Indeed, the above actually means in preprocessing the text, you can 
actually vastly reduce the number of words to search - any sequences of 
words which aren't in any pharse (or important punctuation) can be 
replaced by "*" say. So the above would become:


  *,East,Hartford,*,West,Hartford,*

The "*" tokens block matching multi-word phrases.

Warmest Regards,

Mark.

--
Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Searching for a word when it's more than one word

2018-09-01 Thread Richmond Mathewson via use-livecode
Obviously, when considering names of places such as Colchester, 
Rochester and Chester one has

to search for the longer names first and exclude them from later searches.

Richmond.

On 1/9/2018 12:59 pm, Mark Waddingham via use-livecode wrote:

On 2018-09-01 06:48, Stephen MacLean via use-livecode wrote:

Hi All,

First, followed Keith Clarke’s thread and got a lot out of it, thank
you all. That’s gone into my code snippets!

Now I know, the title is not technically true, if it’s 2 words, they
are distinct and different. Maybe it’s because I’ve been banging my
head against this and some other things too long and need to step
back, but I’m having issues getting this all to work reliably.

I’m searching for town names in various text from a list of towns .
Most names are one word, easy to find and count. Some names are 2 or 3
words, like East Hartford or West Palm Beach. Those go against
distinct towns like Hartford and Palm Beach. Others have their names
inside of other town names like Colchester and Chester.


So the problem you are trying to solve sounds like this:

Given a source text TEXT, and a list of multi-word phrases PHRASES, 
find the longest elements of PHRASES which occur in TEXT when reading 
from left to right.


One way to do this is to preprocess the source TEXT and PHRASES, and 
then iterate over it with back-tracking attempting to match each 
phrase in the list.


Preprocessing can be done like this:

  // pText is arbitrary language text, where it presumed 'trueWord' 
will extract

  // the words we can match against those in PHRASES
  command preprocessText pText, @rWords
local tWords
repeat for each trueWord tWord in pText
  -- normalize word variants - e.g. turn Chester's into Chester
  if tWord ends with "'s" then
put char 1 to -3 of tWord into tWord
  else if ... then
...
  else if ... then
...
  end if
  put tWord into tWords[the number of elements in tWords + 1]
end repeat
put tWords into rWords
  end preprocessText

This gives a sequence of words, in order - where word variants have 
been normalized to the 'root' word (the general operation here is 
called 'stemming' - in your case as you are dealing with fragments of 
proper nouns - 's / s suffixes are probably good enough).


The processing for PHRASES is needed to ensure that they all follow a 
consistent form:


  // pPhrases is presumed to be a return-delimited list of phrases
  command preprocessPhrases pPhrases, @rPhrases
-- We accumulate phrases as the keys of tPhrasesA to eliminate 
duplicates

local tPhrasesA
put empty into tPhrasesA

local tPhrases
repeat for each line tPhrase in pPhrases
  local tPhrase
  put empty into tPhrase
  repeat for each trueWord tWord in tPhrase
put tWord & space after tPhrase
  end repeat
  delete the last char of tPhrase
  put true into tPhrasesA[tPhrase]
end repeat

put the keys of tPhrasesA into rPhrases
  end preprocessPhrases

This produces a return-delimited list of phrases, where the individual 
words in each phrase are separated by a *single* space with all 
punctuation stripped, and no phrase appears twice.


With this pre-processing (not the PHRASES pre-processing only needs to 
be done once for any set of PHRASES to match). A naive search 
algorithm would be:


  // pText should be a sequence array of words to search (we use an 
array here because we need fast random access)
  // pPhrases should be a line delimited string-list of multi-word 
phrases to find

  // rMatches will be a string-list of phrases which have been found
  command searchTextForPhrases pText, pPhrases, @rMatches
local tMatchesA
put empty into tMatchesA

-- Our phrases are single-space delimited, so set the item delimiter
set the itemDelimiter to space

-- Loop through pText, by default we bump tIndex by one each time
-- however, if a match is found, then we can skip the words 
constituting

-- the matched phrase.
local tIndex
put 1 into tIndex
repeat until pText[tIndex] is empty
  -- Store the current longest match we have found starting at tIndex
  local tCurrentMatch
  put empty into tCurrentMatch

  -- Check each phrase in turn for a match.
  repeat for each line tPhrase in pPhrases
-- Assume a match succeeds until it doesn't
local tPhraseMatched
put true into tPhraseMatched

-- Iterate through the items (words) in each phrase, if the 
sequence of
-- words in the phrase is not the same as the sequence of 
words in the text
-- starting at tIndex, then tPhraseMatched will be false on 
exit of the loop.

local tSubIndex
put tIndex into tSubIndex
repeat for each item tWord in tPhrase
  -- Failure to match the word at tSubIndex is failure to 
match the phrase

  if pText[tSubIndex] is not tWord then
put false into tPhraseMatched

Re: Searching for a word when it's more than one word

2018-09-01 Thread Mark Waddingham via use-livecode

On 2018-09-01 06:48, Stephen MacLean via use-livecode wrote:

Hi All,

First, followed Keith Clarke’s thread and got a lot out of it, thank
you all. That’s gone into my code snippets!

Now I know, the title is not technically true, if it’s 2 words, they
are distinct and different. Maybe it’s because I’ve been banging my
head against this and some other things too long and need to step
back, but I’m having issues getting this all to work reliably.

I’m searching for town names in various text from a list of towns .
Most names are one word, easy to find and count. Some names are 2 or 3
words, like East Hartford or West Palm Beach. Those go against
distinct towns like Hartford and Palm Beach. Others have their names
inside of other town names like Colchester and Chester.


So the problem you are trying to solve sounds like this:

Given a source text TEXT, and a list of multi-word phrases PHRASES, find 
the longest elements of PHRASES which occur in TEXT when reading from 
left to right.


One way to do this is to preprocess the source TEXT and PHRASES, and 
then iterate over it with back-tracking attempting to match each phrase 
in the list.


Preprocessing can be done like this:

  // pText is arbitrary language text, where it presumed 'trueWord' will 
extract

  // the words we can match against those in PHRASES
  command preprocessText pText, @rWords
local tWords
repeat for each trueWord tWord in pText
  -- normalize word variants - e.g. turn Chester's into Chester
  if tWord ends with "'s" then
put char 1 to -3 of tWord into tWord
  else if ... then
...
  else if ... then
...
  end if
  put tWord into tWords[the number of elements in tWords + 1]
end repeat
put tWords into rWords
  end preprocessText

This gives a sequence of words, in order - where word variants have been 
normalized to the 'root' word (the general operation here is called 
'stemming' - in your case as you are dealing with fragments of proper 
nouns - 's / s suffixes are probably good enough).


The processing for PHRASES is needed to ensure that they all follow a 
consistent form:


  // pPhrases is presumed to be a return-delimited list of phrases
  command preprocessPhrases pPhrases, @rPhrases
-- We accumulate phrases as the keys of tPhrasesA to eliminate 
duplicates

local tPhrasesA
put empty into tPhrasesA

local tPhrases
repeat for each line tPhrase in pPhrases
  local tPhrase
  put empty into tPhrase
  repeat for each trueWord tWord in tPhrase
put tWord & space after tPhrase
  end repeat
  delete the last char of tPhrase
  put true into tPhrasesA[tPhrase]
end repeat

put the keys of tPhrasesA into rPhrases
  end preprocessPhrases

This produces a return-delimited list of phrases, where the individual 
words in each phrase are separated by a *single* space with all 
punctuation stripped, and no phrase appears twice.


With this pre-processing (not the PHRASES pre-processing only needs to 
be done once for any set of PHRASES to match). A naive search algorithm 
would be:


  // pText should be a sequence array of words to search (we use an 
array here because we need fast random access)
  // pPhrases should be a line delimited string-list of multi-word 
phrases to find

  // rMatches will be a string-list of phrases which have been found
  command searchTextForPhrases pText, pPhrases, @rMatches
local tMatchesA
put empty into tMatchesA

-- Our phrases are single-space delimited, so set the item delimiter
set the itemDelimiter to space

-- Loop through pText, by default we bump tIndex by one each time
-- however, if a match is found, then we can skip the words 
constituting

-- the matched phrase.
local tIndex
put 1 into tIndex
repeat until pText[tIndex] is empty
  -- Store the current longest match we have found starting at 
tIndex

  local tCurrentMatch
  put empty into tCurrentMatch

  -- Check each phrase in turn for a match.
  repeat for each line tPhrase in pPhrases
-- Assume a match succeeds until it doesn't
local tPhraseMatched
put true into tPhraseMatched

-- Iterate through the items (words) in each phrase, if the 
sequence of
-- words in the phrase is not the same as the sequence of words 
in the text
-- starting at tIndex, then tPhraseMatched will be false on exit 
of the loop.

local tSubIndex
put tIndex into tSubIndex
repeat for each item tWord in tPhrase
  -- Failure to match the word at tSubIndex is failure to match 
the phrase

  if pText[tSubIndex] is not tWord then
put false into tPhraseMatched
exit repeat
  end if

  -- The current word of the phrase matches, so move to the 
nbext

  add 1 to tSubIndex
end repeat

-- We are only interested in the longest match at any point, so 
only 

Re: Searching for a word when it's more than one word

2018-09-01 Thread Keith Clarke via use-livecode
Very interesting Steve, your use case is actually very close to what I’m trying 
to achieve, which is to identify keywords and phrases within a corpus of text - 
think prioritised ’tag cloud’ metadata.

My original plan (as a non-programmer) was to identify the most popular unique 
words within the corpus and then go back in to find the words either side and 
check their popularity, etc.

However, from what I’ve learned here, my current pseudo-logic is:

1. Parse the whole source into 1, 2, 3 and 4 trueWord chunks (ideally in one 
pass but I’m still struggling with my array learning curve, so probably via 
lists & fields so I can see my workings)  
2. Remove lines containing noise words and any punctuation that would, by 
definition terminate the keyword/phrase
3. Count & deduplicate the remaining lines
4. Sense-check against a ‘current keywords’ list (which appears to resonate 
with your town names problem?) 

From the unique words results I’ve found, I also note issues around 
singular/plural, synonyms, alternative spelling, etc. - which speak to ‘fuzzy 
logic’ or dare one mention NLP (as inNatural Language Processing) capabilities. 

I wonder if anyone has experimented with LiveCode accessing / using any 
libraries for this kind of language processing - probably another Pandora’s box 
containing infinity + 1 cans of worms! :-)  

Back to basics, I’ll share my workings as I blunder forward and would welcome 
any insights the community experts have to offer.
Best,
Keith 

> On 1 Sep 2018, at 05:48, Stephen MacLean via use-livecode 
>  wrote:
> 
> Hi All,
> 
> First, followed Keith Clarke’s thread and got a lot out of it, thank you all. 
> That’s gone into my code snippets!
> 
> Now I know, the title is not technically true, if it’s 2 words, they are 
> distinct and different. Maybe it’s because I’ve been banging my head against 
> this and some other things too long and need to step back, but I’m having 
> issues getting this all to work reliably.
> 
> I’m searching for town names in various text from a list of towns . Most 
> names are one word, easy to find and count. Some names are 2 or 3 words, like 
> East Hartford or West Palm Beach. Those go against distinct towns like 
> Hartford and Palm Beach. Others have their names inside of other town names 
> like Colchester and Chester.
> 
> "is among the words of” or "is among the trueWords of” works great to find 
> single words, but only works on single words and doesn’t consider “Chester’s” 
> to be ”Chester”, it isn't.
> 
> “is in” works great for finding multiple words like “East Hartford” and "West 
> Palm Beach", finds “Chester” in “Chester’s” but also finds “chester” in 
> “Colchester”.
> 
> At this point, I’ve been using different methods for single word towns vs 
> multi-word towns and while generally effective, trying to accommodate for 
> these and other oddities has made it a complete mess of code.
> 
> If someone has done something similar, or can point me in the right 
> direction, it would be greatly appreciated.
> 
> TIA,
> 
> Steve MacLean
> 


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Searching for a word when it's more than one word

2018-08-31 Thread Stephen MacLean via use-livecode
Hi All,

First, followed Keith Clarke’s thread and got a lot out of it, thank you all. 
That’s gone into my code snippets!

Now I know, the title is not technically true, if it’s 2 words, they are 
distinct and different. Maybe it’s because I’ve been banging my head against 
this and some other things too long and need to step back, but I’m having 
issues getting this all to work reliably.

I’m searching for town names in various text from a list of towns . Most names 
are one word, easy to find and count. Some names are 2 or 3 words, like East 
Hartford or West Palm Beach. Those go against distinct towns like Hartford and 
Palm Beach. Others have their names inside of other town names like Colchester 
and Chester.

"is among the words of” or "is among the trueWords of” works great to find 
single words, but only works on single words and doesn’t consider “Chester’s” 
to be ”Chester”, it isn't.

“is in” works great for finding multiple words like “East Hartford” and "West 
Palm Beach", finds “Chester” in “Chester’s” but also finds “chester” in 
“Colchester”.

At this point, I’ve been using different methods for single word towns vs 
multi-word towns and while generally effective, trying to accommodate for these 
and other oddities has made it a complete mess of code.

If someone has done something similar, or can point me in the right direction, 
it would be greatly appreciated.

TIA,

Steve MacLean





___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode