Re: Searching for a word when it's more than one word
My family was stranded for a while during a transfer at Frankfurt airport, while a computer system refused to accept that ‘Glasgow’ was not a destination. ( At least, in that instance) Having said that, the same error is much more commonly made by taxi drivers, who can’t avoid showing great disappointment, when I am just going to the local station. Cheers, David Glasgow > On 1 Sep 2018, at 5:57 pm, Richmond Mathewson via use-livecode > wrote: > > That sounds remarkably like two women who are friends of my parents: > > One is called "Gay" and the other one is called "Loveday". They were friends > at school 60 years ago > and when they were both widowed they moved in together; although the son of > one of them fell out > with his wife and now lives with them as well. > > Assumptions are sometimes difficult to avoid. > > Although my younger son did actually dislocate his knee jumping to > conclusions . . . > > This was mainly because he was trying to skip a difficult bit . . . > > But I digress. > > Richmond. > > On 1/9/2018 6:39 pm, J. Landman Gay via use-livecode wrote: >> There is a town in Texas called West, made infamous a few years ago by a >> giant explosion. I don't think you can make assumptions about names of >> places. >> >> Mark's suggestion to check for words ending in "s" will fail on many towns, >> though apostrophe-s may be safe. >> -- >> Jacqueline Landman Gay | jac...@hyperactivesw.com >> HyperActive Software | http://www.hyperactivesw.com >> On September 1, 2018 5:49:30 AM Richmond Mathewson via use-livecode >> wrote: >> >>> I can see that the "problem", which my stack does not address, is with 2 >>> or 3 part place names: >>> >>> The Rochester/Chester problem is easily dealt with. >>> >>> While it should be realtively easy to have a subroutine to deal with >>> words such as "West" (after all, there are no places just called "West"), >>> places like a town my parents once lived in called "Haselbury Plucknett" >>> would cause problems. >>> >>> AND, places such as "Ruyton of the Eleven Towns" >>> (https://en.wikipedia.org/wiki/Ruyton-XI-Towns) >>> would really throw a spanner in the works. >>> >>> Come to think of things . . . >>> >>> Unless anyone's code can cope with "Ruyton of the Eleven Towns" it won't >>> stand up: we could even go further and call >>> this the "Ruyton of the Eleven Towns Test". >>> >>> More muffled background noises. >>> >>> Richmond. >>> >>> On 1/9/2018 1:29 pm, Mark Waddingham via use-livecode wrote: On 2018-09-01 12:05, Richmond Mathewson via use-livecode wrote: > Obviously, when considering names of places such as Colchester, > Rochester and Chester one has > to search for the longer names first and exclude them from later > searches. The 'substring' problem (i.e. Chester being 'in' Rochester) isn't relevant in the above algorithm because we are 'tokenising' input and phrases - essentially changing the alphabet. i.e. "Rochester Chester Colchester" is turned into ABC, and we match A, B or C as atomic units. I should perhaps point out that the 'processText' operation probably needs to be a little better in practice - to at least include a 'stop' token for punctuation. For example: "The man walked starting from East Hartford, West Hartford could be seen in the distance." In the case where 'Hartford West' and 'Hartford' are the 'known' towns (and not 'East Hartford') - the proposed tokenization would result in: The,man,walked,starting,from,East,Hartford,West,Hartford,could,be,seen,in,the,distance Which means you'd get "Hartford West" and "Hartford" - when you should only get "Hartford" (assuming you care about the linguistic structure of the text, at least). Indeed, the above actually means in preprocessing the text, you can actually vastly reduce the number of words to search - any sequences of words which aren't in any pharse (or important punctuation) can be replaced by "*" say. So the above would become: *,East,Hartford,*,West,Hartford,* The "*" tokens block matching multi-word phrases. Warmest Regards, Mark. >>> >>> ___ >>> use-livecode mailing list >>> use-livecode@lists.runrev.com >>> Please visit this url to subscribe, unsubscribe and manage your >>> subscription preferences: >>> http://lists.runrev.com/mailman/listinfo/use-livecode >> >> >> >> >> ___ >> use-livecode mailing list >> use-livecode@lists.runrev.com >> Please visit this url to subscribe, unsubscribe and manage your subscription >> preferences: >> http://lists.runrev.com/mailman/listinfo/use-livecode > > ___ > use-livecode mailing list > use-livecode@lists.runrev.com > Please visit this url to subscribe,
Re: Searching for a word when it's more than one word
i had this same problem a few weeks ago...luckily it wasn't critical to the featureset, so i didn't find a solution. I will swing back around with the help of this thread. thanks for entertaining the problem. On Sun, Sep 2, 2018 at 5:09 AM Quentin Long via use-livecode < use-livecode@lists.runrev.com> wrote: > Have pondered the question, and come up with some code which may or may > not solve the problem at hand, but which may at least prove helpful in > looking for a real solution: > > == > > Assumption: You’ve got a text document (not HTML, not RTF, just plain TXT) > which contains, among other things, however-many place names. > Assumption: You have a return-list of place names, which may or may not be > single words > Assumption: The text document is in the variable SourceDoc > Assumption: The list of place names is in the variable NamesList > > Assumption: You want a document which contains a complete census of > exactly which of the place-names in NamesList occur in SourceDoc > Assumption: For each place-name which does occur within SourceDoc, you > want a list of which word-numbers each such occurrance begins at > > put “” into PlaceNamesCensus > repeat for each line DisName in NamesList > put the number of words in DisName into DisNameWords > put 0 into SearchOffset > put “” into FoundLocs > repeat > put offset (DisName, SourceDoc, SearchOffset) into DisLoc > if DisLoc = 0 then > -- there is no character string which matches the place name in > question > end repeat > else > —- there is a character string which matches the place name in > question > —- is it the actual placename, and not finding “chester” in > “colchester”? > put the number of words in (char 1 to DisLoc of SourceDoc) into > StartWord > if DisName = (word StartWord to (StartWord + DisNameWords - 1) of > SourceDoc) then > -- it’s a match, yay! > put StartWord into item (1 + the number of items in FoundLocs) of > FoundLocs > end if > add DisLoc to SearchOffset > end if > end repeat > if FoundLocs <> “” then > —- nope, DisName wasn’t in SourceDoc > put “[nil]” into DeseLocs > else > —- yay! DisName *was* in SourceDoc! at least once! > put FoundLocs into DeseLocs > end if > put DisName & comma & DeseLocs into line (1 + the number of lines in > PlaceNamesCensus) of PlaceNamesCensus > end repeat > > == > > Known issue: The above code does not pretend to locate possessive > instances of place names (i.e., California's, the United Kingdom's, etc). > Am thinking that pre-processing of SourceDoc will be helpful-to-necessary. > This pre-processing may need to accommodate more issues than just > possessives. > > > "Bewitched" + "Charlie's Angels" - Charlie = "At Arm's Length" > Read the webcomic at [ http://www.atarmslength.net ]! > If you like "At Arm's Length", support it at [ > http://www.patreon.com/DarkwingDude ]. > ___ > use-livecode mailing list > use-livecode@lists.runrev.com > Please visit this url to subscribe, unsubscribe and manage your > subscription preferences: > http://lists.runrev.com/mailman/listinfo/use-livecode ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Searching for a word when it's more than one word
Have pondered the question, and come up with some code which may or may not solve the problem at hand, but which may at least prove helpful in looking for a real solution: == Assumption: You’ve got a text document (not HTML, not RTF, just plain TXT) which contains, among other things, however-many place names. Assumption: You have a return-list of place names, which may or may not be single words Assumption: The text document is in the variable SourceDoc Assumption: The list of place names is in the variable NamesList Assumption: You want a document which contains a complete census of exactly which of the place-names in NamesList occur in SourceDoc Assumption: For each place-name which does occur within SourceDoc, you want a list of which word-numbers each such occurrance begins at put “” into PlaceNamesCensus repeat for each line DisName in NamesList put the number of words in DisName into DisNameWords put 0 into SearchOffset put “” into FoundLocs repeat put offset (DisName, SourceDoc, SearchOffset) into DisLoc if DisLoc = 0 then -- there is no character string which matches the place name in question end repeat else —- there is a character string which matches the place name in question —- is it the actual placename, and not finding “chester” in “colchester”? put the number of words in (char 1 to DisLoc of SourceDoc) into StartWord if DisName = (word StartWord to (StartWord + DisNameWords - 1) of SourceDoc) then -- it’s a match, yay! put StartWord into item (1 + the number of items in FoundLocs) of FoundLocs end if add DisLoc to SearchOffset end if end repeat if FoundLocs <> “” then —- nope, DisName wasn’t in SourceDoc put “[nil]” into DeseLocs else —- yay! DisName *was* in SourceDoc! at least once! put FoundLocs into DeseLocs end if put DisName & comma & DeseLocs into line (1 + the number of lines in PlaceNamesCensus) of PlaceNamesCensus end repeat == Known issue: The above code does not pretend to locate possessive instances of place names (i.e., California's, the United Kingdom's, etc). Am thinking that pre-processing of SourceDoc will be helpful-to-necessary. This pre-processing may need to accommodate more issues than just possessives. "Bewitched" + "Charlie's Angels" - Charlie = "At Arm's Length" Read the webcomic at [ http://www.atarmslength.net ]! If you like "At Arm's Length", support it at [ http://www.patreon.com/DarkwingDude ]. ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
[OT] Up is down (was: Searching for a word when it's more than one word)
On September 1, 2018 6:34:17 PM Mark Wieder via use-livecode wrote: On 09/01/2018 02:48 PM, J. Landman Gay via use-livecode wrote: No, it's a little north-east of center. Wait. What? West is north-east of center? Of course. When you're that far south, everything is north. I assume their center must be somewhat dynamic, perhaps based on where the most cattle are at the moment. -- Jacqueline Landman Gay | jac...@hyperactivesw.com HyperActive Software | http://www.hyperactivesw.com ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Searching for a word when it's more than one word
On 09/01/2018 02:48 PM, J. Landman Gay via use-livecode wrote: No, it's a little north-east of center. Wait. What? West is north-east of center? -- Mark Wieder ahsoftw...@gmail.com ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Searching for a word when it's more than one word
No, it's a little north-east of center. On 9/1/18 12:02 PM, Richmond Mathewson via use-livecode wrote: Is West, Texas in West Texas? Richmond. On 1/9/2018 6:55 pm, Mark Wieder via use-livecode wrote: On 09/01/2018 08:39 AM, J. Landman Gay via use-livecode wrote: There is a town in Texas called West, made infamous a few years ago by a giant explosion. I don't think you can make assumptions about names of places. And thus the distinction between West Texas and West, Texas. -- Jacqueline Landman Gay | jac...@hyperactivesw.com HyperActive Software | http://www.hyperactivesw.com ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Searching for a word when it's more than one word
On 9/1/18 10:55 AM, Mark Wieder via use-livecode wrote: On 09/01/2018 08:39 AM, J. Landman Gay via use-livecode wrote: There is a town in Texas called West, made infamous a few years ago by a giant explosion. I don't think you can make assumptions about names of places. And thus the distinction between West Texas and West, Texas. When I first heard it on the news, I thought half of Texas had disappeared. I had mixed feelings when I found out it didn't. -- Jacqueline Landman Gay | jac...@hyperactivesw.com HyperActive Software | http://www.hyperactivesw.com ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Searching for a word when it's more than one word
East or West, home is a comfy LiveCode stack . . . Well; here's my third version, which does better than the first 2: https://www.dropbox.com/s/r3yocmqzwhwu4ta/Text%20analyzer%20X.livecode.zip?dl=0 Richmond. On 1/9/2018 6:39 pm, J. Landman Gay via use-livecode wrote: There is a town in Texas called West, made infamous a few years ago by a giant explosion. I don't think you can make assumptions about names of places. Mark's suggestion to check for words ending in "s" will fail on many towns, though apostrophe-s may be safe. -- Jacqueline Landman Gay | jac...@hyperactivesw.com HyperActive Software | http://www.hyperactivesw.com On September 1, 2018 5:49:30 AM Richmond Mathewson via use-livecode wrote: I can see that the "problem", which my stack does not address, is with 2 or 3 part place names: The Rochester/Chester problem is easily dealt with. While it should be realtively easy to have a subroutine to deal with words such as "West" (after all, there are no places just called "West"), places like a town my parents once lived in called "Haselbury Plucknett" would cause problems. AND, places such as "Ruyton of the Eleven Towns" (https://en.wikipedia.org/wiki/Ruyton-XI-Towns) would really throw a spanner in the works. Come to think of things . . . Unless anyone's code can cope with "Ruyton of the Eleven Towns" it won't stand up: we could even go further and call this the "Ruyton of the Eleven Towns Test". More muffled background noises. Richmond. On 1/9/2018 1:29 pm, Mark Waddingham via use-livecode wrote: On 2018-09-01 12:05, Richmond Mathewson via use-livecode wrote: Obviously, when considering names of places such as Colchester, Rochester and Chester one has to search for the longer names first and exclude them from later searches. The 'substring' problem (i.e. Chester being 'in' Rochester) isn't relevant in the above algorithm because we are 'tokenising' input and phrases - essentially changing the alphabet. i.e. "Rochester Chester Colchester" is turned into ABC, and we match A, B or C as atomic units. I should perhaps point out that the 'processText' operation probably needs to be a little better in practice - to at least include a 'stop' token for punctuation. For example: "The man walked starting from East Hartford, West Hartford could be seen in the distance." In the case where 'Hartford West' and 'Hartford' are the 'known' towns (and not 'East Hartford') - the proposed tokenization would result in: The,man,walked,starting,from,East,Hartford,West,Hartford,could,be,seen,in,the,distance Which means you'd get "Hartford West" and "Hartford" - when you should only get "Hartford" (assuming you care about the linguistic structure of the text, at least). Indeed, the above actually means in preprocessing the text, you can actually vastly reduce the number of words to search - any sequences of words which aren't in any pharse (or important punctuation) can be replaced by "*" say. So the above would become: *,East,Hartford,*,West,Hartford,* The "*" tokens block matching multi-word phrases. Warmest Regards, Mark. ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Searching for a word when it's more than one word
We're all in a state at the moment with this one. Richmond. On 1/9/2018 7:24 pm, Stephen MacLean via use-livecode wrote: Thankfully, in my case, I do know what at least the state is:) On Sep 1, 2018, at 11:55 AM, Mark Wieder via use-livecode wrote: On 09/01/2018 08:39 AM, J. Landman Gay via use-livecode wrote: There is a town in Texas called West, made infamous a few years ago by a giant explosion. I don't think you can make assumptions about names of places. And thus the distinction between West Texas and West, Texas. -- Mark Wieder ahsoftw...@gmail.com ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Searching for a word when it's more than one word
Is West, Texas in West Texas? Richmond. On 1/9/2018 6:55 pm, Mark Wieder via use-livecode wrote: On 09/01/2018 08:39 AM, J. Landman Gay via use-livecode wrote: There is a town in Texas called West, made infamous a few years ago by a giant explosion. I don't think you can make assumptions about names of places. And thus the distinction between West Texas and West, Texas. ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Searching for a word when it's more than one word
That sounds remarkably like two women who are friends of my parents: One is called "Gay" and the other one is called "Loveday". They were friends at school 60 years ago and when they were both widowed they moved in together; although the son of one of them fell out with his wife and now lives with them as well. Assumptions are sometimes difficult to avoid. Although my younger son did actually dislocate his knee jumping to conclusions . . . This was mainly because he was trying to skip a difficult bit . . . But I digress. Richmond. On 1/9/2018 6:39 pm, J. Landman Gay via use-livecode wrote: There is a town in Texas called West, made infamous a few years ago by a giant explosion. I don't think you can make assumptions about names of places. Mark's suggestion to check for words ending in "s" will fail on many towns, though apostrophe-s may be safe. -- Jacqueline Landman Gay | jac...@hyperactivesw.com HyperActive Software | http://www.hyperactivesw.com On September 1, 2018 5:49:30 AM Richmond Mathewson via use-livecode wrote: I can see that the "problem", which my stack does not address, is with 2 or 3 part place names: The Rochester/Chester problem is easily dealt with. While it should be realtively easy to have a subroutine to deal with words such as "West" (after all, there are no places just called "West"), places like a town my parents once lived in called "Haselbury Plucknett" would cause problems. AND, places such as "Ruyton of the Eleven Towns" (https://en.wikipedia.org/wiki/Ruyton-XI-Towns) would really throw a spanner in the works. Come to think of things . . . Unless anyone's code can cope with "Ruyton of the Eleven Towns" it won't stand up: we could even go further and call this the "Ruyton of the Eleven Towns Test". More muffled background noises. Richmond. On 1/9/2018 1:29 pm, Mark Waddingham via use-livecode wrote: On 2018-09-01 12:05, Richmond Mathewson via use-livecode wrote: Obviously, when considering names of places such as Colchester, Rochester and Chester one has to search for the longer names first and exclude them from later searches. The 'substring' problem (i.e. Chester being 'in' Rochester) isn't relevant in the above algorithm because we are 'tokenising' input and phrases - essentially changing the alphabet. i.e. "Rochester Chester Colchester" is turned into ABC, and we match A, B or C as atomic units. I should perhaps point out that the 'processText' operation probably needs to be a little better in practice - to at least include a 'stop' token for punctuation. For example: "The man walked starting from East Hartford, West Hartford could be seen in the distance." In the case where 'Hartford West' and 'Hartford' are the 'known' towns (and not 'East Hartford') - the proposed tokenization would result in: The,man,walked,starting,from,East,Hartford,West,Hartford,could,be,seen,in,the,distance Which means you'd get "Hartford West" and "Hartford" - when you should only get "Hartford" (assuming you care about the linguistic structure of the text, at least). Indeed, the above actually means in preprocessing the text, you can actually vastly reduce the number of words to search - any sequences of words which aren't in any pharse (or important punctuation) can be replaced by "*" say. So the above would become: *,East,Hartford,*,West,Hartford,* The "*" tokens block matching multi-word phrases. Warmest Regards, Mark. ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Searching for a word when it's more than one word
Thankfully, in my case, I do know what at least the state is:) > On Sep 1, 2018, at 11:55 AM, Mark Wieder via use-livecode > wrote: > >> On 09/01/2018 08:39 AM, J. Landman Gay via use-livecode wrote: >> There is a town in Texas called West, made infamous a few years ago by a >> giant explosion. I don't think you can make assumptions about names of >> places. > > And thus the distinction between West Texas and West, Texas. > > -- > Mark Wieder > ahsoftw...@gmail.com > > ___ > use-livecode mailing list > use-livecode@lists.runrev.com > Please visit this url to subscribe, unsubscribe and manage your subscription > preferences: > http://lists.runrev.com/mailman/listinfo/use-livecode ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Searching for a word when it's more than one word
On 09/01/2018 08:39 AM, J. Landman Gay via use-livecode wrote: There is a town in Texas called West, made infamous a few years ago by a giant explosion. I don't think you can make assumptions about names of places. And thus the distinction between West Texas and West, Texas. -- Mark Wieder ahsoftw...@gmail.com ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Searching for a word when it's more than one word
There is a town in Texas called West, made infamous a few years ago by a giant explosion. I don't think you can make assumptions about names of places. Mark's suggestion to check for words ending in "s" will fail on many towns, though apostrophe-s may be safe. -- Jacqueline Landman Gay | jac...@hyperactivesw.com HyperActive Software | http://www.hyperactivesw.com On September 1, 2018 5:49:30 AM Richmond Mathewson via use-livecode wrote: I can see that the "problem", which my stack does not address, is with 2 or 3 part place names: The Rochester/Chester problem is easily dealt with. While it should be realtively easy to have a subroutine to deal with words such as "West" (after all, there are no places just called "West"), places like a town my parents once lived in called "Haselbury Plucknett" would cause problems. AND, places such as "Ruyton of the Eleven Towns" (https://en.wikipedia.org/wiki/Ruyton-XI-Towns) would really throw a spanner in the works. Come to think of things . . . Unless anyone's code can cope with "Ruyton of the Eleven Towns" it won't stand up: we could even go further and call this the "Ruyton of the Eleven Towns Test". More muffled background noises. Richmond. On 1/9/2018 1:29 pm, Mark Waddingham via use-livecode wrote: On 2018-09-01 12:05, Richmond Mathewson via use-livecode wrote: Obviously, when considering names of places such as Colchester, Rochester and Chester one has to search for the longer names first and exclude them from later searches. The 'substring' problem (i.e. Chester being 'in' Rochester) isn't relevant in the above algorithm because we are 'tokenising' input and phrases - essentially changing the alphabet. i.e. "Rochester Chester Colchester" is turned into ABC, and we match A, B or C as atomic units. I should perhaps point out that the 'processText' operation probably needs to be a little better in practice - to at least include a 'stop' token for punctuation. For example: "The man walked starting from East Hartford, West Hartford could be seen in the distance." In the case where 'Hartford West' and 'Hartford' are the 'known' towns (and not 'East Hartford') - the proposed tokenization would result in: The,man,walked,starting,from,East,Hartford,West,Hartford,could,be,seen,in,the,distance Which means you'd get "Hartford West" and "Hartford" - when you should only get "Hartford" (assuming you care about the linguistic structure of the text, at least). Indeed, the above actually means in preprocessing the text, you can actually vastly reduce the number of words to search - any sequences of words which aren't in any pharse (or important punctuation) can be replaced by "*" say. So the above would become: *,East,Hartford,*,West,Hartford,* The "*" tokens block matching multi-word phrases. Warmest Regards, Mark. ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Searching for a word when it's more than one word
Wow, this is awesome, thank you all!! Sorry, on the road taking my daughter to college, would love to try some of this out. One thing to keep in mind is that as that I’m checking for names against the town list, I may not know what town I’m actually looking for. Usually i do, but not always. Therefore i’ve been counting how many of each name I’ve come across and do some calculations at the end to make a best guess. Really appreciate the responses!! Thank you, Steve > On Sep 1, 2018, at 7:53 AM, Richmond Mathewson via use-livecode > wrote: > > > >> On 1/9/2018 2:50 pm, Mark Waddingham via use-livecode wrote: >>> On 2018-09-01 13:15, Richmond Mathewson via use-livecode wrote: >>> I've already shovelled Ruyton of the Eleven Towns quite effectively: >>> >>> https://www.dropbox.com/s/n7r7u0c2m9ny3eb/Text%20analyzer%20X.livecode.zip?dl=0 >>> >>> >>> No tokenising, in fact very basic stuff indeed. >>> >>> Not wishing to bang on about over-complcating things . . . . . >> >> There is actually a 'correct' more shovelistic approach (at least I *think* >> this is correct): >> >> -- Ensure all punctuation is surrounded by space >> repeat for each char tPuncChar in ",.';:()[]{}<>!@£$%^&*-_+=~`?/\|#€" & quote >> replace tPuncChar with space & tPuncChar & space in tText >> end repeat > > Thats a "point" (pun intended) as I just fell foul of a full stop. >> >> -- Ensure all whitespace is space >> replace return with space in tText >> replace tab with space in tText >> >> -- Ensure there is never two spaces next to each other in tText >> repeat while tText contains " " >> replace " " with " " in tText >> end repeat >> >> -- Ensure there is only ever one space between words in phrases >> repeat while tPhrases contains " " >> replace " " with " " in tPhrases >> end repeat >> >> -- We can now use an itemDelimiter of space >> set the itemDelimiter to space >> >> -- Sort the phrases by descending word length. >> sort lines of tPhrases descending numeric by the number of items in each >> >> -- Now check for, and remove each phrase from the source text in turn >> set the wholeMatches to true >> repeat for each line tPhrase in tPhrases >> -- If the phrase is not present then skip to the next >> if itemOffset(tPhrase, tText) is 0 then >>next repeat >> end if >> >> -- Accumulate the phrase on the output list >> put tPhrase & return after tFoundPhrases >> >> -- Remove the phrase from the input text (we assume here that * does not >> appear in any phrase) >> replace tPhrase with "*" in tText >> end repeat >> >> Warmest Regards, >> >> Mark. >> >> P.S. The above will be reasonable quick for small sets of phrases / small >> source texts - but I think as the size of either increases it will get very >> slow, very quickly! >> > > > ___ > use-livecode mailing list > use-livecode@lists.runrev.com > Please visit this url to subscribe, unsubscribe and manage your subscription > preferences: > http://lists.runrev.com/mailman/listinfo/use-livecode ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Searching for a word when it's more than one word
On 1/9/2018 2:50 pm, Mark Waddingham via use-livecode wrote: On 2018-09-01 13:15, Richmond Mathewson via use-livecode wrote: I've already shovelled Ruyton of the Eleven Towns quite effectively: https://www.dropbox.com/s/n7r7u0c2m9ny3eb/Text%20analyzer%20X.livecode.zip?dl=0 No tokenising, in fact very basic stuff indeed. Not wishing to bang on about over-complcating things . . . . . There is actually a 'correct' more shovelistic approach (at least I *think* this is correct): -- Ensure all punctuation is surrounded by space repeat for each char tPuncChar in ",.';:()[]{}<>!@£$%^&*-_+=~`?/\|#€" & quote replace tPuncChar with space & tPuncChar & space in tText end repeat Thats a "point" (pun intended) as I just fell foul of a full stop. -- Ensure all whitespace is space replace return with space in tText replace tab with space in tText -- Ensure there is never two spaces next to each other in tText repeat while tText contains " " replace " " with " " in tText end repeat -- Ensure there is only ever one space between words in phrases repeat while tPhrases contains " " replace " " with " " in tPhrases end repeat -- We can now use an itemDelimiter of space set the itemDelimiter to space -- Sort the phrases by descending word length. sort lines of tPhrases descending numeric by the number of items in each -- Now check for, and remove each phrase from the source text in turn set the wholeMatches to true repeat for each line tPhrase in tPhrases -- If the phrase is not present then skip to the next if itemOffset(tPhrase, tText) is 0 then next repeat end if -- Accumulate the phrase on the output list put tPhrase & return after tFoundPhrases -- Remove the phrase from the input text (we assume here that * does not appear in any phrase) replace tPhrase with "*" in tText end repeat Warmest Regards, Mark. P.S. The above will be reasonable quick for small sets of phrases / small source texts - but I think as the size of either increases it will get very slow, very quickly! ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Searching for a word when it's more than one word
It didn't like this: on mouseDown put empty into fld "zText" if fld "xText" contains "Ruyton of the Eleven Towns." then put fld "xText" into fld "zText" put "Ruyton of the Eleven Towns." into CHUNNK put empty into CHUNNK of fld "zText" end if *end mouseDown** ** **Richmond.* On 1/9/2018 2:25 pm, Mark Waddingham via use-livecode wrote: On 2018-09-01 13:15, Richmond Mathewson via use-livecode wrote: I've already shovelled Ruyton of the Eleven Towns quite effectively: https://www.dropbox.com/s/n7r7u0c2m9ny3eb/Text%20analyzer%20X.livecode.zip?dl=0 No tokenising, in fact very basic stuff indeed. Not wishing to bang on about over-complcating things . . . . . Your revised approach is fine - as long as the names of all the towns are distinct in terms of no one town's name is contained within another. Add 'Palm Beach West' and 'Palm Beach' to your placeNames list; then modify your source text to end 'or Palm Beach West' - and you algorithm does not perform the requested operation. It reports Palm Beach West *and* Palm Beach as being present - whereas, only 'Palm Beach West' is present :D Warmest Regards, Mark. ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Searching for a word when it's more than one word
On 2018-09-01 13:15, Richmond Mathewson via use-livecode wrote: I've already shovelled Ruyton of the Eleven Towns quite effectively: https://www.dropbox.com/s/n7r7u0c2m9ny3eb/Text%20analyzer%20X.livecode.zip?dl=0 No tokenising, in fact very basic stuff indeed. Not wishing to bang on about over-complcating things . . . . . There is actually a 'correct' more shovelistic approach (at least I *think* this is correct): -- Ensure all punctuation is surrounded by space repeat for each char tPuncChar in ",.';:()[]{}<>!@£$%^&*-_+=~`?/\|#€" & quote replace tPuncChar with space & tPuncChar & space in tText end repeat -- Ensure all whitespace is space replace return with space in tText replace tab with space in tText -- Ensure there is never two spaces next to each other in tText repeat while tText contains " " replace " " with " " in tText end repeat -- Ensure there is only ever one space between words in phrases repeat while tPhrases contains " " replace " " with " " in tPhrases end repeat -- We can now use an itemDelimiter of space set the itemDelimiter to space -- Sort the phrases by descending word length. sort lines of tPhrases descending numeric by the number of items in each -- Now check for, and remove each phrase from the source text in turn set the wholeMatches to true repeat for each line tPhrase in tPhrases -- If the phrase is not present then skip to the next if itemOffset(tPhrase, tText) is 0 then next repeat end if -- Accumulate the phrase on the output list put tPhrase & return after tFoundPhrases -- Remove the phrase from the input text (we assume here that * does not appear in any phrase) replace tPhrase with "*" in tText end repeat Warmest Regards, Mark. P.S. The above will be reasonable quick for small sets of phrases / small source texts - but I think as the size of either increases it will get very slow, very quickly! -- Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/ LiveCode: Everyone can create apps ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Searching for a word when it's more than one word
On 1/9/2018 2:25 pm, Mark Waddingham via use-livecode wrote: On 2018-09-01 13:15, Richmond Mathewson via use-livecode wrote: I've already shovelled Ruyton of the Eleven Towns quite effectively: https://www.dropbox.com/s/n7r7u0c2m9ny3eb/Text%20analyzer%20X.livecode.zip?dl=0 No tokenising, in fact very basic stuff indeed. Not wishing to bang on about over-complcating things . . . . . Your revised approach is fine - as long as the names of all the towns are distinct in terms of no one town's name is contained within another. Blast! Of course "my next trick" is to work out how to delete multi-word names (i.e. phrases) from a textField. Richmond. Add 'Palm Beach West' and 'Palm Beach' to your placeNames list; then modify your source text to end 'or Palm Beach West' - and you algorithm does not perform the requested operation. It reports Palm Beach West *and* Palm Beach as being present - whereas, only 'Palm Beach West' is present :D Warmest Regards, Mark. ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Searching for a word when it's more than one word
On 2018-09-01 13:15, Richmond Mathewson via use-livecode wrote: I've already shovelled Ruyton of the Eleven Towns quite effectively: https://www.dropbox.com/s/n7r7u0c2m9ny3eb/Text%20analyzer%20X.livecode.zip?dl=0 No tokenising, in fact very basic stuff indeed. Not wishing to bang on about over-complcating things . . . . . Your revised approach is fine - as long as the names of all the towns are distinct in terms of no one town's name is contained within another. Add 'Palm Beach West' and 'Palm Beach' to your placeNames list; then modify your source text to end 'or Palm Beach West' - and you algorithm does not perform the requested operation. It reports Palm Beach West *and* Palm Beach as being present - whereas, only 'Palm Beach West' is present :D Warmest Regards, Mark. -- Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/ LiveCode: Everyone can create apps ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Searching for a word when it's more than one word
I've already shovelled Ruyton of the Eleven Towns quite effectively: https://www.dropbox.com/s/n7r7u0c2m9ny3eb/Text%20analyzer%20X.livecode.zip?dl=0 No tokenising, in fact very basic stuff indeed. Not wishing to bang on about over-complcating things . . . . . Probably time for both Thee and Me to get out and get some fresh air before we ruin our weekends. Richmond. On 1/9/2018 2:05 pm, Mark Waddingham via use-livecode wrote: On 2018-09-01 12:50, Richmond Mathewson via use-livecode wrote: Yup: indeed: fairly coarse. However, see my next posting re "Ruyton of the Eleven Towns" that should make some folk feel that they need a set of sewing needles rather than "just" a silver teaspoon. I think you'll find my 'silver teaspoon' approach (as you put it) deals with all those cases :D Interestingly, as I said, the multi-word match problem can be reduced to your 'shovel' - with pre and post processing. Let's say that the phrase list is: Ruyton of the Eleven Towns East Hartfordshire Colchester Chester First create a mapping from phrase words to individual characters (the choice of character is arbitrary): Ruyton <-> A of <-> B the <-> C Eleven <-> D Towns <-> E East <-> F Hartfordshire <-> G Colchester <-> H Chester <-> I Now iterate through the source text, generating an output source text consisting of words from the new alphabet, and a 'unknown' letter '*'. For example: The man from Ruyton of the Eleven Towns, who is of the order of shovels, travelled from Chester to Colchester via the towns in East Hartfordshire Would become: C**ABCDE**BC*B***I*H**E*FG The original phrase list is processed similarly to give: ABCDE FG H I Searching the transformed source text using your algorithm with the list of transformed phrases would give the correct set of found phrases as required by the original problem. Warmest Regards, Mark. ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Searching for a word when it's more than one word
On 2018-09-01 12:50, Richmond Mathewson via use-livecode wrote: Yup: indeed: fairly coarse. However, see my next posting re "Ruyton of the Eleven Towns" that should make some folk feel that they need a set of sewing needles rather than "just" a silver teaspoon. I think you'll find my 'silver teaspoon' approach (as you put it) deals with all those cases :D Interestingly, as I said, the multi-word match problem can be reduced to your 'shovel' - with pre and post processing. Let's say that the phrase list is: Ruyton of the Eleven Towns East Hartfordshire Colchester Chester First create a mapping from phrase words to individual characters (the choice of character is arbitrary): Ruyton <-> A of <-> B the <-> C Eleven <-> D Towns <-> E East <-> F Hartfordshire <-> G Colchester <-> H Chester <-> I Now iterate through the source text, generating an output source text consisting of words from the new alphabet, and a 'unknown' letter '*'. For example: The man from Ruyton of the Eleven Towns, who is of the order of shovels, travelled from Chester to Colchester via the towns in East Hartfordshire Would become: C**ABCDE**BC*B***I*H**E*FG The original phrase list is processed similarly to give: ABCDE FG H I Searching the transformed source text using your algorithm with the list of transformed phrases would give the correct set of found phrases as required by the original problem. Warmest Regards, Mark. -- Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/ LiveCode: Everyone can create apps ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Searching for a word when it's more than one word
Yup: indeed: fairly coarse. However, see my next posting re "Ruyton of the Eleven Towns" that should make some folk feel that they need a set of sewing needles rather than "just" a silver teaspoon. Richmond. On 1/9/2018 1:45 pm, Mark Waddingham via use-livecode wrote: On 2018-09-01 12:35, Richmond Mathewson via use-livecode wrote: That's because you lot tend to use a silver teaspoon while I tend to use a great big shovel: https://www.dropbox.com/s/00t8oftb1ydm8ni/Text%20analyzer%20X.livecode.zip?dl=0 Heh, great big shovels are great for coarse work - e.g. for the problem of finding occurrences of SINGLE WORD towns in the source text - as you are in your stack. However, in this case, that wasn't what was asked for - the problem was to find multi-word town names with the constraints that first and longest match always wins with no overlap (i.e. as a human would read them): i.e. East Hartford West Palm Beach Colchester Newchester West Chester With a town list of East Hartford Hartford West West Palm Beach Palm Beach Chester West Chester Should return: East Hartford West Palm Beach West Chester Warmest Regards, Mark. P.S. The problem is actually exactly the same - in the single-word case your alphabet are the characters in the language. In the multi-word case, your alphabet is the set of words in all phrases, with a 'stop' word. ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Searching for a word when it's more than one word
I can see that the "problem", which my stack does not address, is with 2 or 3 part place names: The Rochester/Chester problem is easily dealt with. While it should be realtively easy to have a subroutine to deal with words such as "West" (after all, there are no places just called "West"), places like a town my parents once lived in called "Haselbury Plucknett" would cause problems. AND, places such as "Ruyton of the Eleven Towns" (https://en.wikipedia.org/wiki/Ruyton-XI-Towns) would really throw a spanner in the works. Come to think of things . . . Unless anyone's code can cope with "Ruyton of the Eleven Towns" it won't stand up: we could even go further and call this the "Ruyton of the Eleven Towns Test". More muffled background noises. Richmond. On 1/9/2018 1:29 pm, Mark Waddingham via use-livecode wrote: On 2018-09-01 12:05, Richmond Mathewson via use-livecode wrote: Obviously, when considering names of places such as Colchester, Rochester and Chester one has to search for the longer names first and exclude them from later searches. The 'substring' problem (i.e. Chester being 'in' Rochester) isn't relevant in the above algorithm because we are 'tokenising' input and phrases - essentially changing the alphabet. i.e. "Rochester Chester Colchester" is turned into ABC, and we match A, B or C as atomic units. I should perhaps point out that the 'processText' operation probably needs to be a little better in practice - to at least include a 'stop' token for punctuation. For example: "The man walked starting from East Hartford, West Hartford could be seen in the distance." In the case where 'Hartford West' and 'Hartford' are the 'known' towns (and not 'East Hartford') - the proposed tokenization would result in: The,man,walked,starting,from,East,Hartford,West,Hartford,could,be,seen,in,the,distance Which means you'd get "Hartford West" and "Hartford" - when you should only get "Hartford" (assuming you care about the linguistic structure of the text, at least). Indeed, the above actually means in preprocessing the text, you can actually vastly reduce the number of words to search - any sequences of words which aren't in any pharse (or important punctuation) can be replaced by "*" say. So the above would become: *,East,Hartford,*,West,Hartford,* The "*" tokens block matching multi-word phrases. Warmest Regards, Mark. ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Searching for a word when it's more than one word
On 2018-09-01 12:35, Richmond Mathewson via use-livecode wrote: That's because you lot tend to use a silver teaspoon while I tend to use a great big shovel: https://www.dropbox.com/s/00t8oftb1ydm8ni/Text%20analyzer%20X.livecode.zip?dl=0 Heh, great big shovels are great for coarse work - e.g. for the problem of finding occurrences of SINGLE WORD towns in the source text - as you are in your stack. However, in this case, that wasn't what was asked for - the problem was to find multi-word town names with the constraints that first and longest match always wins with no overlap (i.e. as a human would read them): i.e. East Hartford West Palm Beach Colchester Newchester West Chester With a town list of East Hartford Hartford West West Palm Beach Palm Beach Chester West Chester Should return: East Hartford West Palm Beach West Chester Warmest Regards, Mark. P.S. The problem is actually exactly the same - in the single-word case your alphabet are the characters in the language. In the multi-word case, your alphabet is the set of words in all phrases, with a 'stop' word. -- Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/ LiveCode: Everyone can create apps ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Searching for a word when it's more than one word
That's because you lot tend to use a silver teaspoon while I tend to use a great big shovel: https://www.dropbox.com/s/00t8oftb1ydm8ni/Text%20analyzer%20X.livecode.zip?dl=0 Richmond. On 1/9/2018 1:29 pm, Mark Waddingham via use-livecode wrote: On 2018-09-01 12:05, Richmond Mathewson via use-livecode wrote: Obviously, when considering names of places such as Colchester, Rochester and Chester one has to search for the longer names first and exclude them from later searches. The 'substring' problem (i.e. Chester being 'in' Rochester) isn't relevant in the above algorithm because we are 'tokenising' input and phrases - essentially changing the alphabet. i.e. "Rochester Chester Colchester" is turned into ABC, and we match A, B or C as atomic units. I should perhaps point out that the 'processText' operation probably needs to be a little better in practice - to at least include a 'stop' token for punctuation. For example: "The man walked starting from East Hartford, West Hartford could be seen in the distance." In the case where 'Hartford West' and 'Hartford' are the 'known' towns (and not 'East Hartford') - the proposed tokenization would result in: The,man,walked,starting,from,East,Hartford,West,Hartford,could,be,seen,in,the,distance Which means you'd get "Hartford West" and "Hartford" - when you should only get "Hartford" (assuming you care about the linguistic structure of the text, at least). Indeed, the above actually means in preprocessing the text, you can actually vastly reduce the number of words to search - any sequences of words which aren't in any pharse (or important punctuation) can be replaced by "*" say. So the above would become: *,East,Hartford,*,West,Hartford,* The "*" tokens block matching multi-word phrases. Warmest Regards, Mark. ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Searching for a word when it's more than one word
On 2018-09-01 12:05, Richmond Mathewson via use-livecode wrote: Obviously, when considering names of places such as Colchester, Rochester and Chester one has to search for the longer names first and exclude them from later searches. The 'substring' problem (i.e. Chester being 'in' Rochester) isn't relevant in the above algorithm because we are 'tokenising' input and phrases - essentially changing the alphabet. i.e. "Rochester Chester Colchester" is turned into ABC, and we match A, B or C as atomic units. I should perhaps point out that the 'processText' operation probably needs to be a little better in practice - to at least include a 'stop' token for punctuation. For example: "The man walked starting from East Hartford, West Hartford could be seen in the distance." In the case where 'Hartford West' and 'Hartford' are the 'known' towns (and not 'East Hartford') - the proposed tokenization would result in: The,man,walked,starting,from,East,Hartford,West,Hartford,could,be,seen,in,the,distance Which means you'd get "Hartford West" and "Hartford" - when you should only get "Hartford" (assuming you care about the linguistic structure of the text, at least). Indeed, the above actually means in preprocessing the text, you can actually vastly reduce the number of words to search - any sequences of words which aren't in any pharse (or important punctuation) can be replaced by "*" say. So the above would become: *,East,Hartford,*,West,Hartford,* The "*" tokens block matching multi-word phrases. Warmest Regards, Mark. -- Mark Waddingham ~ m...@livecode.com ~ http://www.livecode.com/ LiveCode: Everyone can create apps ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Searching for a word when it's more than one word
Obviously, when considering names of places such as Colchester, Rochester and Chester one has to search for the longer names first and exclude them from later searches. Richmond. On 1/9/2018 12:59 pm, Mark Waddingham via use-livecode wrote: On 2018-09-01 06:48, Stephen MacLean via use-livecode wrote: Hi All, First, followed Keith Clarke’s thread and got a lot out of it, thank you all. That’s gone into my code snippets! Now I know, the title is not technically true, if it’s 2 words, they are distinct and different. Maybe it’s because I’ve been banging my head against this and some other things too long and need to step back, but I’m having issues getting this all to work reliably. I’m searching for town names in various text from a list of towns . Most names are one word, easy to find and count. Some names are 2 or 3 words, like East Hartford or West Palm Beach. Those go against distinct towns like Hartford and Palm Beach. Others have their names inside of other town names like Colchester and Chester. So the problem you are trying to solve sounds like this: Given a source text TEXT, and a list of multi-word phrases PHRASES, find the longest elements of PHRASES which occur in TEXT when reading from left to right. One way to do this is to preprocess the source TEXT and PHRASES, and then iterate over it with back-tracking attempting to match each phrase in the list. Preprocessing can be done like this: // pText is arbitrary language text, where it presumed 'trueWord' will extract // the words we can match against those in PHRASES command preprocessText pText, @rWords local tWords repeat for each trueWord tWord in pText -- normalize word variants - e.g. turn Chester's into Chester if tWord ends with "'s" then put char 1 to -3 of tWord into tWord else if ... then ... else if ... then ... end if put tWord into tWords[the number of elements in tWords + 1] end repeat put tWords into rWords end preprocessText This gives a sequence of words, in order - where word variants have been normalized to the 'root' word (the general operation here is called 'stemming' - in your case as you are dealing with fragments of proper nouns - 's / s suffixes are probably good enough). The processing for PHRASES is needed to ensure that they all follow a consistent form: // pPhrases is presumed to be a return-delimited list of phrases command preprocessPhrases pPhrases, @rPhrases -- We accumulate phrases as the keys of tPhrasesA to eliminate duplicates local tPhrasesA put empty into tPhrasesA local tPhrases repeat for each line tPhrase in pPhrases local tPhrase put empty into tPhrase repeat for each trueWord tWord in tPhrase put tWord & space after tPhrase end repeat delete the last char of tPhrase put true into tPhrasesA[tPhrase] end repeat put the keys of tPhrasesA into rPhrases end preprocessPhrases This produces a return-delimited list of phrases, where the individual words in each phrase are separated by a *single* space with all punctuation stripped, and no phrase appears twice. With this pre-processing (not the PHRASES pre-processing only needs to be done once for any set of PHRASES to match). A naive search algorithm would be: // pText should be a sequence array of words to search (we use an array here because we need fast random access) // pPhrases should be a line delimited string-list of multi-word phrases to find // rMatches will be a string-list of phrases which have been found command searchTextForPhrases pText, pPhrases, @rMatches local tMatchesA put empty into tMatchesA -- Our phrases are single-space delimited, so set the item delimiter set the itemDelimiter to space -- Loop through pText, by default we bump tIndex by one each time -- however, if a match is found, then we can skip the words constituting -- the matched phrase. local tIndex put 1 into tIndex repeat until pText[tIndex] is empty -- Store the current longest match we have found starting at tIndex local tCurrentMatch put empty into tCurrentMatch -- Check each phrase in turn for a match. repeat for each line tPhrase in pPhrases -- Assume a match succeeds until it doesn't local tPhraseMatched put true into tPhraseMatched -- Iterate through the items (words) in each phrase, if the sequence of -- words in the phrase is not the same as the sequence of words in the text -- starting at tIndex, then tPhraseMatched will be false on exit of the loop. local tSubIndex put tIndex into tSubIndex repeat for each item tWord in tPhrase -- Failure to match the word at tSubIndex is failure to match the phrase if pText[tSubIndex] is not tWord then put false into tPhraseMatched
Re: Searching for a word when it's more than one word
On 2018-09-01 06:48, Stephen MacLean via use-livecode wrote: Hi All, First, followed Keith Clarke’s thread and got a lot out of it, thank you all. That’s gone into my code snippets! Now I know, the title is not technically true, if it’s 2 words, they are distinct and different. Maybe it’s because I’ve been banging my head against this and some other things too long and need to step back, but I’m having issues getting this all to work reliably. I’m searching for town names in various text from a list of towns . Most names are one word, easy to find and count. Some names are 2 or 3 words, like East Hartford or West Palm Beach. Those go against distinct towns like Hartford and Palm Beach. Others have their names inside of other town names like Colchester and Chester. So the problem you are trying to solve sounds like this: Given a source text TEXT, and a list of multi-word phrases PHRASES, find the longest elements of PHRASES which occur in TEXT when reading from left to right. One way to do this is to preprocess the source TEXT and PHRASES, and then iterate over it with back-tracking attempting to match each phrase in the list. Preprocessing can be done like this: // pText is arbitrary language text, where it presumed 'trueWord' will extract // the words we can match against those in PHRASES command preprocessText pText, @rWords local tWords repeat for each trueWord tWord in pText -- normalize word variants - e.g. turn Chester's into Chester if tWord ends with "'s" then put char 1 to -3 of tWord into tWord else if ... then ... else if ... then ... end if put tWord into tWords[the number of elements in tWords + 1] end repeat put tWords into rWords end preprocessText This gives a sequence of words, in order - where word variants have been normalized to the 'root' word (the general operation here is called 'stemming' - in your case as you are dealing with fragments of proper nouns - 's / s suffixes are probably good enough). The processing for PHRASES is needed to ensure that they all follow a consistent form: // pPhrases is presumed to be a return-delimited list of phrases command preprocessPhrases pPhrases, @rPhrases -- We accumulate phrases as the keys of tPhrasesA to eliminate duplicates local tPhrasesA put empty into tPhrasesA local tPhrases repeat for each line tPhrase in pPhrases local tPhrase put empty into tPhrase repeat for each trueWord tWord in tPhrase put tWord & space after tPhrase end repeat delete the last char of tPhrase put true into tPhrasesA[tPhrase] end repeat put the keys of tPhrasesA into rPhrases end preprocessPhrases This produces a return-delimited list of phrases, where the individual words in each phrase are separated by a *single* space with all punctuation stripped, and no phrase appears twice. With this pre-processing (not the PHRASES pre-processing only needs to be done once for any set of PHRASES to match). A naive search algorithm would be: // pText should be a sequence array of words to search (we use an array here because we need fast random access) // pPhrases should be a line delimited string-list of multi-word phrases to find // rMatches will be a string-list of phrases which have been found command searchTextForPhrases pText, pPhrases, @rMatches local tMatchesA put empty into tMatchesA -- Our phrases are single-space delimited, so set the item delimiter set the itemDelimiter to space -- Loop through pText, by default we bump tIndex by one each time -- however, if a match is found, then we can skip the words constituting -- the matched phrase. local tIndex put 1 into tIndex repeat until pText[tIndex] is empty -- Store the current longest match we have found starting at tIndex local tCurrentMatch put empty into tCurrentMatch -- Check each phrase in turn for a match. repeat for each line tPhrase in pPhrases -- Assume a match succeeds until it doesn't local tPhraseMatched put true into tPhraseMatched -- Iterate through the items (words) in each phrase, if the sequence of -- words in the phrase is not the same as the sequence of words in the text -- starting at tIndex, then tPhraseMatched will be false on exit of the loop. local tSubIndex put tIndex into tSubIndex repeat for each item tWord in tPhrase -- Failure to match the word at tSubIndex is failure to match the phrase if pText[tSubIndex] is not tWord then put false into tPhraseMatched exit repeat end if -- The current word of the phrase matches, so move to the nbext add 1 to tSubIndex end repeat -- We are only interested in the longest match at any point, so only
Re: Searching for a word when it's more than one word
Very interesting Steve, your use case is actually very close to what I’m trying to achieve, which is to identify keywords and phrases within a corpus of text - think prioritised ’tag cloud’ metadata. My original plan (as a non-programmer) was to identify the most popular unique words within the corpus and then go back in to find the words either side and check their popularity, etc. However, from what I’ve learned here, my current pseudo-logic is: 1. Parse the whole source into 1, 2, 3 and 4 trueWord chunks (ideally in one pass but I’m still struggling with my array learning curve, so probably via lists & fields so I can see my workings) 2. Remove lines containing noise words and any punctuation that would, by definition terminate the keyword/phrase 3. Count & deduplicate the remaining lines 4. Sense-check against a ‘current keywords’ list (which appears to resonate with your town names problem?) From the unique words results I’ve found, I also note issues around singular/plural, synonyms, alternative spelling, etc. - which speak to ‘fuzzy logic’ or dare one mention NLP (as inNatural Language Processing) capabilities. I wonder if anyone has experimented with LiveCode accessing / using any libraries for this kind of language processing - probably another Pandora’s box containing infinity + 1 cans of worms! :-) Back to basics, I’ll share my workings as I blunder forward and would welcome any insights the community experts have to offer. Best, Keith > On 1 Sep 2018, at 05:48, Stephen MacLean via use-livecode > wrote: > > Hi All, > > First, followed Keith Clarke’s thread and got a lot out of it, thank you all. > That’s gone into my code snippets! > > Now I know, the title is not technically true, if it’s 2 words, they are > distinct and different. Maybe it’s because I’ve been banging my head against > this and some other things too long and need to step back, but I’m having > issues getting this all to work reliably. > > I’m searching for town names in various text from a list of towns . Most > names are one word, easy to find and count. Some names are 2 or 3 words, like > East Hartford or West Palm Beach. Those go against distinct towns like > Hartford and Palm Beach. Others have their names inside of other town names > like Colchester and Chester. > > "is among the words of” or "is among the trueWords of” works great to find > single words, but only works on single words and doesn’t consider “Chester’s” > to be ”Chester”, it isn't. > > “is in” works great for finding multiple words like “East Hartford” and "West > Palm Beach", finds “Chester” in “Chester’s” but also finds “chester” in > “Colchester”. > > At this point, I’ve been using different methods for single word towns vs > multi-word towns and while generally effective, trying to accommodate for > these and other oddities has made it a complete mess of code. > > If someone has done something similar, or can point me in the right > direction, it would be greatly appreciated. > > TIA, > > Steve MacLean > ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Searching for a word when it's more than one word
Hi All, First, followed Keith Clarke’s thread and got a lot out of it, thank you all. That’s gone into my code snippets! Now I know, the title is not technically true, if it’s 2 words, they are distinct and different. Maybe it’s because I’ve been banging my head against this and some other things too long and need to step back, but I’m having issues getting this all to work reliably. I’m searching for town names in various text from a list of towns . Most names are one word, easy to find and count. Some names are 2 or 3 words, like East Hartford or West Palm Beach. Those go against distinct towns like Hartford and Palm Beach. Others have their names inside of other town names like Colchester and Chester. "is among the words of” or "is among the trueWords of” works great to find single words, but only works on single words and doesn’t consider “Chester’s” to be ”Chester”, it isn't. “is in” works great for finding multiple words like “East Hartford” and "West Palm Beach", finds “Chester” in “Chester’s” but also finds “chester” in “Colchester”. At this point, I’ve been using different methods for single word towns vs multi-word towns and while generally effective, trying to accommodate for these and other oddities has made it a complete mess of code. If someone has done something similar, or can point me in the right direction, it would be greatly appreciated. TIA, Steve MacLean ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode