Very interesting Steve, your use case is actually very close to what I’m trying 
to achieve, which is to identify keywords and phrases within a corpus of text - 
think prioritised ’tag cloud’ metadata.

My original plan (as a non-programmer) was to identify the most popular unique 
words within the corpus and then go back in to find the words either side and 
check their popularity, etc.

However, from what I’ve learned here, my current pseudo-logic is:

1. Parse the whole source into 1, 2, 3 and 4 trueWord chunks (ideally in one 
pass but I’m still struggling with my array learning curve, so probably via 
lists & fields so I can see my workings)  
2. Remove lines containing noise words and any punctuation that would, by 
definition terminate the keyword/phrase
3. Count & deduplicate the remaining lines
4. Sense-check against a ‘current keywords’ list (which appears to resonate 
with your town names problem?) 

From the unique words results I’ve found, I also note issues around 
singular/plural, synonyms, alternative spelling, etc. - which speak to ‘fuzzy 
logic’ or dare one mention NLP (as inNatural Language Processing) capabilities. 

I wonder if anyone has experimented with LiveCode accessing / using any 
libraries for this kind of language processing - probably another Pandora’s box 
containing infinity + 1 cans of worms! :-)      

Back to basics, I’ll share my workings as I blunder forward and would welcome 
any insights the community experts have to offer.
Best,
Keith     

> On 1 Sep 2018, at 05:48, Stephen MacLean via use-livecode 
> <use-livecode@lists.runrev.com> wrote:
> 
> Hi All,
> 
> First, followed Keith Clarke’s thread and got a lot out of it, thank you all. 
> That’s gone into my code snippets!
> 
> Now I know, the title is not technically true, if it’s 2 words, they are 
> distinct and different. Maybe it’s because I’ve been banging my head against 
> this and some other things too long and need to step back, but I’m having 
> issues getting this all to work reliably.
> 
> I’m searching for town names in various text from a list of towns . Most 
> names are one word, easy to find and count. Some names are 2 or 3 words, like 
> East Hartford or West Palm Beach. Those go against distinct towns like 
> Hartford and Palm Beach. Others have their names inside of other town names 
> like Colchester and Chester.
> 
> "is among the words of” or "is among the trueWords of” works great to find 
> single words, but only works on single words and doesn’t consider “Chester’s” 
> to be ”Chester”, it isn't.
> 
> “is in” works great for finding multiple words like “East Hartford” and "West 
> Palm Beach", finds “Chester” in “Chester’s” but also finds “chester” in 
> “Colchester”.
> 
> At this point, I’ve been using different methods for single word towns vs 
> multi-word towns and while generally effective, trying to accommodate for 
> these and other oddities has made it a complete mess of code.
> 
> If someone has done something similar, or can point me in the right 
> direction, it would be greatly appreciated.
> 
> TIA,
> 
> Steve MacLean
> 


_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to