Jim,

Sounds like a good project. If you haven't discovered it yourself I'll just 
mention one of my alltime favorite scripts. It outputs a frequency list of 
words in a text file. Something like:

on mouseUp

repeat for each word w in fileContent
add 1 to wordCount[w]
end repeat

put keys(wordCount) into keyWords
sort keyWords

repeat for each line l in keyWords
put l & tab & wordCount[l] & return after displayResult
end repeat

put displayResult into field "result"

end mouseUp

I think it came from Richard Gaskin, or perhaps the Almighty Himself (Scott 
Raney).

Good luck, 
Mike

--- On Tue, 8/14/12, James Hale <ja...@thehales.id.au> wrote:

From: James Hale <ja...@thehales.id.au>
Subject: Re: word counts - what is going on?
To: use-livecode@lists.runrev.com
Date: Tuesday, August 14, 2012, 10:39 PM

Well,

lots of suggestions and attempts at humour. Nice.

The problem with using the word chunk boils down to not being able to get 
quoted text seen as multiple words as in selecting a word within the quoted 
block using the word chunk command to say hilite it.
Certainly I could replace the quotes with curly quotes etc but as the source 
text is open (i.e. not within my control) I have no idea if that would cause 
some unforeseen problem with the text presentation itself.

As mentioned I decided to process the text by character and fully control word 
boundaries myself. Doing this resulted in the following timings.

0.022096+7.497033 secs for 488872 words

The down side being that I actually ended up with some 500326 'words' in my 
array.
The extra words being the components of quoted strings as well as a number of 
dotted strings being broken up (e.g. web addresses etc)
The upshot being the time penalty was only about 5 secs extra.
A good result all things considered. (this process only needs take place once.)
The script provides an array entry with the word itself, its line number within 
the text, the character position from the start of the text, the character 
position from the start of the line as well as the length of the word itself.

On 15/08/2012, at 2:37 AM, Michael Kann <mikek...@yahoo.com> wrote:

> Can you give us the skinny on what you are trying to do? What do you want 
> your output to look like?
> Mike

The purpose being this is an application that will read an ebook (epub 
currently), display it, allow searching and annotations (with hierarchical 
tagging) for purposes of studying texts.

This current issue was concerned with enabling boolean and proximity searches 
on the text.

Boolean searches can be done with straight Livecode scripting without much 
trouble although once there are 3 or 4 terms the search can slow down a bit. 
However apart from speed issues I wanted to provide a display of the number of 
hits for each term as well as the number of hits for the boolean combination as 
the terms are entered into the search block.
for example:

Search Term       Hits          Hits Boolean
        "text"              45          
                                                     27
       "book"           123          
 
So this tells me there were 45 hits for "text", 123 hits for "book" and 27 hits 
where "text" and "book" appear within the same paragraph (line).
I am thinking the best way to do this was to use SQL to do joins and counts 
which I am assuming should be fairly quick (I could be wrong here but I hope 
not.)
The character positions provide both the proximity detail as well as easily 
showing the hits in context, for example:

     …the text was later supplied in book form to anyone th…..

I also realised that the FT module in sqlite could do all this but I couldn't 
guarantee that this would be available as not all installations of sqlite have 
this module compiled and I didn't want to go down the road of compiling and 
supplying it myself. My app is initially for Mac but will also be compiled for 
Windows once I get a working beta. I also plan to provide input for other text 
forms such as .txt, .html, .rtf and perhaps markdown, but early days yet.

Thanks again to everyone who has made suggestions.

James

ja...@thehales.id.au

Tel: +61 3 9386 2516    
Fax: +61 3 9386 1387




_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to