Hi Daniel,

> My first post to the users list.

Welcome, glad to have you on the list!

> I'm looking for a clearer definition of the wslink and wclink criterion
> functions. The best description I've found is in the CLUTO manual but I
> still am not sure exactly what these functions are doing or how they are
> calculated.

As you know, wslink refers to weighted single link, and wclink refers
to weighted complete link. I am not sure how the weights are computed
in this, but I will check and see if I can determine anything beyond
that described in the cluto manual.

> The second comment I have is that although I can't remember clearly, and
> I'm not sure my notes will be detailed enough to remind me, I did find I
> got different results from version 0.69 to 0.71. I seem to think the
> problem was something to do with the word counts. In one version, 0.69,
> it seemed to be including punctuation in its word counts, so a word, say
> "wubble" would be counted as different if it appeared at the end of a
> sentence. That is, it was counting "wubble." as a different word to
> "wubble". This might be because of something stupid that I did, but this
> seemed to be what was happening. I then installed 0.71 and that problem
> wasn't occurring anymore. I was testing these results manually on a very
> simple example, because I wanted to understand what was happening under
> the hood of SenseClusters. 0.71 got the results that I calculated.

Your observations and recollection are excellent. In fact, as of 0.71
we included a program called preprocess.pl (Anagha mentioned this in her
note as well) that does tokenization automatically for the user (using
a regex file that we provide, and which allows the user to change or
replace that file if they wish). In earlier versions we were implicitly
assuming that the user would provide the text to be clustered in an
already tokenized form, but we then realized/decided it might be more
convenient if we embedded that functionality within SenseClusters. So,
as of 0.71 your text is tokenized such that alphanumerics are split
on punctuation and white space, whereas before it was all just white space
separated. You have the option to redefine the tokenization anyway you
like via Perl regular expressions (and the --token option) which is quite
powerful. I have posted some other notes to this list about tokenization,
so if you search the archive for tokenization you should find them (one
has a subject "tokenization fun" I recall :).

So yes, my earlier comments were wrong. In fact there was a change going
from 0.69 to 0.71 that would account for some differences in results, and
those are due to the inclusion of automatic preprocessing/tokenization
as of 0.71.

Cordially,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Reply via email to