Hi Daniel, > My first post to the users list.
Welcome, glad to have you on the list! > I'm looking for a clearer definition of the wslink and wclink criterion > functions. The best description I've found is in the CLUTO manual but I > still am not sure exactly what these functions are doing or how they are > calculated. As you know, wslink refers to weighted single link, and wclink refers to weighted complete link. I am not sure how the weights are computed in this, but I will check and see if I can determine anything beyond that described in the cluto manual. > The second comment I have is that although I can't remember clearly, and > I'm not sure my notes will be detailed enough to remind me, I did find I > got different results from version 0.69 to 0.71. I seem to think the > problem was something to do with the word counts. In one version, 0.69, > it seemed to be including punctuation in its word counts, so a word, say > "wubble" would be counted as different if it appeared at the end of a > sentence. That is, it was counting "wubble." as a different word to > "wubble". This might be because of something stupid that I did, but this > seemed to be what was happening. I then installed 0.71 and that problem > wasn't occurring anymore. I was testing these results manually on a very > simple example, because I wanted to understand what was happening under > the hood of SenseClusters. 0.71 got the results that I calculated. Your observations and recollection are excellent. In fact, as of 0.71 we included a program called preprocess.pl (Anagha mentioned this in her note as well) that does tokenization automatically for the user (using a regex file that we provide, and which allows the user to change or replace that file if they wish). In earlier versions we were implicitly assuming that the user would provide the text to be clustered in an already tokenized form, but we then realized/decided it might be more convenient if we embedded that functionality within SenseClusters. So, as of 0.71 your text is tokenized such that alphanumerics are split on punctuation and white space, whereas before it was all just white space separated. You have the option to redefine the tokenization anyway you like via Perl regular expressions (and the --token option) which is quite powerful. I have posted some other notes to this list about tokenization, so if you search the archive for tokenization you should find them (one has a subject "tokenization fun" I recall :). So yes, my earlier comments were wrong. In fact there was a change going from 0.69 to 0.71 that would account for some differences in results, and those are due to the inclusion of automatic preprocessing/tokenization as of 0.71. Cordially, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642 _______________________________________________ senseclusters-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
