As I mentioned in an earlier posting, I attended the EuroLAN 2005 summer school in Cluj-Napoca Romania, and presented a 3 hour tutorial on SenseClusters, and also conducted a 3 hour practical session which was great fun, and consisted of the "First Transylvanian Bake Off", which was a fun competition between about 20 groups of students using SenseClusters on a particular set of data. I will provide a more detailed summary of that in the coming days, since it was quite exciting and interesting. BTW, Cluj-Napoca is the capital of Transylvania, hence the name of the event...
I learned a number of things during the tutorial, with perhaps the most important being is that it seems very important to provide a bit more information to users about the various criterion functions that are included in SenseClusters. These are not documented in SenseClusters, with only a reference to the Cluto manual given. Then the cluto manual refers to another paper for more detailed information. Sort of a second order relation there I guess. :) In any case, I spent some time at EuroLAN looking at the various criterion functions, and decided that it was time to document those in SenseClusters. I will start by sending some summarizing information to this list to work out any bugs or glitches in the discussion. In clustering there are two crucial scores that are considered. The first is the similarity measure, which is used to score the pairwise similarity or difference between any two contexts. These consist of the cosine, the jaccard coefficient, etc. This is not where the problem lies I don't think, in that generally speaking when using real valued feature vectors you must use the cosine, and that is often the case for our data. When using binary data it is possible to use jaccard, etc. but these are fairly standard and not to difficult to understand. However, we will provide a bit more description and information regarding the similarity measurements that you can choose. The big point of confusion though is the criterion functions. These are what are used to measure the actual quality of the clustering either on a local level (how 'tight' is each cluster without regard to its separation from any other cluster) and then more global measures, that try and consider both the tightness of clusters and their overall separation from each other. These issues will be discussed in more detail as we go along...In any case, the criterion functions are known as I1, I2, I3, H1, H2, G1, G2, and they appear rather mysterious to the user I have observed. We have recommended I2 as a default, which is reasonable but perhaps not the only or even best choice. So what I hope to do in the coming days is to summarize what each of these criterion functions offer, and how or when you might like to use them. This information will eventually find its way into SenseClusters documentation, so your comments are of course welcome. There were some other interesting comments that I will share as well, but the above seemed to be the most important point and the issue that generated the most curiosity, so I'll pursue that first. Cordially, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------- SF.Net email is sponsored by: Discover Easy Linux Migration Strategies from IBM. Find simple to follow Roadmaps, straightforward articles, informative Webcasts and more! Get everything you need to get up to speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click _______________________________________________ senseclusters-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
