Greetings all, I recently participated in the Sense Induction task of Semeval-2, and found it to be a very interesting and worthwhile experience. http://www.cs.york.ac.uk/semeval2010_WSI/index.html
The final camera-ready version of the paper that describes that experience is now available here: http://www.d.umn.edu/~tpederse/Pubs/pedersen-semeval2-2010.pdf Duluth-WSI: SenseClusters Applied to the Sense Induction Task of SemEval-2 (Pedersen) - To Appear in the Proceedings of the SemEval 2010 Workshop : the 5th International Workshop on Semantic Evaluations, July 15-16, 2010, Uppsala, Sweden In the end it turns out that much of this paper is really more about the evaluation methods of the task than it is my participating system, although I do give some details of what I attempted in my systems (all of which is available fairly directly from SenseClusters http://senseclusters.sourceforge.net) In any case, I do have some concerns about how we do unsupervised evaluations which I've tried to lay out in this paper, and I continue to think (although it's not explicitly stated in this paper) that the F-score we have been using for evaluation in SenseClusters is pretty reliable. I think it is necessary (but not sufficient) that an evaluation measure for unsupervised sense induction (or discrimination as we tend to call it) do the following: 1) Not be fooled by random baselines. A random system should get a painfully low score. :) 2) Reward systems that predict the correct number of senses (relative to the gold standard) and penalize those that get the number of clusters wrong with increasing severity as the actual and predicted number of senses differ. Interestingly enough some of the evaluation measures in this task did not meet either or both of these conditions, which is part of what prompted the focus of this particular paper. The paired F-score that was used in the SemEval-2 task is fairly similar to the SenseClusters F-score, and I think both of these meet the above conditions reasonably well. But, I'll be doing a more formal and comprehensive comparison between them and other possible evaluation methods in the near future to try and establish just how well, and maybe formulate a set of necessary and sufficient conditions that we should try to meet. Any other thoughts and ideas about how to evaluate unsupervised sense induction systems are of course very welcome. Enjoy, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------------------------------ _______________________________________________ senseclusters-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
