There are two files of Senseval-2 data in the Demos directory. This data can look a bit complex, so I thought I would say a few things about it in the hopes of making it less mysterious.
This data comes to use complements of the Senseval exercise, which compare supervised word sense disambiguation systems. As such it reprents a nice source of data for us to evaluate with, since we have manually assigned sense tags in this data, we can see how closely our discovered clusters correspond with those tags. More about Senseval at http://www.senseval.org The first file eng-lex-samp.evaluation.xml is test data, and there are no answer tags (ie, manually assigned senses) in this data. Instead, we have the answers in a file called SenseClusters.key. This corresponds with the usual practice in supervised learning of having a set of data where you know the answers but you withhold them from the data. Note that there are multiple instances for multiple words (e.g., 100 instances of word "xyz", 50 instances of word "abc", etc.). The boundaries between the different words are marked by tags called "lexelt" tags. There is one lexelt tag for each target word. The instances associated with a particular target word are ultimately treated separately from all other target words, so if we have data that contains multiple lexelts, SenseClusters will split those apart into separate pieces, and deal with each piece one by one. The second file eng-lex-samp.trining.xml is training data, and each instance has one or more answer tag associated with it. These instances have been manually assigned senses, and in some cases there might be more than one correct answer. However, normally during SenseClusters processing we will remove all but answer that has the most frequency in the training data. (This is done by the program setup.pl) In any case, the demo scripts use both of these data sources and give a very good idea of how SenseClusters can utilize both of these kinds of data. Finally, always remember that the only way we use these manually assigned sense tags is for evaluation purposes, we never actually use them in clustering. SenseClusters can and does deal with data where there are no manually assigned categories, it can do exactly the same feature selection and clustering, we just can't do the evaluation relative to a manually created gold standard. Let us know if there are any additional questions or puzzles about this data, or anything else! -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------- This SF.Net email is sponsored by: Sybase ASE Linux Express Edition - download now for FREE LinuxWorld Reader's Choice Award Winner for best database on Linux. http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click _______________________________________________ senseclusters-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
