[Senseclusters-users] The Senseval-2 data in Demos

ted pedersen Fri, 29 Oct 2004 11:55:03 -0700

There are two files of Senseval-2 data in the Demos directory. This data
can look a bit complex, so I thought I would say a few things about it in
the hopes of making it less mysterious.


This data comes to use complements of the Senseval exercise, which compare
supervised word sense disambiguation systems. As such it reprents a nice
source of data for us to evaluate with, since we have manually assigned
sense tags in this data, we can see how closely our discovered clusters
correspond with those tags. More about Senseval at http://www.senseval.org

The first file eng-lex-samp.evaluation.xml is test data, and there are
no answer tags (ie, manually assigned senses) in this data. Instead, we
have the answers in a file called  SenseClusters.key. This corresponds
with the usual practice in supervised  learning of having a set of data
where you know the answers but you  withhold them from the data. Note
that there are multiple instances for  multiple words (e.g., 100
instances of word "xyz", 50 instances of word  "abc", etc.). The
boundaries between the different words are marked by tags called "lexelt"
tags.

There is one lexelt tag for each target word. The instances  associated
with a particular target word are ultimately treated separately  from all
other target words, so if we have data that contains multiple lexelts,
SenseClusters will split those apart into separate pieces, and deal with
each piece one by one.

The second file eng-lex-samp.trining.xml is training data, and each
instance has one or more answer tag associated with it. These instances
have been manually  assigned senses, and in some cases there might be more
than one correct answer. However, normally during SenseClusters processing
we will remove all but answer that has the most frequency in the training
data. (This is done by the program setup.pl)

In any case, the demo scripts use both of these data sources and give a
very good idea of how SenseClusters can utilize both of these kinds of
data.

Finally, always remember that the only way we use these manually assigned
sense tags is for evaluation purposes, we never actually use them in
clustering. SenseClusters can and does deal with data where there are no
manually assigned categories, it can do exactly the same feature selection
and clustering, we just can't do the evaluation relative to a manually
created gold standard.

Let us know if there are any additional questions or puzzles about this
data, or anything else!


--
Ted Pedersen
http://www.d.umn.edu/~tpederse


-------------------------------------------------------
This SF.Net email is sponsored by:
Sybase ASE Linux Express Edition - download now for FREE
LinuxWorld Reader's Choice Award Winner for best database on Linux.
http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click
_______________________________________________
senseclusters-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

[Senseclusters-users] The Senseval-2 data in Demos

Reply via email to