SenseClusters participated in the recent sense induction task that was
held as a part of Semeval-1/Senseval-4. A few more details on the task
can be found at the url mentioned below, but the basic idea is to take
contexts/instances where a target word has been designated, and
cluster instances of that word to discover senses.

http://nlp.cs.swarthmore.edu/semeval/tasks/task02/description.shtml

I am in the process of preparing a paper that will describe how we
used SenseClusters in this task, but perhaps the most important point
to make is that we used relatively common settings without any
knowledge of the data we were clustering, and came back with
reasonable results. The data used in the task was from the English
lexical sample task of Semeval-1, which consists of 100 words and
27,132 instances. The short summary of our system is that we used
second order context vectors where the bigrams features were selected
using pmi (pointwise mutual information). A large window size of 12
was used to identify bigrams, given the relatively small amounts of
data available for each word. I did not use SVD, and the number of
clusters was automatically determined via the adapted gap statistic.
The clustering method was direct (k-means). More details will be in
the task paper, which I'll make available when it's finished...

One issue that has been very interesting is reflecting upon how
evaluation of unsupervised clustering systems can and should be done.
There were two evaluation methods used in the sense induction task,
and they are different than the built in evaluation method supported
in SenseClusters. I've been discussing those issues a bit now in the
sense induction task mailing list, and will start to relay some of
that information here to this mailing list, since you may well wonder
what are your various options for evaluation, and why one sees such
different results reported for unsupervised clustering of word senses
and related problems.

As you know, SenseClusters provides its own method of evaluation, and
also supports Cluto's built in evaluations that include purity and
entropy (which were used as one of the evaluation methods in the sense
induction task). Note that all of these evaluations are based on
comparing to a "gold standard" clustering of the data as is available
when one is clustering text that has been manually sense tagged (where
the sense tags are ignored during clustering but then used for
evaluation to compare to the discovered clusters).

In any case, I will start to forward some of that correspondence and
also add to it just to explain a bit more about how we do evaluation
in SenseClusters, and what other alternatives might exist.

The most important point though is that there really doesn't seem to
be a standardized method for evaluation of unsupervised clustering of
word senses, so before making any comparisons to other results it's
quite important to understand what the evaluation measures were used
and how they were defined. That's part of the motivation for
discussing those issues here, one important observation is that the
SenseClusters evaluation method is pretty harsh, and tends to provide
lower scores than most of the other methods I've seen out there. I
don't think that's a problem, unless one starts to compare
SenseClusters results with such measures, in which case SenseClusters
usually fares worse, when in fact it's simply the product of different
evaluation techniques.

Well, enough prelude. :) BTW, the discussion group for the sense
induction task is found here:
http://groups.google.com/group/senseinduction

Ted

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Reply via email to