We recently responded (privately) to a query on the Corpora list about
clustering tokens. In looking over Amruta's note, it seemed like it might
be a topic of general interest. Please let us know if you have any
comments or questions!

Cordially,
Ted and Amruta

===> Query to Corpora list:

> >Does anyone know of a tool (or algorithm), preferably available freely
> >for research purposes, that takes as its input a corpus only and
> >produces as its output clusters of tokens that occur close to each other
> >relatively often?

===> Amruta responded as follows:

Yes. SenseClusters will do something exactly like you describe.

1. The N-gram Statistics Package (http://www.d.umn.edu/~tpederse/nsp.html)
creates the list of word pairs that co-occur in some window from
each other and their association scores. Run programs count.pl,
combig.pl and statistics.pl in order ! The output of statistics
will be the list of word pairs that co-occur in some window and their
association scores as computed by tests like log-likelihood, mutual
information, chi-squared test etc.

2. Give the output of step 1 to wordvec.pl in SenseClusters Package
(http://senseclusters.sourceforge.net/). This program will create
a word-by-word association matrix that shows the co-occurrence
vector of each word.

3. Cluster these word vectors with (give the output of step 2 to)
vcluster program in Cluto http://www-users.cs.umn.edu/~karypis/cluto/
to get clusters of words !

This is something you want ! Note that, I am using the word
'word' and you are using the 'token'. There are options in NSP
and SenseClusters that let you specify the definition of token
which are words by default.

All the above packages are freely available for research
purposes.

Let us know if you have any further questions.

Thanks,
Amruta


-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. 
Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
senseclusters-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Reply via email to