We recently responded (privately) to a query on the Corpora list about clustering tokens. In looking over Amruta's note, it seemed like it might be a topic of general interest. Please let us know if you have any comments or questions!
Cordially, Ted and Amruta ===> Query to Corpora list: > >Does anyone know of a tool (or algorithm), preferably available freely > >for research purposes, that takes as its input a corpus only and > >produces as its output clusters of tokens that occur close to each other > >relatively often? ===> Amruta responded as follows: Yes. SenseClusters will do something exactly like you describe. 1. The N-gram Statistics Package (http://www.d.umn.edu/~tpederse/nsp.html) creates the list of word pairs that co-occur in some window from each other and their association scores. Run programs count.pl, combig.pl and statistics.pl in order ! The output of statistics will be the list of word pairs that co-occur in some window and their association scores as computed by tests like log-likelihood, mutual information, chi-squared test etc. 2. Give the output of step 1 to wordvec.pl in SenseClusters Package (http://senseclusters.sourceforge.net/). This program will create a word-by-word association matrix that shows the co-occurrence vector of each word. 3. Cluster these word vectors with (give the output of step 2 to) vcluster program in Cluto http://www-users.cs.umn.edu/~karypis/cluto/ to get clusters of words ! This is something you want ! Note that, I am using the word 'word' and you are using the 'token'. There are options in NSP and SenseClusters that let you specify the definition of token which are words by default. All the above packages are freely available for research purposes. Let us know if you have any further questions. Thanks, Amruta ------------------------------------------------------- This SF.Net email is sponsored by: Oracle 10g Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE. http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click _______________________________________________ senseclusters-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
