Did you have a look at 'Taming Text' (by Grant S. Ingersoll, Thomas S. Morton, and Andrew L. Farris)? There are some sections in this that might be relevant for your issue.
R ________________________________ From: Neil Chaudhuri <[email protected]> To: "[email protected]" <[email protected]> Sent: Friday, 2 December 2011, 3:08 Subject: Word and Phrase Clustering I have a need to cluster a collection of words and phrases by syntactic similarity over a distributed environment, and I came upon Mahout as a possible solution. After studying the documentation though, I am finding all of it tailored to working with entire documents rather than words and phrases. I simply want to know if you believe that Mahout is the right tool for this job. I suppose I could try to view each word and phrase as individual tiny documents, but that feels like I am forcing it. Any insight is appreciated. Thanks.
