Could you elaborate a bit on what you mean by "cluster a collection of
words and phrases by syntactic similarity over a distributed environment
"? If you can describe your collection in terms of a set of (sparse or
dense) term vectors then you should be able to use Mahout clustering
directly. The vectors do not need to be huge (as "document" might
imply), indeed smaller dimensionality clusterings work better than large
ones. The question would be how do you plan to encode these vectors?
Another would be how large a collection you have?
On 12/1/11 8:08 PM, Neil Chaudhuri wrote:
I have a need to cluster a collection of words and phrases by syntactic
similarity over a distributed environment, and I came upon Mahout as a possible
solution. After studying the documentation though, I am finding all of it
tailored to working with entire documents rather than words and phrases. I
simply want to know if you believe that Mahout is the right tool for this job.
I suppose I could try to view each word and phrase as individual tiny
documents, but that feels like I am forcing it.
Any insight is appreciated.
Thanks.