Hi Neil, I suggest you to start by doing clustering on lexical affinities (based on how words look). It seems that it's what you are looking for from your examples. To cluster terms this way you don't really need to use the full data. You can remove all duplicates and get hopefully a much smaller set.
A good way to describe terms for this usage is to use ngrams. You can also use phonetic transcriptions of terms. An interesting trick that works well is to add a special character at the beginning of each work (in the ngrams method). This will boost similarity on the beginning of words which is usually good. I suggest you to have a look at Google Refine<http://code.google.com/p/google-refine/>. Watch the first video. It demonstrate nice terms clustering capabilities using different methods (ngrams, ...). If it's what you are looking for, you can try it on the most frequent terms in your dataset and get quickly interesting results and then implement the way which look the best for you. Best, Pascal 2011/12/2 Neil Chaudhuri <[email protected]> > Glad to fill in more detail. Imagine I have a list of words and phrases in > a data store like this: > > Alabama > Obama > University of Alabama > Bama > Potomac > Texas > Potomac River > > I would like to cluster the ones that look similar enough to be the same. > Like "Alabama" and "University of Alabama" and "Bama" (but not Obama > ideally) or "Potomac" and "Potomac River." > > Now this list of words could be in the terabytes range, which is why I > need distributed computing capability. > > How would I assemble a Vector from an individual entry in this list? With > a bit more understanding of my situation, do you think Mahout can work for > me? > > Please let me know if I can provide more information. > > Thanks. > > > > On Dec 1, 2011, at 11:29 PM, Jeff Eastman wrote: > > > Could you elaborate a bit on what you mean by "cluster a collection of > > words and phrases by syntactic similarity over a distributed environment > > "? If you can describe your collection in terms of a set of (sparse or > > dense) term vectors then you should be able to use Mahout clustering > > directly. The vectors do not need to be huge (as "document" might > > imply), indeed smaller dimensionality clusterings work better than large > > ones. The question would be how do you plan to encode these vectors? > > Another would be how large a collection you have? > > > > On 12/1/11 8:08 PM, Neil Chaudhuri wrote: > >> I have a need to cluster a collection of words and phrases by syntactic > similarity over a distributed environment, and I came upon Mahout as a > possible solution. After studying the documentation though, I am finding > all of it tailored to working with entire documents rather than words and > phrases. I simply want to know if you believe that Mahout is the right tool > for this job. I suppose I could try to view each word and phrase as > individual tiny documents, but that feels like I am forcing it. > >> > >> Any insight is appreciated. > >> > >> Thanks. > >> > > > >
