Here is an ancient article on the subject. http://www.aclweb.org/anthology-new/J/J92/J92-3004.pdf
You don't need fancy computer capabilities to cluster words based on spelling. On Fri, Dec 2, 2011 at 3:36 AM, Pascal Coupet <[email protected]> wrote: > Hi Neil, > > I suggest you to start by doing clustering on lexical affinities (based on > how words look). It seems that it's what you are looking for from your > examples. To cluster terms this way you don't really need to use the full > data. You can remove all duplicates and get hopefully a much smaller set. > > A good way to describe terms for this usage is to use ngrams. You can also > use phonetic transcriptions of terms. An interesting trick that works well > is to add a special character at the beginning of each work (in the ngrams > method). This will boost similarity on the beginning of words which is > usually good. > > I suggest you to have a look at Google > Refine<http://code.google.com/p/google-refine/>. > Watch the first video. It demonstrate nice terms clustering capabilities > using different methods (ngrams, ...). If it's what you are looking for, > you can try it on the most frequent terms in your dataset and get quickly > interesting results and then implement the way which look the best for you. > > Best, > > Pascal > > > > > > > > > > > > > > > > > > > > > 2011/12/2 Neil Chaudhuri <[email protected]> > > > Glad to fill in more detail. Imagine I have a list of words and phrases > in > > a data store like this: > > > > Alabama > > Obama > > University of Alabama > > Bama > > Potomac > > Texas > > Potomac River > > > > I would like to cluster the ones that look similar enough to be the same. > > Like "Alabama" and "University of Alabama" and "Bama" (but not Obama > > ideally) or "Potomac" and "Potomac River." > > > > Now this list of words could be in the terabytes range, which is why I > > need distributed computing capability. > > > > How would I assemble a Vector from an individual entry in this list? With > > a bit more understanding of my situation, do you think Mahout can work > for > > me? > > > > Please let me know if I can provide more information. > > > > Thanks. > > > > > > > > On Dec 1, 2011, at 11:29 PM, Jeff Eastman wrote: > > > > > Could you elaborate a bit on what you mean by "cluster a collection of > > > words and phrases by syntactic similarity over a distributed > environment > > > "? If you can describe your collection in terms of a set of (sparse or > > > dense) term vectors then you should be able to use Mahout clustering > > > directly. The vectors do not need to be huge (as "document" might > > > imply), indeed smaller dimensionality clusterings work better than > large > > > ones. The question would be how do you plan to encode these vectors? > > > Another would be how large a collection you have? > > > > > > On 12/1/11 8:08 PM, Neil Chaudhuri wrote: > > >> I have a need to cluster a collection of words and phrases by > syntactic > > similarity over a distributed environment, and I came upon Mahout as a > > possible solution. After studying the documentation though, I am finding > > all of it tailored to working with entire documents rather than words and > > phrases. I simply want to know if you believe that Mahout is the right > tool > > for this job. I suppose I could try to view each word and phrase as > > individual tiny documents, but that feels like I am forcing it. > > >> > > >> Any insight is appreciated. > > >> > > >> Thanks. > > >> > > > > > > > >
