Thanks for the pointer to Google Refine. That looks like exactly what I want. However, it isn't clear to me how to actually program with it though since I don't see a published API. Even more troubling is its distributability. It seems obvious that memory footprint is an issue, and I don't know how easily Refine's capabilities can be distributed. I would also have to be careful to choose a distributable algorithm rather than one like Levenshtein whose holistic approach would run counter to a distributed model. Any comments on these matters is appreciated.
If I were to pursue the Mahout approach, is it possible to create Vectors from the words and phrases? Thanks. On 12/2/11 6:36 AM, "Pascal Coupet" <[email protected]> wrote: >Hi Neil, > >I suggest you to start by doing clustering on lexical affinities (based on >how words look). It seems that it's what you are looking for from your >examples. To cluster terms this way you don't really need to use the full >data. You can remove all duplicates and get hopefully a much smaller set. > >A good way to describe terms for this usage is to use ngrams. You can also >use phonetic transcriptions of terms. An interesting trick that works well >is to add a special character at the beginning of each work (in the ngrams >method). This will boost similarity on the beginning of words which is >usually good. > >I suggest you to have a look at Google >Refine<http://code.google.com/p/google-refine/>. >Watch the first video. It demonstrate nice terms clustering capabilities >using different methods (ngrams, ...). If it's what you are looking for, >you can try it on the most frequent terms in your dataset and get quickly >interesting results and then implement the way which look the best for >you. > >Best, > >Pascal > > > > > > > > > > > > > > > > > > > > >2011/12/2 Neil Chaudhuri <[email protected]> > >> Glad to fill in more detail. Imagine I have a list of words and phrases >>in >> a data store like this: >> >> Alabama >> Obama >> University of Alabama >> Bama >> Potomac >> Texas >> Potomac River >> >> I would like to cluster the ones that look similar enough to be the >>same. >> Like "Alabama" and "University of Alabama" and "Bama" (but not Obama >> ideally) or "Potomac" and "Potomac River." >> >> Now this list of words could be in the terabytes range, which is why I >> need distributed computing capability. >> >> How would I assemble a Vector from an individual entry in this list? >>With >> a bit more understanding of my situation, do you think Mahout can work >>for >> me? >> >> Please let me know if I can provide more information. >> >> Thanks. >> >> >> >> On Dec 1, 2011, at 11:29 PM, Jeff Eastman wrote: >> >> > Could you elaborate a bit on what you mean by "cluster a collection of >> > words and phrases by syntactic similarity over a distributed >>environment >> > "? If you can describe your collection in terms of a set of (sparse or >> > dense) term vectors then you should be able to use Mahout clustering >> > directly. The vectors do not need to be huge (as "document" might >> > imply), indeed smaller dimensionality clusterings work better than >>large >> > ones. The question would be how do you plan to encode these vectors? >> > Another would be how large a collection you have? >> > >> > On 12/1/11 8:08 PM, Neil Chaudhuri wrote: >> >> I have a need to cluster a collection of words and phrases by >>syntactic >> similarity over a distributed environment, and I came upon Mahout as a >> possible solution. After studying the documentation though, I am finding >> all of it tailored to working with entire documents rather than words >>and >> phrases. I simply want to know if you believe that Mahout is the right >>tool >> for this job. I suppose I could try to view each word and phrase as >> individual tiny documents, but that feels like I am forcing it. >> >> >> >> Any insight is appreciated. >> >> >> >> Thanks. >> >> >> > >> >>
