Re: Word and Phrase Clustering

Neil Chaudhuri Fri, 02 Dec 2011 09:35:28 -0800

Thanks for the pointer to Google Refine. That looks like exactly what I
want. However, it isn't clear to me how to actually program with it though
since I don't see a published API. Even more troubling is its
distributability. It seems obvious that memory footprint is an issue, and
I don't know how easily Refine's capabilities can be distributed. I would
also have to be careful to choose a distributable algorithm rather than
one like Levenshtein whose holistic approach would run counter to a
distributed model. Any comments on these matters is appreciated.


If I were to pursue the Mahout approach, is it possible to create Vectors
from the words and phrases?

Thanks.




On 12/2/11 6:36 AM, "Pascal Coupet" <[email protected]> wrote:

>Hi Neil,
>
>I suggest you to start by doing clustering on lexical affinities (based on
>how words look). It seems that it's what you are looking for from your
>examples. To cluster terms this way you don't really need to use the full
>data. You can remove all duplicates and get hopefully a much smaller set.
>
>A good way to describe terms for this usage is to use ngrams. You can also
>use phonetic transcriptions of terms. An interesting trick that works well
>is to add a special character at the beginning of each work (in the ngrams
>method). This will boost similarity on the beginning of words which is
>usually good.
>
>I suggest you to have a look at Google
>Refine<http://code.google.com/p/google-refine/>.
>Watch the first video. It demonstrate nice terms clustering capabilities
>using different methods (ngrams, ...). If it's what you are looking for,
>you can try it on the most frequent terms in your dataset and get quickly
>interesting results and then implement the way which look the best for
>you.
>
>Best,
>
>Pascal
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>2011/12/2 Neil Chaudhuri <[email protected]>
>
>> Glad to fill in more detail. Imagine I have a list of words and phrases
>>in
>> a data store like this:
>>
>> Alabama
>> Obama
>> University of Alabama
>> Bama
>> Potomac
>> Texas
>> Potomac River
>>
>> I would like to cluster the ones that look similar enough to be the
>>same.
>> Like "Alabama" and "University of Alabama" and "Bama" (but not Obama
>> ideally) or "Potomac" and "Potomac River."
>>
>> Now this list of words could be in the terabytes range, which is why I
>> need distributed computing capability.
>>
>> How would I assemble a Vector from an individual entry in this list?
>>With
>> a bit more understanding of my situation, do you think Mahout can work
>>for
>> me?
>>
>> Please let me know if I can provide more information.
>>
>> Thanks.
>>
>>
>>
>> On Dec 1, 2011, at 11:29 PM, Jeff Eastman wrote:
>>
>> > Could you elaborate a bit on what you mean by "cluster a collection of
>> > words and phrases by syntactic similarity over a distributed
>>environment
>> > "? If you can describe your collection in terms of a set of (sparse or
>> > dense) term vectors then you should be able to use Mahout clustering
>> > directly. The vectors do not need to be huge (as "document" might
>> > imply), indeed smaller dimensionality clusterings work better than
>>large
>> > ones. The question would be how do you plan to encode these vectors?
>> > Another would be how large a collection you have?
>> >
>> > On 12/1/11 8:08 PM, Neil Chaudhuri wrote:
>> >> I have a need to cluster a collection of words and phrases by
>>syntactic
>> similarity over a distributed environment, and I came upon Mahout as a
>> possible solution. After studying the documentation though, I am finding
>> all of it tailored to working with entire documents rather than words
>>and
>> phrases. I simply want to know if you believe that Mahout is the right
>>tool
>> for this job. I suppose I could try to view each word and phrase as
>> individual tiny documents, but that feels like I am forcing it.
>> >>
>> >> Any insight is appreciated.
>> >>
>> >> Thanks.
>> >>
>> >
>>
>>

Re: Word and Phrase Clustering

Reply via email to