I suppose you could do so if you use sequence similarity. 
I know that it can be integrated into hierarchical clustering. But it seems 
that hierarchical clustering has not become part of mahout.



----- Original Message -----
From: Neil Chaudhuri [mailto:[email protected]]
Sent: Friday, December 02, 2011 05:48 AM
To: [email protected] <[email protected]>
Subject: Re: Word and Phrase Clustering

Glad to fill in more detail. Imagine I have a list of words and phrases in a 
data store like this:

Alabama
Obama
University of Alabama
Bama
Potomac
Texas
Potomac River

I would like to cluster the ones that look similar enough to be the same. Like 
"Alabama" and "University of Alabama" and "Bama" (but not Obama ideally) or 
"Potomac" and "Potomac River." 

Now this list of words could be in the terabytes range, which is why I need 
distributed computing capability.

How would I assemble a Vector from an individual entry in this list? With a bit 
more understanding of my situation, do you think Mahout can work for me?

Please let me know if I can provide more information.

Thanks.



On Dec 1, 2011, at 11:29 PM, Jeff Eastman wrote:

> Could you elaborate a bit on what you mean by "cluster a collection of 
> words and phrases by syntactic similarity over a distributed environment 
> "? If you can describe your collection in terms of a set of (sparse or 
> dense) term vectors then you should be able to use Mahout clustering 
> directly. The vectors do not need to be huge (as "document" might 
> imply), indeed smaller dimensionality clusterings work better than large 
> ones. The question would be how do you plan to encode these vectors? 
> Another would be how large a collection you have?
> 
> On 12/1/11 8:08 PM, Neil Chaudhuri wrote:
>> I have a need to cluster a collection of words and phrases by syntactic 
>> similarity over a distributed environment, and I came upon Mahout as a 
>> possible solution. After studying the documentation though, I am finding all 
>> of it tailored to working with entire documents rather than words and 
>> phrases. I simply want to know if you believe that Mahout is the right tool 
>> for this job. I suppose I could try to view each word and phrase as 
>> individual tiny documents, but that feels like I am forcing it.
>> 
>> Any insight is appreciated.
>> 
>> Thanks.
>> 
> 

Reply via email to