Hi Neil,

I suggest you to start by doing clustering on lexical affinities (based on
how words look). It seems that it's what you are looking for from your
examples. To cluster terms this way you don't really need to use the full
data. You can remove all duplicates and get hopefully a much smaller set.

A good way to describe terms for this usage is to use ngrams. You can also
use phonetic transcriptions of terms. An interesting trick that works well
is to add a special character at the beginning of each work (in the ngrams
method). This will boost similarity on the beginning of words which is
usually good.

I suggest you to have a look at Google
Refine<http://code.google.com/p/google-refine/>.
Watch the first video. It demonstrate nice terms clustering capabilities
using different methods (ngrams, ...). If it's what you are looking for,
you can try it on the most frequent terms in your dataset and get quickly
interesting results and then implement the way which look the best for you.

Best,

Pascal




















2011/12/2 Neil Chaudhuri <[email protected]>

> Glad to fill in more detail. Imagine I have a list of words and phrases in
> a data store like this:
>
> Alabama
> Obama
> University of Alabama
> Bama
> Potomac
> Texas
> Potomac River
>
> I would like to cluster the ones that look similar enough to be the same.
> Like "Alabama" and "University of Alabama" and "Bama" (but not Obama
> ideally) or "Potomac" and "Potomac River."
>
> Now this list of words could be in the terabytes range, which is why I
> need distributed computing capability.
>
> How would I assemble a Vector from an individual entry in this list? With
> a bit more understanding of my situation, do you think Mahout can work for
> me?
>
> Please let me know if I can provide more information.
>
> Thanks.
>
>
>
> On Dec 1, 2011, at 11:29 PM, Jeff Eastman wrote:
>
> > Could you elaborate a bit on what you mean by "cluster a collection of
> > words and phrases by syntactic similarity over a distributed environment
> > "? If you can describe your collection in terms of a set of (sparse or
> > dense) term vectors then you should be able to use Mahout clustering
> > directly. The vectors do not need to be huge (as "document" might
> > imply), indeed smaller dimensionality clusterings work better than large
> > ones. The question would be how do you plan to encode these vectors?
> > Another would be how large a collection you have?
> >
> > On 12/1/11 8:08 PM, Neil Chaudhuri wrote:
> >> I have a need to cluster a collection of words and phrases by syntactic
> similarity over a distributed environment, and I came upon Mahout as a
> possible solution. After studying the documentation though, I am finding
> all of it tailored to working with entire documents rather than words and
> phrases. I simply want to know if you believe that Mahout is the right tool
> for this job. I suppose I could try to view each word and phrase as
> individual tiny documents, but that feels like I am forcing it.
> >>
> >> Any insight is appreciated.
> >>
> >> Thanks.
> >>
> >
>
>

Reply via email to