Re: Clustering on 3D in Mahout

Fernando Fernández Mon, 05 Dec 2011 07:13:17 -0800

Sounds weird,

Have you been asked to "cluster documents" or something like that? What is
the goal of that clustering? Using a sequential id and a path of the file
as "variables" make no sense at all. If you have been asked to "cluster
documents" then what you need is a mathematical representation of the
content, in such case, each possible word in the documents become a
dimension (variable) as Cristoph said, so this is much more than 3
dimensions. You can ask lucene to output TF-IDF vectors from the content of
each document (you should search for some info about what TF-IDF is) in a
format that mahout can read.


If you have been literally asked to "cluster with 3 dimensions: seqid,
body, and filepath" then maybe it's not "clustering" what he/she actually
means...

2011/12/5 Christoph Brücke <[email protected]>

> Hi Syed,
>
> I never used Lucene or Solr myself, so I could just rephrase what's in the
> mahout wiki. So just take a look on how two convert a Lucene (== Solr)
> index to a Mahout compatible vector format [1]. Also have a look at the
> JavaDocs [2], especially KMeansDriver and CosineDistanceMetric. In order to
> use your solr index for clustering (I assume KMeans Clustering, other
> clustering algorithms should work the same way) you create a Sequence File
> from your index as described in the wiki [1]. After that create a new job
> using KMeansDriver by calling the constructor with the input directory,
> output directory, .... and most important the CosineDistanceMetric as
> parameter. That's it, at least the hard part, for the actual clustering you
> just call run on your job and sit back and relax.
>
> To make it clear, I assumed you are using KMeans Clustering and the Body
> text of your solr index. The described process should be applicable for the
> other clustering algorithms as well.
>
> Basically you just need your input data (the data that should be used for
> clustering) as iterable collection of vectors. And a distance metric that
> could be used for your input data. In your case this is the body text of
> the index and cosine distance. And again for your stated use case you don't
> really want just three dimensions, since every word in your body text
> represent a new dimensions, so you have most likely much more than just
> three dimensions.
>
> If I totally missed your point please speak out.
>
>
> So long,
> Christoph
>
> [1]
> https://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text#CreatingVectorsfromText-FromLucene
> [2] https://builds.apache.org/job/Mahout-Quality/javadoc/
>
>
> Am 05.12.2011 um 15:15 schrieb syed kather:
>
> > Thanks  Christoph
> > Can you give sample info on clustering on 3D which i can understood .
> >
> > Please help me .. so that i can learn new things. is it poosible using
> > Solr?. If so How can i do that .
> >
> >
> >            Thanks and Regards,
> >        S SYED ABDUL KATHER
> >                9731841519
> >
> >
> > On Mon, Dec 5, 2011 at 7:12 PM, Christoph Brücke <
> > [email protected]> wrote:
> >
> >> Hi Syed,
> >>
> >> to answer your first question, YES mahout is totally capable of
> clustering
> >> in three dimension. However, as far as my knowledge goes with
> >> KMeansClustering, each feature (dimension) has to be the same type.
> Meaning
> >> there has to be one distance metric which is capable of expressing the
> >> distance between every to points. That said i don't think that you can
> >> define a metric which uses seqid, text and text(filepath) as
> coordinates.
> >> But I think you could just use the body of your index and calculate
> >> something like cosine distance to cluster your index entries, as seqid
> is
> >> propably unique to every entry and the file path is not really relevant
> (at
> >> least I can't come up with any suitable use case).
> >>
> >> TL;DR: Yes you can cluster in multiple dimensions as long as you can
> >> define a distance between every pair. You probably better off using just
> >> the body text of your solr index.
> >>
> >> Regards,
> >> Christoph
> >>
> >>
> >> Am 05.12.2011 um 14:09 schrieb syed kather:
> >>
> >>> Team,
> >>>
> >>>    Is it possible to clustering in 3D?
> >>>
> >>>
> >>> I am trying the case like give below.
> >>>
> >>> 1.  I am have having solr index with three Fields (SEQID,BODY(content
> of
> >>> Text file),FILEPATH);
> >>>
> >>> Now i need to cluster this Please Help me how to do this is there a
> way?
> >>>
> >>>           Thanks and Regards,
> >>>       S SYED ABDUL KATHER
> >>
> >>
> >>
>
> Christoph Brücke
> [email protected]
>
>
>
>

Re: Clustering on 3D in Mahout

Reply via email to