Sounds weird, Have you been asked to "cluster documents" or something like that? What is the goal of that clustering? Using a sequential id and a path of the file as "variables" make no sense at all. If you have been asked to "cluster documents" then what you need is a mathematical representation of the content, in such case, each possible word in the documents become a dimension (variable) as Cristoph said, so this is much more than 3 dimensions. You can ask lucene to output TF-IDF vectors from the content of each document (you should search for some info about what TF-IDF is) in a format that mahout can read.
If you have been literally asked to "cluster with 3 dimensions: seqid, body, and filepath" then maybe it's not "clustering" what he/she actually means... 2011/12/5 Christoph Brücke <[email protected]> > Hi Syed, > > I never used Lucene or Solr myself, so I could just rephrase what's in the > mahout wiki. So just take a look on how two convert a Lucene (== Solr) > index to a Mahout compatible vector format [1]. Also have a look at the > JavaDocs [2], especially KMeansDriver and CosineDistanceMetric. In order to > use your solr index for clustering (I assume KMeans Clustering, other > clustering algorithms should work the same way) you create a Sequence File > from your index as described in the wiki [1]. After that create a new job > using KMeansDriver by calling the constructor with the input directory, > output directory, .... and most important the CosineDistanceMetric as > parameter. That's it, at least the hard part, for the actual clustering you > just call run on your job and sit back and relax. > > To make it clear, I assumed you are using KMeans Clustering and the Body > text of your solr index. The described process should be applicable for the > other clustering algorithms as well. > > Basically you just need your input data (the data that should be used for > clustering) as iterable collection of vectors. And a distance metric that > could be used for your input data. In your case this is the body text of > the index and cosine distance. And again for your stated use case you don't > really want just three dimensions, since every word in your body text > represent a new dimensions, so you have most likely much more than just > three dimensions. > > If I totally missed your point please speak out. > > > So long, > Christoph > > [1] > https://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text#CreatingVectorsfromText-FromLucene > [2] https://builds.apache.org/job/Mahout-Quality/javadoc/ > > > Am 05.12.2011 um 15:15 schrieb syed kather: > > > Thanks Christoph > > Can you give sample info on clustering on 3D which i can understood . > > > > Please help me .. so that i can learn new things. is it poosible using > > Solr?. If so How can i do that . > > > > > > Thanks and Regards, > > S SYED ABDUL KATHER > > 9731841519 > > > > > > On Mon, Dec 5, 2011 at 7:12 PM, Christoph Brücke < > > [email protected]> wrote: > > > >> Hi Syed, > >> > >> to answer your first question, YES mahout is totally capable of > clustering > >> in three dimension. However, as far as my knowledge goes with > >> KMeansClustering, each feature (dimension) has to be the same type. > Meaning > >> there has to be one distance metric which is capable of expressing the > >> distance between every to points. That said i don't think that you can > >> define a metric which uses seqid, text and text(filepath) as > coordinates. > >> But I think you could just use the body of your index and calculate > >> something like cosine distance to cluster your index entries, as seqid > is > >> propably unique to every entry and the file path is not really relevant > (at > >> least I can't come up with any suitable use case). > >> > >> TL;DR: Yes you can cluster in multiple dimensions as long as you can > >> define a distance between every pair. You probably better off using just > >> the body text of your solr index. > >> > >> Regards, > >> Christoph > >> > >> > >> Am 05.12.2011 um 14:09 schrieb syed kather: > >> > >>> Team, > >>> > >>> Is it possible to clustering in 3D? > >>> > >>> > >>> I am trying the case like give below. > >>> > >>> 1. I am have having solr index with three Fields (SEQID,BODY(content > of > >>> Text file),FILEPATH); > >>> > >>> Now i need to cluster this Please Help me how to do this is there a > way? > >>> > >>> Thanks and Regards, > >>> S SYED ABDUL KATHER > >> > >> > >> > > Christoph Brücke > [email protected] > > > >
