clustering with kmeans, java app

Videnova, Svetlana Thu, 02 Aug 2012 01:59:07 -0700

Hello,

I’m doing java app for clustering my data with kmeans.


Those are the steps:

1)

LuceneDemo : Create index and vectors using lib Lucene.vector, input path of my 
.txt, output index (segments_1, segments.gen, .fdt, .fdx, .fnm, .frq, .nrm, 
.prx, .tii, .tis, .tvd, .tvx and the most important who will be using by mahout 
.tvf) and vectors looking like that 
(SEQ__org.apache.hadoop.io.Text_org.apache.hadoop.io.Text______t€ðàó^æVG²RŸ˜Õ_________Ž__P(0):{15:1.4650986194610596,14:0.9997141361236572,11:0.9997141361236572,10:0.9997141361236572,9:0.9997141361236572,8:1.4650986194610596,7:1.4650986194610596,6:1.4650986194610596,5:0.9997141361236572,4:1.4650986194610596,2:3.1613736152648926,1:1.4650986194610596,0:0.9997141361236572}_________Ž__P(1):{15:1.4650986194610596,14:0.9997141361236572,11:0.9997141361236572,10:0.9997141361236572,9:0.9997141361236572,8:1.4650986194610596,7:1.4650986194610596,6:1.4650986194610596,5:0.9997141361236572,4:1.4650986194610596,2:3.1613736152648926,1:1.4650986194610596,0:0.9997141361236572}_________Ž__P(2):{
 [… and others])

Does anyone please can confirm me that the output format looks good? If no, 
what the vectors generated by lucene.vector should look like?

This is part of the code :
/*Creating vectors*/
                               Map vectorMap = new TreeMap();
                               IndexReader reader = IndexReader.open(index);
                               int numDoc = reader.maxDoc();
                               for(int i = 0; i < numDoc;i++){


                                               TermFreqVector termFreqVector = 
reader.getTermFreqVector(i, "content");
                                               
addTermFreqToMap(vectorMap,termFreqVector);

                               }




2)


MainClass : Create clusters with mahout, input – path of vectors (the vectors 
generated by step 1 see above) , output -  clusters (looking like : for the 
moment does not create any clusters cause of this error :
Exception in thread "main" java.io.FileNotFoundException: File 
file:/F:/MAHOUT/TesMahout/clusters/tf-vectors/wordcount/data does not exist.
      at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
      at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
      at 
org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
      at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241)
      at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
      at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
      at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
      at 
org.apache.mahout.vectorizer.tfidf.TFIDFConverter.startDFCounting(TFIDFConverter.java:368)
      at 
org.apache.mahout.vectorizer.tfidf.TFIDFConverter.calculateDF(TFIDFConverter.java:198)
      at main.MainClass.main(MainClass.java:144))


Does anyone please can help me to solve this exception? I can’t understand why 
data could not be created… while I’m using hadoop and mahout libs on windows 
(and I’m admin so should not be problem of rights).


This is part of the code :


            Pair<Long[], List<Path>> calculate =TFIDFConverter.calculateDF(new 
Path(outputDir,DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new 
Path(outputDir, DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), conf, 
chuckSize);

            TFIDFConverter.processTfIdf(new 
Path(outputDir,DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new 
Path(outputDir),conf,calculate,minDf,maxDFPercent, norm, true, 
sequentialAccessOutput, false, reduceTasks);

            Path vectorFolder = new Path("output");
            Path canopyCentroids = new Path(outputDir, "canopy-centroids");

            Path clusterOutput = new Path(outputDir, "clusters");

            CanopyDriver.run(vectorFolder, canopyCentroids, new 
EuclideanDistanceMeasure(), 250, 120, false,3,false);

            KMeansDriver.run(conf, vectorFolder, new 
Path(canopyCentroids,"clusters-0"), clusterOutput, new 
TanimotoDistanceMeasure(), 0.01, 20, true,3, false);


Thank you for your time




Regards

Think green - keep it on the screen.

This e-mail and any attachment is for authorised use by the intended 
recipient(s) only. It may contain proprietary material, confidential 
information and/or be subject to legal privilege. It should not be copied, 
disclosed to, retained or used by, any other party. If you are not an intended 
recipient then please promptly delete this e-mail and any attachment and all 
copies and inform the sender. Thank you.

clustering with kmeans, java app

Reply via email to