Hello,
I’m doing java app for clustering my data with kmeans.
Those are the steps:
1)
LuceneDemo : Create index and vectors using lib Lucene.vector, input path of my
.txt, output index (segments_1, segments.gen, .fdt, .fdx, .fnm, .frq, .nrm,
.prx, .tii, .tis, .tvd, .tvx and the most important who will be using by mahout
.tvf) and vectors looking like that
(SEQ__org.apache.hadoop.io.Text_org.apache.hadoop.io.Text______t€ðàó^æVG²RŸ˜Õ_________Ž__P(0):{15:1.4650986194610596,14:0.9997141361236572,11:0.9997141361236572,10:0.9997141361236572,9:0.9997141361236572,8:1.4650986194610596,7:1.4650986194610596,6:1.4650986194610596,5:0.9997141361236572,4:1.4650986194610596,2:3.1613736152648926,1:1.4650986194610596,0:0.9997141361236572}_________Ž__P(1):{15:1.4650986194610596,14:0.9997141361236572,11:0.9997141361236572,10:0.9997141361236572,9:0.9997141361236572,8:1.4650986194610596,7:1.4650986194610596,6:1.4650986194610596,5:0.9997141361236572,4:1.4650986194610596,2:3.1613736152648926,1:1.4650986194610596,0:0.9997141361236572}_________Ž__P(2):{
[… and others])
Does anyone please can confirm me that the output format looks good? If no,
what the vectors generated by lucene.vector should look like?
This is part of the code :
/*Creating vectors*/
Map vectorMap = new TreeMap();
IndexReader reader = IndexReader.open(index);
int numDoc = reader.maxDoc();
for(int i = 0; i < numDoc;i++){
TermFreqVector termFreqVector =
reader.getTermFreqVector(i, "content");
addTermFreqToMap(vectorMap,termFreqVector);
}
2)
MainClass : Create clusters with mahout, input – path of vectors (the vectors
generated by step 1 see above) , output - clusters (looking like : for the
moment does not create any clusters cause of this error :
Exception in thread "main" java.io.FileNotFoundException: File
file:/F:/MAHOUT/TesMahout/clusters/tf-vectors/wordcount/data does not exist.
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
at
org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
at
org.apache.mahout.vectorizer.tfidf.TFIDFConverter.startDFCounting(TFIDFConverter.java:368)
at
org.apache.mahout.vectorizer.tfidf.TFIDFConverter.calculateDF(TFIDFConverter.java:198)
at main.MainClass.main(MainClass.java:144))
Does anyone please can help me to solve this exception? I can’t understand why
data could not be created… while I’m using hadoop and mahout libs on windows
(and I’m admin so should not be problem of rights).
This is part of the code :
Pair<Long[], List<Path>> calculate =TFIDFConverter.calculateDF(new
Path(outputDir,DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new
Path(outputDir, DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), conf,
chuckSize);
TFIDFConverter.processTfIdf(new
Path(outputDir,DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new
Path(outputDir),conf,calculate,minDf,maxDFPercent, norm, true,
sequentialAccessOutput, false, reduceTasks);
Path vectorFolder = new Path("output");
Path canopyCentroids = new Path(outputDir, "canopy-centroids");
Path clusterOutput = new Path(outputDir, "clusters");
CanopyDriver.run(vectorFolder, canopyCentroids, new
EuclideanDistanceMeasure(), 250, 120, false,3,false);
KMeansDriver.run(conf, vectorFolder, new
Path(canopyCentroids,"clusters-0"), clusterOutput, new
TanimotoDistanceMeasure(), 0.01, 20, true,3, false);
Thank you for your time
Regards
Think green - keep it on the screen.
This e-mail and any attachment is for authorised use by the intended
recipient(s) only. It may contain proprietary material, confidential
information and/or be subject to legal privilege. It should not be copied,
disclosed to, retained or used by, any other party. If you are not an intended
recipient then please promptly delete this e-mail and any attachment and all
copies and inform the sender. Thank you.