RE: clustering with kmeans, java app

Videnova, Svetlana Tue, 07 Aug 2012 00:07:15 -0700

Hi,

Yes i'm using mahout and hadoop libs on windows.
My cluster output is not written on hdfs but in LOCAL.
Thanks to cygwin I am able to run unix command in order to run mahout on 
windows.
I changed the path on windows as well.


I didn’t test if wordcount is working, because I am using only mahout libs did 
not tried to run examples.
I was not following none tutorial but I found this may help you : 
http://blogs.msdn.com/b/avkashchauhan/archive/2012/03/06/running-apache-mahout-at-hadoop-on-windows-azure-www-hadooponazure-com.aspx



Cheers


-----Message d'origine-----
De : Yuval Feinstein [mailto:[email protected]] 
Envoyé : mardi 7 août 2012 08:16
À : [email protected]
Objet : Re: clustering with kmeans, java app

I spent a week trying to get Hadoop to work on Windows 7, and then gave up.
Do you manage to run Hadoop on Windows? Do Hadoop tests (e.g. wordcount) work?
http://en.wikisource.org/wiki/User:Fkorning/Code/Hadoop-on-Cygwin has lots of 
details about this.
Some of the possible problems are cygwin paths (!= linux paths), hdfs/local 
filesystem confusion, your hadoop user (!= your user permissions-wise), or 
other things listed at the link above.
Good luck,
Yuval

On Thu, Aug 2, 2012 at 11:57 AM, Videnova, Svetlana 
<[email protected]> wrote:
>
> Hello,
>
> I’m doing java app for clustering my data with kmeans.
>
> Those are the steps:
>
> 1)
>
> LuceneDemo : Create index and vectors using lib Lucene.vector, input 
> path of my .txt, output index (segments_1, segments.gen, .fdt, .fdx, 
> .fnm, .frq, .nrm, .prx, .tii, .tis, .tvd, .tvx and the most important 
> who will be using by mahout .tvf) and vectors looking like that 
> (SEQ__org.apache.hadoop.io.Text_org.apache.hadoop.io.Text______t€ðàó^æ
> VG²RŸ˜Õ_________Ž__P(0):{15:1.4650986194610596,14:0.9997141361236572,1
> 1:0.9997141361236572,10:0.9997141361236572,9:0.9997141361236572,8:1.46
> 50986194610596,7:1.4650986194610596,6:1.4650986194610596,5:0.999714136
> 1236572,4:1.4650986194610596,2:3.1613736152648926,1:1.4650986194610596
> ,0:0.9997141361236572}_________Ž__P(1):{15:1.4650986194610596,14:0.999
> 7141361236572,11:0.9997141361236572,10:0.9997141361236572,9:0.99971413
> 61236572,8:1.4650986194610596,7:1.4650986194610596,6:1.465098619461059
> 6,5:0.9997141361236572,4:1.4650986194610596,2:3.1613736152648926,1:1.4
> 650986194610596,0:0.9997141361236572}_________Ž__P(2):{ [… and 
> others])
>
> Does anyone please can confirm me that the output format looks good? If no, 
> what the vectors generated by lucene.vector should look like?
>
> This is part of the code :
> /*Creating vectors*/
>                                Map vectorMap = new TreeMap();
>                                IndexReader reader = IndexReader.open(index);
>                                int numDoc = reader.maxDoc();
>                                for(int i = 0; i < numDoc;i++){
>
>
>                                                TermFreqVector termFreqVector 
> = reader.getTermFreqVector(i, "content");
>                                                
> addTermFreqToMap(vectorMap,termFreqVector);
>
>                                }
>
>
>
>
> 2)
>
>
> MainClass : Create clusters with mahout, input – path of vectors (the vectors 
> generated by step 1 see above) , output -  clusters (looking like : for the 
> moment does not create any clusters cause of this error :
> Exception in thread "main" java.io.FileNotFoundException: File 
> file:/F:/MAHOUT/TesMahout/clusters/tf-vectors/wordcount/data does not exist.
>       at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
>       at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
>       at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
>       at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241)
>       at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
>       at 
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
>       at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
>       at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
>       at 
> org.apache.mahout.vectorizer.tfidf.TFIDFConverter.startDFCounting(TFIDFConverter.java:368)
>       at 
> org.apache.mahout.vectorizer.tfidf.TFIDFConverter.calculateDF(TFIDFConverter.java:198)
>       at main.MainClass.main(MainClass.java:144))
>
>
> Does anyone please can help me to solve this exception? I can’t understand 
> why data could not be created… while I’m using hadoop and mahout libs on 
> windows (and I’m admin so should not be problem of rights).
>
>
> This is part of the code :
>
>
>             Pair<Long[], List<Path>> calculate 
> =TFIDFConverter.calculateDF(new 
> Path(outputDir,DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), 
> new Path(outputDir, 
> DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), conf, chuckSize);
>
>             TFIDFConverter.processTfIdf(new 
> Path(outputDir,DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), 
> new Path(outputDir),conf,calculate,minDf,maxDFPercent, norm, true, 
> sequentialAccessOutput, false, reduceTasks);
>
>             Path vectorFolder = new Path("output");
>             Path canopyCentroids = new Path(outputDir, 
> "canopy-centroids");
>
>             Path clusterOutput = new Path(outputDir, "clusters");
>
>             CanopyDriver.run(vectorFolder, canopyCentroids, new 
> EuclideanDistanceMeasure(), 250, 120, false,3,false);
>
>             KMeansDriver.run(conf, vectorFolder, new 
> Path(canopyCentroids,"clusters-0"), clusterOutput, new 
> TanimotoDistanceMeasure(), 0.01, 20, true,3, false);
>
>
> Thank you for your time
>
>
>
>
> Regards
>
> Think green - keep it on the screen.
>
> This e-mail and any attachment is for authorised use by the intended 
> recipient(s) only. It may contain proprietary material, confidential 
> information and/or be subject to legal privilege. It should not be copied, 
> disclosed to, retained or used by, any other party. If you are not an 
> intended recipient then please promptly delete this e-mail and any attachment 
> and all copies and inform the sender. Thank you.
>


Think green - keep it on the screen.

This e-mail and any attachment is for authorised use by the intended 
recipient(s) only. It may contain proprietary material, confidential 
information and/or be subject to legal privilege. It should not be copied, 
disclosed to, retained or used by, any other party. If you are not an intended 
recipient then please promptly delete this e-mail and any attachment and all 
copies and inform the sender. Thank you.

RE: clustering with kmeans, java app

Reply via email to