Hi, Yes i'm using mahout and hadoop libs on windows. My cluster output is not written on hdfs but in LOCAL. Thanks to cygwin I am able to run unix command in order to run mahout on windows. I changed the path on windows as well.
I didn’t test if wordcount is working, because I am using only mahout libs did not tried to run examples. I was not following none tutorial but I found this may help you : http://blogs.msdn.com/b/avkashchauhan/archive/2012/03/06/running-apache-mahout-at-hadoop-on-windows-azure-www-hadooponazure-com.aspx Cheers -----Message d'origine----- De : Yuval Feinstein [mailto:[email protected]] Envoyé : mardi 7 août 2012 08:16 À : [email protected] Objet : Re: clustering with kmeans, java app I spent a week trying to get Hadoop to work on Windows 7, and then gave up. Do you manage to run Hadoop on Windows? Do Hadoop tests (e.g. wordcount) work? http://en.wikisource.org/wiki/User:Fkorning/Code/Hadoop-on-Cygwin has lots of details about this. Some of the possible problems are cygwin paths (!= linux paths), hdfs/local filesystem confusion, your hadoop user (!= your user permissions-wise), or other things listed at the link above. Good luck, Yuval On Thu, Aug 2, 2012 at 11:57 AM, Videnova, Svetlana <[email protected]> wrote: > > Hello, > > I’m doing java app for clustering my data with kmeans. > > Those are the steps: > > 1) > > LuceneDemo : Create index and vectors using lib Lucene.vector, input > path of my .txt, output index (segments_1, segments.gen, .fdt, .fdx, > .fnm, .frq, .nrm, .prx, .tii, .tis, .tvd, .tvx and the most important > who will be using by mahout .tvf) and vectors looking like that > (SEQ__org.apache.hadoop.io.Text_org.apache.hadoop.io.Text______t€ðàó^æ > VG²RŸ˜Õ_________Ž__P(0):{15:1.4650986194610596,14:0.9997141361236572,1 > 1:0.9997141361236572,10:0.9997141361236572,9:0.9997141361236572,8:1.46 > 50986194610596,7:1.4650986194610596,6:1.4650986194610596,5:0.999714136 > 1236572,4:1.4650986194610596,2:3.1613736152648926,1:1.4650986194610596 > ,0:0.9997141361236572}_________Ž__P(1):{15:1.4650986194610596,14:0.999 > 7141361236572,11:0.9997141361236572,10:0.9997141361236572,9:0.99971413 > 61236572,8:1.4650986194610596,7:1.4650986194610596,6:1.465098619461059 > 6,5:0.9997141361236572,4:1.4650986194610596,2:3.1613736152648926,1:1.4 > 650986194610596,0:0.9997141361236572}_________Ž__P(2):{ [… and > others]) > > Does anyone please can confirm me that the output format looks good? If no, > what the vectors generated by lucene.vector should look like? > > This is part of the code : > /*Creating vectors*/ > Map vectorMap = new TreeMap(); > IndexReader reader = IndexReader.open(index); > int numDoc = reader.maxDoc(); > for(int i = 0; i < numDoc;i++){ > > > TermFreqVector termFreqVector > = reader.getTermFreqVector(i, "content"); > > addTermFreqToMap(vectorMap,termFreqVector); > > } > > > > > 2) > > > MainClass : Create clusters with mahout, input – path of vectors (the vectors > generated by step 1 see above) , output - clusters (looking like : for the > moment does not create any clusters cause of this error : > Exception in thread "main" java.io.FileNotFoundException: File > file:/F:/MAHOUT/TesMahout/clusters/tf-vectors/wordcount/data does not exist. > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241) > at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) > at > org.apache.mahout.vectorizer.tfidf.TFIDFConverter.startDFCounting(TFIDFConverter.java:368) > at > org.apache.mahout.vectorizer.tfidf.TFIDFConverter.calculateDF(TFIDFConverter.java:198) > at main.MainClass.main(MainClass.java:144)) > > > Does anyone please can help me to solve this exception? I can’t understand > why data could not be created… while I’m using hadoop and mahout libs on > windows (and I’m admin so should not be problem of rights). > > > This is part of the code : > > > Pair<Long[], List<Path>> calculate > =TFIDFConverter.calculateDF(new > Path(outputDir,DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), > new Path(outputDir, > DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), conf, chuckSize); > > TFIDFConverter.processTfIdf(new > Path(outputDir,DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), > new Path(outputDir),conf,calculate,minDf,maxDFPercent, norm, true, > sequentialAccessOutput, false, reduceTasks); > > Path vectorFolder = new Path("output"); > Path canopyCentroids = new Path(outputDir, > "canopy-centroids"); > > Path clusterOutput = new Path(outputDir, "clusters"); > > CanopyDriver.run(vectorFolder, canopyCentroids, new > EuclideanDistanceMeasure(), 250, 120, false,3,false); > > KMeansDriver.run(conf, vectorFolder, new > Path(canopyCentroids,"clusters-0"), clusterOutput, new > TanimotoDistanceMeasure(), 0.01, 20, true,3, false); > > > Thank you for your time > > > > > Regards > > Think green - keep it on the screen. > > This e-mail and any attachment is for authorised use by the intended > recipient(s) only. It may contain proprietary material, confidential > information and/or be subject to legal privilege. It should not be copied, > disclosed to, retained or used by, any other party. If you are not an > intended recipient then please promptly delete this e-mail and any attachment > and all copies and inform the sender. Thank you. > Think green - keep it on the screen. This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.
