What does the folder input vectordata contain? I am guessing you gave the top level directory instead of giving the tfidf-vectors folder as input
Robin On Fri, Jul 22, 2011 at 8:33 PM, Liliana Mamani Sanchez <[email protected]>wrote: > Hello all, > > I was trying to run a basic canopy clustering command: > > > bin/mahout canopy -i vectordata -o output1 -dm > org.apache.mahout.common.distance.CosineDistanceMeasure -ow -t1 6 -t2 2 > > and I get the exception: > > Exception in thread "main" java.io.FileNotFoundException: File does not > exist: hdfs://localhost:9000/user/hadoop/vectorforum1/df-count/data > > Has anyone found something similar? I found a similar exception case of > somebody using k-means but that was lack of a parameter -k , but in this > case, I don't know what is the mistake. > > cheers > > Liliana > > > > On Fri, Jul 22, 2011 at 12:23 PM, Niall Riddell <[email protected] > >wrote: > > > Hi, > > > > I would like to sense check an approach to near-duplicate detection of > > documents using Mahout. After some basic research I've implemented a > > basic proof which works effectively on a small corpus. > > > > I have taken the following pre-processing steps: > > > > 1) Parse the document > > 2) Remove unnecessary tokens > > 3) Split by sentence > > 4) Create w-shingles from sentence tokens > > 5) Hash shingles > > 6) Minhash hashes > > 7) Jaccard Similarity adjusting for number of hash functions used in > > minhash > > > > In order to scale this I will be doing the following: > > > > 1) Use M/R for all steps > > 2) Avoid adding exact duplicate documents to similarity matrix > > 3) Constructing an (additional) LSH matrix (threshold >=0.2) splitting > > into buckets > > 4) Split the similarity job by blocks of document keys for each mapper > > 5) Every document in the minhash matrix gets submitted to every mapper > > 6) Each mapper queries the LSH matrix to look for candidates for > > matching against > > 7) Each mapper matches against candidates in it's block and writes out > > a key (docid) and a vector of all similar documents ({docid, score}) > > 8) The reducer then combines the results from each mapper into the > > final similarity matrix > > > > I've only really used Mahout so far for doing the minhash stuff but > > would like and can't find an LSH implementation. To avoid > > re-inventing the wheel I was looking for general pointers as to the > > efficacy of my approach in the first instance and then any guidance on > > how best to implement using the rest of mahout. > > > > I've gone through MIA and felt the the rowsimilarityjob was a > > possibility, however I understand that a JIRA has been raised to make > > this potentially less general and in it's current form it may not > > match my performance/cost criteria (i.e. high/low). > > > > Any help is greatly appreciated. > > > > Thanks in advance. > > > > Niall > > > > > > -- > Liliana Paola Mamani Sanchez >
