Every time I get this it tells me that I forgot to use the tfidf-vectors folder underneath the actual folder, in your case vectorforum1. Thus, I think your input should be vectordata/tfidf-vectors (or some other subpath)
On Jul 22, 2011, at 11:03 AM, Liliana Mamani Sanchez wrote: > Hello all, > > I was trying to run a basic canopy clustering command: > > > bin/mahout canopy -i vectordata -o output1 -dm > org.apache.mahout.common.distance.CosineDistanceMeasure -ow -t1 6 -t2 2 > > and I get the exception: > > Exception in thread "main" java.io.FileNotFoundException: File does not > exist: hdfs://localhost:9000/user/hadoop/vectorforum1/df-count/data > > Has anyone found something similar? I found a similar exception case of > somebody using k-means but that was lack of a parameter -k , but in this > case, I don't know what is the mistake. > > cheers > > Liliana > > > > On Fri, Jul 22, 2011 at 12:23 PM, Niall Riddell > <[email protected]>wrote: > >> Hi, >> >> I would like to sense check an approach to near-duplicate detection of >> documents using Mahout. After some basic research I've implemented a >> basic proof which works effectively on a small corpus. >> >> I have taken the following pre-processing steps: >> >> 1) Parse the document >> 2) Remove unnecessary tokens >> 3) Split by sentence >> 4) Create w-shingles from sentence tokens >> 5) Hash shingles >> 6) Minhash hashes >> 7) Jaccard Similarity adjusting for number of hash functions used in >> minhash >> >> In order to scale this I will be doing the following: >> >> 1) Use M/R for all steps >> 2) Avoid adding exact duplicate documents to similarity matrix >> 3) Constructing an (additional) LSH matrix (threshold >=0.2) splitting >> into buckets >> 4) Split the similarity job by blocks of document keys for each mapper >> 5) Every document in the minhash matrix gets submitted to every mapper >> 6) Each mapper queries the LSH matrix to look for candidates for >> matching against >> 7) Each mapper matches against candidates in it's block and writes out >> a key (docid) and a vector of all similar documents ({docid, score}) >> 8) The reducer then combines the results from each mapper into the >> final similarity matrix >> >> I've only really used Mahout so far for doing the minhash stuff but >> would like and can't find an LSH implementation. To avoid >> re-inventing the wheel I was looking for general pointers as to the >> efficacy of my approach in the first instance and then any guidance on >> how best to implement using the rest of mahout. >> >> I've gone through MIA and felt the the rowsimilarityjob was a >> possibility, however I understand that a JIRA has been raised to make >> this potentially less general and in it's current form it may not >> match my performance/cost criteria (i.e. high/low). >> >> Any help is greatly appreciated. >> >> Thanks in advance. >> >> Niall >> > > > > -- > Liliana Paola Mamani Sanchez -------------------------------------------- Grant Ingersoll
