Every time I get this it tells me that I forgot to use the tfidf-vectors folder 
underneath the actual folder, in your case vectorforum1.  Thus, I think your 
input should be vectordata/tfidf-vectors (or some other subpath)


On Jul 22, 2011, at 11:03 AM, Liliana Mamani Sanchez wrote:

> Hello all,
> 
> I was trying to run a basic canopy clustering command:
> 
> 
> bin/mahout canopy -i vectordata -o output1 -dm
> org.apache.mahout.common.distance.CosineDistanceMeasure -ow -t1 6 -t2 2
> 
> and I get the exception:
> 
> Exception in thread "main" java.io.FileNotFoundException: File does not
> exist: hdfs://localhost:9000/user/hadoop/vectorforum1/df-count/data
> 
> Has anyone found something similar? I found a similar exception case of
> somebody using k-means but that was lack of a parameter -k , but in this
> case, I don't know what is the mistake.
> 
> cheers
> 
> Liliana
> 
> 
> 
> On Fri, Jul 22, 2011 at 12:23 PM, Niall Riddell 
> <[email protected]>wrote:
> 
>> Hi,
>> 
>> I would like to sense check an approach to near-duplicate detection of
>> documents using Mahout.  After some basic research I've implemented a
>> basic proof which works effectively on a small corpus.
>> 
>> I have taken the following pre-processing steps:
>> 
>> 1) Parse the document
>> 2) Remove unnecessary tokens
>> 3) Split by sentence
>> 4) Create w-shingles from sentence tokens
>> 5) Hash shingles
>> 6) Minhash hashes
>> 7) Jaccard Similarity adjusting for number of hash functions used in
>> minhash
>> 
>> In order to scale this I will be doing the following:
>> 
>> 1) Use M/R for all steps
>> 2) Avoid adding exact duplicate documents to similarity matrix
>> 3) Constructing an (additional) LSH matrix (threshold >=0.2) splitting
>> into buckets
>> 4) Split the similarity job by blocks of document keys for each mapper
>> 5) Every document in the minhash matrix gets submitted to every mapper
>> 6) Each mapper queries the LSH matrix to look for candidates for
>> matching against
>> 7) Each mapper matches against candidates in it's block and writes out
>> a key (docid) and a vector of all similar documents ({docid, score})
>> 8) The reducer then combines the results from each mapper into the
>> final similarity matrix
>> 
>> I've only really used Mahout so far for doing the minhash stuff but
>> would like and can't find an LSH implementation.  To avoid
>> re-inventing the wheel I was looking for general pointers as to the
>> efficacy of my approach in the first instance and then any guidance on
>> how best to implement using the rest of mahout.
>> 
>> I've gone through MIA and felt the the rowsimilarityjob was a
>> possibility, however I understand that a JIRA has been raised to make
>> this potentially less general and in it's current form it may not
>> match my performance/cost criteria (i.e. high/low).
>> 
>> Any help is greatly appreciated.
>> 
>> Thanks in advance.
>> 
>> Niall
>> 
> 
> 
> 
> -- 
> Liliana Paola Mamani Sanchez

--------------------------------------------
Grant Ingersoll


Reply via email to