Re: df-count/data does not exist

Robin Anil Fri, 22 Jul 2011 08:13:08 -0700

What does the folder input vectordata contain? I am guessing you gave the
top level directory instead of giving the tfidf-vectors folder as input


Robin

On Fri, Jul 22, 2011 at 8:33 PM, Liliana Mamani Sanchez
<[email protected]>wrote:

> Hello all,
>
> I was trying to run a basic canopy clustering command:
>
>
> bin/mahout canopy -i vectordata -o output1 -dm
> org.apache.mahout.common.distance.CosineDistanceMeasure -ow -t1 6 -t2 2
>
> and I get the exception:
>
> Exception in thread "main" java.io.FileNotFoundException: File does not
> exist: hdfs://localhost:9000/user/hadoop/vectorforum1/df-count/data
>
> Has anyone found something similar? I found a similar exception case of
> somebody using k-means but that was lack of a parameter -k , but in this
> case, I don't know what is the mistake.
>
> cheers
>
> Liliana
>
>
>
> On Fri, Jul 22, 2011 at 12:23 PM, Niall Riddell <[email protected]
> >wrote:
>
> > Hi,
> >
> > I would like to sense check an approach to near-duplicate detection of
> > documents using Mahout.  After some basic research I've implemented a
> > basic proof which works effectively on a small corpus.
> >
> > I have taken the following pre-processing steps:
> >
> > 1) Parse the document
> > 2) Remove unnecessary tokens
> > 3) Split by sentence
> > 4) Create w-shingles from sentence tokens
> > 5) Hash shingles
> > 6) Minhash hashes
> > 7) Jaccard Similarity adjusting for number of hash functions used in
> > minhash
> >
> > In order to scale this I will be doing the following:
> >
> > 1) Use M/R for all steps
> > 2) Avoid adding exact duplicate documents to similarity matrix
> > 3) Constructing an (additional) LSH matrix (threshold >=0.2) splitting
> > into buckets
> > 4) Split the similarity job by blocks of document keys for each mapper
> > 5) Every document in the minhash matrix gets submitted to every mapper
> > 6) Each mapper queries the LSH matrix to look for candidates for
> > matching against
> > 7) Each mapper matches against candidates in it's block and writes out
> > a key (docid) and a vector of all similar documents ({docid, score})
> > 8) The reducer then combines the results from each mapper into the
> > final similarity matrix
> >
> > I've only really used Mahout so far for doing the minhash stuff but
> > would like and can't find an LSH implementation.  To avoid
> > re-inventing the wheel I was looking for general pointers as to the
> > efficacy of my approach in the first instance and then any guidance on
> > how best to implement using the rest of mahout.
> >
> > I've gone through MIA and felt the the rowsimilarityjob was a
> > possibility, however I understand that a JIRA has been raised to make
> > this potentially less general and in it's current form it may not
> > match my performance/cost criteria (i.e. high/low).
> >
> > Any help is greatly appreciated.
> >
> > Thanks in advance.
> >
> > Niall
> >
>
>
>
> --
> Liliana Paola Mamani Sanchez
>

Re: df-count/data does not exist

Reply via email to