Re: df-count/data does not exist

Jeff Hansen Tue, 16 Aug 2011 10:34:20 -0700

I've been getting this exception a lot as well.  I've been going through
some of the examples in Mahout In Action book, and I get errors a lot when I
follow the instructions word for word -- either due to typos in the book (it
seems like there were a few sections where a script was updated due to code
changes, but the paragraph describing the script wasn't and vice versa) or
due to the fact that I'm using HEAD (mahout 0.6 instead of 0.4) and the api
has changed since the 0.4 version that the book is based on.

A few culprits I've noticed:
1. dictionary-file-* vs dictionary.file-0 -- page 138 of my copy references
reuters-vectors/dictionary-file-* (the other four scripts in the book that
reference this file correctly point to dictionary.file-0 -- notice the dot
versus the hyphen in the file name).
2. Most of the classes that accept a -d dictionary file option will return a
"FileNotFoundException: no such file or directory" for the dictionary file
if you fail to also specify -dt sequencefile
3. File vs Folder -- the clusterdump utility seems to be ok with providing a
file OR a folder, but vectordump and seqdumper seem to want an actual file.
 I'm not sure what you're supposed to do in the case where multiple reducers
ran and you have your output spread across part-r-00000, part-r-00001,
part-r-00002...

I've just recreated all of the issues above to make sure I had the details
correct, but I know there were a number of other times over the last few
days that I hit this error.  At first I gave up and moved on, but now that
I've seen it so much I know not to pay too much attention to the details of
the error message because it's usually a redherring that's just a sign I
missed something else -- it would be nice if short of giving you helpful
information the error messages at least didn't send you misleading
information...

On Tue, Jul 26, 2011 at 6:17 AM, Grant Ingersoll <[email protected]>wrote:

> Every time I get this it tells me that I forgot to use the tfidf-vectors
> folder underneath the actual folder, in your case vectorforum1.  Thus, I
> think your input should be vectordata/tfidf-vectors (or some other subpath)
>
>
> On Jul 22, 2011, at 11:03 AM, Liliana Mamani Sanchez wrote:
>
> > Hello all,
> >
> > I was trying to run a basic canopy clustering command:
> >
> >
> > bin/mahout canopy -i vectordata -o output1 -dm
> > org.apache.mahout.common.distance.CosineDistanceMeasure -ow -t1 6 -t2 2
> >
> > and I get the exception:
> >
> > Exception in thread "main" java.io.FileNotFoundException: File does not
> > exist: hdfs://localhost:9000/user/hadoop/vectorforum1/df-count/data
> >
> > Has anyone found something similar? I found a similar exception case of
> > somebody using k-means but that was lack of a parameter -k , but in this
> > case, I don't know what is the mistake.
> >
> > cheers
> >
> > Liliana
> >
> >
> >
> > On Fri, Jul 22, 2011 at 12:23 PM, Niall Riddell <[email protected]
> >wrote:
> >
> >> Hi,
> >>
> >> I would like to sense check an approach to near-duplicate detection of
> >> documents using Mahout.  After some basic research I've implemented a
> >> basic proof which works effectively on a small corpus.
> >>
> >> I have taken the following pre-processing steps:
> >>
> >> 1) Parse the document
> >> 2) Remove unnecessary tokens
> >> 3) Split by sentence
> >> 4) Create w-shingles from sentence tokens
> >> 5) Hash shingles
> >> 6) Minhash hashes
> >> 7) Jaccard Similarity adjusting for number of hash functions used in
> >> minhash
> >>
> >> In order to scale this I will be doing the following:
> >>
> >> 1) Use M/R for all steps
> >> 2) Avoid adding exact duplicate documents to similarity matrix
> >> 3) Constructing an (additional) LSH matrix (threshold >=0.2) splitting
> >> into buckets
> >> 4) Split the similarity job by blocks of document keys for each mapper
> >> 5) Every document in the minhash matrix gets submitted to every mapper
> >> 6) Each mapper queries the LSH matrix to look for candidates for
> >> matching against
> >> 7) Each mapper matches against candidates in it's block and writes out
> >> a key (docid) and a vector of all similar documents ({docid, score})
> >> 8) The reducer then combines the results from each mapper into the
> >> final similarity matrix
> >>
> >> I've only really used Mahout so far for doing the minhash stuff but
> >> would like and can't find an LSH implementation.  To avoid
> >> re-inventing the wheel I was looking for general pointers as to the
> >> efficacy of my approach in the first instance and then any guidance on
> >> how best to implement using the rest of mahout.
> >>
> >> I've gone through MIA and felt the the rowsimilarityjob was a
> >> possibility, however I understand that a JIRA has been raised to make
> >> this potentially less general and in it's current form it may not
> >> match my performance/cost criteria (i.e. high/low).
> >>
> >> Any help is greatly appreciated.
> >>
> >> Thanks in advance.
> >>
> >> Niall
> >>
> >
> >
> >
> > --
> > Liliana Paola Mamani Sanchez
>
> --------------------------------------------
> Grant Ingersoll
>
>
>

Re: df-count/data does not exist

Reply via email to