(Since it's specifically about the book, might be better to post in the
Manning forums.)

The final version, which is a fair bit more up-to-date than the MEAP
version, is synced with 0.5. It was re-read by a technical proofreader to
make sure it all works, so I imagine most of this has been addressed.

I don't know about the file-vs-folder question though. Those might just be
niggles specific to those classes, which could be patched.

On Tue, Aug 16, 2011 at 6:33 PM, Jeff Hansen <[email protected]> wrote:

> I've been getting this exception a lot as well.  I've been going through
> some of the examples in Mahout In Action book, and I get errors a lot when
> I
> follow the instructions word for word -- either due to typos in the book
> (it
> seems like there were a few sections where a script was updated due to code
> changes, but the paragraph describing the script wasn't and vice versa) or
> due to the fact that I'm using HEAD (mahout 0.6 instead of 0.4) and the api
> has changed since the 0.4 version that the book is based on.
>
> A few culprits I've noticed:
> 1. dictionary-file-* vs dictionary.file-0 -- page 138 of my copy references
> reuters-vectors/dictionary-file-* (the other four scripts in the book that
> reference this file correctly point to dictionary.file-0 -- notice the dot
> versus the hyphen in the file name).
> 2. Most of the classes that accept a -d dictionary file option will return
> a
> "FileNotFoundException: no such file or directory" for the dictionary file
> if you fail to also specify -dt sequencefile
> 3. File vs Folder -- the clusterdump utility seems to be ok with providing
> a
> file OR a folder, but vectordump and seqdumper seem to want an actual file.
>  I'm not sure what you're supposed to do in the case where multiple
> reducers
> ran and you have your output spread across part-r-00000, part-r-00001,
> part-r-00002...
>
> I've just recreated all of the issues above to make sure I had the details
> correct, but I know there were a number of other times over the last few
> days that I hit this error.  At first I gave up and moved on, but now that
> I've seen it so much I know not to pay too much attention to the details of
> the error message because it's usually a redherring that's just a sign I
> missed something else -- it would be nice if short of giving you helpful
> information the error messages at least didn't send you misleading
> information...
>
>
>
> On Tue, Jul 26, 2011 at 6:17 AM, Grant Ingersoll <[email protected]
> >wrote:
>
> > Every time I get this it tells me that I forgot to use the tfidf-vectors
> > folder underneath the actual folder, in your case vectorforum1.  Thus, I
> > think your input should be vectordata/tfidf-vectors (or some other
> subpath)
> >
> >
> > On Jul 22, 2011, at 11:03 AM, Liliana Mamani Sanchez wrote:
> >
> > > Hello all,
> > >
> > > I was trying to run a basic canopy clustering command:
> > >
> > >
> > > bin/mahout canopy -i vectordata -o output1 -dm
> > > org.apache.mahout.common.distance.CosineDistanceMeasure -ow -t1 6 -t2 2
> > >
> > > and I get the exception:
> > >
> > > Exception in thread "main" java.io.FileNotFoundException: File does not
> > > exist: hdfs://localhost:9000/user/hadoop/vectorforum1/df-count/data
> > >
> > > Has anyone found something similar? I found a similar exception case of
> > > somebody using k-means but that was lack of a parameter -k , but in
> this
> > > case, I don't know what is the mistake.
> > >
> > > cheers
> > >
> > > Liliana
> > >
> > >
> > >
> > > On Fri, Jul 22, 2011 at 12:23 PM, Niall Riddell <
> [email protected]
> > >wrote:
> > >
> > >> Hi,
> > >>
> > >> I would like to sense check an approach to near-duplicate detection of
> > >> documents using Mahout.  After some basic research I've implemented a
> > >> basic proof which works effectively on a small corpus.
> > >>
> > >> I have taken the following pre-processing steps:
> > >>
> > >> 1) Parse the document
> > >> 2) Remove unnecessary tokens
> > >> 3) Split by sentence
> > >> 4) Create w-shingles from sentence tokens
> > >> 5) Hash shingles
> > >> 6) Minhash hashes
> > >> 7) Jaccard Similarity adjusting for number of hash functions used in
> > >> minhash
> > >>
> > >> In order to scale this I will be doing the following:
> > >>
> > >> 1) Use M/R for all steps
> > >> 2) Avoid adding exact duplicate documents to similarity matrix
> > >> 3) Constructing an (additional) LSH matrix (threshold >=0.2) splitting
> > >> into buckets
> > >> 4) Split the similarity job by blocks of document keys for each mapper
> > >> 5) Every document in the minhash matrix gets submitted to every mapper
> > >> 6) Each mapper queries the LSH matrix to look for candidates for
> > >> matching against
> > >> 7) Each mapper matches against candidates in it's block and writes out
> > >> a key (docid) and a vector of all similar documents ({docid, score})
> > >> 8) The reducer then combines the results from each mapper into the
> > >> final similarity matrix
> > >>
> > >> I've only really used Mahout so far for doing the minhash stuff but
> > >> would like and can't find an LSH implementation.  To avoid
> > >> re-inventing the wheel I was looking for general pointers as to the
> > >> efficacy of my approach in the first instance and then any guidance on
> > >> how best to implement using the rest of mahout.
> > >>
> > >> I've gone through MIA and felt the the rowsimilarityjob was a
> > >> possibility, however I understand that a JIRA has been raised to make
> > >> this potentially less general and in it's current form it may not
> > >> match my performance/cost criteria (i.e. high/low).
> > >>
> > >> Any help is greatly appreciated.
> > >>
> > >> Thanks in advance.
> > >>
> > >> Niall
> > >>
> > >
> > >
> > >
> > > --
> > > Liliana Paola Mamani Sanchez
> >
> > --------------------------------------------
> > Grant Ingersoll
> >
> >
> >
>

Reply via email to