Hello Jeff Final version of book should contain correct command line examples - they all were checked for correctness... Actual source code for book is at https://github.com/tdunning/MiA - it was checked against 0.5, but should also work with HEAD (if it won't work with HEAD, please, write to me - I'll try to fix it)
On Tue, Aug 16, 2011 at 7:33 PM, Jeff Hansen <[email protected]> wrote: > I've been getting this exception a lot as well. I've been going through > some of the examples in Mahout In Action book, and I get errors a lot when I > follow the instructions word for word -- either due to typos in the book (it > seems like there were a few sections where a script was updated due to code > changes, but the paragraph describing the script wasn't and vice versa) or > due to the fact that I'm using HEAD (mahout 0.6 instead of 0.4) and the api > has changed since the 0.4 version that the book is based on. > > A few culprits I've noticed: > 1. dictionary-file-* vs dictionary.file-0 -- page 138 of my copy references > reuters-vectors/dictionary-file-* (the other four scripts in the book that > reference this file correctly point to dictionary.file-0 -- notice the dot > versus the hyphen in the file name). > 2. Most of the classes that accept a -d dictionary file option will return a > "FileNotFoundException: no such file or directory" for the dictionary file > if you fail to also specify -dt sequencefile > 3. File vs Folder -- the clusterdump utility seems to be ok with providing a > file OR a folder, but vectordump and seqdumper seem to want an actual file. > I'm not sure what you're supposed to do in the case where multiple reducers > ran and you have your output spread across part-r-00000, part-r-00001, > part-r-00002... > > I've just recreated all of the issues above to make sure I had the details > correct, but I know there were a number of other times over the last few > days that I hit this error. At first I gave up and moved on, but now that > I've seen it so much I know not to pay too much attention to the details of > the error message because it's usually a redherring that's just a sign I > missed something else -- it would be nice if short of giving you helpful > information the error messages at least didn't send you misleading > information... > > > > On Tue, Jul 26, 2011 at 6:17 AM, Grant Ingersoll <[email protected]>wrote: > >> Every time I get this it tells me that I forgot to use the tfidf-vectors >> folder underneath the actual folder, in your case vectorforum1. Thus, I >> think your input should be vectordata/tfidf-vectors (or some other subpath) >> >> >> On Jul 22, 2011, at 11:03 AM, Liliana Mamani Sanchez wrote: >> >> > Hello all, >> > >> > I was trying to run a basic canopy clustering command: >> > >> > >> > bin/mahout canopy -i vectordata -o output1 -dm >> > org.apache.mahout.common.distance.CosineDistanceMeasure -ow -t1 6 -t2 2 >> > >> > and I get the exception: >> > >> > Exception in thread "main" java.io.FileNotFoundException: File does not >> > exist: hdfs://localhost:9000/user/hadoop/vectorforum1/df-count/data >> > >> > Has anyone found something similar? I found a similar exception case of >> > somebody using k-means but that was lack of a parameter -k , but in this >> > case, I don't know what is the mistake. >> > >> > cheers >> > >> > Liliana >> > >> > >> > >> > On Fri, Jul 22, 2011 at 12:23 PM, Niall Riddell <[email protected] >> >wrote: >> > >> >> Hi, >> >> >> >> I would like to sense check an approach to near-duplicate detection of >> >> documents using Mahout. After some basic research I've implemented a >> >> basic proof which works effectively on a small corpus. >> >> >> >> I have taken the following pre-processing steps: >> >> >> >> 1) Parse the document >> >> 2) Remove unnecessary tokens >> >> 3) Split by sentence >> >> 4) Create w-shingles from sentence tokens >> >> 5) Hash shingles >> >> 6) Minhash hashes >> >> 7) Jaccard Similarity adjusting for number of hash functions used in >> >> minhash >> >> >> >> In order to scale this I will be doing the following: >> >> >> >> 1) Use M/R for all steps >> >> 2) Avoid adding exact duplicate documents to similarity matrix >> >> 3) Constructing an (additional) LSH matrix (threshold >=0.2) splitting >> >> into buckets >> >> 4) Split the similarity job by blocks of document keys for each mapper >> >> 5) Every document in the minhash matrix gets submitted to every mapper >> >> 6) Each mapper queries the LSH matrix to look for candidates for >> >> matching against >> >> 7) Each mapper matches against candidates in it's block and writes out >> >> a key (docid) and a vector of all similar documents ({docid, score}) >> >> 8) The reducer then combines the results from each mapper into the >> >> final similarity matrix >> >> >> >> I've only really used Mahout so far for doing the minhash stuff but >> >> would like and can't find an LSH implementation. To avoid >> >> re-inventing the wheel I was looking for general pointers as to the >> >> efficacy of my approach in the first instance and then any guidance on >> >> how best to implement using the rest of mahout. >> >> >> >> I've gone through MIA and felt the the rowsimilarityjob was a >> >> possibility, however I understand that a JIRA has been raised to make >> >> this potentially less general and in it's current form it may not >> >> match my performance/cost criteria (i.e. high/low). >> >> >> >> Any help is greatly appreciated. >> >> >> >> Thanks in advance. >> >> >> >> Niall >> >> >> > >> > >> > >> > -- >> > Liliana Paola Mamani Sanchez >> >> -------------------------------------------- >> Grant Ingersoll >> >> >> > -- With best wishes, Alex Ott http://alexott.net/ Tiwtter: alexott_en (English), alexott (Russian) Skype: alex.ott
