Hi Jake, It's a great idea indeed. However I'm new to the mahout ; could you give me some pointers as to where to publish this guide and maybe an example of a well-formed already existing guide that I could use as an example ?
Thank you ! Jeremie 2012/11/16 Jake Mannix <[email protected]> > I'm glad to hear it's working better now! We should take the results of > getting this working and turn it into a step-by-step guide for new users, > others I'm sure could find it useful! > > > On Fri, Nov 16, 2012 at 9:55 AM, Jérémie Gomez <[email protected] > >wrote: > > > Hello Jake, > > > > Thank you very much for these interesting pointers : the problem is > fixed ! > > > > The problem was indeed that the -sort argument for cvb is broken in 0.7. > I > > built from the trunk, and cvb works well. As you suggested, I have run > cvb > > with 20 and 30 iterations, and the result is quite interesting. > > > > Thanks again for your suggestions, it helped a lot ! > > Jeremie > > > > 2012/11/15 Jake Mannix <[email protected]> > > > > > On Thu, Nov 15, 2012 at 3:20 AM, Jérémie Gomez < > [email protected] > > > >wrote: > > > > > > > Thanks a lot Jake, > > > > > > > > I have tried using the vectordump job to retrieve the topics in text > > > > format, and obtained a text document stating all the terms in the > > > > dictionary file and numerical values, which I could not successfully > > > > interpret. My commands were the following: > > > > > > > > 1. bin/mahout cvb -inputdir/matrix -o cvboutput -k 20 -x 10 -dict > > > > seq2sparseoutput/dictionary.file-0 -dt topicdistrib -mt temp/model-1 > > > > > > > > 2. bin/mahout bin/mahout vectordump -i cvboutput -o termtopics -d > > > > seq2sparseoutput/dictionary.file-0 --dictionaryType sequencefile > > > > --vectorSize 5 > > > > > > > > > > > > I'm guessing this might be due to the lack of "-sort" command, > > > > > > > > > Yeah, you won't be able to interpret *at all* without sort - you'll > just > > > get > > > the first few terms for the topic, in no order at all (i.e. maybe ones > > > which are not likely in that topic at all, but have probability > 0). > > > > > > Another thing: you're using temp/model-1 - sounds like you're looking > > > at your *first* iteration of the output? That's nowhere near > > convergence, > > > and your topics will look like garbage - you need to take at least > > > iteration > > > 10 or 20 to see some good topics. > > > > > > but I can't > > > > use the -sort command because of a heap memory problem that I can't > fix > > > by > > > > changing the MAHOUT_HEAPSIZE variable, and I get that heap memory > > problem > > > > even though I am running the cvb test on a 1,3 Mo dataset... > > > > > > > > > > So are you running on trunk? I think -sort was broken in the last > > release, > > > but has been fixed for a few months now on subversion trunk. > > > > > > > > > > > > > > Thank you ! > > > > > > > > > > > > 2012/11/14 Jake Mannix <[email protected]> > > > > > > > > > Clusterdump doesn't work on LDA output, as LDA doesn't produce > > > "cluster" > > > > > objects. > > > > > > > > > > If you want to look at the topics for CVB, use vectordump: > > > > > > > > > > > > > > > mahout vectordump -s <path to topics sequence file> --dictionary > > <path > > > to > > > > > dictionary.file-0> --dictionaryType seqfile --vectorSize <num > entries > > > > > per topic you > > > > > want to see> -sort > > > > > > > > > > > > > > > > > > > > On Wed, Nov 14, 2012 at 10:22 AM, Jérémie Gomez < > > > [email protected] > > > > > >wrote: > > > > > > > > > > > Hi everyone, > > > > > > > > > > > > I have tried several of the clustering algorithms in mahout and > > they > > > > > worked > > > > > > great, but I have a problem with the cvd implementation of Latent > > > > > Dirichlet > > > > > > Allocation. The cvb command works fine but then using clusterdump > > > gives > > > > > me > > > > > > the following error : > > > > > > > > > > > > Exception in thread "main" java.lang.ClassCastException: > > > > > > org.apache.mahout.math.VectorWritable cannot be cast to > > > > > > org.apache.mahout.clustering.iterator.ClusterWritable > > > > > > > > > > > > What I do in details : > > > > > > 1) mahout seqdirectory -c UTF-8 -i inputdir -o sequencefiles > > > > > > 2) mahout seq2sparse -i sequencefiles -o sparsevectors -ow -a > > > > > > org.apache.lucene.analysis.WhitespaceAnalyzer -x 99 -wt tfidf -s > 5 > > > -md > > > > 1 > > > > > -x > > > > > > 90 -ng 2 -ml 50 -seq -n 2 > > > > > > 3) mahout rowid -i sparsevectors/tf-vectors -o rowidresult > > > > > > 4) mahout mahout cvb -i rowresult/matrix -dict > > > > > > sparsevectors/dictionary.file-0 -o topics -dt documents -mt > states > > > -ow > > > > -k > > > > > > 10 > > > > > > 5) mahout clusterdump -i topics -o clusters -of TEXT -n 10 -d > > > > > > marcelproust/dictionary.file-0 -dt sequencefile > > > > > > > > > > > > When I run command 5, I get the error above. Unfortunately, I > could > > > not > > > > > > find any working solution after searching the archives, so I > though > > > I'd > > > > > ask > > > > > > the community ! > > > > > > > > > > > > Thanks a lot in advance. > > > > > > Jeremie > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > -jake > > > > > > > > > > > > > > > > > > > > > -- > > > > > > -jake > > > > > > > > > -- > > -jake >
