Thanks a lot Jake, I have tried using the vectordump job to retrieve the topics in text format, and obtained a text document stating all the terms in the dictionary file and numerical values, which I could not successfully interpret. My commands were the following:
1. bin/mahout cvb -inputdir/matrix -o cvboutput -k 20 -x 10 -dict seq2sparseoutput/dictionary.file-0 -dt topicdistrib -mt temp/model-1 2. bin/mahout bin/mahout vectordump -i cvboutput -o termtopics -d seq2sparseoutput/dictionary.file-0 --dictionaryType sequencefile --vectorSize 5 I'm guessing this might be due to the lack of "-sort" command, but I can't use the -sort command because of a heap memory problem that I can't fix by changing the MAHOUT_HEAPSIZE variable, and I get that heap memory problem even though I am running the cvb test on a 1,3 Mo dataset... Thank you ! 2012/11/14 Jake Mannix <[email protected]> > Clusterdump doesn't work on LDA output, as LDA doesn't produce "cluster" > objects. > > If you want to look at the topics for CVB, use vectordump: > > > mahout vectordump -s <path to topics sequence file> --dictionary <path to > dictionary.file-0> --dictionaryType seqfile --vectorSize <num entries > per topic you > want to see> -sort > > > > On Wed, Nov 14, 2012 at 10:22 AM, Jérémie Gomez <[email protected] > >wrote: > > > Hi everyone, > > > > I have tried several of the clustering algorithms in mahout and they > worked > > great, but I have a problem with the cvd implementation of Latent > Dirichlet > > Allocation. The cvb command works fine but then using clusterdump gives > me > > the following error : > > > > Exception in thread "main" java.lang.ClassCastException: > > org.apache.mahout.math.VectorWritable cannot be cast to > > org.apache.mahout.clustering.iterator.ClusterWritable > > > > What I do in details : > > 1) mahout seqdirectory -c UTF-8 -i inputdir -o sequencefiles > > 2) mahout seq2sparse -i sequencefiles -o sparsevectors -ow -a > > org.apache.lucene.analysis.WhitespaceAnalyzer -x 99 -wt tfidf -s 5 -md 1 > -x > > 90 -ng 2 -ml 50 -seq -n 2 > > 3) mahout rowid -i sparsevectors/tf-vectors -o rowidresult > > 4) mahout mahout cvb -i rowresult/matrix -dict > > sparsevectors/dictionary.file-0 -o topics -dt documents -mt states -ow -k > > 10 > > 5) mahout clusterdump -i topics -o clusters -of TEXT -n 10 -d > > marcelproust/dictionary.file-0 -dt sequencefile > > > > When I run command 5, I get the error above. Unfortunately, I could not > > find any working solution after searching the archives, so I though I'd > ask > > the community ! > > > > Thanks a lot in advance. > > Jeremie > > > > > > -- > > -jake >
