Re: Command line : Error using clusterdump after cvb (0.7)

Jérémie Gomez Thu, 15 Nov 2012 03:21:15 -0800

Thanks a lot Jake,

I have tried using the vectordump job to retrieve the topics in text
format, and obtained a text document stating all the terms in the
dictionary file and numerical values, which I could not successfully
interpret. My commands were the following:


1. bin/mahout cvb -inputdir/matrix -o cvboutput -k 20 -x 10 -dict
seq2sparseoutput/dictionary.file-0 -dt topicdistrib -mt temp/model-1

2. bin/mahout bin/mahout vectordump -i cvboutput -o termtopics -d
seq2sparseoutput/dictionary.file-0 --dictionaryType sequencefile
--vectorSize 5


I'm guessing this might be due to the lack of "-sort" command, but I can't
use the -sort command because of a heap memory problem that I can't fix by
changing the MAHOUT_HEAPSIZE variable, and I get that heap memory problem
even though I am running the cvb test on a 1,3 Mo dataset...

Thank you !


2012/11/14 Jake Mannix <[email protected]>

> Clusterdump doesn't work on LDA output, as LDA doesn't produce "cluster"
> objects.
>
> If you want to look at the topics for CVB, use vectordump:
>
>
> mahout vectordump -s <path to topics sequence file> --dictionary <path to
> dictionary.file-0> --dictionaryType seqfile --vectorSize <num entries
> per topic you
> want to see> -sort
>
>
>
> On Wed, Nov 14, 2012 at 10:22 AM, Jérémie Gomez <[email protected]
> >wrote:
>
> > Hi everyone,
> >
> > I have tried several of the clustering algorithms in mahout and they
> worked
> > great, but I have a problem with the cvd implementation of Latent
> Dirichlet
> > Allocation. The cvb command works fine but then using clusterdump gives
> me
> > the following error :
> >
> > Exception in thread "main" java.lang.ClassCastException:
> > org.apache.mahout.math.VectorWritable cannot be cast to
> > org.apache.mahout.clustering.iterator.ClusterWritable
> >
> > What I do in details :
> > 1) mahout seqdirectory -c UTF-8 -i inputdir -o sequencefiles
> > 2) mahout seq2sparse -i sequencefiles -o sparsevectors -ow -a
> > org.apache.lucene.analysis.WhitespaceAnalyzer -x 99 -wt tfidf -s 5 -md 1
> -x
> > 90 -ng 2 -ml 50 -seq -n 2
> > 3) mahout rowid -i sparsevectors/tf-vectors -o rowidresult
> > 4) mahout mahout cvb -i rowresult/matrix -dict
> > sparsevectors/dictionary.file-0 -o topics -dt documents -mt states -ow -k
> > 10
> > 5) mahout clusterdump -i topics -o clusters -of TEXT -n 10 -d
> > marcelproust/dictionary.file-0 -dt sequencefile
> >
> > When I run command 5, I get the error above. Unfortunately, I could not
> > find any working solution after searching the archives, so I though I'd
> ask
> > the community !
> >
> > Thanks a lot in advance.
> > Jeremie
> >
>
>
>
> --
>
>   -jake
>

Re: Command line : Error using clusterdump after cvb (0.7)

Reply via email to