Re: Problem running new LDA algorithm (cvb) against the Reuters data

Lithium Guava Tue, 19 Jun 2012 03:30:51 -0700

I figured out you need to move the docIndex file out of the matrix dir
before running CVB, or it gets in the way.
The set of commands I'm running so far are:


<convert input to sequence files>
bin/mahout seq2sparse -i reuters-seqfiles -o reuters-seqfiles-sparse -wt tf
-seq -nr 3 --namedVector
bin/mahout rowid -i reuters-seqfiles-sparse/tf-vectors -o reuters-matrix
hadoop fs -mv reuters-matrix/docIndex reuters-matrix-docIndex
bin/mahout cvb -i reuters-matrix -o reuters-cvb -k 15 -ow -x 10 -dict
reuters-seqfiles-sparse/dictionary.file-* -mt reuters-cvb-tm -dt
reuters-cvb-dt


--
Tom


On 5 May 2012 04:54, DAN HELM <[email protected]> wrote:

> I am attempting to run the new LDA algorithm cvb (Mahout version 0.6)
> against the Reuters data.   I just added another
> entry to the cluster-reuters.sh example script as follows:
>
> ******************************************************************************
> elif [ "x$clustertype" == "xcvb" ]; then
>   $MAHOUT seq2sparse \
>     -i ${WORK_DIR}/reuters-out-seqdir/ \
>     -o ${WORK_DIR}/reuters-out-seqdir-sparse-cvb \
>     -wt tf -seq -nr 3 --namedVector \
>   && \
>   $MAHOUT cvb \
>     -i ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/tf-vectors \
>     -o ${WORK_DIR}/reuters-cvb -k 20 -ow -x 2 \
>     -dict ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/dictionary.file-0 \
>     -mt ${WORK_DIR}/topic-model-cvb -dt ${WORK_DIR}/doc-topic-cvb \
>   && \
>   $MAHOUT ldatopics \
>     -i ${WORK_DIR}/reuters-cvb/state-2 \
>     -d ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/dictionary.file-0 \
>     -dt sequencefile
>
> ******************************************************************************
> I successfully ran the previous LDA algorithm against Reuters but I am
> most interested in this new implementation of LDA because I want the new
> feature that generates document-to-cluster mappings (e.g., parameter –dt).
>
> When I run the above code via Hadoop pseudo distributed mode as well as on
> a small cluster I receive the same error from the "mahout cvb" command.
> All the pre-clustering logic including sequence file and sparse vector
> generation works fine but when the cvb clustering is attempted the mappers
> fail with the following error in the Hadoop map task log:
>
> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
> org.apache.hadoop.io.IntWritable
>  at
> org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55)
>  at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>  at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> Any help with resolving the problem would be appreciated.
>
> Dan

Re: Problem running new LDA algorithm (cvb) against the Reuters data

Reply via email to