I'm about to head to bed right now (long day, flight to and from sf in one
day, need sleep), but short answer is
that the new LDA requires SequenceFile<IntWritable, VectorWritable> as
input (the same disk format
as DistributedRowMatrix), which you can get out of SequenceFile<Text,
VectorWritable> by running the
RowIdJob ("$MAHOUT_HOME/bin/mahout rowid -h" for more details) before
running CVB.

Let us know if that doesn't help!

On Fri, May 4, 2012 at 8:54 PM, DAN HELM <[email protected]> wrote:

> I am attempting to run the new LDA algorithm cvb (Mahout version 0.6)
> against the Reuters data.   I just added another
> entry to the cluster-reuters.sh example script as follows:
>
> ******************************************************************************
> elif [ "x$clustertype" == "xcvb" ]; then
>   $MAHOUT seq2sparse \
>     -i ${WORK_DIR}/reuters-out-seqdir/ \
>     -o ${WORK_DIR}/reuters-out-seqdir-sparse-cvb \
>     -wt tf -seq -nr 3 --namedVector \
>   && \
>   $MAHOUT cvb \
>     -i ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/tf-vectors \
>     -o ${WORK_DIR}/reuters-cvb -k 20 -ow -x 2 \
>     -dict ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/dictionary.file-0 \
>     -mt ${WORK_DIR}/topic-model-cvb -dt ${WORK_DIR}/doc-topic-cvb \
>   && \
>   $MAHOUT ldatopics \
>     -i ${WORK_DIR}/reuters-cvb/state-2 \
>     -d ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/dictionary.file-0 \
>     -dt sequencefile
>
> ******************************************************************************
> I successfully ran the previous LDA algorithm against Reuters but I am
> most interested in this new implementation of LDA because I want the new
> feature that generates document-to-cluster mappings (e.g., parameter –dt).
>
> When I run the above code via Hadoop pseudo distributed mode as well as on
> a small cluster I receive the same error from the "mahout cvb" command.
> All the pre-clustering logic including sequence file and sparse vector
> generation works fine but when the cvb clustering is attempted the mappers
> fail with the following error in the Hadoop map task log:
>
> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
> org.apache.hadoop.io.IntWritable
>  at
> org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55)
>  at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>  at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> Any help with resolving the problem would be appreciated.
>
> Dan




-- 

  -jake

Reply via email to