I have a question about using rowid to convert sparse vectors (generated via
seq2sparse) to the form needed for cvb clustering (i.e., to change the Text key
to an Integer). Prior to running this step I had 3 “part” files in my
tf-vectors folder. After running rowid on the tf-vectors folder it generates
one “Matrix“ file and a “docIndex” file. The result of this step is that when
running the cvb clustering on the folder containing “Matrix” only a single
mapper runs on one node. For a large collection this takes an excessive amount
of time to run.
I assume cvb should be able to run in a distributed fashion on multiple
nodes using many mappers/tasktrackers? If so, am I running rowid incorrectly
on the entire tf-vectors folder as opposed to separately on each “part” file in
tf-vectors? Of course it generates the name “Matrix” in output so this implies
it wants to generate a single file.
Any advice on running cvb using multiple mappers would be appreciated. The
following are some pertinent lines from my test shell script to process Reuters
data:
*******************************************
$MAHOUT2 seq2sparse \
-i ${WORK_DIR}/reuters-out-seqdir/ \
-o ${WORK_DIR}/reuters-out-seqdir-sparse-cvb \
-wt tf -seq -nr 3 --namedVector \
&& \
$MAHOUT rowid \
-i ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/tf-vectors \
-o ${WORK_DIR}/sparse-vectors-cvb \
&& \
$HADOOP fs -mv ${WORK_DIR}/sparse-vectors-cvb/docIndex
${WORK_DIR}/sparse-vectors-index-cvb \
&& \
$MAHOUT cvb \
-i ${WORK_DIR}/sparse-vectors-cvb \
-o ${WORK_DIR}/reuters-cvb -k 150 -ow -x 10 \
-dict ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/dictionary.file-0 \
-mt ${WORK_DIR}/topic-model-cvb -dt ${WORK_DIR}/doc-topic-cvb