I have a question about using rowid to convert sparse vectors (generated via 
seq2sparse) to the form needed for cvb clustering (i.e., to change the Text key 
to an Integer).  Prior to running this step I had 3 “part” files in my 
tf-vectors folder.  After running rowid on the tf-vectors folder it generates 
one “Matrix“ file and a “docIndex” file.  The result of this step is that when 
running the cvb clustering on the folder containing “Matrix” only a single 
mapper runs on one node.  For a large collection this takes an excessive amount 
of time to run.   

I assume cvb should be able to run in a distributed fashion on multiple 
nodes using many mappers/tasktrackers?  If so, am I running rowid incorrectly 
on the entire tf-vectors folder as opposed to separately on each “part” file in 
tf-vectors?  Of course it generates the name “Matrix” in output so this implies 
it wants to generate a single file.

Any advice on running cvb using multiple mappers would be appreciated.  The 
following are some pertinent lines from my test shell script to process Reuters 
data:
 
*******************************************
  $MAHOUT2 seq2sparse \
    -i ${WORK_DIR}/reuters-out-seqdir/ \
    -o ${WORK_DIR}/reuters-out-seqdir-sparse-cvb \
    -wt tf -seq -nr 3 --namedVector \
  && \
  $MAHOUT rowid \
    -i ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/tf-vectors \
    -o ${WORK_DIR}/sparse-vectors-cvb \
  && \
  $HADOOP fs -mv ${WORK_DIR}/sparse-vectors-cvb/docIndex 
${WORK_DIR}/sparse-vectors-index-cvb \
  && \
  $MAHOUT cvb \
    -i ${WORK_DIR}/sparse-vectors-cvb \
    -o ${WORK_DIR}/reuters-cvb -k 150 -ow -x 10 \
    -dict ${WORK_DIR}/reuters-out-seqdir-sparse-cvb/dictionary.file-0 \
    -mt ${WORK_DIR}/topic-model-cvb -dt ${WORK_DIR}/doc-topic-cvb

Reply via email to