I assume you mean input *matrix* with 600,000 doc-term *vectors*. You need to ensure these vectors are split evenly across many part files. The number of part files will determine input splits and in turn map-side parallelism.
Could you let us know how much input each of your 70 mappers is processing? Is there an imbalance? Andy On Wed, Jan 30, 2013 at 6:06 PM, David LaBarbera < [email protected]> wrote: > I ran cvb on AWS (mahout 0.7 and amazon's hadoop 1.0.3). > > I'm running it with > hadoop jar mahout-fat.jar org.apache.mahout.driver.MahoutDriver \ > cvb \ > -i /lda/matrix-converted/matrix \ > -o 's3n://.../lda/results \ > -dict /lda/dictionary.file-0 \ > -dt s3n://.../lda/doc-topics \ > -k 10 -x 10 > > The dictionary has around 1,000,000 terms > The input vector has around 600,000 documents (It's a 70MB file) with > 10-100 terms in them. > I created with the matrix file with a block size of 1MB. Each iteration of > CVB is using 70 mappers and takes close to an hour for each mapper to run. > > Is this expected performance under these conditions? Are there any > parameters I can tune? > > David
