On a related I note I believe I have found a bug in the cvb implementation and wish to know how to go about getting it fixed. How do I go about doing this?
Sent from my iPad On 31 Jan 2013, at 02:50, "Andy Schlaikjer" <[email protected]> wrote: > I assume you mean input *matrix* with 600,000 doc-term *vectors*. > > You need to ensure these vectors are split evenly across many part files. > The number of part files will determine input splits and in turn map-side > parallelism. > > Could you let us know how much input each of your 70 mappers is processing? > Is there an imbalance? > > Andy > > > On Wed, Jan 30, 2013 at 6:06 PM, David LaBarbera < > [email protected]> wrote: > >> I ran cvb on AWS (mahout 0.7 and amazon's hadoop 1.0.3). >> >> I'm running it with >> hadoop jar mahout-fat.jar org.apache.mahout.driver.MahoutDriver \ >> cvb \ >> -i /lda/matrix-converted/matrix \ >> -o 's3n://.../lda/results \ >> -dict /lda/dictionary.file-0 \ >> -dt s3n://.../lda/doc-topics \ >> -k 10 -x 10 >> >> The dictionary has around 1,000,000 terms >> The input vector has around 600,000 documents (It's a 70MB file) with >> 10-100 terms in them. >> I created with the matrix file with a block size of 1MB. Each iteration of >> CVB is using 70 mappers and takes close to an hour for each mapper to run. >> >> Is this expected performance under these conditions? Are there any >> parameters I can tune? >> >> David
