Yes, the matrix has 600,000 vectors I generated the input matrix from seq2sparse, then rowid. I don't think rowid is a M/R job and it merges the part files from the term frequency matrix into one file. So, I set the block size for the rowid output to 1MB, which gave me the 70 block, leading to 70 mappers running. I think the data is evenly split for 2 reasons. The vectors are small and every mapper took 45-55 minutes.
Jack, You can probably open a jira ticket https://issues.apache.org/jira/browse/MAHOUT and attach your solution to it. David On Jan 31, 2013, at 3:19 AM, Jack Pay <[email protected]> wrote: > On a related I note I believe I have found a bug in the cvb implementation > and wish to know how to go about getting it fixed. How do I go about doing > this? > > Sent from my iPad > > On 31 Jan 2013, at 02:50, "Andy Schlaikjer" <[email protected]> > wrote: > >> I assume you mean input *matrix* with 600,000 doc-term *vectors*. >> >> You need to ensure these vectors are split evenly across many part files. >> The number of part files will determine input splits and in turn map-side >> parallelism. >> >> Could you let us know how much input each of your 70 mappers is processing? >> Is there an imbalance? >> >> Andy >> >> >> On Wed, Jan 30, 2013 at 6:06 PM, David LaBarbera < >> [email protected]> wrote: >> >>> I ran cvb on AWS (mahout 0.7 and amazon's hadoop 1.0.3). >>> >>> I'm running it with >>> hadoop jar mahout-fat.jar org.apache.mahout.driver.MahoutDriver \ >>> cvb \ >>> -i /lda/matrix-converted/matrix \ >>> -o 's3n://.../lda/results \ >>> -dict /lda/dictionary.file-0 \ >>> -dt s3n://.../lda/doc-topics \ >>> -k 10 -x 10 >>> >>> The dictionary has around 1,000,000 terms >>> The input vector has around 600,000 documents (It's a 70MB file) with >>> 10-100 terms in them. >>> I created with the matrix file with a block size of 1MB. Each iteration of >>> CVB is using 70 mappers and takes close to an hour for each mapper to run. >>> >>> Is this expected performance under these conditions? Are there any >>> parameters I can tune? >>> >>> David
