Re: cvb/lda run time

Jack Pay Thu, 31 Jan 2013 00:19:48 -0800

On a related I note I believe I have found a bug in the cvb implementation and 
wish to know how to go about getting it fixed. How do I go about doing this?


Sent from my iPad

On 31 Jan 2013, at 02:50, "Andy Schlaikjer" <[email protected]> wrote:

> I assume you mean input *matrix* with 600,000 doc-term *vectors*.
> 
> You need to ensure these vectors are split evenly across many part files.
> The number of part files will determine input splits and in turn map-side
> parallelism.
> 
> Could you let us know how much input each of your 70 mappers is processing?
> Is there an imbalance?
> 
> Andy
> 
> 
> On Wed, Jan 30, 2013 at 6:06 PM, David LaBarbera <
> [email protected]> wrote:
> 
>> I ran cvb on AWS (mahout 0.7 and amazon's hadoop 1.0.3).
>> 
>> I'm running it with
>> hadoop jar mahout-fat.jar org.apache.mahout.driver.MahoutDriver \
>> cvb \
>> -i /lda/matrix-converted/matrix \
>> -o 's3n://.../lda/results \
>> -dict /lda/dictionary.file-0 \
>> -dt s3n://.../lda/doc-topics \
>> -k 10 -x 10
>> 
>> The dictionary has around 1,000,000 terms
>> The input vector has around 600,000 documents (It's a 70MB file) with
>> 10-100 terms in them.
>> I created with the matrix file with a block size of 1MB. Each iteration of
>> CVB is using 70 mappers and takes close to an hour for each mapper to run.
>> 
>> Is this expected performance under these conditions? Are there any
>> parameters I can tune?
>> 
>> David

Re: cvb/lda run time

Reply via email to