Re: cvb/lda run time

Andy Schlaikjer Wed, 30 Jan 2013 18:50:13 -0800

I assume you mean input *matrix* with 600,000 doc-term *vectors*.

You need to ensure these vectors are split evenly across many part files.
The number of part files will determine input splits and in turn map-side
parallelism.


Could you let us know how much input each of your 70 mappers is processing?
Is there an imbalance?

Andy


On Wed, Jan 30, 2013 at 6:06 PM, David LaBarbera <
[email protected]> wrote:

> I ran cvb on AWS (mahout 0.7 and amazon's hadoop 1.0.3).
>
> I'm running it with
> hadoop jar mahout-fat.jar org.apache.mahout.driver.MahoutDriver \
> cvb \
> -i /lda/matrix-converted/matrix \
> -o 's3n://.../lda/results \
> -dict /lda/dictionary.file-0 \
> -dt s3n://.../lda/doc-topics \
> -k 10 -x 10
>
> The dictionary has around 1,000,000 terms
> The input vector has around 600,000 documents (It's a 70MB file) with
> 10-100 terms in them.
> I created with the matrix file with a block size of 1MB. Each iteration of
> CVB is using 70 mappers and takes close to an hour for each mapper to run.
>
> Is this expected performance under these conditions? Are there any
> parameters I can tune?
>
> David

Re: cvb/lda run time

Reply via email to