Yes, the matrix has 600,000 vectors

I generated the input matrix from seq2sparse, then rowid. I don't think rowid 
is a M/R job and it merges the part files from the term frequency matrix into 
one file. So, I set the block size for the rowid output to 1MB, which gave me 
the 70 block, leading to 70 mappers running.
I think the data is evenly split for 2 reasons. The vectors are small and every 
mapper took 45-55 minutes.


Jack,
You can probably open a jira ticket 
https://issues.apache.org/jira/browse/MAHOUT
and attach your solution to it.

David

On Jan 31, 2013, at 3:19 AM, Jack Pay <[email protected]> wrote:

> On a related I note I believe I have found a bug in the cvb implementation 
> and wish to know how to go about getting it fixed. How do I go about doing 
> this?
> 
> Sent from my iPad
> 
> On 31 Jan 2013, at 02:50, "Andy Schlaikjer" <[email protected]> 
> wrote:
> 
>> I assume you mean input *matrix* with 600,000 doc-term *vectors*.
>> 
>> You need to ensure these vectors are split evenly across many part files.
>> The number of part files will determine input splits and in turn map-side
>> parallelism.
>> 
>> Could you let us know how much input each of your 70 mappers is processing?
>> Is there an imbalance?
>> 
>> Andy
>> 
>> 
>> On Wed, Jan 30, 2013 at 6:06 PM, David LaBarbera <
>> [email protected]> wrote:
>> 
>>> I ran cvb on AWS (mahout 0.7 and amazon's hadoop 1.0.3).
>>> 
>>> I'm running it with
>>> hadoop jar mahout-fat.jar org.apache.mahout.driver.MahoutDriver \
>>> cvb \
>>> -i /lda/matrix-converted/matrix \
>>> -o 's3n://.../lda/results \
>>> -dict /lda/dictionary.file-0 \
>>> -dt s3n://.../lda/doc-topics \
>>> -k 10 -x 10
>>> 
>>> The dictionary has around 1,000,000 terms
>>> The input vector has around 600,000 documents (It's a 70MB file) with
>>> 10-100 terms in them.
>>> I created with the matrix file with a block size of 1MB. Each iteration of
>>> CVB is using 70 mappers and takes close to an hour for each mapper to run.
>>> 
>>> Is this expected performance under these conditions? Are there any
>>> parameters I can tune?
>>> 
>>> David

Reply via email to