Re: CVB clustering run-time questions

Andy Schlaikjer Fri, 13 Jul 2012 13:28:48 -0700

Dan,

On Fri, Jul 13, 2012 at 11:37 AM, DAN HELM <[email protected]> wrote:


> As far as how documents were distributed across input splits.  The
> derivative rowid program I developed uses a "number of records" parameter,
> to chunk the sparse vector input records input multiple files.  So it just
> loops through the input sparse vector "part" files sequentially, and writes
> "n" vectors out to different matrix output files.  Each matrix output file
> has a subset of the vectors with no replication.  The Mahout rowid program
> (which I initially used just to change the key in my sparse vectors from
> Text to Int to make CVB happy) just wrote all output to one matrix file.
> So, in the above run, the 250K sparse vectors were spit into around 125
> files since I used 2K records as the splitting criteria.  I assume what I'm
> doing is reasonable for creating the splits, or should I be doing this
> differently?
>

This strategy sounds reasonable. For larger numbers of input docs, you
might try a more aggressive partitioning of the docs into part-* files to
further increase the number of map tasks.

Cheers,
Andy

Re: CVB clustering run-time questions

Reply via email to