Dan, On Fri, Jul 13, 2012 at 11:37 AM, DAN HELM <[email protected]> wrote:
> As far as how documents were distributed across input splits. The > derivative rowid program I developed uses a "number of records" parameter, > to chunk the sparse vector input records input multiple files. So it just > loops through the input sparse vector "part" files sequentially, and writes > "n" vectors out to different matrix output files. Each matrix output file > has a subset of the vectors with no replication. The Mahout rowid program > (which I initially used just to change the key in my sparse vectors from > Text to Int to make CVB happy) just wrote all output to one matrix file. > So, in the above run, the 250K sparse vectors were spit into around 125 > files since I used 2K records as the splitting criteria. I assume what I'm > doing is reasonable for creating the splits, or should I be doing this > differently? > This strategy sounds reasonable. For larger numbers of input docs, you might try a more aggressive partitioning of the docs into part-* files to further increase the number of map tasks. Cheers, Andy
