My docterm_matrix is only 1 file of 200Mo: > hadoop fs -ls user_model/vectors 196572321 2013-03-01 17:09 user_model/vectors
To increase map tasks parallelism I add "-Dmapreduce.input.fileinputformat.split.maxsize=2097152" to the command line. This way, the map phase is splitted into 94 tasks. 2013/3/4 Andy Schlaikjer <[email protected]> > Benoit, could you also paste us output of `hdfs -ls > /path/to/your/docterm_matrix/part-*`? Cvb map-side parallelism benefits > from an even distribution of doc-term vectors across your input part files. > > > On Mon, Mar 4, 2013 at 8:34 AM, Jake Mannix <[email protected]> wrote: > > > Can you send us your command line args? Is that for 1 iteration ? That > > would be very very slow > > > > On Monday, March 4, 2013, Benoit Mathieu wrote: > > > > > Hi mahout users, > > > > > > I'd like to run the mahout Latent Dirichlet Allocation algorithm > (mahout > > > cvb) on my own data. I have about 1M "documents" and a vocabulary of > 30k > > > "terms". Documents are very sparse, each of them contains only 100 > terms. > > > I'd like to extract "topics" from that. > > > > > > I have generated mahout vectors from my data using a simple java > program, > > > and using RandomAccessSparseVector. > > > > > > I successfully launched the "mahout cvb with" job with num_topics=200, > > but > > > the job seems very slow: 70 running map tasks took 10mn to process > about > > > 25000 documents on my cluster. > > > > > > So my questions are: > > > - Does this job require specific Vector class for good performance ? > > > - Is LDA algorithm suitable to process 1M docs with a dictionary of 30k > > > terms ? > > > > > > Thanks for any insights. > > > > > > ++ > > > benoit > > > > > > > > > -- > > > > -jake > > >
