Hi mahout users, I'd like to run the mahout Latent Dirichlet Allocation algorithm (mahout cvb) on my own data. I have about 1M "documents" and a vocabulary of 30k "terms". Documents are very sparse, each of them contains only 100 terms. I'd like to extract "topics" from that.
I have generated mahout vectors from my data using a simple java program, and using RandomAccessSparseVector. I successfully launched the "mahout cvb with" job with num_topics=200, but the job seems very slow: 70 running map tasks took 10mn to process about 25000 documents on my cluster. So my questions are: - Does this job require specific Vector class for good performance ? - Is LDA algorithm suitable to process 1M docs with a dictionary of 30k terms ? Thanks for any insights. ++ benoit
