Hi Dan, On Thu, Jul 12, 2012 at 5:45 PM, DAN HELM <[email protected]> wrote:
> Initially I started clustering the Reuters 21K dataset, generating up to > 100 topics. I then moved to some other test collections, clustering as > many as 250K documents into 400 topics. The dictionary sizes ranged from > around 40K terms for Reuters to significantly more terms for other > collections. I think the larger dictionaries contained a lot of junk I > will work on removing in future in the vector generation phase. > Vocabulary definition is an important step at this point due to the topics x terms memory constraint in map tasks. Jake has spent some time working on a variant of the code which enables "sparse" training to alleviate this, but it's not ready for prime time. I use a custom pig pipeline to compute tfidf weighted term vectors from input data. In the process, I filter vocabulary in a number of ways to obtain a target number of terms (generally < 1m). > So, I know the answers to my questions below are most likely “it depends”, > but any general guidance would be appreciated. In theory I would like to > cluster millions of documents but I’m concerned the time could be > prohibitive to do this, in particular if lots of iterations are required. > > 1. What has the biggest impact on the time to cluster documents (e.g., > dictionary size, number of topics, number of input splits, etc.)? > As mentioned above this implementation is memory bound by num topics times num terms. You can push these two variables out till you hit heap capacity of your map tasks. To ensure you're seeing gains from map-side parallelism, organize your input data to evenly distribute your documents / non-zero term entries across as many input splits as you can afford. There are various tricks one can play in pig to accomplish this, though the same tricks may be a bit harder to achieve with existing Mahout tools. > 2. For a large collection (say 500K documents), are there a typical number > of iterations required for results to converge? > This depends on input, but I generally don't run more than ~50 iterations. You can try inspecting the output model after each iteration, and look for significant changes in rank of terms within each topic to get a better feeling for how fast the model is converging on your data. > 3. How long should it take to cluster large collections (say 500K > documents) with a large number of nodes? For a 250K document collection I > processed, it took 2 hours to execute one iteration; and I used around 125 > mappers, where each mapper processed 2K vectors (based on how I split the > data via rowid). > How many non-zero entries does each of your documents have? How have you distributed these docs across input splits? Are you running 400 topics, 40k terms? > 4. Any general advice on how to best tweak CVB via parameters and hadoop > would be appreciated. My concern is based on my own testing to cluster 1 > million documents (assuming 10 iterations) could take days, even if using > hundreds of mappers. > This is about what I'd expect for 1m documents. You might want to consider generating a representative sample of your 1m docs to speed up training during model development. Also, as mentioned above, make sure your input splits are organized to take advantage of available mappers. Andy @sagemintblue
