I have been doing some experimentation with CVB clustering lately.   I started 
on a small cluster and have been running some tests lately on a larger 
cluster.    I made a derivative rowid utility so I can vary the number of input 
splits used in the CVB run. 
 
I have some general questions.
 
Initially I started clustering the Reuters 21K dataset, generating up to 100 
topics.  I then moved to some other test collections, clustering as many as 
250K documents into 400 topics.  The dictionary sizes ranged from around 40K 
terms for Reuters to significantly more terms for other collections.  I think 
the larger dictionaries contained a lot of junk I will work on removing in 
future in the vector generation phase.
 
So, I know the answers to my questions below are most likely “it depends”, but 
any general guidance would be appreciated.    In theory I would like to cluster 
millions of documents but I’m concerned the time could be prohibitive to do 
this, in particular if lots of iterations are required.
 
1. What has the biggest impact on the time to cluster documents (e.g., 
dictionary size, number of topics, number of input splits, etc.)?

2. For a large collection (say 500K documents), are there a typical number of 
iterations required for results to converge?

3. How long should it take to cluster large collections (say 500K documents) 
with a large number of nodes?  For a 250K document collection I processed, it 
took 2 hours to execute one iteration; and I used around 125 mappers, where 
each mapper processed 2K vectors (based on how I split the data via rowid).  

4. Any general advice on how to best tweak CVB via parameters and hadoop would 
be appreciated.  My concern is based on my own testing  to cluster 1 million 
documents (assuming 10 iterations) could take days, even if using hundreds of 
mappers.
 
Thanks, Dan

Reply via email to