Hi Dan,

On Thu, Jul 12, 2012 at 5:45 PM, DAN HELM <[email protected]> wrote:

> Initially I started clustering the Reuters 21K dataset, generating up to
> 100 topics.  I then moved to some other test collections, clustering as
> many as 250K documents into 400 topics.  The dictionary sizes ranged from
> around 40K terms for Reuters to significantly more terms for other
> collections.  I think the larger dictionaries contained a lot of junk I
> will work on removing in future in the vector generation phase.
>

Vocabulary definition is an important step at this point due to the topics
x terms memory constraint in map tasks. Jake has spent some time working on
a variant of the code which enables "sparse" training to alleviate this,
but it's not ready for prime time. I use a custom pig pipeline to compute
tfidf weighted term vectors from input data. In the process, I filter
vocabulary in a number of ways to obtain a target number of terms
(generally < 1m).


> So, I know the answers to my questions below are most likely “it depends”,
> but any general guidance would be appreciated.    In theory I would like to
> cluster millions of documents but I’m concerned the time could be
> prohibitive to do this, in particular if lots of iterations are required.
>
> 1. What has the biggest impact on the time to cluster documents (e.g.,
> dictionary size, number of topics, number of input splits, etc.)?
>

As mentioned above this implementation is memory bound by num topics times
num terms. You can push these two variables out till you hit heap capacity
of your map tasks. To ensure you're seeing gains from map-side parallelism,
organize your input data to evenly distribute your documents / non-zero
term entries across as many input splits as you can afford. There are
various tricks one can play in pig to accomplish this, though the same
tricks may be a bit harder to achieve with existing Mahout tools.


> 2. For a large collection (say 500K documents), are there a typical number
> of iterations required for results to converge?
>

This depends on input, but I generally don't run more than ~50 iterations.
You can try inspecting the output model after each iteration, and look for
significant changes in rank of terms within each topic to get a better
feeling for how fast the model is converging on your data.


> 3. How long should it take to cluster large collections (say 500K
> documents) with a large number of nodes?  For a 250K document collection I
> processed, it took 2 hours to execute one iteration; and I used around 125
> mappers, where each mapper processed 2K vectors (based on how I split the
> data via rowid).
>

How many non-zero entries does each of your documents have? How have you
distributed these docs across input splits? Are you running 400 topics, 40k
terms?


> 4. Any general advice on how to best tweak CVB via parameters and hadoop
> would be appreciated.  My concern is based on my own testing  to cluster 1
> million documents (assuming 10 iterations) could take days, even if using
> hundreds of mappers.
>

This is about what I'd expect for 1m documents. You might want to consider
generating a representative sample of your 1m docs to speed up training
during model development. Also, as mentioned above, make sure your input
splits are organized to take advantage of available mappers.

Andy
@sagemintblue

Reply via email to