I really can't read your results here, the formatting of your columns is pretty destroyed... you look like you've got results for 20 topics, as well as for 10, with different sized corpora?
You can't compare convergence between corpora sizes - the perplexity will vary by order of magnitude between them. The only thing you should be comparing is that for a single fixed corpus, as you run it for 5, 10, 15, 20,... iterations, what does the (held-out) perplexity look like after each of these? Does it start to level off? At some point you may start overfitting and having the perplexity go back up. Your convergence happened before that. I don't think I've ever needed to run more than 50 iterations, and usually I stop after 20-30. The bigger the corpus, the more this becomes true. On Thu, Feb 21, 2013 at 6:45 AM, David LaBarbera < [email protected]> wrote: > I've been running some performance test with the LDA algorithm and I'm > unsure how to gauge them. I ran 10 iterations each time and collected the > perplexity value every 2 iterations with test fraction set to 0.1. These > were all run on an AWS cluster with 10 nodes (70 mapper, 30 reducers). I'm > not sure about the memory or cpu specs. I also stored the documents on hdfs > in 1MB blocks to get some parallelization. The documents I have were very > short - 10-100 words each. Hopefully these results are clear. > > Document Count > corpus size (MB) > Topic Count > Perplexity > Dictionary Size > Runtime (min/iteration) > 40,044 3.2 10 16.326, 15.418, 15.191, 15.088, 15.028 14,097 1.5 > 40,044 3.2 > 20 > 26.461, 24.517, 23.996, 23.805, 23.882 14,097 > 6 > 40,,044 3.2 > 40 > 19.722, 18.185, 17.823, 17.680, 17.608 14,097 > 11.5 > 40,046 3.7 10 > 19.286, 18.373, 18.092, 17.958, 17.865 98,283 5.5 > 40,046 3.7 20 > 18.574, 17.448, 17.143, 17.018, 16.940 > 98,283 10.5 > > > > > > > 44,767 4 > 10 > 19.928, 18.815, 18.521, 18.350, 18.225 31727 2.5 > 44,767 4 > 20 > 21.838, 20.421, 20.087, 19.963, 19.903 31727 4.5 > 616,957 58.5 > 10 > 14.467, 13.830, 13.583, 13.435, 13.381 151,807 > 8.5 > 616,957 58.5 > 20 > 13.590, 12.787, 12.605, 12.522, 12.476 151,807 > 16 > 616,972 58.4 10 14.646, 13.904, 13.646, 13.573, 13.543 > 54,280 4 > 616,967 54.1 10 13.363, 12.634, 12.432, 12.345, 12.283 > 32,101 2.5 > 616,967 54.1 20 13.195, 12.307, 12.065, 11.764, 11.732 > 32,101 > 4.5 > > The question is how to interpret the results. In particular, Is there > anything telling me when to stop running LDA? I've tried running until > convergence, but I've never had the patience to see it finish. Does the > perplexity give some hint to the quality of the results? In attempting to > reach convergence, I saw runs going to 200 iterations. If an iteration > takes around 5.5 minutes, that's 18 hours of processing - and that doesn't > include overhead between iterations. > > David -- -jake
