I've been running some performance test with the LDA algorithm and I'm unsure how to gauge them. I ran 10 iterations each time and collected the perplexity value every 2 iterations with test fraction set to 0.1. These were all run on an AWS cluster with 10 nodes (70 mapper, 30 reducers). I'm not sure about the memory or cpu specs. I also stored the documents on hdfs in 1MB blocks to get some parallelization. The documents I have were very short - 10-100 words each. Hopefully these results are clear.
Document Count corpus size (MB) Topic Count Perplexity Dictionary Size Runtime (min/iteration) 40,044 3.2 10 16.326, 15.418, 15.191, 15.088, 15.028 14,097 1.5 40,044 3.2 20 26.461, 24.517, 23.996, 23.805, 23.882 14,097 6 40,,044 3.2 40 19.722, 18.185, 17.823, 17.680, 17.608 14,097 11.5 40,046 3.7 10 19.286, 18.373, 18.092, 17.958, 17.865 98,283 5.5 40,046 3.7 20 18.574, 17.448, 17.143, 17.018, 16.940 98,283 10.5 44,767 4 10 19.928, 18.815, 18.521, 18.350, 18.225 31727 2.5 44,767 4 20 21.838, 20.421, 20.087, 19.963, 19.903 31727 4.5 616,957 58.5 10 14.467, 13.830, 13.583, 13.435, 13.381 151,807 8.5 616,957 58.5 20 13.590, 12.787, 12.605, 12.522, 12.476 151,807 16 616,972 58.4 10 14.646, 13.904, 13.646, 13.573, 13.543 54,280 4 616,967 54.1 10 13.363, 12.634, 12.432, 12.345, 12.283 32,101 2.5 616,967 54.1 20 13.195, 12.307, 12.065, 11.764, 11.732 32,101 4.5 The question is how to interpret the results. In particular, Is there anything telling me when to stop running LDA? I've tried running until convergence, but I've never had the patience to see it finish. Does the perplexity give some hint to the quality of the results? In attempting to reach convergence, I saw runs going to 200 iterations. If an iteration takes around 5.5 minutes, that's 18 hours of processing - and that doesn't include overhead between iterations. David
