LDA Convergence

David LaBarbera Thu, 21 Feb 2013 06:45:59 -0800

I've been running some performance test with the LDA algorithm and I'm unsure 
how to gauge them. I ran 10 iterations each time and collected the perplexity 
value every 2 iterations with test fraction set to 0.1. These were all run on 
an AWS cluster with 10 nodes (70 mapper, 30 reducers). I'm not sure about the 
memory or cpu specs. I also stored the documents on hdfs in 1MB blocks to get 
some parallelization. The documents I have were very short - 10-100 words each. 
 Hopefully these results are clear.


Document Count
corpus size (MB)
Topic Count
Perplexity
Dictionary Size
Runtime  (min/iteration)         
40,044   3.2     10     16.326, 15.418, 15.191, 15.088, 15.028  14,097  1.5
 40,044 3.2 
20 
 26.461, 24.517, 23.996, 23.805, 23.882 14,097 
 6
 40,,044        3.2 
40 
19.722, 18.185, 17.823, 17.680, 17.608  14,097 
11.5 
 40,046  3.7    10 
19.286, 18.373, 18.092, 17.958, 17.865  98,283   5.5
 40,046  3.7    20 
18.574, 17.448, 17.143, 17.018, 16.940
98,283   10.5


 


 
44,767  4 
10 
19.928, 18.815, 18.521, 18.350, 18.225  31727    2.5
44,767  4 
20 
21.838, 20.421, 20.087, 19.963, 19.903  31727    4.5
 616,957        58.5 
10 
14.467, 13.830, 13.583, 13.435, 13.381  151,807 
 8.5
 616,957        58.5 
20 
13.590, 12.787, 12.605, 12.522, 12.476  151,807 
 16
 616,972         58.4    10     14.646, 13.904, 13.646, 13.573, 13.543   54,280 
 4
 616,967         54.1    10     13.363, 12.634, 12.432, 12.345, 12.283   32,101 
 2.5
 616,967         54.1    20     13.195, 12.307, 12.065, 11.764, 11.732          
32,101 
 4.5

The question is how to interpret the results. In particular, Is there anything 
telling me when to stop running LDA? I've tried running until convergence, but 
I've never had the patience to see it finish. Does the perplexity give some 
hint to the quality of the results? In attempting to reach convergence, I saw 
runs going to 200 iterations. If an iteration takes around 5.5 minutes, that's 
18 hours of processing - and that doesn't include overhead between iterations.

David

LDA Convergence

Reply via email to