Is there a rule of thumb for determining "leveling off" of perplexity? Is this 
value controlled by the convergence delta?

Sorry for the table view. I reformatted it with just space.

Document Count  corpus size(MB) Topic Count     Perplexity                      
                                Dictionary Size Runtime(min/iteration)
40,044                    3.2                         10                     
16.326,15.418,15.191,15.088,15.028      14,097               1.5
40,044                    3.2                         20                     
26.461,24.517,23.996,23.805,23.882      14,097               6
40,044                    3.2                         40                     
19.722,18.185,17.823,17.680,17.608      14,097              11.5
 
40,046                    3.7                         10                     
19.286,18.373,18.092,17.958,17.865      98,283              5.5
40,046                    3.7                         20                     
18.574,17.448,17.143,17.018,16.940      98,283              10.5

44,767                    4                            10                     
19.928,18.815,18.521,18.350,18.225      31727               2.5
44,767                    4                            20                     
21.838,20.421,20.087,19.963,19.903      31727               4.5

616,957                  58.5                      10                     
14.467,13.830,13.583,13.435,13.381      151,807            8.5
616,957                  58.5                      20                     
13.590,12.787,12.605,12.522,12.476      151,807           16

616,972                  58.4                      10                     
14.646,13.904,13.646,13.573,13.543      54,280              4

616,967                  54.1                      10                     
13.363,12.634,12.432,12.345,12.283      32,101              2.5
616,967                  54.1                      20                     
13.195,12.307,12.065,11.764,11.732      32,101              4.5



On Feb 21, 2013, at 12:00 PM, Jake Mannix <[email protected]> wrote:

> I really can't read your results here, the formatting of your columns is
> pretty destroyed...  you look like you've got results for 20 topics, as
> well as for 10, with different sized corpora?
> 
> You can't compare convergence between corpora sizes - the perplexity will
> vary by order of magnitude between them.  The only thing you should be
> comparing is that for a single fixed corpus, as you run it for 5, 10, 15,
> 20,... iterations, what does the (held-out) perplexity look like after each
> of these?  Does it start to level off?  At some point you may start
> overfitting and having the perplexity go back up.  Your convergence
> happened before that.
> 
> I don't think I've ever needed to run more than 50 iterations, and usually
> I stop after 20-30.  The bigger the corpus, the more this becomes true.
> 
> 
> On Thu, Feb 21, 2013 at 6:45 AM, David LaBarbera <
> [email protected]> wrote:
> 
>> I've been running some performance test with the LDA algorithm and I'm
>> unsure how to gauge them. I ran 10 iterations each time and collected the
>> perplexity value every 2 iterations with test fraction set to 0.1. These
>> were all run on an AWS cluster with 10 nodes (70 mapper, 30 reducers). I'm
>> not sure about the memory or cpu specs. I also stored the documents on hdfs
>> in 1MB blocks to get some parallelization. The documents I have were very
>> short - 10-100 words each.  Hopefully these results are clear.
>> 
>> Document Count
>> corpus size (MB)
>> Topic Count
>> Perplexity
>> Dictionary Size
>> Runtime  (min/iteration)
>> 40,044   3.2     10     16.326, 15.418, 15.191, 15.088, 15.028  14,097  1.5
>> 40,044 3.2
>> 20
>> 26.461, 24.517, 23.996, 23.805, 23.882 14,097
>> 6
>> 40,,044        3.2
>> 40
>> 19.722, 18.185, 17.823, 17.680, 17.608  14,097
>> 11.5
>> 40,046  3.7    10
>> 19.286, 18.373, 18.092, 17.958, 17.865  98,283   5.5
>> 40,046  3.7    20
>> 18.574, 17.448, 17.143, 17.018, 16.940
>> 98,283   10.5
>> 
>> 
>> 
>> 
>> 
>> 
>> 44,767  4
>> 10
>> 19.928, 18.815, 18.521, 18.350, 18.225  31727    2.5
>> 44,767  4
>> 20
>> 21.838, 20.421, 20.087, 19.963, 19.903  31727    4.5
>> 616,957        58.5
>> 10
>> 14.467, 13.830, 13.583, 13.435, 13.381  151,807
>> 8.5
>> 616,957        58.5
>> 20
>> 13.590, 12.787, 12.605, 12.522, 12.476  151,807
>> 16
>> 616,972         58.4    10     14.646, 13.904, 13.646, 13.573, 13.543
>> 54,280  4
>> 616,967         54.1    10     13.363, 12.634, 12.432, 12.345, 12.283
>> 32,101  2.5
>> 616,967         54.1    20     13.195, 12.307, 12.065, 11.764, 11.732
>>     32,101
>> 4.5
>> 
>> The question is how to interpret the results. In particular, Is there
>> anything telling me when to stop running LDA? I've tried running until
>> convergence, but I've never had the patience to see it finish. Does the
>> perplexity give some hint to the quality of the results? In attempting to
>> reach convergence, I saw runs going to 200 iterations. If an iteration
>> takes around 5.5 minutes, that's 18 hours of processing - and that doesn't
>> include overhead between iterations.
>> 
>> David
> 
> 
> 
> 
> -- 
> 
>  -jake

Reply via email to