Is there a rule of thumb for determining "leveling off" of perplexity? Is this
value controlled by the convergence delta?
Sorry for the table view. I reformatted it with just space.
Document Count corpus size(MB) Topic Count Perplexity
Dictionary Size Runtime(min/iteration)
40,044 3.2 10
16.326,15.418,15.191,15.088,15.028 14,097 1.5
40,044 3.2 20
26.461,24.517,23.996,23.805,23.882 14,097 6
40,044 3.2 40
19.722,18.185,17.823,17.680,17.608 14,097 11.5
40,046 3.7 10
19.286,18.373,18.092,17.958,17.865 98,283 5.5
40,046 3.7 20
18.574,17.448,17.143,17.018,16.940 98,283 10.5
44,767 4 10
19.928,18.815,18.521,18.350,18.225 31727 2.5
44,767 4 20
21.838,20.421,20.087,19.963,19.903 31727 4.5
616,957 58.5 10
14.467,13.830,13.583,13.435,13.381 151,807 8.5
616,957 58.5 20
13.590,12.787,12.605,12.522,12.476 151,807 16
616,972 58.4 10
14.646,13.904,13.646,13.573,13.543 54,280 4
616,967 54.1 10
13.363,12.634,12.432,12.345,12.283 32,101 2.5
616,967 54.1 20
13.195,12.307,12.065,11.764,11.732 32,101 4.5
On Feb 21, 2013, at 12:00 PM, Jake Mannix <[email protected]> wrote:
> I really can't read your results here, the formatting of your columns is
> pretty destroyed... you look like you've got results for 20 topics, as
> well as for 10, with different sized corpora?
>
> You can't compare convergence between corpora sizes - the perplexity will
> vary by order of magnitude between them. The only thing you should be
> comparing is that for a single fixed corpus, as you run it for 5, 10, 15,
> 20,... iterations, what does the (held-out) perplexity look like after each
> of these? Does it start to level off? At some point you may start
> overfitting and having the perplexity go back up. Your convergence
> happened before that.
>
> I don't think I've ever needed to run more than 50 iterations, and usually
> I stop after 20-30. The bigger the corpus, the more this becomes true.
>
>
> On Thu, Feb 21, 2013 at 6:45 AM, David LaBarbera <
> [email protected]> wrote:
>
>> I've been running some performance test with the LDA algorithm and I'm
>> unsure how to gauge them. I ran 10 iterations each time and collected the
>> perplexity value every 2 iterations with test fraction set to 0.1. These
>> were all run on an AWS cluster with 10 nodes (70 mapper, 30 reducers). I'm
>> not sure about the memory or cpu specs. I also stored the documents on hdfs
>> in 1MB blocks to get some parallelization. The documents I have were very
>> short - 10-100 words each. Hopefully these results are clear.
>>
>> Document Count
>> corpus size (MB)
>> Topic Count
>> Perplexity
>> Dictionary Size
>> Runtime (min/iteration)
>> 40,044 3.2 10 16.326, 15.418, 15.191, 15.088, 15.028 14,097 1.5
>> 40,044 3.2
>> 20
>> 26.461, 24.517, 23.996, 23.805, 23.882 14,097
>> 6
>> 40,,044 3.2
>> 40
>> 19.722, 18.185, 17.823, 17.680, 17.608 14,097
>> 11.5
>> 40,046 3.7 10
>> 19.286, 18.373, 18.092, 17.958, 17.865 98,283 5.5
>> 40,046 3.7 20
>> 18.574, 17.448, 17.143, 17.018, 16.940
>> 98,283 10.5
>>
>>
>>
>>
>>
>>
>> 44,767 4
>> 10
>> 19.928, 18.815, 18.521, 18.350, 18.225 31727 2.5
>> 44,767 4
>> 20
>> 21.838, 20.421, 20.087, 19.963, 19.903 31727 4.5
>> 616,957 58.5
>> 10
>> 14.467, 13.830, 13.583, 13.435, 13.381 151,807
>> 8.5
>> 616,957 58.5
>> 20
>> 13.590, 12.787, 12.605, 12.522, 12.476 151,807
>> 16
>> 616,972 58.4 10 14.646, 13.904, 13.646, 13.573, 13.543
>> 54,280 4
>> 616,967 54.1 10 13.363, 12.634, 12.432, 12.345, 12.283
>> 32,101 2.5
>> 616,967 54.1 20 13.195, 12.307, 12.065, 11.764, 11.732
>> 32,101
>> 4.5
>>
>> The question is how to interpret the results. In particular, Is there
>> anything telling me when to stop running LDA? I've tried running until
>> convergence, but I've never had the patience to see it finish. Does the
>> perplexity give some hint to the quality of the results? In attempting to
>> reach convergence, I saw runs going to 200 iterations. If an iteration
>> takes around 5.5 minutes, that's 18 hours of processing - and that doesn't
>> include overhead between iterations.
>>
>> David
>
>
>
>
> --
>
> -jake