Hi there,
I'm trying out sklearn's latent Dirichlet allocation implementation for topic
modeling. The code from the official example [1] works just fine and the
extracted topics look reasonable. However, when I try other corpora, for
example the Gutenberg corpus from NLTK, most of the extracted
21 27 28]
> [ 2 14 15 17 21 22 27 28]
> [15 22]
> [ 8 11]
> [8]
> [ 8 24]
> [ 2 14 15 22]
>
> and my full test scripts are here:
> https://gist.github.com/chyikwei/1707b59e009d381e1ce1e7a38f9c7826
>
> Best,
> Chyi-Kwei
>
>
> On Thu, Sep 14, 2017 at 7:26 AM Markus Konra
ameter choices might need to be different for Gibbs
> sampling vs variational inference.
>
> On 09/18/2017 12:26 PM, Markus Konrad wrote:
>> Hi Chyi-Kwei,
>>
>> thanks for digging into this. I made similar observations with Gensim
>> when using only a small number o
I'm actually surprised the gibbs sampling gave useful results with so
> little data.
> And splitting the documents results in very different data. It has a lot
> more information.
> How many topics did you use?
>
> Also: PR for docs welcome!
>
> On 09/19/2017 04:26 AM
Sorry, I meant of course "the number that *maximized* the log
likelihood" in the first sentence...
On 09/20/2017 09:18 AM, Markus Konrad wrote:
> I tried it with 12 topics (that's the number that minimized the log
> likelihood) and there were also some very general
Hi there,
I'm trying to find the optimal number of topics for Topic Modeling with
Latent Dirichlet Allocation. I implemented a 5-fold cross validation
method similar to the one described and implemented in R here [1]. I
basically split the full data into 5 equal sized chunks. Then for each
fold (`
Hi again,
> just a note that if you're using this for topic modelling, perplexity might
> not be a good choice of objective function. others have been proposed. see
> the diagnostic functions for MALLET topic modelling for instance.
unfortunately I don't find any of these methods implemented in P