[scikit-learn] LatentDirichletAllocation failing to find topics in NLTK Gutenberg corpus?

2017-09-14 Thread Markus Konrad
Hi there, I'm trying out sklearn's latent Dirichlet allocation implementation for topic modeling. The code from the official example [1] works just fine and the extracted topics look reasonable. However, when I try other corpora, for example the Gutenberg corpus from NLTK, most of the extracted

Re: [scikit-learn] LatentDirichletAllocation failing to find topics in NLTK Gutenberg corpus?

2017-09-18 Thread Markus Konrad
21 27 28] > [ 2 14 15 17 21 22 27 28] > [15 22] > [ 8 11] > [8] > [ 8 24] > [ 2 14 15 22] > > and my full test scripts are here: > https://gist.github.com/chyikwei/1707b59e009d381e1ce1e7a38f9c7826 > > Best, > Chyi-Kwei > > > On Thu, Sep 14, 2017 at 7:26 AM Markus Konra

Re: [scikit-learn] LatentDirichletAllocation failing to find topics in NLTK Gutenberg corpus?

2017-09-19 Thread Markus Konrad
ameter choices might need to be different for Gibbs > sampling vs variational inference. > > On 09/18/2017 12:26 PM, Markus Konrad wrote: >> Hi Chyi-Kwei, >> >> thanks for digging into this. I made similar observations with Gensim >> when using only a small number o

Re: [scikit-learn] LatentDirichletAllocation failing to find topics in NLTK Gutenberg corpus?

2017-09-20 Thread Markus Konrad
I'm actually surprised the gibbs sampling gave useful results with so > little data. > And splitting the documents results in very different data. It has a lot > more information. > How many topics did you use? > > Also: PR for docs welcome! > > On 09/19/2017 04:26 AM

Re: [scikit-learn] LatentDirichletAllocation failing to find topics in NLTK Gutenberg corpus?

2017-09-20 Thread Markus Konrad
Sorry, I meant of course "the number that *maximized* the log likelihood" in the first sentence... On 09/20/2017 09:18 AM, Markus Konrad wrote: > I tried it with 12 topics (that's the number that minimized the log > likelihood) and there were also some very general

[scikit-learn] Using perplexity from LatentDirichletAllocation for cross validation of Topic Models

2017-10-04 Thread Markus Konrad
Hi there, I'm trying to find the optimal number of topics for Topic Modeling with Latent Dirichlet Allocation. I implemented a 5-fold cross validation method similar to the one described and implemented in R here [1]. I basically split the full data into 5 equal sized chunks. Then for each fold (`

Re: [scikit-learn] Using perplexity from LatentDirichletAllocation for cross validation of Topic Models

2017-10-11 Thread Markus Konrad
Hi again, > just a note that if you're using this for topic modelling, perplexity might > not be a good choice of objective function. others have been proposed. see > the diagnostic functions for MALLET topic modelling for instance. unfortunately I don't find any of these methods implemented in P