Re: [Scikit-learn-general] Latent Dirichlet Allocation

2016-02-09 Thread Manjush Vundemodalu
I think you have most of words filtered out of tf because of the condition
min_df=0.05,

I faced similar problems while working with chat data and I tried min_df=2
instead of using float value and it worked

Regards,
Manjush



On Wed, Jan 27, 2016 at 4:31 AM, Rockenkamm, Christian <
c.rockenk...@stud.uni-goettingen.de> wrote:

> I used more datasets in a range from 2200 to 3500 distinct words in the tf
> for training the LDA. This data are preprocessed with lemmatizing before
> CountVectorizrt.
> --
> *Von:* Joel Nothman [joel.noth...@gmail.com]
> *Gesendet:* Dienstag, 26. Januar 2016 23:35
> *An:* scikit-learn-general
> *Betreff:* Re: [Scikit-learn-general] Latent Dirichlet Allocation
>
> How many distinct words are in your dataset?
>
> On 27 January 2016 at 00:21, Rockenkamm, Christian <
> c.rockenk...@stud.uni-goettingen.de> wrote:
>
>> Hallo,
>>
>>
>> I have question concerning the Latent Dirichlet Allocation. The results I
>> get from using it are a bit confusing.
>>
>> At first I use about 3000 documents. In the preparation with the
>> CountVectorizrt I use the following parameters : max_df=0.95 and
>> min_df=0.05.
>>
>> For the LDA fit I use the bath learning method. For the other parameters
>> I have tried many different values. However regardless of which
>> configuration I used, I face one common problem. I get topics that are
>> never used in any of the docs and said topics all show the same structure
>> (topic-word-distribution). I even tried gensim with the same configuration
>> as scikit, yet I still encountered this problem. I also tried lowering the
>> number of topics in the model, but this did not lead to the expected
>> results either. For 100 topics, 20-27 were still affected by this problem,
>> for 50 topics, there were still 2-8 of them being affected, depending on
>> the parameter setting.
>>
>> Does anybody have an idea as to what might be causing this problem and
>> how to resolve it?
>>
>>
>> Best regards,
>>
>> Christian Rockenkamm
>>
>>
>> --
>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>> Monitor end-to-end web transactions and take corrective actions now
>> Troubleshoot faster and improve end-user experience. Signup Now!
>> http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140
>> ___
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> --
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151=/4140___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Latent Dirichlet Allocation

2016-02-09 Thread Vlad Niculae
I usually use an absolute threshold for min_df and a relative one for
max_df. I find it very useful to look at the histogram of word dfs for
choosing the latter, it varies a lot from dataset to dataset. For
short texts, like tweets, words such as "the" can have a df of 0.1.

It's very easy to look at dfs, just get a transformed X out of your
vectorizer and do:

>>> df = (X > 0).sum(axis=0)
>>> df = df.A.ravel().astype(np.double)
>>> df /= X.shape[0]


My 2c,
Vlad

On Tue, Feb 9, 2016 at 3:05 AM, Manjush Vundemodalu <manjus...@gmail.com> wrote:
> I think you have most of words filtered out of tf because of the condition
> min_df=0.05,
>
> I faced similar problems while working with chat data and I tried min_df=2
> instead of using float value and it worked
>
> Regards,
> Manjush
>
>
>
> On Wed, Jan 27, 2016 at 4:31 AM, Rockenkamm, Christian
> <c.rockenk...@stud.uni-goettingen.de> wrote:
>>
>> I used more datasets in a range from 2200 to 3500 distinct words in the tf
>> for training the LDA. This data are preprocessed with lemmatizing before
>> CountVectorizrt.
>> 
>> Von: Joel Nothman [joel.noth...@gmail.com]
>> Gesendet: Dienstag, 26. Januar 2016 23:35
>> An: scikit-learn-general
>> Betreff: Re: [Scikit-learn-general] Latent Dirichlet Allocation
>>
>> How many distinct words are in your dataset?
>>
>> On 27 January 2016 at 00:21, Rockenkamm, Christian
>> <c.rockenk...@stud.uni-goettingen.de> wrote:
>>>
>>> Hallo,
>>>
>>>
>>> I have question concerning the Latent Dirichlet Allocation. The results I
>>> get from using it are a bit confusing.
>>>
>>> At first I use about 3000 documents. In the preparation with the
>>> CountVectorizrt I use the following parameters : max_df=0.95 and
>>> min_df=0.05.
>>>
>>> For the LDA fit I use the bath learning method. For the other parameters
>>> I have tried many different values. However regardless of which
>>> configuration I used, I face one common problem. I get topics that are never
>>> used in any of the docs and said topics all show the same structure
>>> (topic-word-distribution). I even tried gensim with the same configuration
>>> as scikit, yet I still encountered this problem. I also tried lowering the
>>> number of topics in the model, but this did not lead to the expected results
>>> either. For 100 topics, 20-27 were still affected by this problem, for 50
>>> topics, there were still 2-8 of them being affected, depending on the
>>> parameter setting.
>>>
>>> Does anybody have an idea as to what might be causing this problem and
>>> how to resolve it?
>>>
>>>
>>> Best regards,
>>>
>>> Christian Rockenkamm
>>>
>>>
>>>
>>> --
>>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>> Monitor end-to-end web transactions and take corrective actions now
>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>> http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140
>>> ___
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>
>>
>>
>> --
>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>> Monitor end-to-end web transactions and take corrective actions now
>> Troubleshoot faster and improve end-user experience. Signup Now!
>> http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140
>> ___
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
> --
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
>

Re: [Scikit-learn-general] Latent Dirichlet Allocation

2016-01-26 Thread Andreas Mueller

Hi Christian.
Can you provide the data and code to reproduce?
Best,
Andy

On 01/26/2016 08:21 AM, Rockenkamm, Christian wrote:


Hallo,


I have question concerning the Latent Dirichlet Allocation. The 
results I get from using it are a bit confusing.


At first I use about 3000 documents. In the preparation with the 
CountVectorizrt I use the following parameters : max_df=0.95 and 
min_df=0.05.


For the LDA fit I use the bath learning method. For the other 
parameters I have tried many different values. However regardless of 
which configuration I used, I face one common problem. I get topics 
that are never used in any of the docs and said topics all show the 
same structure (topic-word-distribution). I even tried gensim with the 
same configuration as scikit, yet I still encountered this problem. I 
also tried lowering the number of topics in the model, but this did 
not lead to the expected results either. For 100 topics, 20-27 were 
still affected by this problem, for 50 topics, there were still 2-8 of 
them being affected, depending on the parameter setting.


Does anybody have an idea as to what might be causing this problem and 
how to resolve it?



Best regards,

Christian Rockenkamm



--
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Latent Dirichlet Allocation

2016-01-26 Thread Joel Nothman
How many distinct words are in your dataset?

On 27 January 2016 at 00:21, Rockenkamm, Christian <
c.rockenk...@stud.uni-goettingen.de> wrote:

> Hallo,
>
>
> I have question concerning the Latent Dirichlet Allocation. The results I
> get from using it are a bit confusing.
>
> At first I use about 3000 documents. In the preparation with the
> CountVectorizrt I use the following parameters : max_df=0.95 and
> min_df=0.05.
>
> For the LDA fit I use the bath learning method. For the other parameters I
> have tried many different values. However regardless of which configuration
> I used, I face one common problem. I get topics that are never used in any
> of the docs and said topics all show the same structure
> (topic-word-distribution). I even tried gensim with the same configuration
> as scikit, yet I still encountered this problem. I also tried lowering the
> number of topics in the model, but this did not lead to the expected
> results either. For 100 topics, 20-27 were still affected by this problem,
> for 50 topics, there were still 2-8 of them being affected, depending on
> the parameter setting.
>
> Does anybody have an idea as to what might be causing this problem and how
> to resolve it?
>
>
> Best regards,
>
> Christian Rockenkamm
>
>
> --
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
--
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Latent Dirichlet Allocation

2016-01-26 Thread Rockenkamm, Christian
I used more datasets in a range from 2200 to 3500 distinct words in the tf for 
training the LDA. This data are preprocessed with lemmatizing before 
CountVectorizrt.

Von: Joel Nothman [joel.noth...@gmail.com]
Gesendet: Dienstag, 26. Januar 2016 23:35
An: scikit-learn-general
Betreff: Re: [Scikit-learn-general] Latent Dirichlet Allocation

How many distinct words are in your dataset?

On 27 January 2016 at 00:21, Rockenkamm, Christian 
<c.rockenk...@stud.uni-goettingen.de<mailto:c.rockenk...@stud.uni-goettingen.de>>
 wrote:
Hallo,

I have question concerning the Latent Dirichlet Allocation. The results I get 
from using it are a bit confusing.
At first I use about 3000 documents. In the preparation with the 
CountVectorizrt I use the following parameters : max_df=0.95 and min_df=0.05.
For the LDA fit I use the bath learning method. For the other parameters I have 
tried many different values. However regardless of which configuration I used, 
I face one common problem. I get topics that are never used in any of the docs 
and said topics all show the same structure (topic-word-distribution). I even 
tried gensim with the same configuration as scikit, yet I still encountered 
this problem. I also tried lowering the number of topics in the model, but this 
did not lead to the expected results either. For 100 topics, 20-27 were still 
affected by this problem, for 50 topics, there were still 2-8 of them being 
affected, depending on the parameter setting.
Does anybody have an idea as to what might be causing this problem and how to 
resolve it?

Best regards,
Christian Rockenkamm

--
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net<mailto:Scikit-learn-general@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311=/4140___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Latent Dirichlet Allocation topic-word-matrix and the document-topic-matrix

2015-12-08 Thread Andreas Mueller

Hi Christian.
The document-topic-matrix is lda.transform(X), the word-topic-matrix is 
lda.components_.
See 
http://scikit-learn.org/dev/modules/decomposition.html#latent-dirichlet-allocation-lda


"When LatentDirichletAllocation 
 
is applied on a “document-term” matrix, the matrix will be decomposed 
into a “topic-term” matrix and a “document-topic” matrix. While 
“topic-term” matrix is stored as components_ in the model, 
“document-topic” matrix can be calculated from transform method."


On 12/08/2015 11:04 AM, Rockenkamm, Christian wrote:

Hello,

I have a short question concerning the Latent Dirichlet Allocation in 
scikit. Is it possible to acquire the topic-word-matrix and the 
document-topic-matrix? If so, could someone please explain to me how 
to do that?


Best regards,
Christian Rockenkamm


--
Go from Idea to Many App Stores Faster with Intel(R) XDK
Give your users amazing mobile app experiences with Intel(R) XDK.
Use one codebase in this all-in-one HTML5 development environment.
Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs.
http://pubads.g.doubleclick.net/gampad/clk?id=254741911=/4140


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
Go from Idea to Many App Stores Faster with Intel(R) XDK
Give your users amazing mobile app experiences with Intel(R) XDK.
Use one codebase in this all-in-one HTML5 development environment.
Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs.
http://pubads.g.doubleclick.net/gampad/clk?id=254741911=/4140___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general