Re: [Scikit-learn-general] Combining TFIDF and LDA features

2012-09-14 Thread Peter Prettenhofer
2012/9/14 Philipp Singer :
> Hey!
>
> Am 14.09.2012 15:10, schrieb Peter Prettenhofer:
>>
>> I totally agree - I had such an issue in my research as well
>> (combining word presence features with SVD embeddings).
>> I followed Blitzer et. al 2006 and normalized** both feature groups
>> separately - e.g. you could normalize word presence features such that
>> L1 norm equals 1 and do the same for the SVD embeddings.
>
> Isn't the normalization alread part of the tfidf transformation?
> So basically the word presence tfidf features are already L2 normalized,
> but maybe I misunderstand this completely.

I forgot that your LDA embedding is already L1 normalized (i.e. sums to 1).
So both of your feature groups are already normalized; tf/idf is L2
and LDA is L1.

>
>> In my work I had the impression though, that L1|L2 normalization was
>> inferior to simply scale the embeddings by a constant alpha such that
>> the average L2 norm is 1.[1]
>
> Ah, I see. How would I exactly do that? Isn't that the same thing as the
> normalization technique in scikit-learn is doing?

Its as simple as computing the mean L2 norm and dividing the feature
matrix by that number.
Scaler does this per feature, Normalizer per sample - this computes
one normalization constant for all features.

Since the LDA embedding has an intrinsic semantic (document generated
from topic distribution) -  I don't think you should do this - please
forget my comment.

>>
>> ** normalization here means row level normalization - similar do
>> document length normalization in TF/IDF.
>>
>> HTH,
>>   Peter
>
> Regards,
> Philipp
>>
>> Blitzer et al. 2006, Domain Adaptation using Structural Correspondence
>> Learning, http://john.blitzer.com/papers/emnlp06.pdf
>>
>> [1] This is also described here:
>> http://scikit-learn.org/dev/modules/sgd.html#tips-on-practical-use
>
>
> --
> Got visibility?
> Most devs has no idea what their production app looks like.
> Find out how fast your code is with AppDynamics Lite.
> http://ad.doubleclick.net/clk;262219671;13503038;y?
> http://info.appdynamics.com/FreeJavaPerformanceDownload.html
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



-- 
Peter Prettenhofer

--
Got visibility?
Most devs has no idea what their production app looks like.
Find out how fast your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219671;13503038;y?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Combining TFIDF and LDA features

2012-09-14 Thread Philipp Singer
Am 14.09.2012 15:28, schrieb Philipp Singer:
> Okay, so I did a fast chi2 check and it seems like some LDA features
> have high p-values, so they should be helpful at least.

Oh, sorry. We want the lowest p-values, right? But that's the same case. 
There are many with low p-values.
>
> Am 14.09.2012 15:06, schrieb Andreas Müller:
>> I'd be interested in the outcome.
>> Let us know when you get it to work :)
>>
>>
>> - Ursprüngliche Mail -
>> Von: "Philipp Singer" 
>> An: scikit-learn-general@lists.sourceforge.net
>> Gesendet: Freitag, 14. September 2012 14:00:48
>> Betreff: Re: [Scikit-learn-general] Combining TFIDF and LDA features
>>
>> Am 14.09.2012 14:53, schrieb Andreas Müller:
>>> Hi Philipp.
>>
>> Hey Andreas!
>>> First, you should ensure that the features all have approximately the
>>> same scale.
>>> For example they should all be between zero and one - if the LDA
>>> features
>>> are much smaller than the other ones, then they will probably not be
>>> weighted much.
>>
>> LDA features sum up to 1 for one sample, because they describe the
>> probability of one sample to belong to the different topics (in this
>> case 500). So basically, they are between 0 and 1.
>>>
>>> Which LDA package did you use?
>>
>> We used Mallet's LDA implementation, because from experience they have
>> the most established smoothing processes. http://mallet.cs.umass.edu/
>>
>> If we just train on the LDA features we btw get reasonable results, a
>> bit worse than pure TFIDF.
>>>
>>> I am not very experienced with this kind of model, but maybe it would
>>> be helpful
>>> to look at some univariate statistics, like
>>> ``feature_selection.chi2``, to see
>>> if the LDA features are actually helpful.
>>
>> Yeah, this would be something I could look into. I have already tried to
>> to feature selection with chi2 but not actually looked at the specific
>> statistics.
>>>
>>> Cheers,
>>> Andy
>>
>> Regards,
>> Philipp
>>>
>>>
>>> - Ursprüngliche Mail -
>>> Von: "Philipp Singer" 
>>> An: scikit-learn-general@lists.sourceforge.net
>>> Gesendet: Freitag, 14. September 2012 13:47:30
>>> Betreff: [Scikit-learn-general] Combining TFIDF and LDA features
>>>
>>> Hey there!
>>>
>>> I have seen in the past some few research papers that combined tfidf
>>> based features with LDA topic model features and they could increase
>>> their accuracy by some useful extent.
>>>
>>> I now wanted to do the same. As a simple step I just attended the topic
>>> features to each train and test sample with the existing tfidf features
>>> and performed my standard LinearSVC - oh btw thanks that the confusion
>>> with dense and sparse is now resolved in 0.12 ;) - on it.
>>>
>>> The problem now is, that the results are overall exactly similar. Some
>>> classes perform better and some worse.
>>>
>>> I am not exactly sure if this is a data problem, or comes from my lack
>>> of understanding of such feature extension techniques.
>>>
>>> Is it possible that the huge amount of tfidf features somehow overrules
>>> the rather small number of topic features? Do I maybe have to some
>>> feature modification - because tfidf and LDA features are of different
>>> nature?
>>>
>>> Maybe it is also due to the classifier and I need something else?
>>>
>>> Would be happy if someone could shed a little light on my problems ;)
>>>
>>> Regards,
>>> Philipp
>>>
>>> --
>>>
>>> Got visibility?
>>> Most devs has no idea what their production app looks like.
>>> Find out how fast your code is with AppDynamics Lite.
>>> http://ad.doubleclick.net/clk;262219671;13503038;y?
>>> http://info.appdynamics.com/FreeJavaPerformanceDownload.html
>>> ___
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>> --
>>>
>>> Got visibility?
>>> Most devs has no idea what their production app looks like.
>>> Fi

Re: [Scikit-learn-general] Combining TFIDF and LDA features

2012-09-14 Thread Philipp Singer
Hey!

Am 14.09.2012 15:10, schrieb Peter Prettenhofer:
>
> I totally agree - I had such an issue in my research as well
> (combining word presence features with SVD embeddings).
> I followed Blitzer et. al 2006 and normalized** both feature groups
> separately - e.g. you could normalize word presence features such that
> L1 norm equals 1 and do the same for the SVD embeddings.

Isn't the normalization alread part of the tfidf transformation?
So basically the word presence tfidf features are already L2 normalized, 
but maybe I misunderstand this completely.

> In my work I had the impression though, that L1|L2 normalization was
> inferior to simply scale the embeddings by a constant alpha such that
> the average L2 norm is 1.[1]

Ah, I see. How would I exactly do that? Isn't that the same thing as the 
normalization technique in scikit-learn is doing?
>
> ** normalization here means row level normalization - similar do
> document length normalization in TF/IDF.
>
> HTH,
>   Peter

Regards,
Philipp
>
> Blitzer et al. 2006, Domain Adaptation using Structural Correspondence
> Learning, http://john.blitzer.com/papers/emnlp06.pdf
>
> [1] This is also described here:
> http://scikit-learn.org/dev/modules/sgd.html#tips-on-practical-use


--
Got visibility?
Most devs has no idea what their production app looks like.
Find out how fast your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219671;13503038;y?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Combining TFIDF and LDA features

2012-09-14 Thread Philipp Singer
Okay, so I did a fast chi2 check and it seems like some LDA features 
have high p-values, so they should be helpful at least.

Am 14.09.2012 15:06, schrieb Andreas Müller:
> I'd be interested in the outcome.
> Let us know when you get it to work :)
>
>
> - Ursprüngliche Mail -
> Von: "Philipp Singer" 
> An: scikit-learn-general@lists.sourceforge.net
> Gesendet: Freitag, 14. September 2012 14:00:48
> Betreff: Re: [Scikit-learn-general] Combining TFIDF and LDA features
>
> Am 14.09.2012 14:53, schrieb Andreas Müller:
>> Hi Philipp.
>
> Hey Andreas!
>> First, you should ensure that the features all have approximately the same 
>> scale.
>> For example they should all be between zero and one - if the LDA features
>> are much smaller than the other ones, then they will probably not be 
>> weighted much.
>
> LDA features sum up to 1 for one sample, because they describe the
> probability of one sample to belong to the different topics (in this
> case 500). So basically, they are between 0 and 1.
>>
>> Which LDA package did you use?
>
> We used Mallet's LDA implementation, because from experience they have
> the most established smoothing processes. http://mallet.cs.umass.edu/
>
> If we just train on the LDA features we btw get reasonable results, a
> bit worse than pure TFIDF.
>>
>> I am not very experienced with this kind of model, but maybe it would be 
>> helpful
>> to look at some univariate statistics, like ``feature_selection.chi2``, to 
>> see
>> if the LDA features are actually helpful.
>
> Yeah, this would be something I could look into. I have already tried to
> to feature selection with chi2 but not actually looked at the specific
> statistics.
>>
>> Cheers,
>> Andy
>
> Regards,
> Philipp
>>
>>
>> - Ursprüngliche Mail -
>> Von: "Philipp Singer" 
>> An: scikit-learn-general@lists.sourceforge.net
>> Gesendet: Freitag, 14. September 2012 13:47:30
>> Betreff: [Scikit-learn-general] Combining TFIDF and LDA features
>>
>> Hey there!
>>
>> I have seen in the past some few research papers that combined tfidf
>> based features with LDA topic model features and they could increase
>> their accuracy by some useful extent.
>>
>> I now wanted to do the same. As a simple step I just attended the topic
>> features to each train and test sample with the existing tfidf features
>> and performed my standard LinearSVC - oh btw thanks that the confusion
>> with dense and sparse is now resolved in 0.12 ;) - on it.
>>
>> The problem now is, that the results are overall exactly similar. Some
>> classes perform better and some worse.
>>
>> I am not exactly sure if this is a data problem, or comes from my lack
>> of understanding of such feature extension techniques.
>>
>> Is it possible that the huge amount of tfidf features somehow overrules
>> the rather small number of topic features? Do I maybe have to some
>> feature modification - because tfidf and LDA features are of different
>> nature?
>>
>> Maybe it is also due to the classifier and I need something else?
>>
>> Would be happy if someone could shed a little light on my problems ;)
>>
>> Regards,
>> Philipp
>>
>> --
>> Got visibility?
>> Most devs has no idea what their production app looks like.
>> Find out how fast your code is with AppDynamics Lite.
>> http://ad.doubleclick.net/clk;262219671;13503038;y?
>> http://info.appdynamics.com/FreeJavaPerformanceDownload.html
>> ___
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>> --
>> Got visibility?
>> Most devs has no idea what their production app looks like.
>> Find out how fast your code is with AppDynamics Lite.
>> http://ad.doubleclick.net/clk;262219671;13503038;y?
>> http://info.appdynamics.com/FreeJavaPerformanceDownload.html
>> ___
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
> --
> Got visibility?
> Most devs has no idea what their production app looks like.
> Find out how fast your

Re: [Scikit-learn-general] Combining TFIDF and LDA features

2012-09-14 Thread Philipp Singer
Am 14.09.2012 15:10, schrieb amir rahimi:
> Have you done tests using some other classifiers such as gradient
> boosting which has a kind of internal feature selection?

Actually not, but I wanted to try that out, if the runtime allows it.
>
> On Fri, Sep 14, 2012 at 5:36 PM, Andreas Müller
> mailto:amuel...@ais.uni-bonn.de>> wrote:
>
> I'd be interested in the outcome.
> Let us know when you get it to work :)
>
>
> - Ursprüngliche Mail -
> Von: "Philipp Singer" mailto:kill...@gmail.com>>
> An: scikit-learn-general@lists.sourceforge.net
> <mailto:scikit-learn-general@lists.sourceforge.net>
> Gesendet: Freitag, 14. September 2012 14:00:48
> Betreff: Re: [Scikit-learn-general] Combining TFIDF and LDA features
>
> Am 14.09.2012 14:53, schrieb Andreas Müller:
>  > Hi Philipp.
>
> Hey Andreas!
>  > First, you should ensure that the features all have approximately
> the same scale.
>  > For example they should all be between zero and one - if the LDA
> features
>  > are much smaller than the other ones, then they will probably not
> be weighted much.
>
> LDA features sum up to 1 for one sample, because they describe the
> probability of one sample to belong to the different topics (in this
> case 500). So basically, they are between 0 and 1.
>  >
>  > Which LDA package did you use?
>
> We used Mallet's LDA implementation, because from experience they have
> the most established smoothing processes. http://mallet.cs.umass.edu/
>
> If we just train on the LDA features we btw get reasonable results, a
> bit worse than pure TFIDF.
>  >
>  > I am not very experienced with this kind of model, but maybe it
> would be helpful
>  > to look at some univariate statistics, like
> ``feature_selection.chi2``, to see
>  > if the LDA features are actually helpful.
>
> Yeah, this would be something I could look into. I have already tried to
> to feature selection with chi2 but not actually looked at the specific
> statistics.
>  >
>  > Cheers,
>  > Andy
>
> Regards,
> Philipp
>  >
>  >
>  > - Ursprüngliche Mail -
>  > Von: "Philipp Singer" mailto:kill...@gmail.com>>
>  > An: scikit-learn-general@lists.sourceforge.net
> <mailto:scikit-learn-general@lists.sourceforge.net>
>  > Gesendet: Freitag, 14. September 2012 13:47:30
>  > Betreff: [Scikit-learn-general] Combining TFIDF and LDA features
>  >
>  > Hey there!
>  >
>  > I have seen in the past some few research papers that combined tfidf
>  > based features with LDA topic model features and they could increase
>  > their accuracy by some useful extent.
>  >
>  > I now wanted to do the same. As a simple step I just attended the
> topic
>  > features to each train and test sample with the existing tfidf
> features
>  > and performed my standard LinearSVC - oh btw thanks that the
> confusion
>  > with dense and sparse is now resolved in 0.12 ;) - on it.
>  >
>  > The problem now is, that the results are overall exactly similar.
> Some
>  > classes perform better and some worse.
>  >
>  > I am not exactly sure if this is a data problem, or comes from my
> lack
>  > of understanding of such feature extension techniques.
>  >
>  > Is it possible that the huge amount of tfidf features somehow
> overrules
>  > the rather small number of topic features? Do I maybe have to some
>  > feature modification - because tfidf and LDA features are of
> different
>  > nature?
>  >
>  > Maybe it is also due to the classifier and I need something else?
>  >
>  > Would be happy if someone could shed a little light on my problems ;)
>  >
>  > Regards,
>  > Philipp
>  >
>  >
> 
> --
>  > Got visibility?
>  > Most devs has no idea what their production app looks like.
>  > Find out how fast your code is with AppDynamics Lite.
>  > http://ad.doubleclick.net/clk;262219671;13503038;y?
>  > http://info.appdynamics.com/FreeJavaPerformanceDownload.html
>  > ___
>  > Scikit-learn-general mailing list
>  >

Re: [Scikit-learn-general] Combining TFIDF and LDA features

2012-09-14 Thread amir rahimi
Have you done tests using some other classifiers such as gradient boosting
which has a kind of internal feature selection?

On Fri, Sep 14, 2012 at 5:36 PM, Andreas Müller wrote:

> I'd be interested in the outcome.
> Let us know when you get it to work :)
>
>
> - Ursprüngliche Mail -
> Von: "Philipp Singer" 
> An: scikit-learn-general@lists.sourceforge.net
> Gesendet: Freitag, 14. September 2012 14:00:48
> Betreff: Re: [Scikit-learn-general] Combining TFIDF and LDA features
>
> Am 14.09.2012 14:53, schrieb Andreas Müller:
> > Hi Philipp.
>
> Hey Andreas!
> > First, you should ensure that the features all have approximately the
> same scale.
> > For example they should all be between zero and one - if the LDA features
> > are much smaller than the other ones, then they will probably not be
> weighted much.
>
> LDA features sum up to 1 for one sample, because they describe the
> probability of one sample to belong to the different topics (in this
> case 500). So basically, they are between 0 and 1.
> >
> > Which LDA package did you use?
>
> We used Mallet's LDA implementation, because from experience they have
> the most established smoothing processes. http://mallet.cs.umass.edu/
>
> If we just train on the LDA features we btw get reasonable results, a
> bit worse than pure TFIDF.
> >
> > I am not very experienced with this kind of model, but maybe it would be
> helpful
> > to look at some univariate statistics, like ``feature_selection.chi2``,
> to see
> > if the LDA features are actually helpful.
>
> Yeah, this would be something I could look into. I have already tried to
> to feature selection with chi2 but not actually looked at the specific
> statistics.
> >
> > Cheers,
> > Andy
>
> Regards,
> Philipp
> >
> >
> > - Ursprüngliche Mail -
> > Von: "Philipp Singer" 
> > An: scikit-learn-general@lists.sourceforge.net
> > Gesendet: Freitag, 14. September 2012 13:47:30
> > Betreff: [Scikit-learn-general] Combining TFIDF and LDA features
> >
> > Hey there!
> >
> > I have seen in the past some few research papers that combined tfidf
> > based features with LDA topic model features and they could increase
> > their accuracy by some useful extent.
> >
> > I now wanted to do the same. As a simple step I just attended the topic
> > features to each train and test sample with the existing tfidf features
> > and performed my standard LinearSVC - oh btw thanks that the confusion
> > with dense and sparse is now resolved in 0.12 ;) - on it.
> >
> > The problem now is, that the results are overall exactly similar. Some
> > classes perform better and some worse.
> >
> > I am not exactly sure if this is a data problem, or comes from my lack
> > of understanding of such feature extension techniques.
> >
> > Is it possible that the huge amount of tfidf features somehow overrules
> > the rather small number of topic features? Do I maybe have to some
> > feature modification - because tfidf and LDA features are of different
> > nature?
> >
> > Maybe it is also due to the classifier and I need something else?
> >
> > Would be happy if someone could shed a little light on my problems ;)
> >
> > Regards,
> > Philipp
> >
> >
> --
> > Got visibility?
> > Most devs has no idea what their production app looks like.
> > Find out how fast your code is with AppDynamics Lite.
> > http://ad.doubleclick.net/clk;262219671;13503038;y?
> > http://info.appdynamics.com/FreeJavaPerformanceDownload.html
> > ___
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
> >
> --
> > Got visibility?
> > Most devs has no idea what their production app looks like.
> > Find out how fast your code is with AppDynamics Lite.
> > http://ad.doubleclick.net/clk;262219671;13503038;y?
> > http://info.appdynamics.com/FreeJavaPerformanceDownload.html
> > ___
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
>
>
>
> --
> Got visibility?
> Most de

Re: [Scikit-learn-general] Combining TFIDF and LDA features

2012-09-14 Thread Peter Prettenhofer
2012/9/14 Andreas Müller :
> Hi Philipp.
> First, you should ensure that the features all have approximately the same 
> scale.
> For example they should all be between zero and one - if the LDA features
> are much smaller than the other ones, then they will probably not be weighted 
> much.

I totally agree - I had such an issue in my research as well
(combining word presence features with SVD embeddings).
I followed Blitzer et. al 2006 and normalized** both feature groups
separately - e.g. you could normalize word presence features such that
L1 norm equals 1 and do the same for the SVD embeddings.
In my work I had the impression though, that L1|L2 normalization was
inferior to simply scale the embeddings by a constant alpha such that
the average L2 norm is 1.[1]

** normalization here means row level normalization - similar do
document length normalization in TF/IDF.

HTH,
 Peter

Blitzer et al. 2006, Domain Adaptation using Structural Correspondence
Learning, http://john.blitzer.com/papers/emnlp06.pdf

[1] This is also described here:
http://scikit-learn.org/dev/modules/sgd.html#tips-on-practical-use
>
> Which LDA package did you use?
>
> I am not very experienced with this kind of model, but maybe it would be 
> helpful
> to look at some univariate statistics, like ``feature_selection.chi2``, to see
> if the LDA features are actually helpful.
>
> Cheers,
> Andy
>
>
> - Ursprüngliche Mail -
> Von: "Philipp Singer" 
> An: scikit-learn-general@lists.sourceforge.net
> Gesendet: Freitag, 14. September 2012 13:47:30
> Betreff: [Scikit-learn-general] Combining TFIDF and LDA features
>
> Hey there!
>
> I have seen in the past some few research papers that combined tfidf
> based features with LDA topic model features and they could increase
> their accuracy by some useful extent.
>
> I now wanted to do the same. As a simple step I just attended the topic
> features to each train and test sample with the existing tfidf features
> and performed my standard LinearSVC - oh btw thanks that the confusion
> with dense and sparse is now resolved in 0.12 ;) - on it.
>
> The problem now is, that the results are overall exactly similar. Some
> classes perform better and some worse.
>
> I am not exactly sure if this is a data problem, or comes from my lack
> of understanding of such feature extension techniques.
>
> Is it possible that the huge amount of tfidf features somehow overrules
> the rather small number of topic features? Do I maybe have to some
> feature modification - because tfidf and LDA features are of different
> nature?
>
> Maybe it is also due to the classifier and I need something else?
>
> Would be happy if someone could shed a little light on my problems ;)
>
> Regards,
> Philipp
>
> --
> Got visibility?
> Most devs has no idea what their production app looks like.
> Find out how fast your code is with AppDynamics Lite.
> http://ad.doubleclick.net/clk;262219671;13503038;y?
> http://info.appdynamics.com/FreeJavaPerformanceDownload.html
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
> --
> Got visibility?
> Most devs has no idea what their production app looks like.
> Find out how fast your code is with AppDynamics Lite.
> http://ad.doubleclick.net/clk;262219671;13503038;y?
> http://info.appdynamics.com/FreeJavaPerformanceDownload.html
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



-- 
Peter Prettenhofer

--
Got visibility?
Most devs has no idea what their production app looks like.
Find out how fast your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219671;13503038;y?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Combining TFIDF and LDA features

2012-09-14 Thread Andreas Müller
I'd be interested in the outcome.
Let us know when you get it to work :)


- Ursprüngliche Mail -
Von: "Philipp Singer" 
An: scikit-learn-general@lists.sourceforge.net
Gesendet: Freitag, 14. September 2012 14:00:48
Betreff: Re: [Scikit-learn-general] Combining TFIDF and LDA features

Am 14.09.2012 14:53, schrieb Andreas Müller:
> Hi Philipp.

Hey Andreas!
> First, you should ensure that the features all have approximately the same 
> scale.
> For example they should all be between zero and one - if the LDA features
> are much smaller than the other ones, then they will probably not be weighted 
> much.

LDA features sum up to 1 for one sample, because they describe the 
probability of one sample to belong to the different topics (in this 
case 500). So basically, they are between 0 and 1.
>
> Which LDA package did you use?

We used Mallet's LDA implementation, because from experience they have 
the most established smoothing processes. http://mallet.cs.umass.edu/

If we just train on the LDA features we btw get reasonable results, a 
bit worse than pure TFIDF.
>
> I am not very experienced with this kind of model, but maybe it would be 
> helpful
> to look at some univariate statistics, like ``feature_selection.chi2``, to see
> if the LDA features are actually helpful.

Yeah, this would be something I could look into. I have already tried to 
to feature selection with chi2 but not actually looked at the specific 
statistics.
>
> Cheers,
> Andy

Regards,
Philipp
>
>
> - Ursprüngliche Mail -
> Von: "Philipp Singer" 
> An: scikit-learn-general@lists.sourceforge.net
> Gesendet: Freitag, 14. September 2012 13:47:30
> Betreff: [Scikit-learn-general] Combining TFIDF and LDA features
>
> Hey there!
>
> I have seen in the past some few research papers that combined tfidf
> based features with LDA topic model features and they could increase
> their accuracy by some useful extent.
>
> I now wanted to do the same. As a simple step I just attended the topic
> features to each train and test sample with the existing tfidf features
> and performed my standard LinearSVC - oh btw thanks that the confusion
> with dense and sparse is now resolved in 0.12 ;) - on it.
>
> The problem now is, that the results are overall exactly similar. Some
> classes perform better and some worse.
>
> I am not exactly sure if this is a data problem, or comes from my lack
> of understanding of such feature extension techniques.
>
> Is it possible that the huge amount of tfidf features somehow overrules
> the rather small number of topic features? Do I maybe have to some
> feature modification - because tfidf and LDA features are of different
> nature?
>
> Maybe it is also due to the classifier and I need something else?
>
> Would be happy if someone could shed a little light on my problems ;)
>
> Regards,
> Philipp
>
> --
> Got visibility?
> Most devs has no idea what their production app looks like.
> Find out how fast your code is with AppDynamics Lite.
> http://ad.doubleclick.net/clk;262219671;13503038;y?
> http://info.appdynamics.com/FreeJavaPerformanceDownload.html
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
> --
> Got visibility?
> Most devs has no idea what their production app looks like.
> Find out how fast your code is with AppDynamics Lite.
> http://ad.doubleclick.net/clk;262219671;13503038;y?
> http://info.appdynamics.com/FreeJavaPerformanceDownload.html
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>


--
Got visibility?
Most devs has no idea what their production app looks like.
Find out how fast your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219671;13503038;y?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Got visibility?
Most devs has no idea what their production app looks like.
Find out how fast your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219671;13503038;y?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Combining TFIDF and LDA features

2012-09-14 Thread Philipp Singer
Am 14.09.2012 14:53, schrieb Andreas Müller:
> Hi Philipp.

Hey Andreas!
> First, you should ensure that the features all have approximately the same 
> scale.
> For example they should all be between zero and one - if the LDA features
> are much smaller than the other ones, then they will probably not be weighted 
> much.

LDA features sum up to 1 for one sample, because they describe the 
probability of one sample to belong to the different topics (in this 
case 500). So basically, they are between 0 and 1.
>
> Which LDA package did you use?

We used Mallet's LDA implementation, because from experience they have 
the most established smoothing processes. http://mallet.cs.umass.edu/

If we just train on the LDA features we btw get reasonable results, a 
bit worse than pure TFIDF.
>
> I am not very experienced with this kind of model, but maybe it would be 
> helpful
> to look at some univariate statistics, like ``feature_selection.chi2``, to see
> if the LDA features are actually helpful.

Yeah, this would be something I could look into. I have already tried to 
to feature selection with chi2 but not actually looked at the specific 
statistics.
>
> Cheers,
> Andy

Regards,
Philipp
>
>
> - Ursprüngliche Mail -
> Von: "Philipp Singer" 
> An: scikit-learn-general@lists.sourceforge.net
> Gesendet: Freitag, 14. September 2012 13:47:30
> Betreff: [Scikit-learn-general] Combining TFIDF and LDA features
>
> Hey there!
>
> I have seen in the past some few research papers that combined tfidf
> based features with LDA topic model features and they could increase
> their accuracy by some useful extent.
>
> I now wanted to do the same. As a simple step I just attended the topic
> features to each train and test sample with the existing tfidf features
> and performed my standard LinearSVC - oh btw thanks that the confusion
> with dense and sparse is now resolved in 0.12 ;) - on it.
>
> The problem now is, that the results are overall exactly similar. Some
> classes perform better and some worse.
>
> I am not exactly sure if this is a data problem, or comes from my lack
> of understanding of such feature extension techniques.
>
> Is it possible that the huge amount of tfidf features somehow overrules
> the rather small number of topic features? Do I maybe have to some
> feature modification - because tfidf and LDA features are of different
> nature?
>
> Maybe it is also due to the classifier and I need something else?
>
> Would be happy if someone could shed a little light on my problems ;)
>
> Regards,
> Philipp
>
> --
> Got visibility?
> Most devs has no idea what their production app looks like.
> Find out how fast your code is with AppDynamics Lite.
> http://ad.doubleclick.net/clk;262219671;13503038;y?
> http://info.appdynamics.com/FreeJavaPerformanceDownload.html
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
> --
> Got visibility?
> Most devs has no idea what their production app looks like.
> Find out how fast your code is with AppDynamics Lite.
> http://ad.doubleclick.net/clk;262219671;13503038;y?
> http://info.appdynamics.com/FreeJavaPerformanceDownload.html
> ___
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>


--
Got visibility?
Most devs has no idea what their production app looks like.
Find out how fast your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219671;13503038;y?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Combining TFIDF and LDA features

2012-09-14 Thread Andreas Müller
Hi Philipp.
First, you should ensure that the features all have approximately the same 
scale.
For example they should all be between zero and one - if the LDA features
are much smaller than the other ones, then they will probably not be weighted 
much.

Which LDA package did you use?

I am not very experienced with this kind of model, but maybe it would be helpful
to look at some univariate statistics, like ``feature_selection.chi2``, to see
if the LDA features are actually helpful.

Cheers,
Andy


- Ursprüngliche Mail -
Von: "Philipp Singer" 
An: scikit-learn-general@lists.sourceforge.net
Gesendet: Freitag, 14. September 2012 13:47:30
Betreff: [Scikit-learn-general] Combining TFIDF and LDA features

Hey there!

I have seen in the past some few research papers that combined tfidf 
based features with LDA topic model features and they could increase 
their accuracy by some useful extent.

I now wanted to do the same. As a simple step I just attended the topic 
features to each train and test sample with the existing tfidf features 
and performed my standard LinearSVC - oh btw thanks that the confusion 
with dense and sparse is now resolved in 0.12 ;) - on it.

The problem now is, that the results are overall exactly similar. Some 
classes perform better and some worse.

I am not exactly sure if this is a data problem, or comes from my lack 
of understanding of such feature extension techniques.

Is it possible that the huge amount of tfidf features somehow overrules 
the rather small number of topic features? Do I maybe have to some 
feature modification - because tfidf and LDA features are of different 
nature?

Maybe it is also due to the classifier and I need something else?

Would be happy if someone could shed a little light on my problems ;)

Regards,
Philipp

--
Got visibility?
Most devs has no idea what their production app looks like.
Find out how fast your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219671;13503038;y?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
Got visibility?
Most devs has no idea what their production app looks like.
Find out how fast your code is with AppDynamics Lite.
http://ad.doubleclick.net/clk;262219671;13503038;y?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general