Re: [Scikit-learn-general] Combining TFIDF and LDA features
2012/9/14 Philipp Singer : > Hey! > > Am 14.09.2012 15:10, schrieb Peter Prettenhofer: >> >> I totally agree - I had such an issue in my research as well >> (combining word presence features with SVD embeddings). >> I followed Blitzer et. al 2006 and normalized** both feature groups >> separately - e.g. you could normalize word presence features such that >> L1 norm equals 1 and do the same for the SVD embeddings. > > Isn't the normalization alread part of the tfidf transformation? > So basically the word presence tfidf features are already L2 normalized, > but maybe I misunderstand this completely. I forgot that your LDA embedding is already L1 normalized (i.e. sums to 1). So both of your feature groups are already normalized; tf/idf is L2 and LDA is L1. > >> In my work I had the impression though, that L1|L2 normalization was >> inferior to simply scale the embeddings by a constant alpha such that >> the average L2 norm is 1.[1] > > Ah, I see. How would I exactly do that? Isn't that the same thing as the > normalization technique in scikit-learn is doing? Its as simple as computing the mean L2 norm and dividing the feature matrix by that number. Scaler does this per feature, Normalizer per sample - this computes one normalization constant for all features. Since the LDA embedding has an intrinsic semantic (document generated from topic distribution) - I don't think you should do this - please forget my comment. >> >> ** normalization here means row level normalization - similar do >> document length normalization in TF/IDF. >> >> HTH, >> Peter > > Regards, > Philipp >> >> Blitzer et al. 2006, Domain Adaptation using Structural Correspondence >> Learning, http://john.blitzer.com/papers/emnlp06.pdf >> >> [1] This is also described here: >> http://scikit-learn.org/dev/modules/sgd.html#tips-on-practical-use > > > -- > Got visibility? > Most devs has no idea what their production app looks like. > Find out how fast your code is with AppDynamics Lite. > http://ad.doubleclick.net/clk;262219671;13503038;y? > http://info.appdynamics.com/FreeJavaPerformanceDownload.html > ___ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Combining TFIDF and LDA features
Am 14.09.2012 15:28, schrieb Philipp Singer: > Okay, so I did a fast chi2 check and it seems like some LDA features > have high p-values, so they should be helpful at least. Oh, sorry. We want the lowest p-values, right? But that's the same case. There are many with low p-values. > > Am 14.09.2012 15:06, schrieb Andreas Müller: >> I'd be interested in the outcome. >> Let us know when you get it to work :) >> >> >> - Ursprüngliche Mail - >> Von: "Philipp Singer" >> An: scikit-learn-general@lists.sourceforge.net >> Gesendet: Freitag, 14. September 2012 14:00:48 >> Betreff: Re: [Scikit-learn-general] Combining TFIDF and LDA features >> >> Am 14.09.2012 14:53, schrieb Andreas Müller: >>> Hi Philipp. >> >> Hey Andreas! >>> First, you should ensure that the features all have approximately the >>> same scale. >>> For example they should all be between zero and one - if the LDA >>> features >>> are much smaller than the other ones, then they will probably not be >>> weighted much. >> >> LDA features sum up to 1 for one sample, because they describe the >> probability of one sample to belong to the different topics (in this >> case 500). So basically, they are between 0 and 1. >>> >>> Which LDA package did you use? >> >> We used Mallet's LDA implementation, because from experience they have >> the most established smoothing processes. http://mallet.cs.umass.edu/ >> >> If we just train on the LDA features we btw get reasonable results, a >> bit worse than pure TFIDF. >>> >>> I am not very experienced with this kind of model, but maybe it would >>> be helpful >>> to look at some univariate statistics, like >>> ``feature_selection.chi2``, to see >>> if the LDA features are actually helpful. >> >> Yeah, this would be something I could look into. I have already tried to >> to feature selection with chi2 but not actually looked at the specific >> statistics. >>> >>> Cheers, >>> Andy >> >> Regards, >> Philipp >>> >>> >>> - Ursprüngliche Mail - >>> Von: "Philipp Singer" >>> An: scikit-learn-general@lists.sourceforge.net >>> Gesendet: Freitag, 14. September 2012 13:47:30 >>> Betreff: [Scikit-learn-general] Combining TFIDF and LDA features >>> >>> Hey there! >>> >>> I have seen in the past some few research papers that combined tfidf >>> based features with LDA topic model features and they could increase >>> their accuracy by some useful extent. >>> >>> I now wanted to do the same. As a simple step I just attended the topic >>> features to each train and test sample with the existing tfidf features >>> and performed my standard LinearSVC - oh btw thanks that the confusion >>> with dense and sparse is now resolved in 0.12 ;) - on it. >>> >>> The problem now is, that the results are overall exactly similar. Some >>> classes perform better and some worse. >>> >>> I am not exactly sure if this is a data problem, or comes from my lack >>> of understanding of such feature extension techniques. >>> >>> Is it possible that the huge amount of tfidf features somehow overrules >>> the rather small number of topic features? Do I maybe have to some >>> feature modification - because tfidf and LDA features are of different >>> nature? >>> >>> Maybe it is also due to the classifier and I need something else? >>> >>> Would be happy if someone could shed a little light on my problems ;) >>> >>> Regards, >>> Philipp >>> >>> -- >>> >>> Got visibility? >>> Most devs has no idea what their production app looks like. >>> Find out how fast your code is with AppDynamics Lite. >>> http://ad.doubleclick.net/clk;262219671;13503038;y? >>> http://info.appdynamics.com/FreeJavaPerformanceDownload.html >>> ___ >>> Scikit-learn-general mailing list >>> Scikit-learn-general@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>> >>> -- >>> >>> Got visibility? >>> Most devs has no idea what their production app looks like. >>> Fi
Re: [Scikit-learn-general] Combining TFIDF and LDA features
Hey! Am 14.09.2012 15:10, schrieb Peter Prettenhofer: > > I totally agree - I had such an issue in my research as well > (combining word presence features with SVD embeddings). > I followed Blitzer et. al 2006 and normalized** both feature groups > separately - e.g. you could normalize word presence features such that > L1 norm equals 1 and do the same for the SVD embeddings. Isn't the normalization alread part of the tfidf transformation? So basically the word presence tfidf features are already L2 normalized, but maybe I misunderstand this completely. > In my work I had the impression though, that L1|L2 normalization was > inferior to simply scale the embeddings by a constant alpha such that > the average L2 norm is 1.[1] Ah, I see. How would I exactly do that? Isn't that the same thing as the normalization technique in scikit-learn is doing? > > ** normalization here means row level normalization - similar do > document length normalization in TF/IDF. > > HTH, > Peter Regards, Philipp > > Blitzer et al. 2006, Domain Adaptation using Structural Correspondence > Learning, http://john.blitzer.com/papers/emnlp06.pdf > > [1] This is also described here: > http://scikit-learn.org/dev/modules/sgd.html#tips-on-practical-use -- Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Combining TFIDF and LDA features
Okay, so I did a fast chi2 check and it seems like some LDA features have high p-values, so they should be helpful at least. Am 14.09.2012 15:06, schrieb Andreas Müller: > I'd be interested in the outcome. > Let us know when you get it to work :) > > > - Ursprüngliche Mail - > Von: "Philipp Singer" > An: scikit-learn-general@lists.sourceforge.net > Gesendet: Freitag, 14. September 2012 14:00:48 > Betreff: Re: [Scikit-learn-general] Combining TFIDF and LDA features > > Am 14.09.2012 14:53, schrieb Andreas Müller: >> Hi Philipp. > > Hey Andreas! >> First, you should ensure that the features all have approximately the same >> scale. >> For example they should all be between zero and one - if the LDA features >> are much smaller than the other ones, then they will probably not be >> weighted much. > > LDA features sum up to 1 for one sample, because they describe the > probability of one sample to belong to the different topics (in this > case 500). So basically, they are between 0 and 1. >> >> Which LDA package did you use? > > We used Mallet's LDA implementation, because from experience they have > the most established smoothing processes. http://mallet.cs.umass.edu/ > > If we just train on the LDA features we btw get reasonable results, a > bit worse than pure TFIDF. >> >> I am not very experienced with this kind of model, but maybe it would be >> helpful >> to look at some univariate statistics, like ``feature_selection.chi2``, to >> see >> if the LDA features are actually helpful. > > Yeah, this would be something I could look into. I have already tried to > to feature selection with chi2 but not actually looked at the specific > statistics. >> >> Cheers, >> Andy > > Regards, > Philipp >> >> >> - Ursprüngliche Mail - >> Von: "Philipp Singer" >> An: scikit-learn-general@lists.sourceforge.net >> Gesendet: Freitag, 14. September 2012 13:47:30 >> Betreff: [Scikit-learn-general] Combining TFIDF and LDA features >> >> Hey there! >> >> I have seen in the past some few research papers that combined tfidf >> based features with LDA topic model features and they could increase >> their accuracy by some useful extent. >> >> I now wanted to do the same. As a simple step I just attended the topic >> features to each train and test sample with the existing tfidf features >> and performed my standard LinearSVC - oh btw thanks that the confusion >> with dense and sparse is now resolved in 0.12 ;) - on it. >> >> The problem now is, that the results are overall exactly similar. Some >> classes perform better and some worse. >> >> I am not exactly sure if this is a data problem, or comes from my lack >> of understanding of such feature extension techniques. >> >> Is it possible that the huge amount of tfidf features somehow overrules >> the rather small number of topic features? Do I maybe have to some >> feature modification - because tfidf and LDA features are of different >> nature? >> >> Maybe it is also due to the classifier and I need something else? >> >> Would be happy if someone could shed a little light on my problems ;) >> >> Regards, >> Philipp >> >> -- >> Got visibility? >> Most devs has no idea what their production app looks like. >> Find out how fast your code is with AppDynamics Lite. >> http://ad.doubleclick.net/clk;262219671;13503038;y? >> http://info.appdynamics.com/FreeJavaPerformanceDownload.html >> ___ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> -- >> Got visibility? >> Most devs has no idea what their production app looks like. >> Find out how fast your code is with AppDynamics Lite. >> http://ad.doubleclick.net/clk;262219671;13503038;y? >> http://info.appdynamics.com/FreeJavaPerformanceDownload.html >> ___ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> > > > -- > Got visibility? > Most devs has no idea what their production app looks like. > Find out how fast your
Re: [Scikit-learn-general] Combining TFIDF and LDA features
Am 14.09.2012 15:10, schrieb amir rahimi: > Have you done tests using some other classifiers such as gradient > boosting which has a kind of internal feature selection? Actually not, but I wanted to try that out, if the runtime allows it. > > On Fri, Sep 14, 2012 at 5:36 PM, Andreas Müller > mailto:amuel...@ais.uni-bonn.de>> wrote: > > I'd be interested in the outcome. > Let us know when you get it to work :) > > > - Ursprüngliche Mail - > Von: "Philipp Singer" mailto:kill...@gmail.com>> > An: scikit-learn-general@lists.sourceforge.net > <mailto:scikit-learn-general@lists.sourceforge.net> > Gesendet: Freitag, 14. September 2012 14:00:48 > Betreff: Re: [Scikit-learn-general] Combining TFIDF and LDA features > > Am 14.09.2012 14:53, schrieb Andreas Müller: > > Hi Philipp. > > Hey Andreas! > > First, you should ensure that the features all have approximately > the same scale. > > For example they should all be between zero and one - if the LDA > features > > are much smaller than the other ones, then they will probably not > be weighted much. > > LDA features sum up to 1 for one sample, because they describe the > probability of one sample to belong to the different topics (in this > case 500). So basically, they are between 0 and 1. > > > > Which LDA package did you use? > > We used Mallet's LDA implementation, because from experience they have > the most established smoothing processes. http://mallet.cs.umass.edu/ > > If we just train on the LDA features we btw get reasonable results, a > bit worse than pure TFIDF. > > > > I am not very experienced with this kind of model, but maybe it > would be helpful > > to look at some univariate statistics, like > ``feature_selection.chi2``, to see > > if the LDA features are actually helpful. > > Yeah, this would be something I could look into. I have already tried to > to feature selection with chi2 but not actually looked at the specific > statistics. > > > > Cheers, > > Andy > > Regards, > Philipp > > > > > > - Ursprüngliche Mail - > > Von: "Philipp Singer" mailto:kill...@gmail.com>> > > An: scikit-learn-general@lists.sourceforge.net > <mailto:scikit-learn-general@lists.sourceforge.net> > > Gesendet: Freitag, 14. September 2012 13:47:30 > > Betreff: [Scikit-learn-general] Combining TFIDF and LDA features > > > > Hey there! > > > > I have seen in the past some few research papers that combined tfidf > > based features with LDA topic model features and they could increase > > their accuracy by some useful extent. > > > > I now wanted to do the same. As a simple step I just attended the > topic > > features to each train and test sample with the existing tfidf > features > > and performed my standard LinearSVC - oh btw thanks that the > confusion > > with dense and sparse is now resolved in 0.12 ;) - on it. > > > > The problem now is, that the results are overall exactly similar. > Some > > classes perform better and some worse. > > > > I am not exactly sure if this is a data problem, or comes from my > lack > > of understanding of such feature extension techniques. > > > > Is it possible that the huge amount of tfidf features somehow > overrules > > the rather small number of topic features? Do I maybe have to some > > feature modification - because tfidf and LDA features are of > different > > nature? > > > > Maybe it is also due to the classifier and I need something else? > > > > Would be happy if someone could shed a little light on my problems ;) > > > > Regards, > > Philipp > > > > > > -- > > Got visibility? > > Most devs has no idea what their production app looks like. > > Find out how fast your code is with AppDynamics Lite. > > http://ad.doubleclick.net/clk;262219671;13503038;y? > > http://info.appdynamics.com/FreeJavaPerformanceDownload.html > > ___ > > Scikit-learn-general mailing list > >
Re: [Scikit-learn-general] Combining TFIDF and LDA features
Have you done tests using some other classifiers such as gradient boosting which has a kind of internal feature selection? On Fri, Sep 14, 2012 at 5:36 PM, Andreas Müller wrote: > I'd be interested in the outcome. > Let us know when you get it to work :) > > > - Ursprüngliche Mail - > Von: "Philipp Singer" > An: scikit-learn-general@lists.sourceforge.net > Gesendet: Freitag, 14. September 2012 14:00:48 > Betreff: Re: [Scikit-learn-general] Combining TFIDF and LDA features > > Am 14.09.2012 14:53, schrieb Andreas Müller: > > Hi Philipp. > > Hey Andreas! > > First, you should ensure that the features all have approximately the > same scale. > > For example they should all be between zero and one - if the LDA features > > are much smaller than the other ones, then they will probably not be > weighted much. > > LDA features sum up to 1 for one sample, because they describe the > probability of one sample to belong to the different topics (in this > case 500). So basically, they are between 0 and 1. > > > > Which LDA package did you use? > > We used Mallet's LDA implementation, because from experience they have > the most established smoothing processes. http://mallet.cs.umass.edu/ > > If we just train on the LDA features we btw get reasonable results, a > bit worse than pure TFIDF. > > > > I am not very experienced with this kind of model, but maybe it would be > helpful > > to look at some univariate statistics, like ``feature_selection.chi2``, > to see > > if the LDA features are actually helpful. > > Yeah, this would be something I could look into. I have already tried to > to feature selection with chi2 but not actually looked at the specific > statistics. > > > > Cheers, > > Andy > > Regards, > Philipp > > > > > > - Ursprüngliche Mail - > > Von: "Philipp Singer" > > An: scikit-learn-general@lists.sourceforge.net > > Gesendet: Freitag, 14. September 2012 13:47:30 > > Betreff: [Scikit-learn-general] Combining TFIDF and LDA features > > > > Hey there! > > > > I have seen in the past some few research papers that combined tfidf > > based features with LDA topic model features and they could increase > > their accuracy by some useful extent. > > > > I now wanted to do the same. As a simple step I just attended the topic > > features to each train and test sample with the existing tfidf features > > and performed my standard LinearSVC - oh btw thanks that the confusion > > with dense and sparse is now resolved in 0.12 ;) - on it. > > > > The problem now is, that the results are overall exactly similar. Some > > classes perform better and some worse. > > > > I am not exactly sure if this is a data problem, or comes from my lack > > of understanding of such feature extension techniques. > > > > Is it possible that the huge amount of tfidf features somehow overrules > > the rather small number of topic features? Do I maybe have to some > > feature modification - because tfidf and LDA features are of different > > nature? > > > > Maybe it is also due to the classifier and I need something else? > > > > Would be happy if someone could shed a little light on my problems ;) > > > > Regards, > > Philipp > > > > > -- > > Got visibility? > > Most devs has no idea what their production app looks like. > > Find out how fast your code is with AppDynamics Lite. > > http://ad.doubleclick.net/clk;262219671;13503038;y? > > http://info.appdynamics.com/FreeJavaPerformanceDownload.html > > ___ > > Scikit-learn-general mailing list > > Scikit-learn-general@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > > -- > > Got visibility? > > Most devs has no idea what their production app looks like. > > Find out how fast your code is with AppDynamics Lite. > > http://ad.doubleclick.net/clk;262219671;13503038;y? > > http://info.appdynamics.com/FreeJavaPerformanceDownload.html > > ___ > > Scikit-learn-general mailing list > > Scikit-learn-general@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > > > -- > Got visibility? > Most de
Re: [Scikit-learn-general] Combining TFIDF and LDA features
2012/9/14 Andreas Müller : > Hi Philipp. > First, you should ensure that the features all have approximately the same > scale. > For example they should all be between zero and one - if the LDA features > are much smaller than the other ones, then they will probably not be weighted > much. I totally agree - I had such an issue in my research as well (combining word presence features with SVD embeddings). I followed Blitzer et. al 2006 and normalized** both feature groups separately - e.g. you could normalize word presence features such that L1 norm equals 1 and do the same for the SVD embeddings. In my work I had the impression though, that L1|L2 normalization was inferior to simply scale the embeddings by a constant alpha such that the average L2 norm is 1.[1] ** normalization here means row level normalization - similar do document length normalization in TF/IDF. HTH, Peter Blitzer et al. 2006, Domain Adaptation using Structural Correspondence Learning, http://john.blitzer.com/papers/emnlp06.pdf [1] This is also described here: http://scikit-learn.org/dev/modules/sgd.html#tips-on-practical-use > > Which LDA package did you use? > > I am not very experienced with this kind of model, but maybe it would be > helpful > to look at some univariate statistics, like ``feature_selection.chi2``, to see > if the LDA features are actually helpful. > > Cheers, > Andy > > > - Ursprüngliche Mail - > Von: "Philipp Singer" > An: scikit-learn-general@lists.sourceforge.net > Gesendet: Freitag, 14. September 2012 13:47:30 > Betreff: [Scikit-learn-general] Combining TFIDF and LDA features > > Hey there! > > I have seen in the past some few research papers that combined tfidf > based features with LDA topic model features and they could increase > their accuracy by some useful extent. > > I now wanted to do the same. As a simple step I just attended the topic > features to each train and test sample with the existing tfidf features > and performed my standard LinearSVC - oh btw thanks that the confusion > with dense and sparse is now resolved in 0.12 ;) - on it. > > The problem now is, that the results are overall exactly similar. Some > classes perform better and some worse. > > I am not exactly sure if this is a data problem, or comes from my lack > of understanding of such feature extension techniques. > > Is it possible that the huge amount of tfidf features somehow overrules > the rather small number of topic features? Do I maybe have to some > feature modification - because tfidf and LDA features are of different > nature? > > Maybe it is also due to the classifier and I need something else? > > Would be happy if someone could shed a little light on my problems ;) > > Regards, > Philipp > > -- > Got visibility? > Most devs has no idea what their production app looks like. > Find out how fast your code is with AppDynamics Lite. > http://ad.doubleclick.net/clk;262219671;13503038;y? > http://info.appdynamics.com/FreeJavaPerformanceDownload.html > ___ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > -- > Got visibility? > Most devs has no idea what their production app looks like. > Find out how fast your code is with AppDynamics Lite. > http://ad.doubleclick.net/clk;262219671;13503038;y? > http://info.appdynamics.com/FreeJavaPerformanceDownload.html > ___ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Peter Prettenhofer -- Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Combining TFIDF and LDA features
I'd be interested in the outcome. Let us know when you get it to work :) - Ursprüngliche Mail - Von: "Philipp Singer" An: scikit-learn-general@lists.sourceforge.net Gesendet: Freitag, 14. September 2012 14:00:48 Betreff: Re: [Scikit-learn-general] Combining TFIDF and LDA features Am 14.09.2012 14:53, schrieb Andreas Müller: > Hi Philipp. Hey Andreas! > First, you should ensure that the features all have approximately the same > scale. > For example they should all be between zero and one - if the LDA features > are much smaller than the other ones, then they will probably not be weighted > much. LDA features sum up to 1 for one sample, because they describe the probability of one sample to belong to the different topics (in this case 500). So basically, they are between 0 and 1. > > Which LDA package did you use? We used Mallet's LDA implementation, because from experience they have the most established smoothing processes. http://mallet.cs.umass.edu/ If we just train on the LDA features we btw get reasonable results, a bit worse than pure TFIDF. > > I am not very experienced with this kind of model, but maybe it would be > helpful > to look at some univariate statistics, like ``feature_selection.chi2``, to see > if the LDA features are actually helpful. Yeah, this would be something I could look into. I have already tried to to feature selection with chi2 but not actually looked at the specific statistics. > > Cheers, > Andy Regards, Philipp > > > - Ursprüngliche Mail - > Von: "Philipp Singer" > An: scikit-learn-general@lists.sourceforge.net > Gesendet: Freitag, 14. September 2012 13:47:30 > Betreff: [Scikit-learn-general] Combining TFIDF and LDA features > > Hey there! > > I have seen in the past some few research papers that combined tfidf > based features with LDA topic model features and they could increase > their accuracy by some useful extent. > > I now wanted to do the same. As a simple step I just attended the topic > features to each train and test sample with the existing tfidf features > and performed my standard LinearSVC - oh btw thanks that the confusion > with dense and sparse is now resolved in 0.12 ;) - on it. > > The problem now is, that the results are overall exactly similar. Some > classes perform better and some worse. > > I am not exactly sure if this is a data problem, or comes from my lack > of understanding of such feature extension techniques. > > Is it possible that the huge amount of tfidf features somehow overrules > the rather small number of topic features? Do I maybe have to some > feature modification - because tfidf and LDA features are of different > nature? > > Maybe it is also due to the classifier and I need something else? > > Would be happy if someone could shed a little light on my problems ;) > > Regards, > Philipp > > -- > Got visibility? > Most devs has no idea what their production app looks like. > Find out how fast your code is with AppDynamics Lite. > http://ad.doubleclick.net/clk;262219671;13503038;y? > http://info.appdynamics.com/FreeJavaPerformanceDownload.html > ___ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > -- > Got visibility? > Most devs has no idea what their production app looks like. > Find out how fast your code is with AppDynamics Lite. > http://ad.doubleclick.net/clk;262219671;13503038;y? > http://info.appdynamics.com/FreeJavaPerformanceDownload.html > ___ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > -- Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Combining TFIDF and LDA features
Am 14.09.2012 14:53, schrieb Andreas Müller: > Hi Philipp. Hey Andreas! > First, you should ensure that the features all have approximately the same > scale. > For example they should all be between zero and one - if the LDA features > are much smaller than the other ones, then they will probably not be weighted > much. LDA features sum up to 1 for one sample, because they describe the probability of one sample to belong to the different topics (in this case 500). So basically, they are between 0 and 1. > > Which LDA package did you use? We used Mallet's LDA implementation, because from experience they have the most established smoothing processes. http://mallet.cs.umass.edu/ If we just train on the LDA features we btw get reasonable results, a bit worse than pure TFIDF. > > I am not very experienced with this kind of model, but maybe it would be > helpful > to look at some univariate statistics, like ``feature_selection.chi2``, to see > if the LDA features are actually helpful. Yeah, this would be something I could look into. I have already tried to to feature selection with chi2 but not actually looked at the specific statistics. > > Cheers, > Andy Regards, Philipp > > > - Ursprüngliche Mail - > Von: "Philipp Singer" > An: scikit-learn-general@lists.sourceforge.net > Gesendet: Freitag, 14. September 2012 13:47:30 > Betreff: [Scikit-learn-general] Combining TFIDF and LDA features > > Hey there! > > I have seen in the past some few research papers that combined tfidf > based features with LDA topic model features and they could increase > their accuracy by some useful extent. > > I now wanted to do the same. As a simple step I just attended the topic > features to each train and test sample with the existing tfidf features > and performed my standard LinearSVC - oh btw thanks that the confusion > with dense and sparse is now resolved in 0.12 ;) - on it. > > The problem now is, that the results are overall exactly similar. Some > classes perform better and some worse. > > I am not exactly sure if this is a data problem, or comes from my lack > of understanding of such feature extension techniques. > > Is it possible that the huge amount of tfidf features somehow overrules > the rather small number of topic features? Do I maybe have to some > feature modification - because tfidf and LDA features are of different > nature? > > Maybe it is also due to the classifier and I need something else? > > Would be happy if someone could shed a little light on my problems ;) > > Regards, > Philipp > > -- > Got visibility? > Most devs has no idea what their production app looks like. > Find out how fast your code is with AppDynamics Lite. > http://ad.doubleclick.net/clk;262219671;13503038;y? > http://info.appdynamics.com/FreeJavaPerformanceDownload.html > ___ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > -- > Got visibility? > Most devs has no idea what their production app looks like. > Find out how fast your code is with AppDynamics Lite. > http://ad.doubleclick.net/clk;262219671;13503038;y? > http://info.appdynamics.com/FreeJavaPerformanceDownload.html > ___ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > -- Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Combining TFIDF and LDA features
Hi Philipp. First, you should ensure that the features all have approximately the same scale. For example they should all be between zero and one - if the LDA features are much smaller than the other ones, then they will probably not be weighted much. Which LDA package did you use? I am not very experienced with this kind of model, but maybe it would be helpful to look at some univariate statistics, like ``feature_selection.chi2``, to see if the LDA features are actually helpful. Cheers, Andy - Ursprüngliche Mail - Von: "Philipp Singer" An: scikit-learn-general@lists.sourceforge.net Gesendet: Freitag, 14. September 2012 13:47:30 Betreff: [Scikit-learn-general] Combining TFIDF and LDA features Hey there! I have seen in the past some few research papers that combined tfidf based features with LDA topic model features and they could increase their accuracy by some useful extent. I now wanted to do the same. As a simple step I just attended the topic features to each train and test sample with the existing tfidf features and performed my standard LinearSVC - oh btw thanks that the confusion with dense and sparse is now resolved in 0.12 ;) - on it. The problem now is, that the results are overall exactly similar. Some classes perform better and some worse. I am not exactly sure if this is a data problem, or comes from my lack of understanding of such feature extension techniques. Is it possible that the huge amount of tfidf features somehow overrules the rather small number of topic features? Do I maybe have to some feature modification - because tfidf and LDA features are of different nature? Maybe it is also due to the classifier and I need something else? Would be happy if someone could shed a little light on my problems ;) Regards, Philipp -- Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general