Re: [Scikit-learn-general] Best classification for very sparse and skewed feature matrix
Am 15.01.2012 19:45, schrieb Gael Varoquaux: On Sun, Jan 15, 2012 at 07:39:00PM +0100, Philipp Singer wrote: The problem is that my representation is very sparse so I have a huge amount of zeros. That's actually good: some of our estimators are able to use a sparse representation to speed up computation. Furthermore the dataset is skewed so one class takes a huge amount of labels and another one is also pretty high. I have successfully used logistic regression and I could achieve a recall of about (in the best case dataset) 65%. I am pretty happy with that result. But when looking at the confusion matrix the problem is that many examples get mapped to the large class. Use class_weight='auto' in the logistic regression to counter the effect of un-balanced classes. For SVMs, the following example shows the trick: http://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html HTH, Gael -- RSA(R) Conference 2012 Mar 27 - Feb 2 Save $400 by Jan. 27 Register now! http://p.sf.net/sfu/rsa-sfdev2dev2 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general Thanks a lot for the help! This helped out quite a bit. But I am still not entirely happy with the results. Maybe some further ideas? Thanks a lot Philipp -- Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Best classification for very sparse and skewed feature matrix
Which classifier have you tried? Are you sure you selected the best hyper-parameters with GridSearchCV? Have your tried to normalize the dataset? For instance have a look at: http://scikit-learn.org/dev/modules/preprocessing.html For very sparse data with large variance in the feature, you should try the sklearn.feature_extraction.text.TfidfTransformer . http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer Also have you tried to fit ExtraTrees? Is this a multiclass problem or a multilabel problem (e.g. is the target for each example unique and exclusive or can a single sample be assigned several labels)? For the later case, multilabel support was recently introduced and is documented here: http://scikit-learn.org/stable/modules/multiclass.html If you can cheaply collect unsupervised data that looks similar to your training set (albeit without the labels and in much larger amount) it might be interesting to compute cluster centers using MinibatchKMeans and then project your data on the space using a non linear transform (e.g. a RBF kernel) and add this additional features to the original features (horizontal concatenation of the 2 datasets) and then fit the classifier with the labels on this. You can also do the same with a linear feature extraction such as sklearn.decomposition.RandomizedPCA or a non linear dictionary learner such as sklearn.decomposition.MiniBatchDictionaryLearning: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.MiniBatchDictionaryLearning.html#sklearn.decomposition.MiniBatchDictionaryLearning Whatever the feature extractor you use, the goal is to extract latent components with each of them having a wide feature coverage so as to cast a net large net that can activate those component even if the raw input has very few active features. There are also 2 incoming pull requests to add real support for semi supervised learning algorithms in the scikit but not yet merged in the master. With you could use those kind of algorithm to implement so kind of active learning by enriching your training set by manually annotating the most promising examples from your unsupervised dataset. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel -- Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Best classification for very sparse and skewed feature matrix
On Sun, Jan 15, 2012 at 07:39:00PM +0100, Philipp Singer wrote: The problem is that my representation is very sparse so I have a huge amount of zeros. That's actually good: some of our estimators are able to use a sparse representation to speed up computation. Furthermore the dataset is skewed so one class takes a huge amount of labels and another one is also pretty high. I have successfully used logistic regression and I could achieve a recall of about (in the best case dataset) 65%. I am pretty happy with that result. But when looking at the confusion matrix the problem is that many examples get mapped to the large class. Use class_weight='auto' in the logistic regression to counter the effect of un-balanced classes. For SVMs, the following example shows the trick: http://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html HTH, Gael -- RSA(R) Conference 2012 Mar 27 - Feb 2 Save $400 by Jan. 27 Register now! http://p.sf.net/sfu/rsa-sfdev2dev2 ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general