Re: [Scikit-learn-general] Best classification for very sparse and skewed feature matrix

2012-01-24 Thread Philipp Singer
Am 15.01.2012 19:45, schrieb Gael Varoquaux:
 On Sun, Jan 15, 2012 at 07:39:00PM +0100, Philipp Singer wrote:
 The problem is that my representation is very sparse so I have a huge
 amount of zeros.
 That's actually good: some of our estimators are able to use a sparse
 representation to speed up computation.

 Furthermore the dataset is skewed so one class takes a huge amount of
 labels and another one is also pretty high.
 I have successfully used logistic regression and I could achieve a
 recall of about (in the best case dataset) 65%. I am pretty happy with
 that result. But when looking at the confusion matrix the problem is
 that many examples get mapped to the large class.
 Use class_weight='auto' in the logistic regression to counter the
 effect of un-balanced classes.

 For SVMs, the following example shows the trick:
 http://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html

 HTH,

 Gael

 --
 RSA(R) Conference 2012
 Mar 27 - Feb 2
 Save $400 by Jan. 27
 Register now!
 http://p.sf.net/sfu/rsa-sfdev2dev2
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Thanks a lot for the help! This helped out quite a bit. But I am still 
not entirely happy with the results. Maybe some further ideas?

Thanks a lot
Philipp

--
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Best classification for very sparse and skewed feature matrix

2012-01-24 Thread Olivier Grisel
Which classifier have you tried? Are you sure you selected the best
hyper-parameters with GridSearchCV? Have your tried to normalize the
dataset? For instance have a look at:

  http://scikit-learn.org/dev/modules/preprocessing.html

For very sparse data with large variance in the feature, you should
try the sklearn.feature_extraction.text.TfidfTransformer .

  
http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer

Also have you tried to fit ExtraTrees?

Is this a multiclass problem or a multilabel problem (e.g. is the
target for each example unique and exclusive or can a single sample be
assigned several labels)?

For the later case, multilabel support was recently introduced and is
documented here:

  http://scikit-learn.org/stable/modules/multiclass.html

If you can cheaply collect unsupervised data that looks similar to
your training set (albeit without the labels and in much larger
amount) it might be interesting to compute cluster centers using
MinibatchKMeans and then project your data on the space using a non
linear transform (e.g. a RBF kernel) and add this additional features
to the original features (horizontal concatenation of the 2 datasets)
and then fit the classifier with the labels on this.

You can also do the same with a linear feature extraction such as
sklearn.decomposition.RandomizedPCA or a non linear dictionary learner
such as  sklearn.decomposition.MiniBatchDictionaryLearning:

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.MiniBatchDictionaryLearning.html#sklearn.decomposition.MiniBatchDictionaryLearning

Whatever the feature extractor you use, the goal is to extract latent
components with each of them having a wide feature coverage so as to
cast a net large net that can activate those component even if the raw
input has very few active features.

There are also 2 incoming pull requests to add real support for semi
supervised learning algorithms in the scikit but not yet merged in the
master. With you could use those kind of algorithm to implement so
kind of active learning by enriching your training set by manually
annotating the most promising examples from your unsupervised dataset.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

--
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Best classification for very sparse and skewed feature matrix

2012-01-15 Thread Gael Varoquaux
On Sun, Jan 15, 2012 at 07:39:00PM +0100, Philipp Singer wrote:
 The problem is that my representation is very sparse so I have a huge
 amount of zeros.

That's actually good: some of our estimators are able to use a sparse
representation to speed up computation.

 Furthermore the dataset is skewed so one class takes a huge amount of 
 labels and another one is also pretty high.

 I have successfully used logistic regression and I could achieve a 
 recall of about (in the best case dataset) 65%. I am pretty happy with 
 that result. But when looking at the confusion matrix the problem is 
 that many examples get mapped to the large class.

Use class_weight='auto' in the logistic regression to counter the
effect of un-balanced classes. 

For SVMs, the following example shows the trick:
http://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html

HTH,

Gael

--
RSA(R) Conference 2012
Mar 27 - Feb 2
Save $400 by Jan. 27
Register now!
http://p.sf.net/sfu/rsa-sfdev2dev2
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general