Re: [Scikit-learn-general] Value error when using KNeighboursClassifier with GridSearch

2015-04-28 Thread Jitesh Khandelwal
@Joel: Thats exactly the mistake I made. Actually, I already had the transforms implemented in another package. They all do not require any fitting on the data. While wrapping them in sklearn transformer classes, I called the transform functions in the fit() method rather than the transform() meth

Re: [Scikit-learn-general] bias in svm.LinearSVC classification accuracy in very small data sample? (Andreas Mueller)

2015-04-28 Thread Fabrizio Fasano
Thanks a lot: Based on your suggestion I performed the following 2 tests (code below); 1) on the true labels, instead of defining train,test by StratifiedShuffleSplit I performed 1 permutations of train, test sets by cross_validation.train_test_split, and accuracy resulted to be Accuracy: 9

Re: [Scikit-learn-general] Use of the 'learn' font in third party packages

2015-04-28 Thread Olivier Grisel
Note that Trevor already tries to automate checks for semantic compatibility by leveraging Andy's estimator checks utility when possible: https://github.com/trevorstephens/gplearn/blob/master/gplearn/tests/test_common.py This can probably be improved (on both sides) but it's a great start! As fo

[Scikit-learn-general] SVM for feature selection

2015-04-28 Thread Pagliari, Roberto
>From the documentation: "Feature selection is usually used as a pre-processing step before doing the actual learning. The recommended way to do this in scikit-learn is to use a sklearn.pipeline.Pipeline

[Scikit-learn-general] Topic extraction

2015-04-28 Thread C K Kashyap
Hi everyone, I am new to scikit. I only feel sad for not knowing it earlier - it's awesome. I am trying to do the following. Extract topics from a bunch of tweets. I tried NMF (from the sample here - http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf.html) but I w

[Scikit-learn-general] error with RFE and gridsearchCV

2015-04-28 Thread Pagliari, Roberto
I'm trying to use recursive feature elimination with gradient boosting and grid search as shown below gbr = GradientBoostingClassifier() parameters = {'learning_rate': [0.1, 0.01, 0.001], 'max_depth': [1, 4, 6], 'min_samples_leaf': [3, 5, 9, 17],

Re: [Scikit-learn-general] SVM for feature selection

2015-04-28 Thread Sebastian Raschka
With the L1 regularization, you can't "control" the exact number of features that will be selected, it depends on the data (which features are irrelevant), and the regularization strength. What it basically does is zero-ing out coefficients. If you want to experiment with the number of features

Re: [Scikit-learn-general] Use of the 'learn' font in third party packages

2015-04-28 Thread Andreas Mueller
On 04/28/2015 09:49 AM, Olivier Grisel wrote: > Note that Trevor already tries to automate checks for semantic > compatibility by leveraging > Andy's estimator checks utility when possible: > > https://github.com/trevorstephens/gplearn/blob/master/gplearn/tests/test_common.py > > This can probabl

Re: [Scikit-learn-general] error with RFE and gridsearchCV

2015-04-28 Thread Artem
​GridSearchCV is not a​n estimator, but an "utility" to find one. So you should `fit` grid search first in order to find that classifier that performs well on cv-splits, and then use it. Like this gbr = GradientBoostingClassifier() parameters = {'learning_rate': [0.1, 0.01, 0.001],

Re: [Scikit-learn-general] SVM for feature selection

2015-04-28 Thread Pagliari, Roberto
Hi Sebastian, thanks for the hint. I think another way of doing it could be using PCA in the pipeline, and setting the number of components in 'parameters'? Thanks, From: Sebastian Raschka [se.rasc...@gmail.com] Sent: Tuesday, April 28, 2015 3:20 PM To: scikit-le

Re: [Scikit-learn-general] SVM for feature selection

2015-04-28 Thread Sebastian Raschka
Yes, PCA would work too, but then you'll get feature extraction instead of feature selection :) > On Apr 28, 2015, at 4:45 PM, Pagliari, Roberto > wrote: > > Hi Sebastian, > thanks for the hint. I think another way of doing it could be using PCA in > the pipeline, and setting the number of c

Re: [Scikit-learn-general] error with RFE and gridsearchCV

2015-04-28 Thread Pagliari, Roberto
Thank you! one more question. When it comes to pipelining with grid search, which estimators can I use for feature selection, apart from SVC and PCA? Thank you, From: Artem [barmaley@gmail.com] Sent: Tuesday, April 28, 2015 4:07 PM To: scikit-learn-general Su

Re: [Scikit-learn-general] error with RFE and gridsearchCV

2015-04-28 Thread Andreas Mueller
GradientBoostingClassifier has feature_importances_, so at least the RFE in master will will work. You can make grid-search work in RFECV but I wouldn't recommend it. Why don't you grid-search over the rfecv? Regarding your other question, have you looked at the feature selection documentation:

Re: [Scikit-learn-general] error with RFE and gridsearchCV

2015-04-28 Thread Sebastian Raschka
First, I think it's important to think about if the combination makes sense: E.g., I think it wouldn't make much sense to combine PCA and kernel SVM, since PCA is a linear transformation technique (scikit-learn implements so non-linear dim reduction techniques, too). Also, if the size of the dat

Re: [Scikit-learn-general] Topic extraction

2015-04-28 Thread Andreas Mueller
Clusters are one per data point, while topics are not. So the model is slightly different. You can get the list of topics for each sample using NMF().fit_transform(X). On 04/28/2015 01:13 PM, C K Kashyap wrote: Hi everyone, I am new to scikit. I only feel sad for not knowing it earlier - it's

Re: [Scikit-learn-general] bias in svm.LinearSVC classification accuracy in very small data sample? (Andreas Mueller)

2015-04-28 Thread Andreas Mueller
For 1) the two methods should give the same result, except that currently there is no stratification in train_test_split. So the StratifiedShuffleSplit should be better. For 2) 51.66 for 100 permutations seems more reasonable than 60%. On 04/28/2015 05:04 AM, Fabrizio Fasano wrote: Thanks a l

Re: [Scikit-learn-general] SVM for feature selection

2015-04-28 Thread Pagliari, Roberto
hi Sebastian, correct. however, if you set the number of components, you should get feature selection as well. Thank you, From: Sebastian Raschka [se.rasc...@gmail.com] Sent: Tuesday, April 28, 2015 4:53 PM To: scikit-learn-general@lists.sourceforge.net Subject:

Re: [Scikit-learn-general] SVM for feature selection

2015-04-28 Thread Eraldo Pomponi
Dear Roberto, Just in case you want to better understand what Sebastian suggested, let me suggest you two short videos taken from the ML course of Hastie and Tibshirani about the shrinkage methods: https://www.youtube.com/watch?v=cSKzqb0EKS0 https://www.youtube.com/watch?v=A5I1G1MfUmA They help

Re: [Scikit-learn-general] SVM for feature selection

2015-04-28 Thread Andreas Mueller
No, because each component will use all features (PCA coefficients are dense) On 04/28/2015 05:05 PM, Pagliari, Roberto wrote: hi Sebastian, correct. however, if you set the number of components, you should get feature selection as well. Thank you, -

Re: [Scikit-learn-general] SVM for feature selection

2015-04-28 Thread Ndjido Ardo Bar
Hi folks, When it comes to perform feature selection, I often suggest to use ElasticNet which is a combination of an L1 and L2 penalties. When using a penalty-based feature selection one must make sure that features are standardized otherwise the selection can end up being misleading. Cheers,

Re: [Scikit-learn-general] SVM for feature selection

2015-04-28 Thread Pagliari, Roberto
Thanks for the info. I did not explain myself clearly. I just meant to say that once PCA is done, you could choose a smaller number of features, starting from the most relevant. To do that, I would still need to implement a custom transformer. Thank you, From: E

Re: [Scikit-learn-general] Topic extraction

2015-04-28 Thread Joel Nothman
This shows the newsgroup name and highest scoring topic for each doc. zip(np.take(dataset.target_names, dataset.target), np.argmax(nmf.transform(tfidf), axis=1)) I think something based on this should be added to the example. On 29 April 2015 at 07:01, Andreas Mueller wrote: > Clusters are on

Re: [Scikit-learn-general] Use of the 'learn' font in third party packages

2015-04-28 Thread Trevor Stephens
Hi All, Thanks a lot for your responses. Gaël : Certainly not looking to break any openness or trust, that's why I asked :-) After a bit more thought on the font issue, I think the danger of implying my package is reviewed/endorsed by scikit-learn is too great with the graphic similarities that yo

Re: [Scikit-learn-general] Topic extraction

2015-04-28 Thread C K Kashyap
Thanks Joel and Andreas, Joel, I think "highest ranking topic for each doc" is exactly what I am looking for. Could you elaborate on the code please? What would be dataset.target_names and dataset.target in my case - http://lpaste.net/131649 Regards, Kashyap On Wed, Apr 29, 2015 at 3:08 AM, Joe

[Scikit-learn-general] K-Fold-Cross-validation in Scikit-Learn

2015-04-28 Thread nmura...@masonlive.gmu.edu
Hello, I am very new to scikit-learn and am trying to run cross-validation on a data frame consisting of text features, classification class. I am trying to perform text data classification. It is a 2-class classification problem where the distribution between positive and negative instances i

Re: [Scikit-learn-general] K-Fold-Cross-validation in Scikit-Learn

2015-04-28 Thread Sebastian Raschka
Hi, Nikhil, you could use stratified k-fold cross validation, which preserves the "original" class proportions. An example can be found here: http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedKFold.html

Re: [Scikit-learn-general] Topic extraction

2015-04-28 Thread Joel Nothman
Highest ranking topic for each doc is just np.argmax(nmf.transform(tfidf), axis=1). This is because nmf.transform (tfidf) returns a matrix of shape (num samples, num components / to

Re: [Scikit-learn-general] Topic extraction

2015-04-28 Thread C K Kashyap
Thank you so much Joel, I understood. Just one more thing please. How can I include a document against it's highest ranking topic only if it crosses a threshold? regards, Kashyap On Wed, Apr 29, 2015 at 9:45 AM, Joel Nothman wrote: > Highest ranking topic for each doc is just np.argmax(nmf.tr

Re: [Scikit-learn-general] Topic extraction

2015-04-28 Thread Joel Nothman
mask with np.max(..., axis=1) > threshold On 29 April 2015 at 14:35, C K Kashyap wrote: > Thank you so much Joel, > > I understood. Just one more thing please. > > How can I include a document against it's highest ranking topic only if it > crosses a threshold? > > regards, > Kashyap > > On Wed,

Re: [Scikit-learn-general] Topic extraction

2015-04-28 Thread C K Kashyap
Works like a charm. Just noticed though that the max value is sometimes more than 1.0 is that okay? Regards, Kashyap On Wed, Apr 29, 2015 at 10:12 AM, Joel Nothman wrote: > mask with np.max(..., axis=1) > threshold > > On 29 April 2015 at 14:35, C K Kashyap wrote: > >> Thank you so much J

Re: [Scikit-learn-general] Topic extraction

2015-04-28 Thread Joel Nothman
Yes, this is not a probabilistic method. On 29 April 2015 at 14:56, C K Kashyap wrote: > Works like a charm. Just noticed though that the max value is sometimes > more than 1.0 is that okay? > > Regards, > Kashyap > > On Wed, Apr 29, 2015 at 10:12 AM, Joel Nothman > wrote: > >> mask with n