Re: [Scikit-learn-general] sample weights for RandomForestClassifier to compute cross_val_score with roc_auc metric

2015-04-27 Thread Arnaud Joly
If you set sample_weight[i] = 2, for the i-th samples. It will
consider that this sample has to be accounted twice in the tree
growing procedure (impurity computation, leaf labelling, …).

Best regards,
Arnaud



 On 26 Apr 2015, at 16:00, Luca Puggini lucapug...@gmail.com wrote:
 
 Ok thanks a lot, a last question.  
 
 What is the role of sample_weight If I use ExtraTreesClassifier with 
 bootstrap=False (this is the default)?  
 Are they used during the splitting process?  
 
 On Sat, Apr 25, 2015 at 10:04 PM, Andy t3k...@gmail.com 
 mailto:t3k...@gmail.com wrote:
 On 04/25/2015 09:18 AM, Luca Puggini wrote:
 I think it depends by the role of sample weight during the construction of 
 the forest. 
 If I set sample_weight = 2 for one of my samples is this equivalent to 
 duplicate the row in the data?
 
 During fitting, yes, during evaluation currently not.
 
 
 On Fri, Apr 24, 2015 at 10:25 PM, Andreas Mueller t3k...@gmail.com 
 mailto:t3k...@gmail.com wrote:
 The roc_auc will not take sample_weights into account if using 
 cross_val_score.
 Thinking about it, I'm not sure if this a bug or a feature.
 Not sure if that was discussed before, I opened an issue:
 https://github.com/scikit-learn/scikit-learn/issues/4632 
 https://github.com/scikit-learn/scikit-learn/issues/4632
 
 
 On 04/24/2015 12:29 PM, Luca Puggini wrote:
 Dear all,
 
 I am quiet new to  {0,1} classification problems. 
 I have an unbalanced dataset and and I am using a RandomForestMethod on it. 
 
 To evaluate the performances of my estimator I am using the cross_val_score 
 function with the roc_auc metric. 
 
 My understanding is that to deal with unbalanced problem I can pass the 
 argument sample_weight to the random forest estimator.
 
 I do not understand if I should pass the sample_weight parameters also in 
 this case or if this will bias the result obtained with roc_auc
 
 Is there any common way to do that? Have you any advice?
 
 Thanks a lot! 
 
 
 --
 One dashboard for servers and applications across Physical-Virtual-Cloud 
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y 
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net 
 mailto:Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general 
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 
 
 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y 
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net 
 mailto:Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general 
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 
 
 
 
 --
 One dashboard for servers and applications across Physical-Virtual-Cloud 
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y 
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net 
 mailto:Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general 
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 
 
 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y 
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net 
 

[Scikit-learn-general] Value error when using KNeighboursClassifier with GridSearch

2015-04-27 Thread Jitesh Khandelwal
Hi all,

I am trying to use grid search to evaluate some decomposition techniques of
my own. I have implemented some custom transformers such as PAA, DFT, DWT
as shown in the code below.

I am getting a strange ValueError when run the below code and I am unable
to figure out the origin of the problem.

I have pasted the code below and attached the error log file.

Any suggestions on how can I move forward from here would be helpful.

Thanks.

Code:
===
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

from time_series.decomposition import PAA, DFT, DWT, ShapeX
from prepare_data import combine_train_test_dataset

knn = KNeighborsClassifier()
paa = PAA()

pipe = Pipeline([
('paa', paa),
('knn', knn)
])

n_components = [1,2,4,5,10,20,40]
n_neighbors = range(1,11)
metrics = ['euclidean']

datadir = ../keogh_datasets/Coffee
X,y = combine_train_test_dataset(datadir)

model_tunning = GridSearchCV(pipe, {
'paa__n_components': n_components,
'knn__n_neighbors': n_neighbors,
'knn__metric': metrics,
},
n_jobs=-1)

model_tunning.fit(X,y)

print model_tunning.best_score_
print model_tunning.best_params_
===


error_log
Description: Binary data
--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] bias in svm.LinearSVC classification accuracy in very small data sample? (Andreas Mueller)

2015-04-27 Thread Andreas Mueller
You changed the labels only once, and have a test-set size of 4? I would 
imagine that is where that comes from.
If you repeat over different assignments, you will get 50/50.

On 04/27/2015 11:33 AM, Fabrizio Fasano wrote:
 Dear Andy,

 Yes, the classes have the same size, 8 and 8

 this is one example of code I used to cross validate classification (I used 
 here StratifiedShuffleSplit, but I also used other methods as leave one out 
 or simple 4-fold cross validation, and the result didn't change so much)

 from sklearn.cross_validation import StratifiedShuffleSplit
 sss = StratifiedShuffleSplit(y, 100, test_size=0.25, random_state=0)
 clf = svm.LinearSVC(penalty=l1, dual=False, C=1, random_state=1)

 cv_scores=[]
 for train_index, test_index in sss:
X_train, X_test = X_scaled[train_index], X_scaled[test_index]
y_train, y_test = y[train_index], y[test_index]
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
cv_scores.append(np.sum(y_pred == y_test) / float(np.size(y_test)))

 print Accuracy , np.ceil(100*np.mean(cv_scores)), +/-, 
 np.ceil(200*np.std(cv_scores))




 On Apr 26, 2015, at 7:50 PM, Andy wrote:

 Your expectation is right, if you randomly assign labels, you shouldn't
 get more than 50% correct with a large enough dataset.
 I imagine there is some issue in how you shuffled the labels. Without
 the code, it is hard to tell.
 Are you sure the classes have the same size?

 On 04/26/2015 11:22 AM, Fabrizio Fasano wrote:
 Dear Andreas,

 Thanks a lot for your help,

 about the random assignment of values to my labels y. What I mean is that 
 being suspicious about the too good performances, I changed the labels 
 manually, retaining the 50% 1,0 but in different orders, and the labels 
 were always predicted very well, with accuracy no lower than 60%. I mean, 
 by chance I aspected values lower than 50% as well as values higher than 
 50%. I didn't perform an exhaustive test (I only did it manually for few 
 combinations)...

 Fabrizio
 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Value error when using KNeighboursClassifier with GridSearch

2015-04-27 Thread Joel Nothman
I assume you have checked that combine_train_test_dataset produces data of
the correct dimensions in both X and y.

I would be very surprised if the problem were not in PAA, so check it
again: make sure that you test that PAA().fit(X1).transform(X2) gives the
transformation of X2. The error seems to suggest it is returning an array
of X1's size.

On 28 April 2015 at 05:11, Jitesh Khandelwal jk231...@gmail.com wrote:

 Hi Andreas,

 Thanks for your response.

 No, PAA does not change the number of samples. It just reduces the number
 of features.

 For example if the input matrix is X and X.shape = (100, 100) and the
 n_components = 10 in PAA, then the resultant X.shape = (100, 10).

 Yes, I did try using PAA in the ipython shell (without the grid search) on
 the same dataset and it does the transformation as expected.

 Another interesting observation is that the dataset that I have used in
 the code has dimensions (56, 256) and also 37 + 19 = 56. Does this provide
 any insight about the error?


 [image: --]
 Jitesh Khandelwal
 http://about.me/jitesh.khandelwal?promo=email_sig
 [image: http://]about.me/jitesh.khandelwal
 http://about.me/jitesh.khandelwal?promo=email_sig


 On Tue, Apr 28, 2015 at 12:26 AM, Andreas Mueller t3k...@gmail.com
 wrote:

  Does PAA by any chance change the number of samples?
 The error is:
 ValueError: Found array with dim 37. Expected 19

 Interestingly that happens only in the scoring.

 Does it work without the grid-search?



 On 04/27/2015 07:14 AM, Jitesh Khandelwal wrote:

  Hi all,

  I am trying to use grid search to evaluate some decomposition
 techniques of my own. I have implemented some custom transformers such as
 PAA, DFT, DWT as shown in the code below.

  I am getting a strange ValueError when run the below code and I am
 unable to figure out the origin of the problem.

  I have pasted the code below and attached the error log file.

  Any suggestions on how can I move forward from here would be helpful.

  Thanks.

  Code:
 ===
  from sklearn.pipeline import Pipeline
 from sklearn.grid_search import GridSearchCV
 from sklearn.neighbors import KNeighborsClassifier

  from time_series.decomposition import PAA, DFT, DWT, ShapeX
 from prepare_data import combine_train_test_dataset

  knn = KNeighborsClassifier()
 paa = PAA()

  pipe = Pipeline([
 ('paa', paa),
 ('knn', knn)
 ])

  n_components = [1,2,4,5,10,20,40]
 n_neighbors = range(1,11)
 metrics = ['euclidean']

  datadir = ../keogh_datasets/Coffee
 X,y = combine_train_test_dataset(datadir)

  model_tunning = GridSearchCV(pipe, {
 'paa__n_components': n_components,
 'knn__n_neighbors': n_neighbors,
 'knn__metric': metrics,
 },
 n_jobs=-1)

  model_tunning.fit(X,y)

  print model_tunning.best_score_
 print model_tunning.best_params_
 ===



 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM 
 Insight.http://ad.doubleclick.net/ddm/clk/290420510;117567292;y



 ___
 Scikit-learn-general mailing 
 listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance 

Re: [Scikit-learn-general] Random forest with correlated features?

2015-04-27 Thread Luca Puggini
Hey,
I spent quiet some time with this problem.

1) if you are interested only in prediction this is not a big problem. You
can preproces the data with PCA

2) if you want to understand which variables are important
I suggest you to read the paper Understanding variable importances in
forests of randomized trees.
In general I suggest you to use ExtraTreesClassifier with max_depth=3 or 5.
There is a discussion if it is better to use max_features=1 or
max_features=n_features (I will go for the latter one).

I went thought some problems with the R package that you are suggesting so
I would not use that.

I hope this can help.
Best,
Luca

On Mon, Apr 27, 2015 at 4:48 PM, Daniel Homola 
daniel.homol...@imperial.ac.uk wrote:

 Dear all,

 I've found several articles expressing concerns about using Random
 Forest with highly correlated features (e.g.
 http://www.biomedcentral.com/1471-2105/9/307).

 I was wondering if this drawback of the RF algorithm could be somehow
 remedied using scikit-learn methods? The above linked paper has an R
 package but it's known to offer a super-slow solution to the problem.
 When I thought about this problem (quite naively as I'm at a best an
 enthusiastic beginner in ML) I thought maybe further randomisation in
 the tree building might help with this.. So would using
 ExtraTreesClassifier provide some protection against this issue?

 Thanks a lot for any suggestions in advance!

 Cheers,
 Daniel


 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] Value error when using KNeighboursClassifier with GridSearch

2015-04-27 Thread Jitesh Khandelwal
On Mon, Apr 27, 2015 at 4:44 PM, Jitesh Khandelwal jk231...@gmail.com
wrote:

 Hi all,

 I am trying to use grid search to evaluate some decomposition techniques
 of my own. I have implemented some custom transformers such as PAA, DFT,
 DWT as shown in the code below.

 I am getting a strange ValueError when run the below code and I am
 unable to figure out the origin of the problem.

 I have pasted the code below and attached the error log file.

 Any suggestions on how can I move forward from here would be helpful.

 Thanks.

 Code:
 ===
 from sklearn.pipeline import Pipeline
 from sklearn.grid_search import GridSearchCV
 from sklearn.neighbors import KNeighborsClassifier

 from time_series.decomposition import PAA, DFT, DWT, ShapeX
 from prepare_data import combine_train_test_dataset

 knn = KNeighborsClassifier()
 paa = PAA()

 pipe = Pipeline([
 ('paa', paa),
 ('knn', knn)
 ])

 n_components = [1,2,4,5,10,20,40]
 n_neighbors = range(1,11)
 metrics = ['euclidean']

 datadir = ../keogh_datasets/Coffee
 X,y = combine_train_test_dataset(datadir)

 model_tunning = GridSearchCV(pipe, {
 'paa__n_components': n_components,
 'knn__n_neighbors': n_neighbors,
 'knn__metric': metrics,
 },
 n_jobs=-1)

 model_tunning.fit(X,y)

 print model_tunning.best_score_
 print model_tunning.best_params_
 ===


--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] sequential feature selection algorithms

2015-04-27 Thread Sebastian Raschka
 I guess that could be done, but has a much higher complexity than RFE.

Oh yes, I agree, the sequential feature algorithms are definitely 
computationally more costly. 

 It seems interesting. Is that really used in practice and is there any 
 literature evaluating it?


I am not sure how often it is used in practice nowadays, but I think it is one 
of the classic approaches for feature selection -- I learned about it a couple 
of years ago in a pattern classification class, and there is a relatively 
detailed article in 

Ferri, F., et al. Comparative study of techniques for large-scale feature 
selection. Pattern Recognition in Practice IV (1994): 403-413.

The optimal solution to feature selection would be to evaluate the performance 
of all possible feature combination, which is a little bit too costly in 
practice. The sequential forward or backward selection (SFS and SBS) algorithms 
are just a suboptimal solution, and there are some minor improvements, e.g,. 
Sequential Floating Forward Selection (SFFS) which allows for the removal of 
added features in later stages etc.

I have an implementation of SBS that uses k-fold cross_val_score, and it is 
actually not a bad idea to use it for KNN to reduce overfitting as alternative 
to dimensionality reduction, for example, KNN cross-val mean accuracy on the 
wine dataset where the features are selected by SBS: 
http://i.imgur.com/ywDTHom.png?1
 
But for scikit-learn, it may be better to implement SBBS or SFFS which is 
slightly more sophisticated.


 On Apr 27, 2015, at 6:00 PM, Andreas Mueller t3k...@gmail.com wrote:
 
 That is like a one-step look-ahead feature selection?
 I guess that could be done, but has a much higher complexity than RFE.
 RFE works for anything that returns importances, not just linear models.
 It doesn't really work for KNN, as you say. [I wouldn't say 
 non-parametric models. Trees are pretty non-parametric].
 
 It seems interesting. Is that really used in practice and is there any 
 literature evaluating it?
 There is some discussion here 
 http://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf in 4.2
 but there is no empirical comparison or theoretical analysis.
 
 To be added to sklearn, you'd need to show that it is widely used and / 
 or widely useful.
 
 
 On 04/27/2015 02:47 PM, Sebastian Raschka wrote:
 Hi, I was wondering if sequential feature selection algorithms are currently 
 implemented in scikit-learn. The closest that I could find was recursive 
 feature elimination (RFE); 
 http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html.
  However, unless the application requires a fixed number of features, I am 
 not sure if it is necessarily worthwhile using it over regularized models. 
 When I understand correctly, it works like this:
 
 {x1, x2, x3} -- eliminate xi with smallest corresponding weight
 
 {x1, x3} -- eliminate xi with smallest corresponding weight
 
 {x1}
 
 However, this would only work with linear, discriminative models right?
 
 Wouldn't be a classic sequential feature selection algorithm useful for 
 non-regularized, nonparametric models e.g,. K-nearest neighbors as an 
 alternative to dimensionality reduction for applications where the original 
 features may need to be maintained? The RFE, for example, wouldn't work with 
 KNN, and maybe the data is non-linearly separable so that RFE with a linear 
 model doesn't make sense.
 
 In a nutshell, SFS algorithms simply add or remove one feature at the time 
 based on the classifier performance.
 
 e.g., Sequential backward selection:
 
 {x1, x2, x3} --- estimate performance on {x1, x2}, {x2, x3} and {x1, x3}, 
 and pick the subset with the best performance
 {x1, x3} --- estimate performance on {x1}, {x3} and pick the subset with 
 the best performance
 {x1}
 
 where performance could be e.g., cross-val accuracy.
 
 What do you think?
 
 Best,
 Sebastian
 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 
 
 --
 One dashboard for servers and applications across Physical-Virtual-Cloud 
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 

Re: [Scikit-learn-general] Random forest with correlated features?

2015-04-27 Thread Luca Puggini
I think you can find here something of more rigorous.

http://orbi.ulg.ac.be/handle/2268/170309



On Mon, Apr 27, 2015 at 11:20 PM, Daniel Homola 
daniel.homol...@imperial.ac.uk wrote:

  Hi Luca,

 The reason I asked is because I'm interested in the second problem. Thanks
 a lot for the paper and the suggested params, I'll read it and try them!

 Has anyone tested these assumptions/parameters rigorously on simulated
 data, or is this more of a feeling?

 Thanks again for the quick and informative response!
 Best,
 Daniel


 On 27/04/15 20:43, Luca Puggini wrote:

  Hey,
  I spent quiet some time with this problem.

  1) if you are interested only in prediction this is not a big problem.
 You can preproces the data with PCA

  2) if you want to understand which variables are important
 I suggest you to read the paper Understanding variable importances in
 forests of randomized trees.
  In general I suggest you to use ExtraTreesClassifier with max_depth=3 or
 5. There is a discussion if it is better to use max_features=1 or
 max_features=n_features (I will go for the latter one).

  I went thought some problems with the R package that you are suggesting
 so I would not use that.

  I hope this can help.
  Best,
  Luca

 On Mon, Apr 27, 2015 at 4:48 PM, Daniel Homola 
 daniel.homol...@imperial.ac.uk wrote:

 Dear all,

 I've found several articles expressing concerns about using Random
 Forest with highly correlated features (e.g.
 http://www.biomedcentral.com/1471-2105/9/307).

 I was wondering if this drawback of the RF algorithm could be somehow
 remedied using scikit-learn methods? The above linked paper has an R
 package but it's known to offer a super-slow solution to the problem.
 When I thought about this problem (quite naively as I'm at a best an
 enthusiastic beginner in ML) I thought maybe further randomisation in
 the tree building might help with this.. So would using
 ExtraTreesClassifier provide some protection against this issue?

 Thanks a lot for any suggestions in advance!

 Cheers,
 Daniel


 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM 
 Insight.http://ad.doubleclick.net/ddm/clk/290420510;117567292;y



 ___
 Scikit-learn-general mailing 
 listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] sequential feature selection algorithms

2015-04-27 Thread Joel Nothman
I suspect this method is underreported by any particular name, as it's a
straightforward greedy search. It is also very close to what I think many
researchers do in system development or report in system analysis, albeit
with more automation.

In the case of KNN, I would think metric learning could subsume or
outperform this.

On 28 April 2015 at 08:50, Andreas Mueller t3k...@gmail.com wrote:

 Maybe we would want mrmr first?

 http://penglab.janelia.org/proj/mRMR/


 On 04/27/2015 06:46 PM, Sebastian Raschka wrote:
  I guess that could be done, but has a much higher complexity than RFE.
  Oh yes, I agree, the sequential feature algorithms are definitely
 computationally more costly.
 
  It seems interesting. Is that really used in practice and is there any
  literature evaluating it?
 
  I am not sure how often it is used in practice nowadays, but I think it
 is one of the classic approaches for feature selection -- I learned about
 it a couple of years ago in a pattern classification class, and there is a
 relatively detailed article in
 
  Ferri, F., et al. Comparative study of techniques for large-scale
 feature selection. Pattern Recognition in Practice IV (1994): 403-413.
 
  The optimal solution to feature selection would be to evaluate the
 performance of all possible feature combination, which is a little bit too
 costly in practice. The sequential forward or backward selection (SFS and
 SBS) algorithms are just a suboptimal solution, and there are some minor
 improvements, e.g,. Sequential Floating Forward Selection (SFFS) which
 allows for the removal of added features in later stages etc.
 
  I have an implementation of SBS that uses k-fold cross_val_score, and it
 is actually not a bad idea to use it for KNN to reduce overfitting as
 alternative to dimensionality reduction, for example, KNN cross-val mean
 accuracy on the wine dataset where the features are selected by SBS:
 http://i.imgur.com/ywDTHom.png?1
 
  But for scikit-learn, it may be better to implement SBBS or SFFS which
 is slightly more sophisticated.
 
 
  On Apr 27, 2015, at 6:00 PM, Andreas Mueller t3k...@gmail.com wrote:
 
  That is like a one-step look-ahead feature selection?
  I guess that could be done, but has a much higher complexity than RFE.
  RFE works for anything that returns importances, not just linear
 models.
  It doesn't really work for KNN, as you say. [I wouldn't say
  non-parametric models. Trees are pretty non-parametric].
 
  It seems interesting. Is that really used in practice and is there any
  literature evaluating it?
  There is some discussion here
  http://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf in 4.2
  but there is no empirical comparison or theoretical analysis.
 
  To be added to sklearn, you'd need to show that it is widely used and /
  or widely useful.
 
 
  On 04/27/2015 02:47 PM, Sebastian Raschka wrote:
  Hi, I was wondering if sequential feature selection algorithms are
 currently implemented in scikit-learn. The closest that I could find was
 recursive feature elimination (RFE);
 http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html.
 However, unless the application requires a fixed number of features, I am
 not sure if it is necessarily worthwhile using it over regularized models.
 When I understand correctly, it works like this:
 
  {x1, x2, x3} -- eliminate xi with smallest corresponding weight
 
  {x1, x3} -- eliminate xi with smallest corresponding weight
 
  {x1}
 
  However, this would only work with linear, discriminative models right?
 
  Wouldn't be a classic sequential feature selection algorithm useful
 for non-regularized, nonparametric models e.g,. K-nearest neighbors as an
 alternative to dimensionality reduction for applications where the original
 features may need to be maintained? The RFE, for example, wouldn't work
 with KNN, and maybe the data is non-linearly separable so that RFE with a
 linear model doesn't make sense.
 
  In a nutshell, SFS algorithms simply add or remove one feature at the
 time based on the classifier performance.
 
  e.g., Sequential backward selection:
 
  {x1, x2, x3} --- estimate performance on {x1, x2}, {x2, x3} and {x1,
 x3}, and pick the subset with the best performance
  {x1, x3} --- estimate performance on {x1}, {x3} and pick the subset
 with the best performance
  {x1}
 
  where performance could be e.g., cross-val accuracy.
 
  What do you think?
 
  Best,
  Sebastian
 
 --
  One dashboard for servers and applications across
 Physical-Virtual-Cloud
  Widest out-of-the-box monitoring support with 50+ applications
  Performance metrics, stats and reports that give you Actionable
 Insights
  Deep dive visibility with transaction tracing using APM Insight.
  http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
  ___
  Scikit-learn-general mailing list
  

Re: [Scikit-learn-general] Random forest with correlated features?

2015-04-27 Thread Daniel Homola

Hi Luca,

The reason I asked is because I'm interested in the second problem. 
Thanks a lot for the paper and the suggested params, I'll read it and 
try them!


Has anyone tested these assumptions/parameters rigorously on simulated 
data, or is this more of a feeling?


Thanks again for the quick and informative response!
Best,
Daniel

On 27/04/15 20:43, Luca Puggini wrote:

Hey,
I spent quiet some time with this problem.

1) if you are interested only in prediction this is not a big problem. 
You can preproces the data with PCA


2) if you want to understand which variables are important
I suggest you to read the paper Understanding variable importances in 
forests of randomized trees.
In general I suggest you to use ExtraTreesClassifier with max_depth=3 
or 5. There is a discussion if it is better to use max_features=1 or 
max_features=n_features (I will go for the latter one).


I went thought some problems with the R package that you are 
suggesting so I would not use that.


I hope this can help.
Best,
Luca

On Mon, Apr 27, 2015 at 4:48 PM, Daniel Homola 
daniel.homol...@imperial.ac.uk 
mailto:daniel.homol...@imperial.ac.uk wrote:


Dear all,

I've found several articles expressing concerns about using Random
Forest with highly correlated features (e.g.
http://www.biomedcentral.com/1471-2105/9/307).

I was wondering if this drawback of the RF algorithm could be somehow
remedied using scikit-learn methods? The above linked paper has an R
package but it's known to offer a super-slow solution to the problem.
When I thought about this problem (quite naively as I'm at a best an
enthusiastic beginner in ML) I thought maybe further randomisation in
the tree building might help with this.. So would using
ExtraTreesClassifier provide some protection against this issue?

Thanks a lot for any suggestions in advance!

Cheers,
Daniel


--
One dashboard for servers and applications across
Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable
Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
mailto:Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




--
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] Use of the 'learn' font in third party packages

2015-04-27 Thread Trevor Stephens
Hi All,

I've been working for the past month or so on a third-party add-on/plug-in
package `gplearn` that uses the scikit-learn API to implement genetic
programming for symbolic regression tasks in Python and maintains
compatibility with the sklearn pipeline and gridsearch modules, etc. The
reason it is not being pushed as a PR is due to unproven usefulness in the
scikit-learn ecosystem, which comes up a lot on the GitHubs for major
additions.

I am edging my way towards a release now with docs and examples in process
and thus have a general question about the use of parts of the scikit-learn
logo found here:
https://github.com/scikit-learn/scikit-learn/blob/master/doc/logos/identity.pdf

I would like to incorporate the 'learn' font into my own package's logo,
here's the current draft:
https://files.gitter.im/trevorstephens/lqYX/gp-learn.png

I noticed that `nilearn` shares the 'learn' font from sklearn's logo,
though I understand a lot of the same core devs work on it. I see a few
pros and cons to allowing, or encouraging this:

Pros:
- encourages contributors to try out their algorithms in the wild to gauge
usefulness while still feeling like they are a part of an extended
scikit-learn ecosystem.
- a lot of PRs fall flat after a lot of effort on the developer's part. As
above, this gives them more of a chance to have something to show for
significant work done, if it is not ready for a prime-time merge.
- encourages a more common naming convention for scikit-learn compatible
estimators for easier PyPI discovery, kind of like the implied link back to
scipy toolkits with the various scikits.

Cons:
- may carry an implication that the code is reviewed and +1'd by the core
devs, which it clearly is not.
- that's all I can think of, open to hear other objections.

Anyhow, interested in what the core team thinks about this and am excited
to release my package, with or without the script MT bold fanciness.


Cheers,

- Trev
--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] bias in svm.LinearSVC classification accuracy in very small data sample? (Andreas Mueller)

2015-04-27 Thread Fabrizio Fasano
Dear Andy,

Yes, the classes have the same size, 8 and 8

this is one example of code I used to cross validate classification (I used 
here StratifiedShuffleSplit, but I also used other methods as leave one out or 
simple 4-fold cross validation, and the result didn't change so much)

from sklearn.cross_validation import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(y, 100, test_size=0.25, random_state=0)
clf = svm.LinearSVC(penalty=l1, dual=False, C=1, random_state=1)

cv_scores=[]
for train_index, test_index in sss:
  X_train, X_test = X_scaled[train_index], X_scaled[test_index]
  y_train, y_test = y[train_index], y[test_index]
  clf.fit(X_train, y_train)
  y_pred = clf.predict(X_test)
  cv_scores.append(np.sum(y_pred == y_test) / float(np.size(y_test)))

print Accuracy , np.ceil(100*np.mean(cv_scores)), +/-, 
np.ceil(200*np.std(cv_scores))




On Apr 26, 2015, at 7:50 PM, Andy wrote:

 Your expectation is right, if you randomly assign labels, you shouldn't 
 get more than 50% correct with a large enough dataset.
 I imagine there is some issue in how you shuffled the labels. Without 
 the code, it is hard to tell.
 Are you sure the classes have the same size?
 
 On 04/26/2015 11:22 AM, Fabrizio Fasano wrote:
 Dear Andreas,
 
 Thanks a lot for your help,
 
 about the random assignment of values to my labels y. What I mean is that 
 being suspicious about the too good performances, I changed the labels 
 manually, retaining the 50% 1,0 but in different orders, and the labels were 
 always predicted very well, with accuracy no lower than 60%. I mean, by 
 chance I aspected values lower than 50% as well as values higher than 50%. I 
 didn't perform an exhaustive test (I only did it manually for few 
 combinations)...
 
 Fabrizio
 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 
 
 --
 One dashboard for servers and applications across Physical-Virtual-Cloud 
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


[Scikit-learn-general] sequential feature selection algorithms

2015-04-27 Thread Sebastian Raschka
Hi, I was wondering if sequential feature selection algorithms are currently 
implemented in scikit-learn. The closest that I could find was recursive 
feature elimination (RFE); 
http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html.
 However, unless the application requires a fixed number of features, I am not 
sure if it is necessarily worthwhile using it over regularized models. When I 
understand correctly, it works like this:

{x1, x2, x3} -- eliminate xi with smallest corresponding weight

{x1, x3} -- eliminate xi with smallest corresponding weight

{x1}

However, this would only work with linear, discriminative models right? 

Wouldn't be a classic sequential feature selection algorithm useful for 
non-regularized, nonparametric models e.g,. K-nearest neighbors as an 
alternative to dimensionality reduction for applications where the original 
features may need to be maintained? The RFE, for example, wouldn't work with 
KNN, and maybe the data is non-linearly separable so that RFE with a linear 
model doesn't make sense.

In a nutshell, SFS algorithms simply add or remove one feature at the time 
based on the classifier performance.

e.g., Sequential backward selection:

{x1, x2, x3} --- estimate performance on {x1, x2}, {x2, x3} and {x1, x3}, and 
pick the subset with the best performance
{x1, x3} --- estimate performance on {x1}, {x3} and pick the subset with the 
best performance
{x1}

where performance could be e.g., cross-val accuracy.

What do you think?

Best,
Sebastian
--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] Value error when using KNeighboursClassifier with GridSearch

2015-04-27 Thread Andreas Mueller

Does PAA by any chance change the number of samples?
The error is:
ValueError: Found array with dim 37. Expected 19

Interestingly that happens only in the scoring.

Does it work without the grid-search?


On 04/27/2015 07:14 AM, Jitesh Khandelwal wrote:

Hi all,

I am trying to use grid search to evaluate some decomposition 
techniques of my own. I have implemented some custom transformers such 
as PAA, DFT, DWT as shown in the code below.


I am getting a strange ValueError when run the below code and I am 
unable to figure out the origin of the problem.


I have pasted the code below and attached the error log file.

Any suggestions on how can I move forward from here would be helpful.

Thanks.

Code:
===
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

from time_series.decomposition import PAA, DFT, DWT, ShapeX
from prepare_data import combine_train_test_dataset

knn = KNeighborsClassifier()
paa = PAA()

pipe = Pipeline([
('paa', paa),
('knn', knn)
])

n_components = [1,2,4,5,10,20,40]
n_neighbors = range(1,11)
metrics = ['euclidean']

datadir = ../keogh_datasets/Coffee
X,y = combine_train_test_dataset(datadir)

model_tunning = GridSearchCV(pipe, {
'paa__n_components': n_components,
'knn__n_neighbors': n_neighbors,
'knn__metric': metrics,
},
n_jobs=-1)

model_tunning.fit(X,y)

print model_tunning.best_score_
print model_tunning.best_params_
===



--
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y


___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] sequential feature selection algorithms

2015-04-27 Thread Andreas Mueller
That is like a one-step look-ahead feature selection?
I guess that could be done, but has a much higher complexity than RFE.
RFE works for anything that returns importances, not just linear models.
It doesn't really work for KNN, as you say. [I wouldn't say 
non-parametric models. Trees are pretty non-parametric].

It seems interesting. Is that really used in practice and is there any 
literature evaluating it?
There is some discussion here 
http://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf in 4.2
but there is no empirical comparison or theoretical analysis.

To be added to sklearn, you'd need to show that it is widely used and / 
or widely useful.


On 04/27/2015 02:47 PM, Sebastian Raschka wrote:
 Hi, I was wondering if sequential feature selection algorithms are currently 
 implemented in scikit-learn. The closest that I could find was recursive 
 feature elimination (RFE); 
 http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html.
  However, unless the application requires a fixed number of features, I am 
 not sure if it is necessarily worthwhile using it over regularized models. 
 When I understand correctly, it works like this:

 {x1, x2, x3} -- eliminate xi with smallest corresponding weight

 {x1, x3} -- eliminate xi with smallest corresponding weight

 {x1}

 However, this would only work with linear, discriminative models right?

 Wouldn't be a classic sequential feature selection algorithm useful for 
 non-regularized, nonparametric models e.g,. K-nearest neighbors as an 
 alternative to dimensionality reduction for applications where the original 
 features may need to be maintained? The RFE, for example, wouldn't work with 
 KNN, and maybe the data is non-linearly separable so that RFE with a linear 
 model doesn't make sense.

 In a nutshell, SFS algorithms simply add or remove one feature at the time 
 based on the classifier performance.

 e.g., Sequential backward selection:

 {x1, x2, x3} --- estimate performance on {x1, x2}, {x2, x3} and {x1, x3}, 
 and pick the subset with the best performance
 {x1, x3} --- estimate performance on {x1}, {x3} and pick the subset with the 
 best performance
 {x1}

 where performance could be e.g., cross-val accuracy.

 What do you think?

 Best,
 Sebastian
 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


Re: [Scikit-learn-general] sequential feature selection algorithms

2015-04-27 Thread Andreas Mueller
Maybe we would want mrmr first?

http://penglab.janelia.org/proj/mRMR/


On 04/27/2015 06:46 PM, Sebastian Raschka wrote:
 I guess that could be done, but has a much higher complexity than RFE.
 Oh yes, I agree, the sequential feature algorithms are definitely 
 computationally more costly.

 It seems interesting. Is that really used in practice and is there any
 literature evaluating it?

 I am not sure how often it is used in practice nowadays, but I think it is 
 one of the classic approaches for feature selection -- I learned about it a 
 couple of years ago in a pattern classification class, and there is a 
 relatively detailed article in

 Ferri, F., et al. Comparative study of techniques for large-scale feature 
 selection. Pattern Recognition in Practice IV (1994): 403-413.

 The optimal solution to feature selection would be to evaluate the 
 performance of all possible feature combination, which is a little bit too 
 costly in practice. The sequential forward or backward selection (SFS and 
 SBS) algorithms are just a suboptimal solution, and there are some minor 
 improvements, e.g,. Sequential Floating Forward Selection (SFFS) which allows 
 for the removal of added features in later stages etc.

 I have an implementation of SBS that uses k-fold cross_val_score, and it is 
 actually not a bad idea to use it for KNN to reduce overfitting as 
 alternative to dimensionality reduction, for example, KNN cross-val mean 
 accuracy on the wine dataset where the features are selected by SBS: 
 http://i.imgur.com/ywDTHom.png?1
   
 But for scikit-learn, it may be better to implement SBBS or SFFS which is 
 slightly more sophisticated.


 On Apr 27, 2015, at 6:00 PM, Andreas Mueller t3k...@gmail.com wrote:

 That is like a one-step look-ahead feature selection?
 I guess that could be done, but has a much higher complexity than RFE.
 RFE works for anything that returns importances, not just linear models.
 It doesn't really work for KNN, as you say. [I wouldn't say
 non-parametric models. Trees are pretty non-parametric].

 It seems interesting. Is that really used in practice and is there any
 literature evaluating it?
 There is some discussion here
 http://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf in 4.2
 but there is no empirical comparison or theoretical analysis.

 To be added to sklearn, you'd need to show that it is widely used and /
 or widely useful.


 On 04/27/2015 02:47 PM, Sebastian Raschka wrote:
 Hi, I was wondering if sequential feature selection algorithms are 
 currently implemented in scikit-learn. The closest that I could find was 
 recursive feature elimination (RFE); 
 http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html.
  However, unless the application requires a fixed number of features, I am 
 not sure if it is necessarily worthwhile using it over regularized models. 
 When I understand correctly, it works like this:

 {x1, x2, x3} -- eliminate xi with smallest corresponding weight

 {x1, x3} -- eliminate xi with smallest corresponding weight

 {x1}

 However, this would only work with linear, discriminative models right?

 Wouldn't be a classic sequential feature selection algorithm useful for 
 non-regularized, nonparametric models e.g,. K-nearest neighbors as an 
 alternative to dimensionality reduction for applications where the original 
 features may need to be maintained? The RFE, for example, wouldn't work 
 with KNN, and maybe the data is non-linearly separable so that RFE with a 
 linear model doesn't make sense.

 In a nutshell, SFS algorithms simply add or remove one feature at the time 
 based on the classifier performance.

 e.g., Sequential backward selection:

 {x1, x2, x3} --- estimate performance on {x1, x2}, {x2, x3} and {x1, x3}, 
 and pick the subset with the best performance
 {x1, x3} --- estimate performance on {x1}, {x3} and pick the subset with 
 the best performance
 {x1}

 where performance could be e.g., cross-val accuracy.

 What do you think?

 Best,
 Sebastian
 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.

Re: [Scikit-learn-general] Value error when using KNeighboursClassifier with GridSearch

2015-04-27 Thread Jitesh Khandelwal
Hi Andreas,

Thanks for your response.

No, PAA does not change the number of samples. It just reduces the number
of features.

For example if the input matrix is X and X.shape = (100, 100) and the
n_components = 10 in PAA, then the resultant X.shape = (100, 10).

Yes, I did try using PAA in the ipython shell (without the grid search) on
the same dataset and it does the transformation as expected.

Another interesting observation is that the dataset that I have used in the
code has dimensions (56, 256) and also 37 + 19 = 56. Does this provide any
insight about the error?


[image: --]
Jitesh Khandelwal
http://about.me/jitesh.khandelwal?promo=email_sig
[image: http://]about.me/jitesh.khandelwal
http://about.me/jitesh.khandelwal?promo=email_sig


On Tue, Apr 28, 2015 at 12:26 AM, Andreas Mueller t3k...@gmail.com wrote:

  Does PAA by any chance change the number of samples?
 The error is:
 ValueError: Found array with dim 37. Expected 19

 Interestingly that happens only in the scoring.

 Does it work without the grid-search?



 On 04/27/2015 07:14 AM, Jitesh Khandelwal wrote:

  Hi all,

  I am trying to use grid search to evaluate some decomposition techniques
 of my own. I have implemented some custom transformers such as PAA, DFT,
 DWT as shown in the code below.

  I am getting a strange ValueError when run the below code and I am
 unable to figure out the origin of the problem.

  I have pasted the code below and attached the error log file.

  Any suggestions on how can I move forward from here would be helpful.

  Thanks.

  Code:
 ===
  from sklearn.pipeline import Pipeline
 from sklearn.grid_search import GridSearchCV
 from sklearn.neighbors import KNeighborsClassifier

  from time_series.decomposition import PAA, DFT, DWT, ShapeX
 from prepare_data import combine_train_test_dataset

  knn = KNeighborsClassifier()
 paa = PAA()

  pipe = Pipeline([
 ('paa', paa),
 ('knn', knn)
 ])

  n_components = [1,2,4,5,10,20,40]
 n_neighbors = range(1,11)
 metrics = ['euclidean']

  datadir = ../keogh_datasets/Coffee
 X,y = combine_train_test_dataset(datadir)

  model_tunning = GridSearchCV(pipe, {
 'paa__n_components': n_components,
 'knn__n_neighbors': n_neighbors,
 'knn__metric': metrics,
 },
 n_jobs=-1)

  model_tunning.fit(X,y)

  print model_tunning.best_score_
 print model_tunning.best_params_
 ===



 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM 
 Insight.http://ad.doubleclick.net/ddm/clk/290420510;117567292;y



 ___
 Scikit-learn-general mailing 
 listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --
 One dashboard for servers and applications across Physical-Virtual-Cloud
 Widest out-of-the-box monitoring support with 50+ applications
 Performance metrics, stats and reports that give you Actionable Insights
 Deep dive visibility with transaction tracing using APM Insight.
 http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general