Re: [Scikit-learn-general] sample weights for RandomForestClassifier to compute cross_val_score with roc_auc metric
If you set sample_weight[i] = 2, for the i-th samples. It will consider that this sample has to be accounted twice in the tree growing procedure (impurity computation, leaf labelling, …). Best regards, Arnaud On 26 Apr 2015, at 16:00, Luca Puggini lucapug...@gmail.com wrote: Ok thanks a lot, a last question. What is the role of sample_weight If I use ExtraTreesClassifier with bootstrap=False (this is the default)? Are they used during the splitting process? On Sat, Apr 25, 2015 at 10:04 PM, Andy t3k...@gmail.com mailto:t3k...@gmail.com wrote: On 04/25/2015 09:18 AM, Luca Puggini wrote: I think it depends by the role of sample weight during the construction of the forest. If I set sample_weight = 2 for one of my samples is this equivalent to duplicate the row in the data? During fitting, yes, during evaluation currently not. On Fri, Apr 24, 2015 at 10:25 PM, Andreas Mueller t3k...@gmail.com mailto:t3k...@gmail.com wrote: The roc_auc will not take sample_weights into account if using cross_val_score. Thinking about it, I'm not sure if this a bug or a feature. Not sure if that was discussed before, I opened an issue: https://github.com/scikit-learn/scikit-learn/issues/4632 https://github.com/scikit-learn/scikit-learn/issues/4632 On 04/24/2015 12:29 PM, Luca Puggini wrote: Dear all, I am quiet new to {0,1} classification problems. I have an unbalanced dataset and and I am using a RandomForestMethod on it. To evaluate the performances of my estimator I am using the cross_val_score function with the roc_auc metric. My understanding is that to deal with unbalanced problem I can pass the argument sample_weight to the random forest estimator. I do not understand if I should pass the sample_weight parameters also in this case or if this will bias the result obtained with roc_auc Is there any common way to do that? Have you any advice? Thanks a lot! -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net mailto:Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net mailto:Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net mailto:Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net
[Scikit-learn-general] Value error when using KNeighboursClassifier with GridSearch
Hi all, I am trying to use grid search to evaluate some decomposition techniques of my own. I have implemented some custom transformers such as PAA, DFT, DWT as shown in the code below. I am getting a strange ValueError when run the below code and I am unable to figure out the origin of the problem. I have pasted the code below and attached the error log file. Any suggestions on how can I move forward from here would be helpful. Thanks. Code: === from sklearn.pipeline import Pipeline from sklearn.grid_search import GridSearchCV from sklearn.neighbors import KNeighborsClassifier from time_series.decomposition import PAA, DFT, DWT, ShapeX from prepare_data import combine_train_test_dataset knn = KNeighborsClassifier() paa = PAA() pipe = Pipeline([ ('paa', paa), ('knn', knn) ]) n_components = [1,2,4,5,10,20,40] n_neighbors = range(1,11) metrics = ['euclidean'] datadir = ../keogh_datasets/Coffee X,y = combine_train_test_dataset(datadir) model_tunning = GridSearchCV(pipe, { 'paa__n_components': n_components, 'knn__n_neighbors': n_neighbors, 'knn__metric': metrics, }, n_jobs=-1) model_tunning.fit(X,y) print model_tunning.best_score_ print model_tunning.best_params_ === error_log Description: Binary data -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] bias in svm.LinearSVC classification accuracy in very small data sample? (Andreas Mueller)
You changed the labels only once, and have a test-set size of 4? I would imagine that is where that comes from. If you repeat over different assignments, you will get 50/50. On 04/27/2015 11:33 AM, Fabrizio Fasano wrote: Dear Andy, Yes, the classes have the same size, 8 and 8 this is one example of code I used to cross validate classification (I used here StratifiedShuffleSplit, but I also used other methods as leave one out or simple 4-fold cross validation, and the result didn't change so much) from sklearn.cross_validation import StratifiedShuffleSplit sss = StratifiedShuffleSplit(y, 100, test_size=0.25, random_state=0) clf = svm.LinearSVC(penalty=l1, dual=False, C=1, random_state=1) cv_scores=[] for train_index, test_index in sss: X_train, X_test = X_scaled[train_index], X_scaled[test_index] y_train, y_test = y[train_index], y[test_index] clf.fit(X_train, y_train) y_pred = clf.predict(X_test) cv_scores.append(np.sum(y_pred == y_test) / float(np.size(y_test))) print Accuracy , np.ceil(100*np.mean(cv_scores)), +/-, np.ceil(200*np.std(cv_scores)) On Apr 26, 2015, at 7:50 PM, Andy wrote: Your expectation is right, if you randomly assign labels, you shouldn't get more than 50% correct with a large enough dataset. I imagine there is some issue in how you shuffled the labels. Without the code, it is hard to tell. Are you sure the classes have the same size? On 04/26/2015 11:22 AM, Fabrizio Fasano wrote: Dear Andreas, Thanks a lot for your help, about the random assignment of values to my labels y. What I mean is that being suspicious about the too good performances, I changed the labels manually, retaining the 50% 1,0 but in different orders, and the labels were always predicted very well, with accuracy no lower than 60%. I mean, by chance I aspected values lower than 50% as well as values higher than 50%. I didn't perform an exhaustive test (I only did it manually for few combinations)... Fabrizio -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Value error when using KNeighboursClassifier with GridSearch
I assume you have checked that combine_train_test_dataset produces data of the correct dimensions in both X and y. I would be very surprised if the problem were not in PAA, so check it again: make sure that you test that PAA().fit(X1).transform(X2) gives the transformation of X2. The error seems to suggest it is returning an array of X1's size. On 28 April 2015 at 05:11, Jitesh Khandelwal jk231...@gmail.com wrote: Hi Andreas, Thanks for your response. No, PAA does not change the number of samples. It just reduces the number of features. For example if the input matrix is X and X.shape = (100, 100) and the n_components = 10 in PAA, then the resultant X.shape = (100, 10). Yes, I did try using PAA in the ipython shell (without the grid search) on the same dataset and it does the transformation as expected. Another interesting observation is that the dataset that I have used in the code has dimensions (56, 256) and also 37 + 19 = 56. Does this provide any insight about the error? [image: --] Jitesh Khandelwal http://about.me/jitesh.khandelwal?promo=email_sig [image: http://]about.me/jitesh.khandelwal http://about.me/jitesh.khandelwal?promo=email_sig On Tue, Apr 28, 2015 at 12:26 AM, Andreas Mueller t3k...@gmail.com wrote: Does PAA by any chance change the number of samples? The error is: ValueError: Found array with dim 37. Expected 19 Interestingly that happens only in the scoring. Does it work without the grid-search? On 04/27/2015 07:14 AM, Jitesh Khandelwal wrote: Hi all, I am trying to use grid search to evaluate some decomposition techniques of my own. I have implemented some custom transformers such as PAA, DFT, DWT as shown in the code below. I am getting a strange ValueError when run the below code and I am unable to figure out the origin of the problem. I have pasted the code below and attached the error log file. Any suggestions on how can I move forward from here would be helpful. Thanks. Code: === from sklearn.pipeline import Pipeline from sklearn.grid_search import GridSearchCV from sklearn.neighbors import KNeighborsClassifier from time_series.decomposition import PAA, DFT, DWT, ShapeX from prepare_data import combine_train_test_dataset knn = KNeighborsClassifier() paa = PAA() pipe = Pipeline([ ('paa', paa), ('knn', knn) ]) n_components = [1,2,4,5,10,20,40] n_neighbors = range(1,11) metrics = ['euclidean'] datadir = ../keogh_datasets/Coffee X,y = combine_train_test_dataset(datadir) model_tunning = GridSearchCV(pipe, { 'paa__n_components': n_components, 'knn__n_neighbors': n_neighbors, 'knn__metric': metrics, }, n_jobs=-1) model_tunning.fit(X,y) print model_tunning.best_score_ print model_tunning.best_params_ === -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight.http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Scikit-learn-general mailing listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance
Re: [Scikit-learn-general] Random forest with correlated features?
Hey, I spent quiet some time with this problem. 1) if you are interested only in prediction this is not a big problem. You can preproces the data with PCA 2) if you want to understand which variables are important I suggest you to read the paper Understanding variable importances in forests of randomized trees. In general I suggest you to use ExtraTreesClassifier with max_depth=3 or 5. There is a discussion if it is better to use max_features=1 or max_features=n_features (I will go for the latter one). I went thought some problems with the R package that you are suggesting so I would not use that. I hope this can help. Best, Luca On Mon, Apr 27, 2015 at 4:48 PM, Daniel Homola daniel.homol...@imperial.ac.uk wrote: Dear all, I've found several articles expressing concerns about using Random Forest with highly correlated features (e.g. http://www.biomedcentral.com/1471-2105/9/307). I was wondering if this drawback of the RF algorithm could be somehow remedied using scikit-learn methods? The above linked paper has an R package but it's known to offer a super-slow solution to the problem. When I thought about this problem (quite naively as I'm at a best an enthusiastic beginner in ML) I thought maybe further randomisation in the tree building might help with this.. So would using ExtraTreesClassifier provide some protection against this issue? Thanks a lot for any suggestions in advance! Cheers, Daniel -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
[Scikit-learn-general] Value error when using KNeighboursClassifier with GridSearch
On Mon, Apr 27, 2015 at 4:44 PM, Jitesh Khandelwal jk231...@gmail.com wrote: Hi all, I am trying to use grid search to evaluate some decomposition techniques of my own. I have implemented some custom transformers such as PAA, DFT, DWT as shown in the code below. I am getting a strange ValueError when run the below code and I am unable to figure out the origin of the problem. I have pasted the code below and attached the error log file. Any suggestions on how can I move forward from here would be helpful. Thanks. Code: === from sklearn.pipeline import Pipeline from sklearn.grid_search import GridSearchCV from sklearn.neighbors import KNeighborsClassifier from time_series.decomposition import PAA, DFT, DWT, ShapeX from prepare_data import combine_train_test_dataset knn = KNeighborsClassifier() paa = PAA() pipe = Pipeline([ ('paa', paa), ('knn', knn) ]) n_components = [1,2,4,5,10,20,40] n_neighbors = range(1,11) metrics = ['euclidean'] datadir = ../keogh_datasets/Coffee X,y = combine_train_test_dataset(datadir) model_tunning = GridSearchCV(pipe, { 'paa__n_components': n_components, 'knn__n_neighbors': n_neighbors, 'knn__metric': metrics, }, n_jobs=-1) model_tunning.fit(X,y) print model_tunning.best_score_ print model_tunning.best_params_ === -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] sequential feature selection algorithms
I guess that could be done, but has a much higher complexity than RFE. Oh yes, I agree, the sequential feature algorithms are definitely computationally more costly. It seems interesting. Is that really used in practice and is there any literature evaluating it? I am not sure how often it is used in practice nowadays, but I think it is one of the classic approaches for feature selection -- I learned about it a couple of years ago in a pattern classification class, and there is a relatively detailed article in Ferri, F., et al. Comparative study of techniques for large-scale feature selection. Pattern Recognition in Practice IV (1994): 403-413. The optimal solution to feature selection would be to evaluate the performance of all possible feature combination, which is a little bit too costly in practice. The sequential forward or backward selection (SFS and SBS) algorithms are just a suboptimal solution, and there are some minor improvements, e.g,. Sequential Floating Forward Selection (SFFS) which allows for the removal of added features in later stages etc. I have an implementation of SBS that uses k-fold cross_val_score, and it is actually not a bad idea to use it for KNN to reduce overfitting as alternative to dimensionality reduction, for example, KNN cross-val mean accuracy on the wine dataset where the features are selected by SBS: http://i.imgur.com/ywDTHom.png?1 But for scikit-learn, it may be better to implement SBBS or SFFS which is slightly more sophisticated. On Apr 27, 2015, at 6:00 PM, Andreas Mueller t3k...@gmail.com wrote: That is like a one-step look-ahead feature selection? I guess that could be done, but has a much higher complexity than RFE. RFE works for anything that returns importances, not just linear models. It doesn't really work for KNN, as you say. [I wouldn't say non-parametric models. Trees are pretty non-parametric]. It seems interesting. Is that really used in practice and is there any literature evaluating it? There is some discussion here http://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf in 4.2 but there is no empirical comparison or theoretical analysis. To be added to sklearn, you'd need to show that it is widely used and / or widely useful. On 04/27/2015 02:47 PM, Sebastian Raschka wrote: Hi, I was wondering if sequential feature selection algorithms are currently implemented in scikit-learn. The closest that I could find was recursive feature elimination (RFE); http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html. However, unless the application requires a fixed number of features, I am not sure if it is necessarily worthwhile using it over regularized models. When I understand correctly, it works like this: {x1, x2, x3} -- eliminate xi with smallest corresponding weight {x1, x3} -- eliminate xi with smallest corresponding weight {x1} However, this would only work with linear, discriminative models right? Wouldn't be a classic sequential feature selection algorithm useful for non-regularized, nonparametric models e.g,. K-nearest neighbors as an alternative to dimensionality reduction for applications where the original features may need to be maintained? The RFE, for example, wouldn't work with KNN, and maybe the data is non-linearly separable so that RFE with a linear model doesn't make sense. In a nutshell, SFS algorithms simply add or remove one feature at the time based on the classifier performance. e.g., Sequential backward selection: {x1, x2, x3} --- estimate performance on {x1, x2}, {x2, x3} and {x1, x3}, and pick the subset with the best performance {x1, x3} --- estimate performance on {x1}, {x3} and pick the subset with the best performance {x1} where performance could be e.g., cross-val accuracy. What do you think? Best, Sebastian -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___
Re: [Scikit-learn-general] Random forest with correlated features?
I think you can find here something of more rigorous. http://orbi.ulg.ac.be/handle/2268/170309 On Mon, Apr 27, 2015 at 11:20 PM, Daniel Homola daniel.homol...@imperial.ac.uk wrote: Hi Luca, The reason I asked is because I'm interested in the second problem. Thanks a lot for the paper and the suggested params, I'll read it and try them! Has anyone tested these assumptions/parameters rigorously on simulated data, or is this more of a feeling? Thanks again for the quick and informative response! Best, Daniel On 27/04/15 20:43, Luca Puggini wrote: Hey, I spent quiet some time with this problem. 1) if you are interested only in prediction this is not a big problem. You can preproces the data with PCA 2) if you want to understand which variables are important I suggest you to read the paper Understanding variable importances in forests of randomized trees. In general I suggest you to use ExtraTreesClassifier with max_depth=3 or 5. There is a discussion if it is better to use max_features=1 or max_features=n_features (I will go for the latter one). I went thought some problems with the R package that you are suggesting so I would not use that. I hope this can help. Best, Luca On Mon, Apr 27, 2015 at 4:48 PM, Daniel Homola daniel.homol...@imperial.ac.uk wrote: Dear all, I've found several articles expressing concerns about using Random Forest with highly correlated features (e.g. http://www.biomedcentral.com/1471-2105/9/307). I was wondering if this drawback of the RF algorithm could be somehow remedied using scikit-learn methods? The above linked paper has an R package but it's known to offer a super-slow solution to the problem. When I thought about this problem (quite naively as I'm at a best an enthusiastic beginner in ML) I thought maybe further randomisation in the tree building might help with this.. So would using ExtraTreesClassifier provide some protection against this issue? Thanks a lot for any suggestions in advance! Cheers, Daniel -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight.http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Scikit-learn-general mailing listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] sequential feature selection algorithms
I suspect this method is underreported by any particular name, as it's a straightforward greedy search. It is also very close to what I think many researchers do in system development or report in system analysis, albeit with more automation. In the case of KNN, I would think metric learning could subsume or outperform this. On 28 April 2015 at 08:50, Andreas Mueller t3k...@gmail.com wrote: Maybe we would want mrmr first? http://penglab.janelia.org/proj/mRMR/ On 04/27/2015 06:46 PM, Sebastian Raschka wrote: I guess that could be done, but has a much higher complexity than RFE. Oh yes, I agree, the sequential feature algorithms are definitely computationally more costly. It seems interesting. Is that really used in practice and is there any literature evaluating it? I am not sure how often it is used in practice nowadays, but I think it is one of the classic approaches for feature selection -- I learned about it a couple of years ago in a pattern classification class, and there is a relatively detailed article in Ferri, F., et al. Comparative study of techniques for large-scale feature selection. Pattern Recognition in Practice IV (1994): 403-413. The optimal solution to feature selection would be to evaluate the performance of all possible feature combination, which is a little bit too costly in practice. The sequential forward or backward selection (SFS and SBS) algorithms are just a suboptimal solution, and there are some minor improvements, e.g,. Sequential Floating Forward Selection (SFFS) which allows for the removal of added features in later stages etc. I have an implementation of SBS that uses k-fold cross_val_score, and it is actually not a bad idea to use it for KNN to reduce overfitting as alternative to dimensionality reduction, for example, KNN cross-val mean accuracy on the wine dataset where the features are selected by SBS: http://i.imgur.com/ywDTHom.png?1 But for scikit-learn, it may be better to implement SBBS or SFFS which is slightly more sophisticated. On Apr 27, 2015, at 6:00 PM, Andreas Mueller t3k...@gmail.com wrote: That is like a one-step look-ahead feature selection? I guess that could be done, but has a much higher complexity than RFE. RFE works for anything that returns importances, not just linear models. It doesn't really work for KNN, as you say. [I wouldn't say non-parametric models. Trees are pretty non-parametric]. It seems interesting. Is that really used in practice and is there any literature evaluating it? There is some discussion here http://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf in 4.2 but there is no empirical comparison or theoretical analysis. To be added to sklearn, you'd need to show that it is widely used and / or widely useful. On 04/27/2015 02:47 PM, Sebastian Raschka wrote: Hi, I was wondering if sequential feature selection algorithms are currently implemented in scikit-learn. The closest that I could find was recursive feature elimination (RFE); http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html. However, unless the application requires a fixed number of features, I am not sure if it is necessarily worthwhile using it over regularized models. When I understand correctly, it works like this: {x1, x2, x3} -- eliminate xi with smallest corresponding weight {x1, x3} -- eliminate xi with smallest corresponding weight {x1} However, this would only work with linear, discriminative models right? Wouldn't be a classic sequential feature selection algorithm useful for non-regularized, nonparametric models e.g,. K-nearest neighbors as an alternative to dimensionality reduction for applications where the original features may need to be maintained? The RFE, for example, wouldn't work with KNN, and maybe the data is non-linearly separable so that RFE with a linear model doesn't make sense. In a nutshell, SFS algorithms simply add or remove one feature at the time based on the classifier performance. e.g., Sequential backward selection: {x1, x2, x3} --- estimate performance on {x1, x2}, {x2, x3} and {x1, x3}, and pick the subset with the best performance {x1, x3} --- estimate performance on {x1}, {x3} and pick the subset with the best performance {x1} where performance could be e.g., cross-val accuracy. What do you think? Best, Sebastian -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Scikit-learn-general mailing list
Re: [Scikit-learn-general] Random forest with correlated features?
Hi Luca, The reason I asked is because I'm interested in the second problem. Thanks a lot for the paper and the suggested params, I'll read it and try them! Has anyone tested these assumptions/parameters rigorously on simulated data, or is this more of a feeling? Thanks again for the quick and informative response! Best, Daniel On 27/04/15 20:43, Luca Puggini wrote: Hey, I spent quiet some time with this problem. 1) if you are interested only in prediction this is not a big problem. You can preproces the data with PCA 2) if you want to understand which variables are important I suggest you to read the paper Understanding variable importances in forests of randomized trees. In general I suggest you to use ExtraTreesClassifier with max_depth=3 or 5. There is a discussion if it is better to use max_features=1 or max_features=n_features (I will go for the latter one). I went thought some problems with the R package that you are suggesting so I would not use that. I hope this can help. Best, Luca On Mon, Apr 27, 2015 at 4:48 PM, Daniel Homola daniel.homol...@imperial.ac.uk mailto:daniel.homol...@imperial.ac.uk wrote: Dear all, I've found several articles expressing concerns about using Random Forest with highly correlated features (e.g. http://www.biomedcentral.com/1471-2105/9/307). I was wondering if this drawback of the RF algorithm could be somehow remedied using scikit-learn methods? The above linked paper has an R package but it's known to offer a super-slow solution to the problem. When I thought about this problem (quite naively as I'm at a best an enthusiastic beginner in ML) I thought maybe further randomisation in the tree building might help with this.. So would using ExtraTreesClassifier provide some protection against this issue? Thanks a lot for any suggestions in advance! Cheers, Daniel -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net mailto:Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
[Scikit-learn-general] Use of the 'learn' font in third party packages
Hi All, I've been working for the past month or so on a third-party add-on/plug-in package `gplearn` that uses the scikit-learn API to implement genetic programming for symbolic regression tasks in Python and maintains compatibility with the sklearn pipeline and gridsearch modules, etc. The reason it is not being pushed as a PR is due to unproven usefulness in the scikit-learn ecosystem, which comes up a lot on the GitHubs for major additions. I am edging my way towards a release now with docs and examples in process and thus have a general question about the use of parts of the scikit-learn logo found here: https://github.com/scikit-learn/scikit-learn/blob/master/doc/logos/identity.pdf I would like to incorporate the 'learn' font into my own package's logo, here's the current draft: https://files.gitter.im/trevorstephens/lqYX/gp-learn.png I noticed that `nilearn` shares the 'learn' font from sklearn's logo, though I understand a lot of the same core devs work on it. I see a few pros and cons to allowing, or encouraging this: Pros: - encourages contributors to try out their algorithms in the wild to gauge usefulness while still feeling like they are a part of an extended scikit-learn ecosystem. - a lot of PRs fall flat after a lot of effort on the developer's part. As above, this gives them more of a chance to have something to show for significant work done, if it is not ready for a prime-time merge. - encourages a more common naming convention for scikit-learn compatible estimators for easier PyPI discovery, kind of like the implied link back to scipy toolkits with the various scikits. Cons: - may carry an implication that the code is reviewed and +1'd by the core devs, which it clearly is not. - that's all I can think of, open to hear other objections. Anyhow, interested in what the core team thinks about this and am excited to release my package, with or without the script MT bold fanciness. Cheers, - Trev -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] bias in svm.LinearSVC classification accuracy in very small data sample? (Andreas Mueller)
Dear Andy, Yes, the classes have the same size, 8 and 8 this is one example of code I used to cross validate classification (I used here StratifiedShuffleSplit, but I also used other methods as leave one out or simple 4-fold cross validation, and the result didn't change so much) from sklearn.cross_validation import StratifiedShuffleSplit sss = StratifiedShuffleSplit(y, 100, test_size=0.25, random_state=0) clf = svm.LinearSVC(penalty=l1, dual=False, C=1, random_state=1) cv_scores=[] for train_index, test_index in sss: X_train, X_test = X_scaled[train_index], X_scaled[test_index] y_train, y_test = y[train_index], y[test_index] clf.fit(X_train, y_train) y_pred = clf.predict(X_test) cv_scores.append(np.sum(y_pred == y_test) / float(np.size(y_test))) print Accuracy , np.ceil(100*np.mean(cv_scores)), +/-, np.ceil(200*np.std(cv_scores)) On Apr 26, 2015, at 7:50 PM, Andy wrote: Your expectation is right, if you randomly assign labels, you shouldn't get more than 50% correct with a large enough dataset. I imagine there is some issue in how you shuffled the labels. Without the code, it is hard to tell. Are you sure the classes have the same size? On 04/26/2015 11:22 AM, Fabrizio Fasano wrote: Dear Andreas, Thanks a lot for your help, about the random assignment of values to my labels y. What I mean is that being suspicious about the too good performances, I changed the labels manually, retaining the 50% 1,0 but in different orders, and the labels were always predicted very well, with accuracy no lower than 60%. I mean, by chance I aspected values lower than 50% as well as values higher than 50%. I didn't perform an exhaustive test (I only did it manually for few combinations)... Fabrizio -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
[Scikit-learn-general] sequential feature selection algorithms
Hi, I was wondering if sequential feature selection algorithms are currently implemented in scikit-learn. The closest that I could find was recursive feature elimination (RFE); http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html. However, unless the application requires a fixed number of features, I am not sure if it is necessarily worthwhile using it over regularized models. When I understand correctly, it works like this: {x1, x2, x3} -- eliminate xi with smallest corresponding weight {x1, x3} -- eliminate xi with smallest corresponding weight {x1} However, this would only work with linear, discriminative models right? Wouldn't be a classic sequential feature selection algorithm useful for non-regularized, nonparametric models e.g,. K-nearest neighbors as an alternative to dimensionality reduction for applications where the original features may need to be maintained? The RFE, for example, wouldn't work with KNN, and maybe the data is non-linearly separable so that RFE with a linear model doesn't make sense. In a nutshell, SFS algorithms simply add or remove one feature at the time based on the classifier performance. e.g., Sequential backward selection: {x1, x2, x3} --- estimate performance on {x1, x2}, {x2, x3} and {x1, x3}, and pick the subset with the best performance {x1, x3} --- estimate performance on {x1}, {x3} and pick the subset with the best performance {x1} where performance could be e.g., cross-val accuracy. What do you think? Best, Sebastian -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] Value error when using KNeighboursClassifier with GridSearch
Does PAA by any chance change the number of samples? The error is: ValueError: Found array with dim 37. Expected 19 Interestingly that happens only in the scoring. Does it work without the grid-search? On 04/27/2015 07:14 AM, Jitesh Khandelwal wrote: Hi all, I am trying to use grid search to evaluate some decomposition techniques of my own. I have implemented some custom transformers such as PAA, DFT, DWT as shown in the code below. I am getting a strange ValueError when run the below code and I am unable to figure out the origin of the problem. I have pasted the code below and attached the error log file. Any suggestions on how can I move forward from here would be helpful. Thanks. Code: === from sklearn.pipeline import Pipeline from sklearn.grid_search import GridSearchCV from sklearn.neighbors import KNeighborsClassifier from time_series.decomposition import PAA, DFT, DWT, ShapeX from prepare_data import combine_train_test_dataset knn = KNeighborsClassifier() paa = PAA() pipe = Pipeline([ ('paa', paa), ('knn', knn) ]) n_components = [1,2,4,5,10,20,40] n_neighbors = range(1,11) metrics = ['euclidean'] datadir = ../keogh_datasets/Coffee X,y = combine_train_test_dataset(datadir) model_tunning = GridSearchCV(pipe, { 'paa__n_components': n_components, 'knn__n_neighbors': n_neighbors, 'knn__metric': metrics, }, n_jobs=-1) model_tunning.fit(X,y) print model_tunning.best_score_ print model_tunning.best_params_ === -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] sequential feature selection algorithms
That is like a one-step look-ahead feature selection? I guess that could be done, but has a much higher complexity than RFE. RFE works for anything that returns importances, not just linear models. It doesn't really work for KNN, as you say. [I wouldn't say non-parametric models. Trees are pretty non-parametric]. It seems interesting. Is that really used in practice and is there any literature evaluating it? There is some discussion here http://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf in 4.2 but there is no empirical comparison or theoretical analysis. To be added to sklearn, you'd need to show that it is widely used and / or widely useful. On 04/27/2015 02:47 PM, Sebastian Raschka wrote: Hi, I was wondering if sequential feature selection algorithms are currently implemented in scikit-learn. The closest that I could find was recursive feature elimination (RFE); http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html. However, unless the application requires a fixed number of features, I am not sure if it is necessarily worthwhile using it over regularized models. When I understand correctly, it works like this: {x1, x2, x3} -- eliminate xi with smallest corresponding weight {x1, x3} -- eliminate xi with smallest corresponding weight {x1} However, this would only work with linear, discriminative models right? Wouldn't be a classic sequential feature selection algorithm useful for non-regularized, nonparametric models e.g,. K-nearest neighbors as an alternative to dimensionality reduction for applications where the original features may need to be maintained? The RFE, for example, wouldn't work with KNN, and maybe the data is non-linearly separable so that RFE with a linear model doesn't make sense. In a nutshell, SFS algorithms simply add or remove one feature at the time based on the classifier performance. e.g., Sequential backward selection: {x1, x2, x3} --- estimate performance on {x1, x2}, {x2, x3} and {x1, x3}, and pick the subset with the best performance {x1, x3} --- estimate performance on {x1}, {x3} and pick the subset with the best performance {x1} where performance could be e.g., cross-val accuracy. What do you think? Best, Sebastian -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Re: [Scikit-learn-general] sequential feature selection algorithms
Maybe we would want mrmr first? http://penglab.janelia.org/proj/mRMR/ On 04/27/2015 06:46 PM, Sebastian Raschka wrote: I guess that could be done, but has a much higher complexity than RFE. Oh yes, I agree, the sequential feature algorithms are definitely computationally more costly. It seems interesting. Is that really used in practice and is there any literature evaluating it? I am not sure how often it is used in practice nowadays, but I think it is one of the classic approaches for feature selection -- I learned about it a couple of years ago in a pattern classification class, and there is a relatively detailed article in Ferri, F., et al. Comparative study of techniques for large-scale feature selection. Pattern Recognition in Practice IV (1994): 403-413. The optimal solution to feature selection would be to evaluate the performance of all possible feature combination, which is a little bit too costly in practice. The sequential forward or backward selection (SFS and SBS) algorithms are just a suboptimal solution, and there are some minor improvements, e.g,. Sequential Floating Forward Selection (SFFS) which allows for the removal of added features in later stages etc. I have an implementation of SBS that uses k-fold cross_val_score, and it is actually not a bad idea to use it for KNN to reduce overfitting as alternative to dimensionality reduction, for example, KNN cross-val mean accuracy on the wine dataset where the features are selected by SBS: http://i.imgur.com/ywDTHom.png?1 But for scikit-learn, it may be better to implement SBBS or SFFS which is slightly more sophisticated. On Apr 27, 2015, at 6:00 PM, Andreas Mueller t3k...@gmail.com wrote: That is like a one-step look-ahead feature selection? I guess that could be done, but has a much higher complexity than RFE. RFE works for anything that returns importances, not just linear models. It doesn't really work for KNN, as you say. [I wouldn't say non-parametric models. Trees are pretty non-parametric]. It seems interesting. Is that really used in practice and is there any literature evaluating it? There is some discussion here http://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf in 4.2 but there is no empirical comparison or theoretical analysis. To be added to sklearn, you'd need to show that it is widely used and / or widely useful. On 04/27/2015 02:47 PM, Sebastian Raschka wrote: Hi, I was wondering if sequential feature selection algorithms are currently implemented in scikit-learn. The closest that I could find was recursive feature elimination (RFE); http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html. However, unless the application requires a fixed number of features, I am not sure if it is necessarily worthwhile using it over regularized models. When I understand correctly, it works like this: {x1, x2, x3} -- eliminate xi with smallest corresponding weight {x1, x3} -- eliminate xi with smallest corresponding weight {x1} However, this would only work with linear, discriminative models right? Wouldn't be a classic sequential feature selection algorithm useful for non-regularized, nonparametric models e.g,. K-nearest neighbors as an alternative to dimensionality reduction for applications where the original features may need to be maintained? The RFE, for example, wouldn't work with KNN, and maybe the data is non-linearly separable so that RFE with a linear model doesn't make sense. In a nutshell, SFS algorithms simply add or remove one feature at the time based on the classifier performance. e.g., Sequential backward selection: {x1, x2, x3} --- estimate performance on {x1, x2}, {x2, x3} and {x1, x3}, and pick the subset with the best performance {x1, x3} --- estimate performance on {x1}, {x3} and pick the subset with the best performance {x1} where performance could be e.g., cross-val accuracy. What do you think? Best, Sebastian -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight.
Re: [Scikit-learn-general] Value error when using KNeighboursClassifier with GridSearch
Hi Andreas, Thanks for your response. No, PAA does not change the number of samples. It just reduces the number of features. For example if the input matrix is X and X.shape = (100, 100) and the n_components = 10 in PAA, then the resultant X.shape = (100, 10). Yes, I did try using PAA in the ipython shell (without the grid search) on the same dataset and it does the transformation as expected. Another interesting observation is that the dataset that I have used in the code has dimensions (56, 256) and also 37 + 19 = 56. Does this provide any insight about the error? [image: --] Jitesh Khandelwal http://about.me/jitesh.khandelwal?promo=email_sig [image: http://]about.me/jitesh.khandelwal http://about.me/jitesh.khandelwal?promo=email_sig On Tue, Apr 28, 2015 at 12:26 AM, Andreas Mueller t3k...@gmail.com wrote: Does PAA by any chance change the number of samples? The error is: ValueError: Found array with dim 37. Expected 19 Interestingly that happens only in the scoring. Does it work without the grid-search? On 04/27/2015 07:14 AM, Jitesh Khandelwal wrote: Hi all, I am trying to use grid search to evaluate some decomposition techniques of my own. I have implemented some custom transformers such as PAA, DFT, DWT as shown in the code below. I am getting a strange ValueError when run the below code and I am unable to figure out the origin of the problem. I have pasted the code below and attached the error log file. Any suggestions on how can I move forward from here would be helpful. Thanks. Code: === from sklearn.pipeline import Pipeline from sklearn.grid_search import GridSearchCV from sklearn.neighbors import KNeighborsClassifier from time_series.decomposition import PAA, DFT, DWT, ShapeX from prepare_data import combine_train_test_dataset knn = KNeighborsClassifier() paa = PAA() pipe = Pipeline([ ('paa', paa), ('knn', knn) ]) n_components = [1,2,4,5,10,20,40] n_neighbors = range(1,11) metrics = ['euclidean'] datadir = ../keogh_datasets/Coffee X,y = combine_train_test_dataset(datadir) model_tunning = GridSearchCV(pipe, { 'paa__n_components': n_components, 'knn__n_neighbors': n_neighbors, 'knn__metric': metrics, }, n_jobs=-1) model_tunning.fit(X,y) print model_tunning.best_score_ print model_tunning.best_params_ === -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight.http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Scikit-learn-general mailing listScikit-learn-general@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general