Re: [scikit-learn] Inquiry on Genetic Algorithm
Hi, I am not aware of any *official* scikit-learn implementation of a genetic algorithm. I program my own with DEAP, which is quite versatile: https://deap.readthedocs.io/en/master/ ~Thomas On Sun, 30 Oct 2022 at 12:19, Ellarizza Fredeluces via scikit-learn < scikit-learn@python.org> wrote: > Dear Scikit-Learn developers, > > First of all, thank you for your brilliant work. > I would like to ask if a genetic algorithm is available in scikit-learn. > I tried to search, but I only found this one > <https://pypi.org/project/sklearn-genetic/#:~:text=sklearn-genetic%20is%20a%20genetic,optimal%20values%20of%20a%20function.>. > I also checked your website but > there seems to be no genetic algorithm yet. > > Your reply will be highly appreciated. Thank you again. > > Sincerely, > Ella > ___ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- ========== Dr. Thomas Evangelidis Research Scientist IOCB - Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences <https://www.uochb.cz/web/structure/31.html?lang=en>, Prague, Czech Republic & CEITEC - Central European Institute of Technology <https://www.ceitec.eu/>, Brno, Czech Republic email: teva...@gmail.com, Twitter: tevangelidis <https://twitter.com/tevangelidis>, LinkedIn: Thomas Evangelidis <https://www.linkedin.com/in/thomas-evangelidis-495b45125/> website: https://sites.google.com/site/thomasevangelidishomepage/ ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
[scikit-learn] Maximum Mutual Information value for continuous variables
Greetings, I am thinking of alternative ways of removing the invariant scalar features from my feature vectors before training MLPs. So far I tried removing columns with 0-variance and columns with Pearson's R=1.0 or R=-1.0. If I remove columns with |R|<1.0 the performance drops. However, R measures the linear correlation. Now I am thinking to try removing columns with high Mutual Information, but first I need to normalize it. I found in the documentation under "Univariate Feature Selection" the function "mutual_info_regression". https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection I used this function to measure the correlation between columns (features) but sometimes returns values >1.0. On the other hand, there is also this function https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_mutual_info_score.html#sklearn.metrics.adjusted_mutual_info_score which is upper limited to 1.0 but it is for categorical data (clusters). So my question is, is there a way to computer normalized Mutual Information for continuous variables, too? Thanks in advance for any advice. Thomas -- ========== Dr. Thomas Evangelidis Research Scientist IOCB - Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences <https://www.uochb.cz/web/structure/31.html?lang=en>, Prague, Czech Republic & CEITEC - Central European Institute of Technology <https://www.ceitec.eu/>, Brno, Czech Republic email: teva...@gmail.com, Twitter: tevangelidis <https://twitter.com/tevangelidis>, LinkedIn: Thomas Evangelidis <https://www.linkedin.com/in/thomas-evangelidis-495b45125/> website: https://sites.google.com/site/thomasevangelidishomepage/ ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
[scikit-learn] sample_weights in RandomForestRegressor
Hello, I am kind of confused about the use of sample_weights parameter in the fit() function of RandomForestRegressor. Here is my problem: I am trying to predict the binding affinity of small molecules to a protein. I have a training set of 709 molecules and a blind test set of 180 molecules. I want to find those features that are more important for the correct prediction of the binding affinity of those 180 molecules of my blind test set. My rationale is that if I give more emphasis to the similar molecules in the training set, then I will get higher importances for those features that have higher predictive ability for this specific blind test set of 180 molecules. To this end, I weighted the 709 training set molecules by their maximum similarity to the 180 molecules, selected only those features with high importance and trained a new RF with all 709 molecules. I got some results but I am not satisfied. Is this the right way to use sample_weights in RF. I would appreciate any advice or suggested work flow. -- == Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tev...@pharm.uoa.gr teva...@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] custom loss function in RandomForestRegressor
Does this generalize to any loss function? For example I also want to implement Kendall's tau correlation coefficient and a combination of R, tau and RMSE. :) On Mar 1, 2018 15:49, "Sebastian Raschka" <se.rasc...@gmail.com> wrote: > Hi, Thomas, > > as far as I know, it's all the same and doesn't matter, and you would get > the same splits, since R^2 is just a rescaled MSE. > > Best, > Sebastian > > > On Mar 1, 2018, at 9:39 AM, Thomas Evangelidis <teva...@gmail.com> > wrote: > > > > Hi Sebastian, > > > > Going back to Pearson's R loss function, does this imply that I must add > an abstract "init2" method to RegressionCriterion (that's where MSE class > inherits from) where I will add the target values X as extra argument? And > then the node impurity will be 1-R (the lowest the best)? What about the > impurities of the left and right split? In MSE class they are (sum_i^n > y_i)**2 where n is the number of samples in the respective split. It is not > clear how this is related to variance in order to adapt it for my purpose. > > > > Best, > > Thomas > > > > > > On Mar 1, 2018 14:56, "Sebastian Raschka" <se.rasc...@gmail.com> wrote: > > Hi, Thomas, > > > > in regression trees, minimizing the variance among the target values is > equivalent to minimizing the MSE between targets and predicted values. This > is also called variance reduction: https://en.wikipedia.org/wiki/ > Decision_tree_learning#Variance_reduction > > > > Best, > > Sebastian > > > > > On Mar 1, 2018, at 8:27 AM, Thomas Evangelidis <teva...@gmail.com> > wrote: > > > > > > > > > Hi again, > > > > > > I am currently revisiting this problem after familiarizing myself with > Cython and Scikit-Learn's code and I have a very important query: > > > > > > Looking at the class MSE(RegressionCriterion), the node impurity is > defined as the variance of the target values Y on that node. The > predictions X are nowhere involved in the computations. This contradicts my > notion of "loss function", which quantifies the discrepancy between > predicted and target values. Am I looking at the wrong class or what I want > to do is just not feasible with Random Forests? For example, I would like > to modify the RandomForestRegressor code to minimize the Pearson's R > between predicted and target values. > > > > > > I thank you in advance for any clarification. > > > Thomas > > > > > > > > > > > > > > > On 02/15/2018 01:28 PM, Guillaume Lemaitre wrote: > > >> Yes you are right pxd are the header and pyx the definition. You need > to write a class as MSE. Criterion is an abstract class or base class (I > don't have it under the eye) > > >> > > >> @Andy: if I recall the PR, we made the classes public to enable such > custom criterion. However, it is not documented since we were not > officially supporting it. So this is an hidden feature. We could always > discuss to make this feature more visible and document it. > > > > > > > > > > > > > > > > > > -- > > > == > > > Dr Thomas Evangelidis > > > Post-doctoral Researcher > > > CEITEC - Central European Institute of Technology > > > Masaryk University > > > Kamenice 5/A35/2S049, > > > 62500 Brno, Czech Republic > > > > > > email: tev...@pharm.uoa.gr > > > teva...@gmail.com > > > > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > > > > ___ > > > scikit-learn mailing list > > > scikit-learn@python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > ___ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > ___ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > ___ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] custom loss function in RandomForestRegressor
Hi Sebastian, Going back to Pearson's R loss function, does this imply that I must add an abstract "init2" method to RegressionCriterion (that's where MSE class inherits from) where I will add the target values X as extra argument? And then the node impurity will be 1-R (the lowest the best)? What about the impurities of the left and right split? In MSE class they are (sum_i^n y_i)**2 where n is the number of samples in the respective split. It is not clear how this is related to variance in order to adapt it for my purpose. Best, Thomas On Mar 1, 2018 14:56, "Sebastian Raschka" <se.rasc...@gmail.com> wrote: Hi, Thomas, in regression trees, minimizing the variance among the target values is equivalent to minimizing the MSE between targets and predicted values. This is also called variance reduction: https://en.wikipedia.org/wiki/ Decision_tree_learning#Variance_reduction Best, Sebastian > On Mar 1, 2018, at 8:27 AM, Thomas Evangelidis <teva...@gmail.com> wrote: > > > Hi again, > > I am currently revisiting this problem after familiarizing myself with Cython and Scikit-Learn's code and I have a very important query: > > Looking at the class MSE(RegressionCriterion), the node impurity is defined as the variance of the target values Y on that node. The predictions X are nowhere involved in the computations. This contradicts my notion of "loss function", which quantifies the discrepancy between predicted and target values. Am I looking at the wrong class or what I want to do is just not feasible with Random Forests? For example, I would like to modify the RandomForestRegressor code to minimize the Pearson's R between predicted and target values. > > I thank you in advance for any clarification. > Thomas > > > > > On 02/15/2018 01:28 PM, Guillaume Lemaitre wrote: >> Yes you are right pxd are the header and pyx the definition. You need to write a class as MSE. Criterion is an abstract class or base class (I don't have it under the eye) >> >> @Andy: if I recall the PR, we made the classes public to enable such custom criterion. However, it is not documented since we were not officially supporting it. So this is an hidden feature. We could always discuss to make this feature more visible and document it. > > > > > > -- > == > Dr Thomas Evangelidis > Post-doctoral Researcher > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/2S049, > 62500 Brno, Czech Republic > > email: tev...@pharm.uoa.gr > teva...@gmail.com > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > ___ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] custom loss function in RandomForestRegressor
Hi again, I am currently revisiting this problem after familiarizing myself with Cython and Scikit-Learn's code and I have a very important query: Looking at the class MSE(RegressionCriterion), the node impurity is defined as the variance of the target values Y on that node. The predictions X are nowhere involved in the computations. This contradicts my notion of "loss function", which quantifies the discrepancy between predicted and target values. Am I looking at the wrong class or what I want to do is just not feasible with Random Forests? For example, I would like to modify the RandomForestRegressor code to minimize the Pearson's R between predicted and target values. I thank you in advance for any clarification. Thomas > >> On 02/15/2018 01:28 PM, Guillaume Lemaitre wrote: >> >> Yes you are right pxd are the header and pyx the definition. You need to >> write a class as MSE. Criterion is an abstract class or base class (I don't >> have it under the eye) >> >> @Andy: if I recall the PR, we made the classes public to enable such >> custom criterion. However, it is not documented since we were not >> officially supporting it. So this is an hidden feature. We could always >> discuss to make this feature more visible and document it. >> >> >> > -- == Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tev...@pharm.uoa.gr teva...@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] custom loss function in RandomForestRegressor
Is it possible to compile just _criterion.pyx and _criterion.pxd files by using "importpyx" or any alternative way instead of compiling the whole sklearn library every time I introduce a change? Dne 15. 2. 2018 19:29 napsal uživatel "Guillaume Lemaitre" < g.lemaitr...@gmail.com>: Yes you are right pxd are the header and pyx the definition. You need to write a class as MSE. Criterion is an abstract class or base class (I don't have it under the eye) @Andy: if I recall the PR, we made the classes public to enable such custom criterion. However, it is not documented since we were not officially supporting it. So this is an hidden feature. We could always discuss to make this feature more visible and document it. Guillaume Lemaitre INRIA Saclay Ile-de-France / Equipe PARIETAL guillaume.lemai...@inria.fr - https://glemaitre.github.io/ *From: *Thomas Evangelidis *Sent: *Thursday, 15 February 2018 19:15 *To: *Scikit-learn mailing list *Reply To: *Scikit-learn mailing list *Subject: *Re: [scikit-learn] custom loss function in RandomForestRegressor Sorry I don't know Cython at all. _criterion.pxd is like the header file in C++? I see that it contains class, function and variable definitions and their description in comments. class Criterion is an Interface, doesn't have function definitions. By "writing your own criterion with a given loss" you mean writing a class like MSE(RegressionCriterion)? On 15 February 2018 at 18:50, Guillaume Lemaître <g.lemaitr...@gmail.com> wrote: > The ClassificationCriterion and RegressionCriterion are now exposed in the > _criterion.pxd. It will allow you to create your own criterion. > So you can write your own Criterion with a given loss by implementing the > methods which are required in the trees. > Then you can pass an instance of this criterion to the tree and it should > work. > > > > > > -- > Guillaume Lemaitre > INRIA Saclay - Parietal team > Center for Data Science Paris-Saclay > https://glemaitre.github.io/ > > ___ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- == Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tev...@pharm.uoa.gr teva...@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] custom loss function in RandomForestRegressor
Sorry I don't know Cython at all. _criterion.pxd is like the header file in C++? I see that it contains class, function and variable definitions and their description in comments. class Criterion is an Interface, doesn't have function definitions. By "writing your own criterion with a given loss" you mean writing a class like MSE(RegressionCriterion)? On 15 February 2018 at 18:50, Guillaume Lemaître <g.lemaitr...@gmail.com> wrote: > The ClassificationCriterion and RegressionCriterion are now exposed in the > _criterion.pxd. It will allow you to create your own criterion. > So you can write your own Criterion with a given loss by implementing the > methods which are required in the trees. > Then you can pass an instance of this criterion to the tree and it should > work. > > > > > > -- > Guillaume Lemaitre > INRIA Saclay - Parietal team > Center for Data Science Paris-Saclay > https://glemaitre.github.io/ > > ___ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- == Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tev...@pharm.uoa.gr teva...@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
[scikit-learn] custom loss function in RandomForestRegressor
Greetings, The feature importance calculated by the RandomForest implementation is a very useful feature. I personally use it to select the best features because it is simple and fast, and then I train MLPRegressors. The limitation of this approach is that although I can control the loss function of the MLPRegressor (I have modified scikit-learn's implementation to accept an arbitrary loss function), I cannot do the same with RandomForestRegressor, and hence I have to rely on 'mse' which is not in accordance with the loss functions I use in MLPs. Today I was looking at the _criterion.pyx file: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx However, the code is in Cython and I find it hard to follow. I know that for Regression the relevant class are Criterion(), RegressionCriterion(Criterion), and MSE(RegressionCriterion). My question is: is it possible to write a class that takes an arbitrary function "loss(predictions, targets)" to calculate the loss and impurity of the nodes? thanks, Thomas -- == Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tev...@pharm.uoa.gr teva...@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] MLPClassifier as a feature selector
Alright, with these attributes I can get the weights and biases, but what about the values on the nodes of the last hidden layer? Do I have to work them out myself or there is a straightforward way to get them? On 7 December 2017 at 04:25, Manoj Kumar <manojkumarsivaraj...@gmail.com> wrote: > Hi, > > The weights and intercepts are available in the coefs_ and intercepts_ > attribute respectively. > > See https://github.com/scikit-learn/scikit-learn/blob/ > a24c8b46/sklearn/neural_network/multilayer_perceptron.py#L835 > > On Wed, Dec 6, 2017 at 4:56 PM, Brown J.B. via scikit-learn < > scikit-learn@python.org> wrote: > >> I am also very interested in knowing if there is a sklearn cookbook >> solution for getting the weights of a one-hidde-layer MLPClassifier. >> J.B. >> >> 2017-12-07 8:49 GMT+09:00 Thomas Evangelidis <teva...@gmail.com>: >> >>> Greetings, >>> >>> I want to train a MLPClassifier with one hidden layer and use it as a >>> feature selector for an MLPRegressor. >>> Is it possible to get the values of the neurons from the last hidden >>> layer of the MLPClassifier to pass them as input to the MLPRegressor? >>> >>> If it is not possible with scikit-learn, is anyone aware of any >>> scikit-compatible NN library that offers this functionality? For example >>> this one: >>> >>> http://scikit-neuralnetwork.readthedocs.io/en/latest/index.html >>> >>> I wouldn't like to do this in Tensorflow because the MLP there is much >>> slower than scikit-learn's implementation. >>> >>> >>> Thomas >>> >>> >>> -- >>> >>> == >>> >>> Dr Thomas Evangelidis >>> >>> Post-doctoral Researcher >>> CEITEC - Central European Institute of Technology >>> Masaryk University >>> Kamenice 5/A35/2S049, >>> 62500 Brno, Czech Republic >>> >>> email: tev...@pharm.uoa.gr >>> >>> teva...@gmail.com >>> >>> >>> website: https://sites.google.com/site/thomasevangelidishomepage/ >>> >>> >>> ___ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> ___ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > Manoj, > http://github.com/MechCoder > > ___ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- == Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tev...@pharm.uoa.gr teva...@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
[scikit-learn] data augmentation following the underlying feature values distributions and correlations
Greetings, I want to augment my training set but preserve at the same time the correlations between feature values. More specifically my features are NMR resonances of the nuclei of a single amino acid. For example for Glutamic acid I have for each observation the following feature values: [CA, HA, CB, HB, CG, HG] where CA is the resonance of the alpha carbon, HA the resonance of the alpha proton, and so forth. The complication here is that these feature values are not independent. HA is covalently bonded to CA, CB to CA, and so on. Therefore if I sample a random CA value from the distribution of experimental values of CA, I cannot pick ANY HA VALUE from the respective experimental distribution, simply because CA and HA are correlated. The same applies to CA and CB, CB and HB, CB and CG, CG and HG. Is there any algorithm that can generate [CA, HA, CB, HB, CG, HG] feature vectors that comply with the atom distributions and their correlations? I saw that Gaussian Mixture Models have a function to generate random samples from the fitted Gaussian distribution (sklearn.mixture.GaussianMixture.sample) but it is not clear if these samples will retain the correlations between the features (nuclei in this case). If there is not such an algorithm in scikit-learn, could you please point me to any other Python library which does that? Thanks in advance. Thomas -- == Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tev...@pharm.uoa.gr teva...@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
[scikit-learn] MLPClassifier as a feature selector
Greetings, I want to train a MLPClassifier with one hidden layer and use it as a feature selector for an MLPRegressor. Is it possible to get the values of the neurons from the last hidden layer of the MLPClassifier to pass them as input to the MLPRegressor? If it is not possible with scikit-learn, is anyone aware of any scikit-compatible NN library that offers this functionality? For example this one: http://scikit-neuralnetwork.readthedocs.io/en/latest/index.html I wouldn't like to do this in Tensorflow because the MLP there is much slower than scikit-learn's implementation. Thomas -- == Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tev...@pharm.uoa.gr teva...@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] anti-correlated predictions by SVR
I have very small training sets (10-50 observations). Currently, I am working with 16 observations for training and 25 for validation (external test set). And I am doing Regression, not Classification (hence the SVR instead of SVC). On 26 September 2017 at 18:21, Gael Varoquaux <gael.varoqu...@normalesup.org > wrote: > Hypothesis: you have a very small dataset and when you leave out data, > you create a distribution shift between the train and the test. A > simplified example: 20 samples, 10 class a, 10 class b. A leave-one-out > cross-validation will create a training set of 10 samples of one class, 9 > samples of the other, and the test set is composed of the class that is > minority on the train set. > > G > > On Tue, Sep 26, 2017 at 06:10:39PM +0200, Thomas Evangelidis wrote: > > Greetings, > > > I don't know if anyone encountered this before, but sometimes I get > > anti-correlated predictions by the SVR I that am training. Namely, the > > Pearson's R and Kendall's tau are negative when I compare the > predictions on > > the external test set with the true values. However, the SVR predictions > on the > > training set have positive correlations with the experimental values and > hence > > I can't think of a way to know in advance if the trained SVR will produce > > anti-correlated predictions in order to change their sign and avoid the > > disaster. Here is an example of what I mean: > > > Training set predictions: R=0.452422, tau=0.33 > > External test set predictions: R=-0.537420, tau-0.30 > > > Obviously, in a real case scenario where I wouldn't have the external > test set > > I would have used the worst observation instead of the best ones. Has > anybody > > any idea about how I could prevent this? > > > thanks in advance > > Thomas > -- > Gael Varoquaux > Researcher, INRIA Parietal > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > Phone: ++ 33-1-69-08-79-68 > http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux > ___ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- == Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tev...@pharm.uoa.gr teva...@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
[scikit-learn] anti-correlated predictions by SVR
Greetings, I don't know if anyone encountered this before, but sometimes I get anti-correlated predictions by the SVR I that am training. Namely, the Pearson's R and Kendall's tau are negative when I compare the predictions on the external test set with the true values. However, the SVR predictions on the training set have positive correlations with the experimental values and hence I can't think of a way to know in advance if the trained SVR will produce anti-correlated predictions in order to change their sign and avoid the disaster. Here is an example of what I mean: Training set predictions: R=0.452422, tau=0.33 External test set predictions: R=-0.537420, tau-0.30 Obviously, in a real case scenario where I wouldn't have the external test set I would have used the worst observation instead of the best ones. Has anybody any idea about how I could prevent this? thanks in advance Thomas -- == Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tev...@pharm.uoa.gr teva...@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] custom loss function
What about the SVM? I use an SVR at the end to combine multiple MLPRegressor predictions using the rbf kernel (linear kernel is not good for this problem). Can I also implement an SVR with rbf kernel in Tensorflow using my own loss function? So far I found an example of an SVC with linear kernel in Tensorflow and nothing in Keras. My alternative option would be to train multiple SVRs and find through cross validation the one that minimizes my custom loss function, but as I said in a previous message, that would be a suboptimal solution because in scikit-learn the SVR minimizes the default loss function. Dne 13. 9. 2017 20:48 napsal uživatel "Andreas Mueller" <t3k...@gmail.com>: > > > On 09/13/2017 01:18 PM, Thomas Evangelidis wrote: > > > Thanks again for the clarifications Sebastian! > > Keras has a Scikit-learn API with the KeraRegressor which implements the > Scikit-Learn MLPRegressor interface: > > https://keras.io/scikit-learn-api/ > > Is it possible to change the loss function in KerasRegressor? I don't have > time right now to experiment with hyperparameters of new ANN architectures. > I am in urgent need to reproduce in Keras the results obtained with > MLPRegressor and the set of hyperparameters that I have optimized for my > problem and later change the loss function. > > I think using keras is probably the way to go for you. > > ___ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] custom loss function
Thanks again for the clarifications Sebastian! Keras has a Scikit-learn API with the KeraRegressor which implements the Scikit-Learn MLPRegressor interface: https://keras.io/scikit-learn-api/ Is it possible to change the loss function in KerasRegressor? I don't have time right now to experiment with hyperparameters of new ANN architectures. I am in urgent need to reproduce in Keras the results obtained with MLPRegressor and the set of hyperparameters that I have optimized for my problem and later change the loss function. On 13 September 2017 at 18:14, Sebastian Raschka <se.rasc...@gmail.com> wrote: > > What about the SVR? Is it possible to change the loss function there? > > Here you would have the same problem; SVR is a constrained optimization > problem and you would have to change the calculation of the loss gradient > then. Since SVR is a "1-layer" neural net, if you change the cost function > to something else, it's not really a SVR anymore. > > > > Could you please clarify what the "x" and "x'" parameters in the default > Kernel functions mean? Is "x" a NxM array, where N is the number of > observations and M the number of features? > > Both x and x' should be denoting training examples. The kernel matrix is > symmetric (N x N). > > > > Best, > Sebastian > > > On Sep 13, 2017, at 5:25 AM, Thomas Evangelidis <teva...@gmail.com> > wrote: > > > > Thanks Sebastian. Exploring Tensorflow capabilities was in my TODO list, > but now it's in my immediate plans. > > What about the SVR? Is it possible to change the loss function there? > Could you please clarify what the "x" and "x'" parameters in the default > Kernel functions mean? Is "x" a NxM array, where N is the number of > observations and M the number of features? > > > > http://scikit-learn.org/stable/modules/svm.html#kernel-functions > > > > > > > > On 12 September 2017 at 00:37, Sebastian Raschka <se.rasc...@gmail.com> > wrote: > > Hi Thomas, > > > > > For the MLPRegressor case so far my conclusion was that it is not > possible unless you modify the source code. > > > > Also, I suspect that this would be non-trivial. I haven't looked to > closely at how the MLPClassifier/MLPRegressor are implemented but since you > perform the weight updates based on the gradient of the cost function wrt > the weights, the modification would be non-trivial if the partial > derivatives are not computed based on some autodiff implementation -- you > would have to edit all the partial d's along the backpropagation up to the > first hidden layer. While I think that scikit-learn is by far the best > library out there for machine learning, I think if you want an easy > solution, you probably won't get around TensorFlow or PyTorch or > equivalent, here, for your specific MLP problem unless you want to make > your life extra hard :P (seriously, you can pick up any of the two in about > an hour and have your MLPRegressor up and running so that you can then > experiment with your cost function). > > > > Best, > > Sebastian > > > > > On Sep 11, 2017, at 6:13 PM, Thomas Evangelidis <teva...@gmail.com> > wrote: > > > > > > Greetings, > > > > > > I know this is a recurrent question, but I would like to use my own > loss function either in a MLPRegressor or in an SVR. For the MLPRegressor > case so far my conclusion was that it is not possible unless you modify the > source code. On the other hand, for the SVR I was looking at setting custom > kernel functions. But I am not sure if this is the same thing. Could > someone please clarify this to me? Finally, I read about the "scoring" > parameter is cross-validation, but this is just to select a Regressor that > has been trained already with the default loss function, so it would be > harder to find one that minimizes my own loss function. > > > > > > For the record, my loss function is the centered root mean square > error. > > > > > > Thanks in advance for any advice. > > > > > > > > > > > > -- > > > == > > > Dr Thomas Evangelidis > > > Post-doctoral Researcher > > > CEITEC - Central European Institute of Technology > > > Masaryk University > > > Kamenice 5/A35/2S049, > > > 62500 Brno, Czech Republic > > > > > > email: tev...@pharm.uoa.gr > > > teva...@gmail.com > > > > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > &
[scikit-learn] custom loss function
Greetings, I know this is a recurrent question, but I would like to use my own loss function either in a MLPRegressor or in an SVR. For the MLPRegressor case so far my conclusion was that it is not possible unless you modify the source code. On the other hand, for the SVR I was looking at setting custom kernel functions. But I am not sure if this is the same thing. Could someone please clarify this to me? Finally, I read about the "scoring" parameter is cross-validation, but this is just to select a Regressor that has been trained already with the default loss function, so it would be harder to find one that minimizes my own loss function. For the record, my loss function is the centered root mean square error. Thanks in advance for any advice. -- == Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tev...@pharm.uoa.gr teva...@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] control value range of MLPRegressor predictions
On 10 September 2017 at 22:03, Sebastian Raschka <se.rasc...@gmail.com> wrote: > You could normalize the outputs (e.g., via min-max scaling). However, I > think the more intuitive way would be to clip the predictions. E.g., say > you are predicting house prices, it probably makes no sense to have a > negative prediction, so you would clip the output at some value >0$ > > By clipping you mean discarding the predictors that give values below/above the threshold? > PS: -820 and -800 sounds a bit extreme if your training data is in a -5 to > -9 range. Is your training data from a different population then the one > you use for testing/making predictions? Or maybe it's just an extreme case > of overfitting. > > It is from the same population, but the training sets I use are very small (6-32 observations), so it must be over-fitting. We had that discussion in the past here, yet in practice I get good correlations with the experimental values using MLPRegressors. > Best, > Sebastian > > > > On Sep 10, 2017, at 3:13 PM, Thomas Evangelidis <teva...@gmail.com> > wrote: > > > > Greetings, > > > > Is there any way to force the MLPRegressor to make predictions in the > same value range as the training data? For example, if the training data > range between -5 and -9, I don't want the predictions to range between -820 > and -800. In fact, some times I get anti-correlated predictions, for > example between 800 and 820 and I have to change the sign in order to > calculate correlations with experimental values. Is there a way to control > the value range explicitly or implicitly (by post-processing the > predictions)? > > > > thanks > > Thomas > > > > > > -- > > == > > Dr Thomas Evangelidis > > Post-doctoral Researcher > > CEITEC - Central European Institute of Technology > > Masaryk University > > Kamenice 5/A35/2S049, > > 62500 Brno, Czech Republic > > > > email: tev...@pharm.uoa.gr > > teva...@gmail.com > > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > ___ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > ___ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- == Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tev...@pharm.uoa.gr teva...@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
[scikit-learn] control value range of MLPRegressor predictions
Greetings, Is there any way to force the MLPRegressor to make predictions in the same value range as the training data? For example, if the training data range between -5 and -9, I don't want the predictions to range between -820 and -800. In fact, some times I get anti-correlated predictions, for example between 800 and 820 and I have to change the sign in order to calculate correlations with experimental values. Is there a way to control the value range explicitly or implicitly (by post-processing the predictions)? thanks Thomas -- == Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tev...@pharm.uoa.gr teva...@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] combining datasets from different sources
On 7 September 2017 at 15:29, Maciek Wójcikowski <mac...@wojcikowski.pl> wrote: > I think StandardScaller is what you want. For each assay you will get mean > and var. Average mean would be the "optimal" shift and average variance the > spread. But would this value make any physical sense? > > I think you missed my point. The problem was scaling with restraints, the RMSD between the binding affinity of the common ligands must be minimized uppon scaling. Anyway, I managed to work it out using scipy.optimize. > Considering the RF-Score-VS: In fact it's a regressor and it predicts a > real value, not a class. Although it is validated mostly using Enrichment > Factor, the last figure shows top results for regression vs Vina. > > To my understanding you trained the RF using class information (active, inactive) and the prediction was a probability value. If the probability is above 0.5 then the compound is an active, otherwise it is an inactive. This is how sklearn.ensemble.RandomForestClassifier works. In contrast I train MLPRegressors using binding affinities (scalar values) and the predictions are binding affinities (scallar values). > > Pozdrawiam, | Best regards, > Maciek Wójcikowski > mac...@wojcikowski.pl > > 2017-09-06 20:48 GMT+02:00 Thomas Evangelidis <teva...@gmail.com>: > >> >> After some though about this problem today, I think it is an objective >> function minimization problem, when the objective function can be the root >> mean square deviation (RMSD) between the affinities of the common molecules >> in the two data sets. I could work iteratively, first rescale and fit assay >> B to match A, then proceed to assay C and so forth. Or alternatively, for >> each Assay I need to find two missing variables, the optimum shift Sh and >> the scale Sc. So if I have 3 Assays A, B, C lets say, I am looking for the >> optimum values of Sh_A, Sc_A, Sh_B, Sc_B, Sh_C, Sc_C that minimize the RMSD >> between the binding affinities of the overlapping molecules. Any idea how I >> can do that with scikit-learn? >> >> >> On 6 September 2017 at 00:29, Thomas Evangelidis <teva...@gmail.com> >> wrote: >> >>> Thanks Jason, Sebastian and Maciek! >>> >>> I believe from all the suggestions, the most feasible solutions is to >>> look experimental assays which overlap by at least two compounds, and then >>> adjust the binding affinities of one of them by looking in their difference >>> in both assays. Sebastian mentioned the simplest scenario, where the shift >>> for both compounds is 2 kcal/mol. However, he neglected to mention that the >>> ratio between the affinities of the two compounds in each assay also >>> matters. Specifically, the ratio Ka/Kb=-7/-9=0.78 in assay A but >>> -10/-12=0.83 in assay B. Ideally that should also be taken into account to >>> select the right transformation function for the values from Assay B. Is >>> anybody away of any clever algorithm to select the right transformation >>> function for such a case? I am sure there exists. >>> >>> The other approach would be to train different predictors from each >>> assay and then apply a data fusion technique (e.g. min rank). But that >>> wouldn't be that elegant. >>> >>> @Maciek To my understanding, the paper you cited addresses a >>> classification problem (actives, inactives) by implementing Random Forrest >>> Classfiers. My case is a Regression problem. >>> >>> >>> best, >>> Thomas >>> >>> >>> On 5 September 2017 at 20:33, Maciek Wójcikowski <mac...@wojcikowski.pl> >>> wrote: >>> >>>> Hi Thomas and others, >>>> >>>> It also really depend on how many data points you have on each >>>> compound. If you had more than a few then there are few options. If you get >>>> two very distinct activities for one ligand. I'd discard such samples as >>>> ambiguous or decide on one of the assays/experiments (the one with lower >>>> error). The exact problem was faced by PDBbind creators, I'd also look >>>> there for details what they did with their activities. >>>> >>>> To follow up Sebastians suggestion: have you checked how different >>>> ranks/Z-scores you get? Check out the Kendall Tau. >>>> >>>> Anyhow, you could build local models for a specific experimental >>>> methods. In our recent publication on slightly different area >>>> (protein-ligand scoring function), we show that the RF build on one target >>
Re: [scikit-learn] combining datasets from different sources
After some though about this problem today, I think it is an objective function minimization problem, when the objective function can be the root mean square deviation (RMSD) between the affinities of the common molecules in the two data sets. I could work iteratively, first rescale and fit assay B to match A, then proceed to assay C and so forth. Or alternatively, for each Assay I need to find two missing variables, the optimum shift Sh and the scale Sc. So if I have 3 Assays A, B, C lets say, I am looking for the optimum values of Sh_A, Sc_A, Sh_B, Sc_B, Sh_C, Sc_C that minimize the RMSD between the binding affinities of the overlapping molecules. Any idea how I can do that with scikit-learn? On 6 September 2017 at 00:29, Thomas Evangelidis <teva...@gmail.com> wrote: > Thanks Jason, Sebastian and Maciek! > > I believe from all the suggestions, the most feasible solutions is to look > experimental assays which overlap by at least two compounds, and then > adjust the binding affinities of one of them by looking in their difference > in both assays. Sebastian mentioned the simplest scenario, where the shift > for both compounds is 2 kcal/mol. However, he neglected to mention that the > ratio between the affinities of the two compounds in each assay also > matters. Specifically, the ratio Ka/Kb=-7/-9=0.78 in assay A but > -10/-12=0.83 in assay B. Ideally that should also be taken into account to > select the right transformation function for the values from Assay B. Is > anybody away of any clever algorithm to select the right transformation > function for such a case? I am sure there exists. > > The other approach would be to train different predictors from each assay > and then apply a data fusion technique (e.g. min rank). But that wouldn't > be that elegant. > > @Maciek To my understanding, the paper you cited addresses a > classification problem (actives, inactives) by implementing Random Forrest > Classfiers. My case is a Regression problem. > > > best, > Thomas > > > On 5 September 2017 at 20:33, Maciek Wójcikowski <mac...@wojcikowski.pl> > wrote: > >> Hi Thomas and others, >> >> It also really depend on how many data points you have on each compound. >> If you had more than a few then there are few options. If you get two very >> distinct activities for one ligand. I'd discard such samples as ambiguous >> or decide on one of the assays/experiments (the one with lower error). The >> exact problem was faced by PDBbind creators, I'd also look there for >> details what they did with their activities. >> >> To follow up Sebastians suggestion: have you checked how different >> ranks/Z-scores you get? Check out the Kendall Tau. >> >> Anyhow, you could build local models for a specific experimental methods. >> In our recent publication on slightly different area (protein-ligand >> scoring function), we show that the RF build on one target is just slightly >> better than the RF build on many targets (we've used DUD-E database); >> Checkout the "horizontal" and "per-target" splits >> https://www.nature.com/articles/srep46710. Unfortunately, this may >> change for different models. Plus the molecular descriptors used, which we >> know nothing about. >> >> I hope that helped a bit. >> >> >> Pozdrawiam, | Best regards, >> Maciek Wójcikowski >> mac...@wojcikowski.pl >> >> 2017-09-05 19:35 GMT+02:00 Sebastian Raschka <se.rasc...@gmail.com>: >> >>> Another approach would be to pose this as a "ranking" problem to predict >>> relative affinities rather than absolute affinities. E.g., if you have data >>> from one (or more) molecules that has/have been tested under 2 or more >>> experimental conditions, you can rank the other molecules accordingly or >>> normalize. E.g. if you observe that the binding affinity of molecule a is >>> -7 kcal/mol in assay A and -9 kcal/mol in assay to, and say the binding >>> affinities of molecule B are -10 and -12 kcal/mol, respectively, that >>> should give you some information for normalizing the values from assay 2 >>> (e.g., by adding 2 kcal/mol). Of course this is not a perfect solution and >>> might be error prone, but so are experimental assays ... (when I sometimes >>> look at the std error/CI of the data I get from collaborators ... well, it >>> seems that absolute binding affinities have always taken with a grain of >>> salt anyway) >>> >>> Best, >>> Sebastian >>> >>> > On Sep 5, 2017, at 1:02 PM, Jason Rudy <jcr...@gmail.com> wrote: >>> > >>> > Thomas, &
Re: [scikit-learn] combining datasets from different sources
Thanks Jason, Sebastian and Maciek! I believe from all the suggestions, the most feasible solutions is to look experimental assays which overlap by at least two compounds, and then adjust the binding affinities of one of them by looking in their difference in both assays. Sebastian mentioned the simplest scenario, where the shift for both compounds is 2 kcal/mol. However, he neglected to mention that the ratio between the affinities of the two compounds in each assay also matters. Specifically, the ratio Ka/Kb=-7/-9=0.78 in assay A but -10/-12=0.83 in assay B. Ideally that should also be taken into account to select the right transformation function for the values from Assay B. Is anybody away of any clever algorithm to select the right transformation function for such a case? I am sure there exists. The other approach would be to train different predictors from each assay and then apply a data fusion technique (e.g. min rank). But that wouldn't be that elegant. @Maciek To my understanding, the paper you cited addresses a classification problem (actives, inactives) by implementing Random Forrest Classfiers. My case is a Regression problem. best, Thomas On 5 September 2017 at 20:33, Maciek Wójcikowski <mac...@wojcikowski.pl> wrote: > Hi Thomas and others, > > It also really depend on how many data points you have on each compound. > If you had more than a few then there are few options. If you get two very > distinct activities for one ligand. I'd discard such samples as ambiguous > or decide on one of the assays/experiments (the one with lower error). The > exact problem was faced by PDBbind creators, I'd also look there for > details what they did with their activities. > > To follow up Sebastians suggestion: have you checked how different > ranks/Z-scores you get? Check out the Kendall Tau. > > Anyhow, you could build local models for a specific experimental methods. > In our recent publication on slightly different area (protein-ligand > scoring function), we show that the RF build on one target is just slightly > better than the RF build on many targets (we've used DUD-E database); > Checkout the "horizontal" and "per-target" splits https://www.nature.com/ > articles/srep46710. Unfortunately, this may change for different models. > Plus the molecular descriptors used, which we know nothing about. > > I hope that helped a bit. > > > Pozdrawiam, | Best regards, > Maciek Wójcikowski > mac...@wojcikowski.pl > > 2017-09-05 19:35 GMT+02:00 Sebastian Raschka <se.rasc...@gmail.com>: > >> Another approach would be to pose this as a "ranking" problem to predict >> relative affinities rather than absolute affinities. E.g., if you have data >> from one (or more) molecules that has/have been tested under 2 or more >> experimental conditions, you can rank the other molecules accordingly or >> normalize. E.g. if you observe that the binding affinity of molecule a is >> -7 kcal/mol in assay A and -9 kcal/mol in assay to, and say the binding >> affinities of molecule B are -10 and -12 kcal/mol, respectively, that >> should give you some information for normalizing the values from assay 2 >> (e.g., by adding 2 kcal/mol). Of course this is not a perfect solution and >> might be error prone, but so are experimental assays ... (when I sometimes >> look at the std error/CI of the data I get from collaborators ... well, it >> seems that absolute binding affinities have always taken with a grain of >> salt anyway) >> >> Best, >> Sebastian >> >> > On Sep 5, 2017, at 1:02 PM, Jason Rudy <jcr...@gmail.com> wrote: >> > >> > Thomas, >> > >> > This is sort of related to the problem I did my M.S. thesis on years >> ago: cross-platform normalization of gene expression data. If you google >> that term you'll find some papers. The situation is somewhat different, >> though, because with microarrays or RNA-seq you get thousands of data >> points for each experiment, which makes it easier to estimate the batch >> effect. The principle is the similar, however. >> > >> > If I were in your situation, I would consider whether I have any of the >> following advantages: >> > >> > 1. Some molecules that appear in multiple data sets >> > 2. Detailed information about the different experimental conditions >> > 3. Physical/chemical models of how experimental conditions influence >> binding affinity >> > >> > If you have any of the above, you can potentially use them to improve >> your estimates. You could also consider using experiment ID as a >> categorical predictor in a sufficiently general regression method. >> &
[scikit-learn] combining datasets from different sources
Greetings, I am working on a problem that involves predicting the binding affinity of small molecules on a receptor structure (is regression problem, not classification). I have multiple small datasets of molecules with measured binding affinities on a receptor, but each dataset was measured in different experimental conditions and therefore I cannot use them all together as trainning set. So, instead of using them individually, I was wondering whether there is a method to combine them all into a super training set. The first way I could think of is to convert the binding affinities to Z-scores and then combine all the small datasets of molecules. But this is would be inaccurate because, firstly the datasets are very small (10-50 molecules each), and secondly, the range of binding affinities differs in each experiment (some datasets contain really strong binders, while others do not, etc.). Is there any other approach to combine datasets with values coming from different sources? Maybe if someone points me to the right reference I could read and understand if it is applicable to my case. Thanks, Thomas -- == Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tev...@pharm.uoa.gr teva...@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
[scikit-learn] recommended feature selection method to train an MLPRegressor
Which of the following methods would you recommend to select good features (<=50) from a set of 534 features in order to train a MLPregressor? Please take into account that the datasets I use for training are small. http://scikit-learn.org/stable/modules/feature_selection.html And please don't tell me to use a neural network that supports the dropout or any other algorithm for feature elimination. This is not applicable in my case because I want to know the best 50 features in order to append them to other types of feature that I am confident that are important. cheers Thomas -- == Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tev...@pharm.uoa.gr teva...@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] meta-estimator for multiple MLPRegressor
Stuart, I didn't see LASSO performing well, especially with the second type of data. The alpha parameter probably needs adjustment with LassoCV. I don't know if you have read my previous messages on this thread, so I quote again my setting for MLPRegressor. MLPRegressor(random_state=random_state, max_iter=400, early_stopping=True, validation_fraction=0.2, alpha=10, hidden_layer_sizes=(10,)) So to sum up, I must select the lowest possible value for the following parameters: * max_iter * hidden_layer_sizes (lower than 10?) * number of features in my training data. I.e. the first type of data that consisted of 60 features are preferable from that second that consisted of 456. Is this correct? On 10 January 2017 at 19:47, Stuart Reynolds <stu...@stuartreynolds.net> wrote: > Thomas, > Jacob's point is important -- its not the number of features that's > important, its the number of free parameters. As the number of free > parameters increases, the space of representable functions grows to the > point where the cost function is minimized by having a single parameter > explain each variable. This is true of many ML methods. > > In the case of a decision trees, for example you can allow each node (a > free parameter) hold exactly 1 training example, and see perfect training > performance. In linear methods, you can perfectly fit training data by > adding additional polynomial features (for feature x_i, add x^2_i, x^3_i, > x^4_i, ) Performance on unseen data will be terrible. > MLP is no different -- adding more free parameters (more flexibility to > precisely model the training data) may harm more than help when it comes to > unseen data performance, especially when the number of examples it small. > > Early stopping may help overfitting, as might dropout. > > The likely reasons that LASSO and GBR performed well is that they're > methods that explicit manage overfitting. > > Perform a grid search on: > - the number of hidden nodes in you MLP. > - the number of iterations > > for both, you may find lowering values will improve performance on unseen > data. > > > > > > > > > > On Tue, Jan 10, 2017 at 4:46 AM, Thomas Evangelidis <teva...@gmail.com> > wrote: > >> Jacob, >> >> The features are not 6000. I train 2 MLPRegressors from two types of >> data, both refer to the same dataset (35 molecules in total) but each >> one contains different type of information. The first data consist of 60 >> features. I tried 100 different random states and measured the average |R| >> using the leave-20%-out cross-validation. Below are the results from the >> first data: >> >> RandomForestRegressor: |R|= 0.389018243545 +- 0.252891783658 >> LASSO: |R|= 0.247411754937 +- 0.232325286471 >> GradientBoostingRegressor: |R|= 0.324483769202 +- 0.211778410841 >> MLPRegressor: |R|= 0.540528696597 +- 0.255714448793 >> >> The second type of data consist of 456 features. Below are the results >> for these too: >> >> RandomForestRegressor: |R|= 0.361562548904 +- 0.234872385318 >> LASSO: |R|= 3.27752711304e-16 +- 2.60800139195e-16 >> GradientBoostingRegressor: |R|= 0.328087138161 +- 0.229588427086 >> MLPRegressor: |R|= 0.455473342507 +- 0.24579081197 >> >> >> At the end I want to combine models created from these data types using a >> meta-estimator (that was my original question). The combination with the >> highest |R| (0.631851796403 +- 0.247911204514) was produced by an SVR >> that combined the best MLPRegressor from data type 1 and the best >> MLPRegressor from data type2: >> >> >> >> >> >> On 10 January 2017 at 01:36, Jacob Schreiber <jmschreibe...@gmail.com> >> wrote: >> >>> Even with a single layer with 10 neurons you're still trying to train >>> over 6000 parameters using ~30 samples. Dropout is a concept common in >>> neural networks, but doesn't appear to be in sklearn's implementation of >>> MLPs. Early stopping based on validation performance isn't an "extra" step >>> for reducing overfitting, it's basically a required step for neural >>> networks. It seems like you have a validation sample of ~6 datapoints.. I'm >>> still very skeptical of that giving you proper results for a complex model. >>> Will this larger dataset be of exactly the same data? Just taking another >>> unrelated dataset and showing that a MLP can learn it doesn't mean it will >>> work for your specific data. Can you post the actual results from using >>> LASSO, RandomForestRegressor, GradientBoostingRegressor, and MLP? >>> >>> On Mon, Jan 9, 2017 at 4:21 PM, Stuart Reynolds &l
Re: [scikit-learn] meta-estimator for multiple MLPRegressor
Jacob & Sebastian, I think the best way to find out if my modeling approach works is to find a larger dataset, split it into two parts, the first one will be used as training/cross-validation set and the second as a test set, like in a real case scenario. Regarding the MLPRegressor regularization, below is my optimum setup: MLPRegressor(random_state=random_state, max_iter=400, early_stopping=True, > validation_fraction=0.2, alpha=10, hidden_layer_sizes=(10,)) This means only one hidden layer with maximum 10 neurons, alpha=10 for L2 regularization and early stopping to terminate training if validation score is not improving. I think this is a quite simple model. My final predictor is an SVR that combines 2 MLPRegressors, each one trained with different types of input data. @Sebastian You have mentioned dropout again but I could not find it in the docs: http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html#sklearn.neural_network.MLPRegressor Maybe you are referring to another MLPRegressor implementation? I have seen a while ago another implementation you had on github. Can you clarify which one you recommend and why? Thank you both of you for your hints! best Thomas -- == Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tev...@pharm.uoa.gr teva...@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] meta-estimator for multiple MLPRegressor
Sebastian and Jacob, Regarding overfitting, Lasso, Ridge regression and ElasticNet have poor performance on my data. MLPregressors are way superior. On an other note, MLPregressor class has some methods to contol overfitting, like controling the alpha parameter for the L2 regularization (maybe setting it to a high value?) or the number of neurons in the hidden layers (lowering the hidden_layer_sizes?) or even "early_stopping=True". Wouldn't these be sufficient to be on the safe side. Once more I want to highlight something I wrote previously but might have been overlooked. The resulting MLPRegressors will be applied to new datasets that *ARE VERY SIMILAR TO THE TRAINING DATA*. In other words the application of the models will be strictly confined to their applicability domain. Wouldn't that be sufficient to not worry about model overfitting too much? On 8 January 2017 at 11:53, Sebastian Raschka <se.rasc...@gmail.com> wrote: > Like to train an SVR to combine the predictions of the top 10% > MLPRegressors using the same data that were used for training of the > MLPRegressors? Wouldn't that lead to overfitting? > > > It could, but you don't need to use the same data that you used for > training to fit the meta estimator. Like it is commonly done in stacking > with cross validation, you can train the mlps on training folds and pass > predictions from a test fold to the meta estimator but then you'd have to > retrain your mlps and it sounded like you wanted to avoid that. > > I am currently on mobile and only browsed through the thread briefly, but > I agree with others that it may sound like your model(s) may have too much > capacity for such a small dataset -- can be tricky to fit the parameters > without overfitting. In any case, if you to do the stacking, I'd probably > insert a k-fold cv between the mlps and the meta estimator. However I'd > definitely also recommend simpler models als > alternative. > > Best, > Sebastian > > On Jan 7, 2017, at 4:36 PM, Thomas Evangelidis <teva...@gmail.com> wrote: > > > > On 7 January 2017 at 21:20, Sebastian Raschka <se.rasc...@gmail.com> > wrote: > >> Hi, Thomas, >> sorry, I overread the regression part … >> This would be a bit trickier, I am not sure what a good strategy for >> averaging regression outputs would be. However, if you just want to compute >> the average, you could do sth like >> np.mean(np.asarray([r.predict(X) for r in list_or_your_mlps])) >> >> However, it may be better to use stacking, and use the output of >> r.predict(X) as meta features to train a model based on these? >> > > Like to train an SVR to combine the predictions of the top 10% > MLPRegressors using the same data that were used for training of the > MLPRegressors? Wouldn't that lead to overfitting? > > > >> >> Best, >> Sebastian >> >> > On Jan 7, 2017, at 1:49 PM, Thomas Evangelidis <teva...@gmail.com> >> wrote: >> > >> > Hi Sebastian, >> > >> > Thanks, I will try it in another classification problem I have. >> However, this time I am using regressors not classifiers. >> > >> > On Jan 7, 2017 19:28, "Sebastian Raschka" <se.rasc...@gmail.com> wrote: >> > Hi, Thomas, >> > >> > the VotingClassifier can combine different models per majority voting >> amongst their predictions. Unfortunately, it refits the classifiers though >> (after cloning them). I think we implemented it this way to make it >> compatible to GridSearch and so forth. However, I have a version of the >> estimator that you can initialize with “refit=False” to avoid refitting if >> it helps. http://rasbt.github.io/mlxtend/user_guide/classifier/Ensembl >> eVoteClassifier/#example-5-using-pre-fitted-classifiers >> > >> > Best, >> > Sebastian >> > >> > >> > >> > > On Jan 7, 2017, at 11:15 AM, Thomas Evangelidis <teva...@gmail.com> >> wrote: >> > > >> > > Greetings, >> > > >> > > I have trained many MLPRegressors using different random_state value >> and estimated the R^2 using cross-validation. Now I want to combine the top >> 10% of them in how to get more accurate predictions. Is there a >> meta-estimator that can get as input a few precomputed MLPRegressors and >> give consensus predictions? Can the BaggingRegressor do this job using >> MLPRegressors as input? >> > > >> > > Thanks in advance for any hint. >> > > Thomas >> > > >> > > >> > > -- >> > > ===
Re: [scikit-learn] meta-estimator for multiple MLPRegressor
On 8 January 2017 at 00:04, Jacob Schreiber <jmschreibe...@gmail.com> wrote: > If you have such a small number of observations (with a much higher > feature space) then why do you think you can accurately train not just a > single MLP, but an ensemble of them without overfitting dramatically? > > > Because the observations in the data set don't differ much between them. To be more specific, the data set consists of a congeneric series of organic molecules and the ebservation is their binding strength to a target protein. The idea was to train predictors that can predict the binding strenght of new molecules that belong to the same congeneric series. Therefore special care is taken to apply the predictors to the right domain of applicability. According to the literature, the same strategy has been followed in the past several times. The novelty of my approach stems from other factors that are irrelevant to this thread. -- ====== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tev...@pharm.uoa.gr teva...@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] meta-estimator for multiple MLPRegressor
Hi Sebastian, Thanks, I will try it in another classification problem I have. However, this time I am using regressors not classifiers. On Jan 7, 2017 19:28, "Sebastian Raschka" <se.rasc...@gmail.com> wrote: > Hi, Thomas, > > the VotingClassifier can combine different models per majority voting > amongst their predictions. Unfortunately, it refits the classifiers though > (after cloning them). I think we implemented it this way to make it > compatible to GridSearch and so forth. However, I have a version of the > estimator that you can initialize with “refit=False” to avoid refitting if > it helps. http://rasbt.github.io/mlxtend/user_guide/classifier/ > EnsembleVoteClassifier/#example-5-using-pre-fitted-classifiers > > Best, > Sebastian > > > > > On Jan 7, 2017, at 11:15 AM, Thomas Evangelidis <teva...@gmail.com> > wrote: > > > > Greetings, > > > > I have trained many MLPRegressors using different random_state value and > estimated the R^2 using cross-validation. Now I want to combine the top 10% > of them in how to get more accurate predictions. Is there a meta-estimator > that can get as input a few precomputed MLPRegressors and give consensus > predictions? Can the BaggingRegressor do this job using MLPRegressors as > input? > > > > Thanks in advance for any hint. > > Thomas > > > > > > -- > > == > > Thomas Evangelidis > > Research Specialist > > CEITEC - Central European Institute of Technology > > Masaryk University > > Kamenice 5/A35/1S081, > > 62500 Brno, Czech Republic > > > > email: tev...@pharm.uoa.gr > > teva...@gmail.com > > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > ___ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > ___ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
[scikit-learn] meta-estimator for multiple MLPRegressor
Greetings, I have trained many MLPRegressors using different random_state value and estimated the R^2 using cross-validation. Now I want to combine the top 10% of them in how to get more accurate predictions. Is there a meta-estimator that can get as input a few precomputed MLPRegressors and give consensus predictions? Can the BaggingRegressor do this job using MLPRegressors as input? Thanks in advance for any hint. Thomas -- == Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tev...@pharm.uoa.gr teva...@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] combining arrays of features to train an MLP
Thank you, these articles discuss about ML application of the types of fingerprints I working with! I will read them thoroughly to get some hints. In the meantime I tried to eliminate some features using RandomizedLasso and the performance escalated from R=0.067 using all 615 features to R=0.524 using only the 15 top ranked features. Naive question: does it make sense to use the RandomizedLasso to select the good features in order to train a MLP? I had the impression that RandomizedLasso uses multi-variate linear regression to fit the observed values to the experimental and rank the features. Another question: this dataset consists of 31 observations. The Pearson's R values that I reported above were calculated using cross-validation. Could someone claim that they are inaccurate because the number of features used for training the MLP is much larger than the number of observations? On 19 December 2016 at 23:42, Sebastian Raschka <se.rasc...@gmail.com> wrote: > Oh, sorry, I just noticed that I was in the wrong thread — meant answer a > different Thomas :P. > > Regarding the fingerprints; scikit-learn’s estimators expect feature > vectors as samples, so you can’t have a 3D array … e.g., think of image > classification: here you also enroll the n_pixels times m_pixels array into > 1D arrays. > > The low performance can have mutliple issues. In case dimensionality is an > issue, I’d maybe try stronger regularization first, or feature selection. > If you are working with molecular structures, and you have enough of them, > maybe also consider alternative feature representations, e.g,. learning > from the graphs directly: > > http://papers.nips.cc/paper/5954-convolutional-networks- > on-graphs-for-learning-molecular-fingerprints.pdf > http://pubs.acs.org/doi/abs/10.1021/ci400187y > > Best, > Sebastian > > > > On Dec 19, 2016, at 4:56 PM, Thomas Evangelidis <teva...@gmail.com> > wrote: > > > > this means that both are feasible? > > > > On 19 December 2016 at 18:17, Sebastian Raschka <se.rasc...@gmail.com> > wrote: > > Thanks, Thomas, that makes sense! Will submit a PR then to update the > docstring. > > > > Best, > > Sebastian > > > > > > > On Dec 19, 2016, at 11:06 AM, Thomas Evangelidis <teva...@gmail.com> > wrote: > > > > > > > > > Greetings, > > > > > > My dataset consists of objects which are characterised by their > structural features which are encoded into a so called "fingerprint" form. > There are several different types of fingerprints, each one encapsulating > different type of information. I want to combine two specific types of > fingerprints to train a MLP regressor. The first fingerprint consists of a > 2048 bit array of the form: > > > > > > FP1 = array([ 1., 1., 0., ..., 0., 0., 1.], dtype=float32) > > > > > > The second is a 60 float number array of the form: > > > > > > FP2 = array([ 2.77494618, 0.98973243, 0.34638652, 2.88303715, > 1.31473857, > > >-0.56627112, 4.78847547, 2.29587913, -0.6786228 , 4.63391109, > > >... > > > 0., 0., 5.89652792, 0., 0. > ]) > > > > > > At first I tried to fuse them into a single 1D array of 2048+60 > columns but the predictions of the MLP were worse than the 2 different MLP > models trained from one of the 2 fingerprint types individually. My > question: is there a more effective way to combine the 2 fingerprints in > order to indicate that they represent different type of information? > > > > > > To this end, I tried to create a 2-row array (1st row 2048 elements > and 2nd row 60 elements) but sklearn complained: > > > > > > mlp.fit(x_train,y_train) > > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_ > network/multilayer_perceptron.py", line 618, in fit > > > return self._fit(X, y, incremental=False) > > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_ > network/multilayer_perceptron.py", line 330, in _fit > > > X, y = self._validate_input(X, y, incremental) > > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_ > network/multilayer_perceptron.py", line 1264, in _validate_input > > > multi_output=True, y_numeric=True) > > > File > > > "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", > line 521, in check_X_y > > > ensure_min_features, warn_on_dtype, estimator) > > > File > > > "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validati
Re: [scikit-learn] combining arrays of features to train an MLP
this means that both are feasible? On 19 December 2016 at 18:17, Sebastian Raschka <se.rasc...@gmail.com> wrote: > Thanks, Thomas, that makes sense! Will submit a PR then to update the > docstring. > > Best, > Sebastian > > > > On Dec 19, 2016, at 11:06 AM, Thomas Evangelidis <teva...@gmail.com> > wrote: > > > > > > Greetings, > > > > My dataset consists of objects which are characterised by their > structural features which are encoded into a so called "fingerprint" form. > There are several different types of fingerprints, each one encapsulating > different type of information. I want to combine two specific types of > fingerprints to train a MLP regressor. The first fingerprint consists of a > 2048 bit array of the form: > > > > FP1 = array([ 1., 1., 0., ..., 0., 0., 1.], dtype=float32) > > > > The second is a 60 float number array of the form: > > > > FP2 = array([ 2.77494618, 0.98973243, 0.34638652, 2.88303715, > 1.31473857, > >-0.56627112, 4.78847547, 2.29587913, -0.6786228 , 4.63391109, > >... > > 0., 0., 5.89652792, 0., 0.]) > > > > At first I tried to fuse them into a single 1D array of 2048+60 columns > but the predictions of the MLP were worse than the 2 different MLP models > trained from one of the 2 fingerprint types individually. My question: is > there a more effective way to combine the 2 fingerprints in order to > indicate that they represent different type of information? > > > > To this end, I tried to create a 2-row array (1st row 2048 elements and > 2nd row 60 elements) but sklearn complained: > > > > mlp.fit(x_train,y_train) > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_ > network/multilayer_perceptron.py", line 618, in fit > > return self._fit(X, y, incremental=False) > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_ > network/multilayer_perceptron.py", line 330, in _fit > > X, y = self._validate_input(X, y, incremental) > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_ > network/multilayer_perceptron.py", line 1264, in _validate_input > > multi_output=True, y_numeric=True) > > File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", > line 521, in check_X_y > > ensure_min_features, warn_on_dtype, estimator) > > File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", > line 402, in check_array > > array = array.astype(np.float64) > > ValueError: setting an array element with a sequence. > > > > > > Then I tried to create for each object of the dataset a 2D array of > size 2x2048, by adding 1998 zeros in the second row in order both rows to > be of equal size. However sklearn complained again: > > > > > > mlp.fit(x_train,y_train) > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_ > network/multilayer_perceptron.py", line 618, in fit > > return self._fit(X, y, incremental=False) > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_ > network/multilayer_perceptron.py", line 330, in _fit > > X, y = self._validate_input(X, y, incremental) > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_ > network/multilayer_perceptron.py", line 1264, in _validate_input > > multi_output=True, y_numeric=True) > > File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", > line 521, in check_X_y > > ensure_min_features, warn_on_dtype, estimator) > > File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", > line 405, in check_array > > % (array.ndim, estimator_name)) > > ValueError: Found array with dim 3. Estimator expected <= 2. > > > > > > In another case of fingerprints, lets name them FP3 and FP4, I observed > that the MLP regressor created using FP3 yields better results when trained > and evaluated using logarithmically transformed experimental values (the > values in y_train and y_test 1D arrays), while the MLP regressor created > using FP4 yielded better results using the original experimental values. So > my second question is: when combining both FP3 and FP4 into a single array > is there any way to designate to the MLP that the features that correspond > to FP3 must reproduce the logarithmic transform of the experimental values > while the features of FP4 the original untransf
[scikit-learn] combining arrays of features to train an MLP
Greetings, My dataset consists of objects which are characterised by their structural features which are encoded into a so called "fingerprint" form. There are several different types of fingerprints, each one encapsulating different type of information. I want to combine two specific types of fingerprints to train a MLP regressor. The first fingerprint consists of a 2048 bit array of the form: > FP > 1 = array([ 1., 1., 0., ..., 0., 0., 1.], dtype=float32) The second is a 60 float number array of the form: FP2 = array([ 2.77494618, 0.98973243, 0.34638652, 2.88303715, > 1.31473857, >-0.56627112, 4.78847547, 2.29587913, -0.6786228 , 4.63391109, >... > 0., 0., 5.89652792, 0., 0.]) At first I tried to fuse them into a single 1D array of 2048+60 columns but the predictions of the MLP were worse than the 2 different MLP models trained from one of the 2 fingerprint types individually. My question: is there a more effective way to combine the 2 fingerprints in order to indicate that they represent different type of information? To this end, I tried to create a 2-row array (1st row 2048 elements and 2nd row 60 elements) but sklearn complained: mlp.fit(x_train,y_train) > File > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", > line 618, in fit > return self._fit(X, y, incremental=False) > File > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", > line 330, in _fit > X, y = self._validate_input(X, y, incremental) > File > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", > line 1264, in _validate_input > multi_output=True, y_numeric=True) > File > "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line > 521, in check_X_y > ensure_min_features, warn_on_dtype, estimator) > File > "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line > 402, in check_array > array = array.astype(np.float64) > ValueError: setting an array element with a sequence. > Then I tried to create for each object of the dataset a 2D array of size 2x2048, by adding 1998 zeros in the second row in order both rows to be of equal size. However sklearn complained again: mlp.fit(x_train,y_train) > File > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", > line 618, in fit > return self._fit(X, y, incremental=False) > File > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", > line 330, in _fit > X, y = self._validate_input(X, y, incremental) > File > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", > line 1264, in _validate_input > multi_output=True, y_numeric=True) > File > "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line > 521, in check_X_y > ensure_min_features, warn_on_dtype, estimator) > File > "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line > 405, in check_array > % (array.ndim, estimator_name)) > ValueError: Found array with dim 3. Estimator expected <= 2. In another case of fingerprints, lets name them FP3 and FP4, I observed that the MLP regressor created using FP3 yields better results when trained and evaluated using logarithmically transformed experimental values (the values in y_train and y_test 1D arrays), while the MLP regressor created using FP4 yielded better results using the original experimental values. So my second question is: when combining both FP3 and FP4 into a single array is there any way to designate to the MLP that the features that correspond to FP3 must reproduce the logarithmic transform of the experimental values while the features of FP4 the original untransformed experimental values? I would greatly appreciate any advice on any of my 2 queries. Thomas -- == Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tev...@pharm.uoa.gr teva...@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] NuSVC and ValueError: specified nu is infeasible
It finally works with nu=0.01 or less and the predictions are good. Is there a problem with that? On 8 December 2016 at 12:57, Thomas Evangelidis <teva...@gmail.com> wrote: > > >> >> @Thomas >> I still think the optimization problem is not feasible due to your data. >> Have you tried balancing the dataset as I mentioned in your other >> question regarding the >> >> MLPClassifier? >> >> >> > Hi Piotr, > > I had tried all the balancing algorithms in the link that you stated, but > the only one that really offered some improvement was the SMOTE > over-sampling of positive observations. The original dataset contained 24 > positive and 1230 negative but after SMOTE I doubled the positive to 48. > Reduction of the negative observations led to poor predictions, at least > using random forests. I haven't tried it with > > MLPClassifier yet though. > > > > -- == Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tev...@pharm.uoa.gr teva...@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] no positive predictions by neural_network.MLPClassifier
Hello Sebastian, I did normalization of my training set and used the same mean and stdev values to normalize my test set, instead of calculating means and stdev from the test set. I did that because my training set size is finite and the value of each feature is a descriptor that is characteristic of the 3D shape of the observation. The test set would definitely have different mean and stdev values from the training set, and if I had used them to normalize it then I believe I would have distorted the original descriptor values. Anyway, after this normalization I don't get 0 positive predictions anymore by the MLPClassifier. I still don't understand your second suggestion. I cannot find any parameter to control the epoch or measure the cost in sklearn .neural_network.MLPClassifier. Do you suggest to use your own classes from github instead? Besides that my goal is not to make one MLPClassifier using a specific training set, but rather to write a program that can take as input various training sets each time and and train a neural network that will classify a given test set. Therefore, unless I didn't understand your points, working with 3 arbitrary random_state values on my current training set in order to find one value to yield good predictions, wont solve my problem. best Thomas On 8 December 2016 at 01:19, Sebastian Raschka <se.rasc...@gmail.com> wrote: > Hi, Thomas, > we had a related thread on the email list some time ago, let me post it > for reference further below. Regarding your question, I think you may want > make sure that you standardized the features (which makes the learning > generally it less sensitive to learning rate and random weight > initialization). However, even then, I would try at least 1-3 different > random seeds and look at the cost vs time — what can happen is that you > land in different minima depending on the weight initialization as > demonstrated in the example below (in MLPs you have the problem of a > complex cost surface). > > Best, > Sebastian > > The default is set 100 units in the hidden layer, but theoretically, it > should work with 2 hidden logistic units (I think that’s the typical > textbook/class example). I think what happens is that it gets stuck in > local minima depending on the random weight initialization. E.g., the > following works just fine: > > from sklearn.neural_network import MLPClassifier > X = [[0, 0], [0, 1], [1, 0], [1, 1]] > y = [0, 1, 1, 0] > clf = MLPClassifier(solver='lbfgs', > activation='logistic', > alpha=0.0, > hidden_layer_sizes=(2,), > learning_rate_init=0.1, > max_iter=1000, > random_state=20) > clf.fit(X, y) > res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]]) > print(res) > print(clf.loss_) > > > but changing the random seed to 1 leads to: > > [0 1 1 1] > 0.34660921283 > > For comparison, I used a more vanilla MLP (1 hidden layer with 2 units and > logistic activation as well; https://github.com/ > rasbt/python-machine-learning-book/blob/master/code/ch12/ch12.ipynb), > essentially resulting in the same problem: > > > > > > > > > > > > > > > > > On Dec 7, 2016, at 6:45 PM, Thomas Evangelidis <teva...@gmail.com> wrote: > > I tried the sklearn.neural_network.MLPClassifier with the default > parameters using the input data I quoted in my previous post about > Nu-Support Vector Classifier. The predictions are great but the problem is > that sometimes when I rerun the MLPClassifier it predicts no positive > observations (class 1). I have noticed that this can be controlled by the > random_state parameter, e.g. MLPClassifier(random_state=0) gives always no > positive predictions. My question is how can I chose the right random_state > value in a real blind test case? > > thanks in advance > Thomas > > > -- > == > Thomas Evangelidis > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tev...@pharm.uoa.gr > teva...@gmail.com > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > ___ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > ___ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- == Thomas Evange
Re: [scikit-learn] NuSVC and ValueError: specified nu is infeasible
> > > @Thomas > I still think the optimization problem is not feasible due to your data. > Have you tried balancing the dataset as I mentioned in your other question > regarding the > > MLPClassifier? > > > Hi Piotr, I had tried all the balancing algorithms in the link that you stated, but the only one that really offered some improvement was the SMOTE over-sampling of positive observations. The original dataset contained 24 positive and 1230 negative but after SMOTE I doubled the positive to 48. Reduction of the negative observations led to poor predictions, at least using random forests. I haven't tried it with MLPClassifier yet though. ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] NuSVC and ValueError: specified nu is infeasible
Hi Piotr, the SVC performs quite well, slightly better than random forests on the same data. By training error do you mean this? clf = svm.SVC(probability=True) clf.fit(train_list_resampled3, train_activity_list_resampled3) print "training error=", clf.score(train_list_resampled3, train_activity_list_resampled3) If this is what you mean by "skip the sample_weights": clf = svm.NuSVC(probability=True) clf.fit(train_list_resampled3, train_activity_list_resampled3, sample_weight=None) then no, it does not converge. After all "sample_weight=None" is the default value. I am out of ideas about what may be the problem. Thomas On 8 December 2016 at 08:56, Piotr Bialecki <piotr.biale...@hotmail.de> wrote: > Hi Thomas, > > the doc says, that nu gives an upper bound on the fraction of training > errors and a lower bound of the fractions > of support vectors. > http://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html > > Therefore, it acts as a hard bound on the allowed misclassification on > your dataset. > > To me it seems as if the error bound is not feasible. > How well did the SVC perform? What was your training error there? > > Will the NuSVC converge when you skip the sample_weights? > > > Greets, > Piotr > > > On 08.12.2016 00:07, Thomas Evangelidis wrote: > > Greetings, > > I want to use the Nu-Support Vector Classifier with the following input > data: > > X= [ > array([ 3.90387012, 1.60732281, -0.33315799, 4.02770896, > 1.82337731, -0.74007214, 6.75989219, 3.68538903, > .. > 0., 11.64276776, 0., 0.]), > array([ 3.36856769e+00, 1.48705816e+00, 4.28566992e-01, > 3.35622071e+00, 1.64046508e+00, 5.66879661e-01, > . > 4.25335335e+00, 1.96508829e+00, 8.63453394e-06]), > array([ 3.74986249e+00, 1.69060713e+00, -5.09921270e-01, > 3.76320781e+00, 1.67664455e+00, -6.21126735e-01, > .. > 4.16700259e+00, 1.88688784e+00, 7.34729942e-06]), > ... > ] > > and > > Y= [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, > 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0] > > >> Each array of X contains 60 numbers and the dataset consists of 48 >> positive and 1230 negative observations. When I train an svm.SVC() >> classifier I get quite good predictions, but wit the svm.NuSVC() I keep >> getting the following error no matter which value of nu in [0.1, ..., 0.9, >> 0.99, 0.999, 0.] I try: >> /usr/local/lib/python2.7/dist-packages/sklearn/svm/base.pyc in fit(self, >> X, y, sample_weight) >> 187 >> 188 seed = rnd.randint(np.iinfo('i').max) >> --> 189 fit(X, y, sample_weight, solver_type, kernel, >> random_seed=seed) >> 190 # see comment on the other call to np.iinfo in this file >> 191 >> /usr/local/lib/python2.7/dist-packages/sklearn/svm/base.pyc in >> _dense_fit(self, X, y, sample_weight, solver_type, kernel, random_seed) >> 254 cache_size=self.cache_size, coef0=self.coef0, >> 255 gamma=self._gamma, epsilon=self.epsilon, >> --> 256 max_iter=self.max_iter, random_seed=random_seed) >> 257 >> 258 self._warn_from_fit_status() >> /usr/local/lib/python2.7/dist-packages/sklearn/svm/libsvm.so in >> sklearn.svm.libsvm.fit (sklearn/svm/libsvm.c:2501)() >> ValueError: specified nu is infeasible > > > > Does anyone know what might be wrong? Could it be the input data? > > thanks in advance for any advice > Thomas > > > > -- > > == > > Thomas Evangelidis > > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tev...@pharm.uoa.gr > > teva...@gmail.com > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > ___ > scikit-learn mailing > listscikit-learn@python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > ___ > scikit-learn mai
[scikit-learn] no positive predictions by neural_network.MLPClassifier
I tried the sklearn.neural_network.MLPClassifier with the default parameters using the input data I quoted in my previous post about Nu-Support Vector Classifier. The predictions are great but the problem is that sometimes when I rerun the MLPClassifier it predicts no positive observations (class 1). I have noticed that this can be controlled by the random_state parameter, e.g. MLPClassifier(random_state=0) gives always no positive predictions. My question is how can I chose the right random_state value in a real blind test case? thanks in advance Thomas -- == Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tev...@pharm.uoa.gr teva...@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
[scikit-learn] NuSVC and ValueError: specified nu is infeasible
Greetings, I want to use the Nu-Support Vector Classifier with the following input data: X= [ array([ 3.90387012, 1.60732281, -0.33315799, 4.02770896, 1.82337731, -0.74007214, 6.75989219, 3.68538903, .. 0., 11.64276776, 0., 0.]), array([ 3.36856769e+00, 1.48705816e+00, 4.28566992e-01, 3.35622071e+00, 1.64046508e+00, 5.66879661e-01, . 4.25335335e+00, 1.96508829e+00, 8.63453394e-06]), array([ 3.74986249e+00, 1.69060713e+00, -5.09921270e-01, 3.76320781e+00, 1.67664455e+00, -6.21126735e-01, .. 4.16700259e+00, 1.88688784e+00, 7.34729942e-06]), ... ] and Y= [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] > Each array of X contains 60 numbers and the dataset consists of 48 > positive and 1230 negative observations. When I train an svm.SVC() > classifier I get quite good predictions, but wit the svm.NuSVC() I keep > getting the following error no matter which value of nu in [0.1, ..., 0.9, > 0.99, 0.999, 0.] I try: > /usr/local/lib/python2.7/dist-packages/sklearn/svm/base.pyc in fit(self, > X, y, sample_weight) > 187 > 188 seed = rnd.randint(np.iinfo('i').max) > --> 189 fit(X, y, sample_weight, solver_type, kernel, > random_seed=seed) > 190 # see comment on the other call to np.iinfo in this file > 191 > /usr/local/lib/python2.7/dist-packages/sklearn/svm/base.pyc in > _dense_fit(self, X, y, sample_weight, solver_type, kernel, random_seed) > 254 cache_size=self.cache_size, coef0=self.coef0, > 255 gamma=self._gamma, epsilon=self.epsilon, > --> 256 max_iter=self.max_iter, random_seed=random_seed) > 257 > 258 self._warn_from_fit_status() > /usr/local/lib/python2.7/dist-packages/sklearn/svm/libsvm.so in > sklearn.svm.libsvm.fit (sklearn/svm/libsvm.c:2501)() > ValueError: specified nu is infeasible Does anyone know what might be wrong? Could it be the input data? thanks in advance for any advice Thomas -- ========== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tev...@pharm.uoa.gr teva...@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] random forests using grouped data
Sorry, the previous email was incomplete. Below is how the grouped data look like: Group1: score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...] score2 = [0.34, 0.27, 0.24, 0.05, 0.13, 0,14, ...] y=[1,1,1,0,0,0, ...] # 1 indicates "active" and 0 "inactive" Group2: score1 = [0.34, 0.38, 0.48, 0.18, 0.12, 0.19, ...] score2 = [0.28, 0.41, 0.34, 0.13, 0.09, 0,1, ...] y=[1,1,1,0,0,0, ...] # 1 indicates "active" and 0 "inactive" .. Group24: score1 = [0.67, 0.54, 0.59, 0.23, 0.24, 0.08, ...] score2 = [0.41, 0.31, 0.28, 0.23, 0.18, 0,22, ...] y=[1,1,1,0,0,0, ...] # 1 indicates "active" and 0 "inactive" On 1 December 2016 at 14:01, Thomas Evangelidis <teva...@gmail.com> wrote: > Greetings > > I have grouped data which are divided into actives and inactives. The > features are two different types of normalized scores (0-1), where the > higher the score the most probable is an observation to be an "active". My > data look like this: > > > Group1: > score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...] > score2 = [ > y=[1,1,1,0,0,0, ...] > > Group2: > score1 = [0 > score2 = [ > y=[1,1,1,1,1] > > .. > Group24: > score1 = [0 > score2 = [ > y=[1,1,1,1,1] > > > I searched in the documentation about treatment of grouped data, but the > only thing I found was how do do cross-validation. My question is whether > there is any special algorithm that creates random forests from these type > of grouped data. > > thanks in advance > Thomas > > > > -- > > == > > Thomas Evangelidis > > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tev...@pharm.uoa.gr > > teva...@gmail.com > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > -- == Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tev...@pharm.uoa.gr teva...@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
[scikit-learn] random forests using grouped data
Greetings I have grouped data which are divided into actives and inactives. The features are two different types of normalized scores (0-1), where the higher the score the most probable is an observation to be an "active". My data look like this: Group1: score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...] score2 = [ y=[1,1,1,0,0,0, ...] Group2: score1 = [0 score2 = [ y=[1,1,1,1,1] .. Group24: score1 = [0 score2 = [ y=[1,1,1,1,1] I searched in the documentation about treatment of grouped data, but the only thing I found was how do do cross-validation. My question is whether there is any special algorithm that creates random forests from these type of grouped data. thanks in advance Thomas -- ====== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tev...@pharm.uoa.gr teva...@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
[scikit-learn] suggested classification algorithm
Greetings, I want to design a program that can deal with classification problems of the same type, where the number of positive observations is small but the number of negative much larger. Speaking with numbers, the number of positive observations could range usually between 2 to 20 and the number of negative could be at least x30 times larger. The number of features could be between 2 and 20 too, but that could be reduced using feature selection and elimination algorithms. I 've read in the documentation that some algorithms like the SVM are still effective when the number of dimensions is greater than the number of samples, but I am not sure if they are suitable for my case. Moreover, according to this Figure, the Nearest Neighbors is the best and second is the RBF SVM: http://scikit-learn.org/stable/_images/sphx_glr _plot_classifier_comparison_001.png However, I assume that Nearest Neighbors would not be effective in my case where the number of positive observations is very low. For these reasons I would like to know your expert opinion about which classification algorithm should I try first. thanks in advance Thomas -- == Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tev...@pharm.uoa.gr teva...@gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn