Re: [scikit-learn] Inquiry on Genetic Algorithm

2022-10-30 Thread Thomas Evangelidis
Hi,

I am not aware of any *official* scikit-learn implementation of a genetic
algorithm. I program my own with DEAP, which is quite versatile:

https://deap.readthedocs.io/en/master/

~Thomas

On Sun, 30 Oct 2022 at 12:19, Ellarizza Fredeluces via scikit-learn <
scikit-learn@python.org> wrote:

> Dear Scikit-Learn developers,
>
> First of all, thank you for your brilliant work.
> I would like to ask if a genetic algorithm is available in scikit-learn.
> I tried to search, but I only found this one
> <https://pypi.org/project/sklearn-genetic/#:~:text=sklearn-genetic%20is%20a%20genetic,optimal%20values%20of%20a%20function.>.
> I also checked your website but
> there seems to be no genetic algorithm yet.
>
> Your reply will be highly appreciated. Thank you again.
>
> Sincerely,
> Ella
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


-- 

==========

Dr. Thomas Evangelidis

Research Scientist

IOCB - Institute of Organic Chemistry and Biochemistry of the Czech Academy
of Sciences <https://www.uochb.cz/web/structure/31.html?lang=en>, Prague,
Czech Republic
  &
CEITEC - Central European Institute of Technology
<https://www.ceitec.eu/>, Brno,
Czech Republic

email: teva...@gmail.com, Twitter: tevangelidis
<https://twitter.com/tevangelidis>, LinkedIn: Thomas Evangelidis
<https://www.linkedin.com/in/thomas-evangelidis-495b45125/>

website: https://sites.google.com/site/thomasevangelidishomepage/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Maximum Mutual Information value for continuous variables

2019-11-27 Thread Thomas Evangelidis
Greetings,

I am thinking of alternative ways of removing the invariant scalar features
from my feature vectors before training MLPs. So far I tried removing
columns with 0-variance and columns with Pearson's R=1.0 or R=-1.0. If I
remove columns with |R|<1.0 the performance drops. However, R measures the
linear correlation. Now I am thinking to try removing columns with high
Mutual Information, but first I need to normalize it. I found in the
documentation under "Univariate Feature Selection" the function
"mutual_info_regression".

https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection

I used this function to measure the correlation between columns (features)
but sometimes returns values >1.0. On the other hand, there is also this
function

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_mutual_info_score.html#sklearn.metrics.adjusted_mutual_info_score

which is upper limited to 1.0 but it is for categorical data (clusters). So
my question is, is there a way to computer normalized Mutual Information
for continuous variables, too?

Thanks in advance for any advice.
Thomas


-- 

==========

Dr. Thomas Evangelidis

Research Scientist

IOCB - Institute of Organic Chemistry and Biochemistry of the Czech Academy
of Sciences <https://www.uochb.cz/web/structure/31.html?lang=en>, Prague,
Czech Republic
  &
CEITEC - Central European Institute of Technology
<https://www.ceitec.eu/>, Brno,
Czech Republic

email: teva...@gmail.com, Twitter: tevangelidis
<https://twitter.com/tevangelidis>, LinkedIn: Thomas Evangelidis
<https://www.linkedin.com/in/thomas-evangelidis-495b45125/>

website: https://sites.google.com/site/thomasevangelidishomepage/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] sample_weights in RandomForestRegressor

2018-07-15 Thread Thomas Evangelidis
​​
Hello,

I am kind of confused about the use of sample_weights parameter in the
fit() function of RandomForestRegressor. Here is my problem:

I am trying to predict the binding affinity of small molecules to a
protein. I have a training set of 709 molecules and a blind test set of 180
molecules. I want to find those features that are more important for the
correct prediction of the binding affinity of those 180 molecules of my
blind test set.  My rationale is that if I give more emphasis to the
similar molecules in the training set, then I will get higher importances
for those features that have higher predictive ability for this specific
blind test set of 180 molecules. To this end, I weighted the 709 training
set molecules by their maximum similarity to the 180 molecules, selected
only those features with high importance and trained a new RF with all 709
molecules. I got some results but I am not satisfied. Is this the right way
to use sample_weights in RF. I would appreciate any advice or suggested
work flow.


-- 

==

Dr Thomas Evangelidis

Post-doctoral Researcher
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/2S049,
62500 Brno, Czech Republic

email: tev...@pharm.uoa.gr

  teva...@gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] custom loss function in RandomForestRegressor

2018-03-01 Thread Thomas Evangelidis
Does this generalize to any loss function? For example I also want to
implement Kendall's tau correlation coefficient and a combination of R, tau
and RMSE. :)

On Mar 1, 2018 15:49, "Sebastian Raschka" <se.rasc...@gmail.com> wrote:

> Hi, Thomas,
>
> as far as I know, it's all the same and doesn't matter, and you would get
> the same splits, since R^2 is just a rescaled MSE.
>
> Best,
> Sebastian
>
> > On Mar 1, 2018, at 9:39 AM, Thomas Evangelidis <teva...@gmail.com>
> wrote:
> >
> > Hi Sebastian,
> >
> > Going back to Pearson's R loss function, does this imply that I must add
> an abstract "init2" method to RegressionCriterion (that's where MSE class
> inherits from) where I will add the target values X as extra argument? And
> then the node impurity will be 1-R (the lowest the best)? What about the
> impurities of the left and right split? In MSE class they are (sum_i^n
> y_i)**2 where n is the number of samples in the respective split. It is not
> clear how this is related to variance in order to adapt it for my purpose.
> >
> > Best,
> > Thomas
> >
> >
> > On Mar 1, 2018 14:56, "Sebastian Raschka" <se.rasc...@gmail.com> wrote:
> > Hi, Thomas,
> >
> > in regression trees, minimizing the variance among the target values is
> equivalent to minimizing the MSE between targets and predicted values. This
> is also called variance reduction: https://en.wikipedia.org/wiki/
> Decision_tree_learning#Variance_reduction
> >
> > Best,
> > Sebastian
> >
> > > On Mar 1, 2018, at 8:27 AM, Thomas Evangelidis <teva...@gmail.com>
> wrote:
> > >
> > >
> > > Hi again,
> > >
> > > I am currently revisiting this problem after familiarizing myself with
> Cython and Scikit-Learn's code and I have a very important query:
> > >
> > > Looking at the class MSE(RegressionCriterion), the node impurity is
> defined as the variance of the target values Y on that node. The
> predictions X are nowhere involved in the computations. This contradicts my
> notion of "loss function", which quantifies the discrepancy between
> predicted and target values. Am I looking at the wrong class or what I want
> to do is just not feasible with Random Forests? For example, I would like
> to modify the RandomForestRegressor code to minimize the Pearson's R
> between predicted and target values.
> > >
> > > I thank you in advance for any clarification.
> > > Thomas
> > >
> > >
> > >
> > >
> > > On 02/15/2018 01:28 PM, Guillaume Lemaitre wrote:
> > >> Yes you are right pxd are the header and pyx the definition. You need
> to write a class as MSE. Criterion is an abstract class or base class (I
> don't have it under the eye)
> > >>
> > >> @Andy: if I recall the PR, we made the classes public to enable such
> custom criterion. However, ‎it is not documented since we were not
> officially supporting it. So this is an hidden feature. We could always
> discuss to make this feature more visible and document it.
> > >
> > >
> > >
> > >
> > >
> > > --
> > > ==
> > > Dr Thomas Evangelidis
> > > Post-doctoral Researcher
> > > CEITEC - Central European Institute of Technology
> > > Masaryk University
> > > Kamenice 5/A35/2S049,
> > > 62500 Brno, Czech Republic
> > >
> > > email: tev...@pharm.uoa.gr
> > >   teva...@gmail.com
> > >
> > > website: https://sites.google.com/site/thomasevangelidishomepage/
> > >
> > >
> > > ___
> > > scikit-learn mailing list
> > > scikit-learn@python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] custom loss function in RandomForestRegressor

2018-03-01 Thread Thomas Evangelidis
Hi Sebastian,

Going back to Pearson's R loss function, does this imply that I must add an
abstract "init2" method to RegressionCriterion (that's where MSE class
inherits from) where I will add the target values X as extra argument? And
then the node impurity will be 1-R (the lowest the best)? What about the
impurities of the left and right split? In MSE class they are (sum_i^n
y_i)**2 where n is the number of samples in the respective split. It is not
clear how this is related to variance in order to adapt it for my purpose.

Best,
Thomas


On Mar 1, 2018 14:56, "Sebastian Raschka" <se.rasc...@gmail.com> wrote:

Hi, Thomas,

in regression trees, minimizing the variance among the target values is
equivalent to minimizing the MSE between targets and predicted values. This
is also called variance reduction: https://en.wikipedia.org/wiki/
Decision_tree_learning#Variance_reduction

Best,
Sebastian

> On Mar 1, 2018, at 8:27 AM, Thomas Evangelidis <teva...@gmail.com> wrote:
>
>
> Hi again,
>
> I am currently revisiting this problem after familiarizing myself with
Cython and Scikit-Learn's code and I have a very important query:
>
> Looking at the class MSE(RegressionCriterion), the node impurity is
defined as the variance of the target values Y on that node. The
predictions X are nowhere involved in the computations. This contradicts my
notion of "loss function", which quantifies the discrepancy between
predicted and target values. Am I looking at the wrong class or what I want
to do is just not feasible with Random Forests? For example, I would like
to modify the RandomForestRegressor code to minimize the Pearson's R
between predicted and target values.
>
> I thank you in advance for any clarification.
> Thomas
>
>
>
>
> On 02/15/2018 01:28 PM, Guillaume Lemaitre wrote:
>> Yes you are right pxd are the header and pyx the definition. You need to
write a class as MSE. Criterion is an abstract class or base class (I don't
have it under the eye)
>>
>> @Andy: if I recall the PR, we made the classes public to enable such
custom criterion. However, ‎it is not documented since we were not
officially supporting it. So this is an hidden feature. We could always
discuss to make this feature more visible and document it.
>
>
>
>
>
> --
> ==
> Dr Thomas Evangelidis
> Post-doctoral Researcher
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/2S049,
> 62500 Brno, Czech Republic
>
> email: tev...@pharm.uoa.gr
>   teva...@gmail.com
>
> website: https://sites.google.com/site/thomasevangelidishomepage/
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] custom loss function in RandomForestRegressor

2018-03-01 Thread Thomas Evangelidis
Hi again,

I am currently revisiting this problem after familiarizing myself with
Cython and Scikit-Learn's code and I have a very important query:

Looking at the class MSE(RegressionCriterion), the node impurity is defined
as the variance of the target values Y on that node. The predictions X are
nowhere involved in the computations. This contradicts my notion of "loss
function", which quantifies the discrepancy between predicted and target
values. Am I looking at the wrong class or what I want to do is just not
feasible with Random Forests? For example, I would like to modify the
RandomForestRegressor code to minimize the Pearson's R between predicted
and target values.

I thank you in advance for any clarification.
Thomas



>
>> On 02/15/2018 01:28 PM, Guillaume Lemaitre wrote:
>>
>> Yes you are right pxd are the header and pyx the definition. You need to
>> write a class as MSE. Criterion is an abstract class or base class (I don't
>> have it under the eye)
>>
>> @Andy: if I recall the PR, we made the classes public to enable such
>> custom criterion. However, ‎it is not documented since we were not
>> officially supporting it. So this is an hidden feature. We could always
>> discuss to make this feature more visible and document it.
>>
>>
>>
>


-- 

==

Dr Thomas Evangelidis

Post-doctoral Researcher
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/2S049,
62500 Brno, Czech Republic

email: tev...@pharm.uoa.gr

  teva...@gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] custom loss function in RandomForestRegressor

2018-02-15 Thread Thomas Evangelidis
Is it possible to compile just _criterion.pyx and _criterion.pxd files by
using "importpyx" or any alternative way instead of compiling the whole
sklearn library every time I introduce a change?

Dne 15. 2. 2018 19:29 napsal uživatel "Guillaume Lemaitre" <
g.lemaitr...@gmail.com>:

Yes you are right pxd are the header and pyx the definition. You need to
write a class as MSE. Criterion is an abstract class or base class (I don't
have it under the eye)

@Andy: if I recall the PR, we made the classes public to enable such custom
criterion. However, ‎it is not documented since we were not officially
supporting it. So this is an hidden feature. We could always discuss to
make this feature more visible and document it.

Guillaume Lemaitre
INRIA Saclay Ile-de-France / Equipe PARIETAL
guillaume.lemai...@inria.fr - https://glemaitre.github.io/
*From: *Thomas Evangelidis
*Sent: *Thursday, 15 February 2018 19:15
*To: *Scikit-learn mailing list
*Reply To: *Scikit-learn mailing list
*Subject: *Re: [scikit-learn] custom loss function in RandomForestRegressor

Sorry I don't know Cython at all. _criterion.pxd is like the header file in
C++? I see that it contains class, function and variable definitions and
their description in comments.

class Criterion is an Interface, doesn't have function definitions. By
"writing your own criterion with a given loss" you mean writing a class
like MSE(RegressionCriterion)?


On 15 February 2018 at 18:50, Guillaume Lemaître <g.lemaitr...@gmail.com>
wrote:

> The ClassificationCriterion and RegressionCriterion are now exposed in the
> _criterion.pxd. It will allow you to create your own criterion.
> So you can write your own Criterion with a given loss by implementing the
> methods which are required in the trees.
> Then you can pass an instance of this criterion to the tree and it should
> work.
>
>
>
>
>
> --
> Guillaume Lemaitre
> INRIA Saclay - Parietal team
> Center for Data Science Paris-Saclay
> https://glemaitre.github.io/
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 

==

Dr Thomas Evangelidis

Post-doctoral Researcher
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/2S049,
62500 Brno, Czech Republic

email: tev...@pharm.uoa.gr

  teva...@gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] custom loss function in RandomForestRegressor

2018-02-15 Thread Thomas Evangelidis
Sorry I don't know Cython at all. _criterion.pxd is like the header file in
C++? I see that it contains class, function and variable definitions and
their description in comments.

class Criterion is an Interface, doesn't have function definitions. By
"writing your own criterion with a given loss" you mean writing a class
like MSE(RegressionCriterion)?


On 15 February 2018 at 18:50, Guillaume Lemaître <g.lemaitr...@gmail.com>
wrote:

> The ClassificationCriterion and RegressionCriterion are now exposed in the
> _criterion.pxd. It will allow you to create your own criterion.
> So you can write your own Criterion with a given loss by implementing the
> methods which are required in the trees.
> Then you can pass an instance of this criterion to the tree and it should
> work.
>
>
>
>
>
> --
> Guillaume Lemaitre
> INRIA Saclay - Parietal team
> Center for Data Science Paris-Saclay
> https://glemaitre.github.io/
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 

==

Dr Thomas Evangelidis

Post-doctoral Researcher
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/2S049,
62500 Brno, Czech Republic

email: tev...@pharm.uoa.gr

  teva...@gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] custom loss function in RandomForestRegressor

2018-02-15 Thread Thomas Evangelidis
Greetings,

The feature importance calculated by the RandomForest implementation is a
very useful feature. I personally use it to select the best features
because it is simple and fast, and then I train MLPRegressors. The
limitation of this approach is that although I can control the loss
function of the MLPRegressor (I have modified scikit-learn's implementation
to accept an arbitrary loss function), I cannot do the same with
RandomForestRegressor, and hence I have to rely on 'mse' which is not in
accordance with the loss functions I use in MLPs. Today I was looking at
the _criterion.pyx file:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx

However, the code is in Cython and I find it hard to follow. I know that
for Regression the relevant class are Criterion(),
RegressionCriterion(Criterion), and MSE(RegressionCriterion). My question
is: is it possible to write a class that takes an arbitrary function
"loss(predictions, targets)" to calculate the loss and impurity of the
nodes?

thanks,
Thomas


-- 

==

Dr Thomas Evangelidis

Post-doctoral Researcher
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/2S049,
62500 Brno, Czech Republic

email: tev...@pharm.uoa.gr

  teva...@gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] MLPClassifier as a feature selector

2017-12-29 Thread Thomas Evangelidis
Alright, with these attributes I can get the weights and biases, but what
about the values on the nodes of the last hidden layer? Do I have to work
them out myself or there is a straightforward way to get them?

On 7 December 2017 at 04:25, Manoj Kumar <manojkumarsivaraj...@gmail.com>
wrote:

> Hi,
>
> The weights and intercepts are available in the coefs_ and intercepts_
> attribute respectively.
>
> See https://github.com/scikit-learn/scikit-learn/blob/
> a24c8b46/sklearn/neural_network/multilayer_perceptron.py#L835
>
> On Wed, Dec 6, 2017 at 4:56 PM, Brown J.B. via scikit-learn <
> scikit-learn@python.org> wrote:
>
>> I am also very interested in knowing if there is a sklearn cookbook
>> solution for getting the weights of a one-hidde-layer MLPClassifier.
>> J.B.
>>
>> 2017-12-07 8:49 GMT+09:00 Thomas Evangelidis <teva...@gmail.com>:
>>
>>> Greetings,
>>>
>>> I want to train a MLPClassifier with one hidden layer and use it as a
>>> feature selector for an MLPRegressor.
>>> Is it possible to get the values of the neurons from the last hidden
>>> layer of the MLPClassifier to pass them as input to the MLPRegressor?
>>>
>>> If it is not possible with scikit-learn, is anyone aware of any
>>> scikit-compatible NN library that offers this functionality? For example
>>> this one:
>>>
>>> http://scikit-neuralnetwork.readthedocs.io/en/latest/index.html
>>>
>>> I wouldn't like to do this in Tensorflow because the MLP there is much
>>> slower than scikit-learn's implementation.
>>>
>>>
>>> Thomas
>>>
>>>
>>> --
>>>
>>> ==
>>>
>>> Dr Thomas Evangelidis
>>>
>>> Post-doctoral Researcher
>>> CEITEC - Central European Institute of Technology
>>> Masaryk University
>>> Kamenice 5/A35/2S049,
>>> 62500 Brno, Czech Republic
>>>
>>> email: tev...@pharm.uoa.gr
>>>
>>>   teva...@gmail.com
>>>
>>>
>>> website: https://sites.google.com/site/thomasevangelidishomepage/
>>>
>>>
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
>
> --
> Manoj,
> http://github.com/MechCoder
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 

==

Dr Thomas Evangelidis

Post-doctoral Researcher
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/2S049,
62500 Brno, Czech Republic

email: tev...@pharm.uoa.gr

  teva...@gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] data augmentation following the underlying feature values distributions and correlations

2017-12-18 Thread Thomas Evangelidis
Greetings,

I want to augment my training set but preserve at the same time the
correlations between feature values. More specifically my features are NMR
resonances of the nuclei of a single amino acid. For example for Glutamic
acid I have for each observation the following feature values:

[CA, HA, CB, HB, CG, HG]

where CA is the resonance of the alpha carbon, HA the resonance of the
alpha proton, and so forth. The complication here is that these feature
values are not independent. HA is covalently bonded to CA, CB to CA, and so
on. Therefore if I sample a random CA value from the distribution of
experimental values of CA, I cannot pick ANY HA VALUE from the respective
experimental distribution, simply because CA and HA are correlated. The
same applies to CA and CB, CB and HB, CB and CG, CG and HG. Is there any
algorithm that can generate [CA, HA, CB, HB, CG, HG] feature vectors that
comply with the atom distributions and their correlations? I saw that
Gaussian Mixture Models have a function to generate random samples from the
fitted Gaussian distribution (sklearn.mixture.GaussianMixture.sample) but
it is not clear if these samples will retain the correlations between the
features (nuclei in this case). If there is not such an algorithm in
scikit-learn,
could you please point me to any other Python library which does that?

Thanks in advance.
Thomas


-- 

==

Dr Thomas Evangelidis

Post-doctoral Researcher
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/2S049,
62500 Brno, Czech Republic

email: tev...@pharm.uoa.gr

  teva...@gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] MLPClassifier as a feature selector

2017-12-06 Thread Thomas Evangelidis
Greetings,

I want to train a MLPClassifier with one hidden layer and use it as a
feature selector for an MLPRegressor.
Is it possible to get the values of the neurons from the last hidden layer
of the MLPClassifier to pass them as input to the MLPRegressor?

If it is not possible with scikit-learn, is anyone aware of any
scikit-compatible NN library that offers this functionality? For example
this one:

http://scikit-neuralnetwork.readthedocs.io/en/latest/index.html

I wouldn't like to do this in Tensorflow because the MLP there is much
slower than scikit-learn's implementation.


Thomas


-- 

==

Dr Thomas Evangelidis

Post-doctoral Researcher
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/2S049,
62500 Brno, Czech Republic

email: tev...@pharm.uoa.gr

  teva...@gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] anti-correlated predictions by SVR

2017-09-26 Thread Thomas Evangelidis
I have very small training sets (10-50 observations). Currently, I am
working with 16 observations for training and 25 for validation (external
test set). And I am doing Regression, not Classification (hence the SVR
instead of SVC).


On 26 September 2017 at 18:21, Gael Varoquaux <gael.varoqu...@normalesup.org
> wrote:

> Hypothesis: you have a very small dataset and when you leave out data,
> you create a distribution shift between the train and the test. A
> simplified example: 20 samples, 10 class a, 10 class b. A leave-one-out
> cross-validation will create a training set of 10 samples of one class, 9
> samples of the other, and the test set is composed of the class that is
> minority on the train set.
>
> G
>
> On Tue, Sep 26, 2017 at 06:10:39PM +0200, Thomas Evangelidis wrote:
> > Greetings,
>
> > I don't know if anyone encountered this before, but sometimes I get
> > anti-correlated predictions by the SVR I that am training. Namely, the
> > Pearson's R and Kendall's tau are negative when I compare the
> predictions on
> > the external test set with the true values. However, the SVR predictions
> on the
> > training set have positive correlations with the experimental values and
> hence
> > I can't think of a way to know in advance if the trained SVR will produce
> > anti-correlated predictions in order to change their sign and avoid the
> > disaster. Here is an example of what I mean:
>
> > Training set predictions: R=0.452422, tau=0.33
> > External test set predictions: R=-0.537420, tau-0.30
>
> > Obviously, in a real case scenario where I wouldn't have the external
> test set
> > I would have used the worst observation instead of the best ones. Has
> anybody
> > any idea about how I could prevent this?
>
> > thanks in advance
> > Thomas
> --
> Gael Varoquaux
> Researcher, INRIA Parietal
> NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
> Phone:  ++ 33-1-69-08-79-68
> http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>



-- 

==

Dr Thomas Evangelidis

Post-doctoral Researcher
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/2S049,
62500 Brno, Czech Republic

email: tev...@pharm.uoa.gr

  teva...@gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] anti-correlated predictions by SVR

2017-09-26 Thread Thomas Evangelidis
Greetings,

I don't know if anyone encountered this before, but sometimes I get
anti-correlated predictions by the SVR I that am training. Namely, the
Pearson's R and Kendall's tau are negative when I compare the predictions
on the external test set with the true values. However, the SVR predictions
on the training set have positive correlations with the experimental values
and hence I can't think of a way to know in advance if the trained SVR will
produce anti-correlated predictions in order to change their sign and avoid
the disaster. Here is an example of what I mean:

Training set predictions: R=0.452422, tau=0.33
External test set predictions: R=-0.537420, tau-0.30

Obviously, in a real case scenario where I wouldn't have the external test
set I would have used the worst observation instead of the best ones. Has
anybody any idea about how I could prevent this?

thanks in advance
Thomas



-- 

==

Dr Thomas Evangelidis

Post-doctoral Researcher
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/2S049,
62500 Brno, Czech Republic

email: tev...@pharm.uoa.gr

  teva...@gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] custom loss function

2017-09-13 Thread Thomas Evangelidis
What about the SVM? I use an SVR at the end to combine multiple
MLPRegressor predictions using the rbf kernel (linear kernel is not good
for this problem). Can I also implement an SVR with rbf kernel in
Tensorflow using my own loss function? So far I found an example of an SVC
with linear kernel in Tensorflow and nothing in Keras. My alternative
option would be to train multiple SVRs and find through cross validation
the one that minimizes my custom loss function, but as I said in a previous
message, that would be a suboptimal solution because in scikit-learn the
SVR minimizes the default loss function.

Dne 13. 9. 2017 20:48 napsal uživatel "Andreas Mueller" <t3k...@gmail.com>:

>
>
> On 09/13/2017 01:18 PM, Thomas Evangelidis wrote:
>
> ​​
> Thanks again for the clarifications Sebastian!
>
> Keras has a Scikit-learn API with the KeraRegressor which implements the
> Scikit-Learn MLPRegressor interface:
>
> https://keras.io/scikit-learn-api/
>
> Is it possible to change the loss function in KerasRegressor? I don't have
> time right now to experiment with hyperparameters of new ANN architectures.
> I am in urgent need to reproduce in Keras the results obtained with
> MLPRegressor and the set of hyperparameters that I have optimized for my
> problem and later change the loss function.
>
> I think using keras is probably the way to go for you.
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] custom loss function

2017-09-13 Thread Thomas Evangelidis
​​
Thanks again for the clarifications Sebastian!

Keras has a Scikit-learn API with the KeraRegressor which implements the
Scikit-Learn MLPRegressor interface:

https://keras.io/scikit-learn-api/

Is it possible to change the loss function in KerasRegressor? I don't have
time right now to experiment with hyperparameters of new ANN architectures.
I am in urgent need to reproduce in Keras the results obtained with
MLPRegressor and the set of hyperparameters that I have optimized for my
problem and later change the loss function.



On 13 September 2017 at 18:14, Sebastian Raschka <se.rasc...@gmail.com>
wrote:

> > What about the SVR? Is it possible to change the loss function there?
>
> Here you would have the same problem; SVR is a constrained optimization
> problem and you would have to change the calculation of the loss gradient
> then. Since SVR is a "1-layer" neural net, if you change the cost function
> to something else, it's not really a SVR anymore.
>
>
> > Could you please clarify what the "x" and "x'" parameters in the default
> Kernel functions mean? Is "x" a NxM array, where N is the number of
> observations and M the number of features?
>
> Both x and x' should be denoting training examples. The kernel matrix is
> symmetric (N x N).
>
>
>
> Best,
> Sebastian
>
> > On Sep 13, 2017, at 5:25 AM, Thomas Evangelidis <teva...@gmail.com>
> wrote:
> >
> > Thanks Sebastian. Exploring Tensorflow capabilities was in my TODO list,
> but now it's in my immediate plans.
> > What about the SVR? Is it possible to change the loss function there?
> Could you please clarify what the "x" and "x'" parameters in the default
> Kernel functions mean? Is "x" a NxM array, where N is the number of
> observations and M the number of features?
> >
> > http://scikit-learn.org/stable/modules/svm.html#kernel-functions
> >
> >
> >
> > On 12 September 2017 at 00:37, Sebastian Raschka <se.rasc...@gmail.com>
> wrote:
> > Hi Thomas,
> >
> > > For the MLPRegressor case so far my conclusion was that it is not
> possible unless you modify the source code.
> >
> > Also, I suspect that this would be non-trivial. I haven't looked to
> closely at how the MLPClassifier/MLPRegressor are implemented but since you
> perform the weight updates based on the gradient of the cost function wrt
> the weights, the modification would be non-trivial if the partial
> derivatives are not computed based on some autodiff implementation -- you
> would have to edit all the partial d's along the backpropagation up to the
> first hidden layer. While I think that scikit-learn is by far the best
> library out there for machine learning, I think if you want an easy
> solution, you probably won't get around TensorFlow or PyTorch or
> equivalent, here, for your specific MLP problem unless you want to make
> your life extra hard :P (seriously, you can pick up any of the two in about
> an hour and have your MLPRegressor up and running so that you can then
> experiment with your cost function).
> >
> > Best,
> > Sebastian
> >
> > > On Sep 11, 2017, at 6:13 PM, Thomas Evangelidis <teva...@gmail.com>
> wrote:
> > >
> > > Greetings,
> > >
> > > I know this is a recurrent question, but I would like to use my own
> loss function either in a MLPRegressor or in an SVR. For the MLPRegressor
> case so far my conclusion was that it is not possible unless you modify the
> source code. On the other hand, for the SVR I was looking at setting custom
> kernel functions. But I am not sure if this is the same thing. Could
> someone please clarify this to me? Finally, I read about the "scoring"
> parameter is cross-validation, but this is just to select a Regressor that
> has been trained already with the default loss function, so it would be
> harder to find one that minimizes my own loss function.
> > >
> > > For the record, my loss function is the centered root mean square
> error.
> > >
> > > Thanks in advance for any advice.
> > >
> > >
> > >
> > > --
> > > ==
> > > Dr Thomas Evangelidis
> > > Post-doctoral Researcher
> > > CEITEC - Central European Institute of Technology
> > > Masaryk University
> > > Kamenice 5/A35/2S049,
> > > 62500 Brno, Czech Republic
> > >
> > > email: tev...@pharm.uoa.gr
> > >   teva...@gmail.com
> > >
> > > website: https://sites.google.com/site/thomasevangelidishomepage/
> > &

[scikit-learn] custom loss function

2017-09-11 Thread Thomas Evangelidis
Greetings,

I know this is a recurrent question, but I would like to use my own loss
function either in a MLPRegressor or in an SVR. For the MLPRegressor case
so far my conclusion was that it is not possible unless you modify the
source code. On the other hand, for the SVR I was looking at setting custom
kernel functions. But I am not sure if this is the same thing. Could
someone please clarify this to me? Finally, I read about the "scoring"
parameter is cross-validation, but this is just to select a Regressor that
has been trained already with the default loss function, so it would be
harder to find one that minimizes my own loss function.

For the record, my loss function is the centered root mean square error.

Thanks in advance for any advice.



-- 

==

Dr Thomas Evangelidis

Post-doctoral Researcher
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/2S049,
62500 Brno, Czech Republic

email: tev...@pharm.uoa.gr

  teva...@gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] control value range of MLPRegressor predictions

2017-09-10 Thread Thomas Evangelidis
On 10 September 2017 at 22:03, Sebastian Raschka <se.rasc...@gmail.com>
wrote:

> You could normalize the outputs (e.g., via min-max scaling). However, I
> think the more intuitive way would be to clip the predictions. E.g., say
> you are predicting house prices, it probably makes no sense to have a
> negative prediction, so you would clip the output at some value  >0$
>
>
​By clipping you mean discarding the predictors that give values
below/above the threshold?



> PS: -820 and -800 sounds a bit extreme if your training data is in a -5 to
> -9 range. Is your training data from a different population then the one
> you use for testing/making predictions? Or maybe it's just an extreme case
> of overfitting.
>
>
​It is from the same population, but the training sets I use are very small
(6-32 observations), so it must be over-fitting. We had that discussion in
the past here, yet in practice I get good correlations with the
experimental values using MLPRegressors.​



> Best,
> Sebastian
>
>
> > On Sep 10, 2017, at 3:13 PM, Thomas Evangelidis <teva...@gmail.com>
> wrote:
> >
> > Greetings,
> >
> > Is there any way to force the MLPRegressor to make predictions in the
> same value range as the training data? For example, if the training data
> range between -5 and -9, I don't want the predictions to range between -820
> and -800. In fact, some times I get anti-correlated predictions, for
> example between 800 and 820 and I have to change the sign in order to
> calculate correlations with experimental values. Is there a way to control
> the value range explicitly or implicitly (by post-processing the
> predictions)?
> >
> > thanks
> > Thomas
> >
> >
> > --
> > ==
> > Dr Thomas Evangelidis
> > Post-doctoral Researcher
> > CEITEC - Central European Institute of Technology
> > Masaryk University
> > Kamenice 5/A35/2S049,
> > 62500 Brno, Czech Republic
> >
> > email: tev...@pharm.uoa.gr
> >   teva...@gmail.com
> >
> > website: https://sites.google.com/site/thomasevangelidishomepage/
> >
> >
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>



-- 

==

Dr Thomas Evangelidis

Post-doctoral Researcher
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/2S049,
62500 Brno, Czech Republic

email: tev...@pharm.uoa.gr

  teva...@gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] control value range of MLPRegressor predictions

2017-09-10 Thread Thomas Evangelidis
Greetings,

Is there any way to force the MLPRegressor to make predictions in the same
value range as the training data? For example, if the training data range
between -5 and -9, I don't want the predictions to range between -820 and
-800. In fact, some times I get anti-correlated predictions, for example
between 800 and 820 and I have to change the sign in order to calculate
correlations with experimental values. Is there a way to control the value
range explicitly or implicitly (by post-processing the predictions)?

thanks
Thomas


-- 

==

Dr Thomas Evangelidis

Post-doctoral Researcher
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/2S049,
62500 Brno, Czech Republic

email: tev...@pharm.uoa.gr

  teva...@gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] combining datasets from different sources

2017-09-07 Thread Thomas Evangelidis
On 7 September 2017 at 15:29, Maciek Wójcikowski <mac...@wojcikowski.pl>
wrote:

> I think StandardScaller is what you want. For each assay you will get mean
> and var. Average mean would be the "optimal" shift and average variance the
> spread. But would this value make any physical sense?
>
> ​I think you missed my point. The problem was scaling with restraints, the
RMSD between the binding affinity of the common ligands ​must be minimized
uppon scaling. Anyway, I managed to work it out using scipy.optimize.




> Considering the RF-Score-VS: In fact it's a regressor and it predicts a
> real value, not a class. Although it is validated mostly using Enrichment
> Factor, the last figure shows top results for regression vs Vina.
>
> ​To my understanding you trained the RF using class information (active,
inactive) and the prediction was a probability value.​ If the probability
is above 0.5 then the compound is an active, otherwise it is an inactive.
This is how sklearn.ensemble.RandomForestClassifier works.

In contrast I train MLPRegressors using binding affinities (scalar values)
and the predictions are binding affinities (scallar values).





> 
> Pozdrawiam,  |  Best regards,
> Maciek Wójcikowski
> mac...@wojcikowski.pl
>
> 2017-09-06 20:48 GMT+02:00 Thomas Evangelidis <teva...@gmail.com>:
>
>> ​​
>> After some though about this problem today, I think it is an objective
>> function minimization problem, when the objective function can be the root
>> mean square deviation (RMSD) between the affinities of the common molecules
>> in the two data sets. I could work iteratively, first rescale and fit assay
>> B to match A, then proceed to assay C and so forth. Or alternatively, for
>> each Assay I need to find two missing variables, the optimum shift Sh and
>> the scale Sc. So if I have 3 Assays A, B, C lets say, I am looking for the
>> optimum values of Sh_A, Sc_A, Sh_B, Sc_B, Sh_C, Sc_C that minimize the RMSD
>> between the binding affinities of the overlapping molecules. Any idea how I
>> can do that with scikit-learn?
>>
>>
>> On 6 September 2017 at 00:29, Thomas Evangelidis <teva...@gmail.com>
>> wrote:
>>
>>> Thanks Jason, Sebastian and Maciek!
>>>
>>> I believe from all the suggestions, the most feasible solutions is to
>>> look experimental assays which overlap by at least two compounds, and then
>>> adjust the binding affinities of one of them by looking in their difference
>>> in both assays. Sebastian mentioned the simplest scenario, where the shift
>>> for both compounds is 2 kcal/mol. However, he neglected to mention that the
>>> ratio between the affinities of the two compounds in each assay also
>>> matters. Specifically, the ratio Ka/Kb=-7/-9=0.78 in assay A but
>>> -10/-12=0.83 in assay B. Ideally that should also be taken into account to
>>> select the right transformation function for the values from Assay B. Is
>>> anybody away of any clever algorithm to select the right transformation
>>> function for such a case? I am sure there exists.
>>>
>>> The other approach would be to train different predictors from each
>>> assay and then apply a data fusion technique (e.g. min rank). But that
>>> wouldn't be that elegant.
>>>
>>> @Maciek To my understanding, the paper you cited addresses a
>>> classification problem (actives, inactives) by implementing Random Forrest
>>> Classfiers. My case is a Regression problem.
>>>
>>>
>>> best,
>>> Thomas
>>>
>>>
>>> On 5 September 2017 at 20:33, Maciek Wójcikowski <mac...@wojcikowski.pl>
>>> wrote:
>>>
>>>> Hi Thomas and others,
>>>>
>>>> It also really depend on how many data points you have on each
>>>> compound. If you had more than a few then there are few options. If you get
>>>> two very distinct activities for one ligand. I'd discard such samples as
>>>> ambiguous or decide on one of the assays/experiments (the one with lower
>>>> error). The exact problem was faced by PDBbind creators, I'd also look
>>>> there for details what they did with their activities.
>>>>
>>>> To follow up Sebastians suggestion: have you checked how different
>>>> ranks/Z-scores you get? Check out the Kendall Tau.
>>>>
>>>> Anyhow, you could build local models for a specific experimental
>>>> methods. In our recent publication on slightly different area
>>>> (protein-ligand scoring function), we show that the RF build on one target
>>

Re: [scikit-learn] combining datasets from different sources

2017-09-06 Thread Thomas Evangelidis
​​
After some though about this problem today, I think it is an objective
function minimization problem, when the objective function can be the root
mean square deviation (RMSD) between the affinities of the common molecules
in the two data sets. I could work iteratively, first rescale and fit assay
B to match A, then proceed to assay C and so forth. Or alternatively, for
each Assay I need to find two missing variables, the optimum shift Sh and
the scale Sc. So if I have 3 Assays A, B, C lets say, I am looking for the
optimum values of Sh_A, Sc_A, Sh_B, Sc_B, Sh_C, Sc_C that minimize the RMSD
between the binding affinities of the overlapping molecules. Any idea how I
can do that with scikit-learn?


On 6 September 2017 at 00:29, Thomas Evangelidis <teva...@gmail.com> wrote:

> Thanks Jason, Sebastian and Maciek!
>
> I believe from all the suggestions, the most feasible solutions is to look
> experimental assays which overlap by at least two compounds, and then
> adjust the binding affinities of one of them by looking in their difference
> in both assays. Sebastian mentioned the simplest scenario, where the shift
> for both compounds is 2 kcal/mol. However, he neglected to mention that the
> ratio between the affinities of the two compounds in each assay also
> matters. Specifically, the ratio Ka/Kb=-7/-9=0.78 in assay A but
> -10/-12=0.83 in assay B. Ideally that should also be taken into account to
> select the right transformation function for the values from Assay B. Is
> anybody away of any clever algorithm to select the right transformation
> function for such a case? I am sure there exists.
>
> The other approach would be to train different predictors from each assay
> and then apply a data fusion technique (e.g. min rank). But that wouldn't
> be that elegant.
>
> @Maciek To my understanding, the paper you cited addresses a
> classification problem (actives, inactives) by implementing Random Forrest
> Classfiers. My case is a Regression problem.
>
>
> best,
> Thomas
>
>
> On 5 September 2017 at 20:33, Maciek Wójcikowski <mac...@wojcikowski.pl>
> wrote:
>
>> Hi Thomas and others,
>>
>> It also really depend on how many data points you have on each compound.
>> If you had more than a few then there are few options. If you get two very
>> distinct activities for one ligand. I'd discard such samples as ambiguous
>> or decide on one of the assays/experiments (the one with lower error). The
>> exact problem was faced by PDBbind creators, I'd also look there for
>> details what they did with their activities.
>>
>> To follow up Sebastians suggestion: have you checked how different
>> ranks/Z-scores you get? Check out the Kendall Tau.
>>
>> Anyhow, you could build local models for a specific experimental methods.
>> In our recent publication on slightly different area (protein-ligand
>> scoring function), we show that the RF build on one target is just slightly
>> better than the RF build on many targets (we've used DUD-E database);
>> Checkout the "horizontal" and "per-target" splits
>> https://www.nature.com/articles/srep46710. Unfortunately, this may
>> change for different models. Plus the molecular descriptors used, which we
>> know nothing about.
>>
>> I hope that helped a bit.
>>
>> 
>> Pozdrawiam,  |  Best regards,
>> Maciek Wójcikowski
>> mac...@wojcikowski.pl
>>
>> 2017-09-05 19:35 GMT+02:00 Sebastian Raschka <se.rasc...@gmail.com>:
>>
>>> Another approach would be to pose this as a "ranking" problem to predict
>>> relative affinities rather than absolute affinities. E.g., if you have data
>>> from one (or more) molecules that has/have been tested under 2 or more
>>> experimental conditions, you can rank the other molecules accordingly or
>>> normalize. E.g. if you observe that the binding affinity of molecule a is
>>> -7 kcal/mol in assay A and -9 kcal/mol in assay to, and say the binding
>>> affinities of molecule B are -10 and -12 kcal/mol, respectively, that
>>> should give you some information for normalizing the values from assay 2
>>> (e.g., by adding 2 kcal/mol). Of course this is not a perfect solution and
>>> might be error prone, but so are experimental assays ... (when I sometimes
>>> look at the std error/CI of the data I get from collaborators ... well, it
>>> seems that absolute binding affinities have always taken with a grain of
>>> salt anyway)
>>>
>>> Best,
>>> Sebastian
>>>
>>> > On Sep 5, 2017, at 1:02 PM, Jason Rudy <jcr...@gmail.com> wrote:
>>> >
>>> > Thomas,
&

Re: [scikit-learn] combining datasets from different sources

2017-09-05 Thread Thomas Evangelidis
Thanks Jason, Sebastian and Maciek!

I believe from all the suggestions, the most feasible solutions is to look
experimental assays which overlap by at least two compounds, and then
adjust the binding affinities of one of them by looking in their difference
in both assays. Sebastian mentioned the simplest scenario, where the shift
for both compounds is 2 kcal/mol. However, he neglected to mention that the
ratio between the affinities of the two compounds in each assay also
matters. Specifically, the ratio Ka/Kb=-7/-9=0.78 in assay A but
-10/-12=0.83 in assay B. Ideally that should also be taken into account to
select the right transformation function for the values from Assay B. Is
anybody away of any clever algorithm to select the right transformation
function for such a case? I am sure there exists.

The other approach would be to train different predictors from each assay
and then apply a data fusion technique (e.g. min rank). But that wouldn't
be that elegant.

@Maciek To my understanding, the paper you cited addresses a classification
problem (actives, inactives) by implementing Random Forrest Classfiers. My
case is a Regression problem.


best,
Thomas


On 5 September 2017 at 20:33, Maciek Wójcikowski <mac...@wojcikowski.pl>
wrote:

> Hi Thomas and others,
>
> It also really depend on how many data points you have on each compound.
> If you had more than a few then there are few options. If you get two very
> distinct activities for one ligand. I'd discard such samples as ambiguous
> or decide on one of the assays/experiments (the one with lower error). The
> exact problem was faced by PDBbind creators, I'd also look there for
> details what they did with their activities.
>
> To follow up Sebastians suggestion: have you checked how different
> ranks/Z-scores you get? Check out the Kendall Tau.
>
> Anyhow, you could build local models for a specific experimental methods.
> In our recent publication on slightly different area (protein-ligand
> scoring function), we show that the RF build on one target is just slightly
> better than the RF build on many targets (we've used DUD-E database);
> Checkout the "horizontal" and "per-target" splits https://www.nature.com/
> articles/srep46710. Unfortunately, this may change for different models.
> Plus the molecular descriptors used, which we know nothing about.
>
> I hope that helped a bit.
>
> 
> Pozdrawiam,  |  Best regards,
> Maciek Wójcikowski
> mac...@wojcikowski.pl
>
> 2017-09-05 19:35 GMT+02:00 Sebastian Raschka <se.rasc...@gmail.com>:
>
>> Another approach would be to pose this as a "ranking" problem to predict
>> relative affinities rather than absolute affinities. E.g., if you have data
>> from one (or more) molecules that has/have been tested under 2 or more
>> experimental conditions, you can rank the other molecules accordingly or
>> normalize. E.g. if you observe that the binding affinity of molecule a is
>> -7 kcal/mol in assay A and -9 kcal/mol in assay to, and say the binding
>> affinities of molecule B are -10 and -12 kcal/mol, respectively, that
>> should give you some information for normalizing the values from assay 2
>> (e.g., by adding 2 kcal/mol). Of course this is not a perfect solution and
>> might be error prone, but so are experimental assays ... (when I sometimes
>> look at the std error/CI of the data I get from collaborators ... well, it
>> seems that absolute binding affinities have always taken with a grain of
>> salt anyway)
>>
>> Best,
>> Sebastian
>>
>> > On Sep 5, 2017, at 1:02 PM, Jason Rudy <jcr...@gmail.com> wrote:
>> >
>> > Thomas,
>> >
>> > This is sort of related to the problem I did my M.S. thesis on years
>> ago: cross-platform normalization of gene expression data.  If you google
>> that term you'll find some papers.  The situation is somewhat different,
>> though, because with microarrays or RNA-seq you get thousands of data
>> points for each experiment, which makes it easier to estimate the batch
>> effect.  The principle is the similar, however.
>> >
>> > If I were in your situation, I would consider whether I have any of the
>> following advantages:
>> >
>> > 1. Some molecules that appear in multiple data sets
>> > 2. Detailed information about the different experimental conditions
>> > 3. Physical/chemical models of how experimental conditions influence
>> binding affinity
>> >
>> > If you have any of the above, you can potentially use them to improve
>> your estimates.  You could also consider using experiment ID as a
>> categorical predictor in a sufficiently general regression method.
>> &

[scikit-learn] combining datasets from different sources

2017-09-05 Thread Thomas Evangelidis
Greetings,

I am working on a problem that involves predicting the binding affinity of
small molecules on a receptor structure (is regression problem, not
classification). I have multiple small datasets of molecules with measured
binding affinities on a receptor, but each dataset was measured in
different experimental conditions and therefore I cannot use them all
together as trainning set. So, instead of using them individually, I was
wondering whether there is a method to combine them all into a super
training set. The first way I could think of is to convert the binding
affinities to Z-scores and then combine all the small datasets of
molecules. But this is would be inaccurate because, firstly the datasets
are very small (10-50 molecules each), and secondly, the range of binding
affinities differs in each experiment (some datasets contain really strong
binders, while others do not, etc.). Is there any other approach to combine
datasets with values coming from different sources? Maybe if someone points
me to the right reference I could read and understand if it is applicable
to my case.

Thanks,
Thomas

-- 

==

Dr Thomas Evangelidis

Post-doctoral Researcher
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/2S049,
62500 Brno, Czech Republic

email: tev...@pharm.uoa.gr

  teva...@gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] recommended feature selection method to train an MLPRegressor

2017-03-19 Thread Thomas Evangelidis
Which of the following methods would you recommend to select good features
(<=50) from a set of 534 features in order to train a MLPregressor? Please
take into account that the datasets I use for training are small.

http://scikit-learn.org/stable/modules/feature_selection.html

And please don't tell me to use a neural network that supports the dropout
or any other algorithm for feature elimination. This is not applicable in
my case because I want to know the best 50 features in order to append them
to other types of feature that I am confident that are important.


​cheers
Thomas​


-- 

==

Thomas Evangelidis

Research Specialist
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/1S081,
62500 Brno, Czech Republic

email: tev...@pharm.uoa.gr

  teva...@gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] meta-estimator for multiple MLPRegressor

2017-01-10 Thread Thomas Evangelidis
Stuart,

I didn't see LASSO performing well, especially with the second type of
data. The alpha parameter probably needs adjustment with LassoCV.
I don't know if you have read my previous messages on this thread, so I
quote again my setting for MLPRegressor.


MLPRegressor(random_state=random_state, max_iter=400, early_stopping=True,
validation_fraction=0.2, alpha=10, hidden_layer_sizes=(10,))


So to sum up, I must select the lowest possible value for the following
parameters:

* max_iter
* hidden_layer_sizes (lower than 10?)
* number of features in my training data. I.e. the first type of data that
consisted of 60 features are preferable from that second that consisted of
456.

Is this correct?




On 10 January 2017 at 19:47, Stuart Reynolds <stu...@stuartreynolds.net>
wrote:

> Thomas,
> Jacob's point is important -- its not the number of features that's
> important, its the number of free parameters. As the number of free
> parameters increases, the space of representable functions grows to the
> point where the cost function is minimized by having a single parameter
> explain each variable. This is true of many ML methods.
>
> In the case of a decision trees, for example you can allow each node (a
> free parameter) hold exactly 1 training example, and see perfect training
> performance. In linear methods, you can perfectly fit training data by
> adding additional polynomial features (for feature x_i, add x^2_i,  x^3_i,
>  x^4_i, ) Performance on unseen data will be terrible.
> MLP is no different -- adding more free parameters (more flexibility to
> precisely model the training data) may harm more than help when it comes to
> unseen data performance, especially when the number of examples it small.
>
> Early stopping may help overfitting, as might dropout.
>
> The likely reasons that LASSO and GBR performed well is that they're
> methods that explicit manage overfitting.
>
> Perform a grid search on:
>  - the number of hidden nodes in you MLP.
>  - the number of iterations
>
> for both, you may find lowering values will improve performance on unseen
> data.
>
>
>
>
>
>
>
>
>
> On Tue, Jan 10, 2017 at 4:46 AM, Thomas Evangelidis <teva...@gmail.com>
> wrote:
>
>> Jacob,
>>
>> The features are not 6000. I train 2 MLPRegressors from two types of
>> data, both refer to the same dataset (35 molecules in total) but each
>> one contains different type of information. The first data consist of 60
>> features. I tried 100 different random states and measured the average |R|
>> using the leave-20%-out cross-validation. Below are the results from the
>> first data:
>>
>> RandomForestRegressor: |R|= 0.389018243545 +- 0.252891783658
>> LASSO: |R|= 0.247411754937 +- 0.232325286471
>> GradientBoostingRegressor: |R|= 0.324483769202 +- 0.211778410841
>> MLPRegressor: |R|= 0.540528696597 +- 0.255714448793
>>
>> The second type of data consist of 456 features. Below are the results
>> for these too:
>>
>> RandomForestRegressor: |R|= 0.361562548904 +- 0.234872385318
>> LASSO: |R|= 3.27752711304e-16 +- 2.60800139195e-16
>> GradientBoostingRegressor: |R|= 0.328087138161 +- 0.229588427086
>> MLPRegressor: |R|= 0.455473342507 +- 0.24579081197
>>
>>
>> At the end I want to combine models created from these data types using a
>> meta-estimator (that was my original question). The combination with the
>> highest |R| (0.631851796403 +- 0.247911204514) was produced by an SVR
>> that combined the best MLPRegressor from data type 1 and the best
>> MLPRegressor from data type2:
>>
>>
>>
>>
>>
>> On 10 January 2017 at 01:36, Jacob Schreiber <jmschreibe...@gmail.com>
>> wrote:
>>
>>> Even with a single layer with 10 neurons you're still trying to train
>>> over 6000 parameters using ~30 samples. Dropout is a concept common in
>>> neural networks, but doesn't appear to be in sklearn's implementation of
>>> MLPs. Early stopping based on validation performance isn't an "extra" step
>>> for reducing overfitting, it's basically a required step for neural
>>> networks. It seems like you have a validation sample of ~6 datapoints.. I'm
>>> still very skeptical of that giving you proper results for a complex model.
>>> Will this larger dataset be of exactly the same data? Just taking another
>>> unrelated dataset and showing that a MLP can learn it doesn't mean it will
>>> work for your specific data. Can you post the actual results from using
>>> LASSO, RandomForestRegressor, GradientBoostingRegressor, and MLP?
>>>
>>> On Mon, Jan 9, 2017 at 4:21 PM, Stuart Reynolds &l

Re: [scikit-learn] meta-estimator for multiple MLPRegressor

2017-01-09 Thread Thomas Evangelidis
Jacob & Sebastian,

I think the best way to find out if my modeling approach works is to find a
larger dataset, split it into two parts, the first one will be used as
training/cross-validation set and the second as a test set, like in a real
case scenario.

Regarding the MLPRegressor regularization, below is my optimum setup:

MLPRegressor(random_state=random_state, max_iter=400, early_stopping=True,
> validation_fraction=0.2, alpha=10, hidden_layer_sizes=(10,))


This means only one hidden layer with maximum 10 neurons, alpha=10 for L2
regularization and early stopping to terminate training if validation score
is not improving. I think this is a quite simple model. My final predictor
is an SVR that combines 2 MLPRegressors, each one trained with different
types of input data.

@Sebastian
You have mentioned dropout again but I could not find it in the docs:
http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html#sklearn.neural_network.MLPRegressor

Maybe you are referring to another MLPRegressor implementation? I have seen
a while ago another implementation you had on github. Can you clarify which
one you recommend and why?


Thank you both of you for your hints!

best
Thomas



-- 

==

Thomas Evangelidis

Research Specialist
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/1S081,
62500 Brno, Czech Republic

email: tev...@pharm.uoa.gr

  teva...@gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] meta-estimator for multiple MLPRegressor

2017-01-08 Thread Thomas Evangelidis
Sebastian and Jacob,

Regarding overfitting, Lasso, Ridge regression and ElasticNet have poor
performance on my data. MLPregressors are way superior. On an other note,
MLPregressor class has some methods to contol overfitting, like controling
the alpha parameter for the L2 regularization (maybe setting it to a high
value?) or the number of neurons in the hidden layers (lowering the
hidden_layer_sizes?)
or even "early_stopping=True". Wouldn't these be sufficient to be on the
safe side.

Once more I want to highlight something I wrote previously but might have
been overlooked. The resulting MLPRegressors will be applied to new
datasets that *ARE VERY SIMILAR TO THE TRAINING DATA*. In other words the
application of the models will be strictly confined to their applicability
domain. Wouldn't that be sufficient to not worry about model overfitting
too much?





On 8 January 2017 at 11:53, Sebastian Raschka <se.rasc...@gmail.com> wrote:

> Like to train an SVR to combine the predictions of the top 10%
> MLPRegressors using the same data that were used for training of the
> MLPRegressors? Wouldn't that lead to overfitting?
>
>
> It could, but you don't need to use the same data that you used for
> training to fit the meta estimator. Like it is commonly done in stacking
> with cross validation, you can train the mlps on training folds and pass
> predictions from a test fold to the meta estimator but then you'd have to
> retrain your mlps and it sounded like you wanted to avoid that.
>
> I am currently on mobile and only browsed through the thread briefly, but
> I agree with others that it may sound like your model(s) may have too much
> capacity for such a small dataset -- can be tricky to fit the parameters
> without overfitting. In any case, if you to do the stacking, I'd probably
> insert a k-fold cv between the mlps and the meta estimator. However I'd
> definitely also recommend simpler models als
> alternative.
>
> Best,
> Sebastian
>
> On Jan 7, 2017, at 4:36 PM, Thomas Evangelidis <teva...@gmail.com> wrote:
>
>
>
> On 7 January 2017 at 21:20, Sebastian Raschka <se.rasc...@gmail.com>
> wrote:
>
>> Hi, Thomas,
>> sorry, I overread the regression part …
>> This would be a bit trickier, I am not sure what a good strategy for
>> averaging regression outputs would be. However, if you just want to compute
>> the average, you could do sth like
>> np.mean(np.asarray([r.predict(X) for r in list_or_your_mlps]))
>>
>> However, it may be better to use stacking, and use the output of
>> r.predict(X) as meta features to train a model based on these?
>>
>
> ​Like to train an SVR to combine the predictions of the top 10%
> MLPRegressors using the same data that were used for training of the
> MLPRegressors? Wouldn't that lead to overfitting?
> ​
>
>
>>
>> Best,
>> Sebastian
>>
>> > On Jan 7, 2017, at 1:49 PM, Thomas Evangelidis <teva...@gmail.com>
>> wrote:
>> >
>> > Hi Sebastian,
>> >
>> > Thanks, I will try it in another classification problem I have.
>> However, this time I am using regressors not classifiers.
>> >
>> > On Jan 7, 2017 19:28, "Sebastian Raschka" <se.rasc...@gmail.com> wrote:
>> > Hi, Thomas,
>> >
>> > the VotingClassifier can combine different models per majority voting
>> amongst their predictions. Unfortunately, it refits the classifiers though
>> (after cloning them). I think we implemented it this way to make it
>> compatible to GridSearch and so forth. However, I have a version of the
>> estimator that you can initialize with “refit=False” to avoid refitting if
>> it helps. http://rasbt.github.io/mlxtend/user_guide/classifier/Ensembl
>> eVoteClassifier/#example-5-using-pre-fitted-classifiers
>> >
>> > Best,
>> > Sebastian
>> >
>> >
>> >
>> > > On Jan 7, 2017, at 11:15 AM, Thomas Evangelidis <teva...@gmail.com>
>> wrote:
>> > >
>> > > Greetings,
>> > >
>> > > I have trained many MLPRegressors using different random_state value
>> and estimated the R^2 using cross-validation. Now I want to combine the top
>> 10% of them in how to get more accurate predictions. Is there a
>> meta-estimator that can get as input a few precomputed MLPRegressors and
>> give consensus predictions? Can the BaggingRegressor do this job using
>> MLPRegressors as input?
>> > >
>> > > Thanks in advance for any hint.
>> > > Thomas
>> > >
>> > >
>> > > --
>> > > ===

Re: [scikit-learn] meta-estimator for multiple MLPRegressor

2017-01-07 Thread Thomas Evangelidis
On 8 January 2017 at 00:04, Jacob Schreiber <jmschreibe...@gmail.com> wrote:

> If you have such a small number of observations (with a much higher
> feature space) then why do you think you can accurately train not just a
> single MLP, but an ensemble of them without overfitting dramatically?
>
>
>
​Because the observations in the data set don't differ much between them​.
To be more specific, the data set consists of a congeneric series of
organic molecules and the ebservation is their binding strength to a target
protein. The idea was to train predictors that can predict the binding
strenght of new molecules that belong to the same congeneric series.
Therefore special care is taken to apply the predictors to the right domain
of applicability. According to the literature, the same strategy has been
followed in the past several times. The novelty of my approach stems from
other factors that are irrelevant to this thread.


-- 

======

Thomas Evangelidis

Research Specialist
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/1S081,
62500 Brno, Czech Republic

email: tev...@pharm.uoa.gr

  teva...@gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] meta-estimator for multiple MLPRegressor

2017-01-07 Thread Thomas Evangelidis
Hi Sebastian,

Thanks, I will try it in another classification problem I have. However,
this time I am using regressors not classifiers.

On Jan 7, 2017 19:28, "Sebastian Raschka" <se.rasc...@gmail.com> wrote:

> Hi, Thomas,
>
> the VotingClassifier can combine different models per majority voting
> amongst their predictions. Unfortunately, it refits the classifiers though
> (after cloning them). I think we implemented it this way to make it
> compatible to GridSearch and so forth. However, I have a version of the
> estimator that you can initialize with “refit=False” to avoid refitting if
> it helps. http://rasbt.github.io/mlxtend/user_guide/classifier/
> EnsembleVoteClassifier/#example-5-using-pre-fitted-classifiers
>
> Best,
> Sebastian
>
>
>
> > On Jan 7, 2017, at 11:15 AM, Thomas Evangelidis <teva...@gmail.com>
> wrote:
> >
> > Greetings,
> >
> > I have trained many MLPRegressors using different random_state value and
> estimated the R^2 using cross-validation. Now I want to combine the top 10%
> of them in how to get more accurate predictions. Is there a meta-estimator
> that can get as input a few precomputed MLPRegressors and give consensus
> predictions? Can the BaggingRegressor do this job using MLPRegressors as
> input?
> >
> > Thanks in advance for any hint.
> > Thomas
> >
> >
> > --
> > ==
> > Thomas Evangelidis
> > Research Specialist
> > CEITEC - Central European Institute of Technology
> > Masaryk University
> > Kamenice 5/A35/1S081,
> > 62500 Brno, Czech Republic
> >
> > email: tev...@pharm.uoa.gr
> >   teva...@gmail.com
> >
> > website: https://sites.google.com/site/thomasevangelidishomepage/
> >
> >
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] meta-estimator for multiple MLPRegressor

2017-01-07 Thread Thomas Evangelidis
Greetings,

I have trained many MLPRegressors using different random_state value and
estimated the R^2 using cross-validation. Now I want to combine the top 10%
of them in how to get more accurate predictions. Is there a meta-estimator
that can get as input a few precomputed MLPRegressors and give consensus
predictions? Can the BaggingRegressor do this job using MLPRegressors as
input?

Thanks in advance for any hint.
Thomas


-- 

==

Thomas Evangelidis

Research Specialist
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/1S081,
62500 Brno, Czech Republic

email: tev...@pharm.uoa.gr

  teva...@gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] combining arrays of features to train an MLP

2016-12-19 Thread Thomas Evangelidis
Thank you, these articles discuss about ML application of the types of
fingerprints I working with! I will read them thoroughly to get some hints.

In the meantime I tried to eliminate some features using RandomizedLasso
and the performance escalated from R=0.067 using all 615 features to
R=0.524 using only the 15 top ranked features. Naive question: does it make
sense to use the RandomizedLasso to select the good features in order to
train a MLP? I had the impression that RandomizedLasso uses multi-variate
linear regression to fit the observed values to the experimental and rank
the features.

Another question: this dataset consists of 31 observations. The Pearson's R
values that I reported above were calculated using cross-validation. Could
someone claim that they are inaccurate because the number of features used
for training the MLP is much larger than the number of observations?


On 19 December 2016 at 23:42, Sebastian Raschka <se.rasc...@gmail.com>
wrote:

> Oh, sorry, I just noticed that I was in the wrong thread — meant answer a
> different Thomas :P.
>
> Regarding the fingerprints; scikit-learn’s estimators expect feature
> vectors as samples, so you can’t have a 3D array … e.g., think of image
> classification: here you also enroll the n_pixels times m_pixels array into
> 1D arrays.
>
> The low performance can have mutliple issues. In case dimensionality is an
> issue, I’d maybe try stronger regularization first, or feature selection.
> If you are working with molecular structures, and you have enough of them,
> maybe also consider alternative feature representations, e.g,. learning
> from the graphs directly:
>
> http://papers.nips.cc/paper/5954-convolutional-networks-
> on-graphs-for-learning-molecular-fingerprints.pdf
> http://pubs.acs.org/doi/abs/10.1021/ci400187y
>
> Best,
> Sebastian
>
>
> > On Dec 19, 2016, at 4:56 PM, Thomas Evangelidis <teva...@gmail.com>
> wrote:
> >
> > this means that both are feasible?
> >
> > On 19 December 2016 at 18:17, Sebastian Raschka <se.rasc...@gmail.com>
> wrote:
> > Thanks, Thomas, that makes sense! Will submit a PR then to update the
> docstring.
> >
> > Best,
> > Sebastian
> >
> >
> > > On Dec 19, 2016, at 11:06 AM, Thomas Evangelidis <teva...@gmail.com>
> wrote:
> > >
> > > ​​
> > > Greetings,
> > >
> > > My dataset consists of objects which are characterised by their
> structural features which are encoded into a so called "fingerprint" form.
> There are several different types of fingerprints, each one encapsulating
> different type of information. I want to combine two specific types of
> fingerprints to train a MLP regressor. The first fingerprint consists of a
> 2048 bit array of the form:
> > >
> > >  ​FP​1 = array([ 1.,  1.,  0., ...,  0.,  0.,  1.], dtype=float32)
> > >
> > > The second is a 60 float number array of the form:
> > >
> > > FP2 = array([ 2.77494618,  0.98973243,  0.34638652,  2.88303715,
> 1.31473857,
> > >-0.56627112,  4.78847547,  2.29587913, -0.6786228 ,  4.63391109,
> > >...
> > > 0.,  0.,  5.89652792,  0.,  0.
> ])
> > >
> > > At first I tried to fuse them into a single 1D array of 2048+60
> columns but the predictions of the MLP were worse than the 2 different MLP
> models trained from one of the 2 fingerprint types individually. My
> question: is there a more effective way to combine the 2 fingerprints in
> order to indicate that they represent different type of information?
> > >
> > > To this end, I tried to create a 2-row array (1st row 2048 elements
> and 2nd row 60 elements) but sklearn complained:
> > >
> > > ​mlp.fit(x_train,y_train)
> > >   File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_
> network/multilayer_perceptron.py", line 618, in fit
> > > return self._fit(X, y, incremental=False)
> > >   File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_
> network/multilayer_perceptron.py", line 330, in _fit
> > > X, y = self._validate_input(X, y, incremental)
> > >   File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_
> network/multilayer_perceptron.py", line 1264, in _validate_input
> > > multi_output=True, y_numeric=True)
> > >   File 
> > > "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py",
> line 521, in check_X_y
> > > ensure_min_features, warn_on_dtype, estimator)
> > >   File 
> > > "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validati

Re: [scikit-learn] combining arrays of features to train an MLP

2016-12-19 Thread Thomas Evangelidis
this means that both are feasible?

On 19 December 2016 at 18:17, Sebastian Raschka <se.rasc...@gmail.com>
wrote:

> Thanks, Thomas, that makes sense! Will submit a PR then to update the
> docstring.
>
> Best,
> Sebastian
>
>
> > On Dec 19, 2016, at 11:06 AM, Thomas Evangelidis <teva...@gmail.com>
> wrote:
> >
> > ​​
> > Greetings,
> >
> > My dataset consists of objects which are characterised by their
> structural features which are encoded into a so called "fingerprint" form.
> There are several different types of fingerprints, each one encapsulating
> different type of information. I want to combine two specific types of
> fingerprints to train a MLP regressor. The first fingerprint consists of a
> 2048 bit array of the form:
> >
> >  ​FP​1 = array([ 1.,  1.,  0., ...,  0.,  0.,  1.], dtype=float32)
> >
> > The second is a 60 float number array of the form:
> >
> > FP2 = array([ 2.77494618,  0.98973243,  0.34638652,  2.88303715,
> 1.31473857,
> >-0.56627112,  4.78847547,  2.29587913, -0.6786228 ,  4.63391109,
> >...
> > 0.,  0.,  5.89652792,  0.,  0.])
> >
> > At first I tried to fuse them into a single 1D array of 2048+60 columns
> but the predictions of the MLP were worse than the 2 different MLP models
> trained from one of the 2 fingerprint types individually. My question: is
> there a more effective way to combine the 2 fingerprints in order to
> indicate that they represent different type of information?
> >
> > To this end, I tried to create a 2-row array (1st row 2048 elements and
> 2nd row 60 elements) but sklearn complained:
> >
> > ​mlp.fit(x_train,y_train)
> >   File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_
> network/multilayer_perceptron.py", line 618, in fit
> > return self._fit(X, y, incremental=False)
> >   File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_
> network/multilayer_perceptron.py", line 330, in _fit
> > X, y = self._validate_input(X, y, incremental)
> >   File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_
> network/multilayer_perceptron.py", line 1264, in _validate_input
> > multi_output=True, y_numeric=True)
> >   File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py",
> line 521, in check_X_y
> > ensure_min_features, warn_on_dtype, estimator)
> >   File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py",
> line 402, in check_array
> > array = array.astype(np.float64)
> > ValueError: setting an array element with a sequence.
> > ​
> >
> > ​Then I tried to ​create for each object of the dataset a 2D array of
> size 2x2048, by adding 1998 zeros in the second row in order both rows to
> be of equal size. However sklearn complained again:
> >
> >
> > mlp.fit(x_train,y_train)
> >   File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_
> network/multilayer_perceptron.py", line 618, in fit
> > return self._fit(X, y, incremental=False)
> >   File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_
> network/multilayer_perceptron.py", line 330, in _fit
> > X, y = self._validate_input(X, y, incremental)
> >   File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_
> network/multilayer_perceptron.py", line 1264, in _validate_input
> > multi_output=True, y_numeric=True)
> >   File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py",
> line 521, in check_X_y
> > ensure_min_features, warn_on_dtype, estimator)
> >   File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py",
> line 405, in check_array
> > % (array.ndim, estimator_name))
> > ValueError: Found array with dim 3. Estimator expected <= 2.
> >
> >
> > In another case of fingerprints, lets name them FP3 and FP4, I observed
> that the MLP regressor created using FP3 yields better results when trained
> and evaluated using logarithmically transformed experimental values (the
> values in y_train and y_test 1D arrays), while the MLP regressor created
> using FP4 yielded better results using the original experimental values. So
> my second question is: when combining both FP3 and FP4 into a single array
> is there any way to designate to the MLP that the features that correspond
> to FP3 must reproduce the logarithmic transform of the experimental values
> while the features of FP4 the original untransf

[scikit-learn] combining arrays of features to train an MLP

2016-12-19 Thread Thomas Evangelidis
​​
Greetings,

My dataset consists of objects which are characterised by their structural
features which are encoded into a so called "fingerprint" form. There are
several different types of fingerprints, each one encapsulating different
type of information. I want to combine two specific types of fingerprints
to train a MLP regressor. The first fingerprint consists of a 2048 bit
array of the form:


> ​FP​
> 1 = array([ 1.,  1.,  0., ...,  0.,  0.,  1.], dtype=float32)


The second is a 60 float number array of the form:

FP2 = array([ 2.77494618,  0.98973243,  0.34638652,  2.88303715,
>  1.31473857,
>-0.56627112,  4.78847547,  2.29587913, -0.6786228 ,  4.63391109,
>...
> 0.,  0.,  5.89652792,  0.,  0.])


At first I tried to fuse them into a single 1D array of 2048+60 columns but
the predictions of the MLP were worse than the 2 different MLP models
trained from one of the 2 fingerprint types individually. My question: is
there a more effective way to combine the 2 fingerprints in order to
indicate that they represent different type of information?

To this end, I tried to create a 2-row array (1st row 2048 elements and 2nd
row 60 elements) but sklearn complained:

​mlp.fit(x_train,y_train)
>   File
> "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py",
> line 618, in fit
> return self._fit(X, y, incremental=False)
>   File
> "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py",
> line 330, in _fit
> X, y = self._validate_input(X, y, incremental)
>   File
> "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py",
> line 1264, in _validate_input
> multi_output=True, y_numeric=True)
>   File
> "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line
> 521, in check_X_y
> ensure_min_features, warn_on_dtype, estimator)
>   File
> "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line
> 402, in check_array
> array = array.astype(np.float64)
> ValueError: setting an array element with a sequence.
> ​


​Then I tried to ​create for each object of the dataset a 2D array of size
2x2048, by adding 1998 zeros in the second row in order both rows to be of
equal size. However sklearn complained again:


mlp.fit(x_train,y_train)
>   File
> "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py",
> line 618, in fit
> return self._fit(X, y, incremental=False)
>   File
> "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py",
> line 330, in _fit
> X, y = self._validate_input(X, y, incremental)
>   File
> "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py",
> line 1264, in _validate_input
> multi_output=True, y_numeric=True)
>   File
> "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line
> 521, in check_X_y
> ensure_min_features, warn_on_dtype, estimator)
>   File
> "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line
> 405, in check_array
> % (array.ndim, estimator_name))
> ValueError: Found array with dim 3. Estimator expected <= 2.



In another case of fingerprints, lets name them FP3 and FP4, I observed
that the MLP regressor created using FP3 yields better results when trained
and evaluated using logarithmically transformed experimental values (the
values in y_train and y_test 1D arrays), while the MLP regressor created
using FP4 yielded better results using the original experimental values. So
my second question is: when combining both FP3 and FP4 into a single
array is there any way to designate to the MLP that the features that
correspond to FP3 must reproduce the logarithmic transform of the
experimental values while the features of FP4 the original untransformed
experimental values?


I would greatly appreciate any advice on any of my 2 queries.
Thomas









-- 

==

Thomas Evangelidis

Research Specialist
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/1S081,
62500 Brno, Czech Republic

email: tev...@pharm.uoa.gr

  teva...@gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] NuSVC and ValueError: specified nu is infeasible

2016-12-08 Thread Thomas Evangelidis
It finally  works with nu=0.01 or less and the predictions are good. Is
there a problem with that?

On 8 December 2016 at 12:57, Thomas Evangelidis <teva...@gmail.com> wrote:

>
>
>>
>> @Thomas
>> I still think the optimization problem is not feasible due to your data.
>> Have you tried balancing the dataset as I mentioned in your other
>> question regarding the
>> ​​
>> MLPClassifier?
>>
>>
>>
> ​Hi Piotr,
>
> I had tried all the balancing algorithms in the link that you stated, but
> the only one that really offered some improvement was the SMOTE
> over-sampling of positive observations. The original dataset contained ​24
> positive and 1230 negative but after SMOTE I doubled the positive to 48.
> Reduction of the negative observations led to poor predictions, at least
> using random forests. I haven't tried it with
> ​
> MLPClassifier yet though.
>
>
>
>


-- 

==

Thomas Evangelidis

Research Specialist
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/1S081,
62500 Brno, Czech Republic

email: tev...@pharm.uoa.gr

  teva...@gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] no positive predictions by neural_network.MLPClassifier

2016-12-08 Thread Thomas Evangelidis
Hello Sebastian,

I did normalization of my training set and used the same mean and stdev
values to normalize my test set, instead of calculating means and stdev
from the test set. I did that because my training set size is finite and
the value of each feature is a descriptor that is characteristic of the 3D
shape of the observation. The test set would definitely have different mean
and stdev values from the training set, and if I had used them to normalize
it then I believe I would have distorted the original descriptor values.
Anyway, after this normalization I don't get 0 positive predictions anymore
by the MLPClassifier.

I still don't understand your second suggestion. I cannot find any
parameter to control the epoch or measure the cost in sklearn
.neural_network.MLPClassifier. Do you suggest to use your own classes from
github instead?
Besides that my goal is not to make one MLPClassifier using a specific
training set, but rather to write a program that can take as input various
training sets each time and and train a neural network that will classify a
given test set. Therefore, unless I didn't understand your points, working
with 3 arbitrary random_state values on my current training set in order to
find one value to yield good predictions, wont solve my problem.

best
Thomas



On 8 December 2016 at 01:19, Sebastian Raschka <se.rasc...@gmail.com> wrote:

> Hi, Thomas,
> we had a related thread on the email list some time ago, let me post it
> for reference further below. Regarding your question, I think you may want
> make sure that you standardized the features (which makes the learning
> generally it less sensitive to learning rate and random weight
> initialization). However, even then, I would try at least 1-3 different
> random seeds and look at the cost vs time — what can happen is that you
> land in different minima depending on the weight initialization as
> demonstrated in the example below (in MLPs you have the problem of a
> complex cost surface).
>
> Best,
> Sebastian
>
> The default is set 100 units in the hidden layer, but theoretically, it
> should work with 2 hidden logistic units (I think that’s the typical
> textbook/class example). I think what happens is that it gets stuck in
> local minima depending on the random weight initialization. E.g., the
> following works just fine:
>
> from sklearn.neural_network import MLPClassifier
> X = [[0, 0], [0, 1], [1, 0], [1, 1]]
> y = [0, 1, 1, 0]
> clf = MLPClassifier(solver='lbfgs',
> activation='logistic',
> alpha=0.0,
> hidden_layer_sizes=(2,),
> learning_rate_init=0.1,
> max_iter=1000,
> random_state=20)
> clf.fit(X, y)
> res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]])
> print(res)
> print(clf.loss_)
>
>
> but changing the random seed to 1 leads to:
>
> [0 1 1 1]
> 0.34660921283
>
> For comparison, I used a more vanilla MLP (1 hidden layer with 2 units and
> logistic activation as well; https://github.com/
> rasbt/python-machine-learning-book/blob/master/code/ch12/ch12.ipynb),
> essentially resulting in the same problem:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Dec 7, 2016, at 6:45 PM, Thomas Evangelidis <teva...@gmail.com> wrote:
>
> I tried the sklearn.neural_network.MLPClassifier with the default
> parameters using the input data I quoted in my previous post about
> Nu-Support Vector Classifier. The predictions are great but the problem is
> that sometimes when I rerun the MLPClassifier it predicts no positive
> observations (class 1). I have noticed that this can be controlled by the
> random_state parameter, e.g. MLPClassifier(random_state=0) gives always no
> positive predictions. My question is how can I chose the right random_state
> value in a real blind test case?
>
> thanks in advance
> Thomas
>
>
> --
> ==
> Thomas Evangelidis
> Research Specialist
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/1S081,
> 62500 Brno, Czech Republic
>
> email: tev...@pharm.uoa.gr
>   teva...@gmail.com
>
> website: https://sites.google.com/site/thomasevangelidishomepage/
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 

==

Thomas Evange

Re: [scikit-learn] NuSVC and ValueError: specified nu is infeasible

2016-12-08 Thread Thomas Evangelidis
>
>
> @Thomas
> I still think the optimization problem is not feasible due to your data.
> Have you tried balancing the dataset as I mentioned in your other question
> regarding the
> ​​
> MLPClassifier?
>
>
>
​Hi Piotr,

I had tried all the balancing algorithms in the link that you stated, but
the only one that really offered some improvement was the SMOTE
over-sampling of positive observations. The original dataset contained ​24
positive and 1230 negative but after SMOTE I doubled the positive to 48.
Reduction of the negative observations led to poor predictions, at least
using random forests. I haven't tried it with
​
MLPClassifier yet though.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] NuSVC and ValueError: specified nu is infeasible

2016-12-08 Thread Thomas Evangelidis
Hi Piotr,

the SVC performs quite well, slightly better than random forests on the
same data. By training error do you mean this?

clf = svm.SVC(probability=True)
clf.fit(train_list_resampled3, train_activity_list_resampled3)
print "training error=", clf.score(train_list_resampled3,
train_activity_list_resampled3)

If this is what you mean by "skip the sample_weights":
clf = svm.NuSVC(probability=True)
clf.fit(train_list_resampled3, train_activity_list_resampled3,
sample_weight=None)

then no, it does not converge. After all "sample_weight=None" is the
default value.

I am out of ideas about what may be the problem.

Thomas


On 8 December 2016 at 08:56, Piotr Bialecki <piotr.biale...@hotmail.de>
wrote:

> Hi Thomas,
>
> the doc says, that nu gives an upper bound on the fraction of training
> errors and a lower bound of the fractions
> of support vectors.
> http://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html
>
> Therefore, it acts as a hard bound on the allowed misclassification on
> your dataset.
>
> To me it seems as if the error bound is not feasible.
> How well did the SVC perform? What was your training error there?
>
> Will the NuSVC converge when you skip the sample_weights?
>
>
> Greets,
> Piotr
>
>
> On 08.12.2016 00:07, Thomas Evangelidis wrote:
>
> Greetings,
>
> I want  to  use the Nu-Support Vector Classifier with the following input
> data:
>
> X= [
> array([  3.90387012,   1.60732281,  -0.33315799,   4.02770896,
>  1.82337731,  -0.74007214,   6.75989219,   3.68538903,
>  ..
>  0.,  11.64276776,   0.,   0.]),
> array([  3.36856769e+00,   1.48705816e+00,   4.28566992e-01,
>  3.35622071e+00,   1.64046508e+00,   5.66879661e-01,
>  .
>  4.25335335e+00,   1.96508829e+00,   8.63453394e-06]),
>  array([  3.74986249e+00,   1.69060713e+00,  -5.09921270e-01,
>  3.76320781e+00,   1.67664455e+00,  -6.21126735e-01,
>  ..
>  4.16700259e+00,   1.88688784e+00,   7.34729942e-06]),
> ...
> ]
>
> and
>
> Y=  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0]
>
>
>> ​Each array of X contains 60 numbers and the dataset consists of 48
>> positive and 1230 negative observations. When I train an svm.SVC()
>> classifier I get quite good predictions, but wit the ​svm.NuSVC​() I keep
>> getting the following error no matter which value of nu in [0.1, ..., 0.9,
>> 0.99, 0.999, 0.] I try:
>> /usr/local/lib/python2.7/dist-packages/sklearn/svm/base.pyc in fit(self,
>> X, y, sample_weight)
>> 187
>> 188 seed = rnd.randint(np.iinfo('i').max)
>> --> 189 fit(X, y, sample_weight, solver_type, kernel,
>> random_seed=seed)
>> 190 # see comment on the other call to np.iinfo in this file
>> 191
>> /usr/local/lib/python2.7/dist-packages/sklearn/svm/base.pyc in
>> _dense_fit(self, X, y, sample_weight, solver_type, kernel, random_seed)
>> 254 cache_size=self.cache_size, coef0=self.coef0,
>> 255 gamma=self._gamma, epsilon=self.epsilon,
>> --> 256 max_iter=self.max_iter, random_seed=random_seed)
>> 257
>> 258 self._warn_from_fit_status()
>> /usr/local/lib/python2.7/dist-packages/sklearn/svm/libsvm.so in
>> sklearn.svm.libsvm.fit (sklearn/svm/libsvm.c:2501)()
>> ValueError: specified nu is infeasible
>
>
> ​
> ​Does anyone know what might be wrong? Could it be the input data?
>
> thanks in advance for any advice
> Thomas​
>
>
>
> --
>
> ==
>
> Thomas Evangelidis
>
> Research Specialist
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/1S081,
> 62500 Brno, Czech Republic
>
> email: tev...@pharm.uoa.gr
>
>   teva...@gmail.com
>
>
> website: https://sites.google.com/site/thomasevangelidishomepage/
>
>
>
> ___
> scikit-learn mailing 
> listscikit-learn@python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> ___
> scikit-learn mai

[scikit-learn] no positive predictions by neural_network.MLPClassifier

2016-12-07 Thread Thomas Evangelidis
I tried the sklearn.neural_network.MLPClassifier with the default
parameters using the input data I quoted in my previous post about
Nu-Support Vector Classifier. The predictions are great but the problem is
that sometimes when I rerun the MLPClassifier it predicts no positive
observations (class 1). I have noticed that this can be controlled by the
random_state parameter, e.g. MLPClassifier(random_state=0) gives always no
positive predictions. My question is how can I chose the right random_state
value in a real blind test case?

thanks in advance
Thomas


-- 

==

Thomas Evangelidis

Research Specialist
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/1S081,
62500 Brno, Czech Republic

email: tev...@pharm.uoa.gr

  teva...@gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] NuSVC and ValueError: specified nu is infeasible

2016-12-07 Thread Thomas Evangelidis
Greetings,

I want  to  use the Nu-Support Vector Classifier with the following input
data:

X= [
array([  3.90387012,   1.60732281,  -0.33315799,   4.02770896,
 1.82337731,  -0.74007214,   6.75989219,   3.68538903,
 ..
 0.,  11.64276776,   0.,   0.]),
array([  3.36856769e+00,   1.48705816e+00,   4.28566992e-01,
 3.35622071e+00,   1.64046508e+00,   5.66879661e-01,
 .
 4.25335335e+00,   1.96508829e+00,   8.63453394e-06]),
 array([  3.74986249e+00,   1.69060713e+00,  -5.09921270e-01,
 3.76320781e+00,   1.67664455e+00,  -6.21126735e-01,
 ..
 4.16700259e+00,   1.88688784e+00,   7.34729942e-06]),
...
]

and

Y=  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0]


> ​Each array of X contains 60 numbers and the dataset consists of 48
> positive and 1230 negative observations. When I train an svm.SVC()
> classifier I get quite good predictions, but wit the ​svm.NuSVC​() I keep
> getting the following error no matter which value of nu in [0.1, ..., 0.9,
> 0.99, 0.999, 0.] I try:
> /usr/local/lib/python2.7/dist-packages/sklearn/svm/base.pyc in fit(self,
> X, y, sample_weight)
> 187
> 188 seed = rnd.randint(np.iinfo('i').max)
> --> 189 fit(X, y, sample_weight, solver_type, kernel,
> random_seed=seed)
> 190 # see comment on the other call to np.iinfo in this file
> 191
> /usr/local/lib/python2.7/dist-packages/sklearn/svm/base.pyc in
> _dense_fit(self, X, y, sample_weight, solver_type, kernel, random_seed)
> 254 cache_size=self.cache_size, coef0=self.coef0,
> 255 gamma=self._gamma, epsilon=self.epsilon,
> --> 256 max_iter=self.max_iter, random_seed=random_seed)
> 257
> 258 self._warn_from_fit_status()
> /usr/local/lib/python2.7/dist-packages/sklearn/svm/libsvm.so in
> sklearn.svm.libsvm.fit (sklearn/svm/libsvm.c:2501)()
> ValueError: specified nu is infeasible


​
​Does anyone know what might be wrong? Could it be the input data?

thanks in advance for any advice
Thomas​



-- 

==========

Thomas Evangelidis

Research Specialist
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/1S081,
62500 Brno, Czech Republic

email: tev...@pharm.uoa.gr

  teva...@gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] random forests using grouped data

2016-12-01 Thread Thomas Evangelidis
Sorry, the previous email was incomplete. Below is how the grouped data
look like:


Group1:
score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...]
score2 = [0.34, 0.27, 0.24, 0.05, 0.13, 0,14, ...]
y=[1,1,1,0,0,0, ...]  # 1 indicates "active" and 0 "inactive"

Group2:
score1 = [0.34, 0.38, 0.48, 0.18, 0.12, 0.19, ...]
score2 = [0.28, 0.41, 0.34, 0.13, 0.09, 0,1, ...]
y=[1,1,1,0,0,0, ...]  # 1 indicates "active" and 0 "inactive"

​..
Group24​:
score1 = [0.67, 0.54, 0.59, 0.23, 0.24, 0.08, ...]
score2 = [0.41, 0.31, 0.28, 0.23, 0.18, 0,22, ...]
y=[1,1,1,0,0,0, ...]  # 1 indicates "active" and 0 "inactive"


On 1 December 2016 at 14:01, Thomas Evangelidis <teva...@gmail.com> wrote:

> Greetings
>
> ​I have grouped data which are divided into actives and inactives. The
> features are two different types of normalized scores (0-1), where the
> higher the score the most probable is an observation to be an "active". My
> data look like this:
>
>
> Group1:
> score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...]
> score2 = [
> y=[1,1,1,0,0,0, ...]
>
> Group2:
> ​score1 = [0
> score2 = [
> y=[1,1,1,1,1]​
>
> ​..
> Group24​:
> ​score1 = [0
> score2 = [
> y=[1,1,1,1,1]​
>
>
> I searched in the documentation about treatment of grouped data, but the
> only thing I found was how do do cross-validation. My question is whether
> there is any special algorithm that creates random forests from these type
> of grouped data.
>
> thanks in advance
> Thomas
>
>
>
> --
>
> ==
>
> Thomas Evangelidis
>
> Research Specialist
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/1S081,
> 62500 Brno, Czech Republic
>
> email: tev...@pharm.uoa.gr
>
>   teva...@gmail.com
>
>
> website: https://sites.google.com/site/thomasevangelidishomepage/
>
>


-- 

==

Thomas Evangelidis

Research Specialist
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/1S081,
62500 Brno, Czech Republic

email: tev...@pharm.uoa.gr

  teva...@gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] random forests using grouped data

2016-12-01 Thread Thomas Evangelidis
Greetings

​I have grouped data which are divided into actives and inactives. The
features are two different types of normalized scores (0-1), where the
higher the score the most probable is an observation to be an "active". My
data look like this:


Group1:
score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...]
score2 = [
y=[1,1,1,0,0,0, ...]

Group2:
​score1 = [0
score2 = [
y=[1,1,1,1,1]​

​..
Group24​:
​score1 = [0
score2 = [
y=[1,1,1,1,1]​


I searched in the documentation about treatment of grouped data, but the
only thing I found was how do do cross-validation. My question is whether
there is any special algorithm that creates random forests from these type
of grouped data.

thanks in advance
Thomas



-- 

======

Thomas Evangelidis

Research Specialist
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/1S081,
62500 Brno, Czech Republic

email: tev...@pharm.uoa.gr

  teva...@gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] suggested classification algorithm

2016-11-14 Thread Thomas Evangelidis
Greetings,

I want to design a program that can deal with classification problems of
the same type, where the  number of positive observations is small but the
number of negative much larger. Speaking with numbers, the number of
positive observations could range usually between 2 to 20 and the number of
negative could be at least x30 times larger. The number of features could
be between 2 and 20 too, but that could be reduced using feature selection
and elimination algorithms. I 've read in the documentation that some
algorithms like the SVM are still effective when the number of dimensions
is greater than the number of samples, but I am not sure if they are
suitable for my case. Moreover, according to this Figure, the Nearest
Neighbors is the best and second is the RBF SVM:

http://scikit-learn.org/stable/_images/sphx_glr
_plot_classifier_comparison_001.png

However, I assume that Nearest Neighbors would not be effective in my case
where the number of positive observations is very low. For these reasons I
would like to know your expert opinion about which classification algorithm
should I try first.

thanks in advance
Thomas


-- 

==

Thomas Evangelidis

Research Specialist
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/1S081,
62500 Brno, Czech Republic

email: tev...@pharm.uoa.gr

  teva...@gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn