Re: [scikit-learn] sample_weights in RandomForestRegressor

2018-07-16 Thread Brown J.B. via scikit-learn
Dear Thomas,

Your strategy for model development is built on the assumption that the SAR
(structure-activity relationship) is a continuous manifold constructed for
your compound descriptors.
However, SARs for many proteins in drug discovery or chemical biology are
not continuous (consider kinase inhibitors).

Therefore, you must make an assessment of the training data SAR to check
for the prevalence of activity cliffs.
There are at least two ways you can go about this:
  (1) Simply compute all pairwise similarities by your choice of
descriptor+metric, then identify where there are pairs (e.g.,
MACCS-Tanimoto > 0.7) with large activity differences (e.g., K_i or IC50
difference of more than 10/50/100-fold; again, the biology of your problem
determines the right values).
  (2) Perform many repetitions of train-test splitting on the 709 reference
molecules, look at the distribution of your evaluation metric, and see if
there is a limit in your ability to predict. If you are hitting a wall in
terms of predictability (metric performance), it's a likely sign there is
an activity cliff, and no amount of machine learning is going to be able to
overcome this. Further, trace the predictability of individual compounds to
identify those which consistently are predicted wrong.  If you combine this
with analysis (1), you can know exactly which of your chemistries are
unmodelable.

If you find that there are no activity cliffs in your dataset, then your
application of the assumption that chemical similarity implies biological
endpoint similarity will hold, and your experimental design is validated
because of the presence of a continuous manifold.
However, if you do have activity cliffs, then as awesome as sklearn is, it
still cannot make the computational chemistry any better.

Hope this helps you contextualize your work. Don't hesitate to contact me
if I can be of consultation.

Sincerely,
J.B. Brown
Kyoto University Graduate School of Medicine


2018-07-16 8:51 GMT+09:00 Thomas Evangelidis :

> ​​
> Hello,
>
> I am kind of confused about the use of sample_weights parameter in the
> fit() function of RandomForestRegressor. Here is my problem:
>
> I am trying to predict the binding affinity of small molecules to a
> protein. I have a training set of 709 molecules and a blind test set of 180
> molecules. I want to find those features that are more important for the
> correct prediction of the binding affinity of those 180 molecules of my
> blind test set.  My rationale is that if I give more emphasis to the
> similar molecules in the training set, then I will get higher importances
> for those features that have higher predictive ability for this specific
> blind test set of 180 molecules. To this end, I weighted the 709 training
> set molecules by their maximum similarity to the 180 molecules, selected
> only those features with high importance and trained a new RF with all 709
> molecules. I got some results but I am not satisfied. Is this the right way
> to use sample_weights in RF. I would appreciate any advice or suggested
> work flow.
>
>
> --
>
> ==
>
> Dr Thomas Evangelidis
>
> Post-doctoral Researcher
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/2S049,
> 62500 Brno, Czech Republic
>
> email: tev...@pharm.uoa.gr
>
>   teva...@gmail.com
>
>
> website: https://sites.google.com/site/thomasevangelidishomepage/
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] sample_weights in RandomForestRegressor

2018-07-15 Thread Thomas Evangelidis
​​
Hello,

I am kind of confused about the use of sample_weights parameter in the
fit() function of RandomForestRegressor. Here is my problem:

I am trying to predict the binding affinity of small molecules to a
protein. I have a training set of 709 molecules and a blind test set of 180
molecules. I want to find those features that are more important for the
correct prediction of the binding affinity of those 180 molecules of my
blind test set.  My rationale is that if I give more emphasis to the
similar molecules in the training set, then I will get higher importances
for those features that have higher predictive ability for this specific
blind test set of 180 molecules. To this end, I weighted the 709 training
set molecules by their maximum similarity to the 180 molecules, selected
only those features with high importance and trained a new RF with all 709
molecules. I got some results but I am not satisfied. Is this the right way
to use sample_weights in RF. I would appreciate any advice or suggested
work flow.


-- 

==

Dr Thomas Evangelidis

Post-doctoral Researcher
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/2S049,
62500 Brno, Czech Republic

email: tev...@pharm.uoa.gr

  teva...@gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn