[scikit-learn] Would love to contribute to this library that I fell in love with. I have a question! FIRST TIMER

2018-07-16 Thread Abhishek Babuji
TO WHOM IT MAY CONCERN,

I have just learned Python to a level that I can say I'm comfortable with
it. I have also picked up and learned Git and GitHub, and so now I'm ready
to make my contribution to this library.

I'm really enthusiastic but since this is my first time, I'd like to know a
few things!

*Must I know the underlying implementation of something to contribute code
to fix it?*

Explanation: Let's say, someone, tags some issue as 'first timers' and
'easy', and you want to take a look at it, see and contribute code/fix the
code.

Should I know the implementation of what the fixed code is supposed to do?
or will this be explained when the issue is brought up? I have gone over
issues in your GitHub. but I don't think I've seen enough examples. I don't
seem to find this in the contributor guide.

If someone could help me understand the level of depth that I must know
scikit-learn to be able to contribute, I would then begin working towards
it! Because I have used it  a lot in my Machine Learning projects, so I'm
not sure where I stand.

Example: "The shovel doesn't work! Fix it! It is supposed to be able to dig
through mud"
My dilemma: I found an immovable rock in the mud that the shovel is not
being able to dig through.. so I'm stuck. Guess I shouldn't have
volunteered to help.

Just on a side note, to all scikit-learn's contributors, you're doing
God's work.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] sample_weights in RandomForestRegressor

2018-07-16 Thread Brown J.B. via scikit-learn
Dear Thomas,

Your strategy for model development is built on the assumption that the SAR
(structure-activity relationship) is a continuous manifold constructed for
your compound descriptors.
However, SARs for many proteins in drug discovery or chemical biology are
not continuous (consider kinase inhibitors).

Therefore, you must make an assessment of the training data SAR to check
for the prevalence of activity cliffs.
There are at least two ways you can go about this:
  (1) Simply compute all pairwise similarities by your choice of
descriptor+metric, then identify where there are pairs (e.g.,
MACCS-Tanimoto > 0.7) with large activity differences (e.g., K_i or IC50
difference of more than 10/50/100-fold; again, the biology of your problem
determines the right values).
  (2) Perform many repetitions of train-test splitting on the 709 reference
molecules, look at the distribution of your evaluation metric, and see if
there is a limit in your ability to predict. If you are hitting a wall in
terms of predictability (metric performance), it's a likely sign there is
an activity cliff, and no amount of machine learning is going to be able to
overcome this. Further, trace the predictability of individual compounds to
identify those which consistently are predicted wrong.  If you combine this
with analysis (1), you can know exactly which of your chemistries are
unmodelable.

If you find that there are no activity cliffs in your dataset, then your
application of the assumption that chemical similarity implies biological
endpoint similarity will hold, and your experimental design is validated
because of the presence of a continuous manifold.
However, if you do have activity cliffs, then as awesome as sklearn is, it
still cannot make the computational chemistry any better.

Hope this helps you contextualize your work. Don't hesitate to contact me
if I can be of consultation.

Sincerely,
J.B. Brown
Kyoto University Graduate School of Medicine


2018-07-16 8:51 GMT+09:00 Thomas Evangelidis :

> ​​
> Hello,
>
> I am kind of confused about the use of sample_weights parameter in the
> fit() function of RandomForestRegressor. Here is my problem:
>
> I am trying to predict the binding affinity of small molecules to a
> protein. I have a training set of 709 molecules and a blind test set of 180
> molecules. I want to find those features that are more important for the
> correct prediction of the binding affinity of those 180 molecules of my
> blind test set.  My rationale is that if I give more emphasis to the
> similar molecules in the training set, then I will get higher importances
> for those features that have higher predictive ability for this specific
> blind test set of 180 molecules. To this end, I weighted the 709 training
> set molecules by their maximum similarity to the 180 molecules, selected
> only those features with high importance and trained a new RF with all 709
> molecules. I got some results but I am not satisfied. Is this the right way
> to use sample_weights in RF. I would appreciate any advice or suggested
> work flow.
>
>
> --
>
> ==
>
> Dr Thomas Evangelidis
>
> Post-doctoral Researcher
> CEITEC - Central European Institute of Technology
> Masaryk University
> Kamenice 5/A35/2S049,
> 62500 Brno, Czech Republic
>
> email: tev...@pharm.uoa.gr
>
>   teva...@gmail.com
>
>
> website: https://sites.google.com/site/thomasevangelidishomepage/
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn