Re: [Rdkit-discuss] Fingerprint collision and machine learning

2018-10-11 Thread Greg Landrum
I've been quiet on this one since I'm traveling this week, but I want to
briefly weigh in on the fingerprint aspects since I think some terms are
being used incorrectly and that's maybe making things even more confusing.

I believe that the terms "collision" as applied to fingerprints normally
means two different molecular features setting the same bit in the final
fingerprint. In the case of the Morgan fingerprint, this means that two
different atom environments would set the same bit. To understand how
collisions come about, it's worth spending a bit of time describing how a
Morgan fingerprint is generated.
After finding a "circular" atom environment, the fingerprinting code uses a
hash function to convert the environment into a number. Let's call this the
hash value. You can see the hash values for the atom environments of a
molecule (along with how often the environments occur) using the
"GetMorganFingerprint()" function:

In [4]: m = Chem.MolFromSmiles('Cc1c1')

In [5]: fp = rdMolDescriptors.GetMorganFingerprint(m,2)

In [6]: fp.GetNonzeroElements()
Out[6]:
{98513984: 3,
 422715066: 1,
 908339072: 1,
 951226070: 2,
 2246728737: 1,
 2763854213: 1,
 3207567135: 1,
 3217380708: 1,
 3218693969: 5,
 306991: 2,
 4244175903: 2}

When you ask for a fingerprint as a bit vector, those hash values are
truncated so that they fit into the size of the fingerprint you asked for:

In [7]: bv = rdMolDescriptors.GetMorganFingerprintAsBitVect(m,2,4096)

In [8]: bv.GetNumOnBits()
Out[8]: 11

In [9]: len(bv)
Out[9]: 4096

Notice here that we have the same number of bits set in the bit vector (11)
as we did in the original fingerprint

A collision happens when two different atom environments hash to the same
value *or* when the truncation to the bit vector results in two different
hash values ending up in the bit.

The first type of collision doesn't happen all that frequently (and isn't
100% trivial to detect),[1] but the second happens pretty regularly,
particularly when you make fingerprints short. Here's an example of that
for the simple molecule above:

In [12]: bv2 = rdMolDescriptors.GetMorganFingerprintAsBitVect(m,2,256)

In [13]: bv2.GetNumOnBits()
Out[13]: 10

Notice that now only 10 bits are set and remember that we previously had 11.

The two factors influencing the number of collisions of the second type are
the size of the fingerprint - smaller fingerprints = more likelihood of
collisions - and the radius of the features being used - higher radii end
up setting more bits, which purely statistically leads to a greater chance
of collisions.

Collisions by themselves are not necessarily a terrible thing. They do
result in some information loss, have a small impact on similarity, and a
somewhat larger (though still not enormous) impact on machine learning
performance. See the blog posts I mention below for the experiments I did
here to figure this out.

Two different molecules producing the same fingerprint is a different
thing. This can be caused by collisions alone (though I would guess this
happens fairly regularly), but I think it's more likely that it's a
limitation of the nature of or resolution of the fingerprint. You can test
the resolution question by checking to see if increasing the radius you use
allows the molecules to be distinguished from each other. The first
question is probably most easily answered by generating the "full"
fingerprint by calling GetMorganFingerprint() as I show above and looking
to see how similar the molecules are at that level.

There's fair amount of information about the impact of bit vector and
fingerprint radius on the number of collisions in these RDKit blog posts:
http://rdkit.blogspot.com/2014/02/colliding-bits.html
http://rdkit.blogspot.com/2014/03/colliding-bits-ii.html
http://rdkit.blogspot.com/2016/02/colliding-bits-iii.html

I hope this helps a bit,
-greg
[1] Here's a clear example where it has happened:
https://github.com/rdkit/rdkit/issues/814



On Wed, Oct 10, 2018 at 10:28 AM Michal Krompiec 
wrote:

> Dear All,
> Thank you all very much for your feedback! Actually, the number of
> collisions didn't decrease when I increased the bit length, though
> increasing radius to 3 did help a bit. Overall, it is good to know that
> great results are not to be expected.
> Best wishes,
> Michal
>
> On Wed, 10 Oct 2018 at 13:31, Chris Earnshaw  wrote:
>
>> Hi
>>
>> It sounds to me like you're already getting better results than you could
>> reasonably expect.
>>
>> Prediction of melting point is a phenomenally difficult thing to do;
>> you're trying to find the temperature at which a (generally undefined)
>> solid crystalline phase is in equilibrium with a (probably even less
>> defined) liquid phase. You also need to consider that the crystalline form
>> of your solid phase is not necessarily truly constant - what polymorph is
>> involved? Melting points of alternative polymorphs can be radically
>> different and this is one of the real bugbears of pharmaceutical 

Re: [Rdkit-discuss] Fingerprint collision and machine learning

2018-10-10 Thread Peter S. Shenkin
It is very far from a solved problem, since it depends strongly on the
interactions within the crystal. And it’s not terribly uncommon for a
drug-like compound to exhibit different crystal forms, each with its own
melting point and solubility. This has been an issue for drug formulation,
where you usually want to want to stabilize and distribute the least stable
(most soluble) crystal form.

I think there was recently a blog posting on this from Nextstep.

-P.

On Wed, Oct 10, 2018 at 7:51 AM Michal Krompiec 
wrote:

> Hi all,
> I have a slightly off-topic question. I'm trying to train a neural network
> on a dataset of small molecules and their melting points. I did get a
> not-so-bad accuracy with Morgan fingerprints, but I've realised that
> regardless of FP radius and bitvector length, several dozen molecules have
> the same fingerprints but wildly different melting points. I am pretty sure
> this is a "solved problem" so I don't want to reinvent the wheel. What is
> the recommended/usual way of dealing with this?
> Thanks,
> Michal
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
-- 
-P.
Sent from a cell phone. Pls forgive brvty and m1$tea@ks.
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Fingerprint collision and machine learning

2018-10-10 Thread Chris Earnshaw
Hi

It sounds to me like you're already getting better results than you could
reasonably expect.

Prediction of melting point is a phenomenally difficult thing to do; you're
trying to find the temperature at which a (generally undefined) solid
crystalline phase is in equilibrium with a (probably even less defined)
liquid phase. You also need to consider that the crystalline form of your
solid phase is not necessarily truly constant - what polymorph is involved?
Melting points of alternative polymorphs can be radically different and
this is one of the real bugbears of pharmaceutical and agrochemical
development. If you haven't found the most stable form early in the
development process there can be very nasty surprises downstream.

Expecting to handle all these challenges with a descriptor as simple as a
molecular fingerprint - regardless of bit-length, collisions etc. is
probably over optimistic...

Regards,
Chris Earnshaw

On Wed, 10 Oct 2018 at 13:16, Michal Krompiec 
wrote:

> Hi Thomas,
> Radius 2, 2048 bits, 5200 data points.
>
> On Wed, 10 Oct 2018 at 13:13, Thomas Evangelidis 
> wrote:
>
>> What's your bitvector length and radius? How many training samples do you
>> have?
>>
>> On Wed, 10 Oct 2018 at 13:51, Michal Krompiec 
>> wrote:
>>
>>> Hi all,
>>> I have a slightly off-topic question. I'm trying to train a neural
>>> network on a dataset of small molecules and their melting points. I did get
>>> a not-so-bad accuracy with Morgan fingerprints, but I've realised that
>>> regardless of FP radius and bitvector length, several dozen molecules have
>>> the same fingerprints but wildly different melting points. I am pretty sure
>>> this is a "solved problem" so I don't want to reinvent the wheel. What is
>>> the recommended/usual way of dealing with this?
>>> Thanks,
>>> Michal
>>>
>>>
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>
>>
>> --
>>
>> ==
>>
>> Dr Thomas Evangelidis
>>
>> Research Scientist
>>
>> IOCB - Institute of Organic Chemistry and Biochemistry of the Czech
>> Academy of Sciences 
>> Prague, Czech Republic
>>   &
>> CEITEC - Central European Institute of Technology
>> 
>> Brno, Czech Republic
>>
>> email: teva...@gmail.com
>>
>> website: https://sites.google.com/site/thomasevangelidishomepage/
>>
>>
>> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Fingerprint collision and machine learning

2018-10-10 Thread Thomas Evangelidis
Your radius and bitvector lengths are too small for such a big training
set. You probably have bit collisions or the radius is not enough to
capture the differences in substructures, that's why you see that artifact.
Try radius 3, bitvector length 4096. I think that you have enough training
samples to go up to bitvector length 8192 without overfitting the networks,
although that will make the training much slower.

On Wed, 10 Oct 2018 at 14:15, Michal Krompiec 
wrote:

> Hi Thomas,
> Radius 2, 2048 bits, 5200 data points.
>
> On Wed, 10 Oct 2018 at 13:13, Thomas Evangelidis 
> wrote:
>
>> What's your bitvector length and radius? How many training samples do you
>> have?
>>
>> On Wed, 10 Oct 2018 at 13:51, Michal Krompiec 
>> wrote:
>>
>>> Hi all,
>>> I have a slightly off-topic question. I'm trying to train a neural
>>> network on a dataset of small molecules and their melting points. I did get
>>> a not-so-bad accuracy with Morgan fingerprints, but I've realised that
>>> regardless of FP radius and bitvector length, several dozen molecules have
>>> the same fingerprints but wildly different melting points. I am pretty sure
>>> this is a "solved problem" so I don't want to reinvent the wheel. What is
>>> the recommended/usual way of dealing with this?
>>> Thanks,
>>> Michal
>>>
>>>
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>
>>
>> --
>>
>> ==
>>
>> Dr Thomas Evangelidis
>>
>> Research Scientist
>>
>> IOCB - Institute of Organic Chemistry and Biochemistry of the Czech
>> Academy of Sciences 
>> Prague, Czech Republic
>>   &
>> CEITEC - Central European Institute of Technology
>> 
>> Brno, Czech Republic
>>
>> email: teva...@gmail.com
>>
>> website: https://sites.google.com/site/thomasevangelidishomepage/
>>
>>
>>

-- 

==

Dr Thomas Evangelidis

Research Scientist

IOCB - Institute of Organic Chemistry and Biochemistry of the Czech Academy
of Sciences 
Prague, Czech Republic
  &
CEITEC - Central European Institute of Technology 
Brno, Czech Republic

email: teva...@gmail.com

website: https://sites.google.com/site/thomasevangelidishomepage/
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Fingerprint collision and machine learning

2018-10-10 Thread Pavel

Hi Michal,

  I think if you can provide several examples of structures having 
identical bitstrings this will help a lot to better understand the issue.


Pavel.

On 10/10/18 14:15, Michal Krompiec wrote:

Hi Thomas,
Radius 2, 2048 bits, 5200 data points.

On Wed, 10 Oct 2018 at 13:13, Thomas Evangelidis > wrote:


What's your bitvector length and radius? How many training samples
do you have?

On Wed, 10 Oct 2018 at 13:51, Michal Krompiec
mailto:michal.kromp...@gmail.com>> wrote:

Hi all,
I have a slightly off-topic question. I'm trying to train a
neural network on a dataset of small molecules and their
melting points. I did get a not-so-bad accuracy with Morgan
fingerprints, but I've realised that regardless of FP radius
and bitvector length, several dozen molecules have the same
fingerprints but wildly different melting points. I am pretty
sure this is a "solved problem" so I don't want to reinvent
the wheel. What is the recommended/usual way of dealing with this?
Thanks,
Michal


___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net

https://lists.sourceforge.net/lists/listinfo/rdkit-discuss



-- 


==

Dr Thomas Evangelidis

Research Scientist

IOCB - Institute of Organic Chemistry and Biochemistry of the
Czech Academy of Sciences


Prague, Czech Republic
  &
CEITEC - Central European Institute of Technology

Brno, Czech Republic

email: teva...@gmail.com 

website: https://sites.google.com/site/thomasevangelidishomepage/







___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


--
Dr. Pavel Polishchuk
senior researcher
Institute of Molecular and Translational Medicine
Faculty of Medicine and Dentistry
Palacky University
Hněvotínská 1333/5
779 00 Olomouc
Czech Republic
+420 585632298

___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Fingerprint collision and machine learning

2018-10-10 Thread Michal Krompiec
Hi Thomas,
Radius 2, 2048 bits, 5200 data points.

On Wed, 10 Oct 2018 at 13:13, Thomas Evangelidis  wrote:

> What's your bitvector length and radius? How many training samples do you
> have?
>
> On Wed, 10 Oct 2018 at 13:51, Michal Krompiec 
> wrote:
>
>> Hi all,
>> I have a slightly off-topic question. I'm trying to train a neural
>> network on a dataset of small molecules and their melting points. I did get
>> a not-so-bad accuracy with Morgan fingerprints, but I've realised that
>> regardless of FP radius and bitvector length, several dozen molecules have
>> the same fingerprints but wildly different melting points. I am pretty sure
>> this is a "solved problem" so I don't want to reinvent the wheel. What is
>> the recommended/usual way of dealing with this?
>> Thanks,
>> Michal
>>
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
>
> --
>
> ==
>
> Dr Thomas Evangelidis
>
> Research Scientist
>
> IOCB - Institute of Organic Chemistry and Biochemistry of the Czech
> Academy of Sciences 
> Prague, Czech Republic
>   &
> CEITEC - Central European Institute of Technology 
> Brno, Czech Republic
>
> email: teva...@gmail.com
>
> website: https://sites.google.com/site/thomasevangelidishomepage/
>
>
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss