Re: [Rdkit-discuss] Fingerprint collision and machine learning

2018-10-10 Thread Peter S. Shenkin
It is very far from a solved problem, since it depends strongly on the
interactions within the crystal. And it’s not terribly uncommon for a
drug-like compound to exhibit different crystal forms, each with its own
melting point and solubility. This has been an issue for drug formulation,
where you usually want to want to stabilize and distribute the least stable
(most soluble) crystal form.

I think there was recently a blog posting on this from Nextstep.

-P.

On Wed, Oct 10, 2018 at 7:51 AM Michal Krompiec 
wrote:

> Hi all,
> I have a slightly off-topic question. I'm trying to train a neural network
> on a dataset of small molecules and their melting points. I did get a
> not-so-bad accuracy with Morgan fingerprints, but I've realised that
> regardless of FP radius and bitvector length, several dozen molecules have
> the same fingerprints but wildly different melting points. I am pretty sure
> this is a "solved problem" so I don't want to reinvent the wheel. What is
> the recommended/usual way of dealing with this?
> Thanks,
> Michal
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
-- 
-P.
Sent from a cell phone. Pls forgive brvty and m1$tea@ks.
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Fingerprint collision and machine learning

2018-10-10 Thread Chris Earnshaw
Hi

It sounds to me like you're already getting better results than you could
reasonably expect.

Prediction of melting point is a phenomenally difficult thing to do; you're
trying to find the temperature at which a (generally undefined) solid
crystalline phase is in equilibrium with a (probably even less defined)
liquid phase. You also need to consider that the crystalline form of your
solid phase is not necessarily truly constant - what polymorph is involved?
Melting points of alternative polymorphs can be radically different and
this is one of the real bugbears of pharmaceutical and agrochemical
development. If you haven't found the most stable form early in the
development process there can be very nasty surprises downstream.

Expecting to handle all these challenges with a descriptor as simple as a
molecular fingerprint - regardless of bit-length, collisions etc. is
probably over optimistic...

Regards,
Chris Earnshaw

On Wed, 10 Oct 2018 at 13:16, Michal Krompiec 
wrote:

> Hi Thomas,
> Radius 2, 2048 bits, 5200 data points.
>
> On Wed, 10 Oct 2018 at 13:13, Thomas Evangelidis 
> wrote:
>
>> What's your bitvector length and radius? How many training samples do you
>> have?
>>
>> On Wed, 10 Oct 2018 at 13:51, Michal Krompiec 
>> wrote:
>>
>>> Hi all,
>>> I have a slightly off-topic question. I'm trying to train a neural
>>> network on a dataset of small molecules and their melting points. I did get
>>> a not-so-bad accuracy with Morgan fingerprints, but I've realised that
>>> regardless of FP radius and bitvector length, several dozen molecules have
>>> the same fingerprints but wildly different melting points. I am pretty sure
>>> this is a "solved problem" so I don't want to reinvent the wheel. What is
>>> the recommended/usual way of dealing with this?
>>> Thanks,
>>> Michal
>>>
>>>
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>
>>
>> --
>>
>> ==
>>
>> Dr Thomas Evangelidis
>>
>> Research Scientist
>>
>> IOCB - Institute of Organic Chemistry and Biochemistry of the Czech
>> Academy of Sciences 
>> Prague, Czech Republic
>>   &
>> CEITEC - Central European Institute of Technology
>> 
>> Brno, Czech Republic
>>
>> email: teva...@gmail.com
>>
>> website: https://sites.google.com/site/thomasevangelidishomepage/
>>
>>
>> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Fingerprint collision and machine learning

2018-10-10 Thread Thomas Evangelidis
Your radius and bitvector lengths are too small for such a big training
set. You probably have bit collisions or the radius is not enough to
capture the differences in substructures, that's why you see that artifact.
Try radius 3, bitvector length 4096. I think that you have enough training
samples to go up to bitvector length 8192 without overfitting the networks,
although that will make the training much slower.

On Wed, 10 Oct 2018 at 14:15, Michal Krompiec 
wrote:

> Hi Thomas,
> Radius 2, 2048 bits, 5200 data points.
>
> On Wed, 10 Oct 2018 at 13:13, Thomas Evangelidis 
> wrote:
>
>> What's your bitvector length and radius? How many training samples do you
>> have?
>>
>> On Wed, 10 Oct 2018 at 13:51, Michal Krompiec 
>> wrote:
>>
>>> Hi all,
>>> I have a slightly off-topic question. I'm trying to train a neural
>>> network on a dataset of small molecules and their melting points. I did get
>>> a not-so-bad accuracy with Morgan fingerprints, but I've realised that
>>> regardless of FP radius and bitvector length, several dozen molecules have
>>> the same fingerprints but wildly different melting points. I am pretty sure
>>> this is a "solved problem" so I don't want to reinvent the wheel. What is
>>> the recommended/usual way of dealing with this?
>>> Thanks,
>>> Michal
>>>
>>>
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>
>>
>> --
>>
>> ==
>>
>> Dr Thomas Evangelidis
>>
>> Research Scientist
>>
>> IOCB - Institute of Organic Chemistry and Biochemistry of the Czech
>> Academy of Sciences 
>> Prague, Czech Republic
>>   &
>> CEITEC - Central European Institute of Technology
>> 
>> Brno, Czech Republic
>>
>> email: teva...@gmail.com
>>
>> website: https://sites.google.com/site/thomasevangelidishomepage/
>>
>>
>>

-- 

==

Dr Thomas Evangelidis

Research Scientist

IOCB - Institute of Organic Chemistry and Biochemistry of the Czech Academy
of Sciences 
Prague, Czech Republic
  &
CEITEC - Central European Institute of Technology 
Brno, Czech Republic

email: teva...@gmail.com

website: https://sites.google.com/site/thomasevangelidishomepage/
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Fingerprint collision and machine learning

2018-10-10 Thread Pavel

Hi Michal,

  I think if you can provide several examples of structures having 
identical bitstrings this will help a lot to better understand the issue.


Pavel.

On 10/10/18 14:15, Michal Krompiec wrote:

Hi Thomas,
Radius 2, 2048 bits, 5200 data points.

On Wed, 10 Oct 2018 at 13:13, Thomas Evangelidis > wrote:


What's your bitvector length and radius? How many training samples
do you have?

On Wed, 10 Oct 2018 at 13:51, Michal Krompiec
mailto:michal.kromp...@gmail.com>> wrote:

Hi all,
I have a slightly off-topic question. I'm trying to train a
neural network on a dataset of small molecules and their
melting points. I did get a not-so-bad accuracy with Morgan
fingerprints, but I've realised that regardless of FP radius
and bitvector length, several dozen molecules have the same
fingerprints but wildly different melting points. I am pretty
sure this is a "solved problem" so I don't want to reinvent
the wheel. What is the recommended/usual way of dealing with this?
Thanks,
Michal


___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net

https://lists.sourceforge.net/lists/listinfo/rdkit-discuss



-- 


==

Dr Thomas Evangelidis

Research Scientist

IOCB - Institute of Organic Chemistry and Biochemistry of the
Czech Academy of Sciences


Prague, Czech Republic
  &
CEITEC - Central European Institute of Technology

Brno, Czech Republic

email: teva...@gmail.com 

website: https://sites.google.com/site/thomasevangelidishomepage/







___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


--
Dr. Pavel Polishchuk
senior researcher
Institute of Molecular and Translational Medicine
Faculty of Medicine and Dentistry
Palacky University
Hněvotínská 1333/5
779 00 Olomouc
Czech Republic
+420 585632298

___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Fingerprint collision and machine learning

2018-10-10 Thread Michal Krompiec
Hi Thomas,
Radius 2, 2048 bits, 5200 data points.

On Wed, 10 Oct 2018 at 13:13, Thomas Evangelidis  wrote:

> What's your bitvector length and radius? How many training samples do you
> have?
>
> On Wed, 10 Oct 2018 at 13:51, Michal Krompiec 
> wrote:
>
>> Hi all,
>> I have a slightly off-topic question. I'm trying to train a neural
>> network on a dataset of small molecules and their melting points. I did get
>> a not-so-bad accuracy with Morgan fingerprints, but I've realised that
>> regardless of FP radius and bitvector length, several dozen molecules have
>> the same fingerprints but wildly different melting points. I am pretty sure
>> this is a "solved problem" so I don't want to reinvent the wheel. What is
>> the recommended/usual way of dealing with this?
>> Thanks,
>> Michal
>>
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
>
> --
>
> ==
>
> Dr Thomas Evangelidis
>
> Research Scientist
>
> IOCB - Institute of Organic Chemistry and Biochemistry of the Czech
> Academy of Sciences 
> Prague, Czech Republic
>   &
> CEITEC - Central European Institute of Technology 
> Brno, Czech Republic
>
> email: teva...@gmail.com
>
> website: https://sites.google.com/site/thomasevangelidishomepage/
>
>
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Fingerprint collision and machine learning

2018-10-10 Thread Michal Krompiec
Hi all,
I have a slightly off-topic question. I'm trying to train a neural network
on a dataset of small molecules and their melting points. I did get a
not-so-bad accuracy with Morgan fingerprints, but I've realised that
regardless of FP radius and bitvector length, several dozen molecules have
the same fingerprints but wildly different melting points. I am pretty sure
this is a "solved problem" so I don't want to reinvent the wheel. What is
the recommended/usual way of dealing with this?
Thanks,
Michal
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Coordgen library questions

2018-10-10 Thread MARIA BRANDL via Rdkit-discuss
 Hi Lukas,
according to the source 
code:https://github.com/rdkit/rdkit/blob/master/External/CoordGen/CoordGen.h
double coordgenScaling = 50.0; 

the default scaling factor, actually more like a "shrinking factor" seems to be 
50, producing bond lengths of 1A.

For 1.5A bond lengths, you would need to enter a scaling factor of 50.0/1.5, 
maybe this should be changed internally to make it more intuitive ?

The code snippet below may save you the trouble going into pymol for 
measurements. When getting the 50A bond lengths, depiction showed all single 
bonds and did not take bond orders into account, wonder whether that is 
intended.
Best wishes,Maria

from rdkit import Chemfrom rdkit.Chem import rdCoordGenfrom rdkit.Chem import 
rdmolopsmol = Chem.MolFromSmiles('Cc1c1', sanitize=True)mol1 = 
Chem.MolFromSmiles('Cc1c1', sanitize=True)
p = rdCoordGen.CoordGenParams()p.coordgenScaling = 
50.0/1.5rdCoordGen.AddCoords(mol1,p)
print(Chem.MolToMolBlock(mol))print(Chem.MolToMolBlock(mol1))print(rdmolops.Get3DDistanceMatrix(mol1))
Chem.MolToMolFile(mol, 'default.sdf') # bond length 1Chem.MolToMolFile(mol1, 
'scale.mol') # bond length 33.3from rdkit.Chem import rdDepictorfrom 
rdkit.Chem.Draw import IPythonConsole




On Tuesday, 9 October 2018, 14:08:05 BST, Lukas Pravda  
wrote:  
 
 
Hi all,

 

I’m playing with the Coordgen library inside rdkit and I have a couple of 
questions I could not figure out by myself. Hopefully someone more experienced 
will know.

 
   
   - [comment] The way one can pass a scaling factor to the bond size is very 
unintuitive. If I don’t provide any parameter a single bond length is 1.0. If I 
pass 1.5 as a scaling factor, I’d expect to get single bond of a length 1.5. 
But instead I get 33.3. (measured in pymol) 

 

Snippet:

from rdkit import Chem

from rdkit.Chem import rdCoordGen

 

mol = Chem.MolFromSmiles('Cc1c1', sanitize=True)

mol1 = Chem.MolFromSmiles('Cc1c1', sanitize=True)

 

p = rdCoordGen.CoordGenParams()

p.coordgenScaling = 1.5

 

rdCoordGen.AddCoords(mol)

rdCoordGen.AddCoords(mol1, p)

 

Chem.MolToMolFile(mol, 'default.sdf') # bond length 1

Chem.MolToMolFile(mol1, '1.5_scale.sdf') # bond length 33.3

 

 

    Is that intended?

 
   
   - Is there any way to modify templates, which can be passed as the 
‘templateFileDir’ parameter to match general groups and bonds as described 
here: http://rdkit.blogspot.com/2016/07/tuning-substructure-queries-ii.html? 
   - By default, rdCoordGen module writes to stderr by putting ‘TEMPLATES: 
/path/to/templates’ line for each depiction generated. Is there any simple way 
of muting that piece of information without manually hijacking the stderr 
(rdkit.rdBase.DisableLog('rdApp.*') does not work)?

 

Thanks for possible suggestions.

  

  

Cheers, 

Lukas

  
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
  

|  | Virus-free. www.avast.com  |

___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss