Re: [Rdkit-discuss] Morgan Fingerprint one-to-one relation

2019-11-20 Thread Greg Landrum
Hi Paul,

On Wed, Nov 20, 2019 at 5:32 PM Paul Zierep via Rdkit-discuss <
rdkit-discuss@lists.sourceforge.net> wrote:

> Hi,
> in the original paper of ECFPs (Rogers, D.; Hahn, M.
> “Extended-Connectivity Fingerprints.” *J. Chem. Inf. and Model.* *50*:742-54
> (2010).); it says, " that the relationship between fingerprint features and
> the substructures may not always be one-to-one, " (especially for the FCFPs
> but also the ECFPs).
>
> I was wondering if in the implementation of the rdkit Morgan Fingerprints
> (speaking of the non-hashed/folded type of course), is it possible that the
> one feature encodes for different not identical substructures.
>

Yes. The function that takes the atom environments and hashes them to
produce an integer is not perfect and can produce collisions.
I believe this is fairly rare, but it certainly can happen.
Here's a concrete example:
https://github.com/rdkit/rdkit/issues/814

There's a longer discussion of this topic here:
https://sourceforge.net/p/rdkit/mailman/message/36438523/

-greg
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] RMS matrix from GetConformerRMSMatrix

2019-11-20 Thread topgunhaides .
Hi guys,

I am trying to construct my own symmetrical RMS matrix (lower half) for
Butina clustering, by using GetBestRMS which considers symmetry.

I need to get the matrix with rms elements in correct order first. Here is
what I did for testing, by just using GetConformerRMSMatrix and
GetConformerRMS:


from rdkit import Chem
from rdkit.Chem import AllChem

mh = Chem.AddHs(Chem.MolFromSmiles('OCCCN'))
cids = AllChem.EmbedMultipleConfs(mh, numConfs=5, maxAttempts=1000,
  pruneRmsThresh=0.5, numThreads=0,
randomSeed=1)
m = Chem.RemoveHs(mh)
mat_a = AllChem.GetConformerRMSMatrix(m, prealigned=False)
print(mat_a)

mat_b = []
count = len(cids)
for i in range(count - 1):
for j in range(i + 1, count):
 mat_b.append(AllChem.GetConformerRMS(m, cids[i], cids[j]))
print(mat_b)


mat_a:
[0.660379357470512, 0.5803507133538487, 0.8111033830159597,
0.7063747192537949, 0.10437239857420268, 0.8858184043706921,
0.9292367217722529, 0.87
2233146598343, 0.6451929254710606, 0.9110647560331953]
mat_b:
[0.660379357470513, 0.5803507133538501, 0.7063747192537968,
0.929236721772254, 0.7045981421188982, 0.09521549761836234,
0.6494273777558387, 0.766
7663565750649, 0.6265013024617176, 0.6467365004737882]

You see the two matrices do not match. Apparently, my mat_b gives me this
rms list: [01, 02, 03, 04, 12, 13, 14, 23, 24, 34] (numbers are id pairs)

According to the documentation, GetConformerRMSMatrix should give me the
following matrix and so the list [ a, b, c, d, e, f, g, h, i,  j ]:
rmsmatrix = [ a,
  b, c,
  d, e, f,
  g, h, i, j ]

After assign the id numbers:
rmsmatrix = [0, 1, 2, 3, 4
0
1   a,
2   b, c,
3   d, e, f,
4   g, h, i,  j ]
So the mat_a from GetConformerRMSMatrix should be:
[ a,   b,   c,   d,   e,   f,g,h,   i,j  ] =
[01, 02, 12, 03, 13, 23, 04, 14, 24, 34 ]

This might tell the differences between mat_a and mat_b. But still, some
numbers are very different, even after reordering manually. I cannot figure
out why.
Did I miss anything important? I am new to RDKit. Can anyone help me with
this? Thanks a lot!

Best,
Leon
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Morgan Fingerprint one-to-one relation

2019-11-20 Thread Paul Zierep via Rdkit-discuss
Hi,
in the original paper of ECFPs (Rogers, D.; Hahn, M. “Extended-Connectivity
Fingerprints.” *J. Chem. Inf. and Model.* *50*:742-54 (2010).); it says, "
that the relationship between fingerprint features and the substructures
may not always be one-to-one, " (especially for the FCFPs but also the
ECFPs).

I was wondering if in the implementation of the rdkit Morgan Fingerprints
(speaking of the non-hashed/folded type of course), is it possible that the
one feature encodes for different not identical substructures.

Thank you very much,
Paul
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Hydrogens involved in "stereochemistry" are not removed by RemoveHs()

2019-11-20 Thread Ivan Tubert-Brohman
Thank you, Greg and Andrew, for your replies, and I'm glad to hear that
this is something that can be fixed within RDKit. I had almost forgotten I
had sent this email... :-)

Best,
Ivan

On Wed, Nov 20, 2019 at 12:17 AM Greg Landrum 
wrote:

> Hi Ivan,
>
> I agree that there is a bug here, but I think the problem is actually that
> the double bond is being assigned stereochemistry at all in this case.
>
> In [2]: m = Chem.MolFromSmiles('[H]/C=C/F')
>
>
>
> In [3]: m.Debug()
>
>
> Atoms:
> 0 1 H chg: 0  deg: 1 exp: 1 imp: 0 hyb: 1 arom?: 0 chi: 0
> 1 6 C chg: 0  deg: 2 exp: 3 imp: 1 hyb: 3 arom?: 0 chi: 0
> 2 6 C chg: 0  deg: 2 exp: 3 imp: 1 hyb: 3 arom?: 0 chi: 0
> 3 9 F chg: 0  deg: 1 exp: 1 imp: 0 hyb: 4 arom?: 0 chi: 0
> Bonds:
> 0 0->1 order: 1 dir: 4 conj?: 0 aromatic?: 0
> 1 1->2 order: 2 stereo: 3 stereoAts: (0 3) conj?: 0 aromatic?: 0
> 2 2->3 order: 1 dir: 4 conj?: 0 aromatic?: 0
>
>
> Given that the two substituents on the first C are the same, the double
> bond shouldn't be marked as STEREOE at all.
>
> I'll get this fixed.
> -greg
>
>
>
> On Wed, Nov 6, 2019 at 4:34 PM Ivan Tubert-Brohman <
> ivan.tubert-broh...@schrodinger.com> wrote:
>
>> Hi,
>>
>> For reasons to complicated to get into here, I ended up with a molecule
>> containing a =CH2 in which one of the hydrogens was explicit and had E/Z
>> stereo info. For example, consider [H]/C=C/F.
>>
>> I was surprised that RemoveHs() refused to remove the hydrogen, although
>> later I found that that's the documented behavior, and generally it makes
>> sense as a way to prevent the loss of stereochemical information.
>>
>> For example, compare these two:
>>
>> In [7]: Chem.MolToSmiles(Chem.RemoveHs(Chem.MolFromSmiles('[H]/C=C/F')))
>> Out[7]: '[H]/C=C/F'
>>
>> In [8]: Chem.MolToSmiles(Chem.RemoveHs(Chem.MolFromSmiles('[H]C=C/F')))
>> Out[8]: 'C=CF'
>>
>> A chemist would say that these two are obviously the same molecule, and
>> arguably the second representation is better, because a double bond ending
>> in =CH2 can't have geometric isomers. Maybe it's unreasonable to expect
>> RDKit to make that kind of inference, but still I wonder, what would be a
>> good automated way to get from [H]/C=C/F to C=CF?
>>
>> One idea is to add a "=CH2 cleanup" step, perhaps implemented by applying
>> this reaction:
>>
>> [H][C:1]=[*:2]>>[CH2:1]=[*:2]
>>
>> but perhaps there's a better way?
>>
>> Best,
>> Ivan
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] [*External*] Re: Try to reproduce a code working in January

2019-11-20 Thread Guillaume GODIN
Hi Taka,

So this bug is fix thanks.



Now I have two others issues:

1. This cell gives also an error:

dataset = rg.GetRGroupsAsColumns()
core =  Chem.RemoveHs(dataset["Core"][0])

RDKit ERROR: [12:59:27] Explicit valence for atom # 6 N, 5, is greater than 
permitted
---
ValueErrorTraceback (most recent call last)
 in 
  1 dataset = rg.GetRGroupsAsColumns()
> 2 core =  Chem.RemoveHs(dataset["Core"][0])

ValueError: Sanitization error: Explicit valence for atom # 6 N, 5, is greater 
than permitted



2. and after second issue when I’m running this cell:
res = enumeratemol(core,rg)


---
UnboundLocalError Traceback (most recent call last)
 in 
> 1 res = enumeratemol(core,rg)

 in enumeratemol(core, rg, maxmol)
 11 mol = core
 12 for idx,j in enumerate(i):
---> 13 mol = makebond(mol, rgs[idx][j])
 14 AllChem.Compute2DCoords(mol)
 15 mol = Chem.RemoveHs(mol)

 in makebond(target, chain)
 14 nbr2 = [x.GetOtherAtom(newmol.GetAtomWithIdx(atm2)) for x 
in newmol.GetAtomWithIdx(atm2).GetBonds()][0]
 15 nbr2.SetAtomMapNum(idx)
---> 16 newmol.AddBond(nbr1.GetIdx(), nbr2.GetIdx(), 
order=Chem.rdchem.BondType.SINGLE)
 17 nbr1.SetAtomMapNum(0)
 18 nbr2.SetAtomMapNum(0)

UnboundLocalError: local variable 'nbr1' referenced before assignment


Thanks for helping

Guillaume

De : Taka Seri 
Date : mercredi, 20 novembre 2019 à 14:02
À : Guillaume GODIN 
Cc : "rdkit-discuss@lists.sourceforge.net" 
Objet : [*External*] Re: [Rdkit-discuss] Try to reproduce a code working in 
January

Hi Guillaume,

I confirmed the issue.
RemoveHs does not work to invalid molecule.
So I updated my code example and uploaded gist.
Could you please check new version of my code?
https://gist.github.com/iwatobipen/77269b8a10eafe0e0cba8de5c1cae6ec

Any comments or suggestions will be appreciated.

Thanks,
Taka

2019年11月20日(水) 14:46 Guillaume GODIN 
mailto:guillaume.go...@firmenich.com>>:
Dear community,

I try to reproduce this code

https://iwatobipen.wordpress.com/2019/01/18/generate-possible-molecules-from-a-dataset-chemoinformatics-rdkit/

but got an error un panda / rdkit during generation:

frame = frame[["ROMol", "Smiles", "Core", "R1", "R2", "R3"]]
frame['Core']=frame['Core'].apply(Chem.RemoveHs)
frame.head(2)



RDKit ERROR: [05:02:02]
RDKit ERROR:
RDKit ERROR: 
RDKit ERROR: Pre-condition Violation
RDKit ERROR: getExplicitValence() called without call to calcExplicitValence()
RDKit ERROR: Violation occurred on line 161 in file 
/opt/conda/conda-bld/rdkit_1561471048963/work/Code/GraphMol/Atom.cpp
RDKit ERROR: Failed Expression: d_explicitValence > -1
RDKit ERROR: 
RDKit ERROR:
RDKit ERROR: [05:05:04] Explicit valence for atom # 6 N, 5, is greater than 
permitted

---
ValueErrorTraceback (most recent call last)
 in 
  1 frame = frame[["ROMol", "Smiles", "Core", "R1", "R2", "R3"]]
> 2 frame['Core']=frame['Core'].apply(Chem.RemoveHs)
  3 frame.head(2)

~/miniconda/envs/py37/lib/python3.7/site-packages/pandas/core/series.py in 
apply(self, func, convert_dtype, args, **kwds)
   3589 else:
   3590 values = self.astype(object).values
-> 3591 mapped = lib.map_infer(values, f, convert=convert_dtype)
   3592
   3593 if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

ValueError: Sanitization error: Explicit valence for atom # 6 N, 5, is greater 
than permitted



Any idea why ?

BR

Guillaume


***
DISCLAIMER
This email and any files transmitted with it, including replies and forwarded 
copies (which may contain alterations) subsequently transmitted from Firmenich, 
are confidential and solely for the use of the intended recipient. The contents 
do not represent the opinion of Firmenich except to the extent that it relates 
to their official business.
***
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

***
DISCLAIMER  
This email and any files transmitted with it, including replies and forwarded 
copies (which may contain alterations) subsequently transmitted from Firmenich, 
are confidential and solely for the use of the intended recipient. The contents 
do not represent the opinion of Firmenich except to 

Re: [Rdkit-discuss] Try to reproduce a code working in January

2019-11-20 Thread Taka Seri
Hi Guillaume,

I confirmed the issue.
RemoveHs does not work to invalid molecule.
So I updated my code example and uploaded gist.
Could you please check new version of my code?
https://gist.github.com/iwatobipen/77269b8a10eafe0e0cba8de5c1cae6ec

Any comments or suggestions will be appreciated.

Thanks,
Taka

2019年11月20日(水) 14:46 Guillaume GODIN :

> Dear community,
>
>
>
> I try to reproduce this code
>
>
>
>
> https://iwatobipen.wordpress.com/2019/01/18/generate-possible-molecules-from-a-dataset-chemoinformatics-rdkit/
>
>
>
> but got an error un panda / rdkit during generation:
>
>
>
> frame = frame[["ROMol", "Smiles", "Core", "R1", "R2", "R3"]]
>
> frame['Core']=frame['Core'].apply(Chem.RemoveHs)
>
> frame.head(2)
>
>
>
>
>
>
>
> RDKit ERROR: [05:02:02]
>
> RDKit ERROR:
>
> RDKit ERROR: 
>
> RDKit ERROR: Pre-condition Violation
>
> RDKit ERROR: getExplicitValence() called without call to
> calcExplicitValence()
>
> RDKit ERROR: Violation occurred on line 161 in file
> /opt/conda/conda-bld/rdkit_1561471048963/work/Code/GraphMol/Atom.cpp
>
> RDKit ERROR: Failed Expression: d_explicitValence > -1
>
> RDKit ERROR: 
>
> RDKit ERROR:
>
> RDKit ERROR: [05:05:04] Explicit valence for atom # 6 N, 5, is greater
> than permitted
>
>
>
> ---
>
> ValueErrorTraceback (most recent call
> last)
>
>  in 
>
> *  1* frame = frame[["ROMol", "Smiles", "Core", "R1", "R2", "R3"]]
>
> > 2 frame['Core']=frame['Core'].apply(Chem.RemoveHs)
>
> *  3* frame.head(2)
>
>
>
> ~/miniconda/envs/py37/lib/python3.7/site-packages/pandas/core/series.py
> in apply(self, func, convert_dtype, args, **kwds)
>
> *   3589* else:
>
> *   3590* values = self.astype(object).values
>
> -> 3591 mapped = lib.map_infer(values, f, convert=
> convert_dtype)
>
> *   3592*
>
> *   3593* if len(mapped) and isinstance(mapped[0], Series):
>
>
>
> pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()
>
>
>
> ValueError: Sanitization error: Explicit valence for atom # 6 N, 5, is
> greater than permitted
>
>
>
>
>
>
>
> Any idea why ?
>
>
>
> BR
>
>
>
> Guillaume
>
>
>
>
>
> ***
> DISCLAIMER
> This email and any files transmitted with it, including replies and
> forwarded copies (which may contain alterations) subsequently transmitted
> from Firmenich, are confidential and solely for the use of the intended
> recipient. The contents do not represent the opinion of Firmenich except to
> the extent that it relates to their official business.
>
> ***
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] 5th in silico drug design workshop in Olomouc

2019-11-20 Thread Pavel Polishchuk

Dear colleagues,

  we would like to invite you on the 5th Drug Design workshop which 
will be held 3-7 January 2020 in Olomouc (Czech Republic). It is focused 
on practical applications of different chemoinformatic tools and 
approaches for drug development. This might be interesting for bachelor, 
master and PhD students to broaden their experience and sharpen skills. 
During the workshop, students will learn pharmacophore and QSAR 
modeling, molecular docking, PDBe services. A competition will be 
organized at the last day of the workshop where participants will be 
able to apply acquired knowledge to solve a real chemoinformatic task 
and win prizes.

  This year we will organize a poster session for participants.

https://fch.upol.cz/en/5add/

  Please feel free to share this information to those who can be 
interested in participation in such event. Thank you!


Kind regards,
Pavel.

--
Dr. Pavel Polishchuk
senior researcher
Institute of Molecular and Translational Medicine
Faculty of Medicine and Dentistry
Palacky University
Hněvotínská 1333/5
779 00 Olomouc
Czech Republic
+420 585632298



___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Folding count vectors

2019-11-20 Thread Francois Berenger

On 20/11/2019 02:00, Benjamin Datko wrote:

Hello Francois,

I am trying to replicate some of the functionality of
CreateDifferenceFingerprintForReaction [Ref 1] for my own
understanding on how the code works. The function
CreateDifferenceFingerprintForReaction allows for three difference
fingerprint representation of the molecules: AtomPair, Morgan, and
TopologicalTorsion [Ref 2]. All three are count vectors [Ref 3], and
the function allows for variable fingerprint size output.


Personally, I wouldn't try to fold a count vector.
They are sparse vectors, so they don't take a lot of memory.
Also, they are less information lossy than binary fingerprints.

But, maybe Greg has some hack around, if you are really forced to do 
this.



I was following this post [Ref 4] describing how to create reaction
difference fingerprints using different fingerprints representation.
Using the code from the post I can create reaction difference
fingerprints using either Morgan or AtomPair, but comparing the output
from the post [Ref 4] to CreateDifferenceFingerprintForReaction
results in different size fingerprints, with different values within
the fingerprint, and different densities. I am assuming this due to
folding the count vector down to the default fingerprint size of 2048.


Example code snippet:

# The below defs are from the post
https://sourceforge.net/p/rdkit/mailman/message/35240736/

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit import DataStructs
import copy

def _createFP(mol,maxSize,fpType='AP'):
mol.UpdatePropertyCache(False)
if fpType == 'AP':
return AllChem.GetAtomPairFingerprint(mol, minLength=1,
maxLength=maxSize)
else:
Chem.GetSSSR(mol)
rinfo = mol.GetRingInfo()
return AllChem.GetMorganFingerprint(mol, radius=maxSize)

def getSumFps(fps):
summedFP = copy.deepcopy(fps[0])
for fp in fps[1:]:
summedFP += fp
return summedFP

def buildReactionFP(rxn, maxSize=3, fpType='AP'):
reactants = rxn.GetReactants()
products = rxn.GetProducts()
rFP = getSumFps([_createFP(mol,maxSize,fpType=fpType) for mol in
reactants])
pFP = getSumFps([_createFP(mol,maxSize,fpType=fpType) for mol in
products])
return pFP-rFP


rxn1 = AllChem.ReactionFromSmarts( '[C:1]C1C1>>[N:1]C1C1'

, useSmiles=True)


rxfp1 = buildReactionFP(rxn1,maxSize=2)



rxfp1.GetNonzeroElements()

{558114: -2, 574497: -1, 1066050: 2, 1066081: 1}


rxfp1.GetLength()

8388608

# Same reaction now using CreateDifferenceFingerprintForReaction

rxn1_fp = AllChem.CreateDifferenceFingerprintForReaction(rxn1)



rxn1_fp.GetNonzeroElements()


{1048: 10,
 1310: -20,
 1325: 20,
 1372: -10,
 1390: 20,
 1692: -10,
 1757: -20,
 1772: 10}


print(rxn1_fp.GetLength(),rxfp1.GetLength())

2048 8388608

References
1.
https://www.rdkit.org/docs/source/rdkit.Chem.rdChemReactions.html#rdkit.Chem.rdChemReactions.CreateDifferenceFingerprintForReaction
2.
https://www.rdkit.org/docs/cppapi/structRDKit_1_1ReactionFingerprintParams.html
3.
https://www.rdkit.org/docs/GettingStartedInPython.html#morgan-fingerprints-circular-fingerprints
4. https://sourceforge.net/p/rdkit/mailman/message/35240736/

v/r,

Ben

On Mon, Nov 18, 2019 at 10:13 PM Francois Berenger 
wrote:


On 19/11/2019 03:34, Benjamin Datko wrote:

Hello all,

I am curious on how to fold a count vector fingerprint. I

understand

when folding bit vectors the most common way is to split the

vector in

half, and apply a bitwise OR operation. I think this is how the
function rdkit.DataStructs.FoldFingerprint works in RDKit, correct

me

if I am wrong.

How does RDKit and or what is the appropriate way to fold count
vectors such as AtomPair, Morgan, and Topological torsion?


Can you give us some context? Why do you want to do that?

Maybe, you can use the following in order to create
shorter "fingerprints" for which the Tanimoto distance is
still computable (despite becoming approximate then):

---
Shrivastava, A. (2016).
Simple and efficient weighted minwise hashing.
In Advances in Neural Information Processing Systems (pp.
1498-1506).



https://papers.nips.cc/paper/6472-simple-and-efficient-weighted-minwise-hashing.pdf

---

Regards,
F.


I thought about turning the fingerprint into a bit vector using

their

respected "AsBitVect" method then folding using
rdkit.DataStructs.FoldFingerprint, but topological torsion doesn't
have a "AsBitVect" method
[https://www.rdkit.org/docs/GettingStartedInPython.html].

For an explicit example using AtomPair fingerprint we can see the
fingerprint is extremely sparse. Could this AtomPair fingerprint

be

folded to increase the density?


from rdkit import Chem



from rdkit.Chem import AllChem



mol = Chem.MolFromSmiles('CC1C1')
ap_fp = AllChem.GetAtomPairFingerprint(mol, minLength=1,

maxLength=3)


number_of_nonzero_elements =

len(ap_fp.GetNonzeroElements().values())


print((ap_fp.GetLength(),number_of_nonzero_elements))

(8388608,9)

Very Respectfully,

Ben