Re: [Rdkit-discuss] One tautomer not included in list of enumerated tautomers

2024-02-05 Thread Lewis Martin
Good catch, thank you Diogo!

Recognising the difficulties of tautomer enumeration: For my own purposes,
the ideal behaviour would be to get the set of all three plausible
tautomers of 'mol1' no matter what the input SMILES. Looks like there's
already a Github Issue up (https://github.com/rdkit/rdkit/issues/5937) but
I can add this if it has a different cause.

thanks all
Lewis



On Tue, Feb 6, 2024 at 7:23 AM Diogo Martins  wrote:

> Hello,
>
> I think it's a bug because the tautomers depend on how the input SMILES is
> written. Both represent mol1:
>
> Sc1ncc2c(c1)2
> Sc1cc2c2cn1
>
> However the resulting tautomers differ depending on which is used as input.
>
> Best regards,
> Diogo
>
> On Mon, 5 Feb 2024 at 11:38, Lewis Martin 
> wrote:
>
>> Thank you very much for the detective work, Wim! This is helpful.
>>
>> It looks like the _reverse_ transition is possible, though. If I start by
>> generating tautomers of "mol2", then "mol1" is recovered, which indicates
>> this is an allowed transform. Is it possible that one direction is allowed
>> but not the reverse?
>>
>> Failing a solution there, does anyone know if it is possible to add
>> SMIRKS to the allowed tautomers through the python interface?
>> Thanks,
>> Lewis
>>
>> On Mon, Feb 5, 2024 at 9:52 PM Wim Dehaen  wrote:
>>
>>> hi lewis,
>>> if i am not mistaken this is because the tautomer transfor "1,3 aromatic
>>> heteroatom H shift" does not account for other chalcogens than oxygen, so
>>> no selenium, tellurium or sulfur.
>>> you can find the list of transforms here:
>>> https://github.com/rdkit/rdkit/blob/8dae48b7a17fd984c69d04549e6d9b53690f5c52/Code/GraphMol/MolStandardize/TautomerCatalog/tautomerTransforms.in#L46
>>> (poiting to the line with the relevant transform).
>>> best wishes
>>> wim
>>>
>>> On Mon, Feb 5, 2024 at 3:26 AM Lewis Martin 
>>> wrote:
>>>
>>>> Hi all,
>>>> I'm looking at scoring tautomers, and using the 'tautobase' dataset
>>>> used by Weider et al* at:
>>>>
>>>> https://github.com/choderalab/neutromeratio/blob/master/data/b3lyp_tautobase_subset.txt
>>>>
>>>> This dataset has pairs of tautomers with experimental logK values to
>>>> determine the preferred tautomer.
>>>>
>>>> In at least one case, depending on which tautomer you use as the
>>>> 'entry' point, the enumerated tautomers by RDKit either do or don't include
>>>> both of the pair of input molecules. *I'm hoping there's a way to
>>>> uniquely recover the full set of possible tautomers from using any input
>>>> tautomer. *
>>>>
>>>> Here's a code example:
>>>>
>>>> from rdkit import Chem
>>>>>
>>>> from rdkit.Chem import Draw
>>>>
>>>> from rdkit.Chem.Draw import IPythonConsole
>>>>> IPythonConsole.drawOptions.addStereoAnnotation = True
>>>>> from rdkit.Chem.MolStandardize import rdMolStandardize
>>>>>
>>>>> #same result if you don't do any of these params.
>>>>
>>>> tautomer_params =
>>>>> Chem.MolStandardize.rdMolStandardize.CleanupParameters()
>>>>> tautomer_params.tautomerRemoveSp3Stereo = False
>>>>> tautomer_params.tautomerRemoveBondStereo = False
>>>>> tautomer_params.tautomerRemoveIsotopicHs = False
>>>>> tautomer_params.tautomerReassignStereo = False
>>>>> tautomer_params.doCanonical = True
>>>>>
>>>>> enumerator = rdMolStandardize.TautomerEnumerator(tautomer_params)
>>>>>
>>>>> smi1 = 'Sc1cc2c2cn1'
>>>>> smi2 = 'S=c1cc2c2c[nH]1'
>>>>> mol1 = Chem.MolFromSmiles(smi1)
>>>>> mol2 = Chem.MolFromSmiles(smi2)
>>>>>
>>>>> #choose mol1 or mol2 to be source of tautomers:
>>>>> #choose mol1, and look at the tautomers. Note that mol2 isn't present!
>>>>> tauts = [Chem.MolFromSmiles(Chem.MolToSmiles(m)) for m in
>>>>> enumerator.Enumerate(mol1)]
>>>>>
>>>>> Draw.MolsToGridImage([mol1, mol2]+tauts, legends=['mol1', 'mol2 (not
>>>>> present in tauts!)'] + [f'taut{i}' for i in range(len(tauts))],
>>>>>  molsPerRow=4)
>>>>>
>>>>
>>>> And a picture of this in a notebook for an at-a-glance view:
>>>> https://gist.github.com/ljmartin/4a9d9eb684df3e11e59fc6502a4b7b03
>>>>
>>>> Does anyone know a way to recover "mol2" within tautomers of "mol1"?
>>>>
>>>> Thank you!
>>>> Lewis
>>>>
>>>>
>>>> ___
>>>> Rdkit-discuss mailing list
>>>> Rdkit-discuss@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>
>>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] One tautomer not included in list of enumerated tautomers

2024-02-05 Thread Lewis Martin
Thank you very much for the detective work, Wim! This is helpful.

It looks like the _reverse_ transition is possible, though. If I start by
generating tautomers of "mol2", then "mol1" is recovered, which indicates
this is an allowed transform. Is it possible that one direction is allowed
but not the reverse?

Failing a solution there, does anyone know if it is possible to add SMIRKS
to the allowed tautomers through the python interface?
Thanks,
Lewis

On Mon, Feb 5, 2024 at 9:52 PM Wim Dehaen  wrote:

> hi lewis,
> if i am not mistaken this is because the tautomer transfor "1,3 aromatic
> heteroatom H shift" does not account for other chalcogens than oxygen, so
> no selenium, tellurium or sulfur.
> you can find the list of transforms here:
> https://github.com/rdkit/rdkit/blob/8dae48b7a17fd984c69d04549e6d9b53690f5c52/Code/GraphMol/MolStandardize/TautomerCatalog/tautomerTransforms.in#L46
> (poiting to the line with the relevant transform).
> best wishes
> wim
>
> On Mon, Feb 5, 2024 at 3:26 AM Lewis Martin 
> wrote:
>
>> Hi all,
>> I'm looking at scoring tautomers, and using the 'tautobase' dataset used
>> by Weider et al* at:
>>
>> https://github.com/choderalab/neutromeratio/blob/master/data/b3lyp_tautobase_subset.txt
>>
>> This dataset has pairs of tautomers with experimental logK values to
>> determine the preferred tautomer.
>>
>> In at least one case, depending on which tautomer you use as the 'entry'
>> point, the enumerated tautomers by RDKit either do or don't include both of
>> the pair of input molecules. *I'm hoping there's a way to uniquely
>> recover the full set of possible tautomers from using any input tautomer. *
>>
>> Here's a code example:
>>
>> from rdkit import Chem
>>>
>> from rdkit.Chem import Draw
>>
>> from rdkit.Chem.Draw import IPythonConsole
>>> IPythonConsole.drawOptions.addStereoAnnotation = True
>>> from rdkit.Chem.MolStandardize import rdMolStandardize
>>>
>>> #same result if you don't do any of these params.
>>
>> tautomer_params = Chem.MolStandardize.rdMolStandardize.CleanupParameters()
>>> tautomer_params.tautomerRemoveSp3Stereo = False
>>> tautomer_params.tautomerRemoveBondStereo = False
>>> tautomer_params.tautomerRemoveIsotopicHs = False
>>> tautomer_params.tautomerReassignStereo = False
>>> tautomer_params.doCanonical = True
>>>
>>> enumerator = rdMolStandardize.TautomerEnumerator(tautomer_params)
>>>
>>> smi1 = 'Sc1cc2c2cn1'
>>> smi2 = 'S=c1cc2c2c[nH]1'
>>> mol1 = Chem.MolFromSmiles(smi1)
>>> mol2 = Chem.MolFromSmiles(smi2)
>>>
>>> #choose mol1 or mol2 to be source of tautomers:
>>> #choose mol1, and look at the tautomers. Note that mol2 isn't present!
>>> tauts = [Chem.MolFromSmiles(Chem.MolToSmiles(m)) for m in
>>> enumerator.Enumerate(mol1)]
>>>
>>> Draw.MolsToGridImage([mol1, mol2]+tauts, legends=['mol1', 'mol2 (not
>>> present in tauts!)'] + [f'taut{i}' for i in range(len(tauts))],
>>>  molsPerRow=4)
>>>
>>
>> And a picture of this in a notebook for an at-a-glance view:
>> https://gist.github.com/ljmartin/4a9d9eb684df3e11e59fc6502a4b7b03
>>
>> Does anyone know a way to recover "mol2" within tautomers of "mol1"?
>>
>> Thank you!
>> Lewis
>>
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] One tautomer not included in list of enumerated tautomers

2024-02-04 Thread Lewis Martin
Hi all,
I'm looking at scoring tautomers, and using the 'tautobase' dataset used by
Weider et al* at:
https://github.com/choderalab/neutromeratio/blob/master/data/b3lyp_tautobase_subset.txt

This dataset has pairs of tautomers with experimental logK values to
determine the preferred tautomer.

In at least one case, depending on which tautomer you use as the 'entry'
point, the enumerated tautomers by RDKit either do or don't include both of
the pair of input molecules. *I'm hoping there's a way to uniquely recover
the full set of possible tautomers from using any input tautomer. *

Here's a code example:

from rdkit import Chem
>
from rdkit.Chem import Draw

from rdkit.Chem.Draw import IPythonConsole
> IPythonConsole.drawOptions.addStereoAnnotation = True
> from rdkit.Chem.MolStandardize import rdMolStandardize
>
> #same result if you don't do any of these params.

tautomer_params = Chem.MolStandardize.rdMolStandardize.CleanupParameters()
> tautomer_params.tautomerRemoveSp3Stereo = False
> tautomer_params.tautomerRemoveBondStereo = False
> tautomer_params.tautomerRemoveIsotopicHs = False
> tautomer_params.tautomerReassignStereo = False
> tautomer_params.doCanonical = True
>
> enumerator = rdMolStandardize.TautomerEnumerator(tautomer_params)
>
> smi1 = 'Sc1cc2c2cn1'
> smi2 = 'S=c1cc2c2c[nH]1'
> mol1 = Chem.MolFromSmiles(smi1)
> mol2 = Chem.MolFromSmiles(smi2)
>
> #choose mol1 or mol2 to be source of tautomers:
> #choose mol1, and look at the tautomers. Note that mol2 isn't present!
> tauts = [Chem.MolFromSmiles(Chem.MolToSmiles(m)) for m in
> enumerator.Enumerate(mol1)]
>
> Draw.MolsToGridImage([mol1, mol2]+tauts, legends=['mol1', 'mol2 (not
> present in tauts!)'] + [f'taut{i}' for i in range(len(tauts))],
>  molsPerRow=4)
>

And a picture of this in a notebook for an at-a-glance view:
https://gist.github.com/ljmartin/4a9d9eb684df3e11e59fc6502a4b7b03

Does anyone know a way to recover "mol2" within tautomers of "mol1"?

Thank you!
Lewis
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Query on a failed molecule from SureChEMBL

2021-12-15 Thread Lewis Martin
Thanks a lot Greg! That is indeed very helpful.

Just to know that the molecule is odd is helpful too. The mol blocks appear
to be V2000 format and have names like "Mrv0541 03021215572D" which says
ChemAxon Marvin to me, but I'm still unsure why SureChEMBL would use such a
representation (it doesn't look like a faithful transcription from the
source patent). Off-topic, but if anyone happens to have an insight or
connection with SureChEMBL, please do reach out!

Cheers
Lewis




On Wed, Dec 15, 2021 at 4:24 PM Greg Landrum  wrote:

> Hi Lewis,
>
> Dealing with all the strange chemical representations that show up "in the
> wild" is an ongoing struggle.
>
> Your first example is pretty clearly intended to be an azide and we can
> certainly add a rule to normalize that one to what the RDKit expects it to
> be (there already is a rule for C-N=N#N, but that doesn't help here.). That
> won't happen before the next feature release though.
>
> I'm not really sure what the intent was for the two
> four-coordinate neutral Ns in the second molecule, so I think it's unlikely
> that we'd add a standard cleanup for one.
>
> However! The good news is that there's a pretty easy (and efficient) way
> to fix this yourself. We added a new method to chemical reactions in the
> 2021.09 release which allows you to modify a molecule in place (subject to
> some constraints). This is ideal for doing cleanup transformations like
> these.
>
> This gist shows how to write reaction rules for your cases (I guessed for
> what the Ns are supposed to be) and then use them:
> https://gist.github.com/greglandrum/8fd229bc6bf6c734d1c21da7f2bebebb
>
> Hope this helps,
> -greg
>
>
> On Wed, Dec 15, 2021 at 12:21 AM Lewis Martin 
> wrote:
>
>> Hi All,
>> Reading molecules from a bulk download of SureChEMBL, I come across a
>> fair few molecules that fail to parse. Not sure whether they SHOULD parse
>> or not.
>>
>> Here is an example: https://www.surechembl.org/chemical/SCHEMBL386
>> with SMILES code: COC(=O)C1=C(C=CC=C1)C1=CC=C(C[N+]#[N]=[N-])C=C1
>>
>> Even reading the SMILES code one can see that there are too many bonds in
>> there - a nitrogen triply bonded and doubly bonded to other atoms.
>>
>> Another example: https://www.surechembl.org/chemical/SCHEMBL33957
>> smiles: NC(N)=[NH]C1=NC(CSCC[NH]=CNS(=O)(=O)C2=CC=C(Br)C=C2)=CS1
>>
>> Again, valence for a nitrogen is off.
>>
>> Should I expect to parse these with RDKit? Might there be some way around
>> this? It's a significant fraction of the molecules in SureChEMBL.
>>
>> Thanks team!
>> Lewis
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Programmatic access to MMFF torsion indices and parameters

2021-11-19 Thread Lewis Martin
Hi all,
Does anyone have a way to access the atom indices and the parameters for
all of the dihedrals (aka torsions) in an MMFF force field object? I
noticed that one can set mmffVerbosity=2 to print them to stdout, but I'd
like to access them programmatically. I believe Paolo Tosco must have done
this as part of an effort to port MMFF to OpenMM via rdkit (
https://github.com/ptosco/rdkit/tree/openmm ) , but I can't find the
portion of code that does it!

Thanks a lot
Lewis
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Parsing a PDB file with atoms that are too close, causing bad bond

2021-09-27 Thread Lewis Martin
Very interesting - thank you Francois! PDB re-do does the trick:









*import requestsfrom rdkit import Chemdef getPDB(code):out =
requests.get(f'https://pdb-redo.eu/db/{code}/{code}_final.pdb
<https://pdb-redo.eu/db/{code}/{code}_final.pdb>')return
out.contentpdb_string = getPDB('3udn')Chem.MolFromPDBBlock(pdb_string)*

I think this solves it for me, but if anyone knows how to infer correct
bonding information without relying on distances, I'd love to hear it too!
So far I've noticed that Parmed and PDBFixer infer correct bonds, but they
don't determine bond orders, so it's difficult to port the molecule into
RDKit.

Cheers
Lewis



On Mon, Sep 27, 2021 at 5:55 PM Francois Berenger  wrote:

> Hi Lewis,
>
> Just an idea: you might try to load your PDB in UCSF Chimera, then
> save it as a mol2 or sdf file.
> Then, try to read this sdf file from rdkit.
>
> Another idea: try to get your pdb file through the pdbredo service.
> https://pdb-redo.eu/
> They might have fixed a few things; maybe this PDB will read better in
> rdkit.
>
> Regards,
> F.
>
> On 26/09/2021 17:02, Lewis Martin wrote:
> > Hi RDKit,
> > While parsing proteins from the PBD with RDKit, I've come across
> > situations where the distance-based bond determination leads to
> > 'incorrect' bonds between atoms that are erroneously too close
> > together. PDB files have no bond information, so it's not really
> > 'incorrect' (rather the model coordinates are off), but the bonds are
> > nonphysical - and it means the Mol objects won't sanitize.
> >
> > Here's an example:
> >
> > import requests
> > from io import BytesIO
> > import gzip
> > from rdkit import Chem
> >
> > def getPDB(code):
> > out =
> > requests.get(f'https://files.rcsb.org/download/{code}.pdb1.gz [1]')
> > binary_stream =  BytesIO(out.content)
> > return gzip.open(binary_stream).read()
> >
> > pdb_string = getPDB('3udn')
> > Chem.MolFromPDBBlock(pdb_string)
> >
> > Error is:
> >
> > RDKit ERROR: [22:38:21] Explicit valence for atom # 573 O, 3, is
> > greater than permitted
> >
> > This is caused by the threonine 72 sidechain being too close to the
> > TYR71 backbone carbonyl oxygen (this can be visualized at
> > https://www.rcsb.org/3d-view/3UDN?preset=ligandInteraction=09B ,
> > TYR71 is near the ligand).
> >
> > Does anyone know how to avoid this to create a Chem.Mol? I've tried
> > using Parmed and PDBFixer, since they use residue templates to
> > generate the correct bonding topology, but they don't write CONECT
> > records or SDFs, so the bonds are still lost to RDKit.
> >
> > Thanks for your time!
> > Lewis
> > PS - why not just use PDBFixer? I'm trying to calculate atom
> > invariants using RDKit's morgan fingerprinter implementation, so
> > ultimately I want a sanitized Mol object
> >
> > Links:
> > --
> > [1] https://files.rcsb.org/download/%7Bcode%7D.pdb1.gz
> > ___
> > Rdkit-discuss mailing list
> > Rdkit-discuss@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Parsing a PDB file with atoms that are too close, causing bad bond

2021-09-26 Thread Lewis Martin
Hi RDKit,
While parsing proteins from the PBD with RDKit, I've come across situations
where the distance-based bond determination leads to 'incorrect' bonds
between atoms that are erroneously too close together. PDB files have no
bond information, so it's not really 'incorrect' (rather the model
coordinates are off), but the bonds are nonphysical - and it means the Mol
objects won't sanitize.

Here's an example:

import requests
from io import BytesIO
import gzip
from rdkit import Chem

def getPDB(code):
out = requests.get(f'https://files.rcsb.org/download/{code}.pdb1.gz')
binary_stream =  BytesIO(out.content)
return gzip.open(binary_stream).read()

pdb_string = getPDB('3udn')
Chem.MolFromPDBBlock(pdb_string)

Error is:

RDKit ERROR: [22:38:21] Explicit valence for atom # 573 O, 3, is
greater than permitted

This is caused by the threonine 72 sidechain being too close to the TYR71
backbone carbonyl oxygen (this can be visualized at
https://www.rcsb.org/3d-view/3UDN?preset=ligandInteraction=09B , TYR71
is near the ligand).

Does anyone know how to avoid this to create a Chem.Mol? I've tried using
Parmed and PDBFixer, since they use residue templates to generate the
correct bonding topology, but they don't write CONECT records or SDFs, so
the bonds are still lost to RDKit.


Thanks for your time!
Lewis
PS - why not just use PDBFixer? I'm trying to calculate atom invariants
using RDKit's morgan fingerprinter implementation, so ultimately I want a
sanitized Mol object
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Calculating strain energy of conformers

2021-07-11 Thread Lewis Martin
Hi RDKit,
I'm exploring strain energies in the context of virtual screening,
something that has been considered for a while[1] and is still being
explored today[2].

There may not be a canonical way, but is this a valid/good way to calculate
strain energy? I'm just not sure if I'm using the MMFF correctly.

from rdkit import Chem
from rdkit.Chem import AllChem
testmol =
Chem.MolFromSmiles('Cc1c(c2cc(ccc2n1C(=O)c3ccc(cc3)Cl)OC)CC(=O)O')
#indomethacin
testmol = Chem.AddHs(testmol)
AllChem.EmbedMultipleConfs(testmol, 50)

mp = AllChem.MMFFGetMoleculeProperties(testmol, mmffVariant='MMFF94s')

# keep angle-related terms
mp.SetMMFFOopTerm(True)
mp.SetMMFFAngleTerm(True)
mp.SetMMFFTorsionTerm(True)
# turn off anything unrelated to angles
mp.SetMMFFStretchBendTerm(False)
mp.SetMMFFBondTerm(False)
mp.SetMMFFVdWTerm(False)
mp.SetMMFFEleTerm(False)

for c in range(50):
print(AllChem.MMFFGetMoleculeForceField(testmol, mp,
confId=c).CalcEnergy())
MMFFOptimizeMolecule(testmol, confId=c, mmffVariant='MMFF94s')
print('\t'+str(AllChem.MMFFGetMoleculeForceField(testmol, mp,
confId=c).CalcEnergy()))

Thanks for any comments or advice!
Lewis


[1]Conformational Analysis of Drug-Like Molecules Bound to Proteins: An
Extensive Study of Ligand Reorganization upon Binding, 2003
[2]Ligand Strain Energy in Large Library Docking, 2021
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] MolsToGridImage drawing multiple conformers of single molecule

2021-06-07 Thread Lewis Martin
Hi all,
Is there a way to draw multiple conformers of a single molecule using
Draw.MolsToGridImage?

Here's a first attempt but, either all conformers are exactly the same or
the parent molecule falls back to conformer 0, since all molecules in the
grid appear the same:

```
from rdkit import Chem
from rdkit.Chem import AllChem, Draw

#indomethacin
mol = Chem.MolFromSmiles('Cc1c(c2cc(ccc2n1C(=O)c3ccc(cc3)Cl)OC)CC(=O)O')
mol = Chem.AddHs(mol)
AllChem.EmbedMultipleConfs(mol, numConfs=10)
mol = Chem.RemoveHs(mol)

#drawBonds(mol)
Draw.MolsToGridImage([mol.GetConformer(i).GetOwningMol() for i in
range(10)])
```

Perhaps a simpler question is, how do I access a single conformer from the
10 available in order to draw it?

Thanks!
Lewis
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Using MolsToGridImage to draw multiple conformers of a single molecule

2021-06-06 Thread Lewis Martin
Apologies if this doubles up, I think sourceforge was having issues...

Hi all,
Is there a way to draw multiple conformers of a single molecule using
Draw.MolsToGridImage?

Here's a first attempt but, either all conformers are exactly the same or
the parent molecule falls back to conformer 0, since all molecules in the
grid appear the same:

```
from rdkit import Chem
from rdkit.Chem import AllChem, Draw

#indomethacin
mol = Chem.MolFromSmiles('Cc1c(c2cc(ccc2n1C(=O)c3ccc(cc3)Cl)OC)CC(=O)O')
mol = Chem.AddHs(mol)
AllChem.EmbedMultipleConfs(mol, numConfs=10)
mol = Chem.RemoveHs(mol)

#drawBonds(mol)
Draw.MolsToGridImage([mol.GetConformer(i).GetOwningMol() for i in
range(10)])
```

Perhaps a simpler question is, how do I access a molecule of a single
conformer in order to draw it?

Thanks!
Lewis
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] XYZ to mol ???

2021-06-06 Thread Lewis Martin
I know this doesn't address the question exactly - but you can also do this
(using RDKit) via the jan jensen and colleague's xyz2mol -->
https://github.com/jensengroup/xyz2mol
- lew

On Sat, Jun 5, 2021 at 10:56 AM Storer, Joey (J)  wrote:

> Dear all,
>
>
>
> For molecular modeling workflows and interoperability with QM/MM etc.,
>
>
>
> Can RDKit gain a Chem.XyzToMol(xyz) functionality?
>
>
>
> Thanks for considering this.
>
>
>
> Joey Storer
>
> Dow, Inc.
>
> General Business
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] GPU Implementation of shape-based 3D overlap on rdkit?

2020-11-03 Thread Lewis Martin
Ive had an initial go at something like this using JAX. I chose JAX since
it has a shallow learning curve, essentially being numpy on a GPU. This is
great for vectorized calculations, but less so for applications that
involve a lot of control flow (ie if/else statements), which as i
understand it most point cloud registration algorithms use, such as
iterative closest point or anything available in open3d.

No guarantee ill make any progress of course, but would someone mind
recommending a paper explaining a nice subshape alignment algorithm?

Thanks :)
Lewis

On Wed, 4 Nov 2020 at 3:52 am, Andy Jennings 
wrote:

> Hi Greg,
>
> Thanks for the response and background. Here's hoping someone is smart
> enough to code this up and generous enough to donate it back to the
> community.
>
> Best,
> Andy
>
> On Mon, Nov 2, 2020 at 8:52 PM Greg Landrum 
> wrote:
>
>> Hi Andy,
>>
>> At the moment the RDKit doesn't have either high-quality shape-based
>> alignment code[1] or GPU support.
>>
>> I think having good shape-based alignment available would be a really
>> useful complement to the Open3DAlign code that's already there, but it's
>> certainly not a small project.
>>
>> -greg
>> [1] The python implementation of the subshape alignment algorithm is
>> essentially just a proof-of-concept and not performant enough for real
>> usage.
>>
>> On Mon, Nov 2, 2020 at 7:16 PM Andy Jennings 
>> wrote:
>>
>>> Hi,
>>>
>>> I see that back in 2014 there was some discussion of using CUDA inside
>>> of RDKit and how it may be possible to produce a FastROCS-like open source
>>> alternative. I was curious if anyone had made such a breakthrough. Since
>>> GPU availability is now so common, and datasets are becoming so large, I
>>> figured that more and more people would be thinking RDKit + GPU = :-)
>>>
>>> Thanks in advance.
>>> Andy
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
-- 
Sent from Gmail Mobile
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Largest possible Morgan fp bit magnitude

2020-10-08 Thread Lewis Martin
Hi all,
Felt sure this would have been asked but I can't find it. What is the
'largest' possible bit in an unfolded Morgan fingerprint? Asked another
way, what type of number are the substructure identities hashed into?

The Rogers and Hahn ECFP paper says that they hash into a 32-bit integer,
and in the paper they use negative and positive values.

Since hashing generates bits with mostly uniform density, I tried sampling
some fingerprints. Testing a few hundred thousand molecules, the largest
bit I found was suspiciously close to 2X larger than the maximum
expressible number for a 32-bit integer. So I guess that, to be consistent
with Rogers and Hahn the bits are hashed into 32-bit integers, but then
they are shifted to be positive? Is that correct?

Thanks :) hope the UGM went well.
Lewis

PS context is I saw a weird result where prediction scores kept getting
higher when I used larger fingerprints. At size 8192 I ran out of memory,
so I'm moving to sparse representation (possibly unfolded) but I don't know
how big the sparse matrix should be.
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Smallest possible size of 100*1e6 morgan fingerprints for storage and memory

2020-09-08 Thread Lewis Martin
OK to sum it up, for me writing to binary is a neat, fast, and low-storage
solution for fingerprints. Example:
o = open('fingerprints.bin', 'wb')
gen_mo = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=64)
for smi in tqdm_notebook(df['smiles']):
mol = Chem.MolFromSmiles(smi)
fp = gen_mo.GetFingerprint(mol)
bs = bitstring.BitArray(bin=fp.ToBitString())
o.write(bs.bytes)

o.close()

Most of the time was being taken up in creating numpy arrays. For instance:
%%timeit
fp = np.array(gen_mo.GetFingerprint(mol))

351 µs ± 2.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

vs:

%%timeit
fp = gen_mo.GetFingerprint(mol)
bs = bitstring.BitArray(bin=fp.ToBitString())
42 µs ± 273 ns per loop (mean ± std. dev. of 7 runs, 1 loops each)


so when you remove that step, 100mm ligands takes about 4 1/2 hours in
serial, no need for parallelization. Cheers!
PS for newbs like me, read the binary fingerprints like this:
i = open('fingerprints.bin', 'rb')
bits = ''
for _ in range(num_fps):
fp_bytes = i.read(8)
bits += bitstring.BitArray(bytes=fp_bytes).bin
i.close()
fingerprints_concat = (np.fromstring(bits,'u1') -
ord('0')).reshape(num_fps, 64)

On Wed, Sep 9, 2020 at 1:28 PM Greg Landrum  wrote:

> The most efficient (easy) way to store the fingerprints is using
> DataStructs.BitVectToBinaryText(). That will return a 64byte binary string
> for a 512bit fingerprint.
>
> FWIW: if you haven't seen the recent blog post about similarity searching
> with short fingerprints:
> http://rdkit.blogspot.com/2020/08/doing-similarity-searches-with-highly.html
>
> -greg
>
>
> On Wed, Sep 9, 2020 at 2:37 AM Lewis Martin 
> wrote:
>
>> Hi RDKit,
>>
>> Looking for advice on an rdkit-adjacent problem please. Ultimately I'd
>> like to fit an approximate-nearest neighbors index on a dataset of 100
>> million ligands, featurized by morgan fingerprint. The text file of the
>> smiles is ~6gb but this blows out when loaded with pandas.read_csv() or
>> f.readlines() due to weird memory allocation issues.
>>
>>
>> It would take 45hrs to process the file in serial (i.e. read line, create
>> mol, fingerprint, convert to np.arr or sparse arrays) in a streaming manner
>> so now I'd like to parallelize the job with joblib, which would multiply
>> the memory requirements by the number of processes running at a time.
>>
>> So: what is the smallest possible representation for a binary
>> fingerprint? Using `sys.getsizeof` on a
>> rdkit.DataStructs.cDataStructs.ExplicitBitVect object tells me it is 96
>> bytes, but I'm not sure whether to believe that since, like csr_matrix, the
>> size depends on accurately returning the object's data. Here's an example
>> demonstrating this:
>>
>> from rdkit import Chem
>> from rdkit.Chem import rdFingerprintGenerator
>> smi = 'COCCN(CCO)C(=O)/C=C\\c1cccnc1'
>> mol = Chem.MolFromSmiles(smi)
>> gen_mo = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=512)
>> fp = gen_mo.GetFingerprint(mol)
>> sparse_fp = sparse.csr_matrix(fp)
>>
>> print('ExplicitBitVect object size:', getsizeof(fp))
>> print('Sparse matrix size (naive):', getsizeof(sparse_fp))
>> print('Sparse matrix size (real):',
>> sparse_fp.data.nbytes+sparse_fp.indices.nbytes+sparse_fp.indptr.nbytes)
>> print('fp.ToBinary size:', getsizeof(fp.ToBinary()))
>> print('fp.ToBinary size:', getsizeof(fp.ToBase64()))
>> >>>
>>
>> ExplicitBitVect object size: 96
>> Sparse matrix size (naive): 64
>> Sparse matrix size (real): 476
>> fp.ToBinary size: 85
>> fp.ToBinary size: 121
>>
>>
>>
>> Note that even the smallest of these multiplied by 100 million would be
>> about 8gb, still larger than the text file storing the smiles codes - not
>> sure if that is to be expected or not?
>>
>> Thank for your time!
>> Lewis
>>
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Smallest possible size of 100*1e6 morgan fingerprints for storage and memory

2020-09-08 Thread Lewis Martin
Cheers Francois - that might be the way to go actually. I'll try with
'bitstring' https://github.com/scott-griffiths/bitstring and I guess write
the data as concatenated bitarrays in chunked binary files.

I'd like to keep it FOSS since its for academic publication and hopefully
to be re-used. Chemfp is amazing but brute-forcing 100million by 100million
would surely still take a long time compared with an approximate nearest
neighbor approach.

Straying from RDKit so Ill leave it there - thanks!

On Wed, Sep 9, 2020 at 11:29 AM Francois Berenger  wrote:

> On 09/09/2020 09:35, Lewis Martin wrote:
> > Hi RDKit,
> >
> > Looking for advice on an rdkit-adjacent problem please. Ultimately I'd
> > like to fit an approximate-nearest neighbors index on a dataset of 100
> > million ligands, featurized by morgan fingerprint. The text file of
> > the smiles is ~6gb but this blows out when loaded with
> > pandas.read_csv() or f.readlines() due to weird memory allocation
> > issues.
> >
> > It would take 45hrs to process the file in serial (i.e. read line,
> > create mol, fingerprint, convert to np.arr or sparse arrays) in a
> > streaming manner so now I'd like to parallelize the job with joblib,
> > which would multiply the memory requirements by the number of
> > processes running at a time.
> >
> > So: what is the smallest possible representation for a binary
> > fingerprint? Using `sys.getsizeof` on a
> > rdkit.DataStructs.cDataStructs.ExplicitBitVect object tells me it is
> > 96 bytes, but I'm not sure whether to believe that since, like
> > csr_matrix, the size depends on accurately returning the object's
> > data. Here's an example demonstrating this:
> >
> > from rdkit import Chem
> > from rdkit.Chem import rdFingerprintGenerator
> > smi = 'COCCN(CCO)C(=O)/C=C\\c1cccnc1'
> > mol = Chem.MolFromSmiles(smi)
> > gen_mo = rdFingerprintGenerator.GetMorganGenerator(radius=2,
> > fpSize=512)
>
> Obviously, if you ask for fpSize = 512, the smallest uncompressed
> representation of the fingerprint will be 512 bits (64 bytes).
>
> 10M of such fingerprints, if there is not any overhead added by the
> programming language,
> would fit into 6GB of RAM.
>
> But, the really fun things will start when you want to search fast into
> so many molecules. :)
> There are many published methods, some open-source software (like
> Dalke's chemfp) and even some commercial ones
> which claim they are lightning fast (even reaching real-time search
> speed!).
>
> e.g.
> https://chemaxon.com/products/madfast
> https://www.nextmovesoftware.com/arthor.html
>
> Regards,
> F.
>
> > fp = gen_mo.GetFingerprint(mol)
> > sparse_fp = sparse.csr_matrix(fp)
> >
> > print('ExplicitBitVect object size:', getsizeof(fp))
> > print('Sparse matrix size (naive):', getsizeof(sparse_fp))
> > print('Sparse matrix size (real):',
> > sparse_fp.data.nbytes+sparse_fp.indices.nbytes+sparse_fp.indptr.nbytes)
> > print('fp.ToBinary size:', getsizeof(fp.ToBinary()))
> > print('fp.ToBinary size:', getsizeof(fp.ToBase64()))
> >>>>
> >
> > ExplicitBitVect object size: 96
> > Sparse matrix size (naive): 64
> > Sparse matrix size (real): 476
> > fp.ToBinary size: 85
> > fp.ToBinary size: 121
> >
> > Note that even the smallest of these multiplied by 100 million would
> > be about 8gb, still larger than the text file storing the smiles codes
> > - not sure if that is to be expected or not?
> >
> > Thank for your time!
> > Lewis
> > ___
> > Rdkit-discuss mailing list
> > Rdkit-discuss@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Smallest possible size of 100*1e6 morgan fingerprints for storage and memory

2020-09-08 Thread Lewis Martin
Hi RDKit,

Looking for advice on an rdkit-adjacent problem please. Ultimately I'd like
to fit an approximate-nearest neighbors index on a dataset of 100 million
ligands, featurized by morgan fingerprint. The text file of the smiles is
~6gb but this blows out when loaded with pandas.read_csv() or f.readlines()
due to weird memory allocation issues.


It would take 45hrs to process the file in serial (i.e. read line, create
mol, fingerprint, convert to np.arr or sparse arrays) in a streaming manner
so now I'd like to parallelize the job with joblib, which would multiply
the memory requirements by the number of processes running at a time.

So: what is the smallest possible representation for a binary fingerprint?
Using `sys.getsizeof` on a rdkit.DataStructs.cDataStructs.ExplicitBitVect
object tells me it is 96 bytes, but I'm not sure whether to believe that
since, like csr_matrix, the size depends on accurately returning the
object's data. Here's an example demonstrating this:

from rdkit import Chem
from rdkit.Chem import rdFingerprintGenerator
smi = 'COCCN(CCO)C(=O)/C=C\\c1cccnc1'
mol = Chem.MolFromSmiles(smi)
gen_mo = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=512)
fp = gen_mo.GetFingerprint(mol)
sparse_fp = sparse.csr_matrix(fp)

print('ExplicitBitVect object size:', getsizeof(fp))
print('Sparse matrix size (naive):', getsizeof(sparse_fp))
print('Sparse matrix size (real):',
sparse_fp.data.nbytes+sparse_fp.indices.nbytes+sparse_fp.indptr.nbytes)
print('fp.ToBinary size:', getsizeof(fp.ToBinary()))
print('fp.ToBinary size:', getsizeof(fp.ToBase64()))
>>>

ExplicitBitVect object size: 96
Sparse matrix size (naive): 64
Sparse matrix size (real): 476
fp.ToBinary size: 85
fp.ToBinary size: 121



Note that even the smallest of these multiplied by 100 million would be
about 8gb, still larger than the text file storing the smiles codes - not
sure if that is to be expected or not?

Thank for your time!
Lewis
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Error parsing a MUTAG smiles

2020-03-04 Thread Lewis Martin
Hi all,
Im chasing up a small puzzle on parsing SMILES codes if anyone's
interested, but its not directly RDkit. I was looking at the molecules in
the MUTAG dataset, which is commonly used in graph learning research.
Mostly these are just shared as graphs (i.e. vertices and edges) rather
than SMILES codes, but I did find a list of SMILES at ChemDB:
http://cdb.ics.uci.edu/cgibin/LearningDatasetsWeb.py

RDKit can't parse two of the SMILES codes - indices 82 and 187 using zero
indexing - and youll note that they are the same:

smiles[82]

'c1ccc2=NC3=CC(=CC=C3=c2c1)[N+](=O)[O-]'


smiles[187]

'c1ccc2=NC3=CC=C(C=C3=c2c1)[N+](=O)[O-]'


I found these in the original paper* as '83 - 2-nitrocarbazoIe' and '188 -
3-nitrocarbazole' (1-indexed). So the smiles codes should be similar but
not exactly the same.

Can anyone please tell me if those smiles codes are legit or has there been
a transcription error? It's easy to replace these with the correct codes
from pubchem, but if you're familiar with the dataset is it safe to trust
the other codes? Why does the paper have 197 molecules but the dataset only
has 188?

Thanks for your time!


Pubchem for 2nitrocarbazole:
https://pubchem.ncbi.nlm.nih.gov/compound/99612#section=InChI
Pubchem for 3nitrocarbazole:
https://pubchem.ncbi.nlm.nih.gov/compound/3-Nitrocarbazole



*original paper is "Structure-activity relationship of mutagenic aromatic
and heteroaromatic nitro compounds. Correlation with molecular orbital
energies and hydrophobicity" https://www.ncbi.nlm.nih.gov/pubmed/1995902
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] How do rdFingerprintGenerator.GetMorganGenerator and AllChem.GetMorganFingerprintAsBitVect differ?

2019-07-10 Thread Lewis Martin
Thanks Greg!
Would you mind giving a blurb or a link to a paper on how count simulation
works? I looked through the GSOC pull request but unfortunately don't
understand it.

Agreed re. your comments. Usually 256bits is for playing around and then
larger FPs are for 'production' runs. Although in my use case, for example
logistic regression / naive bayes classifier on protein activity records in
chembl, I really don't see a big difference despite collisions! That was
prior to count simulation.

cheers
lewis


On Wed, Jul 10, 2019 at 2:39 PM Greg Landrum  wrote:

> Hi,
>
> By default the new fingerprint generators do "count simulation": adding
> extra bits to a bit vector fingerprint in order to get bit-vector
> similarities that are more similar to count-vector similarities.
> You can turn this off by passing the useCountSimulation=False argument to
> GetMorganGenerator().
>
> Two comments about your sample code:
> 1) 256 bits is really not very many for a Morgan fingerprint. Maybe you
> were just using the small number for this question, but if you are really
> using fingerprints that short you should be aware that you are going to
> have a lot of collisions (blog post on this here:
> http://rdkit.blogspot.com/2016/02/colliding-bits-iii.html)
> 2) In case you aren't aware of it: you can calculate similarities and do
> fingerprint stats a lot more simply with builtin code like the
> GetNumOnBits() method on bit vectors and the similarity calculation code
> in  rdkit.DataStructs. Take a look at DataStructs.DiceSimilarity()
>
> Hope this helps,
> -greg
>
>
>
> On Wed, Jul 10, 2019 at 3:53 AM Lewis Martin 
> wrote:
>
>> Hi all,
>> Quick question on truncated fingerprints, any help is really appreciated.
>>
>>
>> I think I've missed a trick on how the new fingerprint generator works. I
>> thought the below should produce equivalent fingerprints but they are
>> totally different. Has the implementation changed, or maybe I'm getting the
>> kwargs incorrect? See below code or this link for a quick visual:
>> https://github.com/ljmartin/snippets/blob/master/truncated_fingerprints.ipynb
>> Thanks !
>>
>> import rdkit
>> from rdkit import Chem
>> from rdkit.Chem import Draw, AllChem
>> from rdkit.Chem import rdFingerprintGenerator
>> from rdkit.Chem.Draw import IPythonConsole
>> import numpy as np
>> from scipy.spatial import distance
>>
>> mol = Chem.MolFromSmiles('CN1C(=O)CN=C(C2=C1C=CC(=C2)Cl)C3=CC=CC=C3')
>> #diazepam
>>
>> gen_mo = rdFingerprintGenerator.GetMorganGenerator(fpSize=256, radius=2)
>> a = gen_mo.GetFingerprint(mol)
>> b = AllChem.GetMorganFingerprintAsBitVect(mol,2,256,useFeatures=False)
>> a_f = [int(i) for i in a.ToBitString()]
>> b_f = [int(i) for i in b.ToBitString()]
>> print('NumBits a: %s, NumBits b: %s' % (np.sum(a_f), np.sum(b_f)))
>> print('Dice Distance %s' % distance.dice(a_f,b_f))
>>
>>
>> NumBits a: 47, NumBits b: 38
>> Dice Distance 0.9058823529411765
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] How do rdFingerprintGenerator.GetMorganGenerator and AllChem.GetMorganFingerprintAsBitVect differ?

2019-07-09 Thread Lewis Martin
Hi all,
Quick question on truncated fingerprints, any help is really appreciated.


I think I've missed a trick on how the new fingerprint generator works. I
thought the below should produce equivalent fingerprints but they are
totally different. Has the implementation changed, or maybe I'm getting the
kwargs incorrect? See below code or this link for a quick visual:
https://github.com/ljmartin/snippets/blob/master/truncated_fingerprints.ipynb
Thanks !

import rdkit
from rdkit import Chem
from rdkit.Chem import Draw, AllChem
from rdkit.Chem import rdFingerprintGenerator
from rdkit.Chem.Draw import IPythonConsole
import numpy as np
from scipy.spatial import distance

mol = Chem.MolFromSmiles('CN1C(=O)CN=C(C2=C1C=CC(=C2)Cl)C3=CC=CC=C3')
#diazepam

gen_mo = rdFingerprintGenerator.GetMorganGenerator(fpSize=256, radius=2)
a = gen_mo.GetFingerprint(mol)
b = AllChem.GetMorganFingerprintAsBitVect(mol,2,256,useFeatures=False)
a_f = [int(i) for i in a.ToBitString()]
b_f = [int(i) for i in b.ToBitString()]
print('NumBits a: %s, NumBits b: %s' % (np.sum(a_f), np.sum(b_f)))
print('Dice Distance %s' % distance.dice(a_f,b_f))


NumBits a: 47, NumBits b: 38
Dice Distance 0.9058823529411765
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] What is returned by GetMMFFVdWParams?

2019-07-02 Thread Lewis Martin
Hi all,
Can anyone please help explain what values are returned
by GetMMFFVdWParams? It takes two indices as input, so is it an interaction
term between the two? Or is it the well depth and minimum (i.e. epsilon and
R)?

Example:
In:
m = Chem.MolFromSmiles('C1CCC1OC')
m2=Chem.AddHs(m)
AllChem.EmbedMolecule(m2)
AllChem.MMFFOptimizeMolecule(m2)
pyMP = AllChem.MMFFGetMoleculeProperties(m2)
pyFF = AllChem.MMFFGetMoleculeForceField(m2, pyMP)
pyMP.GetMMFFVdWParams(0,1)

Out:
(3.9377389919289634,

 0.06779699304291371,
 3.9377389919289634,
 0.06779699304291371)


The code at
https://github.com/rdkit/rdkit/blob/master/Code/ForceField/Wrap/ForceField.cpp
says it returns R_ij_starUnscaled, epsilonUnscaled, R_ij_star, and epsilon,
but Im unclear on what the scaling is or why two indices are needed.

Thanks!

lewis
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] any paper on fingerprint pre-selection/feature selection?

2019-06-17 Thread Lewis Martin
There have been some comparisons between different fingerprints, which Im
sure can be found via google, but I haven't seen feature selection.

If you're looking for dimensionality reduction, anecdotally I've noticed
that hashing bit vectors down to size has no benefit over the traditional
folding, and tf-idf has no benefit either (see below for hyperparameters I
used, being compared with standard Morgan 512-size bit vectors). I believe
morgan fingerprint bits have already been hashed, so the number of
bit-collisions during folding should be evenly spread.

The 'performance' in this context is virtual screening for protein hits, so
this may not apply to other tasks. My intuition is that there are a large
number of possible targets so it's probably not beneficial to reduce the
dimensionality below 256 dimensions.


tf-idf parameters I used:
fps_array = list()
for smi in smileses:
mol = Chem.MolFromSmiles(smi)
fp = AllChem.GetMorganFingerprint(mol, 2)
fp_values = fp.GetNonzeroElements()
stringerprint = list()
for key, item in fp_values.items():
stringerprint.append(str(key))
fps_array.append(stringerprint)

corpus = list()
for fp in fps_array:
corpus.append(' '.join(fp))

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=0.4, min_df=0.03)
X = vectorizer.fit_transform(corpus)



On Tue, Jun 18, 2019 at 6:42 AM Mario Lovrić 
wrote:

> Dear all,
>
> Is there any paper discussing some sort of pre-selection/feature selection
> with fingerprints?
> Any rule of thumb? E.g. dont keep fingeprints if less than 5% hits?
>
>
> Thanks
>
> --
> Mario Lovrić
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Pharmacophore atom typing for torsion or atom pair FP

2019-01-31 Thread Lewis Martin
Hi Francois,
If it helps, this is what Ive been using while playing with atom types:
https://github.com/ljmartin/snippets/blob/master/snippet_atom_typing.ipynb

The SMARTS codes were taken from down the bottom of this  page
https://www.rdkit.org/docs/GettingStartedInPython.html#chemical-features-and-pharmacophores
I tried using rdkit.Chem.rdMolDescriptors.GetFeatureInvariants() but it
appears to give more than the 6 listed atom types from the Gobbi paper. For
ECFP, I don't know any other way to describe an atom other than the obscure
integer, but you can visualize the bits using:
http://rdkit.blogspot.com/2018/10/using-new-fingerprint-bit-rendering-code.html
cheers
lewis

On Fri, Feb 1, 2019 at 2:09 PM Francois Berenger  wrote:

> Hi,
>
> I have a related question:
> how to output the type of an atom in a molecule,
> if possible in a human-readable format; i.e. a human
> readable/understandable string rather than some (obscure) integer.
>
> I am interested to look at the atom types used by the ECFP
> and the FCFP fingerprints.
>
> Thanks a lot,
> Francois.
>
> On 31/01/2019 08:49, Lewis Martin wrote:
> > Thanks so much Greg!
> >
> > If I catch your drift, you are talking about the new fingerprint
> > generators from the google summer of code. I took a look myself since
> > I was curious.
> >
> > Here's a notebook demonstrating how I think it works:
> >
> https://github.com/ljmartin/snippets/blob/master/snippet_fp_with_invariants.ipynb
> > [3]
> > This downloads some bioactivity data from chembl and then compares
> > standard AP or TT fingerprints with same using the atom invariants
> > associated with the MorganFP "Feature" atom typing, which is actually
> > the feature types from the Gobbi/Poppinger paper.  As expected, the
> > invariant versions have higher similarity! It's not CATS but this
> > seems equivalent for my purposes - thanks!
> >
> > Hopefully it's close to the mark - looking forward to seeing other
> > examples too.
> > cheers
> > lewis
> >
> > On Thu, Jan 31, 2019 at 12:03 AM Greg Landrum 
> > wrote:
> >
> >> Hi Lewis,
> >>
> >> This is a great chance to demonstrate some of the things that can be
> >> done with the new fingerprint generation code. It's going to take me
> >> a bit to put this together (it's all new enough that I'm still not
> >> quite "fluent"), but I will try to get an example put together over
> >> the next couple of days.
> >>
> >> -greg
> >>
> >> On Wed, Jan 30, 2019 at 4:59 AM Lewis Martin
> >>  wrote:
> >>
> >>> Hi rdkitters,
> >>> I'd like to compare the similarity of torsion/atom pair FPs using
> >>> standard atomic numbering with those using pharmacophore types,
> >>> like the 'CATS' atom typing developed by Gisbert Schneider, and
> >>> hoped someone has some advice here. _CATS_ is a pharmacophore atom
> >>> typing system with these types: H-bond donor, H-bond acceptor,
> >>> positive, negative, lipophilic, and CATS2 has 'aromatic'. These
> >>> are described in: _“Scaffold‐Hopping” by Topological
> >>> Pharmacophore Search: A Contribution to Virtual Screening. _It
> >>> seems pretty close to the Gobbi 2D pharmacophore typing, or the
> >>> features used in FCFP.
> >>>
> >>> Ive no problem detecting the atom types - I borrowed code from the
> >>> open source PyBioMed - but I'm stuck at the next step. How to
> >>> change the atoms into their pharmacophore types to then make a
> >>> torsion or atom pair fingerprint using RDKit? What I've tried so
> >>> far is to just set the atomic number to some series of 5 atoms not
> >>> normally seen in drug like molecules, like 40-44. This is silly
> >>> but it seems to work. The only issue is trouble kekulizing the
> >>> molecules for display. Is there a better way?
> >>>
> >>> Here's a snippet to demonstrate what I mean, it's adapted from
> >>> PyBioMed and any errors are probably mine:
> >>>
> >>
> >
> https://github.com/ljmartin/snippets/blob/master/atom_typing_snippet.ipynb
> >>> [1]
> >>>
> >>> Thanks for your time!
> >>> lewis
> >>>
> >>> ___
> >>> Rdkit-discuss mailing list
> >>> Rdkit-discuss@lists.sourceforge.net
> >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss [2]
> >
> >
> > Links:
> > --
> > [1]
> >
> https://github.com/ljmartin/snippets/blob/master/atom_typing_snippet.ipynb
> > [2] https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> > [3]
> >
> https://github.com/ljmartin/snippets/blob/master/snippet_fp_with_invariants.ipynb
> >
> > ___
> > Rdkit-discuss mailing list
> > Rdkit-discuss@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Pharmacophore atom typing for torsion or atom pair FP

2019-01-30 Thread Lewis Martin
Thanks so much Greg!

If I catch your drift, you are talking about the new fingerprint generators
from the google summer of code. I took a look myself since I was curious.

Here's a notebook demonstrating how I think it works:
https://github.com/ljmartin/snippets/blob/master/snippet_fp_with_invariants.ipynb
This downloads some bioactivity data from chembl and then compares standard
AP or TT fingerprints with same using the atom invariants associated with
the MorganFP "Feature" atom typing, which is actually the feature types
from the Gobbi/Poppinger paper.  As expected, the invariant versions have
higher similarity! It's not CATS but this seems equivalent for my purposes
- thanks!

Hopefully it's close to the mark - looking forward to seeing other examples
too.
cheers
lewis



On Thu, Jan 31, 2019 at 12:03 AM Greg Landrum 
wrote:

> Hi Lewis,
>
> This is a great chance to demonstrate some of the things that can be done
> with the new fingerprint generation code. It's going to take me a bit to
> put this together (it's all new enough that I'm still not quite "fluent"),
> but I will try to get an example put together over the next couple of days.
>
> -greg
>
>
> On Wed, Jan 30, 2019 at 4:59 AM Lewis Martin 
> wrote:
>
>> Hi rdkitters,
>> I'd like to compare the similarity of torsion/atom pair FPs using
>> standard atomic numbering with those using pharmacophore types, like the
>> 'CATS' atom typing developed by Gisbert Schneider, and hoped someone has
>> some advice here. *CATS* is a pharmacophore atom typing system with
>> these types: H-bond donor, H-bond acceptor, positive, negative, lipophilic,
>> and CATS2 has 'aromatic'. These are described in: *“Scaffold‐Hopping” by
>> Topological Pharmacophore Search: A Contribution to Virtual Screening. *It
>> seems pretty close to the Gobbi 2D pharmacophore typing, or the features
>> used in FCFP.
>>
>> Ive no problem detecting the atom types - I borrowed code from the open
>> source PyBioMed - but I'm stuck at the next step. How to change the atoms
>> into their pharmacophore types to then make a torsion or atom pair
>> fingerprint using RDKit? What I've tried so far is to just set the atomic
>> number to some series of 5 atoms not normally seen in drug like molecules,
>> like 40-44. This is silly but it seems to work. The only issue is trouble
>> kekulizing the molecules for display. Is there a better way?
>>
>>
>> Here's a snippet to demonstrate what I mean, it's adapted from PyBioMed
>> and any errors are probably mine:
>> https://github.com/ljmartin/snippets/blob/master/atom_typing_snippet.ipynb
>>
>> Thanks for your time!
>> lewis
>>
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Pharmacophore atom typing for torsion or atom pair FP

2019-01-29 Thread Lewis Martin
Hi rdkitters,
I'd like to compare the similarity of torsion/atom pair FPs using standard
atomic numbering with those using pharmacophore types, like the 'CATS' atom
typing developed by Gisbert Schneider, and hoped someone has some advice
here. *CATS* is a pharmacophore atom typing system with these types: H-bond
donor, H-bond acceptor, positive, negative, lipophilic, and CATS2 has
'aromatic'. These are described in: *“Scaffold‐Hopping” by Topological
Pharmacophore Search: A Contribution to Virtual Screening. *It seems pretty
close to the Gobbi 2D pharmacophore typing, or the features used in FCFP.

Ive no problem detecting the atom types - I borrowed code from the open
source PyBioMed - but I'm stuck at the next step. How to change the atoms
into their pharmacophore types to then make a torsion or atom pair
fingerprint using RDKit? What I've tried so far is to just set the atomic
number to some series of 5 atoms not normally seen in drug like molecules,
like 40-44. This is silly but it seems to work. The only issue is trouble
kekulizing the molecules for display. Is there a better way?


Here's a snippet to demonstrate what I mean, it's adapted from PyBioMed and
any errors are probably mine:
https://github.com/ljmartin/snippets/blob/master/atom_typing_snippet.ipynb

Thanks for your time!
lewis
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Problem getting valence

2018-07-26 Thread Lewis Martin
Thankyou Paolo and Chris! These hydrogens were added while editing the
molecule beforehand but I assumed sanitizing would remove them.

Cheers
Lewis

On Thu, 26 Jul 2018 at 7:59 pm, Chris Earnshaw  wrote:

> Hi
>
> It looks to me like N5 [nH:5] also has a problem. This has 3 connections
> to heavy atoms, is specified to have a hydrogen attached, but has no
> charge. This may not have triggered an error but it looks wrong, especially
> in this structure. Surely this atom should just be [n:5] ?
>
> Best regards,
> Chris
>
> On 26 July 2018 at 10:42, Paolo Tosco  wrote:
>
>> Dear Lewis,
>>
>> C #7 indeed has 5 valences:
>>
>>  *
>>  |
>> -C=O
>>  |
>>  H
>>
>>
>> If you change [CH:14] into [C:14] sanitization will succeed.
>>
>> Cheers,
>> p.
>>
>> On 07/26/18 09:42, Lewis J Martin wrote:
>>
>> Hi all,
>>
>> I have generated a molecule that I can both visualize and export as
>> SMILES, but I can't Sanitize it. Creating a new molecule from that SMILES
>> fails due to something to do with the valence. It seems to me that it has
>> the correct amount of bonds.
>>
>> Code to reproduce and 'debugging' that Ive tried:
>> --
>> #sanitize=False or it won't load this SMILES
>> mol =
>> Chem.MolFromSmiles('[CH3:0][CH2:1][CH2:2][CH2:3][CH2:4][nH:5]1[c:6]([CH:14](=[O:15])[*:16])[n:7][c:8]2[cH:9][cH:10][cH:11][cH:12][c:13]12',sanitize=False)
>>
>> #print atom indices and atomic number:
>> for i in mol.GetAtoms():
>> print(i, i.GetIdx(), i.GetAtomicNum())
>>
>> #display bonds relating to index 7. Seems correct for a carbon.
>> for i in mol.GetBonds():
>> if i.GetBeginAtomIdx()==7 or i.GetEndAtomIdx()==7:
>> print(i.GetBeginAtomIdx(), i.GetEndAtomIdx(), i.GetBondType())
>>
>> #print valences. Fails on index 7.
>> for i in mol.GetAtoms():
>> print(i.GetIdx(), i.GetExplicitValence())
>> --
>>
>>
>> Can anyone please offer some advice as to what the problem is?
>> Much appreciated!
>>
>> Lewis
>>
>>
>> PS. here is the output I get:
>>
>>  0 6
>>  1 6
>>  2 6
>>  3 6
>>  4 6
>>  5 7
>>  6 6
>>  7 6
>>  8 8
>>  9 0
>>  10 7
>>  11 6
>>  12 6
>>  13 6
>>  14 6
>>  15 6
>>  16 6
>> 6 7 SINGLE
>> 7 8 DOUBLE
>> 7 9 SINGLE
>> 0 4
>> 1 4
>> 2 4
>> 3 4
>> 4 4
>> 5 3
>> 6 4
>>
>> ---RuntimeError
>>   Traceback (most recent call 
>> last) in ()  7   8 for i in 
>> mol.GetAtoms():> 9 print(i.GetIdx(), i.GetExplicitValence()) 10  
>>#valence = oneMol.GetAtomWithIdx(7).GetExplicitValence() 11 
>> #print(valence)
>> RuntimeError: Pre-condition Violation
>>  getExplicitValence() called without call to calcExplicitValence()
>>  Violation occurred on line 162 in file Code/GraphMol/Atom.cpp
>>  Failed Expression: d_explicitValence > -1
>>  RDKIT: 2018.03.3
>>  BOOST: 1_65_1
>>
>>
>>
>>
>>
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>
>>
>>
>> ___
>> Rdkit-discuss mailing 
>> listRdkit-discuss@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>>
>>
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
> --
Sent from Gmail Mobile
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss