Re: [Rdkit-discuss] Query on a failed molecule from SureChEMBL

2021-12-16 Thread Eloy Félix
Hi Lewis,

SureChEMBL is getting its structures from:

- USPTO attached molfiles (deposited structures)
- names using tools including OPSIN, ChemAxon, Lexichem, ACD.
- images using tools including OSRA, imago, CLiDE.

As Nicolas points out, issues like this one can occur when auto generating
structures from names and images. It is the case for the 2 structures you
mention.
We have plans to review all the tools we are using to generate the
structures as we know about some new ones out there.

Cheers,
Eloy


On Thu, 16 Dec 2021 at 09:28, Nicolas Bosc  wrote:

> Hi Lewis,
>
> Currently structures are generated automatically in SureChEMBL so this
> kind of error unfortunately happens…
>
> My colleagues will address this issue as soon as possible.
>
> Cheers,
> Nicolas
> ---
> Dr Nicolas Bosc
> Data Mining and Analysis Scientist
> ChEMBL group
> EMBL-EBI
> Wellcome Genome Campus
> Hinxton, Cambridge, CB10 1SD
> United Kingdom
>
> nb...@ebi.ac.uk
> +44 1223 492519
>
>
> On 15 Dec 2021, at 20:42, Lewis Martin  wrote:
>
> Thanks a lot Greg! That is indeed very helpful.
>
> Just to know that the molecule is odd is helpful too. The mol blocks
> appear to be V2000 format and have names like "Mrv0541 03021215572D"
> which says ChemAxon Marvin to me, but I'm still unsure why SureChEMBL would
> use such a representation (it doesn't look like a faithful transcription
> from the source patent). Off-topic, but if anyone happens to have an
> insight or connection with SureChEMBL, please do reach out!
>
> Cheers
> Lewis
>
>
>
>
> On Wed, Dec 15, 2021 at 4:24 PM Greg Landrum 
> wrote:
>
>> Hi Lewis,
>>
>> Dealing with all the strange chemical representations that show up "in
>> the wild" is an ongoing struggle.
>>
>> Your first example is pretty clearly intended to be an azide and we can
>> certainly add a rule to normalize that one to what the RDKit expects it to
>> be (there already is a rule for C-N=N#N, but that doesn't help here.). That
>> won't happen before the next feature release though.
>>
>> I'm not really sure what the intent was for the two
>> four-coordinate neutral Ns in the second molecule, so I think it's unlikely
>> that we'd add a standard cleanup for one.
>>
>> However! The good news is that there's a pretty easy (and efficient) way
>> to fix this yourself. We added a new method to chemical reactions in the
>> 2021.09 release which allows you to modify a molecule in place (subject to
>> some constraints). This is ideal for doing cleanup transformations like
>> these.
>>
>> This gist shows how to write reaction rules for your cases (I guessed for
>> what the Ns are supposed to be) and then use them:
>> https://gist.github.com/greglandrum/8fd229bc6bf6c734d1c21da7f2bebebb
>>
>> Hope this helps,
>> -greg
>>
>>
>> On Wed, Dec 15, 2021 at 12:21 AM Lewis Martin 
>> wrote:
>>
>>> Hi All,
>>> Reading molecules from a bulk download of SureChEMBL, I come across a
>>> fair few molecules that fail to parse. Not sure whether they SHOULD parse
>>> or not.
>>>
>>> Here is an example: https://www.surechembl.org/chemical/SCHEMBL386
>>> with SMILES code: COC(=O)C1=C(C=CC=C1)C1=CC=C(C[N+]#[N]=[N-])C=C1
>>>
>>> Even reading the SMILES code one can see that there are too many bonds
>>> in there - a nitrogen triply bonded and doubly bonded to other atoms.
>>>
>>> Another example: https://www.surechembl.org/chemical/SCHEMBL33957
>>> smiles: NC(N)=[NH]C1=NC(CSCC[NH]=CNS(=O)(=O)C2=CC=C(Br)C=C2)=CS1
>>>
>>> Again, valence for a nitrogen is off.
>>>
>>> Should I expect to parse these with RDKit? Might there be some way
>>> around this? It's a significant fraction of the molecules in SureChEMBL.
>>>
>>> Thanks team!
>>> Lewis
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
>
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] unique chemical representation

2020-09-14 Thread Eloy Félix
Hi,

All the steps of ChEMBL's RDKit based standardiser are explained in the
recently published manuscript:
https://link.springer.com/article/10.1186/s13321-020-00456-1

Hope it helps!

Regards,
Eloy


On Mon, 14 Sep 2020 at 04:32, Francois Berenger  wrote:

> On 12/09/2020 00:27, Mike Mazanetz wrote:
> > Dear Forum,
> >
> > I'm curious as to how the community standardizes molecules to generate
> > unique chemical representations.
> >
> > Please let me know what are people's referred means to treat:
> >
> >   * Tautomers
> >   * Protomers
> >   * Resonance structures
> >   * Salts when the salt is larger than the ligand
>
> Here is how ChEMBL does it:
>
> https://github.com/chembl/ChEMBL_Structure_Pipeline
>
> Not sure they handle all the cases you listed, though.
>
> Regards,
> F.
>
> > Particularly when converting between chemical representations SDF to
> > smiles, SMARTS to smiles, and one flavour of smiles to another.
> >
> > And are there any caveats to consider, such as the correct assignment
> > of heterocyclic nitrogens as aromatic ?
> >
> > I look forward to hearing your thoughts.
> >
> > Regards,
> >
> > mike
> > ___
> > Rdkit-discuss mailing list
> > Rdkit-discuss@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] want advice for good teaching data set

2018-08-29 Thread Eloy Félix
Hi Andrew,

If you want to build model I guess that what you want is to get
experimental logp values.

This should give you something to start with:

select ACTIVITY_ID, MOLREGNO, STANDARD_VALUE, STANDARD_TYPE from ACTIVITIES
where STANDARD_TYPE = 'LogP' and STANDARD_VALUE is not null and
data_validity_comment is null and POTENTIAL_DUPLICATE = 0;

Eloy.


2018-08-29 14:51 GMT+01:00 TJ O'Donnell :

> Hi Andrew
> ChEMBL 24 has compound properties in the table compound_properties.  I
> think the alogp
> is computed using (Crippen) atom types and the acd_logp is uses ACD labs
> methods.
> TJ
>
> On Wed, Aug 29, 2018 at 5:52 AM Andrew Dalke 
> wrote:
>
>> Hi all,
>>
>>   I am starting to put together materials for the Python/RDKit training
>> course I'm giving just before the RDKit UGM next month.
>>
>> I would like to structure part of it around the SQLite release of the
>> ChEMBL data set. More specifically, I plan to include examples of machine
>> learning with scikit-learn, using RDKit descriptors and values from ChEMBL
>> 24 (and making sure to use the new schema).
>>
>> Two problems. First, I'm not a computational chemist and I don't know
>> what would constitute a good example to use. "Good" in this case means one
>> whose outlines are well-known to likely students. Second, I don't have much
>> experience with the ChEMBL data.
>>
>> My thought is to make a logP model. The easiest would be to based it on
>> atom types. For this option, can anyone suggest where I can find logP data
>> from ChEMBL?
>>
>> Another possibility is to use a pre-existing model, like the notebook
>> George Papadatos did for Ligand-based Target Prediction at
>> http://nbviewer.jupyter.org/gist/madgpap/10457778 .
>>
>> Perhaps someone here could point me to other existing resources along
>> similar lines?
>>
>> Best regards,
>>
>> Andrew
>> da...@dalkescientific.com
>>
>>
>>
>> 
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] cis/trans info lost when generating SMILES from molfile

2018-01-25 Thread Eloy Félix
Hi RDKitters,

I'm having trouble writing SMILES including cis/trans info for some
molecules I load from molfile using rdkit. Openbabel and indigo are
generating the expected SMILES.

https://www.ebi.ac.uk/chembl/compound/inspect/CHEMBL501674

*# RDKIT*
mol = Chem.MolFromMolFile('CHEMBL501674.mol')
Chem.MolToSmiles(mol, isomericSmiles=True)
'CO[C@H]1[C@H](OC[C@H]2[C@@H]3O[C@@H]3C=CC(=O)[C@H](C)CC[C@H](O[C@@H]3O[C@H
](C)C[C@H](O)[C@H]3O)[C@@H](C)C=CC(=O)O[C@@H]2C)O[C@H](C)[C@@H](O)[C@H]1OC'

*# OPENBABEL*
mol_ob = pybel.readfile('mol', 'CHEMBL501674.mol')
mol_smiles.write('can')
'CO[C@H]1[C@H](OC[C@@H]2[C@@H](C)OC(=O)*/C=C/*[C@H](C)[C@H](CC[C@H](C(=O)
*/C=C/*[C@@H]3[C@H]2O3)C)O[C@@H]2O[C@H](C)C[C@@H]([C@H]2O)O)O[C@@H]([C@H
]([C@H]1OC)O)C'

*# INDIGO*
mol = indigoObject.loadMoleculeFromFile('CHEMBL501674.mol')
mol.smiles()
'C1(=O)[C@H](C)CC[C@H](O[C@@]2([H])[C@H](O)[C@@H](O)C[C@@H](C)O2)[C@
@H](C)C=CC(=O)O[C@H](C)[C@@H](CO[C@]2([H])O[C@H](C)[C@@H](O)[C@@H](OC)[C@H
]2OC)[C@]2([H])O[C@]2([H])C=C1 |*t:21,51*
,&1:2,6,8,10,12,15,18,25,27,30,33,35,37,40,43,46|'

Indigo is using chemaxon extended notation, but is also . recognising
t:21,51.


If I check double bond stereo info for the molecule:

for bond in mol.GetBonds():
if bond.GetBondType() == Chem.BondType.DOUBLE:
print(bond.GetBeginAtomIdx(), bond.GetEndAtomIdx(),
bond.GetStereo())
6 7 STEREONONE
11 12 STEREONONE
0 17 STEREONONE
5 43 STEREONONE

No E/Z info in bonds.

inchi bond stereo layer generated from the molfile: /b12-10+,13-9+
from rdkit generated smiles(with and without Compute2DCoords):
/b12-10-,13-9-
from obabel generated smiles: /b12-10+,13-9+

This might be a bug in the piece of code that detects bond stereo info from
the molfile or maybe... I'm just missing something :P

Thanks for your great job!
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss