Re: [Rdkit-discuss] Query on a failed molecule from SureChEMBL

2021-12-16 Thread Eloy Félix
Hi Lewis,

SureChEMBL is getting its structures from:

- USPTO attached molfiles (deposited structures)
- names using tools including OPSIN, ChemAxon, Lexichem, ACD.
- images using tools including OSRA, imago, CLiDE.

As Nicolas points out, issues like this one can occur when auto generating
structures from names and images. It is the case for the 2 structures you
mention.
We have plans to review all the tools we are using to generate the
structures as we know about some new ones out there.

Cheers,
Eloy


On Thu, 16 Dec 2021 at 09:28, Nicolas Bosc  wrote:

> Hi Lewis,
>
> Currently structures are generated automatically in SureChEMBL so this
> kind of error unfortunately happens…
>
> My colleagues will address this issue as soon as possible.
>
> Cheers,
> Nicolas
> ---
> Dr Nicolas Bosc
> Data Mining and Analysis Scientist
> ChEMBL group
> EMBL-EBI
> Wellcome Genome Campus
> Hinxton, Cambridge, CB10 1SD
> United Kingdom
>
> nb...@ebi.ac.uk
> +44 1223 492519
>
>
> On 15 Dec 2021, at 20:42, Lewis Martin  wrote:
>
> Thanks a lot Greg! That is indeed very helpful.
>
> Just to know that the molecule is odd is helpful too. The mol blocks
> appear to be V2000 format and have names like "Mrv0541 03021215572D"
> which says ChemAxon Marvin to me, but I'm still unsure why SureChEMBL would
> use such a representation (it doesn't look like a faithful transcription
> from the source patent). Off-topic, but if anyone happens to have an
> insight or connection with SureChEMBL, please do reach out!
>
> Cheers
> Lewis
>
>
>
>
> On Wed, Dec 15, 2021 at 4:24 PM Greg Landrum 
> wrote:
>
>> Hi Lewis,
>>
>> Dealing with all the strange chemical representations that show up "in
>> the wild" is an ongoing struggle.
>>
>> Your first example is pretty clearly intended to be an azide and we can
>> certainly add a rule to normalize that one to what the RDKit expects it to
>> be (there already is a rule for C-N=N#N, but that doesn't help here.). That
>> won't happen before the next feature release though.
>>
>> I'm not really sure what the intent was for the two
>> four-coordinate neutral Ns in the second molecule, so I think it's unlikely
>> that we'd add a standard cleanup for one.
>>
>> However! The good news is that there's a pretty easy (and efficient) way
>> to fix this yourself. We added a new method to chemical reactions in the
>> 2021.09 release which allows you to modify a molecule in place (subject to
>> some constraints). This is ideal for doing cleanup transformations like
>> these.
>>
>> This gist shows how to write reaction rules for your cases (I guessed for
>> what the Ns are supposed to be) and then use them:
>> https://gist.github.com/greglandrum/8fd229bc6bf6c734d1c21da7f2bebebb
>>
>> Hope this helps,
>> -greg
>>
>>
>> On Wed, Dec 15, 2021 at 12:21 AM Lewis Martin 
>> wrote:
>>
>>> Hi All,
>>> Reading molecules from a bulk download of SureChEMBL, I come across a
>>> fair few molecules that fail to parse. Not sure whether they SHOULD parse
>>> or not.
>>>
>>> Here is an example: https://www.surechembl.org/chemical/SCHEMBL386
>>> with SMILES code: COC(=O)C1=C(C=CC=C1)C1=CC=C(C[N+]#[N]=[N-])C=C1
>>>
>>> Even reading the SMILES code one can see that there are too many bonds
>>> in there - a nitrogen triply bonded and doubly bonded to other atoms.
>>>
>>> Another example: https://www.surechembl.org/chemical/SCHEMBL33957
>>> smiles: NC(N)=[NH]C1=NC(CSCC[NH]=CNS(=O)(=O)C2=CC=C(Br)C=C2)=CS1
>>>
>>> Again, valence for a nitrogen is off.
>>>
>>> Should I expect to parse these with RDKit? Might there be some way
>>> around this? It's a significant fraction of the molecules in SureChEMBL.
>>>
>>> Thanks team!
>>> Lewis
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
>
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Query on a failed molecule from SureChEMBL

2021-12-16 Thread Nicolas Bosc
Hi Lewis,

Currently structures are generated automatically in SureChEMBL so this kind of 
error unfortunately happens…

My colleagues will address this issue as soon as possible.

Cheers,
Nicolas
---
Dr Nicolas Bosc
Data Mining and Analysis Scientist
ChEMBL group
EMBL-EBI
Wellcome Genome Campus
Hinxton, Cambridge, CB10 1SD
United Kingdom

nb...@ebi.ac.uk
+44 1223 492519


> On 15 Dec 2021, at 20:42, Lewis Martin  wrote:
> 
> Thanks a lot Greg! That is indeed very helpful. 
> 
> Just to know that the molecule is odd is helpful too. The mol blocks appear 
> to be V2000 format and have names like "Mrv0541 03021215572D" which says 
> ChemAxon Marvin to me, but I'm still unsure why SureChEMBL would use such a 
> representation (it doesn't look like a faithful transcription from the source 
> patent). Off-topic, but if anyone happens to have an insight or connection 
> with SureChEMBL, please do reach out!
> 
> Cheers
> Lewis
> 
>  
> 
> 
> On Wed, Dec 15, 2021 at 4:24 PM Greg Landrum  > wrote:
> Hi Lewis,
> 
> Dealing with all the strange chemical representations that show up "in the 
> wild" is an ongoing struggle.
> 
> Your first example is pretty clearly intended to be an azide and we can 
> certainly add a rule to normalize that one to what the RDKit expects it to be 
> (there already is a rule for C-N=N#N, but that doesn't help here.). That 
> won't happen before the next feature release though.
> 
> I'm not really sure what the intent was for the two four-coordinate neutral 
> Ns in the second molecule, so I think it's unlikely that we'd add a standard 
> cleanup for one.
> 
> However! The good news is that there's a pretty easy (and efficient) way to 
> fix this yourself. We added a new method to chemical reactions in the 2021.09 
> release which allows you to modify a molecule in place (subject to some 
> constraints). This is ideal for doing cleanup transformations like these.
> 
> This gist shows how to write reaction rules for your cases (I guessed for 
> what the Ns are supposed to be) and then use them:
> https://gist.github.com/greglandrum/8fd229bc6bf6c734d1c21da7f2bebebb 
> 
> 
> Hope this helps,
> -greg
> 
> 
> On Wed, Dec 15, 2021 at 12:21 AM Lewis Martin  > wrote:
> Hi All, 
> Reading molecules from a bulk download of SureChEMBL, I come across a fair 
> few molecules that fail to parse. Not sure whether they SHOULD parse or not. 
> 
> Here is an example: https://www.surechembl.org/chemical/SCHEMBL386 
> 
> with SMILES code: COC(=O)C1=C(C=CC=C1)C1=CC=C(C[N+]#[N]=[N-])C=C1
> 
> Even reading the SMILES code one can see that there are too many bonds in 
> there - a nitrogen triply bonded and doubly bonded to other atoms. 
> 
> Another example: https://www.surechembl.org/chemical/SCHEMBL33957 
> 
> smiles: NC(N)=[NH]C1=NC(CSCC[NH]=CNS(=O)(=O)C2=CC=C(Br)C=C2)=CS1
> 
> Again, valence for a nitrogen is off. 
> 
> Should I expect to parse these with RDKit? Might there be some way around 
> this? It's a significant fraction of the molecules in SureChEMBL. 
> 
> Thanks team!
> Lewis 
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net 
> 
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss 
> 
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Query on a failed molecule from SureChEMBL

2021-12-15 Thread Lewis Martin
Thanks a lot Greg! That is indeed very helpful.

Just to know that the molecule is odd is helpful too. The mol blocks appear
to be V2000 format and have names like "Mrv0541 03021215572D" which says
ChemAxon Marvin to me, but I'm still unsure why SureChEMBL would use such a
representation (it doesn't look like a faithful transcription from the
source patent). Off-topic, but if anyone happens to have an insight or
connection with SureChEMBL, please do reach out!

Cheers
Lewis




On Wed, Dec 15, 2021 at 4:24 PM Greg Landrum  wrote:

> Hi Lewis,
>
> Dealing with all the strange chemical representations that show up "in the
> wild" is an ongoing struggle.
>
> Your first example is pretty clearly intended to be an azide and we can
> certainly add a rule to normalize that one to what the RDKit expects it to
> be (there already is a rule for C-N=N#N, but that doesn't help here.). That
> won't happen before the next feature release though.
>
> I'm not really sure what the intent was for the two
> four-coordinate neutral Ns in the second molecule, so I think it's unlikely
> that we'd add a standard cleanup for one.
>
> However! The good news is that there's a pretty easy (and efficient) way
> to fix this yourself. We added a new method to chemical reactions in the
> 2021.09 release which allows you to modify a molecule in place (subject to
> some constraints). This is ideal for doing cleanup transformations like
> these.
>
> This gist shows how to write reaction rules for your cases (I guessed for
> what the Ns are supposed to be) and then use them:
> https://gist.github.com/greglandrum/8fd229bc6bf6c734d1c21da7f2bebebb
>
> Hope this helps,
> -greg
>
>
> On Wed, Dec 15, 2021 at 12:21 AM Lewis Martin 
> wrote:
>
>> Hi All,
>> Reading molecules from a bulk download of SureChEMBL, I come across a
>> fair few molecules that fail to parse. Not sure whether they SHOULD parse
>> or not.
>>
>> Here is an example: https://www.surechembl.org/chemical/SCHEMBL386
>> with SMILES code: COC(=O)C1=C(C=CC=C1)C1=CC=C(C[N+]#[N]=[N-])C=C1
>>
>> Even reading the SMILES code one can see that there are too many bonds in
>> there - a nitrogen triply bonded and doubly bonded to other atoms.
>>
>> Another example: https://www.surechembl.org/chemical/SCHEMBL33957
>> smiles: NC(N)=[NH]C1=NC(CSCC[NH]=CNS(=O)(=O)C2=CC=C(Br)C=C2)=CS1
>>
>> Again, valence for a nitrogen is off.
>>
>> Should I expect to parse these with RDKit? Might there be some way around
>> this? It's a significant fraction of the molecules in SureChEMBL.
>>
>> Thanks team!
>> Lewis
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Query on a failed molecule from SureChEMBL

2021-12-14 Thread Greg Landrum
Hi Lewis,

Dealing with all the strange chemical representations that show up "in the
wild" is an ongoing struggle.

Your first example is pretty clearly intended to be an azide and we can
certainly add a rule to normalize that one to what the RDKit expects it to
be (there already is a rule for C-N=N#N, but that doesn't help here.). That
won't happen before the next feature release though.

I'm not really sure what the intent was for the two
four-coordinate neutral Ns in the second molecule, so I think it's unlikely
that we'd add a standard cleanup for one.

However! The good news is that there's a pretty easy (and efficient) way to
fix this yourself. We added a new method to chemical reactions in the
2021.09 release which allows you to modify a molecule in place (subject to
some constraints). This is ideal for doing cleanup transformations like
these.

This gist shows how to write reaction rules for your cases (I guessed for
what the Ns are supposed to be) and then use them:
https://gist.github.com/greglandrum/8fd229bc6bf6c734d1c21da7f2bebebb

Hope this helps,
-greg


On Wed, Dec 15, 2021 at 12:21 AM Lewis Martin 
wrote:

> Hi All,
> Reading molecules from a bulk download of SureChEMBL, I come across a fair
> few molecules that fail to parse. Not sure whether they SHOULD parse or
> not.
>
> Here is an example: https://www.surechembl.org/chemical/SCHEMBL386
> with SMILES code: COC(=O)C1=C(C=CC=C1)C1=CC=C(C[N+]#[N]=[N-])C=C1
>
> Even reading the SMILES code one can see that there are too many bonds in
> there - a nitrogen triply bonded and doubly bonded to other atoms.
>
> Another example: https://www.surechembl.org/chemical/SCHEMBL33957
> smiles: NC(N)=[NH]C1=NC(CSCC[NH]=CNS(=O)(=O)C2=CC=C(Br)C=C2)=CS1
>
> Again, valence for a nitrogen is off.
>
> Should I expect to parse these with RDKit? Might there be some way around
> this? It's a significant fraction of the molecules in SureChEMBL.
>
> Thanks team!
> Lewis
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss