Re: [Rdkit-discuss] bad inchi or parsing problem?

2017-09-14 Thread Jason Biggs
Thanks to Curt, Markus, and John for helping me understand this.  I knew
that inchi had its limitations, but that didn't jump out at me here because
there's no hydrogen migration between the different forms - not realizing
these forms also qualify as tautomers.  But So this is definitely a feature
(or limitation) of inchi.


> No, my "good old" cactus service doesn't do a lookup in this case, it is
> read from the string, which is of of course in opposition to what I just
> said :-). We did quite a bit regarding normalization, first, the CACTVS
> toolkit behind the service is quite good in this regard and I added a few
> things for the web service, too.
>
>
I may look into adding in a step after getting a sanitization error, but
before accepting the unsanitized structure, to see if CACTVS can give a
better SMILES string.

I want to avoid returning to the user 2D structures like the first image in
the thread, when you can point to another structure that equally matches
the input.
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] bad inchi or parsing problem?

2017-09-14 Thread Markus Sitzmann
On Thu, Sep 14, 2017 at 8:09 PM, Jason Biggs  wrote:

> Okay, all three of these smiles strings resolve to the same inchi,
>
> "O=[N+](C1=NC2=CC=CC=C2N=C1)[N-](=O)C1=NC2=CC=CC=C2N=C1"
> "C1=CC=C2C(=C1)N=CC(=N2)N(=N(=O)C3=NC4=CC=CC=C4N=C3)=O"
> "[O-][N+](c1cnc2c2n1)=[N+]([O-])c3cnc4c4n3"
>
> even though to me they seem like different structures due to the specified
> charges.  Is this a limitation of inchi, or do I need to rethink my ideas
> of what makes two chemical structures the same?
>
>
Well, but at least the first two ones I would regard as erroneous or
unlikely (not stable) creatures - and that is exactly what John meant with
InChI is an identifier, not a representation. InChI's main purpose
(particularly that one of Standard InChI) is to identify them as the same
(corrected, normalized) molecule, not as three separate species (that would
be the purpose of representation). Of course, in many cases, there might be
a discussion avout where sensible correction/normalization should end and
separation of structures should start but that is long topic.
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] bad inchi or parsing problem?

2017-09-14 Thread Markus Sitzmann
On Thu, Sep 14, 2017 at 8:09 PM, Jason Biggs  wrote:

> Okay, all three of these smiles strings resolve to the same inchi,
>
> "O=[N+](C1=NC2=CC=CC=C2N=C1)[N-](=O)C1=NC2=CC=CC=C2N=C1"
> "C1=CC=C2C(=C1)N=CC(=N2)N(=N(=O)C3=NC4=CC=CC=C4N=C3)=O"
> "[O-][N+](c1cnc2c2n1)=[N+]([O-])c3cnc4c4n3"
>
> even though to me they seem like different structures due to the specified
> charges.  Is this a limitation of inchi, or do I need to rethink my ideas
> of what makes two chemical structures the same?
>
>
Well, but at least the first two ones I would regard as erroneous or
unlikely (not stable) creatures - and that is exactly what John meant with
InChI is an identifier, not a representation. InChI's main purpose
(particularly that one of Standard InChI) is to identify them as the same
(corrected, normalized) molecule, not as three separate species (that would
be the purpose of representation). Of course, in many cases, there might be
a discussion avout where sensible correction/normalization should end and
separation of structures should start but that is long topic.
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] bad inchi or parsing problem?

2017-09-14 Thread Markus Sitzmann
On Thu, Sep 14, 2017 at 7:38 PM, John Mayfield 
wrote:

> InChI is an identifier and not a representation, you should not read
> InChIs... but we are beyond hope there so...
>

Wonderfully said - unfortunately one day they decided to make InChIs
"readable" ...


> The InChI string is correct and is the same if you roundtrip your
> preferred one with charge separated bonds and the 5 valent one.
>
> All toolkits will use the InChI library to read/write InChIs and it
> generates the representation with 5v nitrogens, cactus is either applying
> normalisation after reading or in this case (since it's the name resolved)
> doing a identifier lookup from an original SMILES used to generate this
> InChI:
>

No, my "good old" cactus service doesn't do a lookup in this case, it is
read from the string, which is of of course in opposition to what I just
said :-). We did quite a bit regarding normalization, first, the CACTVS
toolkit behind the service is quite good in this regard and I added a few
things for the web service, too.


 Markus
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] bad inchi or parsing problem?

2017-09-14 Thread Curt Fischer
I'm not 100% sure about this particular case, but I suspect this is a
limitation of InChI.  For example, the InChI representation of zwitterionic
phenylalanine (negative COO-, positive NH3+) and "neutral" phenylalanine
(neutral COOH, neutral NH2) is exactly the same.  This is by design.  See
https://chemistry.stackexchange.com/questions/34563/pubchem-inchi-smiles-and-uniqueness
for some possibly useful additional discussion.

The InChI FAQ at http://www.inchi-trust.org/technical-faq/#13.2 says:

This is exemplified below by standard InChIKeys as well as standard InChI
> strings for neutral, zwitterionic, anionic and cationic states of glycine
> (note that neutral and zwitterionic states do not differ in the total
> number of protons so they have the same standard InChI/InChIKey):


Is this the same as or at least similar to the issue you are encountering?

Curt

On Thu, Sep 14, 2017 at 11:09 AM, Jason Biggs  wrote:

> Okay, all three of these smiles strings resolve to the same inchi,
>
> "O=[N+](C1=NC2=CC=CC=C2N=C1)[N-](=O)C1=NC2=CC=CC=C2N=C1"
> "C1=CC=C2C(=C1)N=CC(=N2)N(=N(=O)C3=NC4=CC=CC=C4N=C3)=O"
> "[O-][N+](c1cnc2c2n1)=[N+]([O-])c3cnc4c4n3"
>
> even though to me they seem like different structures due to the specified
> charges.  Is this a limitation of inchi, or do I need to rethink my ideas
> of what makes two chemical structures the same?
>
>
>
>
>
> Jason Biggs
>
>
> On Thu, Sep 14, 2017 at 12:38 PM, John Mayfield <
> john.wilkinson...@gmail.com> wrote:
>
>> InChI is an identifier and not a representation, you should not read
>> InChIs... but we are beyond hope there so...
>>
>> The InChI string is correct and is the same if you roundtrip your
>> preferred one with charge separated bonds and the 5 valent one.
>>
>> All toolkits will use the InChI library to read/write InChIs and it
>> generates the representation with 5v nitrogens, cactus is either applying
>> normalisation after reading or in this case (since it's the name resolved)
>> doing a identifier lookup from an original SMILES used to generate this
>> InChI:
>>
>> echo 'InChI=1S/C16H10N6O2/c23-21(15-9-17-11-5-1-3-7-13(11)19-15)
>>> 22(24)16-10-18-12-6-2-4-8-14(12)20-16/h1-10H' | inchi -STDIO
>>> -inChi2Struct -OutputSDF | obabel -imol -osmi
>>
>> c1ccc2c(c1)ncc(n2)N(=N(=O)c1cnc2c2n1)=O Structure #1
>>
>> SDF also attached without going though Open Babel.
>>
>> - John
>>
>>
>>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] bad inchi or parsing problem?

2017-09-14 Thread Jason Biggs
Okay, all three of these smiles strings resolve to the same inchi,

"O=[N+](C1=NC2=CC=CC=C2N=C1)[N-](=O)C1=NC2=CC=CC=C2N=C1"
"C1=CC=C2C(=C1)N=CC(=N2)N(=N(=O)C3=NC4=CC=CC=C4N=C3)=O"
"[O-][N+](c1cnc2c2n1)=[N+]([O-])c3cnc4c4n3"

even though to me they seem like different structures due to the specified
charges.  Is this a limitation of inchi, or do I need to rethink my ideas
of what makes two chemical structures the same?





Jason Biggs


On Thu, Sep 14, 2017 at 12:38 PM, John Mayfield  wrote:

> InChI is an identifier and not a representation, you should not read
> InChIs... but we are beyond hope there so...
>
> The InChI string is correct and is the same if you roundtrip your
> preferred one with charge separated bonds and the 5 valent one.
>
> All toolkits will use the InChI library to read/write InChIs and it
> generates the representation with 5v nitrogens, cactus is either applying
> normalisation after reading or in this case (since it's the name resolved)
> doing a identifier lookup from an original SMILES used to generate this
> InChI:
>
> echo 'InChI=1S/C16H10N6O2/c23-21(15-9-17-11-5-1-3-7-13(11)19-
>> 15)22(24)16-10-18-12-6-2-4-8-14(12)20-16/h1-10H' | inchi -STDIO
>> -inChi2Struct -OutputSDF | obabel -imol -osmi
>
> c1ccc2c(c1)ncc(n2)N(=N(=O)c1cnc2c2n1)=O Structure #1
>
> SDF also attached without going though Open Babel.
>
> - John
>
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] bad inchi or parsing problem?

2017-09-14 Thread John Mayfield
InChI is an identifier and not a representation, you should not read
InChIs... but we are beyond hope there so...

The InChI string is correct and is the same if you roundtrip your preferred
one with charge separated bonds and the 5 valent one.

All toolkits will use the InChI library to read/write InChIs and it
generates the representation with 5v nitrogens, cactus is either applying
normalisation after reading or in this case (since it's the name resolved)
doing a identifier lookup from an original SMILES used to generate this
InChI:

echo
> 'InChI=1S/C16H10N6O2/c23-21(15-9-17-11-5-1-3-7-13(11)19-15)22(24)16-10-18-12-6-2-4-8-14(12)20-16/h1-10H'
> | inchi -STDIO -inChi2Struct -OutputSDF | obabel -imol -osmi

c1ccc2c(c1)ncc(n2)N(=N(=O)c1cnc2c2n1)=O Structure #1

SDF also attached without going though Open Babel.

- John


inchioutput.sdf
Description: chemical/mdl-sdfile
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss