Re: [Rdkit-discuss] bad inchi or parsing problem?
Thanks to Curt, Markus, and John for helping me understand this. I knew that inchi had its limitations, but that didn't jump out at me here because there's no hydrogen migration between the different forms - not realizing these forms also qualify as tautomers. But So this is definitely a feature (or limitation) of inchi. > No, my "good old" cactus service doesn't do a lookup in this case, it is > read from the string, which is of of course in opposition to what I just > said :-). We did quite a bit regarding normalization, first, the CACTVS > toolkit behind the service is quite good in this regard and I added a few > things for the web service, too. > > I may look into adding in a step after getting a sanitization error, but before accepting the unsanitized structure, to see if CACTVS can give a better SMILES string. I want to avoid returning to the user 2D structures like the first image in the thread, when you can point to another structure that equally matches the input. -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] bad inchi or parsing problem?
On Thu, Sep 14, 2017 at 8:09 PM, Jason Biggswrote: > Okay, all three of these smiles strings resolve to the same inchi, > > "O=[N+](C1=NC2=CC=CC=C2N=C1)[N-](=O)C1=NC2=CC=CC=C2N=C1" > "C1=CC=C2C(=C1)N=CC(=N2)N(=N(=O)C3=NC4=CC=CC=C4N=C3)=O" > "[O-][N+](c1cnc2c2n1)=[N+]([O-])c3cnc4c4n3" > > even though to me they seem like different structures due to the specified > charges. Is this a limitation of inchi, or do I need to rethink my ideas > of what makes two chemical structures the same? > > Well, but at least the first two ones I would regard as erroneous or unlikely (not stable) creatures - and that is exactly what John meant with InChI is an identifier, not a representation. InChI's main purpose (particularly that one of Standard InChI) is to identify them as the same (corrected, normalized) molecule, not as three separate species (that would be the purpose of representation). Of course, in many cases, there might be a discussion avout where sensible correction/normalization should end and separation of structures should start but that is long topic. -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] bad inchi or parsing problem?
On Thu, Sep 14, 2017 at 8:09 PM, Jason Biggswrote: > Okay, all three of these smiles strings resolve to the same inchi, > > "O=[N+](C1=NC2=CC=CC=C2N=C1)[N-](=O)C1=NC2=CC=CC=C2N=C1" > "C1=CC=C2C(=C1)N=CC(=N2)N(=N(=O)C3=NC4=CC=CC=C4N=C3)=O" > "[O-][N+](c1cnc2c2n1)=[N+]([O-])c3cnc4c4n3" > > even though to me they seem like different structures due to the specified > charges. Is this a limitation of inchi, or do I need to rethink my ideas > of what makes two chemical structures the same? > > Well, but at least the first two ones I would regard as erroneous or unlikely (not stable) creatures - and that is exactly what John meant with InChI is an identifier, not a representation. InChI's main purpose (particularly that one of Standard InChI) is to identify them as the same (corrected, normalized) molecule, not as three separate species (that would be the purpose of representation). Of course, in many cases, there might be a discussion avout where sensible correction/normalization should end and separation of structures should start but that is long topic. -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] bad inchi or parsing problem?
On Thu, Sep 14, 2017 at 7:38 PM, John Mayfieldwrote: > InChI is an identifier and not a representation, you should not read > InChIs... but we are beyond hope there so... > Wonderfully said - unfortunately one day they decided to make InChIs "readable" ... > The InChI string is correct and is the same if you roundtrip your > preferred one with charge separated bonds and the 5 valent one. > > All toolkits will use the InChI library to read/write InChIs and it > generates the representation with 5v nitrogens, cactus is either applying > normalisation after reading or in this case (since it's the name resolved) > doing a identifier lookup from an original SMILES used to generate this > InChI: > No, my "good old" cactus service doesn't do a lookup in this case, it is read from the string, which is of of course in opposition to what I just said :-). We did quite a bit regarding normalization, first, the CACTVS toolkit behind the service is quite good in this regard and I added a few things for the web service, too. Markus -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] bad inchi or parsing problem?
I'm not 100% sure about this particular case, but I suspect this is a limitation of InChI. For example, the InChI representation of zwitterionic phenylalanine (negative COO-, positive NH3+) and "neutral" phenylalanine (neutral COOH, neutral NH2) is exactly the same. This is by design. See https://chemistry.stackexchange.com/questions/34563/pubchem-inchi-smiles-and-uniqueness for some possibly useful additional discussion. The InChI FAQ at http://www.inchi-trust.org/technical-faq/#13.2 says: This is exemplified below by standard InChIKeys as well as standard InChI > strings for neutral, zwitterionic, anionic and cationic states of glycine > (note that neutral and zwitterionic states do not differ in the total > number of protons so they have the same standard InChI/InChIKey): Is this the same as or at least similar to the issue you are encountering? Curt On Thu, Sep 14, 2017 at 11:09 AM, Jason Biggswrote: > Okay, all three of these smiles strings resolve to the same inchi, > > "O=[N+](C1=NC2=CC=CC=C2N=C1)[N-](=O)C1=NC2=CC=CC=C2N=C1" > "C1=CC=C2C(=C1)N=CC(=N2)N(=N(=O)C3=NC4=CC=CC=C4N=C3)=O" > "[O-][N+](c1cnc2c2n1)=[N+]([O-])c3cnc4c4n3" > > even though to me they seem like different structures due to the specified > charges. Is this a limitation of inchi, or do I need to rethink my ideas > of what makes two chemical structures the same? > > > > > > Jason Biggs > > > On Thu, Sep 14, 2017 at 12:38 PM, John Mayfield < > john.wilkinson...@gmail.com> wrote: > >> InChI is an identifier and not a representation, you should not read >> InChIs... but we are beyond hope there so... >> >> The InChI string is correct and is the same if you roundtrip your >> preferred one with charge separated bonds and the 5 valent one. >> >> All toolkits will use the InChI library to read/write InChIs and it >> generates the representation with 5v nitrogens, cactus is either applying >> normalisation after reading or in this case (since it's the name resolved) >> doing a identifier lookup from an original SMILES used to generate this >> InChI: >> >> echo 'InChI=1S/C16H10N6O2/c23-21(15-9-17-11-5-1-3-7-13(11)19-15) >>> 22(24)16-10-18-12-6-2-4-8-14(12)20-16/h1-10H' | inchi -STDIO >>> -inChi2Struct -OutputSDF | obabel -imol -osmi >> >> c1ccc2c(c1)ncc(n2)N(=N(=O)c1cnc2c2n1)=O Structure #1 >> >> SDF also attached without going though Open Babel. >> >> - John >> >> >> > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] bad inchi or parsing problem?
Okay, all three of these smiles strings resolve to the same inchi, "O=[N+](C1=NC2=CC=CC=C2N=C1)[N-](=O)C1=NC2=CC=CC=C2N=C1" "C1=CC=C2C(=C1)N=CC(=N2)N(=N(=O)C3=NC4=CC=CC=C4N=C3)=O" "[O-][N+](c1cnc2c2n1)=[N+]([O-])c3cnc4c4n3" even though to me they seem like different structures due to the specified charges. Is this a limitation of inchi, or do I need to rethink my ideas of what makes two chemical structures the same? Jason Biggs On Thu, Sep 14, 2017 at 12:38 PM, John Mayfieldwrote: > InChI is an identifier and not a representation, you should not read > InChIs... but we are beyond hope there so... > > The InChI string is correct and is the same if you roundtrip your > preferred one with charge separated bonds and the 5 valent one. > > All toolkits will use the InChI library to read/write InChIs and it > generates the representation with 5v nitrogens, cactus is either applying > normalisation after reading or in this case (since it's the name resolved) > doing a identifier lookup from an original SMILES used to generate this > InChI: > > echo 'InChI=1S/C16H10N6O2/c23-21(15-9-17-11-5-1-3-7-13(11)19- >> 15)22(24)16-10-18-12-6-2-4-8-14(12)20-16/h1-10H' | inchi -STDIO >> -inChi2Struct -OutputSDF | obabel -imol -osmi > > c1ccc2c(c1)ncc(n2)N(=N(=O)c1cnc2c2n1)=O Structure #1 > > SDF also attached without going though Open Babel. > > - John > > > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] bad inchi or parsing problem?
InChI is an identifier and not a representation, you should not read InChIs... but we are beyond hope there so... The InChI string is correct and is the same if you roundtrip your preferred one with charge separated bonds and the 5 valent one. All toolkits will use the InChI library to read/write InChIs and it generates the representation with 5v nitrogens, cactus is either applying normalisation after reading or in this case (since it's the name resolved) doing a identifier lookup from an original SMILES used to generate this InChI: echo > 'InChI=1S/C16H10N6O2/c23-21(15-9-17-11-5-1-3-7-13(11)19-15)22(24)16-10-18-12-6-2-4-8-14(12)20-16/h1-10H' > | inchi -STDIO -inChi2Struct -OutputSDF | obabel -imol -osmi c1ccc2c(c1)ncc(n2)N(=N(=O)c1cnc2c2n1)=O Structure #1 SDF also attached without going though Open Babel. - John inchioutput.sdf Description: chemical/mdl-sdfile -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss