Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key
Sure, here is: 1. The question: "I noticed that compounds that differ only on the cis-trans isomerization > around an sp2 nitrogen get the same InChI Key from RDKit. For example: > > inchi_cis = > Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(/NC#N)NCCSCc1nc[nH]c1C")) > > inchi_cis > 'AQIXAKUUQRKLND-UHFFFAOYSA-N' > > inchi_trans = > Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(\\NC#N)NCCSCc1nc[nH]c1C")) > > inchi_trans > 'AQIXAKUUQRKLND-UHFFFAOYSA-N' > > inchi_cis == inchi_trans > True > I wonder if this is a limitation of the InChI Key definition, or an > implementation issue. There answer to the question, in the end, was that the InChI Keys were behaving as intended, by design, as pointed out by Igor Pletnev: though InChI is not perfect, in this case it behaves as intended. > Please see below. > The discussed molecules contain substituted guanidine fragment > (RHN)C(=NMe)(NHR') > It is subjected to tautomerism, and in different tautomers different C-N > bonds have double order: > (RHN)C(=NMe)(NHR') > (RHN)C(NHMe)(=NR') > (RN=)C(NHMe)(NHR') > You generated Standard InChI, which is evidenced by "InChI=1S/" prefix in > the examples. > Standard InChI is specifically designed to produce the same identifier for > all tautomers (by indicating that two hydrogens are shared by three > nitrogen atoms, for any tautomer; bond orders are not indicated in InChI). > As the tautomer-invariant Std InChI does not know which C-N bond is > actually a double, there is the only option for treating stereo -- to > completely ignore it as a drawing artifact. > All in all: > Standard InChI means that the exact tautomeric form is unknown ==> all > tautomers are mapped to the same generic representation ==> the exact C-N > double bond placement in this generic is unspecified ==> C-N double bond > stereo is ignored ==> generated StdInChI and Std InChIKey are the same for > seemingly different, by initial drawing, cis/trans forms. > Once again, this behavior is by design; it is intended for maximal > interoperability while comparing different drawings of the "same" compound. > If, for any reason, you would like to consider your examples as the > definite and resolvable structures, each having its own identifier, just > use non-Standard InChI. > The InChI which preserves the exact positions of tautomeric H's and double > bond ("as drawn") is produced by just specifying option /FixedH upon > generation. > More on this may be found in InChI FAQ: > https://www.inchi-trust.org/technical-faq-2/ The only question remaining was how to use this "/FixedH" option in RDKit, and that was answered by Paolo Tosco: you can pass InChI options to the underlying InChI API through the options parameter > of Chem.inchi.MolToInchi() and Chem.inchi.MolToInchiKey(); e.g.: > inchi.MolToInchi(mol, options="/FixedH") > Source: > https://www.rdkit.org/docs/source/rdkit.Chem.inchi.html?highlight=inchi#rdkit.Chem.inchi.MolBlockToInchi And this is what I'm using now to remove duplicate molecules from my database. I'm using a Pandas DataFrame and, with the more recent versions of Pandas, the following works fine: > df['InChI Key'] = df[mol_col].progress_apply(Chem.MolToInchiKey, options="/FixedH") > df.drop_duplicates(subset=['InChI Key'], keep='first', inplace=True) All the best, -- Gustavo Seabra. On Fri, Oct 30, 2020 at 4:47 AM Adelene LAI wrote: > Hi Gustavo, > > > Looks like you found a solution for your deduplication task. Would you > mind sharing it with us? (Seems some emails in the chain are missing.) > > > I'm curious - returning to your original question, did we figure out why > the same InChIKey was given for the stereoisomers? > > > Adelene > > > Doctoral Researcher > > Environmental Cheminformatics > > UNIVERSITÉ DU LUXEMBOURG > > > LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE > > 6, avenue du Swing, L-4367 Belvaux > > T +356 46 66 44 67 18 > > [image: github.png] adelenelai > > > > > > -- > *From:* Gustavo Seabra > *Sent:* Thursday, October 29, 2020 10:23:20 PM > *To:* Paolo Tosco > *Cc:* Igor Pletnev; RDKit Discuss > *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key > > Aha! Fantastic! > > Thanks a lot!! > Gustavo. > > -- > Gustavo Seabra > > -- > *From:* Paolo Tosco > *Sent:* Thursday, October 29, 2020 5:13:33 PM > *To:* Gustavo Seabra > *Cc:* Igor Pletnev ; RDKit Discuss < > rdkit-discuss@lists.sourceforge.net> > *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key > > Hi Gusta
Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key
Hi Gustavo, Looks like you found a solution for your deduplication task. Would you mind sharing it with us? (Seems some emails in the chain are missing.) I'm curious - returning to your original question, did we figure out why the same InChIKey was given for the stereoisomers? Adelene Doctoral Researcher Environmental Cheminformatics UNIVERSITÉ DU LUXEMBOURG LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE 6, avenue du Swing, L-4367 Belvaux T +356 46 66 44 67 18 [github.png] adelenelai From: Gustavo Seabra Sent: Thursday, October 29, 2020 10:23:20 PM To: Paolo Tosco Cc: Igor Pletnev; RDKit Discuss Subject: Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key Aha! Fantastic! Thanks a lot!! Gustavo. -- Gustavo Seabra From: Paolo Tosco Sent: Thursday, October 29, 2020 5:13:33 PM To: Gustavo Seabra Cc: Igor Pletnev ; RDKit Discuss Subject: Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key Hi Gustavo, you can pass InChI options to the underlying InChI API through the options parameter of Chem.inchi.MolToInchi() and Chem.inchi.MolToInchiKey(); e.g.: inchi.MolToInchi(mol, options="/FixedH") Source: https://www.rdkit.org/docs/source/rdkit.Chem.inchi.html?highlight=inchi#rdkit.Chem.inchi.MolBlockToInchi Cheers, p. On Thu, Oct 29, 2020 at 9:42 PM Gustavo Seabra mailto:gustavo.sea...@gmail.com>> wrote: Ok, thanks! -- Gustavo Seabra. On Thu, Oct 29, 2020 at 4:33 PM Igor Pletnev mailto:igor.plet...@gmail.com>> wrote: > Is this "/FixedH" an option in RDKit? How to use that? (I don't see it in > the docs). Sorry, I am not so proficient in RDKit and can not answer exactly. Anyway, this option is available in InChI API calls, and I am pretty sure that it is also available in RDKit. I recall that couple of years ago, on some InChI event, Greg Landrum somewhat surprised me by saying that he himself often uses non-Standard InChI instead of Standard one — exactly to distinguish tautomers. So I guess Greg can answer on how it is arranged in RDKit. Regards, Igor On Thu, 29 Oct 2020 at 23:03, Gustavo Seabra mailto:gustavo.sea...@gmail.com>> wrote: That does make sense, I understand it now, thanks! Is this "/FixedH" an option in RDKit? How to use that? (I don't see it in the docs). Thanks, -- Gustavo Seabra. On Wed, Oct 28, 2020 at 6:10 PM Igor Pletnev mailto:igor.plet...@gmail.com>> wrote: Hi Gustavo, > ... I was generating the InChI Keys to get a unique hash for each compound, > thinking it would be better than SMILES (guaranteed to be unique), but is > clearly not the case. On the bright side, I won't lose time generating > InChIs... though InChI is not perfect, in this case it behaves as intended. Please see below. The discussed molecules contain substituted guanidine fragment (RHN)C(=NMe)(NHR') It is subjected to tautomerism, and in different tautomers different C-N bonds have double order: (RHN)C(=NMe)(NHR') (RHN)C(NHMe)(=NR') (RN=)C(NHMe)(NHR') You generated Standard InChI, which is evidenced by "InChI=1S/" prefix in the examples. Standard InChI is specifically designed to produce the same identifier for all tautomers (by indicating that two hydrogens are shared by three nitrogen atoms, for any tautomer; bond orders are not indicated in InChI). As the tautomer-invariant Std InChI does not know which C-N bond is actually a double, there is the only option for treating stereo -- to completely ignore it as a drawing artifact. All in all: Standard InChI means that the exact tautomeric form is unknown ==> all tautomers are mapped to the same generic representation ==> the exact C-N double bond placement in this generic is unspecified ==> C-N double bond stereo is ignored ==> generated StdInChI and Std InChIKey are the same for seemingly different, by initial drawing, cis/trans forms. Once again, this behavior is by design; it is intended for maximal interoperability while comparing different drawings of the "same" compound. If, for any reason, you would like to consider your examples as the definite and resolvable structures, each having its own identifier, just use non-Standard InChI. The InChI which preserves the exact positions of tautomeric H's and double bond ("as drawn") is produced by just specifying option /FixedH upon generation. More on this may be found in InChI FAQ: https://www.inchi-trust.org/technical-faq-2/ Hope this helps. Regards, Igor On Mon, Oct 26, 2020 at 6:56 PM Gustavo Seabra mailto:gustavo.sea...@gmail.com>> wrote: Thanks a lot Peter and Adelene, Yes, it looks like canonical SMILES is the way to go, and I have no problem sticking with RDKit. I was generating the InChI Keys to get a unique hash for each compound, thinking it would be better than SMILES (guaranteed to be unique), but is clearly not the case.
Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key
Aha! Fantastic! Thanks a lot!! Gustavo. -- Gustavo Seabra From: Paolo Tosco Sent: Thursday, October 29, 2020 5:13:33 PM To: Gustavo Seabra Cc: Igor Pletnev ; RDKit Discuss Subject: Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key Hi Gustavo, you can pass InChI options to the underlying InChI API through the options parameter of Chem.inchi.MolToInchi() and Chem.inchi.MolToInchiKey(); e.g.: inchi.MolToInchi(mol, options="/FixedH") Source: https://www.rdkit.org/docs/source/rdkit.Chem.inchi.html?highlight=inchi#rdkit.Chem.inchi.MolBlockToInchi Cheers, p. On Thu, Oct 29, 2020 at 9:42 PM Gustavo Seabra mailto:gustavo.sea...@gmail.com>> wrote: Ok, thanks! -- Gustavo Seabra. On Thu, Oct 29, 2020 at 4:33 PM Igor Pletnev mailto:igor.plet...@gmail.com>> wrote: > Is this "/FixedH" an option in RDKit? How to use that? (I don't see it in > the docs). Sorry, I am not so proficient in RDKit and can not answer exactly. Anyway, this option is available in InChI API calls, and I am pretty sure that it is also available in RDKit. I recall that couple of years ago, on some InChI event, Greg Landrum somewhat surprised me by saying that he himself often uses non-Standard InChI instead of Standard one — exactly to distinguish tautomers. So I guess Greg can answer on how it is arranged in RDKit. Regards, Igor On Thu, 29 Oct 2020 at 23:03, Gustavo Seabra mailto:gustavo.sea...@gmail.com>> wrote: That does make sense, I understand it now, thanks! Is this "/FixedH" an option in RDKit? How to use that? (I don't see it in the docs). Thanks, -- Gustavo Seabra. On Wed, Oct 28, 2020 at 6:10 PM Igor Pletnev mailto:igor.plet...@gmail.com>> wrote: Hi Gustavo, > ... I was generating the InChI Keys to get a unique hash for each compound, > thinking it would be better than SMILES (guaranteed to be unique), but is > clearly not the case. On the bright side, I won't lose time generating > InChIs... though InChI is not perfect, in this case it behaves as intended. Please see below. The discussed molecules contain substituted guanidine fragment (RHN)C(=NMe)(NHR') It is subjected to tautomerism, and in different tautomers different C-N bonds have double order: (RHN)C(=NMe)(NHR') (RHN)C(NHMe)(=NR') (RN=)C(NHMe)(NHR') You generated Standard InChI, which is evidenced by "InChI=1S/" prefix in the examples. Standard InChI is specifically designed to produce the same identifier for all tautomers (by indicating that two hydrogens are shared by three nitrogen atoms, for any tautomer; bond orders are not indicated in InChI). As the tautomer-invariant Std InChI does not know which C-N bond is actually a double, there is the only option for treating stereo -- to completely ignore it as a drawing artifact. All in all: Standard InChI means that the exact tautomeric form is unknown ==> all tautomers are mapped to the same generic representation ==> the exact C-N double bond placement in this generic is unspecified ==> C-N double bond stereo is ignored ==> generated StdInChI and Std InChIKey are the same for seemingly different, by initial drawing, cis/trans forms. Once again, this behavior is by design; it is intended for maximal interoperability while comparing different drawings of the "same" compound. If, for any reason, you would like to consider your examples as the definite and resolvable structures, each having its own identifier, just use non-Standard InChI. The InChI which preserves the exact positions of tautomeric H's and double bond ("as drawn") is produced by just specifying option /FixedH upon generation. More on this may be found in InChI FAQ: https://www.inchi-trust.org/technical-faq-2/ Hope this helps. Regards, Igor On Mon, Oct 26, 2020 at 6:56 PM Gustavo Seabra mailto:gustavo.sea...@gmail.com>> wrote: Thanks a lot Peter and Adelene, Yes, it looks like canonical SMILES is the way to go, and I have no problem sticking with RDKit. I was generating the InChI Keys to get a unique hash for each compound, thinking it would be better than SMILES (guaranteed to be unique), but is clearly not the case. On the bright side, I won't lose time generating InChIs... Can I trust that the same molecule will always get the same canonical SMILES from RDKit, independent of how it is read? (Different SDF files, geometries, atom orders, etc.?) All the best, Gustavo. -- Gustavo Seabra. On Sun, Oct 25, 2020 at 8:27 PM Peter S. Shenkin mailto:shen...@gmail.com>> wrote: Canonical SMILES is probably the way to go, but you might also be able to use the InchiKey and the Inchi auxiliary information together as a compound hash key. -P. On Sun, Oct 25, 2020 at 10:53 AM Adelene LAI mailto:adelene@uni.lu>> wrote: Hi Gustavo, (Sorry, forgot to reply all before...) Your deduplication task is quite fa
Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key
LES is the way to go, and I have no >>>>> problem sticking with RDKit. I was generating the InChI Keys to get a >>>>> unique hash for each compound, thinking it would be better than SMILES >>>>> (guaranteed to be unique), but is clearly not the case. On the bright >>>>> side, >>>>> I won't lose time generating InChIs... >>>>> >>>>> Can I trust that the same molecule will always get the same canonical >>>>> SMILES from RDKit, independent of how it is read? (Different SDF files, >>>>> geometries, atom orders, etc.?) >>>>> >>>>> All the best, >>>>> Gustavo. >>>>> >>>>> >>>>> -- >>>>> Gustavo Seabra. >>>>> >>>>> >>>>> On Sun, Oct 25, 2020 at 8:27 PM Peter S. Shenkin >>>>> wrote: >>>>> >>>>>> Canonical SMILES is probably the way to go, but you might also be >>>>>> able to use the InchiKey and the Inchi auxiliary information together as >>>>>> a >>>>>> compound hash key. >>>>>> >>>>>> -P. >>>>>> >>>>>> On Sun, Oct 25, 2020 at 10:53 AM Adelene LAI >>>>>> wrote: >>>>>> >>>>>>> Hi Gustavo, >>>>>>> >>>>>>> >>>>>>> (Sorry, forgot to reply all before...) >>>>>>> >>>>>>> >>>>>>> Your deduplication task is quite familiar to me and something I do >>>>>>> quite a lot of in my own work ;) >>>>>>> >>>>>>> >>>>>>> Can I suggest deduplicating using Canonical SMILES? >>>>>>> >>>>>>> >>>>>>> It doesn't solve your InChIKey issue, but it is a solution for now. >>>>>>> >>>>>>> >>>>>>> I updated my gist to show that it is feasible: >>>>>>> >>>>>>> >>>>>>> https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f >>>>>>> >>>>>>> >>>>>>> <https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f> >>>>>>> >>>>>>> Adelene >>>>>>> >>>>>>> >>>>>>> >>>>>>> Doctoral Researcher >>>>>>> >>>>>>> Environmental Cheminformatics >>>>>>> >>>>>>> UNIVERSITÉ DU LUXEMBOURG >>>>>>> >>>>>>> >>>>>>> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE >>>>>>> >>>>>>> 6, avenue du Swing >>>>>>> <https://www.google.com/maps/search/6,+avenue+du+Swing?entry=gmail=g>, >>>>>>> L-4367 Belvaux >>>>>>> >>>>>>> T +356 46 66 44 67 18 >>>>>>> >>>>>>> [image: github.png] adelenelai >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> *From:* Gustavo Seabra >>>>>>> *Sent:* Sunday, October 25, 2020 2:27:15 PM >>>>>>> *To:* Adelene LAI >>>>>>> *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same >>>>>>> InChI Key >>>>>>> >>>>>>> Actually, I was trying to generate all stereoisomers for molecules >>>>>>> in a database, and filter duplicate molecules by using the InChI Key to >>>>>>> detect duplicates. But it gives cis/trans isomers on sp2-N the same >>>>>>> Key. >>>>>>> >>>>>>> Gustavo. >>>>>>> >>>>>>> -- >>>>>>> Gustavo Seabra >>>>>>> >>>>>>> -- >>>>>>> *From:* Adelene LAI >>>>>>> *Sent:* Sunday, October 25, 2020 1:44:01 AM >>>>>>> *To:* Gustavo Seabra >>>>>>> *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same >>>>>>> InChI Key >>>>>>> >>>>>>> >>
Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key
PM Peter S. Shenkin >>>> wrote: >>>> >>>>> Canonical SMILES is probably the way to go, but you might also be able >>>>> to use the InchiKey and the Inchi auxiliary information together as a >>>>> compound hash key. >>>>> >>>>> -P. >>>>> >>>>> On Sun, Oct 25, 2020 at 10:53 AM Adelene LAI >>>>> wrote: >>>>> >>>>>> Hi Gustavo, >>>>>> >>>>>> >>>>>> (Sorry, forgot to reply all before...) >>>>>> >>>>>> >>>>>> Your deduplication task is quite familiar to me and something I do >>>>>> quite a lot of in my own work ;) >>>>>> >>>>>> >>>>>> Can I suggest deduplicating using Canonical SMILES? >>>>>> >>>>>> >>>>>> It doesn't solve your InChIKey issue, but it is a solution for now. >>>>>> >>>>>> >>>>>> I updated my gist to show that it is feasible: >>>>>> >>>>>> >>>>>> https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f >>>>>> >>>>>> >>>>>> <https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f> >>>>>> >>>>>> Adelene >>>>>> >>>>>> >>>>>> >>>>>> Doctoral Researcher >>>>>> >>>>>> Environmental Cheminformatics >>>>>> >>>>>> UNIVERSITÉ DU LUXEMBOURG >>>>>> >>>>>> >>>>>> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE >>>>>> >>>>>> 6, avenue du Swing >>>>>> <https://www.google.com/maps/search/6,+avenue+du+Swing?entry=gmail=g>, >>>>>> L-4367 Belvaux >>>>>> >>>>>> T +356 46 66 44 67 18 >>>>>> >>>>>> [image: github.png] adelenelai >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> *From:* Gustavo Seabra >>>>>> *Sent:* Sunday, October 25, 2020 2:27:15 PM >>>>>> *To:* Adelene LAI >>>>>> *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same >>>>>> InChI Key >>>>>> >>>>>> Actually, I was trying to generate all stereoisomers for molecules >>>>>> in a database, and filter duplicate molecules by using the InChI Key to >>>>>> detect duplicates. But it gives cis/trans isomers on sp2-N the same Key. >>>>>> >>>>>> Gustavo. >>>>>> >>>>>> -- >>>>>> Gustavo Seabra >>>>>> >>>>>> -- >>>>>> *From:* Adelene LAI >>>>>> *Sent:* Sunday, October 25, 2020 1:44:01 AM >>>>>> *To:* Gustavo Seabra >>>>>> *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same >>>>>> InChI Key >>>>>> >>>>>> >>>>>> Hi Gustavo, >>>>>> >>>>>> >>>>>> It occurred to me while swimming yesterday - was there a reason you >>>>>> pointed out the hybridisation state of N in your original subject text? >>>>>> >>>>>> >>>>>> Was it just to specify which N to focus on, or did you expect >>>>>> something special about sp2 hybridisation wrt InChIKey? >>>>>> >>>>>> >>>>>> Adelene >>>>>> >>>>>> >>>>>> Doctoral Researcher >>>>>> >>>>>> Environmental Cheminformatics >>>>>> >>>>>> UNIVERSITÉ DU LUXEMBOURG >>>>>> >>>>>> >>>>>> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE >>>>>> >>>>>> 6, avenue du Swing >>>>>> <https://www.google.com/maps/search/6,+avenue+du+Swing?entry=gmail=g>, >>>>>> L-4367 Belvaux >>>>>> >>>>>> T +356 46 66 44 67 18 >>>>>> >>>>>> [image: github.png] adelenelai >>>>>> >>&g
Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key
That does make sense, I understand it now, thanks! Is this "/FixedH" an option in RDKit? How to use that? (I don't see it in the docs). Thanks, -- Gustavo Seabra. On Wed, Oct 28, 2020 at 6:10 PM Igor Pletnev wrote: > Hi Gustavo, > > > ... I was generating the InChI Keys to get a unique hash for each > compound, thinking it would be better than SMILES (guaranteed to be > unique), but is clearly not the case. On the bright side, I won't lose time > generating InChIs... > > though InChI is not perfect, in this case it behaves as intended. > Please see below. > > The discussed molecules contain substituted guanidine fragment > (RHN)C(=NMe)(NHR') > > It is subjected to tautomerism, and in different tautomers different C-N > bonds have double order: > (RHN)C(=NMe)(NHR') > (RHN)C(NHMe)(=NR') > (RN=)C(NHMe)(NHR') > > You generated Standard InChI, which is evidenced by "InChI=1S/" prefix in > the examples. > Standard InChI is specifically designed to produce the same identifier for > all tautomers (by indicating that two hydrogens are shared by three > nitrogen atoms, for any tautomer; bond orders are not indicated in InChI). > > As the tautomer-invariant Std InChI does not know which C-N bond is > actually a double, there is the only option for treating stereo -- to > completely ignore it as a drawing artifact. > > All in all: > Standard InChI means that the exact tautomeric form is unknown ==> all > tautomers are mapped to the same generic representation ==> the exact C-N > double bond placement in this generic is unspecified ==> C-N double bond > stereo is ignored ==> generated StdInChI and Std InChIKey are the same for > seemingly different, by initial drawing, cis/trans forms. > > Once again, this behavior is by design; it is intended for maximal > interoperability while comparing different drawings of the "same" compound. > > If, for any reason, you would like to consider your examples as the > definite and resolvable structures, each having its own identifier, just > use non-Standard InChI. > The InChI which preserves the exact positions of tautomeric H's and double > bond ("as drawn") is produced by just specifying option /FixedH upon > generation. > > More on this may be found in InChI FAQ: > https://www.inchi-trust.org/technical-faq-2/ > > Hope this helps. > > Regards, > Igor > > > > On Mon, Oct 26, 2020 at 6:56 PM Gustavo Seabra > wrote: > >> Thanks a lot Peter and Adelene, >> >> Yes, it looks like canonical SMILES is the way to go, and I have no >> problem sticking with RDKit. I was generating the InChI Keys to get a >> unique hash for each compound, thinking it would be better than SMILES >> (guaranteed to be unique), but is clearly not the case. On the bright side, >> I won't lose time generating InChIs... >> >> Can I trust that the same molecule will always get the same canonical >> SMILES from RDKit, independent of how it is read? (Different SDF files, >> geometries, atom orders, etc.?) >> >> All the best, >> Gustavo. >> >> >> -- >> Gustavo Seabra. >> >> >> On Sun, Oct 25, 2020 at 8:27 PM Peter S. Shenkin >> wrote: >> >>> Canonical SMILES is probably the way to go, but you might also be able >>> to use the InchiKey and the Inchi auxiliary information together as a >>> compound hash key. >>> >>> -P. >>> >>> On Sun, Oct 25, 2020 at 10:53 AM Adelene LAI wrote: >>> >>>> Hi Gustavo, >>>> >>>> >>>> (Sorry, forgot to reply all before...) >>>> >>>> >>>> Your deduplication task is quite familiar to me and something I do >>>> quite a lot of in my own work ;) >>>> >>>> >>>> Can I suggest deduplicating using Canonical SMILES? >>>> >>>> >>>> It doesn't solve your InChIKey issue, but it is a solution for now. >>>> >>>> >>>> I updated my gist to show that it is feasible: >>>> >>>> >>>> https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f >>>> >>>> >>>> <https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f> >>>> >>>> Adelene >>>> >>>> >>>> >>>> Doctoral Researcher >>>> >>>> Environmental Cheminformatics >>>> >>>> UNIVERSITÉ DU LUXEMBOURG >>>> >>>> >>>> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE >>>>
Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key
Thanks a lot Peter and Adelene, Yes, it looks like canonical SMILES is the way to go, and I have no problem sticking with RDKit. I was generating the InChI Keys to get a unique hash for each compound, thinking it would be better than SMILES (guaranteed to be unique), but is clearly not the case. On the bright side, I won't lose time generating InChIs... Can I trust that the same molecule will always get the same canonical SMILES from RDKit, independent of how it is read? (Different SDF files, geometries, atom orders, etc.?) All the best, Gustavo. -- Gustavo Seabra. On Sun, Oct 25, 2020 at 8:27 PM Peter S. Shenkin wrote: > Canonical SMILES is probably the way to go, but you might also be able to > use the InchiKey and the Inchi auxiliary information together as a compound > hash key. > > -P. > > On Sun, Oct 25, 2020 at 10:53 AM Adelene LAI wrote: > >> Hi Gustavo, >> >> >> (Sorry, forgot to reply all before...) >> >> >> Your deduplication task is quite familiar to me and something I do quite >> a lot of in my own work ;) >> >> >> Can I suggest deduplicating using Canonical SMILES? >> >> >> It doesn't solve your InChIKey issue, but it is a solution for now. >> >> >> I updated my gist to show that it is feasible: >> >> >> https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f >> >> >> <https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f> >> >> Adelene >> >> >> >> Doctoral Researcher >> >> Environmental Cheminformatics >> >> UNIVERSITÉ DU LUXEMBOURG >> >> >> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE >> >> 6, avenue du Swing, L-4367 Belvaux >> >> T +356 46 66 44 67 18 >> >> [image: github.png] adelenelai >> >> >> >> >> >> -- >> *From:* Gustavo Seabra >> *Sent:* Sunday, October 25, 2020 2:27:15 PM >> *To:* Adelene LAI >> *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI >> Key >> >> Actually, I was trying to generate all stereoisomers for molecules in a >> database, and filter duplicate molecules by using the InChI Key to detect >> duplicates. But it gives cis/trans isomers on sp2-N the same Key. >> >> Gustavo. >> >> -- >> Gustavo Seabra >> >> -- >> *From:* Adelene LAI >> *Sent:* Sunday, October 25, 2020 1:44:01 AM >> *To:* Gustavo Seabra >> *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI >> Key >> >> >> Hi Gustavo, >> >> >> It occurred to me while swimming yesterday - was there a reason you >> pointed out the hybridisation state of N in your original subject text? >> >> >> Was it just to specify which N to focus on, or did you expect something >> special about sp2 hybridisation wrt InChIKey? >> >> >> Adelene >> >> >> Doctoral Researcher >> >> Environmental Cheminformatics >> >> UNIVERSITÉ DU LUXEMBOURG >> >> >> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE >> >> 6, avenue du Swing, L-4367 Belvaux >> >> T +356 46 66 44 67 18 >> >> [image: github.png] adelenelai >> >> >> >> >> >> ---------- >> *From:* Gustavo Seabra >> *Sent:* Saturday, October 24, 2020 5:37:09 AM >> *To:* RDKit Discuss; Adelene LAI >> *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI >> Key >> >> Thanks for looking into it. I'm happy to see.it wasn't just a mistake by >> me ;-) >> >> I hope we can find what's wrong there. >> >> Best, >> Gustavo. >> >> -- >> Gustavo Seabra >> >> -- >> *From:* Adelene LAI >> *Sent:* Friday, October 23, 2020 11:28:55 PM >> *To:* Gustavo Seabra ; RDKit Discuss < >> rdkit-discuss@lists.sourceforge.net> >> *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI >> Key >> >> >> Hi Gustavo, >> >> >> <https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f> >> https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f >> >> >> In the gist above, I tried doing some further investigating. >> >> >> It seems for the example you gave, the rdkit functions indeed give the >> same inchikey and inchi, but different aux info. >> >> >> Why this different aux info doesn't translate into di
Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key
Canonical SMILES is probably the way to go, but you might also be able to use the InchiKey and the Inchi auxiliary information together as a compound hash key. -P. On Sun, Oct 25, 2020 at 10:53 AM Adelene LAI wrote: > Hi Gustavo, > > > (Sorry, forgot to reply all before...) > > > Your deduplication task is quite familiar to me and something I do quite a > lot of in my own work ;) > > > Can I suggest deduplicating using Canonical SMILES? > > > It doesn't solve your InChIKey issue, but it is a solution for now. > > > I updated my gist to show that it is feasible: > > > https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f > > > <https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f> > > Adelene > > > > Doctoral Researcher > > Environmental Cheminformatics > > UNIVERSITÉ DU LUXEMBOURG > > > LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE > > 6, avenue du Swing, L-4367 Belvaux > > T +356 46 66 44 67 18 > > [image: github.png] adelenelai > > > > > > -------------- > *From:* Gustavo Seabra > *Sent:* Sunday, October 25, 2020 2:27:15 PM > *To:* Adelene LAI > *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key > > Actually, I was trying to generate all stereoisomers for molecules in a > database, and filter duplicate molecules by using the InChI Key to detect > duplicates. But it gives cis/trans isomers on sp2-N the same Key. > > Gustavo. > > -- > Gustavo Seabra > > ------ > *From:* Adelene LAI > *Sent:* Sunday, October 25, 2020 1:44:01 AM > *To:* Gustavo Seabra > *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key > > > Hi Gustavo, > > > It occurred to me while swimming yesterday - was there a reason you > pointed out the hybridisation state of N in your original subject text? > > > Was it just to specify which N to focus on, or did you expect something > special about sp2 hybridisation wrt InChIKey? > > > Adelene > > > Doctoral Researcher > > Environmental Cheminformatics > > UNIVERSITÉ DU LUXEMBOURG > > > LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE > > 6, avenue du Swing, L-4367 Belvaux > > T +356 46 66 44 67 18 > > [image: github.png] adelenelai > > > > > > -- > *From:* Gustavo Seabra > *Sent:* Saturday, October 24, 2020 5:37:09 AM > *To:* RDKit Discuss; Adelene LAI > *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key > > Thanks for looking into it. I'm happy to see.it wasn't just a mistake by > me ;-) > > I hope we can find what's wrong there. > > Best, > Gustavo. > > -- > Gustavo Seabra > > -- > *From:* Adelene LAI > *Sent:* Friday, October 23, 2020 11:28:55 PM > *To:* Gustavo Seabra ; RDKit Discuss < > rdkit-discuss@lists.sourceforge.net> > *Subject:* Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key > > > Hi Gustavo, > > > <https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f> > https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f > > > In the gist above, I tried doing some further investigating. > > > It seems for the example you gave, the rdkit functions indeed give the > same inchikey and inchi, but different aux info. > > > Why this different aux info doesn't translate into different > inchikeys/inchis, I'm not sure. > > > Adelene > > > > > > > Doctoral Researcher > > Environmental Cheminformatics > > UNIVERSITÉ DU LUXEMBOURG > > > LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE > > 6, avenue du Swing, L-4367 Belvaux > > T +356 46 66 44 67 18 > > [image: github.png] adelenelai > > > > > > -- > *From:* Gustavo Seabra > *Sent:* Friday, October 23, 2020 6:43:07 PM > *To:* RDKit Discuss > *Subject:* [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key > > Hi all, > > I run into an issue here, and I'd appreciate your input. I noticed that > compounds that differ only on the cis-trans isomerization around an sp2 > nitrogen get the same InChI Key from RDKit. For example: > > > inchi_cis = > Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(/NC#N)NCCSCc1nc[nH]c1C")) > > inchi_cis > 'AQIXAKUUQRKLND-UHFFFAOYSA-N' > > > inchi_trans = > Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(\\NC#N)NCCSCc1nc[nH]c1C")) > > inchi_trans > 'AQIXAKUUQRKLND-UHFFFAOYSA-N' > > > inchi_cis == inchi_trans > True > > I wonder if this is a limitation of the InChI Key definition, or an > implementation issue. > > Thanks a lot, > -- > Gustavo Seabra. > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key
Hi Gustavo, (Sorry, forgot to reply all before...) Your deduplication task is quite familiar to me and something I do quite a lot of in my own work ;) Can I suggest deduplicating using Canonical SMILES? It doesn't solve your InChIKey issue, but it is a solution for now. I updated my gist to show that it is feasible: https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f <https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f> Adelene Doctoral Researcher Environmental Cheminformatics UNIVERSITÉ DU LUXEMBOURG LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE 6, avenue du Swing, L-4367 Belvaux T +356 46 66 44 67 18 [github.png] adelenelai From: Gustavo Seabra Sent: Sunday, October 25, 2020 2:27:15 PM To: Adelene LAI Subject: Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key Actually, I was trying to generate all stereoisomers for molecules in a database, and filter duplicate molecules by using the InChI Key to detect duplicates. But it gives cis/trans isomers on sp2-N the same Key. Gustavo. -- Gustavo Seabra From: Adelene LAI Sent: Sunday, October 25, 2020 1:44:01 AM To: Gustavo Seabra Subject: Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key Hi Gustavo, It occurred to me while swimming yesterday - was there a reason you pointed out the hybridisation state of N in your original subject text? Was it just to specify which N to focus on, or did you expect something special about sp2 hybridisation wrt InChIKey? Adelene Doctoral Researcher Environmental Cheminformatics UNIVERSITÉ DU LUXEMBOURG LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE 6, avenue du Swing, L-4367 Belvaux T +356 46 66 44 67 18 [github.png] adelenelai From: Gustavo Seabra Sent: Saturday, October 24, 2020 5:37:09 AM To: RDKit Discuss; Adelene LAI Subject: Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key Thanks for looking into it. I'm happy to see.it wasn't just a mistake by me ;-) I hope we can find what's wrong there. Best, Gustavo. -- Gustavo Seabra From: Adelene LAI Sent: Friday, October 23, 2020 11:28:55 PM To: Gustavo Seabra ; RDKit Discuss Subject: Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key Hi Gustavo, <https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f>https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f In the gist above, I tried doing some further investigating. It seems for the example you gave, the rdkit functions indeed give the same inchikey and inchi, but different aux info. Why this different aux info doesn't translate into different inchikeys/inchis, I'm not sure. Adelene Doctoral Researcher Environmental Cheminformatics UNIVERSITÉ DU LUXEMBOURG LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE 6, avenue du Swing, L-4367 Belvaux T +356 46 66 44 67 18 [github.png] adelenelai From: Gustavo Seabra Sent: Friday, October 23, 2020 6:43:07 PM To: RDKit Discuss Subject: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key Hi all, I run into an issue here, and I'd appreciate your input. I noticed that compounds that differ only on the cis-trans isomerization around an sp2 nitrogen get the same InChI Key from RDKit. For example: > inchi_cis = > Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(/NC#N)NCCSCc1nc[nH]c1C")) > inchi_cis 'AQIXAKUUQRKLND-UHFFFAOYSA-N' > inchi_trans = > Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(\\NC#N)NCCSCc1nc[nH]c1C")) > inchi_trans 'AQIXAKUUQRKLND-UHFFFAOYSA-N' > inchi_cis == inchi_trans True I wonder if this is a limitation of the InChI Key definition, or an implementation issue. Thanks a lot, -- Gustavo Seabra. ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key
Thanks for looking into it. I'm happy to see.it wasn't just a mistake by me ;-) I hope we can find what's wrong there. Best, Gustavo. -- Gustavo Seabra From: Adelene LAI Sent: Friday, October 23, 2020 11:28:55 PM To: Gustavo Seabra ; RDKit Discuss Subject: Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key Hi Gustavo, <https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f>https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f In the gist above, I tried doing some further investigating. It seems for the example you gave, the rdkit functions indeed give the same inchikey and inchi, but different aux info. Why this different aux info doesn't translate into different inchikeys/inchis, I'm not sure. Adelene Doctoral Researcher Environmental Cheminformatics UNIVERSITÉ DU LUXEMBOURG LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE 6, avenue du Swing, L-4367 Belvaux T +356 46 66 44 67 18 [github.png] adelenelai From: Gustavo Seabra Sent: Friday, October 23, 2020 6:43:07 PM To: RDKit Discuss Subject: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key Hi all, I run into an issue here, and I'd appreciate your input. I noticed that compounds that differ only on the cis-trans isomerization around an sp2 nitrogen get the same InChI Key from RDKit. For example: > inchi_cis = > Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(/NC#N)NCCSCc1nc[nH]c1C")) > inchi_cis 'AQIXAKUUQRKLND-UHFFFAOYSA-N' > inchi_trans = > Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(\\NC#N)NCCSCc1nc[nH]c1C")) > inchi_trans 'AQIXAKUUQRKLND-UHFFFAOYSA-N' > inchi_cis == inchi_trans True I wonder if this is a limitation of the InChI Key definition, or an implementation issue. Thanks a lot, -- Gustavo Seabra. ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key
Hi Gustavo, <https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f>https://gist.github.com/adelenelai/59a8794e1f030941c19bcb50aa8adf3f In the gist above, I tried doing some further investigating. It seems for the example you gave, the rdkit functions indeed give the same inchikey and inchi, but different aux info. Why this different aux info doesn't translate into different inchikeys/inchis, I'm not sure. Adelene Doctoral Researcher Environmental Cheminformatics UNIVERSITÉ DU LUXEMBOURG LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE 6, avenue du Swing, L-4367 Belvaux T +356 46 66 44 67 18 [github.png] adelenelai From: Gustavo Seabra Sent: Friday, October 23, 2020 6:43:07 PM To: RDKit Discuss Subject: [Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key Hi all, I run into an issue here, and I'd appreciate your input. I noticed that compounds that differ only on the cis-trans isomerization around an sp2 nitrogen get the same InChI Key from RDKit. For example: > inchi_cis = > Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(/NC#N)NCCSCc1nc[nH]c1C")) > inchi_cis 'AQIXAKUUQRKLND-UHFFFAOYSA-N' > inchi_trans = > Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(\\NC#N)NCCSCc1nc[nH]c1C")) > inchi_trans 'AQIXAKUUQRKLND-UHFFFAOYSA-N' > inchi_cis == inchi_trans True I wonder if this is a limitation of the InChI Key definition, or an implementation issue. Thanks a lot, -- Gustavo Seabra. ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] Nitrogen sp2 isomers get the same InChI Key
Hi all, I run into an issue here, and I'd appreciate your input. I noticed that compounds that differ only on the cis-trans isomerization around an sp2 nitrogen get the same InChI Key from RDKit. For example: > inchi_cis = Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(/NC#N)NCCSCc1nc[nH]c1C")) > inchi_cis 'AQIXAKUUQRKLND-UHFFFAOYSA-N' > inchi_trans = Chem.inchi.MolToInchiKey(Chem.MolFromSmiles("C/N=C(\\NC#N)NCCSCc1nc[nH]c1C")) > inchi_trans 'AQIXAKUUQRKLND-UHFFFAOYSA-N' > inchi_cis == inchi_trans True I wonder if this is a limitation of the InChI Key definition, or an implementation issue. Thanks a lot, -- Gustavo Seabra. ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss