Re: [Rdkit-discuss] InChI strings from molfiles - stereo perception.
Dear Jan, On Mon, Apr 29, 2013 at 8:03 AM, Jan Holst Jensen j...@biochemfusion.com wrote: Hi RDKitters, I wonder why the InChI strings generated by RDKit differ from the ones generated by the standard IUPAC inchi-1 executable. At least some were due to an RDKit bug that has been fixed for a while (it's in the 2013.03 release). The fix isn't reflected in the knime nodes because we haven't done an update of the knime binaries in a while; that's coming in the next day or so. I ran the standard InChI example file Samples.sdf through the KNIME workflow and compared with the InChIs generated from the IUPAC executable. A number of InChI strings are different; it seems to be almost all stereo-related. Here's what I get from Python: For example: InChI strings generated for spiro.mol (spiro.mol - attached): IUPAC: InChI=1S/2C9H14Cl2/c2*1-7(10)3-9(4-7)5-8(2,11)6-9/h2*3-6H2,1-2H3/t2*7-,8-,9-/m10/s1 RDKit: InChI=1S/2C9H14Cl2/c2*1-7(10)3-9(4-7)5-8(2,11)6-9/h2*3-6H2,1-2H3 This one still doesn't recognize the stereo. I'll file a bug for it: In [2]: Chem.MolToInchi(Chem.MolFromMolFile('spiro.mol')) [09:53:16] WARNING: Omitted undefined stereo Out[2]: 'InChI=1S/2C9H14Cl2/c2*1-7(10)3-9(4-7)5-8(2,11)6-9/h2*3-6H2,1-2H3' and stertaut.mol (stertaut.mol - attached): IUPAC: InChI=1S/C6H6O5/c7-1-2-3(5(8)9)4(2)6(10)11/h1,3-4,7H,(H,8,9)(H,10,11)/b2-1-/t3-,4+/m0/s1 RDKit: InChI=1S/C6H6O5/c7-1-2-3(5(8)9)4(2)6(10)11/h1,3-4,7H,(H,8,9)(H,10,11)/t3-,4-/m1/s1 In [3]: Chem.MolToInchi(Chem.MolFromMolFile('stertaut.mol')) Out[3]: 'InChI=1S/C6H6O5/c7-1-2-3(5(8)9)4(2)6(10)11/h1,3-4,7H,(H,8,9)(H,10,11)/b2-1-/t3-,4+/m0/s1' looks fine. OK, now those InChI samples look like they are heavy on fringe cases and perhaps thus likely to really stress toolkits. These are the best kind. :-) So I took something more peaceful and ran a peptide from PubChem through (pubchem_71296070.mol - attached). IUPAC: InChI=1S/C33H55N9O10/c1-18(43)26(37)31(49)40-23(16-20-10-4-3-5-11-20)30(48)39-21(12-6-8-14-34)28(46)38-22(13-7-9-15-35)29(47)42-27(19(2)44)32(50)41-24(33(51)52)17-25(36)45/h3-5,10-11,18-19,21-24,26-27,43-44H,6-9,12-17,34-35,37H2,1-2H3,(H2,36,45)(H,38,46)(H,39,48)(H,40,49)(H,41,50)(H,42,47)(H,51,52)/t18-,19-,21+,22+,23+,24+,26+,27+/m1/s1 RDKit: InChI=1S/C33H55N9O10/c1-18(43)26(37)31(49)40-23(16-20-10-4-3-5-11-20)30(48)39-21(12-6-8-14-34)28(46)38-22(13-7-9-15-35)29(47)42-27(19(2)44)32(50)41-24(33(51)52)17-25(36)45/h3-5,10-11,18-19,21-24,26-27,43-44H,6-9,12-17,34-35,37H2,1-2H3,(H2,36,45)(H,38,46)(H,39,48)(H,40,49)(H,41,50)(H,42,47)(H,51,52)/t18-,19-,21+,22+,23+,24+,26+,27+/m0/s1 In [4]: Chem.MolToInchi(Chem.MolFromMolFile('pubchem_71296070.mol')) Out[4]: 'InChI=1S/C33H55N9O10/c1-18(43)26(37)31(49)40-23(16-20-10-4-3-5-11-20)30(48)39-21(12-6-8-14-34)28(46)38-22(13-7-9-15-35)29(47)42-27(19(2)44)32(50)41-24(33(51)52)17-25(36)45/h3-5,10-11,18-19,21-24,26-27,43-44H,6-9,12-17,34-35,37H2,1-2H3,(H2,36,45)(H,38,46)(H,39,48)(H,40,49)(H,41,50)(H,42,47)(H,51,52)/t18-,19-,21+,22+,23+,24+,26+,27+/m1/s1' also looks fine. -greg -- Introducing AppDynamics Lite, a free troubleshooting tool for Java/.NET Get 100% visibility into your production application - at no cost. Code-level diagnostics for performance bottlenecks with 2% overhead Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap1 ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] InChI strings from molfiles - stereo perception.
Dear Greg, On 2013-05-01 09:58, Greg Landrum wrote: Dear Jan, On Mon, Apr 29, 2013 at 8:03 AM, Jan Holst Jensen j...@biochemfusion.com wrote: Hi RDKitters, I wonder why the InChI strings generated by RDKit differ from the ones generated by the standard IUPAC inchi-1 executable. At least some were due to an RDKit bug that has been fixed for a while (it's in the 2013.03 release). The fix isn't reflected in the knime nodes because we haven't done an update of the knime binaries in a while; that's coming in the next day or so. Ah - sounds wonderful. Thanks. Out of sheer laziness my Python-enabled RDKit builds have been without InChI support so I couldn't compare with the KNIME nodes - just assumed that they behaved identically. Well... as they say Assumption is the mother of all sc***-ups :-). OK, now those InChI samples look like they are heavy on fringe cases and perhaps thus likely to really stress toolkits. These are the best kind. :-) Indeed :-). So I took something more peaceful and ran a peptide from PubChem through (pubchem_71296070.mol - attached). In [4]: Chem.MolToInchi(Chem.MolFromMolFile('pubchem_71296070.mol')) Out[4]: 'InChI=1S/C33H55N9O10/c1-18(43)26(37)31(49)40-23(16-20-10-4-3-5-11-20)30(48)39-21(12-6-8-14-34)28(46)38-22(13-7-9-15-35)29(47)42-27(19(2)44)32(50)41-24(33(51)52)17-25(36)45/h3-5,10-11,18-19,21-24,26-27,43-44H,6-9,12-17,34-35,37H2,1-2H3,(H2,36,45)(H,38,46)(H,39,48)(H,40,49)(H,41,50)(H,42,47)(H,51,52)/t18-,19-,21+,22+,23+,24+,26+,27+/m1/s1' also looks fine. Yep. Everything should be in order then. Cheers -- Jan -- Introducing AppDynamics Lite, a free troubleshooting tool for Java/.NET Get 100% visibility into your production application - at no cost. Code-level diagnostics for performance bottlenecks with 2% overhead Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap1 ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] InChI strings from molfiles - stereo perception.
Hi RDKitters, I wonder why the InChI strings generated by RDKit differ from the ones generated by the standard IUPAC inchi-1 executable. I have used the IUPAC inchi-1 executable from a command line to generate IUPAC InChI strings (the executable that comes pre-built with the InChI 1.04 binary download). RDKit InChI strings were generated with the RDKit KNIME nodes, this version: RDKit KNIME integration2.1.0.201302211506 I constructed a KNIME workflow that reads in an SD-file, uses the Molecule to RDKit node and then the RDKit To InChI node with default options to generate RDKit InChI strings. I ran the standard InChI example file Samples.sdf through the KNIME workflow and compared with the InChIs generated from the IUPAC executable. A number of InChI strings are different; it seems to be almost all stereo-related. For example: InChI strings generated for spiro.mol (spiro.mol - attached): IUPAC: InChI=1S/2C9H14Cl2/c2*1-7(10)3-9(4-7)5-8(2,11)6-9/h2*3-6H2,1-2H3/t2*7-,8-,9-/m10/s1 RDKit: InChI=1S/2C9H14Cl2/c2*1-7(10)3-9(4-7)5-8(2,11)6-9/h2*3-6H2,1-2H3 and stertaut.mol (stertaut.mol - attached): IUPAC: InChI=1S/C6H6O5/c7-1-2-3(5(8)9)4(2)6(10)11/h1,3-4,7H,(H,8,9)(H,10,11)/b2-1-/t3-,4+/m0/s1 RDKit: InChI=1S/C6H6O5/c7-1-2-3(5(8)9)4(2)6(10)11/h1,3-4,7H,(H,8,9)(H,10,11)/t3-,4-/m1/s1 OK, now those InChI samples look like they are heavy on fringe cases and perhaps thus likely to really stress toolkits. So I took something more peaceful and ran a peptide from PubChem through (pubchem_71296070.mol - attached). IUPAC: InChI=1S/C33H55N9O10/c1-18(43)26(37)31(49)40-23(16-20-10-4-3-5-11-20)30(48)39-21(12-6-8-14-34)28(46)38-22(13-7-9-15-35)29(47)42-27(19(2)44)32(50)41-24(33(51)52)17-25(36)45/h3-5,10-11,18-19,21-24,26-27,43-44H,6-9,12-17,34-35,37H2,1-2H3,(H2,36,45)(H,38,46)(H,39,48)(H,40,49)(H,41,50)(H,42,47)(H,51,52)/t18-,19-,21+,22+,23+,24+,26+,27+/m1/s1 RDKit: InChI=1S/C33H55N9O10/c1-18(43)26(37)31(49)40-23(16-20-10-4-3-5-11-20)30(48)39-21(12-6-8-14-34)28(46)38-22(13-7-9-15-35)29(47)42-27(19(2)44)32(50)41-24(33(51)52)17-25(36)45/h3-5,10-11,18-19,21-24,26-27,43-44H,6-9,12-17,34-35,37H2,1-2H3,(H2,36,45)(H,38,46)(H,39,48)(H,40,49)(H,41,50)(H,42,47)(H,51,52)/t18-,19-,21+,22+,23+,24+,26+,27+/m0/s1 The only difference in this case is that IUPAC outputs an InChI string with /m0 and RDKit an InChI with /m1. As far as I can understand from the InChI FAQ the /m0 /m1 difference indicates that these are different enantiomers. I converted the InChIs back to molecule with the InChI to RDKit KNIME node. The molecule generated from the IUPAC InChI (from-iupac-inchi.mol - attached) faithfully reconstructs the original PubChem molecule. When I construct a molecule from the RDKit InChI (from-rdkit-inchi.mol - attached), all the stereo centers have been inverted (as expected - different enantiomer). Is there a good explanation for this ? Cheers -- Jan 71296070 -OEChem-04201305312D 107107 0 1 0 0 0 0 0999 V2000 8.06221.25000. O 0 0 0 0 0 0 0 0 0 0 0 0 10.6603 -1.25000. O 0 0 0 0 0 0 0 0 0 0 0 0 13.2583 -1.75000. O 0 0 0 0 0 0 0 0 0 0 0 0 5.4641 -1.25000. O 0 0 0 0 0 0 0 0 0 0 0 0 13.25831.25000. O 0 0 0 0 0 0 0 0 0 0 0 0 2.86601.25000. O 0 0 0 0 0 0 0 0 0 0 0 0 2.86604.25000. O 0 0 0 0 0 0 0 0 0 0 0 0 16.72240.25000. O 0 0 0 0 0 0 0 0 0 0 0 0 15.8564 -1.25000. O 0 0 0 0 0 0 0 0 0 0 0 0 16.72241.25000. O 0 0 0 0 0 0 0 0 0 0 0 0 8.9282 -0.25000. N 0 0 0 0 0 0 0 0 0 0 0 0 6.33010.25000. N 0 0 0 0 0 0 0 0 0 0 0 0 11.52630.25000. N 0 0 0 0 0 0 0 0 0 0 0 0 4.59811.25000. N 0 0 0 0 0 0 0 0 0 0 0 0 14.1244 -0.25000. N 0 0 0 0 0 0 0 0 0 0 0 0 11.52634.25000. N 0 0 0 0 0 0 0 0 0 0 0 0 8.9282 -4.25000. N 0 0 0 0 0 0 0 0 0 0 0 0 4.59813.25000. N 0 0 0 0 0 0 0 0 0 0 0 0 15.85642.75000. N 0 0 0 0 0 0 0 0 0 0 0 0 7.1962 -0.25000. C 0 0 1 0 0 0 0 0 0 0 0 0 9.79420.25000. C 0 0 1 0 0 0 0 0 0 0 0 0 7.1962 -1.25000. C 0 0 0 0 0 0 0 0 0 0 0 0 9.79421.25000. C 0 0 0 0 0 0 0 0 0 0 0 0 8.0622 -1.75000. C 0 0 0 0 0 0 0 0 0 0 0 0 10.66031.75000. C 0 0 0 0 0 0 0 0 0 0 0 0 8.06220.25000. C 0 0 0 0 0 0 0 0 0 0 0 0 10.6603 -0.25000. C 0 0 0 0 0 0 0 0 0 0 0 0 4.59810.25000. C 0 0 2 0 0 0 0 0 0 0 0 0 10.66032.75000. C 0