Re: [Rdkit-discuss] InChI strings from molfiles - stereo perception.

2013-05-01 Thread Greg Landrum
Dear Jan,

On Mon, Apr 29, 2013 at 8:03 AM, Jan Holst Jensen j...@biochemfusion.com 
wrote:
 Hi RDKitters,

 I wonder why the InChI strings generated by RDKit differ from the ones
 generated by the standard IUPAC inchi-1 executable.

At least some were due to an RDKit bug that has been fixed for a while
(it's in the 2013.03 release). The fix isn't reflected in the knime
nodes because we haven't done an update of the knime binaries in a
while; that's coming in the next day or so.

 I ran the standard InChI example file Samples.sdf through the KNIME workflow
 and compared with the InChIs generated from the IUPAC executable. A number
 of InChI strings are different; it seems to be almost all stereo-related.

Here's what I get from Python:


 For example: InChI strings generated for spiro.mol (spiro.mol - attached):

 IUPAC:
 InChI=1S/2C9H14Cl2/c2*1-7(10)3-9(4-7)5-8(2,11)6-9/h2*3-6H2,1-2H3/t2*7-,8-,9-/m10/s1
 RDKit: InChI=1S/2C9H14Cl2/c2*1-7(10)3-9(4-7)5-8(2,11)6-9/h2*3-6H2,1-2H3

This one still doesn't recognize the stereo. I'll file a bug for it:
In [2]: Chem.MolToInchi(Chem.MolFromMolFile('spiro.mol'))
[09:53:16] WARNING: Omitted undefined stereo
Out[2]: 'InChI=1S/2C9H14Cl2/c2*1-7(10)3-9(4-7)5-8(2,11)6-9/h2*3-6H2,1-2H3'


 and stertaut.mol (stertaut.mol - attached):

 IUPAC:
 InChI=1S/C6H6O5/c7-1-2-3(5(8)9)4(2)6(10)11/h1,3-4,7H,(H,8,9)(H,10,11)/b2-1-/t3-,4+/m0/s1
 RDKit:
 InChI=1S/C6H6O5/c7-1-2-3(5(8)9)4(2)6(10)11/h1,3-4,7H,(H,8,9)(H,10,11)/t3-,4-/m1/s1

In [3]: Chem.MolToInchi(Chem.MolFromMolFile('stertaut.mol'))
Out[3]: 
'InChI=1S/C6H6O5/c7-1-2-3(5(8)9)4(2)6(10)11/h1,3-4,7H,(H,8,9)(H,10,11)/b2-1-/t3-,4+/m0/s1'

looks fine.


 OK, now those InChI samples look like they are heavy on fringe cases and
 perhaps thus likely to really stress toolkits.

These are the best kind. :-)


 So I took something more peaceful and ran a peptide from PubChem through
 (pubchem_71296070.mol - attached).

 IUPAC:
 InChI=1S/C33H55N9O10/c1-18(43)26(37)31(49)40-23(16-20-10-4-3-5-11-20)30(48)39-21(12-6-8-14-34)28(46)38-22(13-7-9-15-35)29(47)42-27(19(2)44)32(50)41-24(33(51)52)17-25(36)45/h3-5,10-11,18-19,21-24,26-27,43-44H,6-9,12-17,34-35,37H2,1-2H3,(H2,36,45)(H,38,46)(H,39,48)(H,40,49)(H,41,50)(H,42,47)(H,51,52)/t18-,19-,21+,22+,23+,24+,26+,27+/m1/s1
 RDKit:
 InChI=1S/C33H55N9O10/c1-18(43)26(37)31(49)40-23(16-20-10-4-3-5-11-20)30(48)39-21(12-6-8-14-34)28(46)38-22(13-7-9-15-35)29(47)42-27(19(2)44)32(50)41-24(33(51)52)17-25(36)45/h3-5,10-11,18-19,21-24,26-27,43-44H,6-9,12-17,34-35,37H2,1-2H3,(H2,36,45)(H,38,46)(H,39,48)(H,40,49)(H,41,50)(H,42,47)(H,51,52)/t18-,19-,21+,22+,23+,24+,26+,27+/m0/s1


In [4]: Chem.MolToInchi(Chem.MolFromMolFile('pubchem_71296070.mol'))
Out[4]: 
'InChI=1S/C33H55N9O10/c1-18(43)26(37)31(49)40-23(16-20-10-4-3-5-11-20)30(48)39-21(12-6-8-14-34)28(46)38-22(13-7-9-15-35)29(47)42-27(19(2)44)32(50)41-24(33(51)52)17-25(36)45/h3-5,10-11,18-19,21-24,26-27,43-44H,6-9,12-17,34-35,37H2,1-2H3,(H2,36,45)(H,38,46)(H,39,48)(H,40,49)(H,41,50)(H,42,47)(H,51,52)/t18-,19-,21+,22+,23+,24+,26+,27+/m1/s1'

also looks fine.
-greg

--
Introducing AppDynamics Lite, a free troubleshooting tool for Java/.NET
Get 100% visibility into your production application - at no cost.
Code-level diagnostics for performance bottlenecks with 2% overhead
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap1
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] InChI strings from molfiles - stereo perception.

2013-05-01 Thread Jan Holst Jensen
Dear Greg,

On 2013-05-01 09:58, Greg Landrum wrote:
 Dear Jan,

 On Mon, Apr 29, 2013 at 8:03 AM, Jan Holst Jensen j...@biochemfusion.com 
 wrote:
 Hi RDKitters,

 I wonder why the InChI strings generated by RDKit differ from the ones
 generated by the standard IUPAC inchi-1 executable.
 At least some were due to an RDKit bug that has been fixed for a while
 (it's in the 2013.03 release). The fix isn't reflected in the knime
 nodes because we haven't done an update of the knime binaries in a
 while; that's coming in the next day or so.

Ah - sounds wonderful. Thanks.

Out of sheer laziness my Python-enabled RDKit builds have been without 
InChI support so I couldn't compare with the KNIME nodes - just assumed 
that they behaved identically. Well... as they say Assumption is the 
mother of all sc***-ups :-).

 OK, now those InChI samples look like they are heavy on fringe cases and
 perhaps thus likely to really stress toolkits.
 These are the best kind. :-)

Indeed :-).

 So I took something more peaceful and ran a peptide from PubChem through
 (pubchem_71296070.mol - attached).


 In [4]: Chem.MolToInchi(Chem.MolFromMolFile('pubchem_71296070.mol'))
 Out[4]: 
 'InChI=1S/C33H55N9O10/c1-18(43)26(37)31(49)40-23(16-20-10-4-3-5-11-20)30(48)39-21(12-6-8-14-34)28(46)38-22(13-7-9-15-35)29(47)42-27(19(2)44)32(50)41-24(33(51)52)17-25(36)45/h3-5,10-11,18-19,21-24,26-27,43-44H,6-9,12-17,34-35,37H2,1-2H3,(H2,36,45)(H,38,46)(H,39,48)(H,40,49)(H,41,50)(H,42,47)(H,51,52)/t18-,19-,21+,22+,23+,24+,26+,27+/m1/s1'

 also looks fine.

Yep. Everything should be in order then.

Cheers
-- Jan

--
Introducing AppDynamics Lite, a free troubleshooting tool for Java/.NET
Get 100% visibility into your production application - at no cost.
Code-level diagnostics for performance bottlenecks with 2% overhead
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap1
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] InChI strings from molfiles - stereo perception.

2013-04-29 Thread Jan Holst Jensen

Hi RDKitters,

I wonder why the InChI strings generated by RDKit differ from the ones 
generated by the standard IUPAC inchi-1 executable.


I have used the IUPAC inchi-1 executable from a command line to generate 
IUPAC InChI strings (the executable that comes pre-built with the InChI 
1.04 binary download).


RDKit InChI strings were generated with the RDKit KNIME nodes, this version:

 RDKit KNIME integration2.1.0.201302211506

I constructed a KNIME workflow that reads in an SD-file, uses the 
Molecule to RDKit node and then the RDKit To InChI node with default 
options to generate RDKit InChI strings.


I ran the standard InChI example file Samples.sdf through the KNIME 
workflow and compared with the InChIs generated from the IUPAC 
executable. A number of InChI strings are different; it seems to be 
almost all stereo-related.


For example: InChI strings generated for spiro.mol (spiro.mol - attached):

IUPAC: 
InChI=1S/2C9H14Cl2/c2*1-7(10)3-9(4-7)5-8(2,11)6-9/h2*3-6H2,1-2H3/t2*7-,8-,9-/m10/s1

RDKit: InChI=1S/2C9H14Cl2/c2*1-7(10)3-9(4-7)5-8(2,11)6-9/h2*3-6H2,1-2H3

and stertaut.mol (stertaut.mol - attached):

IUPAC: 
InChI=1S/C6H6O5/c7-1-2-3(5(8)9)4(2)6(10)11/h1,3-4,7H,(H,8,9)(H,10,11)/b2-1-/t3-,4+/m0/s1
RDKit: 
InChI=1S/C6H6O5/c7-1-2-3(5(8)9)4(2)6(10)11/h1,3-4,7H,(H,8,9)(H,10,11)/t3-,4-/m1/s1


OK, now those InChI samples look like they are heavy on fringe cases and 
perhaps thus likely to really stress toolkits.


So I took something more peaceful and ran a peptide from PubChem 
through (pubchem_71296070.mol - attached).


IUPAC: 
InChI=1S/C33H55N9O10/c1-18(43)26(37)31(49)40-23(16-20-10-4-3-5-11-20)30(48)39-21(12-6-8-14-34)28(46)38-22(13-7-9-15-35)29(47)42-27(19(2)44)32(50)41-24(33(51)52)17-25(36)45/h3-5,10-11,18-19,21-24,26-27,43-44H,6-9,12-17,34-35,37H2,1-2H3,(H2,36,45)(H,38,46)(H,39,48)(H,40,49)(H,41,50)(H,42,47)(H,51,52)/t18-,19-,21+,22+,23+,24+,26+,27+/m1/s1
RDKit: 
InChI=1S/C33H55N9O10/c1-18(43)26(37)31(49)40-23(16-20-10-4-3-5-11-20)30(48)39-21(12-6-8-14-34)28(46)38-22(13-7-9-15-35)29(47)42-27(19(2)44)32(50)41-24(33(51)52)17-25(36)45/h3-5,10-11,18-19,21-24,26-27,43-44H,6-9,12-17,34-35,37H2,1-2H3,(H2,36,45)(H,38,46)(H,39,48)(H,40,49)(H,41,50)(H,42,47)(H,51,52)/t18-,19-,21+,22+,23+,24+,26+,27+/m0/s1


The only difference in this case is that IUPAC outputs an InChI string 
with /m0 and RDKit an InChI with /m1. As far as I can understand from 
the InChI FAQ the /m0 /m1 difference indicates that these are different 
enantiomers.


I converted the InChIs back to molecule with the InChI to RDKit KNIME 
node. The molecule generated from the IUPAC InChI (from-iupac-inchi.mol 
- attached) faithfully reconstructs the original PubChem molecule. When 
I construct a molecule from the RDKit InChI (from-rdkit-inchi.mol - 
attached), all the stereo centers have been inverted (as expected - 
different enantiomer).


Is there a good explanation for this ?

Cheers
-- Jan

71296070
  -OEChem-04201305312D

107107  0 1  0  0  0  0  0999 V2000
8.06221.25000. O   0  0  0  0  0  0  0  0  0  0  0  0
   10.6603   -1.25000. O   0  0  0  0  0  0  0  0  0  0  0  0
   13.2583   -1.75000. O   0  0  0  0  0  0  0  0  0  0  0  0
5.4641   -1.25000. O   0  0  0  0  0  0  0  0  0  0  0  0
   13.25831.25000. O   0  0  0  0  0  0  0  0  0  0  0  0
2.86601.25000. O   0  0  0  0  0  0  0  0  0  0  0  0
2.86604.25000. O   0  0  0  0  0  0  0  0  0  0  0  0
   16.72240.25000. O   0  0  0  0  0  0  0  0  0  0  0  0
   15.8564   -1.25000. O   0  0  0  0  0  0  0  0  0  0  0  0
   16.72241.25000. O   0  0  0  0  0  0  0  0  0  0  0  0
8.9282   -0.25000. N   0  0  0  0  0  0  0  0  0  0  0  0
6.33010.25000. N   0  0  0  0  0  0  0  0  0  0  0  0
   11.52630.25000. N   0  0  0  0  0  0  0  0  0  0  0  0
4.59811.25000. N   0  0  0  0  0  0  0  0  0  0  0  0
   14.1244   -0.25000. N   0  0  0  0  0  0  0  0  0  0  0  0
   11.52634.25000. N   0  0  0  0  0  0  0  0  0  0  0  0
8.9282   -4.25000. N   0  0  0  0  0  0  0  0  0  0  0  0
4.59813.25000. N   0  0  0  0  0  0  0  0  0  0  0  0
   15.85642.75000. N   0  0  0  0  0  0  0  0  0  0  0  0
7.1962   -0.25000. C   0  0  1  0  0  0  0  0  0  0  0  0
9.79420.25000. C   0  0  1  0  0  0  0  0  0  0  0  0
7.1962   -1.25000. C   0  0  0  0  0  0  0  0  0  0  0  0
9.79421.25000. C   0  0  0  0  0  0  0  0  0  0  0  0
8.0622   -1.75000. C   0  0  0  0  0  0  0  0  0  0  0  0
   10.66031.75000. C   0  0  0  0  0  0  0  0  0  0  0  0
8.06220.25000. C   0  0  0  0  0  0  0  0  0  0  0  0
   10.6603   -0.25000. C   0  0  0  0  0  0  0  0  0  0  0  0
4.59810.25000. C   0  0  2  0  0  0  0  0  0  0  0  0
   10.66032.75000. C   0